Python爬蟲教程（一）使用request+Beautiful爬取妹子圖

04-22

官方文檔

以下內容大多來自於官方文檔，本文進行了一些修改和總結。要了解更多可以參考

官方文檔?

cn.python-requests.org

request安裝

通過pip安裝

pip install requests

request使用示例

import requestsresponse = requests.get(https://www.douban.com/) # get()方法發送請求，獲取HTML網頁response.status_code # 返回狀態碼response.text #以文本格式返回網頁內容response.content # 以二進位形式返回

request還有其他請求方法和屬性可參考崔慶才個人博客

Python爬蟲利器一之Requests庫的用法 | 靜覓?

cuiqingcai.com

BeautifulSoup庫

BeautifulSoup是Python的一個庫，最主要的功能就是從網頁爬取我們需要的數據。BeautifulSoup將html解析為對象進行處理，全部頁面轉變為字典或者數組。

以下為官方文檔：

Beautiful Soup 4.4.0 文檔?

beautifulsoup.readthedocs.io

BeautifulSoup安裝

通過pip安裝

pip install beautifulsoup4

BeautifulSoup基本使用

from bs4 import BeautifulSoupurl=http://www.baidu.comres = requests.get(url) # get()方法發送請求，獲取HTML網頁soup = BeautifulSoup(res.text, html.parser) # # 使用BeautifulSoup來解析我們獲取到的網頁

使用request+beautifulsoup爬取妹子圖圖片

圖片下載到本地文件夾中，如圖：

直接上代碼啦，詳情解釋，看不懂留言解釋也歡迎大神指錯：

爬取鏈接：http://www.mzitu.com/page/1

#coding=utf-8import requestsfrom bs4 import BeautifulSoupdef imgurl(url): res = requests.get(url) # url為a標籤的helf鏈接，即為圖片封面的圖片 soup = BeautifulSoup(res.text, html.parser) # 使用BeautifulSoup來解析我們獲取到的網頁 page = int(soup.select(.pagenavi span)[-2].text) # 獲取總頁數，-2為去掉上下頁 # a = soup.select(.main-image a)[0] # 獲取當前圖片鏈接 # src = a.select(img)[0].get(src) src = soup.select(.main-image a img)[0].get(src) # 獲取圖片鏈接 meiziid = src[-9:-6] # 切片將src的倒數的字元串做名字 print(開始下載妹子:, format(meiziid)) # 輸出窗口提示下載 for i in range(1, page+1): i = %02d % i img = src.replace(01.jpg, str(i)+.jpg) # replace()替換頁數 # 添加headers模擬瀏覽器工作反反爬 headers = { User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1), Referer: http://www.mzitu.com } response = requests.get(img, headers=headers) f = open(D:\666\+meiziid+%s.jpg % i, wb) # 放在D:666目錄下 f.write(response.content) f.close() print(===> %s 完成 % (meiziid + i)) print( %s 已下載 % meiziid)def imgpage(page=): res = requests.get(http://www.mzitu.com/page/ + page) soup = BeautifulSoup(res.text, html.parser) # 解析頁面 href = soup.select(#pins a) # 篩選 list = set([i.get(href) for i in href]) # 遍歷獲取篩選後的href鏈接並用set()去掉重複的鏈接 [imgurl(i) for i in list] # 遍歷下載result = input(下載哪一頁：)imgpage(result)