Python數據分析及可視化實例之爬蟲源碼（01）

01-26

系列文章總目錄：Python數據分析及可視化實例目錄

1.背景介紹

（1）這是隨便找的，貼吧不用headers就可以訪問：【美圖】雜圖_美圖吧_百度貼吧

（2）原本也打算抓點妹子圖，但我乎這麼純凈的界面，還是別污了。

（3）源碼重點掌握：Requests請求，Re正則提取標籤，Beautifulsoup提取標籤；獲取圖片大小；判斷文件夾是否存在；圖片保存等。

（4）預留作業：利用Pillow將抓取的圖片拼接，形成照片牆，難度☆☆；別以為沒用哈，用好了這貨詞雲，圖雲都可以自擼；後面驗證碼，圖片相似處理都要靠它完成初級處理。這麼說吧，數據清洗靠Pandas，文本處理靠Gensim，圖片處理靠Pillow，視頻處理我不會。

（5）凡是在該源碼基礎上改編的爬蟲，可以將源碼投稿到本專欄；獎勵若干中、英文Python資料，源碼，視頻。

2.源碼

# coding:utf-8import re,requestsfrom bs4 import BeautifulSoupheaders = { # "Host": "i.meizitu.net", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3", "Connection": "keep-alive", "Accept-Encoding":"gzip, deflate", "Upgrade-Insecure-Requests":"1"} # 請求頭，蛇無頭不行，帶上吧，後面會講到用Cookie實現單賬號免密登錄s = requests.session() # 保留會話def re_test(text): """ :param text: 網頁源文件 :return: 返回圖片的鏈接 """ # https://imgsa.baidu.com/forum/w%3D580/sign=20bec30aa0ec8a13141a57e8c7029157/2508562c11dfa9ecfd81b1f26bd0f703938fc180.jpg img_url = re.findall("https://imgsa.baidu.com/forum/.*?jpg",text) return img_urldef bs_test(text): """ :param text: 網頁源文件 :return: 返回圖片的鏈接 """ # <img class="BDE_Image" src="//i0.wp.com/imgsa.baidu.com/forum/w%3D580/sign=44312fe1a5af2eddd4f149e1bd110102/1c3477094b36acaf1ce5c71b75d98d1000e99c2f.jpg" size="201260" width_="479" height="852"> soup = BeautifulSoup(text, "lxml") img_urls = soup.find_all("img",{"class":"BDE_Image"}) img_url = [i.get("src") for i in img_urls] return img_urldef img_size(content): # 熟悉下面這個圖片處理庫，對於驗證碼處理和AI有很大幫助哦。 from PIL import Image from io import BytesIO img = Image.open(BytesIO(content)) # width,height = img.size # 獲取圖片大小，更改圖片大小，拼接照片牆自己先試試 return img.sizedef save_img(url): """ :param url: 圖片地址 :return: 木有返回值 """ img_name = url.strip().split("/")[-1] print(img_name) url_re = s.get(url.strip(),headers=headers) if url_re.status_code == 200: # 200是http響應狀態 # print("準備保存") import os if not os.path.exists("baidu_img"): # 沒有文件夾，則創建文件夾 os.mkdir("baidu_img") if img_size(url_re.content)[0] > 400 and img_size(url_re.content)[1] > 600: # 圖片寬*高大於400*600像素才保存 print("尺寸不錯，留下了") open("baidu_img/" + img_name, "wb").write(url_re.content)if __name__ == "__main__": for i in range(2) : # 用2頁測試一下 url = "https://tieba.baidu.com/p/5033202671?pn="+str(i+1) # 構造和Page相關的鏈接 req_text = s.get(url).text # print(re_test(req_text)) # 正則 # urls = re_test(req_text) # print(bs_test(req_text)) # BS urls = bs_test(req_text) for img_url in re_test(req_text): # 採用正則獲取圖片鏈接 save_img(img_url)

Python數據分析及可視化實例之爬蟲源碼（01）

1.背景介紹

2.源碼

膠水語言博大精深，

本主只得一二為新人帶路，

老鳥可去另一專欄：Python中文社區

最後，別只收藏不關注哈