python3 爬取半次元cosplay圖片

09-03

來自專欄 Python中文社區14 人贊了文章

(建議使用Chrome瀏覽器)

本人水平有限，希望各位Python養蟲師斧正

首先，進入半次元，點擊COS，熱門推薦

點擊F12，可以看到開發者工具窗口

我們以第一張COS照片的代碼進行分析

以這張圖片作為例子,查看這張圖片的html代碼

紅色箭頭是這張圖片的html代碼，手動複製URL到地址欄，把/2X3去掉，按下回車，我們可以獲取最高質量的圖片

我們與之前HTML代碼的圖片的URL進行比較

-----2018.8.31更新：請無視最後一段的/w650，把他當空白

可以看見，尾部/2X3是我們第一次進網頁時第一張COS照片得到的URL，下一段是高清圖片的URL

其他COS照片以此類推

我們在第一次進入的頁面繼續往下滑，發現該網頁滾到結尾時自動更新，可以確定網頁使用了AJAX技術，我們回到置頂刷新界面，等網頁載入好後按F12打開開發者工具，操作如圖

點擊XHR

我們繼續往下劃，等到頁面更新時發現新條碼點擊條目

在Headers頁面往下滑，看見X-Reauested-With:XMLHttpRequest，表明是AJAX請求，找到Query String Parameters，這就是AJAX請求的數據，在Preview中可以看見AJAX返回的數據。

繼續往下劃，讓網頁數據更新，發現Network中又新出現的幾個新條目

我們比較Query String Parameters的AJAX請求數據，發現

grid_type:
flow
sort:
hot
tag_id:
399

這三條數據和其他Network條目是相同的，但是since不相同，和其他條目對比http://25853.xxx其中xxx這三個數字是不規律的，其中since中25853小數點後的數據為565、523、483、428（以實際情況為準，僅供參考），意味著我們在接下來圖片爬取中since數據要手動輸入

import reimport requestsfrom pyquery import PyQuery as pqimport timefrom urllib.parse import urlencodeFilepath=input(Please enter the path to save images: ) #目錄文件名可以修改注意不要含有"/"list_since=[]list_since.append(float(input([1/4]Please enter the Since in the Query String Parameters: )))list_since.append(float(input([2/4]Please enter the Since in the Query String Parameters: )))list_since.append(float(input([3/4]Please enter the Since in the Query String Parameters: )))list_since.append(float(input([4/4]Please enter the Since in the Query String Parameters: )))print("==================Start crawlin==================")#list_since = [25941.552,25941.511,25941.479,25941.415] #ajax請求的since#========================函數定義區========================def get_html(url): headers = {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36} getweb = requests.get(str(url),headers=headers) try: return getweb.text except Exception as e: print(e) except IOError as e1: print(e1)def DownloadFileWithFilename(url,filename,path): import requests import os import re try: if not os.path.exists(path): os.mkdir(path) if not os.path.exists(path): r = requests.get(url) r = requests.get(url) with open(str(path) + "/"+str(filename), "wb") as code: code.write(r.content) print(Downloaded!,str(path) + "/"+str(filename)) except IOError as e: print("Download Failed!") print(e) except Exception as e1: print(e1) print("Download Failed!")def getStaticHtmlImage(): #獲取沒有AJAX更新時網頁的COSPLAY圖片 global Filepath web_static_state=https://bcy.net/coser doc = pq(web_static_state) image = doc(li.js-smallCards._box a.db.posr.ovf img.cardImage).items() for i in image: # 爬取ajax網頁數據 i = str(i.attr(src)).rstrip(/2X3) # 這裡的i是把獲取的URL最後一段/2x3去除 filename = str(re.search([^/]+(?!.*/), i).group(0)) # filename是URL的最後一段:xxx.jpg DownloadFileWithFilename(i, filename, Filepath) time.sleep(1) #休眠三秒防止封IPdef getDynamicHtmlImage(since1): #獲取ajax更新數據的COSPLAY圖片 global Filepath ajax_get_data = {since:since1,grid_type:flow,sort:hot,tag_id:399} headers = {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36} web_dynamic = requests.get(https://bcy.net/circle/timeline/showtag?+urlencode(ajax_get_data),headers=headers).text doc = pq(web_dynamic) image = doc(li.js-smallCards._box a.db.posr.ovf img.cardImage).items() for i in image: # 爬取ajax網頁數據 i = str(i.attr(src)).rstrip(/2X3) # 這裡的i是把獲取的URL最後一段/2x3去除 filename = str(re.search([^/]+(?!.*/), i).group(0)) # filename是URL的最後一段:xxx.jpg DownloadFileWithFilename(i, filename, Filepath) time.sleep(1) #休眠三秒防止封IP#========================執行區========================getStaticHtmlImage()for i in list_since: print(i) getDynamicHtmlImage(i)print("===========================================================")print("Finished! The path of images:",Filepath)

記下4條since數據 例如:25941.552,25941.511,25941.479,25941.415。運行腳本，1.輸入下載目錄，2.逐行輸入since數據

最後的結果：

各位是不是有點激動呢?

求贊求評論

github地址:wenead99/BCY.net-Cos-image-Crawl-tool

順便作死艾特 @Pt.wang