左手用R右手Python系列——多進程/線程數據抓取與網頁請求
這一篇涉及到如何在網頁請求環節使用多進程任務處理功能,因為網頁請求涉及到兩個重要問題:一是多進程的並發操作會面臨更大的反爬風險,所以面臨更嚴峻的反爬風險,二是抓取網頁數據需要獲取返回值,而且這些返回值需要彙集成一個關係表(數據框)(區別於上一篇中的二進位文件下載,文件下載僅僅執行語句塊命令即可,無需收集返回值)。
R語言使用RCurl+XML,Python使用urllib+lxml。
library("RCurl")nlibrary("XML")nlibrary("magrittr")n
方案1——自建顯式循環:
Getjobs <- function(){n fullinfo <- data.frame()n headers <- c("Referer"="https://www.hellobi.com/jobs/search",n "User-Agent"="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"n )n d <- debugGatherer()n handle <- getCurlHandle(debugfunction=d$update,followlocation=TRUE,cookiefile="",verbose = TRUE)n i = 0n while (i < 11){n i = i+1n url <- sprintf("https://www.hellobi.com/jobs/search?page=%d",i)n tryCatch({n content <- getURL(url,.opts=list(httpheader=headers),.encoding="utf-8",curl=handle) %>% htmlParse() n job_item <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h4/a",xmlValue)n job_links <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h4/a",xmlGetAttr,"href")n job_info <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h5",xmlValue,trim = TRUE) n job_salary <- content %>% xpathSApply(.,"//div[@class=job_item-right pull-right]/h4",xmlValue,trim = TRUE) n job_origin <- content %>% xpathSApply(.,"//div[@class=job_item-right pull-right]/h5",xmlValue,trim = TRUE)n myreslut <- data.frame(job_item,job_links,job_info,job_salary,job_origin) n fullinfo <- rbind(fullinfo,myreslut) n cat(sprintf("第【%d】頁已抓取完畢!",i),sep = "n")n },error = function(e){n cat(sprintf("第【%d】頁抓取失敗!",i),sep = "n")n })n }n cat("all page is OK!!!")n return (fullinfo)n}nsystem.time(mydata1 <- Getjobs())n
整個過程耗時11.03秒。
方案2——使用向量化函數:
Getjobs <- function(i){n fullinfo <- data.frame()n headers <- c("Referer"="https://www.hellobi.com/jobs/search",n "User-Agent"="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"n )n d <- debugGatherer()n handle <- getCurlHandle(debugfunction=d$update,followlocation=TRUE,cookiefile="",verbose = TRUE)n url <- sprintf("https://www.hellobi.com/jobs/search?page=%d",i)n content <- getURL(url,.opts=list(httpheader=headers),.encoding="utf-8",curl=handle) %>% htmlParse() n job_item <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h4/a",xmlValue)n job_links <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h4/a",xmlGetAttr,"href")n job_info <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h5",xmlValue,trim = TRUE) n job_salary <- content %>% xpathSApply(.,"//div[@class=job_item-right pull-right]/h4",xmlValue,trim = TRUE) n job_origin <- content %>% xpathSApply(.,"//div[@class=job_item-right pull-right]/h5",xmlValue,trim = TRUE)n data.frame(job_item,job_links,job_info,job_salary,job_origin) %>% return()n}nnsystem.time(mydata <- plyr::ldply(1:10,Getjobs,.progress = "text"))n
整個過程耗時9.07m。
方案3——使用多進程包:
system.time({n library("doParallel")n library("foreach")n cl<-makeCluster(4)n registerDoParallel(cl)n mydata2 <- foreach(i=1:10,n .combine=rbind,n .packages = c("RCurl", "XML","magrittr")n ) %dopar% Getjobs(i)n stopCluster(cl)n })n
總耗時5.14秒。
這裡解釋一下昨天的多進程下載pdf文件為何沒有任何效果,我覺得是因為,對於網路I/O密集型的任務,網路下載過程帶寬不足,耗時太久,幾乎掩蓋了多進程的時間節省(pdf文件平均5m)。
Python版:
Python的案例使用urllib、lxml包進行演示。
from urllib.request import urlopen,Requestnimport pandas as pdnimport numpy as npnimport timenfrom lxml import etreen
方案1——使用顯式循環抓取:
def getjobs(i):n myresult = {n "job_item":[],n "job_links":[],n "job_info":[],n "job_salary":[],n "job_origin":[]n };n header ={n User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36,n Referer:https://www.hellobi.com/jobs/searchn }n i =0n while i < 11:n i+=1 n url = "https://www.hellobi.com/jobs/search?page={}".format(i)n pagecontent=urlopen(Request(url,headers=header)).read().decode(utf-8)n result = etree.HTML(pagecontent)n myresult["job_item"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/text()))n myresult["job_links"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/@href))n myresult["job_info"].extend([ text.xpath(string(.)).strip() for text in result.xpath(//div[@class="job_item_middle pull-left"]/h5)])n myresult["job_salary"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h4/span/text()))n myresult["job_origin"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h5/span/text()))n time.sleep(1)n print("正在抓取第【{}】頁".format(i))n print("everything is OK")n return pd.DataFrame(myresult)n nif __name__ == "__main__":n t0 = time.time()n mydata1 = getjobs(list(range(1,11)))n t1 = time.time()n total = t1 - t0n print("消耗時間:{}".format(total))n
總耗時將近19秒,(代碼中設置有時延,估測凈時間在9秒左右)
方案2——使用多線程方式抓取:
def executeThread(i):n myresult = {n "job_item":[], n "job_links":[], n "job_info":[], n "job_salary":[],n "job_origin":[]n };n header ={n User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36,n Referer:https://www.hellobi.com/jobs/searchn }n url = "https://www.hellobi.com/jobs/search?page={}".format(i)n try:n pagecontent=urlopen(Request(url,headers=header)).read().decode(utf-8)n result = etree.HTML(pagecontent)n myresult["job_item"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/text()))n myresult["job_links"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/@href))n myresult["job_info"].extend([ text.xpath(string(.)).strip() for text in result.xpath(//div[@class="job_item_middle pull-left"]/h5)])n myresult["job_salary"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h4/span/text()))n myresult["job_origin"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h5/span/text()))n except:n passn with open(D:/Python/File/hellolive.csv, a+) as f:n pd.DataFrame(myresult).to_csv(f, index = False,header= False if i > 1 else True)nndef main():n threads = []n for i in range(1,11):n thread = threading.Thread(target=executeThread,args=(i,))n threads.append(thread)n thread.start()n for i in threads:n i.join()n nif __name__ == __main__:n t0 = time.time()n main()n t1 = time.time()n total = t1 - t0n print("消耗時間:{}".format(total))n
以上多進程模式僅使用了1.64m,多進程爬蟲的優勢與單進程相比效率非常明顯。
方案3——使用多進程方式抓取:
from multiprocessing import Poolnfrom urllib.request import urlopen,Requestnimport pandas as pdnimport timenfrom lxml import etreendef executeThread(i):n myresult = {n "job_item":[],n "job_links":[],n "job_info":[], n "job_salary":[],n "job_origin":[]n };n header ={n User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36,n Referer:https://www.hellobi.com/jobs/searchn }n url = "https://www.hellobi.com/jobs/search?page={}".format(i)n try:n pagecontent=urlopen(Request(url,headers=header)).read().decode(utf-8)n result = etree.HTML(pagecontent)n myresult["job_item"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/text()))n myresult["job_links"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/@href))n myresult["job_info"].extend([ text.xpath(string(.)).strip() for text in result.xpath(//div[@class="job_item_middle pull-left"]/h5)])n myresult["job_salary"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h4/span/text()))n myresult["job_origin"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h5/span/text())) except: passn with open(D:/Python/File/hellolive.csv, a+) as f:n pd.DataFrame(myresult).to_csv(f, index = False,header= False if i > 1 else True)nndef shell():n # Multi-processn pool = Pool(multiprocessing.cpu_count())n pool.map(excuteThread,list(range(1,11)))n pool.close()n pool.join()n nif __name__ == "__main__":n #計時開始:n t0 = time.time()n shell()n t1 = time.time()n total = t1 - t0n print("消耗時間:{}".format(total))n
最後的多進程執行時間差不多也在1.5s左右,但是因為windows的forks問題,不能直接在編輯器中執行,需要將多進程的代碼放在.py文件,然後將.py文件在cmd或者PowerShell中執行。
c從今天這些案例可以看出,對於網路I/O密集型任務而言,多線程和多進程確實可以提升任務效率,但是速度越快也意味著面臨著更大的反爬壓力,特別是在多進程/多線程環境下,並發處理需要做更加加完善的偽裝措施,比如考慮提供隨機UA/IP,以防過早被封殺。
在線課程請點擊文末原文鏈接:
Hellobi Live |2018年1月16日 R語言爬蟲實戰案例分享:網易雲課堂、知乎live、今日頭條、B站視頻
往期案例數據請移步本人GitHub:
https://github.com/ljtyduyu/DataWarehouse/tree/master/File
推薦閱讀:
※Maple或Matlab怎麼解不定方程?
※時間序列建模問題,如何準確的建立時間序列模型?
※做黑客需要會多少種語言?
※用R可以做數據清洗嗎?或者有更好的數據清洗工具?
※R有將中文地址轉化成經緯度的包么?