左手用R右手Python系列——多進程/線程數據抓取與網頁請求

02-02

這一篇涉及到如何在網頁請求環節使用多進程任務處理功能，因為網頁請求涉及到兩個重要問題：一是多進程的並發操作會面臨更大的反爬風險，所以面臨更嚴峻的反爬風險，二是抓取網頁數據需要獲取返回值，而且這些返回值需要彙集成一個關係表（數據框）（區別於上一篇中的二進位文件下載，文件下載僅僅執行語句塊命令即可，無需收集返回值）。

R語言使用RCurl+XML,Python使用urllib+lxml。

library("RCurl")nlibrary("XML")nlibrary("magrittr")n

方案1——自建顯式循環：

Getjobs <- function(){n fullinfo <- data.frame()n headers <- c("Referer"="https://www.hellobi.com/jobs/search",n "User-Agent"="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"n )n d <- debugGatherer()n handle <- getCurlHandle(debugfunction=d$update,followlocation=TRUE,cookiefile="",verbose = TRUE)n i = 0n while (i < 11){n i = i+1n url <- sprintf("https://www.hellobi.com/jobs/search?page=%d",i)n tryCatch({n content <- getURL(url,.opts=list(httpheader=headers),.encoding="utf-8",curl=handle) %>% htmlParse() n job_item <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h4/a",xmlValue)n job_links <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h4/a",xmlGetAttr,"href")n job_info <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h5",xmlValue,trim = TRUE) n job_salary <- content %>% xpathSApply(.,"//div[@class=job_item-right pull-right]/h4",xmlValue,trim = TRUE) n job_origin <- content %>% xpathSApply(.,"//div[@class=job_item-right pull-right]/h5",xmlValue,trim = TRUE)n myreslut <- data.frame(job_item,job_links,job_info,job_salary,job_origin) n fullinfo <- rbind(fullinfo,myreslut) n cat(sprintf("第【%d】頁已抓取完畢！",i),sep = "n")n },error = function(e){n cat(sprintf("第【%d】頁抓取失敗!",i),sep = "n")n })n }n cat("all page is OK!!!")n return (fullinfo)n}nsystem.time(mydata1 <- Getjobs())n

整個過程耗時11.03秒。

方案2——使用向量化函數：

Getjobs <- function(i){n fullinfo <- data.frame()n headers <- c("Referer"="https://www.hellobi.com/jobs/search",n "User-Agent"="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"n )n d <- debugGatherer()n handle <- getCurlHandle(debugfunction=d$update,followlocation=TRUE,cookiefile="",verbose = TRUE)n url <- sprintf("https://www.hellobi.com/jobs/search?page=%d",i)n content <- getURL(url,.opts=list(httpheader=headers),.encoding="utf-8",curl=handle) %>% htmlParse() n job_item <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h4/a",xmlValue)n job_links <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h4/a",xmlGetAttr,"href")n job_info <- content %>% xpathSApply(.,"//div[@class=job_item_middle pull-left]/h5",xmlValue,trim = TRUE) n job_salary <- content %>% xpathSApply(.,"//div[@class=job_item-right pull-right]/h4",xmlValue,trim = TRUE) n job_origin <- content %>% xpathSApply(.,"//div[@class=job_item-right pull-right]/h5",xmlValue,trim = TRUE)n data.frame(job_item,job_links,job_info,job_salary,job_origin) %>% return()n}nnsystem.time(mydata <- plyr::ldply(1:10,Getjobs,.progress = "text"))n

整個過程耗時9.07m。

方案3——使用多進程包：

system.time({n library("doParallel")n library("foreach")n cl<-makeCluster(4)n registerDoParallel(cl)n mydata2 <- foreach(i=1:10,n .combine=rbind,n .packages = c("RCurl", "XML","magrittr")n ) %dopar% Getjobs(i)n stopCluster(cl)n })n

總耗時5.14秒。

這裡解釋一下昨天的多進程下載pdf文件為何沒有任何效果，我覺得是因為，對於網路I/O密集型的任務，網路下載過程帶寬不足，耗時太久，幾乎掩蓋了多進程的時間節省（pdf文件平均5m）。

Python版：

Python的案例使用urllib、lxml包進行演示。

from urllib.request import urlopen,Requestnimport pandas as pdnimport numpy as npnimport timenfrom lxml import etreen

方案1——使用顯式循環抓取：

def getjobs(i):n myresult = {n "job_item":[],n "job_links":[],n "job_info":[],n "job_salary":[],n "job_origin":[]n };n header ={n User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36,n Referer:https://www.hellobi.com/jobs/searchn }n i =0n while i < 11:n i+=1 n url = "https://www.hellobi.com/jobs/search?page={}".format(i)n pagecontent=urlopen(Request(url,headers=header)).read().decode(utf-8)n result = etree.HTML(pagecontent)n myresult["job_item"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/text()))n myresult["job_links"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/@href))n myresult["job_info"].extend([ text.xpath(string(.)).strip() for text in result.xpath(//div[@class="job_item_middle pull-left"]/h5)])n myresult["job_salary"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h4/span/text()))n myresult["job_origin"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h5/span/text()))n time.sleep(1)n print("正在抓取第【{}】頁".format(i))n print("everything is OK")n return pd.DataFrame(myresult)n nif __name__ == "__main__":n t0 = time.time()n mydata1 = getjobs(list(range(1,11)))n t1 = time.time()n total = t1 - t0n print("消耗時間：{}".format(total))n

總耗時將近19秒，（代碼中設置有時延，估測凈時間在9秒左右）

方案2——使用多線程方式抓取：

def executeThread(i):n myresult = {n "job_item":[], n "job_links":[], n "job_info":[], n "job_salary":[],n "job_origin":[]n };n header ={n User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36,n Referer:https://www.hellobi.com/jobs/searchn }n url = "https://www.hellobi.com/jobs/search?page={}".format(i)n try:n pagecontent=urlopen(Request(url,headers=header)).read().decode(utf-8)n result = etree.HTML(pagecontent)n myresult["job_item"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/text()))n myresult["job_links"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/@href))n myresult["job_info"].extend([ text.xpath(string(.)).strip() for text in result.xpath(//div[@class="job_item_middle pull-left"]/h5)])n myresult["job_salary"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h4/span/text()))n myresult["job_origin"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h5/span/text()))n except:n passn with open(D:/Python/File/hellolive.csv, a+) as f:n pd.DataFrame(myresult).to_csv(f, index = False,header= False if i > 1 else True)nndef main():n threads = []n for i in range(1,11):n thread = threading.Thread(target=executeThread,args=(i,))n threads.append(thread)n thread.start()n for i in threads:n i.join()n nif __name__ == __main__:n t0 = time.time()n main()n t1 = time.time()n total = t1 - t0n print("消耗時間：{}".format(total))n

以上多進程模式僅使用了1.64m,多進程爬蟲的優勢與單進程相比效率非常明顯。

方案3——使用多進程方式抓取：

from multiprocessing import Poolnfrom urllib.request import urlopen,Requestnimport pandas as pdnimport timenfrom lxml import etreendef executeThread(i):n myresult = {n "job_item":[],n "job_links":[],n "job_info":[], n "job_salary":[],n "job_origin":[]n };n header ={n User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36,n Referer:https://www.hellobi.com/jobs/searchn }n url = "https://www.hellobi.com/jobs/search?page={}".format(i)n try:n pagecontent=urlopen(Request(url,headers=header)).read().decode(utf-8)n result = etree.HTML(pagecontent)n myresult["job_item"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/text()))n myresult["job_links"].extend(result.xpath(//div[@class="job_item_middle pull-left"]/h4/a/@href))n myresult["job_info"].extend([ text.xpath(string(.)).strip() for text in result.xpath(//div[@class="job_item_middle pull-left"]/h5)])n myresult["job_salary"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h4/span/text()))n myresult["job_origin"].extend(result.xpath(//div[@class="job_item-right pull-right"]/h5/span/text())) except: passn with open(D:/Python/File/hellolive.csv, a+) as f:n pd.DataFrame(myresult).to_csv(f, index = False,header= False if i > 1 else True)nndef shell():n # Multi-processn pool = Pool(multiprocessing.cpu_count())n pool.map(excuteThread,list(range(1,11)))n pool.close()n pool.join()n nif __name__ == "__main__":n #計時開始：n t0 = time.time()n shell()n t1 = time.time()n total = t1 - t0n print("消耗時間：{}".format(total))n

最後的多進程執行時間差不多也在1.5s左右，但是因為windows的forks問題，不能直接在編輯器中執行，需要將多進程的代碼放在.py文件，然後將.py文件在cmd或者PowerShell中執行。

c從今天這些案例可以看出，對於網路I/O密集型任務而言，多線程和多進程確實可以提升任務效率，但是速度越快也意味著面臨著更大的反爬壓力，特別是在多進程/多線程環境下，並發處理需要做更加加完善的偽裝措施，比如考慮提供隨機UA/IP，以防過早被封殺。

在線課程請點擊文末原文鏈接：

Hellobi Live |2018年1月16日 R語言爬蟲實戰案例分享：網易雲課堂、知乎live、今日頭條、B站視頻

往期案例數據請移步本人GitHub：

https://github.com/ljtyduyu/DataWarehouse/tree/master/File