Python3爬蟲(5)BeautifulSoup4庫爬取鏈家100頁面

02-03

按照傳統先看看爬取的內容：

爬取了3000條信息，這次把網址也爬了。

爬取鏈家100頁面難度1.5星

其實爬取鏈家二手房頁面的難度跟上一篇爬取二手車難度相當，沒有賬號密碼、爬蟲限制、等等。這次只是增加了爬取網址。

所以這篇就不細講了，直接上代碼。如果有需要詳細解釋的可以看上一篇文章：

曹驥：Python3爬蟲(4)多網頁爬取文字信息zhuanlan.zhihu.com

import urllib.requestnfrom bs4 import BeautifulSoupnimport pandas as pdnndef get_html(url): n html=urllib.request.urlopen(url) #請求網頁，並打開n htmltext=html.read().decode(utf-8) #讀取內容，並解碼n return htmltextnndef get_data(htmltext, list_http,list_title,list_size,list_position,list_followinfo, list_price):n soup=BeautifulSoup(htmltext)n main_part=soup.find(ul,sellListContent) #將含有車輛信息的主要HTML代碼塊找到n items=main_part.find_all(li,clear) #將這個主代碼塊按照每個房子一條tag，分成若干條n for item in items: #對每個房子進行迭代處理n try: #try 避免空行導致出錯n item.text #判斷是否為空行n try: #try 避免有漏掉某一方面信息導致出錯n list_http.append(item.find(a,href=True)[href]) #將網址提取，並放入list_http列表中n list_title.append(item.find(div,title).text)n list_size.append(item.find(div,houseInfo).text)n list_position.append(item.find(div,positionInfo).text)n list_followinfo.append(item.find(div,followInfo).text)n list_price.append(item.find(div,priceInfo).text)n except:n continuen except:n continuenn#創建6個空列表，以便存放6個方面的信息 nlist_http=[]nlist_title=[]nlist_size=[]nlist_position=[]nlist_followinfo=[]nlist_price=[]nn#開始迭代操作nfor i in range(1,101):n url=https://cd.lianjia.com/ershoufang/pg%d/%in htmltext=get_html(url)n get_data(htmltext, list_http,list_title,list_size,list_position,list_followinfo, list_price)n print(爬取完成第%d頁%i)nn#將爬取的信息放入空的DataFrame裡面，並存入excelndf=pd.DataFrame()ndf[標題]=list_titlendf[信息]=list_sizendf[位置]=list_positionndf[其他]=list_followinfondf[價格]=list_pricendf[網址]=list_httpndf.to_excel(rE:testershoufang.xls)n

這樣操作，爬取內容，稍微簡單好理解，但是感覺速度較慢，爬取了100個網頁用了2分10秒時間。主要的時間用在了BeautifulSoup4庫對每個網頁進行解析上。

據說使用lxml庫會快很多，我們下一篇文章可以嘗試使用lxml庫。

Python3爬蟲(5)BeautifulSoup4庫爬取鏈家100頁面

爬取鏈家100頁面 難度1.5星

爬取鏈家100頁面難度1.5星