鏈家房源爬蟲（含源碼）

05-14

鏈家APP上有很多在售房源信息以及成交房源信息，如果可以把這些信息爬下來，可以得到很多有價值的信息。因此本文將講一講如何爬取這些數據，並保存下來供以後分析。

本文將介紹以下幾個方面：

程序介紹
使用教程
實現思路
數據存儲
可視化分析(Pandas分析請參考https://zhuanlan.zhihu.com/p/31532470)

程序介紹

該程序支持爬取鏈家在線二手房數據，歷史成交數據，在線租房數據和指定城市所有小區數據。
數據存儲目前支持三種資料庫格式（mysql，postgreSql, Sqlite3)。
由於鏈家網採取對IP限流設置，所以該程序沒有採取多線程爬取，並且限制了爬取速度來防止被封。
提供mysql數據轉到ES的解決方案，方便進行數據可視化分析。
該數據已經應用在北京二手房數據網www.ershoufangdata.com。

使用教程

源碼地址 https://github.com/XuefengHuang/lianjia-scrawler 如果喜歡，請給個star支持一下，謝謝！
下載源碼並安裝依賴包

1. git clone https://github.com/XuefengHuang/lianjia-scrawler.git2. cd lianjia-scrawler# If youd like not to use [virtualenv](https://virtualenv.pypa.io/en/stable/), please skip step 3 and 4.3. virtualenv lianjia4. source lianjia/bin/activate5. pip install -r requirements.txt

設置資料庫信息以及爬取城市行政區信息（支持三種資料庫格式）

DBENGINE = mysql #ENGINE OPTIONS: mysql, sqlite3, postgresqlDBNAME = testDBUSER = rootDBPASSWORD = DBHOST = 127.0.0.1DBPORT = 3306CITY = bj # only one, shanghai=sh shenzhen=sh......REGIONLIST = [uchaoyang, uxicheng] # 只支持拼音

運行 python scrawl.py! (請注釋14行如果已爬取完所想要的小區信息)
可以修改scrawl.py來只爬取在售房源信息或者成交房源信息或者租售房源信息

實現思路

開始抓取前先觀察下目標頁面或網站的結構，其中比較重要的是URL的結構。鏈家網的二手房列表頁面共有100個，URL結構為http://bj.lianjia.com/ershoufang/pg9/，其中bj表示城市，/ershoufang/是頻道名稱，pg9是頁面碼。我們要抓取的是北京的二手房頻道，所以前面的部分不會變，屬於固定部分，後面的頁面碼需要在1-100間變化，屬於可變部分。將URL分為兩部分，前面的固定部分賦值給url，後面的可變部分使用for循環。我們以根據小區名字搜索二手房出售情況為例：

BASE_URL = u"http://bj.lianjia.com/"url = BASE_URL + u"ershoufang/rs" + urllib2.quote(communityname.encode(utf8)) + "/"total_pages = misc.get_total_pages(url) //獲取總頁數信息for page in range(total_pages): if page > 0: url_page = BASE_URL + u"ershoufang/pg%drs%s/" % (page+1, urllib2.quote(communityname.encode(utf8)))//獲取總頁數信息代碼def get_total_pages(url): source_code = get_source_code(url) soup = BeautifulSoup(source_code, lxml) total_pages = 0 try: page_info = soup.find(div,{class:page-box house-lst-page-box}) except AttributeError as e: page_info = None if page_info == None: return None page_info_str = page_info.get(page-data).split(,)[0] #{"totalPage":5,"curPage":1} total_pages = int(page_info_str.split(:)[1]) return total_pages

頁面抓取完成後無法直接閱讀和進行數據提取，還需要進行頁面解析。我們使用BeautifulSoup對頁面進行解析。

soup = BeautifulSoup(source_code, lxml)nameList = soup.findAll("li", {"class":"clear"})

完成頁面解析後就可以對頁面中的關鍵信息進行提取了。下面我們分別對房源各個信息進行提取。

for name in nameList: # per house loop i = i + 1 info_dict = {} try: housetitle = name.find("div", {"class":"title"}) info_dict.update({utitle:housetitle.get_text().strip()}) info_dict.update({ulink:housetitle.a.get(href)}) houseaddr = name.find("div", {"class":"address"}) info = houseaddr.div.get_text().split(|) info_dict.update({ucommunity:info[0].strip()}) info_dict.update({uhousetype:info[1].strip()}) info_dict.update({usquare:info[2].strip()}) info_dict.update({udirection:info[3].strip()}) housefloor = name.find("div", {"class":"flood"}) floor_all = housefloor.div.get_text().split(-)[0].strip().split( ) info_dict.update({ufloor:floor_all[0].strip()}) info_dict.update({uyears:floor_all[-1].strip()}) followInfo = name.find("div", {"class":"followInfo"}) info_dict.update({ufollowInfo:followInfo.get_text()}) tax = name.find("div", {"class":"tag"}) info_dict.update({utaxtype:tax.get_text().strip()}) totalPrice = name.find("div", {"class":"totalPrice"}) info_dict.update({utotalPrice:int(totalPrice.span.get_text())}) unitPrice = name.find("div", {"class":"unitPrice"}) info_dict.update({uunitPrice:int(unitPrice.get(data-price))}) info_dict.update({uhouseID:unitPrice.get(data-hid)}) except: continue

提取完後，為了之後數據分析，要存進之前配置的資料庫中。

model.Houseinfo.insert(**info_dict).upsert().execute()model.Hisprice.insert(houseID=info_dict[houseID], totalPrice=info_dict[totalPrice]).upsert().execute()

數據存儲

可支持資料庫：mysql，postgreSql, Sqlite3
資料庫信息：

Community小區信息（id, title, link, district, bizcurcle, taglist）Houseinfo在售房源信息（houseID, title, link, community, years, housetype, square, direction, floor, taxtype, totalPrice, unitPrice, followInfo, validdate)Hisprice歷史成交信息（houseID，totalPrice，date）Sellinfo成交房源信息(houseID, title, link, community, years, housetype, square, direction, floor, status, source,, totalPrice, unitPrice, dealdate, updatedate)Rentinfo租售房源信息 (houseID, title, link, region, zone, meters, other, subway, decoration, heating, price, pricepre, updatedate)

可視化分析

首先需要同步mysql數據到ES里，然後利用kibana進行數據分析。同步的部分可以利用該工具
截圖示例：

房源信息

房源信息json格式數據

房源地區分布圖