python爬取房產數據,在地圖上展現
首先還是分析思路,爬取網站數據,獲取小區名稱,地址,價格,經緯度,保存在excel里。再把excel數據上傳到BDP網站,生成地圖報表
本次我使用的是scrapy框架,可能有點大材小用了,主要是剛學完用這個練練手,再寫代碼前我還是建議大家先分析網站,分析好數據,再去動手寫代碼,因為好的分析可以事半功倍,烏魯木齊樓盤,2017烏魯木齊新樓盤,烏魯木齊樓盤信息 - 烏魯木齊吉屋網 這個網站的數據比較全,每一頁獲取房產的LIST信息,並且翻頁,點進去是詳情頁,獲取房產的詳細信息(包含名稱,地址,房價,經緯度),再用pipelines保存item到excel里,最後在bdp生成地圖報表,廢話不多說上代碼:
JiwuspiderSpider.py# -*- coding: utf-8 -*-nfrom scrapy import Spider,Requestnimport renfrom jiwu.items import JiwuItemnnnclass JiwuspiderSpider(Spider):n name = "jiwuspider"n allowed_domains = ["wlmq.jiwu.com"]n start_urls = [http://wlmq.jiwu.com/loupan]nn def parse(self, response):n """n 解析每一頁房屋的listn :param response: n :return: n """n for url in response.xpath(//a[@class="index_scale"]/@href).extract():n yield Request(url,self.parse_html) # 取list集合中的url 調用詳情解析方法nn # 如果下一頁屬性還存在,則把下一頁的url獲取出來n nextpage = response.xpath(//a[@class="tg-rownum-next index-icon"]/@href).extract_first()n #判斷是否為空n if nextpage:n yield Request(nextpage,self.parse) #回調自己繼續解析nnnn def parse_html(self,response):n """n 解析每一個房產信息的詳情頁面,生成itemn :param response: n :return: n """n pattern = re.compile(<script type="text/javascript">.*?lng = (.*?);.*?lat = (.*?);.*?bname = (.*?);.*?n address = (.*?);.*?price = (.*?);,re.S)n item = JiwuItem()n results = re.findall(pattern,response.text)n for result in results:n item[name] = result[2]n item[address] = result[3]n # 對價格判斷只取數字,如果為空就設置為0n pricestr =result[4]n pattern2 = re.compile((d+))n s = re.findall(pattern2,pricestr)n if len(s) == 0:n item[price] = 0n else:item[price] = s[0]n item[lng] = result[0]n item[lat] = result[1]n yield itemn
item.py
# -*- coding: utf-8 -*-nn# Define here the models for your scraped itemsn#n# See documentation in:n# http://doc.scrapy.org/en/latest/topics/items.htmlnnimport scrapynnnclass JiwuItem(scrapy.Item):n # define the fields for your item here like:n name = scrapy.Field()n price =scrapy.Field()n address =scrapy.Field()n lng = scrapy.Field()n lat = scrapy.Field()nn passn
pipelines.py 注意此處是吧mongodb的保存方法注釋了,可以自選選擇保存方式
# -*- coding: utf-8 -*-nn# Define your item pipelines heren#n# Dont forget to add your pipeline to the ITEM_PIPELINES settingn# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlnimport pymongonfrom scrapy.conf import settingsnfrom openpyxl import workbooknnclass JiwuPipeline(object):n wb = workbook.Workbook()n ws = wb.activen ws.append([小區名稱, 地址, 價格, 經度, 緯度])n def __init__(self):n # 獲取資料庫連接信息n host = settings[MONGODB_URL]n port = settings[MONGODB_PORT]n dbname = settings[MONGODB_DBNAME]n client = pymongo.MongoClient(host=host, port=port)nn # 定義資料庫n db = client[dbname]n self.table = db[settings[MONGODB_TABLE]]nn def process_item(self, item, spider):n jiwu = dict(item)n #self.table.insert(jiwu)n line = [item[name], item[address], str(item[price]), item[lng], item[lat]]n self.ws.append(line)n self.wb.save(jiwu.xlsx)nn return itemn
最後報表的數據
mongodb資料庫地圖報表效果圖:BDP分享儀錶盤,分享可視化效果
推薦閱讀: