Scrapy學習實例（二）採集無限滾動頁面

01-29

上一篇寫的是採集虎嗅網首頁的新聞數據，有朋友對我說，採集多頁試試看。後來研究下，虎嗅網首頁是POST載入，Form Data中攜帶參數，所以只需要帶上一個循環就好了。這是我最初的想法，先讓我們看看Scrapy中

如何採集無限滾動頁面？

先舉個栗子，採集網站是quotes

分析網頁

下拉時，會發現更多新的請求，觀察這些請求，返回的都是json數據，也就是我們所需的，再看看他們的不同，也就是參數的改變，完整鏈接是：

http://spidyquotes.herokuapp.com/api/quotes?page=2nhttp://spidyquotes.herokuapp.com/api/quotes?page=3nhttp://spidyquotes.herokuapp.com/api/quotes?page=4n

這就很清晰了。

返回的是json，我們需要解析，然後提取數據，那我們如何知道最多有多少條json呢，文件已經告訴我們了：

has_next:true

創建項目

scrapy startproject quotencd quotenscrapy genspider spiderquote http://spidyquotes.herokuapp.com/scrolln

定義Item

查看網站，採集text、author和tags這三個

import scrapynclass QuoteItem(scrapy.Item):n # define the fields for your item here like:n # name = scrapy.Field()n text = scrapy.Field()n author = scrapy.Field()n tag = scrapy.Field()n

編寫spider

# -*- coding: utf-8 -*-nimport scrapynimport jsonnclass SpiderquoteSpider(scrapy.Spider):n name = spiderquoten quotes_base_url = http://spidyquotes.herokuapp.com/api/quotes?page=%sn start_urls = [quotes_base_url % 1]n download_delay = 1.5n def parse(self, response):n data = json.loads(response.body)n for item in data.get(quotes, []):n yield {n text: item.get(text),n author: item.get(author, {}).get(name),n tags: item.get(tags),n }n if data[has_next]:n next_page = data[page] + 1n yield scrapy.Request(self.quotes_base_url % next_page)n

運行爬蟲，然後就可以看到結果了。

應用到虎嗅網

那麼如何應用到虎嗅網呢？首先還是要去分析網頁。

虎嗅網的參數有3個：

huxiu_hash_code:13a3a353c52d424e1e263dda4d594e59npage:3nlast_dateline:1512026700n

我們知道page就是翻頁頁碼，huxiu_hash_code是一個不變的字元，last_dateline看起來像unix時間戳，驗證確實如此。這個時間戳有必要帶上嗎，我想驗證試試看。

在postman中測試，不帶上last_dateline也是可以返回數據，並且這個json中已經告訴我們一共有多少頁：

"total_page": 1654

在主函數中我們可以依葫蘆畫瓢

# -*- coding: utf-8 -*-nimport scrapynfrom huxiu.items import HuxiuItemnimport jsonnfrom lxml import etreenclass HuxiuSpider(scrapy.Spider):n name = HuXiun def start_requests(self):n url = https://www.huxiu.com/v2_action/article_listn for i in range(1, 10):n # FormRequest 是Scrapy發送POST請求的方法n yield scrapy.FormRequest(n url = url,n formdata = {"huxiu_hash_code" : "13a3a353c52d424e1e263dda4d594e59", "page" : str(i)},n callback = self.parsen )n def parse(self, response):n item = HuxiuItem()n data = json.loads(response.text)n s = etree.HTML(data[data])n item[title] = s.xpath(//a[@class="transition msubstr-row2"]/text())n item[link] = s.xpath(//a[@class="transition msubstr-row2"]/@href)n item[author] = s.xpath(//span[@class="author-name"]/text())n item[introduction] = s.xpath(//div[@class="mob-sub"]/text())n yield itemn

輸出的數據有點難看，是一段一段的。。

因為data[data]是一段html文件，所以這裡選擇的是xpath，不清楚這裡是否直接使用Scrapy的xpath解析工具，如果可以，歡迎在評論中告訴我。

本篇收穫

Scrapy採集動態網站：分析網頁
使用Scrapy模擬post請求方法，文檔在這
劉亦菲好漂亮

待做事宜

完善文件保存與解析
全站抓取大概用了3分鐘，速度有點慢

原文在此 Scrapy學習實例（二）採集無限滾動頁面