Scrapy學習實例（三）採集批量網頁

02-08

原文可以聽歌 Scrapy學習實例（三）採集批量網頁

---

先來首火影壓壓驚 (??`ω′?)

最開始接觸 Rules是在Scrapy的文檔上看到的，但是並看讀懂這是什麼意思。接下來看別人的案例，有使用到Rules，便花了很多時間去了解。

解釋：
Rule是在定義抽取鏈接的規則，上面的兩條規則分別對應列表頁的各個分頁頁面和詳情頁，關鍵點在於通過restrict_xpath來限定只從頁面特定的部分來抽取接下來將要爬取的鏈接。

其實用我的話來說就是，一個是可以便捷的進行翻頁操作，二是可以採集二級頁面，相當於打開獲得詳情頁內容。所以若使用了 Rules，可以便捷的幫助我們採集批量網頁。

官方文檔

CrawlSpider示例

import scrapyfrom scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorclass MySpider(CrawlSpider): name = example.com allowed_domains = [example.com] start_urls = [http://www.example.com] rules = ( # Extract links matching category.php (but not matching subsection.php) # and follow links from them (since no callback means follow=True by default). Rule(LinkExtractor(allow=(category.php, ), deny=(subsection.php, ))), # Extract links matching item.php and parse them with the spiders method parse_item Rule(LinkExtractor(allow=(item.php, )), callback=parse_item), ) def parse_item(self, response): self.logger.info(Hi, this is an item page! %s, response.url) item = scrapy.Item() item[id] = response.xpath(//td[@id="item_id"]/text()).re(rID: (d+)) item[name] = response.xpath(//td[@id="item_name"]/text()).extract() item[description] = response.xpath(//td[@id="item_description"]/text()).extract() return item

該spider將從http://example.com的首頁開始爬取，獲取category以及item的鏈接並對後者使用 parse_item 方法。對於每個item response，將使用XPath從HTML中提取一些數據，並使用它填充Item。

實際應用

為了更好的理解，我們來看看實際案例中Rules如何使用

豆瓣應用

rules = [Rule(LinkExtractor(allow=(rhttps://movie.douban.com/top250?start=d+.*))), Rule(LinkExtractor(allow=(rhttps://movie.douban.com/subject/d+)), callback=parse_item, follow=False)]

如果接觸過django，那麼可以發現這個規則與django的路由系統十分相似（django都已經忘完了 -_-！），其實這裡使用的正則匹配。

使用 rhttps://movie.douban.com/top250?start=d+.*來匹配翻頁鏈接，如：

https://movie.douban.com/top250?start=25&filter=
https://movie.douban.com/top250?start=50&filter=

使用https://movie.douban.com/subject/d+來匹配具體電影的鏈接，如：

https://movie.douban.com/subject/1292052/
https://movie.douban.com/subject/1291546/

鏈家應用

爬蟲的通常需要在一個網頁裡面爬去其他的鏈接，然後一層一層往下爬，scrapy提供了LinkExtractor類用於對網頁鏈接的提取，使用LinkExtractor需要使用CrawlSpider爬蟲類中，CrawlSpider與Spider相比主要是多了rules，可以添加一些規則，先看下面這個例子，爬取鏈家網的鏈接

from scrapy.spiders import CrawlSpider, Rulefrom scrapy.linkextractors import LinkExtractorclass LianjiaSpider(CrawlSpider): name = "lianjia" allowed_domains = ["lianjia.com"] start_urls = [ "http://bj.lianjia.com/ershoufang/" ] rules = [ # 匹配正則表達式,處理下一頁 Rule(LinkExtractor(allow=(rhttp://bj.lianjia.com/ershoufang/pgs+$,)), callback=parse_item), # 匹配正則表達式,結果加到url列表中,設置請求預處理函數 # Rule(FangLinkExtractor(allow=(http://www.lianjia.com/client/, )), follow=True, process_request=add_cookie) ] def parse_item(self, response): # 這裡與之前的parse方法一樣，處理 pass

同樣的，使用rhttp://bj.lianjia.com/ershoufang/pgs+$來匹配下一頁鏈接，如：

https://bj.lianjia.com/ershoufang/pg2/
https://bj.lianjia.com/ershoufang/pg3/

還可以使用 rhttps://bj.lianjia.com/ershoufang/d+.html來匹配詳情頁鏈接，如：

https://bj.lianjia.com/ershoufang/101102126888.html
https://bj.lianjia.com/ershoufang/101100845676.html

學習參數

Rule對象

Role對象有下面參數

link_extractor：鏈接提取規則
callback：link_extractor提取的鏈接的請求結果的回調
cb_kwargs：附加參數，可以在回調函數中獲取到
follow：表示提取的鏈接請求完成後是否還要應用當前規則（boolean），如果為False則不會對提取出來的網頁進行進一步提取，默認為False
process_links：處理所有的鏈接的回調，用於處理從response提取的links，通常用於過濾（參數為link列表）
process_request：鏈接請求預處理（添加header或cookie等）

LinkExtractor

LinkExtractor常用的參數有：

allow：提取滿足正則表達式的鏈接
deny：排除正則表達式匹配的鏈接（優先順序高於allow）
allow_domains：允許的域名（可以是str或list）
deny_domains：排除的域名（可以是str或list）
restrict_xpaths：提取滿足XPath選擇條件的鏈接（可以是str或list）
restrict_css：提取滿足css選擇條件的鏈接（可以是str或list）
tags：提取指定標籤下的鏈接，默認從a和area中提取（可以是str或list）
attrs：提取滿足擁有屬性的鏈接，默認為href（類型為list）
unique：鏈接是否去重（類型為boolean）
process_value：值處理函數（優先順序大於allow）

關於LinkExtractor的詳細參數介紹見官網

注意：在編寫抓取Spider規則時，避免使用parse作為回調，因為CrawlSpider使用parse方法自己實現其邏輯。因此，如果你覆蓋parse方法，爬行Spider將不再工作。

最後說一個自己犯過的低級錯誤，我用Scrapy有個習慣，創建一個項目之後，直接cd目錄，然後使用genspider命令，然後。。

D:Backup桌面λ scrapy startproject exampleNew Scrapy project example, using template directory c:\users\administrator\appdata\local\programs\python\python36\lib\site-packages\scrapy\templates\project, created in: D:Backup桌面exampleYou can start your first spider with: cd example scrapy genspider example example.comD:Backup桌面λ cd exampleD:Backup桌面exampleλ scrapy genspider em example.comCreated spider em using template basic in module: example.spiders.em

然後我的em.py就變成了這樣：

# -*- coding: utf-8 -*-import scrapyclass EmSpider(scrapy.Spider): name = em allowed_domains = [example.com] start_urls = [http://example.com/] def parse(self, response): pass

注意，這個時候是不能使用Rules方法的，因為object不對，應該是

class EmSpider(CrawlSpider)

而不是class EmSpider(scrapy.Spider):

共勉！！！

下一節應該會講到Scrapy中各個組件的作用，以及這張神圖

參考：

CrawlSpider示例
scrapy學習筆記