Scrapy框架的使用抓取分析

02-12

目標網址：「http://blog.jobbole.com/all-posts/」

工具：1.Win10操作系統 2.IDE：Pycharm 3.Scrapy框架 4.谷歌瀏覽器

參考資料：（新手入門）當然是嵩天老師的Python課程系列，免費而且高效。

具體地址：「Python網路爬蟲與信息提取_北京理工大學_中國大學MOOC(慕課)」

第二個：Scrapy中文官方網站：「Scrapy 0.25 文檔 - Scrapy 0.24.1 文檔」

第三個:Xpath語法，XPath 教程

-------------------------------我是分割線呵呵呵----------------------------------

使用Scrapy框架的第一步是建立項目：Win+R敲入CMD按回車出現如圖：

逐步按照Scrapy框架的順序寫入項目：

scrapy startproject tutorial 這裡的tutorial是自己命名的名稱

然後進入該文件夾cd tutorial

scrapy genspider +(spiders 里的py文件名稱）+「目標網址」

該命令將會創建包含下列內容的 tutorial 目錄:tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ...這些文件分別是:scrapy.cfg: 項目的配置文件tutorial/: 該項目的python模塊。之後您將在此加入代碼。tutorial/items.py: 項目中的item文件.tutorial/pipelines.py: 項目中的pipelines文件.tutorial/settings.py: 項目的設置文件.tutorial/spiders/: 放置spider代碼的目錄.

到此有了項目，通過Pycharm打開我們建立的項目了：

打開後的目錄結構是這樣的：

隨意的點開其中一個文章進入詳情頁面可以看到我們需要抓取的內容：

因此抓取的內容為標題title，時間time，標籤tags，點贊數thumb_num，收藏數collect_num，和評論數comment_num以及文本內容article

首先在items里寫入所要抓取的內容：定義一個items

item 是保存爬取到的數據的容器；其使用方法和python字典類似，並且提供了額外保護機制來避免拼寫錯誤導致的未定義欄位錯誤。編輯 bolearticles 目錄中的 items.py 文件:

分網頁結構打開詳情頁面的開發者工具

<div class="entry-header"> <h1>談談程序員的離職和跳槽</h1> </div>

可以用元素選擇器點選標題看到標題在這個div標籤內，通過Xpath將其提取出來。我常用兩種方法來提取測試是否正確。第一種就是通過谷歌瀏覽器的開發工具Xpath Helper來測試。當然中國不能使用谷歌下載開發工具，百度有一種方法可以下載，感興趣的可以百度下.安裝完成後是這樣的一個叉叉符號。

第二種就是scrapy提供一個shell界面的方式來測試，具體在cmd界面寫入

scrapy shell "目標網址"。我們這裡的目標網址就是詳情頁面的網址。

完成後的界面是這樣的

註：Xpath的簡單使用：

/html/head/title: 選擇HTML文檔中 <head> 標籤內的 <title> 元素

/html/head/title/text(): 選擇上面提到的 <title> 元素的文字
//td: 選擇所有的 <td> 元素
//div[@class="mine"]: 選擇所有具有 class="mine" 屬性的 div 元素

上邊僅僅是幾個簡單的XPath例子，XPath實際上要比這遠遠強大的多。如果您想了解的更多，我們推薦這篇XPath教程。

下邊我們嘗試提取標題：

可以得到結果，不過是一個數組模式，利用extract（）轉換為字元串

下邊提取時間：<p class="entry-meta-hide-on-mobile"> 2017/12/09 · <a href="職場 - 文章 - 伯樂在線" rel="category tag" class="">職場</a> · <a href="#article-comment" class=""> 6 評論 </a> · <a href="職場 - 文章 - 伯樂在線" class="">職場</a>, <a href="跳槽 - 文章 - 伯樂在線" class="">跳槽</a> </p>

通過元素選擇器選擇時間看到對應的html代碼。

再把無關字元去掉，用python的基本語法。

再提取評論點贊等等

可以看到點贊在這裡。不過提取有點難度的，它不同與上面的提取模式。

<span data-post-id="112809" class=" btn-bluet-bigger href-style vote-post-up register-user-only "><i class="fa fa-thumbs-o-up"></i> <h10 id="112809votetotal">1</h10> 贊</span>

可以看到class的屬性特別的多，而我們需要提取的點贊數量在h10標籤下。用到Xpath里的contains。具體看這裡：

後邊的評論又不一樣，評論在a標籤下

下邊就可以提取標籤了。

<p class="entry-meta-hide-on-mobile"> 2017/12/09 · <a href="http://blog.jobbole.com/category/career/" rel="category tag" class="">職場</a> · <a href="#article-comment" class=""> 6 評論 </a> · <a href="http://blog.jobbole.com/tag/%e8%81%8c%e5%9c%ba/" class="">職場</a>, <a href="http://blog.jobbole.com/tag/%e8%b7%b3%e6%a7%bd/" class="">跳槽</a> </p>

構造Xpath提取：

這裡仔細想想萬一沒有評論數量怎麼版？就會出現報錯index，剛才我們已經提取了評論了，這裡為了避免報錯直接刪去評論。採用的是python的基本語法規則。

後邊是文本的提取

打這裡我們提取完畢了文章詳情頁面的內容。具體整理為代碼是這樣噠：Spiders部分

def parse_item(self, response): item = BolearticlesItem() item[title] = response.xpath(//div[@class="entry-header"]/h1/text()).extract()[0] item[time] = response.xpath(//p[@class="entry-meta-hide-on-mobile"]/text()).extract()[0].strip().replace(·,) tags = response.xpath(//p[@class="entry-meta-hide-on-mobile"]/a/text()).extract() item[tags] = ,.join([element for element in tags if not element.strip().endswith(評論)]) item[thumb_num] = response.xpath("//span[contains(@class,vote-post-up)]/h10/text()").extract()[0] collect_nums = response.xpath("//span[contains(@class,bookmark-btn)]/text()").extract()[0] collect = (re.match(r.*(d+).*, collect_nums)) if collect: item[collect_num] = collect.group(1) else: item[collect_num] = 0 comment_nums = response.xpath(//a[@href="#article-comment"]/span/text()).extract()[0] comment = (re.match(r.*(d+).*, comment_nums)) if comment: item[comment_num] = comment.group(1) else: item[comment_num] = 0 item[article] = response.xpath(//div[@class="entry"]/p/text()).extract() yield item

哪裡主要是構造正則表達式提取評論和收藏以及點贊數量把後邊的文字刪去。

大致過程：

但是有個問題，如果文章沒有評論和收藏呢？所以加了個判斷條件，沒有收藏就返回0.

後邊分析翻頁。。。。。。。。。。。。。

目標網站的分析，打開網站後看到

點擊下一頁可以發現規律特別的簡單，

到第幾頁哪裡的數字就變為幾。。。。。。

Scrapy的原理就是你只需要給我個網址，把網址（Requests）發送給下載器，交給我（Downloader）來下載。

就是tu中的右上角的情況，因此

這個頁面我們只需要提取每條文章的URL然後發送給下載器執行詳情頁面的下載。因此在構造一個函數parse函數。

函數parse的功能是啥?它主要負責提取每條文章的url和翻頁的url然後發送給parse_item函數來下載，分工明顯就好比企業內部一樣有明確的分工。

分析網頁的結構：

url都在這個div[@class="post floated-thumb"]/div/a/里。

是列表，用循環：

提取了每條文章的url，後邊還需要翻頁：

可以點擊下一頁來進行。具體在這裡

每翻一頁就提取下一頁的鏈接，如此往複的循環直至終結。代碼：

def parse(self, response): urls = response.xpath(//div[@class="post floated-thumb"]/div/a/@href).extract() for url in urls: yield scrapy.Request(url=url, callback=self.parse_item) # 把文章的url發給parse下載器。 next_url = response.xpath(//a[@class="next page-numbers"]/@href).extract()[0] yield scrapy.Request(url=next_url, callback=self.parse)

第一個url發送給下一個函數parse_item來下載詳情頁面的內容，而第二個url是翻頁的url，這個url要發送給誰呢？當然是函數parse。為什麼？頁面翻頁了進入下一頁又開始有諸多文章的url，這個url還不是得是函數parse來處理，所以把翻頁的url發送給parse，parse又取文章的url發送給parse_item進行下載。

完整的spiders代碼是：

# -*- coding: utf-8 -*-import scrapyfrom bolearticles.items import BolearticlesItemimport reclass BoleSpider(scrapy.Spider): name = bole allowed_domains = [blog.jobbole.com] start_urls = [http://blog.jobbole.com/all-posts/page/1/] def parse(self, response): urls = response.xpath(//div[@class="post floated-thumb"]/div/a/@href).extract() for url in urls: yield scrapy.Request(url=url, callback=self.parse_item) # 把文章的url發給parse下載器。 next_url = response.xpath(//a[@class="next page-numbers"]/@href).extract()[0] yield scrapy.Request(url=next_url, callback=self.parse) def parse_item(self, response): item = BolearticlesItem() item[title] = response.xpath(//div[@class="entry-header"]/h1/text()).extract()[0] item[time] = response.xpath(//p[@class="entry-meta-hide-on-mobile"]/text()).extract()[0].strip().replace(·,) tags = response.xpath(//p[@class="entry-meta-hide-on-mobile"]/a/text()).extract() item[tags] = ,.join([element for element in tags if not element.strip().endswith(評論)]) item[thumb_num] = response.xpath("//span[contains(@class,vote-post-up)]/h10/text()").extract()[0] collect_nums = response.xpath("//span[contains(@class,bookmark-btn)]/text()").extract()[0] collect = (re.match(r.*(d+).*, collect_nums)) if collect: item[collect_num] = collect.group(1) else: item[collect_num] = 0 comment_nums = response.xpath(//a[@href="#article-comment"]/span/text()).extract()[0] comment = (re.match(r.*(d+).*, comment_nums)) if comment: item[comment_num] = comment.group(1) else: item[comment_num] = 0 item[article] = response.xpath(//div[@class="entry"]/p/text()).extract() yield item

注意scrapy框架的運行是在終端進行。

點擊Terminal 按鈕敲入：scrapy crawl bole 回車即可。

那想保存為json或者csv文檔怎麼做呢？

敲入scrapy crawl bole -o bole.json

或者scrapy crawl bole -o bole.csv即可。

到這裡有了數據並保存為CSV和JSON，但是如果想保存到資料庫呢？下邊就來寫保存到資料庫的代碼部分。這裡保存到MYSQL資料庫，scrapy裡邊把抓取的數據保存到資料庫，主要在pipeline裡邊寫相應的代碼。

這裡我保存到資料庫主要是用到Navicat for MySQL這個客戶端，小巧玲瓏而且方便。這裡我就不具體的介紹MYSQL資料庫的安裝了，百度上教程特別的多。Navicat安裝很簡單，直接下一步下一步既可以了。

安裝打開後是這樣的

第一步，建立表格。。左上角有個連接，點擊後出現MYSQL點擊後出現

連接名可以自定義。密碼也是，最後點擊確定即可。。

在圖中單擊右鍵建立表。表的內容必須和scrapy中的items裡邊的欄位一致。

到這裡有了表了。。去pipeline裡邊寫相應的代碼了。

導入基本的模塊和庫。基本代碼：

import MySQLdbimport MySQLdb.cursorsclass MysqlPipeline(object): def __init__(self): self.conn = MySQLdb.connect(127.0.0.1, root, 123456, articles, charset="utf8", use_unicode=True) self.cursor = self.conn.cursor() #127.0.0.1是本機的連接mysql的ip，本機都是這個。第二個是剛才建立的資料庫名字默認是root』。。。第三個是密碼，剛才自己設立的。 def process_item(self, item, spider): insert_sql = insert into articles(title, time, tags , thumb_num , collect_num , article , comment_num , img_urls) VALUES (%s, %s, %s, %s , %s , %s , %s , %s) self.cursor.execute(insert_sql, (item["title"], item["time"], item["tags"] , item["thumb_num"] , item["collect_num"] , item["article"] , item["comment_num"] , item["img_url"])) self.conn.commit()然後還需要在settings中把這個類添加進去。

這可以當作基本的模板，僅僅需要修改前面的地址和密碼，後邊的欄位部分。這是一種同步的機制寫的，也就是說爬去和存儲是一致的。一般抓取的速度會大於存儲的速度，這時候該怎麼處理呢？就必須修改代碼，採用非同步的方式存儲。現在settings中添加一些資料庫的信息。

MYSQL_HOST = 127.0.0.1MYSQL_DBNAME = articlesMYSQL_USER = rootMYSQL_PASSWORD = 123456

具體pipeline中添加的代碼：

from twisted.enterprise import adbapiclass MysqlTwistedPipline(object): def __init__(self, dbpool): self.dbpool = dbpool @classmethod def from_settings(cls, settings): dbparms = dict( host = settings["MYSQL_HOST"], db = settings["MYSQL_DBNAME"], user = settings["MYSQL_USER"], passwd = settings["MYSQL_PASSWORD"], charset=utf8, cursorclass=MySQLdb.cursors.DictCursor, use_unicode=True, ) dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms) return cls(dbpool) def process_item(self, item, spider): #使用twisted將mysql插入變成非同步執行 query = self.dbpool.runInteraction(self.do_insert, item) query.addErrback(self.handle_error, item, spider) #處理異常 def handle_error(self, failure, item, spider): # 處理非同步插入的異常 print (failure) def do_insert(self,cursor,item): insert_sql = insert into articlesss(title, time, tags , thumb_num , collect_num , article , comment_num , img_urls)values(%s, %s, %s, %s , %s , %s , %s , %s) cursor.execute(insert_sql, ( item["title"], item["time"], item["tags"], item["thumb_num"], item["collect_num"], item["article"], item["comment_num"], item["img_urls"]))

不用去糾結代碼怎麼寫的，我也看不懂，直接用就可以，可以當作一種模板來用。只需要修改後邊欄位部分就是可以使用了。最後把這個類添加到settings中。

大功告成。運行就可以在資料庫中看到內容了。

這裡只需要使用這個非同步的方式就可以，剛才的同步的機制可以略去。

下邊介紹如何保存到Mongodb資料庫。這個方式比較的簡單。直接網上搜scrapy的官網。

看到Item Pipeline，點進去看到

代碼直接複製過去。

import pymongoclass MongoPipeline(object): collection_name = bole#這裡修改成自己的名稱 def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get(MONGO_URI), mongo_db=crawler.settings.get(MONGO_DATABASE, item)#這裡換成自己的item ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].insert_one(dict(item)) return item

文章到這裡也就告一段落了。下一期繼續用這個案例來演示如何抓取圖片。

這個項目需要注意的地方也是容易出錯的地方：

（1），抓取翻頁的地方容易出錯。不知道該把url 返回給那個函數來處理。

（2），抓取時間的那個地方容易出錯。python的時間格式規則參考：datetime

（3），後邊的正則提取和處理評論的有無問題上。

我本人是學文科的，某上海二本院校，管理專業，學這個因為興趣。

做到最牛就是我所訂下的每日標準，不管到哪都是。—德萊文

學習是孤獨的，你覺得呢？需要有人陪伴嗎？一起學習不孤獨。。那就快關注我把。我會不定時的更新。哈哈哈哈。讓我們紅塵作伴學得瀟瀟洒灑。哈哈哈哈