利用Scrapy爬取所有知乎用戶詳細信息並存至MongoDB

02-09

歡迎大家關注騰訊雲技術社區-知乎官方機構號，我們將持續在知乎為大家推薦技術精品文章哦~

作者：崔慶才

本節分享一下爬取知乎用戶所有用戶信息的 Scrapy 爬蟲實戰。

本節目標

本節要實現的內容有：

從一個大V用戶開始，通過遞歸抓取粉絲列表和關注列表，實現知乎所有用戶的詳細信息的抓取。
將抓取到的結果存儲到 MongoDB，並進行去重操作。

思路分析

我們都知道每個人都有關注列表和粉絲列表，尤其對於大V來說，粉絲和關注尤其更多。

如果我們從一個大V開始，首先可以獲取他的個人信息，然後我們獲取他的粉絲列表和關注列表，然後遍歷列表中的每一個用戶，進一步抓取每一個用戶的信息還有他們各自的粉絲列表和關注列表，然後再進一步遍歷獲取到的列表中的每一個用戶，進一步抓取他們的信息和關注粉絲列表，循環往複，不斷遞歸，這樣就可以做到一爬百，百爬萬，萬爬百萬，通過社交關係自然形成了一個爬取網，這樣就可以爬到所有的用戶信息了。當然零粉絲零關注的用戶就忽略他們吧～

爬取的信息怎樣來獲得呢？不用擔心，通過分析知乎的請求就可以得到相關介面，通過請求介面就可以拿到用戶詳細信息和粉絲、關注列表了。

接下來我們開始實戰爬取。

環境需求

Python3

本項目使用的 Python 版本是 Python3，項目開始之前請確保你已經安裝了Python3。

Scrapy

Scrapy 是一個強大的爬蟲框架，安裝方式如下：

pip3 install scrapy

MongoDB

非關係型資料庫，項目開始之前請先安裝好 MongoDB 並啟動服務。

PyMongo

Python 的 MongoDB 連接庫，安裝方式如下：

pip3 install pymongo

創建項目

安裝好以上環境之後，我們便可以開始我們的項目了。

在項目開始之首先我們用命令行創建一個項目：

scrapy startproject zhihuuser

創建爬蟲

接下來我們需要創建一個 spider，同樣利用命令行，不過這次命令行需要進入到項目里運行。

cd zhihuuserscrapy genspider zhihu www.zhihu.com

禁止ROBOTSTXT_OBEY

接下來你需要打開settings.py文件，將ROBOTSTXT_OBEY修改為 False。

ROBOTSTXT_OBEY = False

它默認為True，就是要遵守robots.txt 的規則，那麼robots.txt是個什麼東西呢？

通俗來說，robots.txt是遵循 Robot 協議的一個文件，它保存在網站的伺服器中，它的作用是，告訴搜索引擎爬蟲，本網站哪些目錄下的網頁不希望你進行爬取收錄。在Scrapy啟動後，會在第一時間訪問網站的robots.txt 文件，然後決定該網站的爬取範圍。

當然，我們並不是在做搜索引擎，而且在某些情況下我們想要獲取的內容恰恰是被robots.txt 所禁止訪問的。所以，某些時候，我們就要將此配置項設置為 False ，拒絕遵守 Robot協議！

所以在這裡設置為 False 。當然可能本次爬取不一定會被它限制，但是我們一般來說會首先選擇禁止它。

嘗試最初的爬取

接下來我們什麼代碼也不修改，執行爬取，運行如下命令：

scrapy crawl zhihu

你會發現爬取結果會出現這樣的一個錯誤：

500 Internal Server Error

訪問知乎得到的狀態碼是500，這說明爬取並沒有成功，其實這是因為我們沒有加入請求頭，知乎識別User-Agent發現不是瀏覽器，就返回錯誤的響應了。

所以接下來的一步我們需要加入請求 headers 信息，你可以在 Request 的參數里加，也可以在 spider 裡面的custom_settings裡面加，當然最簡單的方法莫過於在全局 settings 裡面加了。

我們打開settings.py文件，取消DEFAULT_REQUEST_HEADERS的注釋，加入如下的內容：

DEFAULT_REQUEST_HEADERS = { User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36}

這個是為你的請求添加請求頭，如果你沒有設置 headers 的話，它就會使用這個請求頭請求，添加了User-Agent信息，所以這樣我們的爬蟲就可以偽裝瀏覽器了。

接下來重新運行爬蟲。

scrapy crawl zhihu

這時你就會發現得到的返回狀態碼就正常了。

解決了這個問題，我們接下來就可以分析頁面邏輯來正式實現爬蟲了。

爬取流程

接下來我們需要先探尋獲取用戶詳細信息和獲取關注列表的介面。

回到網頁，打開瀏覽器的控制台，切換到Network監聽模式。

我們首先要做的是尋找一個大V，以輪子哥為例吧，它的個人信息頁面網址是：vczh - 知乎

首先打開輪子哥的首頁

我們可以看到這裡就是他的一些基本信息，我們需要抓取的就是這些，比如名字、簽名、職業、關注數、贊同數等等。

接下來我們需要探索一下關注列表介面在哪裡，我們點擊關注選項卡，然後下拉，點擊翻頁，我們會在下面的請求中發現出現了 followees 開頭的 Ajax 請求。這個就是獲取關注列表的介面。

我們觀察一下這個請求結構

首先它是一個Get類型的請求，請求的URL是https://www.zhihu.com/api/v4/members/excited-vczh/followees，後面跟了三個參數，一個是include，一個是offset，一個是limit。

觀察後可以發現，include 是一些獲取關注的人的基本信息的查詢參數，包括回答數、文章數等等。

offset 是偏移量，我們現在分析的是第3 頁的關注列表內容，offset 當前為40。

limit 為每一頁的數量，這裡是20，所以結合上面的 offset 可以推斷，當 offset 為0 時，獲取到的是第一頁關注列表，當offset 為20 時，獲取到的是第二頁關注列表，依次類推。

然後接下來看下返回結果：

可以看到有 data 和 paging 兩個欄位，data 就是數據，包含20個內容，這些就是用戶的基本信息，也就是關注列表的用戶信息。

paging裡面又有幾個欄位，is_end表示當前翻頁是否結束，next 是下一頁的鏈接，所以在判讀分頁的時候，我們可以先利用is_end判斷翻頁是否結束，然後再獲取 next 鏈接，請求下一頁。

這樣我們的關注列表就可以通過介面獲取到了。

接下來我們再看下用戶詳情介面在哪裡，我們將滑鼠放到關注列表任意一個頭像上面，觀察下網路請求，可以發現又會出現一個 Ajax 請求。

可以看到這次的請求鏈接為https://www.zhihu.com/api/v4/members/lu-jun-ya-1

後面又一個參數include，include 是一些查詢參數，與剛才的介面類似，不過這次參數非常全，幾乎可以把所有詳情獲取下來，另外介面的最後是加了用戶的用戶名，這個其實是url_token，上面的那個介面其實也是，在返回數據中是可以獲得的。

所以綜上所述：

要獲取用戶的關注列表，我們需要請求類似 https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit} 這樣的介面，其中user就是該用戶的url_token，include 是固定的查詢參數，offset 是分頁偏移量，limit是一頁取多少個。
要獲取用戶的詳細信息，我們需要請求類似 https://www.zhihu.com/api/v4/members/{user}?include={include} 這樣的介面，其中user就是該用戶的url_token，include是查詢參數。

理清了如上介面邏輯後，我們就可以開始構造請求了。

生成第一步請求

接下來我們要做的第一步當然是請求輪子哥的基本信息，然後獲取輪子哥的關注列表了，我們首先構造一個格式化的url，將一些可變參數提取出來，然後需要重寫start_requests方法，生成第一步的請求，接下來我們還需要根據獲取到到關注列表做進一步的分析。

import jsonfrom scrapy import Spider, Requestfrom zhihuuser.items import UserItemclass ZhihuSpider(Spider): name = "zhihu" allowed_domains = ["www.zhihu.com"] user_url = https://www.zhihu.com/api/v4/members/{user}?include={include} follows_url = https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit} start_user = excited-vczh user_query = locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics follows_query = data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics def start_requests(self): yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user) yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0), self.parse_follows)

然後我們實現一下兩個解析方法parse_user和parse_follows。

def parse_user(self, response): print(response.text) def parse_follows(self, response): print(response.text)

最簡單的實現他們的結果輸出即可，然後運行觀察結果。

scrapy crawl zhihu

這時你會發現出現了

401 HTTP status code is not handled or not allowed

訪問被禁止了，這時我們觀察下瀏覽器請求，發現它相比之前的請求多了一個 OAuth 請求頭。

OAuth

它是Open Authorization的縮寫。

OAUTH_token:OAUTH進行到最後一步得到的一個「令牌」，通過此「令牌」請求，就可以去擁有資源的網站抓取任意有許可權可以被抓取的資源。

在這裡我知乎並沒有登陸，這裡的OAuth值是

oauth c3cef7c66a1843f8b3a9e6a1e3160e20

經過我長久的觀察，這個一直不會改變，所以可以長久使用，我們將它配置到DEFAULT_REQUEST_HEADERS里，這樣它就變成了：

DEFAULT_REQUEST_HEADERS = { User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36, authorization: oauth c3cef7c66a1843f8b3a9e6a1e3160e20,}

接下來如果我們重新運行爬蟲，就可以發現可以正常爬取了。

parse_user

接下來我們處理一下用戶基本信息，首先我們查看一下介面信息會返回一些什麼數據。

可以看到返回的結果非常全，在這裡我們直接聲明一個Item全保存下就好了。

在 items 里新聲明一個 UserItem

from scrapy import Item, Fieldclass UserItem(Item): # define the fields for your item here like: id = Field() name = Field() avatar_url = Field() headline = Field() description = Field() url = Field() url_token = Field() gender = Field() cover_url = Field() type = Field() badge = Field() answer_count = Field() articles_count = Field() commercial_question_count = Field() favorite_count = Field() favorited_count = Field() follower_count = Field() following_columns_count = Field() following_count = Field() pins_count = Field() question_count = Field() thank_from_count = Field() thank_to_count = Field() thanked_count = Field() vote_from_count = Field() vote_to_count = Field() voteup_count = Field() following_favlists_count = Field() following_question_count = Field() following_topic_count = Field() marked_answers_count = Field() mutual_followees_count = Field() hosted_live_count = Field() participated_live_count = Field() locations = Field() educations = Field() employments = Field()

所以在解析方法裡面我們解析得到的 response 內容，然後轉為 json 對象，然後依次判斷欄位是否存在，賦值就好了。

result = json.loads(response.text)item = UserItem()for field in item.fields: if field in result.keys(): item[field] = result.get(field)yield item

得到 item 後通過 yield 返回就好了。

這樣保存用戶基本信息就完成了。

接下來我們還需要在這裡獲取這個用戶的關注列表，所以我們需要再重新發起一個獲取關注列表的 request

在parse_user後面再添加如下代碼：

yield Request( self.follows_url.format(user=result.get(url_token), include=self.follows_query, limit=20, offset=0), self.parse_follows)

這樣我們又生成了獲取該用戶關注列表的請求。

parse_follows

接下來我們處理一下關注列表，首先也是解析response的文本，然後要做兩件事：

通過關注列表的每一個用戶，對每一個用戶發起請求，獲取其詳細信息。
處理分頁，判斷 paging 內容，獲取下一頁關注列表。

所以在這裡將parse_follows改寫如下：

results = json.loads(response.text)if data in results.keys(): for result in results.get(data): yield Request(self.user_url.format(user=result.get(url_token), include=self.user_query), self.parse_user)if paging in results.keys() and results.get(paging).get(is_end) == False: next_page = results.get(paging).get(next) yield Request(next_page, self.parse_follows)

這樣，整體代碼如下：

# -*- coding: utf-8 -*-import jsonfrom scrapy import Spider, Requestfrom zhihuuser.items import UserItemclass ZhihuSpider(Spider): name = "zhihu" allowed_domains = ["www.zhihu.com"] user_url = https://www.zhihu.com/api/v4/members/{user}?include={include} follows_url = https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit} start_user = excited-vczh user_query = locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics follows_query = data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics def start_requests(self): yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user) yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0), self.parse_follows) def parse_user(self, response): result = json.loads(response.text) item = UserItem() for field in item.fields: if field in result.keys(): item[field] = result.get(field) yield item yield Request( self.follows_url.format(user=result.get(url_token), include=self.follows_query, limit=20, offset=0), self.parse_follows) def parse_follows(self, response): results = json.loads(response.text) if data in results.keys(): for result in results.get(data): yield Request(self.user_url.format(user=result.get(url_token), include=self.user_query), self.parse_user) if paging in results.keys() and results.get(paging).get(is_end) == False: next_page = results.get(paging).get(next) yield Request(next_page, self.parse_follows)

這樣我們就完成了獲取用戶基本信息，然後遞歸獲取關注列表進一步請求了。

重新運行爬蟲，可以發現當前已經可以實現循環遞歸爬取了。

followers

上面我們實現了通過獲取關注列表實現爬取循環，那這裡少不了的還有粉絲列表，經過分析後發現粉絲列表的 api 也類似，只不過把 followee 換成了 follower，其他的完全相同，所以我們按照同樣的邏輯添加 followers 相關信息，

最終spider代碼如下：

# -*- coding: utf-8 -*-import jsonfrom scrapy import Spider, Requestfrom zhihuuser.items import UserItemclass ZhihuSpider(Spider): name = "zhihu" allowed_domains = ["www.zhihu.com"] user_url = https://www.zhihu.com/api/v4/members/{user}?include={include} follows_url = https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit} followers_url = https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit} start_user = excited-vczh user_query = locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics follows_query = data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics followers_query = data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics def start_requests(self): yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user) yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, limit=20, offset=0), self.parse_follows) yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, limit=20, offset=0), self.parse_followers) def parse_user(self, response): result = json.loads(response.text) item = UserItem() for field in item.fields: if field in result.keys(): item[field] = result.get(field) yield item yield Request( self.follows_url.format(user=result.get(url_token), include=self.follows_query, limit=20, offset=0), self.parse_follows) yield Request( self.followers_url.format(user=result.get(url_token), include=self.followers_query, limit=20, offset=0), self.parse_followers) def parse_follows(self, response): results = json.loads(response.text) if data in results.keys(): for result in results.get(data): yield Request(self.user_url.format(user=result.get(url_token), include=self.user_query), self.parse_user) if paging in results.keys() and results.get(paging).get(is_end) == False: next_page = results.get(paging).get(next) yield Request(next_page, self.parse_follows) def parse_followers(self, response): results = json.loads(response.text) if data in results.keys(): for result in results.get(data): yield Request(self.user_url.format(user=result.get(url_token), include=self.user_query), self.parse_user) if paging in results.keys() and results.get(paging).get(is_end) == False: next_page = results.get(paging).get(next) yield Request(next_page, self.parse_followers)

需要改變的位置有

start_requests裡面添加yield followers信息
parse_user裡面裡面添加yield followers信息
parse_followers做相應的的抓取詳情請求和翻頁

如此一來，spider 就完成了，這樣我們就可以實現通過社交網路遞歸的爬取，把用戶詳情都爬下來。

小結

通過以上的spider，我們實現了如上邏輯：

start_requests方法，實現了第一個大V用戶的詳細信息請求還有他的粉絲和關注列表請求。
parse_user方法，實現了詳細信息的提取和粉絲關注列表的獲取。
paese_follows，實現了通過關注列表重新請求用戶並進行翻頁的功能。
paese_followers，實現了通過粉絲列表重新請求用戶並進行翻頁的功能。

加入pipeline

在這裡資料庫存儲使用MongoDB，所以在這裡我們需要藉助於Item Pipeline，實現如下：

class MongoPipeline(object): collection_name = users def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get(MONGO_URI), mongo_db=crawler.settings.get(MONGO_DATABASE) ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): self.db[self.collection_name].update({url_token: item[url_token]}, {$set: dict(item)}, True) return item

比較重要的一點就在於process_item，在這裡使用了 update 方法，第一個參數傳入查詢條件，這裡使用的是url_token，第二個參數傳入字典類型的對象，就是我們的 item，第三個參數傳入True，這樣就可以保證，如果查詢數據存在的話就更新，不存在的話就插入。這樣就可以保證去重了。

另外記得開啟一下Item Pileline

ITEM_PIPELINES = { zhihuuser.pipelines.MongoPipeline: 300,}

然後重新運行爬蟲

scrapy crawl zhihu

這樣就可以發現正常的輸出了，會一直不停地運行，用戶也一個個被保存到資料庫。

看下MongoDB，裡面我們爬取的用戶詳情結果。

到現在為止，整個爬蟲就基本完結了，我們主要通過遞歸的方式實現了這個邏輯。存儲結果也通過適當的方法實現了去重。

更高效率

當然我們現在運行的是單機爬蟲，只在一台電腦上運行速度是有限的，所以後面我們要想提高抓取效率，需要用到分散式爬蟲，在這裡需要用到 Redis 來維護一個公共的爬取隊列。

更多的分散式爬蟲的實現可以查看自己動手，豐衣足食！Python3網路爬蟲實戰案例

利用Scrapy爬取所有知乎用戶詳細信息並存至MongoDB

本節目標

思路分析

環境需求

Python3

Scrapy

MongoDB

PyMongo

創建項目

創建爬蟲

禁止ROBOTSTXT_OBEY

嘗試最初的爬取

爬取流程

生成第一步請求

OAuth

parse_user

parse_follows

followers

小結

加入pipeline

更高效率

相關推薦