Scrapy cookies淺析
首先打消大家的疑慮, Scrapy會自動管理cookies, 就像瀏覽器一樣:
Does Scrapy manage cookies automatically?
Yes, Scrapy receives and keeps track of cookies sent by servers, and sends them back on subsequent requests, like any regular web browser does.
Cookies的管理是通過CookiesMiddleware, 它屬於DownloadMiddleware的一部分, 所有的requests和response都要經過它的處理.
首先我們看處理request的部分
代碼:
class CookiesMiddleware(object): """This middleware enables working with sites that need cookies""" def __init__(self, debug=False): # 用字典生成多個cookiesjar self.jars = defaultdict(CookieJar) self.debug = debug def process_request(self, request, spider): if request.meta.get(dont_merge_cookies, False): return # 每個cookiesjar的key都存儲在 meta字典中 cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] cookies = self._get_request_cookies(jar, request) # 把requests的cookies存儲到cookiesjar中 for cookie in cookies: jar.set_cookie_if_ok(cookie, request) # set Cookie header # 刪除原有的cookies request.headers.pop(Cookie, None) # 添加cookiesjar中的cookies到requests header jar.add_cookie_header(request) self._debug_cookie(request, spider)
流程如下:
- 使用字典初始化多個cookies jar
- 把每個requests指定的cookies jar 提取出來
- 然後根據policy把requests中的cookies添加cookies jar
- 最後把cookies jar中合適的cookies添加到requests首部
接下來看看如何處理response中的cookies:
def process_response(self, request, response, spider): if request.meta.get(dont_merge_cookies, False): return response # extract cookies from Set-Cookie and drop invalid/expired cookies cookiejarkey = request.meta.get("cookiejar") jar = self.jars[cookiejarkey] jar.extract_cookies(response, request) self._debug_set_cookie(response, spider) return response
流程如下:
- 首先從cookies jar 字典中把requests對應的cookiesjar提取出來.
- 使用extract_cookies把response首部中的cookies添加到cookies jar
實戰演示
五行代碼登錄知乎
log:
2017-03-16 14:12:32 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://www.zhihu.com>Set-Cookie: aliyungf_tc=AQAAAORPOnVLugoAa9ZkyvKxdjghakUA; Path=/; HttpOnlySet-Cookie: q_c1=63b9dcc3f102407cbbf376bbd38824dc|1489644752000|14896447521100; Domain=zhihu.com; expires=Sun, 15 Mar 2020 06:12:32 GMT; Path=/Set-Cookie: nweb_qa=heifetz; Domain=zhihu.com; expires=Sat, 15 Apr 2017 06:12:32 GMT; Path=/Set-Cookie: _xsrf=fa1b4a61943eddsbf16c1489392989e60dd; Path=/Set-Cookie: r_cap_id="YjRhNGFiNDYwMfzYwNDg2YTg4ZGViZjRkZWExMjQ5OWY=|1489644752|dba3443a0b52e3ba046d5cc00eac73b9d89dacaf"; Domain=zhihu.com; expires=Sat, 15 Apr 2017 06:12:32 GMT; Path=/Set-Cookie: cap_id="MjEwYjRmOTlkYjBiNGVkMmIxNjA5YjdhNzcwYjM3NmI=|1489644752|a9cf208351118a31b7371c9e054756cdf77a835a"; Domain=zhihu.com; expires=Sat, 15 Apr 2017 06:12:32 GMT; Path=/Set-Cookie: l_cap_id="MWRiMzg2NjUxMGYyNGJhZTliMWJhODY3MjM1ZjhjNGQ=|1489644752|ab04f14607a253b34af62403639d9eaba719909b"; Domain=zhihu.com; expires=Sat, 15 Apr 2017 06:12:32 GMT; Path=/2017-03-16 14:12:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhihu.com> (referer: None)2017-03-16 14:12:32 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <POST https://www.zhihu.com/login/phone_num>Cookie: aliyungf_tc=AQAAAORPOnVLugoAa9ZkyvKxdjghakUA; _xsrf=fa1b4a61943ebf16c1489392989e60dd; r_cap_id="YjRhNGFiNDYwMzYwNDg2YTg4ZGViZjRkZWExMjQ5OWY=|1489644752|dba3443a0b52e3ba046d5cc00eac73b9d89dacaf"; nweb_qa=heifetz; q_c1=63b9dcc3f102407cbbf376bbd38824dc|1489644752000|1489644752000; cap_id="MjEwYjRmOTlkYjBiNGVkMmIxNjA5YjdhNzcwYjM3NmI=|1489644752|a9cf208351118a31b7371c9e054756cdf77a835a"; l_cap_id="MWRiMzg2NjUxMGYyNGJhZTliMWJhODY3MjM1ZjhjNGQ=|1489644752|ab04f14607a253b34af62403639d9eaba719909b"2017-03-16 14:12:33 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://www.zhihu.com/login/phone_num>Set-Cookie: l_n_c=; Domain=zhihu.com; expires=Wed, 16 Mar 2016 06:12:33 GMT; Path=/Set-Cookie: z_c0="QUJDQXdQeUE4Zsd2tYQUFBQVlRSlZUZEc1OFZoNFR4TVp2TGEteU8xQVQxZHFxaWRDczlzNXJRPT0=|1489644753|997e58a03d0acc60016564d03ee3e4f8fe748495"; Domain=zhihu.com; expires=Sat, 15 Apr 2017 06:12:33 GMT; httponly; Path=/Set-Cookie: nweb_qa=heifetz; Domain=zhihu.com; expires=Sat, 15 Apr 2017 06:12:33 GMT; Path=/Set-Cookie: _xsrf=; Domain=zhihu.com; expires=Wed, 16 Mar 2016 06:12:33 GMT; Path=/Set-Cookie: l_cap_id=; Domain=zhihu.com; expires=Wed, 16 Mar 2016 06:12:33 GMT; Path=/Set-Cookie: n_c=; Domain=zhihu.com; expires=Wed, 16 Mar 2016 06:12:33 GMT; Path=/2017-03-16 14:12:33 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.zhihu.com/login/phone_num> (referer: https://www.zhihu.com){r: 0, msg: 登錄成功}
從上面的日誌可以看出, 我們在請求知乎首頁的時候cookies以及被添加到了cookies jar了, 然後我們登錄之後, 我們的cookies信息也被保留下來了.
總結:
Scrapy對於Cookies管理有著良好的支持, 大家在操作的時候不需要太過關心Cookies的問題, 就像你用瀏覽器的時候也不關心你的Cookies是如何設置的.
知乎登錄代碼: simple_spdier
推薦閱讀:
※Python安裝Scrapy出現以下錯誤怎麼辦?
※Python網頁信息採集:使用PhantomJS採集某貓寶商品內容
※如何高效學習python的某一個包?
※【爬蟲】用Scrapy做分散式爬蟲:1.環境搭建
※[轉載]學習Scrapy入門