【記錄】Scrapy模擬登錄cookie失效問題

01-28

問題描述

在接手一個需要登錄的採集任務時，使用Scrapy框架，在 setting.py 文件中複製瀏覽器中的Request Headers數據，但是返回的數據總是未登錄。

解決方法

一般瀏覽器中的cookie是這樣的：

Cookie:

aliyungf_tc=AQAAAMeFWkXOsQ4ABmMhfZulWGtZkfPs; _xsrf=48aa43fd-bdd2-4b63-afa3-7cc8f2e3abca; q_c1=0e8f6896778146f587caeb0b227a3b76|1516199621000|1516199621000; capsion_ticket="2|1:0|10:1516199621|14:capsion_ticket|44:OTQ1ZGQ3OTI2NmFlNGJjYzlhZjFmMGZkZGNlNGZiOTY=|a13a774404b025674a34fbcfcb907e2e5145e686eb0273553830253301a53fa2"; _zap=20e42cbd-1d16-44eb-b1be-533400a082f6

在 setting.py 文件中複製過去是這樣的：

DEFAULT_REQUEST_HEADERS = {n User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36,n Cookie:aliyungf_tc=AQAAAMeFWkXOsQ4ABmMhfZulWGtZkfPs; _xsrf=48aa43fd-bdd2-4b63-afa3-7cc8f2e3abca; q_c1=0e8f6896778146f587caeb0b227a3b76|1516199621000|1516199621000; capsion_ticket="2|1:0|10:1516199621|14:capsion_ticket|44:OTQ1ZGQ3OTI2NmFlNGJjYzlhZjFmMGZkZGNlNGZiOTY=|a13a774404b025674a34fbcfcb907e2e5145e686eb0273553830253301a53fa2"; _zap=20e42cbd-1d16-44eb-b1be-533400a082f6n}n

但是官方文檔中的示例是這樣的：

cookies (dict or list) –

the request cookies. These can be sent in two forms.

Using a dict:

request_with_cookies = Request(url="http://www.example.com", cookies={currency: USD, country: UY})n

Using a list of dicts:

request_with_cookies = Request(url="http://www.example.com", cookies=[{name: currency, value: USD, domain: example.com, path: /currency}])n

The latter form allows for customizing the domain and path attributes of the cookie. This is only useful if the cookies are saved for later requests.

When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in future requests. That』s the typical behaviour of any regular web browser. However, if, for some reason, you want to avoid merging with existing cookies you can instruct Scrapy to do so by setting the dont_merge_cookies key to True in the Request.meta.

Example of request without merging cookies:

request_with_cookies = Request(url="http://www.example.com", cookies={currency: USD, country: UY}, meta={dont_merge_cookies: True})n

For more info see CookiesMiddleware.

官方推薦的寫法並不是上面哪樣，而是以key， value的形式傳遞進去。這個時候，重寫 start_requests 就好，直接傳遞cookie的參數。

也就是這樣

cookies = {n aliyungf_tc:AQAAAMeFWkXOsQ4ABmMhfZulWGtZkfPs,n _xsrf:48aa43fd-bdd2-4b63-afa3-7cc8f2e3abca,n q_c1:0e8f6896778146f587caeb0b227a3b76|1516199621000|1516199621000,n capsion_ticket:"2|1:0|10:1516199621|14:capsion_ticket|44:OTQ1ZGQ3OTI2NmFlNGJjYzlhZjFmMGZkZGNlNGZiOTY=|a13a774404b025674a34fbcfcb907e2e5145e686eb0273553830253301a53fa2",n _zap:20e42cbd-1d16-44eb-b1be-533400a082f6n}nn def start_requests(self):n url = [https://www.zhihu.com/people/cuishite/activities]n yield scrapy.Request(url, cookies=self.cookies)n

當然，還有一種方法就是模擬登錄，除非有多個賬號同時使用，否則這種不推薦。

Requests用法

當然，我們使用requests時候，以上兩種方法都可以，例如：

import requestsnnurl = "https://www.zhihu.com/people/cuishite/activities"nnheaders = {n accept-encoding: "gzip, deflate, br",n accept-language: "zh-CN,zh;q=0.9,en;q=0.8",n upgrade-insecure-requests: "1",n user-agent: "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",n accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",n cache-control: "no-cache",n cookie: "d_c0="AJCCdkCtTAuPTlyOq8KViXPcE2lzDxFr660=|1486904251"; _zap=4bb82da4-e1a4-4567-b776-680b773be441; OUTFOX_SEARCH_USER_ID_NCOO=522960375.05909324; _ga=GA1.2.314890948.1495081760; q_c1=d402a896494040fcaff27f29a58e3cec|1507645799000|1486904252000; __utmv=51854390.100-1|2=registration_date=20151010=1^3=entry_date=20151010=1; _xsrf=ae3399dce59c1cfc914e2eca9fa61864; r_cap_id="NjZkZmIyM2U5ZjQwNGRkMDgwOGE3MjM2M2Q1ZDkxMDk=|1514708518|49bee39e58fbf764e17202ca1fc640e3a04e566b"; cap_id="YmYzZTVhYTJmMDk5NGEzM2E3NGI2NDdhYzczN2I0NzQ=|1514708518|1753b8dace7c55b8f42efbbd46c10c68ad576554"; l_cap_id="NTY2ODk5MjY5OGZjNDc5YjgxMGMwMTY3NDVkZTM0NWE=|1514708518|79cc996555d5d795362e65b72212c33811799746"; capsion_ticket="2|1:0|10:1514708520|14:capsion_ticket|44:NWVhM2JmYTVkYTM0NDI4NmFjMjMyMWMzNjM3MTZjY2U=|56fa4dd544de498a4d47428a76f90cfc405fcf91f2ab822d246636490a0e951d"; z_c0="2|1:0|10:1514708521|4:z_c0|92:Mi4xUUhNc0FnQUFBQUFBa0lKMlFLMU1DeVlBQUFCZ0FsVk5LZXcxV3dCNUNOVURrbzFJLXU2NFgyS2Y3QllORWlaZTJn|1b9512c0017bb73efdc1901db0f096c20d7bcc1c95e44e036c86cef97143723c"; __utma=51854390.2040575535.1509024688.1514710953.1514995748.12; __utmz=51854390.1514995748.12.17.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/question/19668080; q_c1=d402a896494040fcaff27f29a58e3cec|1516023938000|1486904252000; aliyungf_tc=AQAAAOXIRUa9RAUABmMhfeWn0nHHvpqn; _xsrf=ae3399dce59c1cfc914e2eca9fa61864",n connection: "keep-alive",n }nnresponse = requests.request("GET", url, headers=headers)nnprint(response.text)n

或者

>>> import requestsn>>> url = http://httpbin.org/cookiesn>>> cookies = dict(cookies_are=working)nn>>> r = requests.get(url, cookies=cookies)n>>> r.textn{"cookies": {"cookies_are": "working"}}n

留下記錄，以後別在犯錯。