【記錄】Scrapy模擬登錄cookie失效問題

問題描述

在接手一個需要登錄的採集任務時,使用Scrapy框架,在 setting.py 文件中複製瀏覽器中的Request Headers數據,但是返回的數據總是未登錄。

解決方法

一般瀏覽器中的cookie是這樣的:

Cookie:

aliyungf_tc=AQAAAMeFWkXOsQ4ABmMhfZulWGtZkfPs; _xsrf=48aa43fd-bdd2-4b63-afa3-7cc8f2e3abca; q_c1=0e8f6896778146f587caeb0b227a3b76|1516199621000|1516199621000; capsion_ticket="2|1:0|10:1516199621|14:capsion_ticket|44:OTQ1ZGQ3OTI2NmFlNGJjYzlhZjFmMGZkZGNlNGZiOTY=|a13a774404b025674a34fbcfcb907e2e5145e686eb0273553830253301a53fa2"; _zap=20e42cbd-1d16-44eb-b1be-533400a082f6

在 setting.py 文件中複製過去是這樣的:

DEFAULT_REQUEST_HEADERS = {n User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36,n Cookie:aliyungf_tc=AQAAAMeFWkXOsQ4ABmMhfZulWGtZkfPs; _xsrf=48aa43fd-bdd2-4b63-afa3-7cc8f2e3abca; q_c1=0e8f6896778146f587caeb0b227a3b76|1516199621000|1516199621000; capsion_ticket="2|1:0|10:1516199621|14:capsion_ticket|44:OTQ1ZGQ3OTI2NmFlNGJjYzlhZjFmMGZkZGNlNGZiOTY=|a13a774404b025674a34fbcfcb907e2e5145e686eb0273553830253301a53fa2"; _zap=20e42cbd-1d16-44eb-b1be-533400a082f6n}n

但是官方文檔中的示例是這樣的 :

cookies (dict or list) –

the request cookies. These can be sent in two forms.

Using a dict:

request_with_cookies = Request(url="http://www.example.com", cookies={currency: USD, country: UY})n

Using a list of dicts:

request_with_cookies = Request(url="http://www.example.com", cookies=[{name: currency, value: USD, domain: example.com, path: /currency}])n

The latter form allows for customizing the domain and path attributes of the cookie. This is only useful if the cookies are saved for later requests.

When some site returns cookies (in a response) those are stored in the cookies for that domain and will be sent again in future requests. That』s the typical behaviour of any regular web browser. However, if, for some reason, you want to avoid merging with existing cookies you can instruct Scrapy to do so by setting the dont_merge_cookies key to True in the Request.meta.

Example of request without merging cookies:

request_with_cookies = Request(url="http://www.example.com", cookies={currency: USD, country: UY}, meta={dont_merge_cookies: True})n

For more info see CookiesMiddleware.

官方推薦的寫法並不是上面哪樣,而是以key, value的形式傳遞進去。這個時候,重寫 start_requests 就好,直接傳遞cookie的參數。

也就是這樣

cookies = {n aliyungf_tc:AQAAAMeFWkXOsQ4ABmMhfZulWGtZkfPs,n _xsrf:48aa43fd-bdd2-4b63-afa3-7cc8f2e3abca,n q_c1:0e8f6896778146f587caeb0b227a3b76|1516199621000|1516199621000,n capsion_ticket:"2|1:0|10:1516199621|14:capsion_ticket|44:OTQ1ZGQ3OTI2NmFlNGJjYzlhZjFmMGZkZGNlNGZiOTY=|a13a774404b025674a34fbcfcb907e2e5145e686eb0273553830253301a53fa2",n _zap:20e42cbd-1d16-44eb-b1be-533400a082f6n}nn def start_requests(self):n url = [https://www.zhihu.com/people/cuishite/activities]n yield scrapy.Request(url, cookies=self.cookies)n

當然,還有一種方法就是模擬登錄,除非有多個賬號同時使用,否則這種不推薦。

Requests用法

當然,我們使用requests時候,以上兩種方法都可以,例如:

import requestsnnurl = "https://www.zhihu.com/people/cuishite/activities"nnheaders = {n accept-encoding: "gzip, deflate, br",n accept-language: "zh-CN,zh;q=0.9,en;q=0.8",n upgrade-insecure-requests: "1",n user-agent: "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36",n accept: "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",n cache-control: "no-cache",n cookie: "d_c0="AJCCdkCtTAuPTlyOq8KViXPcE2lzDxFr660=|1486904251"; _zap=4bb82da4-e1a4-4567-b776-680b773be441; OUTFOX_SEARCH_USER_ID_NCOO=522960375.05909324; _ga=GA1.2.314890948.1495081760; q_c1=d402a896494040fcaff27f29a58e3cec|1507645799000|1486904252000; __utmv=51854390.100-1|2=registration_date=20151010=1^3=entry_date=20151010=1; _xsrf=ae3399dce59c1cfc914e2eca9fa61864; r_cap_id="NjZkZmIyM2U5ZjQwNGRkMDgwOGE3MjM2M2Q1ZDkxMDk=|1514708518|49bee39e58fbf764e17202ca1fc640e3a04e566b"; cap_id="YmYzZTVhYTJmMDk5NGEzM2E3NGI2NDdhYzczN2I0NzQ=|1514708518|1753b8dace7c55b8f42efbbd46c10c68ad576554"; l_cap_id="NTY2ODk5MjY5OGZjNDc5YjgxMGMwMTY3NDVkZTM0NWE=|1514708518|79cc996555d5d795362e65b72212c33811799746"; capsion_ticket="2|1:0|10:1514708520|14:capsion_ticket|44:NWVhM2JmYTVkYTM0NDI4NmFjMjMyMWMzNjM3MTZjY2U=|56fa4dd544de498a4d47428a76f90cfc405fcf91f2ab822d246636490a0e951d"; z_c0="2|1:0|10:1514708521|4:z_c0|92:Mi4xUUhNc0FnQUFBQUFBa0lKMlFLMU1DeVlBQUFCZ0FsVk5LZXcxV3dCNUNOVURrbzFJLXU2NFgyS2Y3QllORWlaZTJn|1b9512c0017bb73efdc1901db0f096c20d7bcc1c95e44e036c86cef97143723c"; __utma=51854390.2040575535.1509024688.1514710953.1514995748.12; __utmz=51854390.1514995748.12.17.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/question/19668080; q_c1=d402a896494040fcaff27f29a58e3cec|1516023938000|1486904252000; aliyungf_tc=AQAAAOXIRUa9RAUABmMhfeWn0nHHvpqn; _xsrf=ae3399dce59c1cfc914e2eca9fa61864",n connection: "keep-alive",n }nnresponse = requests.request("GET", url, headers=headers)nnprint(response.text)n

或者

>>> import requestsn>>> url = http://httpbin.org/cookiesn>>> cookies = dict(cookies_are=working)nn>>> r = requests.get(url, cookies=cookies)n>>> r.textn{"cookies": {"cookies_are": "working"}}n

留下記錄,以後別在犯錯。


推薦閱讀:

如何用scrapy爬取搜房網上小區的坐標值?
基於Scrapy如何寫一個爬蟲抓取藝龍、攜程等網站的機票價格進行分析並做機票價格預測?
為什麼覺得Scrapy很難?
Python安裝Scrapy出現以下錯誤怎麼辦?
為何抵觸爬蟲?

TAG:Python | 爬虫计算机网络 | scrapy |