Python爬蟲系列（二）：requests基礎

05-17

來自專欄編程，讓複雜的事情變簡單

1.發送請求：

import requests

# 獲取數據

#r是一個 response 對象。包含請求返回的內容

r = requests.get(https://github.com/timeline.json)

print(r.content)

列印結果：

b{"message":"Hello there, wayfaring stranger. If youxe2x80x99re reading this then you probably didnxe2x80x99t see our blog post a couple of years back announcing that this API would go away: GitHub API v2: End of Life Fear not, you should be able to get what you need from the shiny new Events API instead.","documentation_url":"Events | GitHub Developer Guide"}

發送請求有4中方式，就是http協議的4中method：

r = requests.put("http://httpbin.org/put")

r = requests.delete("http://httpbin.org/delete")

r = requests.head("http://httpbin.org/get")

r = requests.options("http://httpbin.org/get")

2.傳遞 URL 參數

以下兩種方式，是通過url傳參。參數，必須是一個字典

import requests

payload1 = {key1: value1, key2: value2}

r1 = requests.get("http://httpbin.org/get", params=payload1)

print(r1.url)

payload2 = {key1: value1, key2: [value2, value3]}

r2 = requests.get(http://httpbin.org/get, params=payload2)

print(r2.url)

對應結果：

http://httpbin.org/get?key1=value1&key2=value2

http://httpbin.org/get?key1=value1&key2=value2&key2=value3

注意看差別

3.響應內容

r = requests.get(https://github.com/timeline.json)

#獲取響應結果

print(r.text)

#獲取內容編碼

print(r.encoding)

#修改內容編碼方式。修改之後再取text將使用新的編碼方式

r.encoding = ISO-8859-1

注意符號上編碼不同

將內容編輯成二進位

i = BytesIO(r.content)

將內容轉為JSON對象

print(r.json())

注意：成功調用 r.json() 並不意味著響應的成功。有的伺服器會在失敗的響應中包含一個 JSON 對象（比如 HTTP 500 的錯誤細節）。這種 JSON 會被解碼返回。要檢查請求是否成功，請使用 r.raise_for_status() 或者檢查 r.status_code 是否和你的期望相同

原始響應內容

什麼是原始內容？客戶端和伺服器端建立socket的那一層取回的內容。需要設置stream=True才能取回，返回的是urllib的對象。

r = requests.get(https://github.com/timeline.json, stream=True)

#取迴流中的100個位元組的內容

r.raw.read(100)

但是，若是要將返回的數據保存為文件，應這樣使用流：

with open(filename, wb) as fd:

for chunk in r.iter_content(chunk_size):

fd.write(chunk)

用Response.iter_content替代r.raw

4.定製請求頭

url = https://api.github.com/some/endpoint

headers = {user-agent: my-app/0.0.1}

#說白了就給url傳參數

r = requests.get(url, headers=headers)

有以下內容要注意：

注意: 定製 header 的優先順序低於某些特定的信息源，例如：

如果在 .netrc 中設置了用戶認證信息，使用 headers= 設置的授權就不會生效。而如果設置了auth= 參數，``.netrc`` 的設置就無效了。

如果被重定向到別的主機，授權 header 就會被刪除。

代理授權 header 會被 URL 中提供的代理身份覆蓋掉。

在我們能判斷內容長度的情況下，header 的 Content-Length 會被改寫。

更進一步講，Requests 不會基於定製 header 的具體情況改變自己的行為。只不過在最後的請求中，所有的 header 信息都會被傳遞進去。

注意: 所有的 header 值必須是 string、bytestring 或者 unicode。

5.更加複雜的 POST 請求

import requests

# 傳遞元組

payload1 = ((key1, value1), (key1, value2))

r1 = requests.post(http://httpbin.org/post, data=payload1)

# 傳遞字典

payload2 = {key1: value1, key2: value2}

r2 = requests.post("http://httpbin.org/post", data=payload2)

# 傳遞JSON字元串

url1 = https://api.github.com/some/endpoint

payload3 = {some: data}

r3 = requests.post(url1, data=json.dumps(payload3))

# 傳遞JSON對象

url2 = https://api.github.com/some/endpoint

payload4 = {some: data}

r4 = requests.post(url2, json=payload4)

6.傳文件

import requests

url = http://httpbin.org/post

# files = {file: open(report.xls, rb)}

# 顯式地設置文件名，文件類型和請求頭

# files = {file: (report.xls, open(report.xls, rb), application/vnd.ms-excel, {Expires: 0})}

# 把字元串當做文件來發送

files = {file: (report.xls, some,data,to,send
another,row,to,send
)}

r = requests.post(url, files=files)

print(r.text)

第3步響應結果

注意：官方建議使用 requests-toolbelt 發送多個文件。後面我們將進一步演示

7.響應狀態碼

r = requests.get(http://httpbin.org/get)

print(r.status_code)

# 狀態查詢對象：requests.codes

print(r.status_code == requests.codes.ok)

bad_r = requests.get(http://httpbin.org/status/404)

print (bad_r.status_code)

# 在請求有問題的時候，raise_for_status()方法會手動出發異常

bad_r.raise_for_status()

執行結果：

8.響應頭

import requests

r = requests.get(http://httpbin.org/get)

print(r.status_code)

#獲取響應頭。響應頭是字典

print(r.headers)

print(r.headers[Content-Type])

print(r.headers.get(content-type))

9.Cookie

import requests

url = http://example.com/some/cookie/setting/url

r = requests.get(url)

# 獲取請求返回的cookies

r.cookies[example_cookie_name]

url = http://httpbin.org/cookies

# 把請求帶上cookies 這玩意在模擬登錄後經常使用

r = requests.get(url, cookies=cookies)

r.text

# Cookie 的返回對象為 RequestsCookieJar，它的行為和字典類似，適合跨域名跨路徑使用

#妹的，這是跨域嗎。明明是模仿免登錄

jar = requests.cookies.RequestsCookieJar()

jar.set(tasty_cookie, yum, domain=httpbin.org, path=/cookies)

jar.set(gross_cookie, blech, domain=httpbin.org, path=/elsewhere)

url = http://httpbin.org/cookies

r = requests.get(url, cookies=jar)

r.text

10.重定向與請求歷史

默認情況下，除了 HEAD, Requests 會自動處理所有重定向。可以使用響應對象的 history 方法來追蹤重定向。

什麼是重定向：輸入的是A地址卻自動跳轉到B地址

以下實例：放回的301代表永久性重定向。不要糾結過多，記住就行了。

這裡需要理解：本實例明明訪問一個地址，為什麼就重定向了。因為訪問的是域名，DNS會自動轉向實際的伺服器，這裡就重定向了

Response.history 是一個 Response 對象的列表，為了完成請求而創建了這些對象。這個對象列表按照從最老到最近的請求進行排序。

r = requests.get(http://github.com)

print(r.url)

print(r.history)

禁用重定向：

使用GET、OPTIONS、POST、PUT、PATCH 或者 DELETE，那麼可以通過 allow_redirects 參數禁用重定向處理

r = requests.get(http://github.com, allow_redirects=False)

print(r.status_code)

print(r.history)

使用HEAD啟動重定向：

r = requests.head(http://github.com, allow_redirects=True)

print(r.history)

11.超時

r=requests.get(http://github.com, timeout=0.001)

超時：是非常有用的。若是不設置超時，在很長一段時間都沒返回，那麼程序就會阻塞。timeout 僅對連接過程有效，與響應體的下載無關。 timeout 並不是整個下載響應的時間限制，而是如果伺服器在 timeout 秒內沒有應答，將會引發一個異常（更精確地說，是在timeout 秒內沒有從基礎套接字上接收到任何位元組的數據時）

12.錯誤與異常

遇到網路問題（如：DNS 查詢失敗、拒絕連接等）時，Requests 會拋出一個 ConnectionError 異常。

如果 HTTP 請求返回了不成功的狀態碼， Response.raise_for_status() 會拋出一個 HTTPError異常。

若請求超時，則拋出一個 Timeout 異常。

若請求超過了設定的最大重定向次數，則會拋出一個 TooManyRedirects 異常。

所有Requests顯式拋出的異常都繼承自 requests.exceptions.RequestException

截止目前，我們對requests有了一個基本認識。明天，我們將進一步討論requests高級耍法。

我只希望公司的新同事，牛小妹能花點時間仔細看下，代碼拿來運行下，看有什麼效果。