python 3.5 使用 BeautifulSoup 解析中文網頁的中文全是亂碼?

01-19

我知道是老生常談的問題，我也看了很多答案和教程，可惜天生愚鈍啊，只有不恥下問了。新人學爬蟲，以前0基礎，碰到這個網頁編碼問題，糾結了幾天了。
from bs4 import BeautifulSoup import requests
page = requests.get("http://info.sporttery.cn/basketball/vote/fb_vote.php?page=2search_value=num=2016-05-09") soup = BeautifulSoup(page.text, "lxml") print(soup)

加一句,

page.encoding = page.apparent_encoding

這個就行了

具體參考我以前寫的博客:

python + requests抓取百度百科時候遇到的亂碼問題

requests 這個可以添加編碼

req.encoding = "gb18030"

page.encoding：根據鏈接頭來猜測響應內容的編碼方式；

page.apparent_encoding：會從內容中分析出響應的內容編碼方式；

所以，加上page.encoding = page.apparent_encoding 後其對page內容的編碼方式解析得更準確。

響應內容

我們能讀取伺服器響應的內容。再次以 GitHub 時間線為例：
&>&>&> import requests
&>&>&> r = requests.get("https://github.com/timeline.json")
&>&>&> r.text

u"[{"repository":{"open_issues":0,"url":"https://github.com/...
Requests 會自動解碼來自伺服器的內容。大多數 unicode 字符集都能被無縫地解碼。
請求發出後，Requests 會基於 HTTP 頭部對響應的編碼作出有根據的推測。當你訪問 r.text 之時，Requests 會使用其推測的文本編碼。你可以找出 Requests 使用了什麼編碼，並且能夠使用r.encoding 屬性來改變它：
&>&>&> r.encoding
"utf-8"
&>&>&> r.encoding = "ISO-8859-1"
如果你改變了編碼，每當你訪問 r.text ，Request 都將會使用 r.encoding 的新值。你可能希望在使用特殊邏輯計算出文本的編碼的情況下來修改編碼。比如 HTTP 和 XML 自身可以指定編碼。這樣的話，你應該使用 r.content 來找到編碼，然後設置 r.encoding 為相應的編碼。這樣就能使用正確的編碼解析 r.text 了。
在你需要的情況下，Requests 也可以使用定製的編碼。如果你創建了自己的編碼，並使用 codecs模塊進行註冊，你就可以輕鬆地使用這個解碼器名稱作為 r.encoding 的值，然後由 Requests 來為你處理編碼。

以後答題直接複製文檔,知乎越來越簡單了,文檔鏈接http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id2