urllib庫使用詳解
通過這篇文章為大家介紹崔慶才老師對Python爬蟲的講解,包括基本原理及其理論知識點
本文共有約1200字,建議閱讀時間10分鐘,並且注重理論與實踐相結合
覺得文章比較枯燥和用電腦觀看的都可以點擊閱讀原文即可跳轉到CSDN網頁
目錄:
一、什麼是Urllib庫?
二、urllib庫用法講解
一、什麼是Urllib庫?
定義:Python內置的HTTP請求庫
urllib.request:請求模塊
urllib.error:異常處理模塊
urllib.parse:url解析模塊(拆分、合併等)
urllib.robotparser:robot.txt解析模塊
二、urllib庫用法講解
1.urlopen解析:
urllib.request.urlopen(url,data = None,[timeout]*,cafile = None,capath = None,cadefault = False,context = None)#urlopen前三個分別(網站,網站的數據,超時設置)
爬蟲第一步(urlopen操作):
from urllib import requestresponse = request.urlopen(http://www.baidu.com)print(response.read().decode(utf-8))#獲取響應體的內容post類型的請求(parse操作):from urllib import parsedata = bytes(parse.urlencode({word:hello}),encoding = utf8)response1 = request.urlopen(http://httpbin.org/post,data = data)#http://httpbin.org/是一個做http測試的網站print(response1.read())timeout超時設置:response2 = request.urlopen(http://httpbin.org/get,timeout = 1)#將超時時間設置有1秒print(response2.read())try: response3 = request.urlopen(http://httpbin.org/get,timeout = 0.1)#將超時時間設置為0.1秒except error.URLError as e: if isinstance(e.reason,socket.timeout):#使用isinstance判斷error的原因是否是timeout print(TIME OUT)
2.響應
響應類型
print(type(response))#保留原本的response,自己也可以另行設置一個新的responseOut[20]: http.client.HTTPResponse狀態碼、響應頭print(response.status)#狀態碼print(response.getheaders())#響應頭print(response.getheaders(Set-Cookie))#響應頭內信息類型為字典的,可以通過鍵名找到對應的值
3.Request
from urllib import requestfrom urllib import parse,errorrequest1 = request.Request(http://python.org/)#此步驟為請求,對比urllib的使用可知可省略response = request.urlopen(request1)print(response.read().decode(utf-8))from urllib import parse,request,errorimport socketurl = http://httpbin.org/post#構造一個POST請求headers = {User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64),Host:httpbin.org}dict1 = {name:Germey}data = bytes(parse.urlencode(dict1),encoding=utf8)#fontdata數據req = request.Request(url = url,data = data,headers = headers,method = POST)#整一個Request()的結構response = request.urlopen(req)print(response.read().decode(utf-8))#輸出結構中可以看出我們前面所構造的headers和dict1
實現的另一種方式
req1 = request.Request(url = url,data = data,method = POST)req1.add_header(User-Agent,Mozilla/5.0 (Windows NT 6.1; WOW64))#使用add_header添加response = request.urlopen(req1)print(response.read().decode(utf-8))
4.Headler:
代理(https://docs.python.org/3/library/urllib.request.html#module-urllib.request官方文檔)
from urllib import requestproxy_handler = request.ProxyHandler({http:http://127.0.0.1:9743,https:https://127.0.0.1:9743})#此IP為過期IP,最近我的途徑被封了,無法為大家展示><sorryopener = request.build_opener(proxy_handler)response = opener.open(http://www.baidu.com)print(response.read())
5.Cookie(客戶端保存,用來記錄客戶身份的文本文件、維持登錄狀態)
from urllib import requestfrom http import cookiejarcookie =cookiejar.CookieJar()#設置一個cookie棧handler = request.HTTPCookieProcessor(cookie)opener = request.build_opener(handler)response =opener.open(http://www.baidu.com)for item in cookie: print(item.name+=+item.value)
6.異常處理
from urllib import error#我們試著訪問一個不存在的網址try: response = request.urlopen(http://www.cuiqingcai.com/index.html)#http://www.cuiqingcai.com/此鏈接為崔老師的個人博客except error.URLError as e: print(e.reason)#通過審查可以查到我們捕捉的異常是否與之相符
可以捕獲的(https://docs.python.org/3/library/urllib.error.html#module-urllib.error官方文檔):
try: response = request.urlopen(http://www.cuiqingcai.com/index.html)except error.HTTPError as e: #最好先捕捉HTTPError再捕捉其他的異常 print(e.reason,e.code,e.headers,sep=
)except error.URLError as e: print(e.reason)else: print(Request Successfully)try: response = request.urlopen(http://www.baidu.com,timeout = 0.01)#超時異常except error.URLError as e: print(type(e.reason)) if isinstance(e.reason,socket.timeout):#判斷error類型 print(TIME OUT)
7.URL解析
21.8. urllib.parse - Parse URLs into components - Python 3.6.4 documentation7.1、urlparse(將url進行分割,分割成好幾個部分,再依次將其複製)
parse.urlparse(urlstring,scheme=,allow_fragments = True)#(url,協議類型,#後面的東西)from urllib.parse import urlparseresult = urlparse(https://www.baidu.com/s?wd=urllib&ie=UTF-8)print(type(result),result) #<class urllib.parse.ParseResult>#無協議類型指定,自行添加的情況result = urlparse(www.baidu.com/s?wd=urllib&ie=UTF-8,scheme = https)print(result)#有指定協議類型,添加的情況result1 = urlparse(http://www.baidu.com/s?wd=urllib&ie=UTF-8,scheme = https)print(result1)#allow_fragments參數使用result1 = urlparse(http://www.baidu.com/s?#comment,allow_fragments = False)result2 = urlparse(http://www.baidu.com/s?wd=urllib&ie=UTF-8#comment,allow_fragments = False)print(result1,result2)#allow_fragments=False表示#後面的東西不能填,原本在fragment位置的參數就會往上一個位置拼接,可以對比result1和result2的區別urlunparse(urlparse的反函數)from urllib.parse import urlunparse#data可以通過urlparse得出的參數往裡面帶,注意:即使是空符號也要寫進去,不然會出錯data = [https, , www.baidu.com/s, , wd=urllib&ie=UTF-8, ]print(urlunparse(data))
7.2、urjoin(拼接URL):
from urllib.parse import urljoin#總的來說:無論是正常鏈接或是隨便打的,都可以拼接,如果同時出現完整鏈接http或是https,不會產生拼接,而會列印後者的鏈接print(urljoin(http://www.baidu.com,FQA.html))http://www.baidu.com/FQA.htmlprint(urljoin(http://www.baidu.com,http://www.caiqingcai.com/FQA.html))http://www.caiqingcai.com/FQA.htmlprint(urljoin(https://www.baidu.com/about.html,http://www.caiqingcai.com/FQA.html))http://www.caiqingcai.com/FQA.htmlprint(urljoin(http://www.baidu.com/about.html,https://www.caiqingcai.com/FQA.html))https://www.caiqingcai.com/FQA.html
7.3、urlencode(字典對象轉化為get請求參數):
from urllib.parse import urlencodeparams = {name:Arise,age:21}base_url = http://www.baidu.com?url = base_url+urlencode(params)print(url)http://www.baidu.com?name=Arise&age=21
7.4、robotparser(用來解析robot.txt):
官方文檔(製作了解):
21.10. urllib.robotparser - Parser for robots.txt - Python 3.6.4 documentationimport urllib.robotparserrp = urllib.robotparser.RobotFileParser()rp.set_url("http://www.musi-cal.com/robots.txt")rp.read()rrate = rp.request_rate("*")rrate.requests#3rrate.seconds#20rp.crawl_delay("*")#6rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")#Falserp.can_fetch("*", "http://www.musi-cal.com/")#True
urllib庫使用詳解 - CSDN博客
推薦閱讀: