urllib庫使用詳解

通過這篇文章為大家介紹崔慶才老師對Python爬蟲的講解,包括基本原理及其理論知識點

本文共有約1200字,建議閱讀時間10分鐘,並且注重理論與實踐相結合

覺得文章比較枯燥和用電腦觀看的都可以點擊閱讀原文即可跳轉到CSDN網頁


目錄:

一、什麼是Urllib庫?

二、urllib庫用法講解


一、什麼是Urllib庫?

定義:Python內置的HTTP請求庫

urllib.request:請求模塊

urllib.error:異常處理模塊

urllib.parse:url解析模塊(拆分、合併等)

urllib.robotparser:robot.txt解析模塊


二、urllib庫用法講解

1.urlopen解析:

urllib.request.urlopen(url,data = None,[timeout]*,cafile = None,capath = None,cadefault = False,context = None)#urlopen前三個分別(網站,網站的數據,超時設置)

爬蟲第一步(urlopen操作):

from urllib import requestresponse = request.urlopen(http://www.baidu.com)print(response.read().decode(utf-8))#獲取響應體的內容post類型的請求(parse操作):from urllib import parsedata = bytes(parse.urlencode({word:hello}),encoding = utf8)response1 = request.urlopen(http://httpbin.org/post,data = data)#http://httpbin.org/是一個做http測試的網站print(response1.read())timeout超時設置:response2 = request.urlopen(http://httpbin.org/get,timeout = 1)#將超時時間設置有1秒print(response2.read())try: response3 = request.urlopen(http://httpbin.org/get,timeout = 0.1)#將超時時間設置為0.1秒except error.URLError as e: if isinstance(e.reason,socket.timeout):#使用isinstance判斷error的原因是否是timeout print(TIME OUT)

2.響應

響應類型

print(type(response))#保留原本的response,自己也可以另行設置一個新的responseOut[20]: http.client.HTTPResponse狀態碼、響應頭print(response.status)#狀態碼print(response.getheaders())#響應頭print(response.getheaders(Set-Cookie))#響應頭內信息類型為字典的,可以通過鍵名找到對應的值

3.Request

from urllib import requestfrom urllib import parse,errorrequest1 = request.Request(http://python.org/)#此步驟為請求,對比urllib的使用可知可省略response = request.urlopen(request1)print(response.read().decode(utf-8))from urllib import parse,request,errorimport socketurl = http://httpbin.org/post#構造一個POST請求headers = {User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64),Host:httpbin.org}dict1 = {name:Germey}data = bytes(parse.urlencode(dict1),encoding=utf8)#fontdata數據req = request.Request(url = url,data = data,headers = headers,method = POST)#整一個Request()的結構response = request.urlopen(req)print(response.read().decode(utf-8))#輸出結構中可以看出我們前面所構造的headers和dict1

實現的另一種方式

req1 = request.Request(url = url,data = data,method = POST)req1.add_header(User-Agent,Mozilla/5.0 (Windows NT 6.1; WOW64))#使用add_header添加response = request.urlopen(req1)print(response.read().decode(utf-8))

4.Headler:

代理(docs.python.org/3/libra官方文檔)

from urllib import requestproxy_handler = request.ProxyHandler({http:http://127.0.0.1:9743,https:https://127.0.0.1:9743})#此IP為過期IP,最近我的途徑被封了,無法為大家展示><sorryopener = request.build_opener(proxy_handler)response = opener.open(http://www.baidu.com)print(response.read())

5.Cookie(客戶端保存,用來記錄客戶身份的文本文件、維持登錄狀態)

from urllib import requestfrom http import cookiejarcookie =cookiejar.CookieJar()#設置一個cookie棧handler = request.HTTPCookieProcessor(cookie)opener = request.build_opener(handler)response =opener.open(http://www.baidu.com)for item in cookie: print(item.name+=+item.value)

6.異常處理

from urllib import error#我們試著訪問一個不存在的網址try: response = request.urlopen(http://www.cuiqingcai.com/index.html)#http://www.cuiqingcai.com/此鏈接為崔老師的個人博客except error.URLError as e: print(e.reason)#通過審查可以查到我們捕捉的異常是否與之相符

可以捕獲的(docs.python.org/3/libra官方文檔):

try: response = request.urlopen(http://www.cuiqingcai.com/index.html)except error.HTTPError as e: #最好先捕捉HTTPError再捕捉其他的異常 print(e.reason,e.code,e.headers,sep=
)except error.URLError as e: print(e.reason)else: print(Request Successfully)try: response = request.urlopen(http://www.baidu.com,timeout = 0.01)#超時異常except error.URLError as e: print(type(e.reason)) if isinstance(e.reason,socket.timeout):#判斷error類型 print(TIME OUT)

7.URL解析

21.8. urllib.parse - Parse URLs into components - Python 3.6.4 documentationdocs.python.org

7.1、urlparse(將url進行分割,分割成好幾個部分,再依次將其複製)

parse.urlparse(urlstring,scheme=,allow_fragments = True)#(url,協議類型,#後面的東西)from urllib.parse import urlparseresult = urlparse(https://www.baidu.com/s?wd=urllib&ie=UTF-8)print(type(result),result) #<class urllib.parse.ParseResult>#無協議類型指定,自行添加的情況result = urlparse(www.baidu.com/s?wd=urllib&ie=UTF-8,scheme = https)print(result)#有指定協議類型,添加的情況result1 = urlparse(http://www.baidu.com/s?wd=urllib&ie=UTF-8,scheme = https)print(result1)#allow_fragments參數使用result1 = urlparse(http://www.baidu.com/s?#comment,allow_fragments = False)result2 = urlparse(http://www.baidu.com/s?wd=urllib&ie=UTF-8#comment,allow_fragments = False)print(result1,result2)#allow_fragments=False表示#後面的東西不能填,原本在fragment位置的參數就會往上一個位置拼接,可以對比result1和result2的區別urlunparseurlparse的反函數)from urllib.parse import urlunparse#data可以通過urlparse得出的參數往裡面帶,注意:即使是空符號也要寫進去,不然會出錯data = [https, , www.baidu.com/s, , wd=urllib&ie=UTF-8, ]print(urlunparse(data))

7.2、urjoin(拼接URL):

from urllib.parse import urljoin#總的來說:無論是正常鏈接或是隨便打的,都可以拼接,如果同時出現完整鏈接http或是https,不會產生拼接,而會列印後者的鏈接print(urljoin(http://www.baidu.com,FQA.html))http://www.baidu.com/FQA.htmlprint(urljoin(http://www.baidu.com,http://www.caiqingcai.com/FQA.html))http://www.caiqingcai.com/FQA.htmlprint(urljoin(https://www.baidu.com/about.html,http://www.caiqingcai.com/FQA.html))http://www.caiqingcai.com/FQA.htmlprint(urljoin(http://www.baidu.com/about.html,https://www.caiqingcai.com/FQA.html))https://www.caiqingcai.com/FQA.html

7.3、urlencode(字典對象轉化為get請求參數):

from urllib.parse import urlencodeparams = {name:Arise,age:21}base_url = http://www.baidu.com?url = base_url+urlencode(params)print(url)http://www.baidu.com?name=Arise&age=21

7.4、robotparser(用來解析robot.txt):

官方文檔(製作了解):

21.10. urllib.robotparser - Parser for robots.txt - Python 3.6.4 documentationdocs.python.org

import urllib.robotparserrp = urllib.robotparser.RobotFileParser()rp.set_url("http://www.musi-cal.com/robots.txt")rp.read()rrate = rp.request_rate("*")rrate.requests#3rrate.seconds#20rp.crawl_delay("*")#6rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")#Falserp.can_fetch("*", "http://www.musi-cal.com/")#True

urllib庫使用詳解 - CSDN博客blog.csdn.net圖標


推薦閱讀:

編程小白如何寫爬蟲程序
NodeJs爬蟲抓取古代典籍,共計16000個頁面心得體會總結

TAG:Python | python爬蟲 | 網頁爬蟲 |