從零開始寫Python爬蟲 --- 1.3 BS4庫的解析器

01-28

bs4庫之所以能快速的定位我們想要的元素，是因為他能夠用一種方式將html文件解析了一遍，不同的解析器有不同的效果。下文將一一進行介紹。

bs4解析器的選擇

網路爬蟲的最終目的就是過濾選取網路信息，最重要的部分可以說是解析器。解析器的優劣決定了爬蟲的速度和效率。bs4庫除了支持我們上文用過的『html.parser』解析器外，還支持很多第三方的解析器，下面我們來對他們進行對比分析。

bs4庫官方推薦我們使用的是lxml解析器，原因是它具有更高的效率，所以我們也將採用lxml解析器。

lxml解析器的安裝：

依舊採用pip安裝工具來安裝：

pip install lxmln
注意，由於我用的是unix類系統，用pip工具十分的方便，但是如果在windows下安裝，總是會出現這樣或者那樣的問題，這裡推薦win用戶去lxml官方，下載安裝包，來安裝適合自己系統版本的lxml解析器。

使用lxml解析器來解釋網頁

我們依舊以上一篇的愛麗絲文檔為例子

html_doc = """n<html><head><title>The Dormouses story</title></head>n<body>nThe Dormouses storynnOnce upon a time there were three little sisters; and their names weren<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> andn<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;nand they lived at the bottom of a well.nn...n"""n

試一下吧：

import bs4nnn#首先我們先將html文件已lxml的方式做成一鍋湯nsoup = bs4.BeautifulSoup(open(Beautiful Soup 爬蟲/demo.html),lxml)nn#我們把結果輸出一下，是一個很清晰的樹形結構。n#print(soup.prettify())nnnOUT:nn<html>n <head>n <title>n The Dormouses storyn </title>n </head>n <body>n n n The Dormouses storyn n n n Once upon a time there were three little sisters; and their names weren <a class="sister" href="http://example.com/elsie" id="link1">n Elsien </a>n ,n <a class="sister" href="http://example.com/lacie" id="link2">n Lacien </a>n andn <a class="sister" href="http://example.com/tillie" id="link3">n Tillien </a>n ;nand they lived at the bottom of a well.n n n ...n n </body>n</html>nn

如何具體的使用？

bs4 庫首先將傳入的字元串或文件句柄轉換為 Unicode的類型，這樣，我們在抓取中文信息的時候，就不會有很麻煩的編碼問題了。當然，有一些生僻的編碼如：『big5』，就需要我們手動設置編碼：
soup = BeautifulSoup(markup, from_encoding="編碼方式")

對象的種類：

bs4 庫將複雜的html文檔轉化為一個複雜的樹形結構，每個節點都是Python對象，所有對象可以分為以下四個類型：Tag , NavigableString , BeautifulSoup , Comment

我們來逐一解釋：

Tag：和html中的Tag基本沒有區別，可以簡單上手使用
NavigableString：被包裹在tag內的字元串
BeautifulSoup：表示一個文檔的全部內容，大部分的時候可以吧他看做一個tag對象，支持遍歷文檔樹和搜索文檔樹方法。
Comment：這是一個特殊的NavigableSting對象，在出現在html文檔中時，會以特殊的格式輸出，比如注釋類型。

搜索文檔樹的最簡單的方法就是搜索你想獲取tag的的name：

soup.headn# <head><title>The Dormouses story</title></head>nnsoup.titlen# <title>The Dormouses story</title>n

如果你還想更深入的獲得更小的tag：例如我們想找到body下的被b標籤包裹的部分

soup.body.bn# The Dormouses storyn

但是這個方法只能找到按順序第一個出現的tag

獲取所有的標籤呢？

這個時候需要find_all()方法，他返回一個列表類型

tag=soup.find_all(a)n# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,n# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,n# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]nn#假設我們要找到a標籤中的第二個元素：nneed = tag[1]n#簡單吧n

tag的.contents屬性可以將tag的子節點以列表的方式輸出：

head_tag = soup.headnhead_tagn# <head><title>The Dormouses story</title></head>nnhead_tag.contentsn[<title>The Dormouses story</title>]ntitle_tag = head_tag.contents[0]nprint(title_tag)n# <title>The Dormouses story</title>ntitle_tag.contentsn# [uThe Dormouses story]n

另外通過tag的 .children生成器，可以對tag的子節點進行循環：

for child in title_tag.children:n print(child)n # The Dormouses storyn

這種方式只能遍歷出子節點。如何遍歷出子孫節點呢？

子孫節點：比如 head.contents 的子節點是<title>The Dormouses story</title>,這裡 title本身也有子節點：『The Dormouse『s story』。這裡的『The Dormouse『s story』也叫作head的子孫節點

for child in head_tag.descendants:n print(child)n # <title>The Dormouses story</title>n # The Dormouses storyn

如何找到tag下的所有的文本內容呢？

如果該tag只有一個子節點（NavigableString類型）：直接使用tag.string就能找到。
如果tag有很多個子、孫節點，並且每個節點裡都string：

我們可以用迭代的方式將其全部找出：

for string in soup.strings:n print(repr(string))n # u"The Dormouses story"n # unnn # u"The Dormouses story"n # unnn # uOnce upon a time there were three little sisters; and their names werenn # uElsien # u,nn # uLacien # u andnn # uTillien # u;nand they lived at the bottom of a well.n # unnn # u...n # unn

好了，關於bs4庫的基本使用，我們就先介紹到這。剩下來的部分：

父節點、兄弟節點、回退和前進，都與上面從子節點找元素的過程差不多

想要具體了解可以去看一下官方文檔

每天的學習記錄都會同步更新到：
微信公眾號： findyourownway
知乎專欄：從零開始寫Python爬蟲 - 知乎專欄
blog ： www.ehcoblog.ml