四、BeautifulSoup庫

03-06

BeautifulSoup使用流程：

第一步：解析HTML頁面

soup = BeautifulSoup(demo, html.parser)

第二步：獲取標籤

soup.find_all(a)

soup.a

第三步：獲取標籤信息

soup.a.name

一、簡介

作用：對HTML、XML頁面進行樹形解析，並提取相關信息

安裝：pip install beautifulsoup4

文檔：

英文版：

Beautiful Soup Documentationwww.crummy.com

中文版：

Beautiful Soup 4.2.0 文檔www.crummy.com

獲取HTML代碼：

手工獲取，在待獲取頁面右鍵點擊「查看源代碼」
通過Requests庫獲取HTML頁面代碼

示例：

import requestsfrom bs4 import BeautifulSoupr = requests.get("https://python123.io/ws/demo.html")print(r.text)demo = r.textsoup = BeautifulSoup(demo, html.parser)#給出待解析內容以及解析器print(soup.prettify())#返回內容：<html><head><title>This is a python demo page</title></head><body>The demo python introduces several python courses.Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</body></html><html> <head> <title> This is a python demo page </title> </head> <body> The demo python introduces several python courses. Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses: <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1"> Basic Python </a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2"> Advanced Python </a> . </body></html>

Beautiful Soup庫的理解：HTML文檔是一個由尖括弧形成的標籤樹，Beautiful Soup庫是解析、遍歷、維護「標籤樹」的功能庫，BeautifulSoup庫對應一個HTML/XML文檔的全部內容。

BeautifulSoup類<--->標籤樹<--->HTML/XML

二、Beautiful Soup庫解析器：

①解析器：bs4的HTML解析器

使用方法：Beautiful Soup(mk,"html.parser")

條件：安裝bs4庫

②解析器：lxml的HTML解析器

使用方法：Beautiful Soup(mk,"lxml")

條件：pip install lxml

③解析器：lxml的XML解析器

使用方法：Beautiful Soup(mk,"xml.parser")

條件：pip install lxml

④解析器：html5lib的解析器

使用方法：Beautiful Soup(mk,"html5lib")

條件：pip install html5lib

三、Beautiful Soup庫類的基本元素：

●Tag（標籤）

p標籤

 最好看的美女圖片就在妹子圖，記住我們的網址 mmjpg.com Copyright ? 2018 妹子圖湘ICP備16007494號-3 

p是標籤的name，成對出現，標明標籤的範圍，class="title"表示屬性，屬性有0個或多個，class是屬性名，"title"是屬性的值，屬性是由鍵值對構成的

a標籤

<a href="http://www.mmjpg.com/mm/1195" target="_blank"> 美臀嫩模小姐姐各種性感姿勢極度香艷 </a>

from bs4 import BeautifulSoupsoup = BeautifulSoup(demo, html.parsersoup.title#在瀏覽器左上方顯示信息位置的地方#返回<title>This is a Python demo page</title>soup.a #a標籤是鏈接標籤，返回<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

任何一個存在於HTML語法中的標籤均可以通過soup.tag方式列印出來，當HTML文檔中存在多個相同的Tag標籤時soup.tag只能返回其中的第一個

●Name（標籤的名字）

from bs4 import BeautifulSoupsoup = BeautifulSoup（demo,"html.parser"）soup.a.name#返回asoup.a.parent.name#返回psoup.a.parent.parent.name#返回body

●Attributes（標籤的屬性）

tag = soup.atag.attrs#返回如下：｛class: [py1], id: link1, href: http://www.icourse163.org/course/BIT-268001｝tag.attrs[class]#獲得class屬性的值#返回[py1]tag.attrs[href]#獲得鏈接屬性的值#返回http://www.icourse163.org/course/BIT-268001type(tag.attrs)#class (dict)type(tag)#<class bs4.element.Tag>

●NavigableString（標籤內非屬性字元串）

尖括弧之間的那部分string，示例如下：

print(soup.a.string)Basic Python

print(soup.p)#The demo python introduces several python courses.print(soup.p.string)#The demo python introduces several python courses.

NavigableString可以跨越標籤層次提取字元串，如上示例中p標籤裡面還有一個b標籤：

●Comment（標籤內字元串的注釋部分）

newsoup = BeautifulSoup("This is not a comment", "html.parser")print(newsoup.b.string)#返回This is a commenttype(newsoup.b.string)#返回<class bs4.element.Comment>

四、BeautifulSoup庫的三種遍歷方式：

上行遍歷、下行遍歷、平行遍歷

①下行遍歷：

.contents子節點的列表

.children子節點的迭代類型

.descendants子孫節點的迭代類型

示例：

soup = BeautifulSoup(demo,＂html.parser＂)soup.head#返回<head><title>This is a python demo page</title></head>soup.head.contents#返回[<title>This is a python demo page</title>]

備註：對於一個標籤的兒子節點不僅僅包括標籤節點還包括字元串節點，可以用len()函數檢索標籤兒子節點的個數，可以用列表類型的下標來檢索某個元素。

>>>for child in soup.body.children:>>>print(child)#遍歷兒子節點>>>for child in soup.body.descendant:>>>print(child)#遍歷子孫節點

②上行遍歷：

.parent節點的父親標籤

.parents節點先輩標籤的迭代類型

>>>soup = BeautifulSoup(demo,＂html.parser＂)>>>soup.title.parent#返回<head><title>This is a python demo page</title></head>

③平行遍歷：

.next_sibling返回下一個平行節點標籤

.previous_sibling返回上一個平行節點標籤

.next_siblings迭代類型

.previous_siblings迭代類型

備註：平行遍歷發生在同一個父親節點下的各節點

如何讓HTML頁面更加＂友好＂的顯示：使用.prettify()方法

注意：標籤之間的NavigableString也是節點，所以子孫節點不一定是標籤