基於bs4庫的HTML內容查找方法

01-28

目標網址This is a python demo page

主要使用BeautifulSoup的findall_all方法

>>> import requestsn>>> r = requests.get(http://python123.io/ws/demo.html)n>>> demo = r.textn>>> demon<html><head><title>This is a python demo page</title></head>rn<body>rnThe demo python introduces several python courses.rnPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:rn<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.rn</body></html>n

find_all( name , attrs , recursive , string , **kwargs )

返回一個列表內心，存儲查找的結果

.name ：對標籤名稱的檢索字元串

>>> from bs4 import BeautifulSoupn>>> soup = BeautifulSoup(demo,html.parser)n>>> soup.find_all(a)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n>>> soup.find_all([a,b])n[The demo python introduces several python courses., <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n

>>> for tag in soup.find_all(True):ntprint(tag.name)n#True 可以匹配任何值,下面代碼查找到所有的tag,但是不會返回字元串節點ntnhtmlnheadntitlenbodynpnbnpnanan>>> import ren>>> for tag in soup.find_all(re.compile(b))#正則:ntprint(tag.name)nntnbodynbn

.attrs : 對標籤屬性的前所字元串，可標註屬性檢索

>>> soup.find_all(p,course)n[Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:nn<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.]n>>> soup.find_all(id = link1)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]n>>> import ren>>> soup.find_all(id=re.compile(link))n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n

.recursive : 是否對子孫全部檢索，默認True

>>> soup.find_all(a)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n>>> soup.find_all(a,recursive=False)n[]n

.string : <>...</>中字元串區域的檢索字元串

>>> soupn<html><head><title>This is a python demo page</title></head>n<body>nThe demo python introduces several python courses.nPython is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:nn<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.n</body></html>n>>> soup.find_all(string=Basic Python)n[Basic Python]n>>> import ren>>> soup.find_all(string=re.compile(python))n[This is a python demo page, The demo python introduces several python courses.]n

Tips：

<tag>(..) 等價於 <tag>.find_all(..)nsoup(..) 等價於 soup.find_all(..)n

來源：Python網路爬蟲與信息提取_北京理工大學_中國大學MOOC(慕課)