基於bs4庫的HTML內容查找方法

目標網址This is a python demo page

主要使用BeautifulSoupfindall_all方法

>>> import requestsn>>> r = requests.get(http://python123.io/ws/demo.html)n>>> demo = r.textn>>> demon<html><head><title>This is a python demo page</title></head>rn<body>rn<p class="title"><b>The demo python introduces several python courses.</b></p>rn<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:rn<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>rn</body></html>n

find_all( name , attrs , recursive , string , **kwargs )

返回一個列表內心,存儲查找的結果

.name : 對標籤名稱的檢索字元串

>>> from bs4 import BeautifulSoupn>>> soup = BeautifulSoup(demo,html.parser)n>>> soup.find_all(a)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n>>> soup.find_all([a,b])n[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n

>>> for tag in soup.find_all(True):ntprint(tag.name)n#True 可以匹配任何值,下面代碼查找到所有的tag,但是不會返回字元串節點ntnhtmlnheadntitlenbodynpnbnpnanan>>> import ren>>> for tag in soup.find_all(re.compile(b))#正則:ntprint(tag.name)nntnbodynbn

.attrs : 對標籤屬性的前所字元串,可標註屬性檢索

>>> soup.find_all(p,course)n[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:nn<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]n>>> soup.find_all(id = link1)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]n>>> import ren>>> soup.find_all(id=re.compile(link))n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n

.recursive : 是否對子孫全部檢索,默認True

>>> soup.find_all(a)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n>>> soup.find_all(a,recursive=False)n[]n

.string : <>...</>中字元串區域的檢索字元串

>>> soupn<html><head><title>This is a python demo page</title></head>n<body>n<p class="title"><b>The demo python introduces several python courses.</b></p>n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:nn<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>n</body></html>n>>> soup.find_all(string=Basic Python)n[Basic Python]n>>> import ren>>> soup.find_all(string=re.compile(python))n[This is a python demo page, The demo python introduces several python courses.]n

Tips:

<tag>(..) 等價於 <tag>.find_all(..)nsoup(..) 等價於 soup.find_all(..)n

來源:Python網路爬蟲與信息提取_北京理工大學_中國大學MOOC(慕課)

推薦閱讀:

pyecharts + Flask&Django,該來的總是要來的

TAG:Python | 网页爬虫 |