基於bs4庫的HTML內容查找方法
目標網址This is a python demo page
主要使用BeautifulSoup的findall_all方法
>>> import requestsn>>> r = requests.get(http://python123.io/ws/demo.html)n>>> demo = r.textn>>> demon<html><head><title>This is a python demo page</title></head>rn<body>rn<p class="title"><b>The demo python introduces several python courses.</b></p>rn<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:rn<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>rn</body></html>n
find_all( name , attrs , recursive , string , **kwargs )
返回一個列表內心,存儲查找的結果
.name : 對標籤名稱的檢索字元串
>>> from bs4 import BeautifulSoupn>>> soup = BeautifulSoup(demo,html.parser)n>>> soup.find_all(a)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n>>> soup.find_all([a,b])n[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n
>>> for tag in soup.find_all(True):ntprint(tag.name)n#True 可以匹配任何值,下面代碼查找到所有的tag,但是不會返回字元串節點ntnhtmlnheadntitlenbodynpnbnpnanan>>> import ren>>> for tag in soup.find_all(re.compile(b))#正則:ntprint(tag.name)nntnbodynbn
.attrs : 對標籤屬性的前所字元串,可標註屬性檢索
>>> soup.find_all(p,course)n[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:nn<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]n>>> soup.find_all(id = link1)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]n>>> import ren>>> soup.find_all(id=re.compile(link))n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n
.recursive : 是否對子孫全部檢索,默認True
>>> soup.find_all(a)n[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]n>>> soup.find_all(a,recursive=False)n[]n
.string : <>...</>中字元串區域的檢索字元串
>>> soupn<html><head><title>This is a python demo page</title></head>n<body>n<p class="title"><b>The demo python introduces several python courses.</b></p>n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:nn<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>n</body></html>n>>> soup.find_all(string=Basic Python)n[Basic Python]n>>> import ren>>> soup.find_all(string=re.compile(python))n[This is a python demo page, The demo python introduces several python courses.]n
Tips:
<tag>(..) 等價於 <tag>.find_all(..)nsoup(..) 等價於 soup.find_all(..)n
推薦閱讀: