python爬蟲基礎之正則匹配的實例(二)

03-03

在說實例之前,上一章的元字元裡面,我們有一個非常重要的知識點沒講.

就是轉義符.

這個轉義符,我們要寫在python代碼中,首先我們就必須要明確,python裡面的這個符號是不是和正則具有相同的作用,如果有相同的作用,那麼我又應該怎麼處理,這是我們需要考慮的問題,下面我們先簡單看下代碼.

例如:

import restring = acd123efga = re.findall(a, string)b = re.findall(cd, string)c = re.findall(c\d, string) #用轉義d,讓他不再表示匹配數字,輸出結果依然為空,為什麼?print(a)print(b)print(c)

輸出結果:

[ax08][][]

那麼為什麼a匹配到了,cd匹配不到呢.我們知道d在正則表達式中代表的是匹配數字,所以我們就需要用對d進行轉義.但是在python解釋器當中,其實也是代表轉義字元,所以我們還需要加一個才能完成匹配.

import restring = acd123efga = re.findall(c\d, string)print(a)

輸出結果:

[c\d]

往往我們匹配網頁中的結構,都不是這麼簡單的方法匹配.結合我們前面所了解的知識點,我們就可以拼湊出來各種數據結構的獲取方法.

下面我們就來看一個實例,隨便找個小說頁面:

首先明確自己需要爬取的目標,比如,我想獲得下面所有出現的書名和書名對應的ID方便我們拼接成一個字元串:

import restring = <!doctype html><html><head><meta http-equiv="Cache-Control" content="no-siteapp"/><meta http-equiv="Cache-Control" content="no-transform"/><meta http-equiv="Content-Type" content="text/html; charset=gbk"/><title> 第九十八章大盜活人（2）_萬界天尊_筆趣閣</title><meta name="keywords" content="萬界天尊, 第九十八章大盜活人（2）"/><meta name="description" content="筆趣閣提供了血紅創作的玄幻小說《萬界天尊》乾淨清爽無錯字的文字章節：第九十八章大盜活人（2）在線閱讀。"/><link rel="stylesheet" type="text/css" href="/images/biquge.css"/><script type="text/javascript" src="//libs.baidu.com/jquery/1.4.2/jquery.min.js"></script><script type="text/javascript" src="/images/bqg.js"></script><script type="text/javascript">var preview_page = "/16_16662/8277223.html";var next_page = "/16_16662/8278646.html";var index_page = "/16_16662/";var article_id = "16662"; var chapter_id = "8277236"; function jumpPage() {var event = document.all ? window.event : arguments[0];if (event.keyCode == 37) document.location = preview_page;if (event.keyCode == 39) document.location = next_page;if (event.keyCode == 13) document.location = index_page;}document.onkeydown=jumpPage;</script></head><body><div id="wrapper"><script>login();</script> <div class="header"> <div class="header_logo"> <a href="http://www.biquge.com.tw">筆趣閣</a> </div> <script>bqg_panel();</script> </div> <div class="nav"> <ul><li><a href="/">網站首頁</a></li><li><a href="/xuanhuan/">玄幻小說</a></li><li><a href="/xiuzhen/">修真小說</a></li><li><a href="/dushi/">都市小說</a></li><li><a href="/lishi/">歷史小說</a></li><li><a href="/wangyou/">網遊小說</a></li><li><a href="/kehuan/">科幻小說</a></li><li><a href="/kongbu/">恐怖小說</a></li><li><a href="/quanben/">全本小說</a></li> </ul> </div> <div class="content_read"> <div class="box_con"> <div class="con_top"> <script>textselect();</script> <a href="/">筆趣閣</a> > 玄幻小說 > <a href="http://www.biquge.com.tw/16_16662/">萬界天尊</a> > 第九十八章大盜活人（2） </div> <div class="bookname"> <h1> 第九十八章大盜活人（2）</h1> <div class="bottem1"> <a href="javascript:;" onclick="showpop(/modules/article/uservote.php?id=16662&ajax_request=1);">投推薦票</a> <a href="/16_16662/8277223.html">上一章</a> ← <a href="/16_16662/">萬界天尊</a> → <a href="/16_16662/8278646.html">下一章</a> <a href="javascript:;" onclick="showpop(/modules/article/addbookcase.php?id=16662&cid=8277236&ajax_request=1);">加入書籤</a> </div> <div class="lm"> 熱門推薦：<a href=/18_18820/ stylex=font-weight:bold>飛劍問道</a> <a href=/0_703/ >斗戰狂潮</a> <a href=/0_213/ stylex=font-weight:bold>一念永恆</a> <a href=/16_16662/ >萬界天尊</a> <a href=/16_16209/ stylex=font-weight:bold>我是至尊</a> <a href=/19_19107/ >廚道仙途</a> <a href=/18_18949/ stylex=font-weight:bold>大道朝天</a> <a href=/18_18970/ >蒼穹之上</a> <a href=/18_18489/ stylex=font-weight:bold>天行戰記</a> <a href=/19_19019/ >我是仙凡</a> <a href=/16_16802/ stylex=font-weight:bold>大劫主</a> <a href=/8_8568/ >伏天氏</a> <a href=/17_17380/ >紂臨</a> <a href=/1_1237/ stylex=font-weight:bold>超品巫師</a> <a href=/9_9651/ >他從地獄來</a> <a href=/11_11850/ >聖墟</a> <a href=/18_18698/ >漢鄉</a> <a href=/6_6595/ >牧神記</a> <a href=/3_3907/ >我真是大明星</a> <a href=/18_18186/ >重生之魔教教主</a># compile構建正則表達式對象, r 代表以原始字元串的形式匹配,re_exp = re.compile(r"<a href=(/d{1,2}wd{3,5}/) (>$|.*)>(.*?)</a>", re.M)a = re.findall(re_exp, string)print(a)for i in a: print(i[0], i[2])

a的輸出結果是一個列表裡面包含元組:

[(/18_18820/, "stylex=font-weight:bold", 飛劍問道), (/0_703/, , 斗戰狂潮), (/0_213/, "stylex=font-weight:bold", 一念永恆), (/16_16662/, , 萬界天尊), (/16_16209/, "stylex=font-weight:bold", 我是至尊), (/19_19107/, , 廚道仙途), (/18_18949/, "stylex=font-weight:bold", 大道朝天), (/18_18970/, , 蒼穹之上), (/18_18489/, "stylex=font-weight:bold", 天行戰記), (/19_19019/, , 我是仙凡), (/16_16802/, "stylex=font-weight:bold", 大劫主), (/8_8568/, , 伏天氏), (/17_17380/, , 紂臨), (/1_1237/, "stylex=font-weight:bold", 超品巫師), (/9_9651/, , 他從地獄來), (/11_11850/, , 聖墟), (/18_18698/, , 漢鄉), (/6_6595/, , 牧神記), (/3_3907/, , 我真是大明星), (/18_18186/, , 重生之魔教教主)]

然後用for循環取出我們想要的數據:

/18_18820/ 飛劍問道/0_703/ 斗戰狂潮/0_213/ 一念永恆/16_16662/ 萬界天尊/16_16209/ 我是至尊/19_19107/ 廚道仙途/18_18949/ 大道朝天/18_18970/ 蒼穹之上/18_18489/ 天行戰記/19_19019/ 我是仙凡/16_16802/ 大劫主/8_8568/ 伏天氏/17_17380/ 紂臨/1_1237/ 超品巫師/9_9651/ 他從地獄來/11_11850/ 聖墟/18_18698/ 漢鄉/6_6595/ 牧神記/3_3907/ 我真是大明星/18_18186/ 重生之魔教教主

案例2:

提取生日信息:

import re#有這麼多的類型,我們該如何提取string = A出生於1990年05月21日 B出生於1994/02/02 C出生於1993/8/1 D出生於1988年9月2 E出生於1986-03-29 F出生於1986/04 G出生於1990-10 H出生於97-10 # 表達式對象re_exp = re.compile(r"(.*?)出生於(d{2,4}[年/-]d{1,2}([月/-]d{1.2}|[月/-]$|$))", re.S | re.M)birthday = re.findall(re_exp, string)print(birthday)

輸出結果:

[( A出生於1990年05月21日 B出生於1994/02/02 C出生於1993/8/1 D出生於1988年9月2 E出生於1986-03-29 F, 1986/04, ), ( G, 1990-10, ), ( H, 97-10, )]

以上就是這一張的內容,有什麼問題可以在評論區留言,下一章更新字元串的檢索和替換.

如有什麼不足之處,請多指導!!!

作者:Sruty