使用Python爬蟲下載電子書
這兩天將半年前寫的爬蟲代碼重構了一下,本來以為要不了多久,結果前前後後花了我將近4個小時的時間。
無力吐槽!??
半年前的代碼是一個面向過程的處理,幾個函數順序執行,最終慢悠悠地把PDF生成出來,功能都齊全,但是可讀性和拓展性極差。現在全部改為面向對象處理,將requests.Session操作剝離出來作為Crawler類,將解析網頁的操作剝離出來作為Parse類,結構清楚了很多,耦合度(較之前)大大降低,基本達到我的要求。
整體功能實現後,我寫了一個cache函數,將Session操作緩存起來方便後續復用,本地調試成功,但最終沒有採用。我的設想是在一定期限內將Session操作常駐內存,每次執行前檢查緩存中有沒有,有的話就直接用,沒有才新建。但我這個cache函數在程序執行完後,緩存的內容直接被釋放,每次執行都需要新建Session連接。這幾天在學習Redis,估計我想要的效果得用redis才能實現。
在將網頁生成HTML文件到本地後,使用pdfkit工具將HTML文件轉換為PDF很耗費時間,這一點請大家注意。
環境準備
Mac os 10.11.6 + Anaconda Navigator 1.7.0+ Python 2.7.12 + Sublime 3.0
技術要點
- Requests會話處理
- BeautifulSoup網頁解析
- pdfkit工具(注意,一定要先安裝wkhtmltopdf這個工具包)
- decorator裝飾器
代碼實現
# -*- coding: utf-8 -*-"""Created on Wen Mar 21 18:21:37 2018Func:使用爬蟲下載來讀電子書PGM:python_learning_crawler_laidu_new.py@author:benbendemo@email:xc0910@hotmail.com"""import requestsfrom bs4 import BeautifulSoupimport reimport datetimeimport profileimport pdfkitPGMname = PGM:python_learning_crawler_laidu_newclass Crawler(object): 1. 定義爬蟲基類 2. 定義SESSION、HEADER、BASE_URL等類參數 SESSION = requests.Session() LOGIN_URL = http://laidu.co/login HEADER = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"} @classmethod def login_with_session(cls): 由於待爬取網站需要登陸,使用SESSION.post方法進行登陸驗證 # 這裡需要先註冊來讀,然後把郵箱和密碼換成你自己的 payload = { email: XXXXXXXX, password: XXXXXXXX, _token: cls.get_xsrf() } cls.SESSION.post(url=cls.LOGIN_URL, data=payload) return cls.SESSION @classmethod def get_xsrf(cls): 使用SESSION登陸前,需要先獲取xsrf偽隨機數 response = cls.SESSION.get(url=cls.LOGIN_URL, headers=cls.HEADER) soup = BeautifulSoup(response.content, "html.parser") xsrf = soup.find(input, attrs={"name": "_token"}).get("value") print "---xsrf---:",xsrf return xsrf @classmethod def logoff_with_session(cls): 爬取結束後,關閉SESSION cls.SESSION.close() class Parse(Crawler): 1. 繼承Crawler類,定義頁面解析類 2. 定義BASE_URL、HTML_TEMPLATE和正則表達式類變數 # 基地址 BASE_URL = http://laidu.co # html網頁模版文件 HTML_TEMPLATE = """ <!DOCTYPE HTML> <html lang="en"> <head> <meta charset="UTF-8"> <meta content="text/html; charset=utf-8" http-equiv="Content-Type"> </head> <body> {content} </body> </html> """ # 定義html文本中匹配img標籤的正則表達式 IMG_PATTERN = "(<img .*?src=")(.*?.png|.*?.jpg)(")(/>)" # 編譯匹配img標籤的正則表達式 IMG_PATTERN_COMPILED = re.compile(IMG_PATTERN) # 匹配url中最長/字元的位置 URL_PATTERN = ".*/" # 編譯匹配url中最長/的正則表達式 URL_PATTERN_COMPILED = re.compile(URL_PATTERN) def __init__(self, book_url, book_name): # 電子書首頁地址和電子書名稱 self.book_url = book_url self.book_name = book_name # 進行時間統計的裝飾器函數 def decorator(func): def wrapper(*args, **kwargs): print "*** Function Name:***",func.__name__ #print "*** Function Args:***",args #print "*** Function kwargs:***",kwargs t1 = datetime.datetime.now() res = func(*args, **kwargs) t2 = datetime.datetime.now() print "*** Function Takes:***", (t2-t1), "Time" return res return wrapper def parse_url(self,url): 根據每個章節的url地址,截取去掉index.html後的前綴 match = re.match(self.URL_PATTERN_COMPILED,url) if match: return match.group(0) else: return None def get_total_chapter_urls(self): 從電子書首頁網址里,解析出全部章節的url地址,返回匯總列表 resp = self.SESSION.get(self.book_url) soup = BeautifulSoup(resp.content,html.parser) menu_tag = soup.find_all(ul,class_=summary)[0] url_total = [] for li in menu_tag.find_all(li,class_=chapter): url = self.BASE_URL + li.a.get(href) url_total.append(url) print "AAA url_total:",url_total return url_total def gen_book_html(self, chapter_name, chapter_url): 爬取每一章電子書的url地址,生成每個章節對應的html頁面. 將img標籤中的圖片相對路徑,使用正則表達式進行匹配後 使用re.sub函數將相對路徑轉換為絕對路徑 def gen_absolute_url(match): re.sub函數接收一個match對象作為參數 gen_absolute_url函數用於拼接圖片絕對路徑的url網址 rtn = .join([match.group(1), self.parse_url(chapter_url), match.group(2), match.group(3), match.group(4)]) return str(rtn) resp = self.SESSION.get(chapter_url) soup = BeautifulSoup(resp.content,"html.parser") body = soup.find_all(div,class_=normal)[0] html_before = str(body) # 注:re.sub函數需要接收一個match對象作為參數 html_after = re.sub(self.IMG_PATTERN_COMPILED, gen_absolute_url, html_before) html = self.HTML_TEMPLATE.format(content=html_after) with open(chapter_name,wb) as fp: fp.write(html) @decorator def transfer_html_2_pdf(self, htmls, bookname): 把所有html文件轉換成pdf文件 參數配置查看https://wkhtmltopdf.org/usage/wkhtmltopdf.txt options = { margin-top: 0.75in, margin-right: 0.75in, margin-bottom: 0.75in, margin-left: 0.75in, minimum-font-size: 75, zoom: 4, } print "*** Transfer_html_2_pdf begin ***" config = pdfkit.configuration(wkhtmltopdf=/usr/local/bin/wkhtmltopdf) pdfname = str(bookname) + .pdf pdfkit.from_file(htmls, pdfname, options=options, configuration=config) print "*** Transfer_html_2_pdf end ***" @decorator def run(self): print "*** PGM begin ***" # 打開SESSION連接 self.login_with_session() chapter_name_tot = [] for chapter_index, chapter_url in enumerate(self.get_total_chapter_urls()): chapter_name = ".".join([str(self.book_name), str(chapter_index), html]) chapter_name_tot.append(chapter_name) print "MAIN chapter_index:",chapter_index print "MAIN chapter_url:",chapter_url print "MAIN chapter_name:",chapter_name self.gen_book_html(chapter_name, chapter_url) # 將全部章節的html文件,合併生成一個pdf文件 self.transfer_html_2_pdf(chapter_name_tot, self.book_name) # 關閉SESSION連接 self.logoff_with_session() print "*** PGM end ***" if __name__ == __main__: bookindexurl = http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/index.html bookname = 把時間當作朋友 p = Parse(bookindexurl,bookname) p.run() # 使用profile進行性能分析 #profile.run(p.run())
執行結果
- 生成的HTML文件和PDF文件如下。
- 生成PDF文件預覽,注意紅色方框第5章第4節"逆命題"出現錯位,我檢查過,不是網頁解析的問題,是電子書HTML文件源碼中"逆命題"那一節的文本標籤被錯誤定義為"h1",手工將文件改為"h2",再生成PDF就能修復這個問題。
- 輸出的LOG如下。
runfile(/Users/jacksonshawn/PythonCodes/pythonlearning/python_learning_crawler_laidu_new.py, wdir=/Users/jacksonshawn/PythonCodes/pythonlearning)*** Function Name:*** run*** PGM begin ***---xsrf---: f8EMOPQer81CI4xwn3mQ8ccqnyLikVRNQAMk5887MAIN chapter_index: 0MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/index.htmlMAIN chapter_name: 把時間當做朋友.0.htmlMAIN chapter_index: 1MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Preface.htmlMAIN chapter_name: 把時間當做朋友.1.htmlMAIN chapter_index: 2MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Forword.htmlMAIN chapter_name: 把時間當做朋友.2.htmlMAIN chapter_index: 3MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Chapter0.htmlMAIN chapter_name: 把時間當做朋友.3.htmlMAIN chapter_index: 4MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Chapter1.htmlMAIN chapter_name: 把時間當做朋友.4.htmlMAIN chapter_index: 5MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Chapter2.htmlMAIN chapter_name: 把時間當做朋友.5.htmlMAIN chapter_index: 6MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Chapter3.htmlMAIN chapter_name: 把時間當做朋友.6.htmlMAIN chapter_index: 7MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Chapter4.htmlMAIN chapter_name: 把時間當做朋友.7.htmlMAIN chapter_index: 8MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Chapter5.htmlMAIN chapter_name: 把時間當做朋友.8.htmlMAIN chapter_index: 9MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Chapter6.htmlMAIN chapter_name: 把時間當做朋友.9.htmlMAIN chapter_index: 10MAIN chapter_url: http://laidu.co/books/7fa8fcfa612989251007dafde19a1e86/Chapter7.htmlMAIN chapter_name: 把時間當做朋友.10.html*** Function Name:*** transfer_html_2_pdf*** Transfer_html_2_pdf begin ***Loading pages (1/6)libpng warning: iCCP: known incorrect sRGB profile ] 50%libpng warning: iCCP: known incorrect sRGB profile ] 52%libpng warning: iCCP: known incorrect sRGB profile ] 52%libpng warning: iCCP: known incorrect sRGB profile ] 56%libpng warning: iCCP: known incorrect sRGB profile ] 56%libpng warning: iCCP: known incorrect sRGB profile ] 56%libpng warning: iCCP: known incorrect sRGB profile ] 59%libpng warning: iCCP: known incorrect sRGB profile ] 60%libpng warning: iCCP: known incorrect sRGB profile ] 60%libpng warning: iCCP: known incorrect sRGB profile ] 61%libpng warning: iCCP: known incorrect sRGB profile ] 62%libpng warning: iCCP: known incorrect sRGB profile ] 63%libpng warning: iCCP: known incorrect sRGB profile ] 64%libpng warning: iCCP: known incorrect sRGB profile ] 64%libpng warning: iCCP: known incorrect sRGB profile ] 64%libpng warning: iCCP: known incorrect sRGB profile ] 64%libpng warning: iCCP: known incorrect sRGB profile ] 66%libpng warning: iCCP: known incorrect sRGB profile ] 68%libpng warning: iCCP: known incorrect sRGB profile ] 69%libpng warning: iCCP: known incorrect sRGB profile ] 69%libpng warning: iCCP: known incorrect sRGB profile ] 70%libpng warning: iCCP: known incorrect sRGB profile ] 71%libpng warning: iCCP: known incorrect sRGB profile ] 71%libpng warning: iCCP: known incorrect sRGB profile ] 71%libpng warning: iCCP: known incorrect sRGB profile ] 72%libpng warning: iCCP: known incorrect sRGB profile ] 72%libpng warning: iCCP: known incorrect sRGB profile ] 73%libpng warning: iCCP: known incorrect sRGB profile ] 73%libpng warning: iCCP: known incorrect sRGB profilelibpng warning: iCCP: known incorrect sRGB profile ] 76%libpng warning: iCCP: known incorrect sRGB profile ] 78%libpng warning: iCCP: known incorrect sRGB profile ] 80%libpng warning: iCCP: known incorrect sRGB profile=> ] 84%libpng warning: iCCP: known incorrect sRGB profile===> ] 88%libpng warning: iCCP: known incorrect sRGB profile====> ] 89%libpng warning: iCCP: known incorrect sRGB profile=====> ] 90%libpng warning: iCCP: known incorrect sRGB profile=====> ] 91%libpng warning: iCCP: known incorrect sRGB profile======> ] 93%libpng warning: iCCP: known incorrect sRGB profile========> ] 95%libpng warning: iCCP: known incorrect sRGB profile========> ] 96%libpng warning: iCCP: known incorrect sRGB profile==========>] 99%Counting pages (2/6) Resolving links (4/6) Loading headers and footers (5/6) Printing pages (6/6)Done *** Transfer_html_2_pdf end ****** Function Takes:*** 0:01:09.791587 Time*** PGM end ****** Function Takes:*** 0:01:16.134701 Time
總結
- 使用pdfkit生成PDF文件,必須要先安裝wkhtmltopdf這個工具。pdfkit只是一個入口程序,真正生成PDF這些臟活累活,都是wkhtmltopdf完成的。安裝wkhtmltopdf成功後,在transfer_html_2_pdf函數中一定要指定正確的調用路徑。
- 目前還沒有設置緩存機制,這兩天在看Redis,打算後面加一個緩存處理。
- Crawler類裡面使用的@classmethod裝飾器,其實完全可以拿掉不要,我測試過,不用@classmtehod也沒問題。用這個顯得高大上,裝逼效果更好。??
- Parse類裡面用到的decorator裝飾器,其實可以剝離出來成為一個單獨的類,進一步降低耦合度。
- 可以使用profile.run(p.run())跑性能檢測作業。目前最耗費時間的操作在生成PDF文件那一步,爬取網頁操作其實要不了多少時間。
附Github地址:
- 使用Python爬蟲下載電子書
參考資料:
- PDFKIT參數說明
推薦閱讀:
※基於餘弦相似性的404頁面識別
※基於cookie登錄爬取豆瓣舌尖短評信息並做詞雲分析
※【爬蟲】爬取網易雲音樂評論2
※beautifulsoup+json抓取stackoverflow實戰