獲取知乎問題答案並轉換為MarkDown文件

01-30

20170609 更新:

感謝一介草民與ftzz的反饋

(1) 修復中文路徑保存問題

(2) 修復offset問題

(3) 修復第一個問題

來個好玩的東西

20170607 更新:

(1) 感謝Ftzz提醒, 將圖片替換為原圖

(2) 將文件保存到本地,解決了最大的缺點問題,不用聯網也可以看了

大家好，我是四毛。

寫在前面的話

在開始前，給大家分享一個前段時間逛Github時看到的某個爬蟲腳本中的內容：

所以，大家爬網站的時候，還是友善一點為好，且爬且珍惜啊。

好了，言歸正傳。

今天主要講一下如何將某一個知乎問題的所有答案轉換為本地MarkDown文件。

前期準備

python2.7
html2text

markdownpad(這裡隨意，只要可以支持md就行)
會抓包。。。。。
最重要的是你要有代理，因為知乎開始封IP了

1.什麼是MarkDown文件

Markdown 是一種用來寫作的輕量級「標記語言」，它用簡潔的語法代替排版，而不像一般我們用的字處理軟體 Word 或 Pages 有大量的排版、字體設置。它使我們專心於碼字，用「標記」語法，來代替常見的排版格式。例如此文從內容到格式，甚至插圖，鍵盤就可以通通搞定了。

恩，上面是我抄的，哈哈。想多了解的可以看看這裡。

2.為什麼要將答案轉為MarkDwon

因為。。。。。。懶，哈哈，開個玩笑。最重要的原因還是markdown看著比較舒服。平時寫腳本的時候，也一直在思考一個問題，如何將一個文字與圖片穿插的網頁原始的保存下來呢。如果藉助工具的話，那就很多了，CTRL+P 列印的時候，選擇另存為PDF，或者搞個印象筆記，直接保存整個網頁。那麼，我們如何用爬蟲實現呢？正好前幾天看到了這個項目，仔細研究了一下，大受啟發。

3.原理

原理說起來很簡單：獲取請求到的內容的BODY部分，然後重新構建一個HTML文件，接著利用html2text這個模塊將其轉換為markdown文件，最後對圖片及標題按照markdown的格式做一些處理就好了。目前應用的場景主要是在知乎。

4.Show Code

4.1獲取知乎答案

寫代碼的時候，主要考慮了兩種使用場景。第一，獲取某一特定答案的數據然後進行轉換；第二，獲取某一個問題的所有答案進行然後挨個進行轉換，在這裡可以通過贊同數來對要獲取的答案進行質量控制。

4.1.1、某一個特定答案的數據獲取

url：https://www.zhihu.com/question/27621722/answer/48658220（前面那個是問題ID，後邊的是答案ID）

這一數據的獲取我這裡分為了兩個部分，第一部分請求上述網址，拿到答案主體數據以及贊同數，第二部分請求下面這個介面：

https://www.zhihu.com/api/v4/answers/48658220

為什麼會這樣？因為這個介面得到的答案正文數據不是完整數據，所以只能分兩步了。

4.1.2、某一個特定答案的數據獲取

這一個數據就可以通過很簡單的方式得到了，介面如下：

https://www.zhihu.com/api/v4/questions/27621722/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_collapsed%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=3

返回的都是JSON數據，很方便獲取。但是這裡有一個地方需要注意，從這裡面取的答案正文數據就是文本數據，不是一個完整的html文件，所以需要在構造一下。

4.1.2、保存的欄位

author_name 回答用戶名
answer_id 答案ID
question_id 問題ID
question_title 問題
vote_up_count 贊同數
create_time 創建時間

答案主體

4.2 Code

zhihu.py

#!/usr/bin/env pythonn# -*- coding: utf-8 -*-n# Created by shimeng on 17-6-5nimport osnimport renimport jsonnimport requestsnimport html2textnfrom parse_content import parsenn"""njust for study and funnTalk is cheapnshow me your coden"""nnclass ZhiHu(object):n def __init__(self):n self.request_content = Nonenn def request(self, url, retry_times=10):n header = {n User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36,n authorization: oauth c3cef7c66a1843f8b3a9e6a1e3160e20,n Host: www.zhihu.comn }n times = 0n while retry_times>0:n times += 1n print request %s, times: %d %(url, times)n try:n ip = your proxy ipn if ip:n proxy = {n http: http://%s % ip,n https: http://%s % ipn }n self.request_content = requests.get(url, headers=header, proxies=proxy, timeout=10).contentn except Exception, e:n print en retry_times -= 1n else:n return self.request_contentnn def get_all_answer_content(self, question_id, flag=2):n first_url_format = https://www.zhihu.com/api/v4/questions/{}/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_collapsed%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=3n first_url = first_url_format.format(question_id)n response = self.request(first_url)n if response:n contents = json.loads(response)n print contents.get(paging).get(is_end)n while not contents.get(paging).get(is_end):n for content in contents.get(data):n self.parse_content(content, flag)n next_page_url = contents.get(paging).get(next).replace(http, https)n contents = json.loads(self.request(next_page_url))n else:n raise ValueError(request failed, quit......)nn def get_single_answer_content(self, answer_url, flag=1):n all_content = {}n question_id, answer_id = re.findall(https://www.zhihu.com/question/(d+)/answer/(d+), answer_url)[0]nn html_content = self.request(answer_url)n if html_content:n all_content[main_content] = html_contentn else:n raise ValueError(request failed, quit......)nn ajax_answer_url = https://www.zhihu.com/api/v4/answers/{}.format(answer_id)n ajax_content = self.request(ajax_answer_url)n if ajax_content:n all_content[ajax_content] = json.loads(ajax_content)n else:n raise ValueError(request failed, quit......)nn self.parse_content(all_content, flag, )nn def parse_content(self, content, flag=None):n data = parse(content, flag)n self.transform_to_markdown(data)nn def transform_to_markdown(self, data):n content = data[content]n author_name = data[author_name]n answer_id = data[answer_id]n question_id = data[question_id]n question_title = data[question_title]n vote_up_count = data[vote_up_count]n create_time = data[create_time]nn file_name = u%s--%s的回答[%d].md % (question_title, author_name,answer_id)n folder_name = u%s % (question_title)nn if not os.path.exists(os.path.join(os.getcwd(),folder_name)):n os.mkdir(folder_name)n os.chdir(folder_name)nn f = open(file_name, "wt")n f.write("-" * 40 + "n")n origin_url = https://www.zhihu.com/question/{}/answer/{}.format(question_id, answer_id)n f.write("## 本答案原始鏈接: " + origin_url + "n")n f.write("### question_title: " + question_title.encode(utf-8) + "n")n f.write("### Author_Name: " + author_name.encode(utf-8) + "n")n f.write("### Answer_ID: %d" % answer_id + "n")n f.write("### Question_ID %d: " % question_id + "n")n f.write("### VoteCount: %s" % vote_up_count + "n")n f.write("### Create_Time: " + create_time + "n")n f.write("-" * 40 + "n")nn text = html2text.html2text(content.decode(utf-8)).encode("utf-8")n # 標題n r = re.findall(r**(.*?)**, text, re.S)n for i in r:n if i != " ":n text = text.replace(i, i.strip())nn r = re.findall(r_(.*)_, text)n for i in r:n if i != " ":n text = text.replace(i, i.strip())n text = text.replace(_ _, )nn # 圖片n r = re.findall(r![]((?:.*?)), text)n for i in r:n text = text.replace(i, i + "nn")nn f.write(text)nn f.close()nnnif __name__ == __main__:n zhihu = ZhiHu()n url = https://www.zhihu.com/question/27621722/answer/105331078n zhihu.get_single_answer_content(url)nn # question_id = 27621722n # zhihu.get_all_answer_content(question_id)n

zhihu.py為主腳本，內容很簡單，發起請求，調用解析函數進行解析，最後再進行保存。

解析函數腳本：parse_content.py

#!/usr/bin/env pythonn# -*- coding: utf-8 -*-n# Created by shimeng on 17-6-5nimport timenfrom bs4 import BeautifulSoupnnndef html_template(data):n # api contentn html = n <html>n <head>n <body>n %sn </body>n </head>n </html>n % datan return htmlnnndef parse(content, flag=None):n data = {}n if flag == 1:n # singlen main_content = content.get(main_content)n ajax_content = content.get(ajax_content)nn soup = BeautifulSoup(main_content.decode("utf-8"), "lxml")n answer = soup.find("span", class_="RichText CopyrightRichText-richText")nn author_name = ajax_content.get(author).get(name)n answer_id = ajax_content.get(id)n question_id = ajax_content.get(question).get(id)n question_title = ajax_content.get(question).get(title)n vote_up_count = soup.find("meta", itemprop="upvoteCount")["content"]n create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(ajax_content.get(created_time)))nnn else:n # alln answer_content = content.get(content)nn author_name = content.get(author).get(name)n answer_id = content.get(id)n question_id = content.get(question).get(id)n question_title = content.get(question).get(title)n vote_up_count = content.get(voteup_count)n create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(content.get(created_time)))nn content = html_template(answer_content)n soup = BeautifulSoup(content, lxml)n answer = soup.find("body")nn print author_name,answer_id,question_id,question_title,vote_up_count,create_timen # 這裡非原創，看了別人的代碼，修改了一下n soup.body.extract()n soup.head.insert_after(soup.new_tag("body", **{class: zhi}))nn soup.body.append(answer)nn img_list = soup.find_all("img", class_="content_image lazy")n for img in img_list:n img["src"] = img["data-actualsrc"]n img_list = soup.find_all("img", class_="origin_image zh-lightbox-thumb lazy")n for img in img_list:n img["src"] = img["data-actualsrc"]n noscript_list = soup.find_all("noscript")n for noscript in noscript_list:n noscript.extract()nn data[content] = soupn data[author_name] = author_namen data[answer_id] = answer_idn data[question_id] = question_idn data[question_title] = question_titlen data[vote_up_count] = vote_up_countn data[create_time] = create_timenn return datan

parse_content.py主要負責構造新的html，然後對其進行解析，獲取數據。

5.測試結果展示

恩，下面還有，就不截圖了。

6.缺點與不足

下面聊一聊這種方法的缺點：

這種方法的最大缺點就是：

一定要聯網！

因為。。。。。。在md文件中我們只是寫了個圖片的網址，這就意味著markdown的編輯器幫我們去存放圖片的伺服器上對這個圖片進行了獲取，所以斷網也就意味著你看不到圖片了；同時也意味著如果用戶刪除了這張圖片，你也就看不到了。

但是，後來我又發現在markdownpad中將文件導出為html時，即使是斷網了，依然可以看到全部的內容，包括圖片，所以如果你真的喜歡某一個答案，保存到印象筆記肯定是不錯的選擇，PDF直接保存也不錯，如果是使用了這個方法，記得轉為html最好。

還有一個缺點就是html2text轉換過後的效果其實並不是特別好，還是需要後期在進行處理的。

7.總結

代碼還有很多可以改進之處，歡迎大家與我交流：QQ:549411552 （註明來自靜覓）

國際慣例：代碼在這

收工。

轉載請註明：靜覓 ? 獲取知乎問題答案並轉換為MarkDown文件

自己動手，豐衣足食！Python3網路爬蟲實戰案例 自己動手，豐衣足食！Python3網路爬蟲實戰案例
關鍵字：已有1039 人學習貓眼電影、今日頭條街拍、淘寶商品美食、微信文章、知乎用戶信息等案例，結合反爬策略，例例實戰已連載完畢。