開啟知乎收藏夾看圖模式
代碼放在了這裡,python3的~
wzyonggege/Zhihu-Crawler---------------------------------------------------------------------------------------
我們在逛知乎的時候,經常會遇到下面這種收藏夾,嗯,你懂的~
我也逛~
於是就想寫一個python小程序,來開啟「只看圖」模式呢~
就像這樣~簡單的抓了一下~
程序代碼比較簡單,也比較溫和,首先是模擬cookie登錄知乎,收藏夾頁面訪問一次,獲取十個回答的鏈接,每個回答的鏈接訪問一個,獲取頁面下圖片的URL,然後寫入本地~
這裡模擬登錄不多做介紹,只說一種比較簡單的方法:瀏覽器登錄知乎後,打開開發者模式,找到主頁面,
Network->Headers->Requests Headers->Cookie,把這一整段複製下來,用來模擬登錄。
Cookie拷貝下來後,使用requests訪問(本文代碼python版本2.7)
import requestsheaders = { User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36, Cookie: cookie #你的cookie}url = https://www.zhihu.com/collection/69135664response = requests.get(url, headers=headers).content
如上便可簡單的實現模擬登錄,
接著就是比較簡單的分頁和頁面標籤提取,可自己研究一下(本文需求不需要調用知乎API去解析動態資源如json),我這裡就拋磚一下。
# coding:utf-8import requestsfrom lxml import htmlimport os# 編碼問題,可以加下面三行# import sys# reload(sys)# sys.setdefaultencoding(utf-8)headers = { User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36, Cookie: cookie #你的cookie}def get_link_ist(collection_num): page = input(你想要多少頁?(注意身體哦~):) result = [] collection_title = None for i in range(1, page+1): link = https://www.zhihu.com/collection/{}?page={}.format(collection_num, i) response = requests.get(link, headers=headers).content sel = html.fromstring(response) # 創建文件夾 if collection_title is None: # 收藏夾名字 collection_title = sel.xpath(//h2[@class="zm-item-title zm-editable-content"]/text())[0].strip() if not os.path.exists(collection_title): os.mkdir(collection_title) each = sel.xpath(//div[@class="zm-item"]//div[@class="zm-item-answer "]/link) for e in each: link = https://www.zhihu.com + e.xpath(@href)[0] result.append(link) return [collection_title, result]def get_pic(collection, answer_link): response = requests.get(answer_link, headers=headers).content sel = html.fromstring(response) title = sel.xpath(//h1[@class="QuestionHeader-title"]/text())[0].strip() try: # 匿名用戶 author = sel.xpath(//a[@class="UserLink-link"]/text())[0].strip() except: author = u匿名用戶 # 新建路徑 path = collection + / + title + - + author try: if not os.path.exists(path): os.mkdir(path) n = 1 for i in sel.xpath(//div[@class="RichContent-inner"]//img/@src): # 去除whitedot鏈接 if whitedot not in i: # print i pic = requests.get(i).content fname = path + / + str(n) + .jpg with open(fname, wb) as p: p.write(pic) n += 1 print u{} 已完成.format(title) except : passif __name__ == __main__: collection_num = input(輸入收藏夾號碼:) r = get_link_ist(collection_num) collection = r[0] collection_list = r[1] for k in collection_list: get_pic(collection, k)
嗯~就這樣~(點了贊再走啊)
推薦閱讀: