對豆瓣《戰狼2》87767 條短評做詞雲

github源代碼:ronghuaxu/zhanlang2_wordcloud

以下我按照代碼來詳細描述自己在編寫爬蟲代碼時碰到的問題:

1. python爬蟲爬取短評代碼:

# -*- coding:utf-8 -*-nimport randomnimport timennfrom downloader import download as ddnfrom parser import movieparser as psnimport codecsnnif __name__ == __main__:nn templateurl = https://movie.douban.com/subject/26363254/comments?start={}&limit=20&sort=new_score&status=P;n with codecs.open(pjl_comment.txt, a, encoding=utf-8) as f:n # 4249n for i in range(4249):n print (開始爬取{}頁評論..., i)n targeturl = templateurl.format(i * 20)n res = dd.download_page(targeturl)n f.writelines(ps.get_douban_comments(res))n time.sleep(1 + float(random.randint(1, 20)) / 20)n

# -*- coding:utf-8 -*-nfrom bs4 import BeautifulSoupnnndef get_douban_comments(res):n comments_list = [] # 評論列表n soup = BeautifulSoup(res)n comment_nodes = soup.select(.comment > p)n for node in comment_nodes:n comments_list.append(node.get_text().strip().replace("n", "") + un)n return comments_listn

# -*- coding:utf-8 -*-nimport requestsnn# 下載源代碼ndef download_page(url):n header={User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0,n Cookie:input your login cookien }n html=requests.get(url,headers=header).contentn return htmln

提醒:爬取短評需要登錄操作,可以註冊一個小號作為爬蟲的工具。

爬取完成後得到一個10M的文件夾:

2. 對電影評論分詞,並繪製詞雲

# -*- coding:utf-8 -*-nimport codecsnfrom os import pathnnimport jiebanfrom scipy.misc import imreadnfrom wordcloud import WordCloudnnndef get_all_keywords(file_name):n word_lists = [] # 關鍵詞列表n with codecs.open(file_name, r, encoding=utf-8) as f:n Lists = f.readlines() # 文本列表n for List in Lists:n cut_list = list(jieba.cut(List))n for word in cut_list:n word_lists.append(word)n word_lists_set = set(word_lists) # 去除重複元素n sort_count = []n word_lists_set = list(word_lists_set)n length = len(word_lists_set)n print u"共有%d個關鍵詞" % lengthn k = 1n for w in word_lists_set:n sort_count.append(w + u: + unicode(word_lists.count(w)) + u"次n")n print u"%d---" % k + w + u":" + unicode(word_lists.count(w)) + u"次"n k += 1n with codecs.open(count_word.txt, w, encoding=utf-8) as f:n f.writelines(sort_count)nnn# 繪製詞雲ndef save_jieba_result():n # 設置多線程切割n jieba.enable_parallel(4)n dirs = path.join(path.dirname(__file__), ../pjl_comment.txt)n with codecs.open(dirs, encoding=utf-8) as f:n comment_text = f.read()n cut_text = " ".join(jieba.cut(comment_text)) # 將jieba分詞得到的關鍵詞用空格連接成為字元串n with codecs.open(pjl_jieba.txt, a, encoding=utf-8) as f:n f.write(cut_text)nnndef draw_wordcloud2():n dirs = path.join(path.dirname(__file__), pjl_jieba.txt)n with codecs.open(dirs, encoding=utf-8) as f:n comment_text = f.read()nn color_mask = imread("/Users/huazi/Desktop/music.jpg") # 讀取背景圖片nn stopwords = [u就是, u電影, u你們, u這麼, u不過, u但是, u什麼, u沒有, u這個, u那個, u大家, u比較, u看到, u真是,n u除了, u時候, u已經, u可以]n cloud = WordCloud(font_path="/Users/huazi/Desktop/simsunttc/simsun.ttc", background_color=white,n max_words=2000, max_font_size=200, min_font_size=4, mask=color_mask, stopwords=stopwords)n word_cloud = cloud.generate(comment_text) # 產生詞雲n word_cloud.to_file("pjl_cloud.jpg")nnnsave_jieba_result()ndraw_wordcloud2()n

碰到的一些問題:

  1. 因為評論的數據較多,使用jieba分詞工具時,時間消耗很大,提高速度的方法, 設置多線程切割: jieba.enable_parallel(4)
  2. WordCloud使用時,對stopwords使用中文,一直沒有效果,通過修改源代碼函數達到目標:


推薦閱讀:

爬取Ajax動態載入和翻頁時url不變的網頁
【Python爬蟲實戰】——爬取今日頭條美女圖片
爬蟲軟體|軟體的簡單使用(二)
數據採集技術指南 第一篇 技術棧總覽
校長,我要上車——python模擬登錄熊貓TV

TAG:电影 | Python | 爬虫 |