python爬蟲之豆瓣音樂top250

01-29

回家很久了，實在熬不住，想起來爬點數據玩一玩，之前自己筆記本是win7加ubuntu16.04雙系統，本打算在ubuntu里寫代碼的，可是回到家ubuntu打開一直是紫屏，百度知乎方法用了也沒解決，厲害的兄弟可以教下我哦，過年有紅包哦！！然後就還是在win7下開始寫代碼了（電腦太卡，一直不想裝Python），今天爬的是豆瓣音樂top250，比較簡單，主要是練練手。

代碼

import requestsnimport renfrom bs4 import BeautifulSoupnimport timenimport pymongonnclient = pymongo.MongoClient(localhost, 27017)ndouban = client[douban]nmusictop = douban[musictop]nnheaders = {n User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36n}nurls = [https://music.douban.com/top250?start={}.format(str(i)) for i in range(0,250,25)]nndef get_url_music(url):n wb_data = requests.get(url,headers=headers)n soup = BeautifulSoup(wb_data.text,lxml)n music_hrefs = soup.select(a.nbg)n for music_href in music_hrefs:n get_music_info(music_href[href])n time.sleep(2)nndef get_music_info(url):n wb_data = requests.get(url,headers=headers)n soup = BeautifulSoup(wb_data.text,lxml)n names = soup.select(h1 > span)n authors = soup.select(span.pl > a)n styles = re.findall(流派:&nbsp;(.*?) ,wb_data.text,re.S)n times = re.findall(發行時間:&nbsp;(.*?) ,wb_data.text,re.S)n contents = soup.select(span.short > span)n if len(names) == 0:n name = 缺失n else:n name = names[0].get_text()n if len(authors) == 0:n author = 佚名n else:n author = authors[0].get_text()n if len(styles) == 0:n style = 未知n else:n style = styles[0].split(n)[0]n if len(times) == 0:n time = 未知n else:n time = times[0].split(-)[0]n if len(contents) == 0:n content = 無n else:n content = contents[0].get_text()n info = {n name:name,n author:author,n style:style,n time:time,n content:contentn }n musictop.insert_one(info)nnfor url in urls:n get_url_music(url)n

1加了請求頭（本來沒加，調試幾次突然沒數據了，加了請求頭開始也沒好，後來又好了，可能是網路原因）

2這次是進入信息頁爬的數據（上次爬電影沒採用這種方法，缺少了部分數據）

3數據的預處理用了很多if函數，厲害的兄弟有什麼優化的方法。

數據分析

1部分數據可以見上圖

2中國音樂作者還是很多的，哈哈。

3隨著音樂設備和網路的普及，流行音樂的發展，可以看出2000年後作品越來越多，到2010年又積極下滑（經典就是經典，無法吐槽現在的音樂）

4風格大家可以看出流行，搖滾，民謠佔了一大半。

5最後弄了一首周董的《不能說的秘密》做詞雲，想想小時候都是回憶啊。

問題

import requestsnimport renfrom bs4 import BeautifulSoupnimport timenimport pymysqlnnconn = pymysql.connect(host=localhost, user=root, passwd=123456, db=test, port=3306, charset=utf8)ncursor = conn.cursor()nnheaders = {n User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36n}nurls = [https://music.douban.com/top250?start={}.format(str(i)) for i in range(0,225,25)]nndef get_url_music(url):n wb_data = requests.get(url,headers=headers)n soup = BeautifulSoup(wb_data.text,lxml)n music_hrefs = soup.select(a.nbg)n for music_href in music_hrefs:n get_music_info(music_href[href])n time.sleep(2)nndef get_music_info(url):n wb_data = requests.get(url,headers=headers)n soup = BeautifulSoup(wb_data.text,lxml)n names = soup.select(h1 > span)n authors = soup.select(span.pl > a)n styles = re.findall(流派:&nbsp;(.*?) ,wb_data.text,re.S)n times = re.findall(發行時間:&nbsp;(.*?) ,wb_data.text,re.S)n contents = soup.select(span.short > span)n if len(names) == 0:n name = 缺失n else:n name = names[0].get_text()n if len(authors) == 0:n author = 佚名n else:n author = authors[0].get_text()n if len(styles) == 0:n style = 未知n else:n style = styles[0].split(n)[0]n if len(times) == 0:n time = 未知n else:n time = times[0].split(-)[0]n if len(contents) == 0:n content = 無n else:n content = contents[0].get_text()n info = {n name:name,n author:author,n style:style,n time:time,n content:contentn }nn cursor.execute("use test")n cursor.execute("insert into doubanmusic250 (name,author,style,time,content) values(%s,%s,%s,%s,%s)", (name,author,style,time,content))n conn.commit()nnfor url in urls:n get_url_music(url)n

最近再學mysql，想用Python連接MySQL的，可是出錯，附上代碼，大牛們前來指導。代碼出錯圖：

作者：羅羅攀 Python愛好者社區專欄作者，請勿轉載，謝謝。
簡書主頁：羅羅攀 - 簡書
博客專欄：羅羅攀的博客
配套視頻教程：Python3爬蟲三大案例實戰分享：貓眼電影、今日頭條街拍美圖、淘寶美食 Python3爬蟲三大案例實戰分享

公眾號：Python愛好者社區（微信ID：python_shequ），關注，查看更多連載內容。