python爬取QQ音樂
最近在學習python爬蟲,今天就抓取QQ音樂做一個記錄。
一、尋找頁面的入口及尋找規律
打開QQ音樂首頁https://y.qq.com/,點擊「排行榜」,選中某一首歌,「檢查(N)」
我們並沒有找到音樂資源的鏈接,QQ音樂是通過網路請求動態獲取的。
點擊「network」,找到「JS」,我們發現這個是鏈接獲取的信息包含歌曲的名稱及各種信息
我們記錄下這個鏈接地址(https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date=2018-02-08&topid=4&type=top&song_begin=0&song_num=30&g_tk=809396275&jsonpCallback=MusicJsonCallbacktoplist&loginUin=344604012&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq&needNewCode=0)
我們再接著往下找,發現此鏈接是獲取所有左邊側邊欄的入口,並且topID跟左邊側邊欄獲取信息的鏈接有極強的關聯性(「https://y.qq.com/n/yqq/toplist/4.html#stat=y_new.toplist.menu.4」)。
因此,這個鏈接地址就是我們此次爬蟲的入口地址,記錄一下這個鏈接地址(「https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_opt.fcg?page=index&format=html&tpl=macv4&v8debug=1&jsonCallback=jsonCallback」)
二、尋找歌曲的URL
到目前為止,我們仍未獲得音樂資源的鏈接地址,我們點擊一首音樂播放,滑鼠對準音樂,右鍵,「檢查(N)」,我們找到network->media,我們發現該音樂的資源鏈接:
找其他的歌曲進行相同操作,總能找到相似的url。經過對比發現,音樂資源文件的url需要根據不同的歌曲傳遞不同的歌曲名和vkey
三、獲取vkey和filename
我們發現filename跟步驟一發現的songmid有關聯,即filename=C000 + songmid + .m4a
但是vkey暫未找出,我們緊接著步驟二,找到JS,終於發現獲得vkey的方法
再看看該方法的請求,請求的url的參數需要songmid和filename
四、梳理
至此,我們已經找到了獲取音樂資源的方法:
入口地址獲得topid->側邊欄獲得songmid->根據songmid獲得vkey->獲得音樂資源的鏈接
五、代碼實現
import requestsimport jsonimport timedef getSongMID(pageurl):#獲得songmid headers = { User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36, Referer: https://y.qq.com/n/yqq/toplist/{0}.html.format(item1[topID]), # Cookie: pt2gguin=o0344604012; RK=hBKNVevgd9; ptcz=7e70edebd26744f63d321bdc3eea832e59681b1614a86c481cb4bdd7af326ae0; pgv_pvid=9010457983; o_cookie=344604012; pac_uid=1_344604012; pgv_pvi=728103936; ts_uid=1996056142; luin=o0344604012; lskey=00010000e413b4b01a49b1b29d38d9babc926a4938ea3ea55f6e0816c6cd499ee24b2b3308e048e5aee5ad9f; p_luin=o0344604012; p_lskey=000400005434bcef5702931d41bb0bff1e229a77686dc7890afefd848e20a1f60a95ecf18b8c1d946bfb4aef; yq_index=0; pgv_si=s4678867968; pgv_info=ssid=s1658792426; ts_refer=ADTAGnewyqq.toplist; yqq_stat=0; ts_last=y.qq.com/n/yqq/toplist/4.html } html = requests.get(pageurl, headers=headers).text t1 = html.replace(MusicJsonCallbacktoplist(, "") t2 = t1.strip(")") jsonp = json.loads(t2) # print(page_url) # print(jsonp) songmid = [] for tid in jsonp[songlist]: songmid.append([tid[data][songmid],tid[data][songname]]) return songmiddef saveMusic(songid,vkey,name):#保存音樂 url = http://dl.stream.qqmusic.qq.com/C400{0}.m4a?vkey={1}&guid=9010457983&uin=344604012&fromtag=66.format(songid,vkey) headers = { User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36, Host: dl.stream.qqmusic.qq.com, # Cookie: pt2gguin=o0344604012; RK=hBKNVevgd9; ptcz=7e70edebd26744f63d321bdc3eea832e59681b1614a86c481cb4bdd7af326ae0; pgv_pvid=9010457983; o_cookie=344604012; pac_uid=1_344604012; pgv_pvi=728103936; ts_uid=1996056142; luin=o0344604012; lskey=00010000e413b4b01a49b1b29d38d9babc926a4938ea3ea55f6e0816c6cd499ee24b2b3308e048e5aee5ad9f; p_luin=o0344604012; p_lskey=000400005434bcef5702931d41bb0bff1e229a77686dc7890afefd848e20a1f60a95ecf18b8c1d946bfb4aef; yq_index=0; pgv_si=s4678867968; pgv_info=ssid=s1658792426; ts_refer=ADTAGnewyqq.toplist; yqq_stat=0; ts_last=y.qq.com/n/yqq/toplist/4.html } # html = requests.get(url, headers=headers) filename = G:/music/{0}.m4a.format(name.replace("?","").replace("/","_").replace("","_").replace(""","")) print(filename) res = requests.get(url, headers=headers, stream=True) print(url) with open(filename, wb) as f: f.write(res.raw.read())def getVkey(songmid):#獲得vkey url = https://c.y.qq.com/base/fcgi-bin/fcg_music_express_mobile3.fcg?g_tk=1418093288&jsonpCallback=MusicJsonCallback01822902435765017&loginUin=344604012&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq&needNewCode=0&cid=205361747&callback=MusicJsonCallback&uin=344604012&songmid={0}&filename=C400{1}.m4a&guid=9010457983.format(songmid,songmid) headers = { User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36, Referer: https://y.qq.com/portal/player.html, # Cookie: pt2gguin=o0344604012; RK=hBKNVevgd9; ptcz=7e70edebd26744f63d321bdc3eea832e59681b1614a86c481cb4bdd7af326ae0; pgv_pvid=9010457983; o_cookie=344604012; pac_uid=1_344604012; pgv_pvi=728103936; ts_uid=1996056142; luin=o0344604012; lskey=00010000e413b4b01a49b1b29d38d9babc926a4938ea3ea55f6e0816c6cd499ee24b2b3308e048e5aee5ad9f; p_luin=o0344604012; p_lskey=000400005434bcef5702931d41bb0bff1e229a77686dc7890afefd848e20a1f60a95ecf18b8c1d946bfb4aef; yq_index=0; pgv_si=s4678867968; pgv_info=ssid=s1658792426; ts_refer=ADTAGnewyqq.toplist; yqq_stat=0; ts_last=y.qq.com/n/yqq/toplist/4.html } html = requests.get(url, headers=headers).text t1 = html.replace(MusicJsonCallback(, "") t2 = t1.strip(")", "") jsonp = json.loads(t2) vkey = jsonp[data][items][0][vkey] return vkey# 入口地址start_url = https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_opt.fcg?page=index&format=html&tpl=macv4&v8debug=1&jsonCallback=jsonCallbackheaders={ User-Agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36, Referer:https://y.qq.com/n/yqq/toplist/4.html, Cookie:pt2gguin=o0344604012; RK=hBKNVevgd9; ptcz=7e70edebd26744f63d321bdc3eea832e59681b1614a86c481cb4bdd7af326ae0; pgv_pvid=9010457983; o_cookie=344604012; pac_uid=1_344604012; pgv_pvi=728103936; ts_uid=1996056142; luin=o0344604012; lskey=00010000e413b4b01a49b1b29d38d9babc926a4938ea3ea55f6e0816c6cd499ee24b2b3308e048e5aee5ad9f; p_luin=o0344604012; p_lskey=000400005434bcef5702931d41bb0bff1e229a77686dc7890afefd848e20a1f60a95ecf18b8c1d946bfb4aef; yq_index=0; pgv_si=s4678867968; pgv_info=ssid=s1658792426; ts_refer=ADTAGnewyqq.toplist; yqq_stat=0; ts_last=y.qq.com/n/yqq/toplist/4.html }html = requests.get(start_url,headers=headers).textt1 = html.replace(jsonCallback(,"")t2 = t1.strip(")","")json_dict=json.loads(t2)ch_album = json_dict[0]#只獲取『QQ音樂巔峰榜』,反正韓語、日語什麼的,我也聽不懂for item1 in ch_album[List]: mm=0 while(True): start_item = mm*30 num = 30 mm += 1 update_key = item1[update_key]#有些update_key為2018-5,而實際請求需要傳遞2018-05,因此需要轉換下 tt = update_key.split("_") if(len(tt) == 2): if(len(tt[1]) == 1): update_key = tt[0] + _0 + tt[1] else: update_key = tt[0] + _ + tt[1] #每一個側邊欄獲取信息的請求地址 page_url = https://c.y.qq.com/v8/fcg-bin/fcg_v8_toplist_cp.fcg?tpl=3&page=detail&date={0}&topid={1}&type=top&song_begin={2}&song_num={3}&g_tk=1418093288&jsonpCallback=MusicJsonCallbacktoplist&loginUin=344604012&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq&needNewCode=0.format(update_key,item1[topID],start_item,num) songinfo = getSongMID(page_url)#獲得songmid和songname print(songinfo) if(len(songinfo)<=0):#已經沒有音樂了,跳出此次循環 break for sid in songinfo: vkey = getVkey(sid[0])#獲取每首音樂的vkey saveMusic(sid[0],vkey,sid[1])#保存此音樂 time.sleep(1)#休眠1秒,防止被伺服器過濾掉
推薦閱讀:
※用Python實現機器學習演算法:邏輯回歸
※Python:為什麼下面這段程序只刪除1個0?
※python爬蟲如何斷點繼續抓取?
※黃哥漫談Python 生成器。
※教你免費搭建個人博客,Hexo&Github