爬蟲如何獲得biilbili播放數？

12-28

&&&&&&&&&&
它源碼，沒有確切的播放數量。如何獲得播放數？

用av2047063舉例，訪問下面的網址：【網址已隱去】
@妹空醬提醒我才想起來。。。。
先去自己申請一個appkey。。。在這裡：
bilibili - 提示
然後就可以對bilibiliapi為所欲為了。。。。
B站第三方客戶端就是這麼開發出來的。。。

可以看到最後兩個參數id=av號page=分p
play後面的18253即為播放數。==============================
b站有公開api啊。。。。。。。那麼麻煩幹嘛。。。

明明F12一下就能找到不用Key的API：
http://api.bilibili.com/archive_stat/stat?aid=170001
（隨便舉個例子，求不吐槽）
用WebRequest或者其他方法得到JSON：
{ "code": 0, "data": { "view": 4465790, "danmaku": 180954, "reply": 21133, "favorite": 157883, "coin": 14786, "share": 53475, "now_rank": 0, "his_rank": 1000 }, "message": "" }
有以下主要欄位：

view：播放量

danmaku：彈幕量

reply：回復量

favorite：收藏量

coin：硬幣量

share：分享量

隨便找個庫或者自己手動Parse一下，寫到資料庫里就可以了。
下面是C#代碼：
using System; using System.IO; using System.Net; using System.Threading; namespace BiliSpider { class Program { static int Count = 0; static string Host = "http://api.bilibili.com"; static FileMode Mode = FileMode.Create; static void Main(string[] Arg) { new Thread(() =&> Task(1000000)).Start(); new Thread(() =&> Task(2000000)).Start(); new Thread(() =&> Task(3000000)).Start(); new Thread(() =&> Task(4000000)).Start(); new Thread(() =&> Task(5000000)).Start(); } static void Task(int i) { var Url = Host + "/archive_stat/stat?aid="; var File = new FileStream(i + ".json", Mode); var Writer = new StreamWriter(File); while (true) { Console.Title = "Spider " + Count++; var Line = i + " " + Get(i++, Url); Writer.WriteLine(Line); Writer.Flush(); Console.WriteLine(Line); } } static int[] Parse(int Num, string Url) { var Section = Get(Num, Url).Split(","); var Result = new int[Section.Length]; for (int i = 0; i &< Section.Length; i++) { Section[i] = Section[i].Split(":")[1]; Result[i] = int.Parse(Section[i]); } return Result; } static string Get(int Num, string Url) { try { var Request = WebRequest.Create(Url + Num); var Response = Request.GetResponse(); using (var Stream = Response.GetResponseStream()) using (var Reader = new StreamReader(Stream)) { var Json = Reader.ReadLine().Remove(0, 18); return Json.Substring(0, Json.IndexOf("}")); } } catch (Exception Error) { return Error.Message; } } } }
有時候會報403（Access Denied），對於出錯的數據需要另外處理。
在我的ECS上運行每小時大概能抓取100W條，全部數據大概500M。

# encoding=utf8 # author:shell-von
import requests import re aid = "3210612" api_key = "http://interface.bilibili.com/count?key=27f582250563d5d6b11d6833aid=%s" data = requests.get(api_key % aid).content regex = r"("(?:.|#)([w_]+)").html("?(d+)"?)" print dict(re.findall(regex, data))
嘗試了幾個，貌似都可以，那個什麼api_key是我用chrome dev tool看到他請求了倆js發現的，測試了幾個發現可以通用。。233333333

以下是測試的幾個av截圖：

dianji就是問題要的。其他的數據萬一也有要呢？我就一起抓了，dm_count是彈幕數，收藏stow_count，銀幣數量v_ctimes如果換urllib2也一樣:

urllib2.urlopen(api_key).read()

這是MATLAB的抓取，其中api可以利用Chrome的開發者工具獲得：
aid = 3295561; api = "http://interface.bilibili.com/count?key=b9415053057bb00966665eaa"; data = regexp(webread(api,"aid",aid),"#(w)+D*(d)+","tokens"); data = [data{:}]
其中的第三行如果是MATLAB r2014b之前版本需要改寫webread為urlread:
data = regexp(urlread(sprintf("%said=%d",api,aid)),"#(w)+D*(d)+","tokens");
得到結果：
data = "dianji" "108999" "stow_count" "8212" "v_ctimes" "3354" "dm_count" "6510"
說下大概的思路。
0、打開特定的av頁面，通過這條語句&

實際上，我們ctrl + u看到的頁面是網站發給我們的其中一個包而已，而最終的結果頁面是網站發給我們的多個包組合的結果。
有時候，網站會將數據封裝在json或者xml中，然後通過多個請求獲取數據，最後在本地用js來進行最後的構建。
因此，頁面上看到的內容是最後的結果，如果你要判斷這個結果來自於源頁面還是json還是xml，就需要通過開發者工具抓抓包，然後自己分析。
總之，邏輯就是：
0、這個數據哪來的？ —— 通過抓包分析
1、模擬獲取這個數據的過程。 —— 直接訪問該數據的來源url
當然還要注意你要傳的參數。這個參數從哪些地方獲取也需要自己分析。
====================================================
還是舉個例子吧。

注意：B站發回的數據是gzip，然而urllib2的urlopen不會自動解壓，需要手動處理。
可以參考這個回答：
Does python urllib2 automatically uncompress gzip data fetched from webpage?
隨便在首頁找了個頁面，地址如下：
【愛深黑切】路人女主的玩壞方法~第一彈
import urllib2 import re from StringIO import StringIO import gzip
def find_cid_aid(html): target = re.compile("EmbedPlayer(?P&.*?)&",re.DOTALL) cidaid = target.search(html) cidaid = html[cidaid.start("args"):cidaid.end("args")] cid = cidaid.find("cid=") aid = cidaid.find("aid=") index = aid while cidaid[index] != """: index += 1 return (cidaid[cid + 4:aid],cidaid[aid + 5:index]) def find_how_many(cid_aid): target = re.compile(r"&(?P&.*?)&",re.DOTALL) cid = cid_aid[0] aid = cid_aid[1] addr = r"http://interface.bilibili.com/player?id=cid:" + cid + "aid=" + aid f = urllib2.urlopen(addr) res = f.read() target = target.search(res) return res[target.start("result"):target.end("result")] headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language":"zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3", "User-Agent":"Mozilla/5.0 (Windows NT 6.1; rv:28.0) Gecko/20100101 Firefox/28.0", "Host":"www.bilibili.com", "Accept-Encoding":"gzip, deflate", "Cache-Control":"max-age=0", "Connection":"keep-alive"} request = urllib2.Request(r"http://www.bilibili.com/video/av2046145/", headers=headers) html = urllib2.urlopen(request) if html.info().get("Content-Encoding") == "gzip": buf = StringIO(html.read()) f = gzip.GzipFile(fileobj=buf) html = f.read()
cid_aid = find_cid_aid(html) print find_how_many(cid_aid)
由於一開始並沒有察覺到B站發回的數據是gzip，所以我徹底按照瀏覽器的格式修改了header，不要在意這個細節。

答主的第一次就就交在這裡了，，，
———————————————————————————————————————
前不久學習了python，正好複習一下
代碼如下：
import re,urllib
page=urllib.urlopen("http://m.acg.tv/video/av2046040.html")
HTML=page.read()
re_times=r"&&&(.*)&"
result = re.findall(re_times,HTML)
re_title=r"&(.*)&"
title=re.findall(re_title,HTML)
print title[0],"的播放次數為",result[0]
下面以av2046040為例：http://www.bilibili.com/video/av2046040/
可以看到

使用火狐查看選中部分源代碼，如下

但是我通過python的urllib模塊並沒有獲取到頁面內容：
page=urllib.urlopen("http://www.bilibili.com/video/av2046040/")

於是我轉換思路，貌似B站的手機版網頁可以，
然後使用火狐的User-Agent Overrider修改瀏覽器UA為Android FireFox/29

既可以獲得如下界面：

獲取到頁面實際地址後，就可以再次使用火狐查看源代碼

既可以寫出正則表達式：
re_times=r"&&&(.*)&"
然後正則匹配就好了。

以前寫過一個。。。。

haogefeifei/get_bilibili_anime · GitHub

今天因為想看B站電視劇欄目有什麼熱門的，播放數高的劇，就Python寫了只爬蟲抓數據，直接給源碼，原文發布在簡書（http://www.jianshu.com/p/d2c9740e85dc），代碼睡覺前親測有用。
明天再去看我txt裡面的東西內容了

```
#_*_coding:utf-8_*_
import requests
from bs4 import BeautifulSoup
import time
import random
#==================
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
#一段神奇的代碼,解決"UnicodeEncodeError: "ascii" codec can"t encode characters in position 0-6: ordinal not in range(128)"問題
#=================
files=open("爬取信息小表格.txt","wb")
info_data=[]
url=[]
counts=0

#analysis_endpage函數,主要用來抓取初始頁面的最大頁數
def analysis_endpage():
url="http://www.bilibili.com/video/tv-drama-1.html#!page=1order=default"
endpage=requests.get(url)
soup=BeautifulSoup(endpage.text,"lxml")
return soup.select("a.p.endPage")[0].get_text()

#get_info函數,這個爬蟲的主函數,主要用來抓取具體某個頁面中的標題,時間,播放數量,彈幕數量以及收藏數量,以及av號

def get_info(url):
website=requests.get(url)
time.sleep(random.randint(1,3)*0.1)
soup=BeautifulSoup(website.text,"lxml")
titles=soup.find_all("a","title")
timelines=soup.select("div.up-info &> span")
viedo_plays=soup.select("span.v-info-i.gk &> span")
tucaos=soup.select("span.v-info-i.dm &> span")
collections=soup.select("span.v-info-i.sc &> span")
avs=titles
#print website.text
#print soup
for title,timeline,viedo_play,tucao,collection,av in zip(titles,timelines,viedo_plays,tucaos,collections,avs):
global counts
a=title.get_text()
b=timeline.get_text()
c=viedo_play.get_text()
d=tucao.get_text()
e=collection.get_text()
f=av.get("href")
#info_data.append([a,b,c,d,e,"http://www.bilibili.com/"+f])
files.write(a+"\t"+b+"\t"+c+"\t"+d+"\t"+e+"\t"+"http://www.bilibili.com/"+f+"\n")
print "寫寫寫,第%r條記錄"%counts
counts+=1

#get_page函數,在end_page函數之後,用來翻頁
def get_page(end_page,start_page=1,):
for i in range(start_page,(end_page+1)):
url.append("http://www.bilibili.com/list/default-34-"+ str(i)+"-2016-06-21~2016-06-28.html" )
#"http://www.bilibili.com/list/default-34-1-2016-06-21~2016-06-28.html"

#body &> div.b-page-body &> div &> div.container-body &> div &> div.b-page-large.b-f-left &> div &> div.vd-list-cnt.loaded &> ul &> li:nth-child(1) &> div &> div &> a
#body &> div.b-page-body &> div &> div.container-body &> div &> div.b-page-large.b-f-left &> div &> div.vd-list-cnt.loaded &> ul &> li:nth-child(1) &> div &> div &> div.up-info &> span
#body &> div.b-page-body &> div &> div.container-body &> div &> div.b-page-large.b-f-left &> div &> div.vd-list-cnt.loaded &> ul &> li:nth-child(1) &> div &> div &> div.v-info &> span.v-info-i.gk &> span
#body &> div.b-page-body &> div &> div.container-body &> div &> div.b-page-large.b-f-left &> div &> div.vd-list-cnt.loaded &> ul &> li:nth-child(1) &> div &> div &> div.v-info &> span.v-info-i.dm &> span
#body &> div.b-page-body &> div &> div.container-body &> div &> div.b-page-large.b-f-left &> div &> div.vd-list-cnt.loaded &> ul &> li:nth-child(1) &> div &> div &> div.v-info &> span.v-info-i.sc &> span
#body &> div.b-page-body &> div &> div.container-body &> div &> div.b-page-large.b-f-left &> div &> div.vd-list-cnt.loaded &> ul &> li:nth-child(1) &> div &> div &> a

#================華麗的分割線========================
print "啟動嗶哩嗶哩小爬蟲~"
time.sleep(1)

files.write("標題\t上傳時間\t播放次數\t彈幕數量\t收藏數量\t網址\n")
print "網頁頁數載入中……"
print "已解析到有%r個頁面"%int(analysis_endpage())
try:
get_infomation=int(raw_input("請輸入要爬取的頁數:"))

except:
print "輸入文字格式出錯!只能輸入數字!"
files.close()
quit()

if get_infomation &> int(analysis_endpage()):
print "超出頁面限制,請重試!"
else:
get_page(get_infomation)
c=1
for aaa in url:
get_info(aaa)
print "第%r頁,OK!"%c
c+=1

print "間諜計劃結束 -。-,辛苦了~"
print "共獲得%r條記錄"%(counts-1)
files.close()
```

獲取cid aid請求http://interface.bilibili.com/player
什麼東西抓抓包就知道了
比如說如圖一樣的懶人眼鏡，你懂的~~這裡的源碼直接可以直接用正則匹配到cid和aid,

cid=1511100aid=1044050

然後請求

http://interface.bilibili.com/player?id=cid:1511100aid=1044050

然後被&&包圍的就是播放數了
&4611&
至於代碼方面.自己去實現
php的話可以用file_get_contents();直接獲取
也可以用curl來獲取

推薦閱讀：

※如何評價十月新番《品酒要在成為夫妻後》？
※如何看待知乎用戶 Negar Kordi 在嗶哩嗶哩發布視頻？
※如何評價b站up主赫蘿老師視頻封面引戰，導致與型月粉撕逼一事？
※如何看待本次nga的fgo版主wjndante疑似和b站運營有勾結的事件。？
※如何分析FGO第二章的序章劇情？

TAG:Python | PHP | 爬蟲計算機網路 | 嗶哩嗶哩 |