MATLAB Graph Object(1): 爬數據
我們的實習生myc在幾天前承諾大家要搞個大新聞,然而後來就沒聲音了……其實並不是因為他在偷懶,而是因為
他要轉正啦!
然而在轉正之前,老闆說
你先把你做的項目的文檔寫了吧,讓以後的實習生可以接著你的項目做下去
於是myc開始了程序員最不喜歡做的事情之一,寫文檔
myc:我最不喜歡做兩件事情,1是給自己的程序寫文檔,2是看沒有文檔的程序
為了獲得這篇預告文章里最後的效果,我們首先要獲取數據。
爬蟲技術哪家強?
反正MATLAB是不太行。。。這裡我們借用@7sDream大神用python寫的zhihu_oauth來爬知乎的數據。
myc:可是。。。我不會python
沒問題,我們myc是MATLAB的高手,還記得如何在MATLAB調用python嗎?
我們先按照教程安裝zhihu_oauth。這裡我們選用了python3.5。
安裝好後在MATLAB裡面先用pyversion看看不能調用python, 隨後載入zhihu_oauth module,試試可不可以調用zhihu_oauth裡面的介面
>>[v, e, isloaded] = pyversion;nn>>py.importlib.import_module(zhihu_oauth);nn>>client = py.zhihu_oauth.ZhihuClientn
成功的話你會看到
client = nn Python ZhihuClient with no properties.nn <zhihu_oauth.client.ZhihuClient object at 0x7ff6bf45ac18>n
根據教程,我們來用token的形式調用。這裡需要在OS的命令行通過login_in_terminal進行登錄,可以參考官方文檔。登錄好後用save_token來存儲你的session, 然後可以回到MATLAB進行操作。
>>client.load_token(token.pkl);n>>matlab = client.column(matlab)n
成功後會得到
matlab = nn Python Column with properties:nn article_count: [1x1 py.int]n articles: [1x1 py.zhihu_oauth.zhcls.generator.ArticleGenerator]n articles_count: [1x1 py.int]n author: [1x1 py.zhihu_oauth.zhcls.people.People]n comment_permission: [1x3 py.str]n description: [1x27 py.str]n follower_count: [1x1 py.int]n followers: [1x1 py.zhihu_oauth.zhcls.generator.PeopleGenerator]n id: [1x6 py.str]n image_url: [1x62 py.str]n pure_data: [1x1 py.dict]n title: [1x6 py.str]n updated: [1x1 py.int]n updated_time: [1x1 py.int]nn <zhihu_oauth.zhcls.column.Column object at 0x7ff6bd962080>n
我們來看看專欄的關注人數
>> matlab.follower_countnnans = nn Python int with properties:nn denominator: [1x1 py.int]n imag: [1x1 py.int]n numerator: [1x1 py.int]n real: [1x1 py.int]nn 503n
已經突破500啦~撒花~
下面就是對專欄的關注人數逐一進行處理了,這裡用python的zip做一個iterator,並最終轉化為cell以便matlab操作
>>iter = py.zip(py.range(matlab.follower_count),matlab.followers);n>>list = py.list(iter);n>>ppl = list.cell;n
我們來看看第一個裡面有啥
>>subscriberCell = ppl{1}.cellnnsubscriberCell = nn [1x1 py.int] [1x1 py.zhihu_oauth.zhcls.people.People]nn>>subscriber = subscriberCell{2}nsubscriber = nn Python People with properties:nn activities: [1x1 py.zhihu_oauth.zhcls.generator.ActivityGenerator]n answer_count: [1x1 py.int]n answers: [1x1 py.zhihu_oauth.zhcls.generator.AnswerGenerator]n articles: [1x1 py.zhihu_oauth.zhcls.generator.ArticleGenerator]n articles_count: [1x1 py.int]n avatar_url: [1x60 py.str]n business: [1x1 py.zhihu_oauth.zhcls.streaming.StreamingJSON]n collected_count: [1x1 py.int]n collection_count: [1x1 py.int]n collections: [1x1 py.zhihu_oauth.zhcls.generator.CollectionGenerator]n column_count: [1x1 py.int]n columns: [1x1 py.zhihu_oauth.zhcls.generator.ColumnGenerator]n columns_count: [1x1 py.int]n created_at: [1x1 py.NoneType]n description: [1x10 py.str]n draft_count: [1x1 py.NoneType]n educations: [1x1 py.zhihu_oauth.zhcls.streaming.StreamingJSON]n email: [1x1 py.NoneType]n employments: [1x1 py.zhihu_oauth.zhcls.streaming.StreamingJSON]n favorite_count: [1x1 py.int]n favorited_count: [1x1 py.int]n follower_count: [1x1 py.int]n followers: [1x1 py.zhihu_oauth.zhcls.generator.PeopleGenerator]n following_collections: [1x1 py.zhihu_oauth.zhcls.generator.CollectionGenerator]n following_column_count: [1x1 py.int]n following_columns: [1x1 py.zhihu_oauth.zhcls.generator.ColumnGenerator]n following_count: [1x1 py.int]n following_question_count: [1x1 py.int]n following_questions: [1x1 py.zhihu_oauth.zhcls.generator.QuestionGenerator]n following_topic_count: [1x1 py.int]n following_topics: [1x1 py.zhihu_oauth.zhcls.generator.TopicGenerator]n followings: [1x1 py.zhihu_oauth.zhcls.generator.PeopleGenerator]n friendly_score: [1x1 py.NoneType]n gender: [1x1 py.int]n has_daily_recommend_permission: [1x1 py.NoneType]n headline: [1x9 py.str]n id: [1x32 py.str]n is_active: [1x1 py.NoneType]n is_baned: [1x1 py.NoneType]n is_bind_sina: 0n is_locked: [1x1 py.NoneType]n is_moments_user: [1x1 py.NoneType]n locations: [1x1 py.zhihu_oauth.zhcls.streaming.StreamingJSON]n name: [1x8 py.str]n pure_data: [1x1 py.dict]n question_count: [1x1 py.int]n questions: [1x1 py.zhihu_oauth.zhcls.generator.QuestionGenerator]n shared_count: [1x1 py.int]n sina_weibo_name: [1x1 py.NoneType]n sina_weibo_url: [1x1 py.NoneType]n thanked_count: [1x1 py.int]n uid: [1x1 py.NoneType]n voteup_count: [1x1 py.int]nn <zhihu_oauth.zhcls.people.People object at 0x7ff6bd9621d0>n
看來我們已經得到了專欄第一個關注者的信息了,看看是誰
>>nameS = char(subscriber.name)nnameS =nnYu Jiangnn>>uidS = char(subscriber.id)nuidS =nn8c479214206c3f2664df29a000a0bf95nn>>followerCountS = double(subscriber.follower_count)nfollowerCountS =nn 257n
不錯,接下來就是對每個關注者及其關注者進行處理了。腳本大致如下
[v, e, isloaded] = pyversion;nif isloadedn disp(To change the Python version, restart MATLAB, then call pyversion.)nelsen pyversion(/usr/local/bin/python3.5)nendnnpy.importlib.import_module(zhihu_oauth);nnfid = fopen(relation.txt,w+);nnclient = py.zhihu_oauth.ZhihuClientnclient.load_token(token.pkl);nmatlab = client.column(matlab);niter = py.zip(py.range(matlab.follower_count),matlab.followers);nnlist = py.list(iter);nppl =list.cell;nnnfor i = 1:numel(ppl)n tryn subscriberCell = ppl{i}.cell;n subscriber = subscriberCell{2};n nameS = char(subscriber.name);n uidS = char(subscriber.id);n followerCountS = double(subscriber.follower_count);n % Write to filen fprintf(fid,%s,%s,%it,uidS,nameS,followerCountS);n % Record all followersn iterP = py.zip(py.range(subscriber.follower_count),subscriber.followers);n listP = py.list(iterP);n followers = listP.cell; n n for j = 1:numel(followers)n tryn followersCell = followers{j}.cell;n follower = followersCell{2};n nameF = char(follower.name);n uidF = char(follower.id);n followerCountF = double(follower.follower_count);n fprintf(fid,%s,%s,%i;,uidF,nameF,followerCountF);n catchn continuen endn n endn n % New linen fprintf(fid,n);n catchn continuen endn nendnnfclose(fid);n
myc寫好後把腳本和結果都上傳到了這裡,剩下的就是處理數據了……
終於要轉正了,讓我睡一會……
推薦閱讀:
※矽谷之路53:如何設計Crawler(三)分散式爬蟲設計
※python是否可對進行了blob加密的視頻進行爬取操作呢?
※關於爬蟲,就此封鍵盤
※目前做爬蟲,往後的職業發展方向是什麼?
※爬蟲VS反爬蟲的蝴蝶效應