python學習之文章數據分析

02-26

通常我們在進行NLP學習的時候,會經常的處理一些語料,同時也會對這些語料進行一些分析,今天的這篇文章我們通過分析quora上的Andrew NG的一個回答來實際操作一下:

原文複製如下:

Deep Learning is an amazing tool that is helping numerous groups create exciting AI applications. It is helping us build self-driving cars, accurate speech recognition, computers that can understand images, and much more.Despite all the recent progress, I still see huge untapped opportunities ahead. Therere many projects in precision agriculture, consumer finance, medicine, ... where I see a clear opportunity for deep learning to have a big impact, but that none of us have had time to focus on yet. So Im confident deep learning isnt going to "plateau" anytime soon and that itll continue to grow rapidly.Deep Learning has also been overhyped. Because neural networks are very technical and hard to explain, many of us used to explain it by drawing an analogy to the human brain. But we have pretty much no idea how the biological brain works. UC Berkeleys Michael Jordan calls deep learning a "cartoon" of the biological brain--a vastly oversimplified version of something we dont even understand--and I agree. Despite the media hype, were nowhere near being able to build human-level intelligence. Because we fundamentally dont know how the brain works, attempts to blindly replicate what little we know in a computer also has not resulted in particularly useful AI systems. Instead, the most effective deep learning work today has made its progress by drawing from CS and engineering principles and at most a touch of biological inspiration, rather than try to blindly copy biology.Concretely, if you hear someone say "The brain does X. My system also does X. Thus were on a path to building the brain," my advice is to run away!Many of the ideas used in deep learning have been around for decades. Why is it taking off only now? Two of the key drivers of its progress are: (i) scale of data and (ii) scale of computation. With our society spending more time on websites and mobile devices, for the past two decades weve been rapidly accumulating data. It was only recently that we figured out how to scale computation so as to build deep learning algorithms that can take advantage of this voluminous amount of data.This has now put us in two positive feedback loops, which is accelerating the progress of deep learning:First, now that we have huge machines to absorb huge amounts of data, the value of big data is clearer. This creates a greater incentive to acquire more data, which in turn creates a greater incentive to build bigger/faster neural networks.Second, that we have fast deep learning implementations also speeds up innovation, and accelerates deep learnings research progress. Many people underestimate the impact of computer systems investments in deep learning. When carrying out deep learning research, we start out not knowing what algorithms will and wont work, and our job is to run a lot of experiments and figure it out. If we have an efficient compute infrastructure that lets you run an experiment in a day rather than a week, then your research progress could be almost 7x as fast!This is why around 2008 my group at Stanford started advocating shifting deep learning to GPUs (this was really controversial at that time; but now everyone does it); and Im now advocating shifting to HPC (High Performance Computing/Supercomputing) tactics for scaling up deep learning. Machine learning should embrace HPC. These methods will make researchers more efficient and help accelerate the progress of our whole field.To summarize: Deep learning has already helped AI made tremendous progress. But the best is still to come!

我們分析這篇文章有兩個需求,一個是分析一篇文章當中的詞頻,另外一個是每一個詞出現的次數,而我們也將奔著這兩個目標去處理:

這裡我們要用到matplotlib這個模塊來進行圖像的繪製:

1:分詞處理

英文文章一個好處是他們每個詞之間會有空格來進行區分,但是詞和詞之間往往會有句號,逗號這樣的標點來去干擾,因此我們是通過string這個模塊來去除標點和空格,其中string.punctuation是去除標點,string.whitespace是去除空格.至於hist[word]=hist.get(word,0)+1,這句話等同於上邊的if-else,這裡記錄的是每一個單詞和這個單詞出現的次數.

結果如下:

2:排序處理

這一個函數是在上文獲取了每一個單詞和這個單詞出現的次數之後,他不是有順序的,,在這裡我們要用數組的排序來處理一下,數組有一個sort()函數,可以從大到小進行排序.

結果如下:

3:繪圖處理:

這裡用的matplotlib繪圖大家都很熟悉了,繪製出來

其實本來應該下邊包含有標籤,比如下邊:

這樣應該是最好的,但是我換了一台電腦後發現最下邊的標籤實在是太丑了,擁擠不堪,於是就去掉了,如果有興趣的小夥伴可以自己再加上.

完整代碼如下:

#-*- coding:utf-8 -*-import stringfrom matplotlib import pyplot as plt#字元串類型hist = []def process_line(line, hist): #處理每一行 for word in line.split(): # 去除單詞裡邊的標點符號 word = word.strip(string.punctuation+string.whitespace) #單詞的格式統一小寫 word = word.lower() #字典語法原邏輯 if word not in res: res[word]=1 else: res[word]+=1 hist[word]= hist.get(word,0)+1def process_file(filename): #處理文章 res = {} with open(filename,r) as f: for line in f: process_line(line, res) return resdef most_word(hist,num): #從高到低返回指定個數的詞頻信息 temp = [] for key,value in hist.items(): temp.append([value,key]) temp.sort(reverse=True) print(temp) return temp[:num]if __name__ == __main__: hist = process_file(emma.txt) data = most_word(hist,20)#最多的單詞 for i in range(20): # plt.bar(t[i][1:],t[i][:-1]) plt.bar(i, [data[i][0]]) plt.legend() plt.xlabel(word) plt.ylabel(rate) plt.title(show) plt.show()