【我愛背單詞】從300萬單詞中給你一份3000英語新聞高頻辭彙

02-02

-----------------2017-01-22 20:21----------------

【單詞集已經更新】

【有知友提醒，使用Windows自帶的記事本打開會出現單詞為分割開來的情況，晚上回去重新處理一下，自己使用Notepad上是正常的換行分割，還望見諒】

學英語，無論目的何在，辭彙量總是一個繞不過去的坎，沒有足夠的辭彙量，都難以用文法來組詞造句。

前段時間，惡魔的奶爸提供了幾份偏向於各個領域的高頻辭彙，很是不錯。

最近用Python寫爬蟲入門教程的時候，碰到過幾個英文網站，便有了提煉一份網站英文單詞詞頻的念頭。

3000高頻單詞庫說明：

來源：Connecting China Connecting the World 全站4700+個網頁

剔除127個常見停用詞，剔除單詞長度為1的單詞，剔除網站英文名chinadaily。

我年紀還輕，閱歷不深的時候，我父親教導過我一句話，我至今還念念不忘。「每逢你想要批評任何人的時候，」他對我說，「你就記住，這個世界上所有的人，並不是個個都有過你擁有的那些優越條件。」
——《了不起的蓋茨比》

以下為3000高頻辭彙的提取過程，如需最後的單詞庫，直接拉到文末。

1.爬取ChinaDaily全站網頁URL

def get_all_link(url):n try:n # 分割網址n host = url.split(/)n # print(host[2])n wbdata = requests.get(url).textn soup = BeautifulSoup(wbdata,lxml)n for link in soup.find_all(a):n # 判斷網頁中提取的URl形式n if link.get(href) not in pages and link.get(href) is not None:n if link.get(href).startswith(http):n if link.get(href).split(/)[2] == host[2]:n newpage = link.get(href)n # print(newpage)n pages.add(newpage)n get_all_link(newpage)n elif link.get(href).startswith(/):n newpage = link.get(href)n pages.add(newpage)n newpage_url = http://+host[2]+newpagen # print(newpage_url)n get_all_link(newpage_url)n print(url數量：+str(len(pages)))n except BaseException as e:n print(程序出錯：{0}.format(e))n

2.請求爬取的URL並解析網頁單詞

# 解析網頁單詞並寫入文本文件ndef resolve_html(url):n wbdata = requests.get(url).contentn soup = BeautifulSoup(wbdata,lxml)n # 替換換行字元n text = str(soup).replace(n,).replace(r,)n # 替換<script>標籤n text = re.sub(r<script.*?>.*?</script>, ,text)n # 替換HTML標籤n text = re.sub(r<.*?>," ",text)n text = re.sub(r[^a-zA-Z], ,text)n # 轉換為小寫n text = text.lower()n text = text.split()n text = [i for i in text if len(i) > 1 and i != chinadaily]n text = .join(text)n print(text)n with open("j:pythonwords.txt",a+,encoding=utf-8) as file:n file.write(text+ )n print("寫入成功")nif __name__ == __main__:n pool = Pool(processes=2)n pool.map_async(resolve_html,urllist)n pool.close()n pool.join()n print(運行完成)n

3.對單詞文本文件進行詞頻處理

# 對單詞文本文件進行詞頻處理ndef resolve_words():n corpath = J:pythonwordsn wordlist = PlaintextCorpusReader(corpath,.*)n allwords = nltk.Text(wordlist.words(words.txt))n print("單詞總數",len(allwords))n print("單詞個數",len(set(allwords)))n stop = stopwords.words(english)n swords = [i for i in allwords if i not in stop]n print("去除停用詞的單詞總數：",len(swords))n print("去除停用詞的單詞個數：",len(set(swords)))n print("開始詞頻統計")n fdist = nltk.FreqDist(swords)n print(fdist.most_common(3000))n for item in fdist.most_common(3000):n print(item,item[0])n with open(J:pythonwords3000.txt,a+,encoding=utf-8) as file:n file.write(item[0]+r)n print("寫入完成")n

結果為：

單詞總數 3537063n單詞個數 38201n去除停用詞的單詞總數： 2603450n去除停用詞的單詞個數： 38079n

部分單詞及詞頻為：

(online, 8788)n(business, 8772)n(society, 8669)n(people, 8646)n(content, 8498)n(story, 8463)n(multimedia, 8287)n(cdic, 8280)n(travel, 7959)n(com, 7691)n(cover, 7679)n(cn, 7515)n(hot, 7219)n(shanghai, 7064)n(first, 6941)n(photos, 6739)n(page, 6562)n(years, 6367)n(paper, 6289)n(festival, 6188)n(offer, 6064)n(sports, 6025)n(africa, 6008)n(forum, 5983)n

最後得到一個包含3000個高頻辭彙的txt文本文件，大家可以將其導入到各大單詞軟體的單詞本中。

下載地址：

關注微信公眾號：州的先生
回復關鍵字：3000高頻詞