國內外深度學習開放數據集下載集合(值得收藏，不斷更新)

04-23

一、Image processing data set

1、MNIST ，是最流行的深度學習數據集之一。這是一個手寫數字數據集，包含一個有著 60000 樣本的訓練集和一個有著 10000 樣本的測試集。對於在現實世界數據上嘗試學習技術和深度識別模式而言，這是一個非常好的資料庫，且無需花費過多時間和精力進行數據預處理。

大小：約 50 MB

數量：70000 張圖像，共分為 10 個類別。

Identify the Digits?

datahack.analyticsvidhya.com

MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges?

yann.lecun.com

2、Fashion-MNIST，包含 60,000 個訓練集圖像和 10,000 個測試集圖像。它是一個類似 MNIST 的時尚產品資料庫。開發人員認為 MNIST 的使用次數太多了，因此他們把這個數據集用作 MNIST 的直接替代品。每張圖像都以灰度顯示，並具備一個標籤（10 個類別之一）。

大小：30MB

數量：70,000 張圖像，共 10 類

zalandoresearch/fashion-mnist?

github.com

3、PASCAL VOC挑戰賽是視覺對象的分類識別和檢測的一個基準測試，提供了檢測演算法和學習性能的標準圖像注釋數據集和標準的評估系統。PASCAL VOC圖片集包括20個目錄：人類；動物（鳥、貓、牛、狗、馬、羊）；交通工具（飛機、自行車、船、公共汽車、小轎車、摩托車、火車）；室內（瓶子、椅子、餐桌、盆栽植物、沙發、電視）。PASCAL VOC挑戰賽在2012年後便不再舉辦，但其數據集圖像質量好，標註完備，非常適合用來測試演算法性能。

數據集大小：~2GB

Visual Object Classes Challenge 2012 (VOC2012)?

host.robots.ox.ac.uk

4、VQA ，是一個包含圖像開放式問題的數據集。這些問題的解答需要視覺和語言的理解。該數據集擁有下列有趣的特徵：

大小：25GB（壓縮後）

數量：265,016 張圖像，每張圖像至少 3 個問題，每個問題 10 個正確答案

Announcing the VQA Challenge 2018!?

www.visualqa.org

5、COCO，是一個大型數據集，用於目標檢測、分割和標題生成。Announcing the VQA Challenge 2018!2、COCO 是一個大型數據集，用於目標檢測、分割和標題生成。

大小：約 25 GB（壓縮後）

數量：33 萬張圖像、80 個目標類別、每張圖像 5 個標題、25 萬張帶有關鍵點的人像

Common Objects in Context?

cocodataset.org

6、CIFAR-10，該數據集也用於圖像分類。它由 10 個類別共計 60,000 張圖像組成（每個類在上圖中表示為一行）。該數據集共有 50,000 張訓練集圖像和 10,000 個測試集圖像。數據集分為 6 個部分——5 個訓練批和 1 個測試批。每批含有 10,000 張圖像。

大小：170MB

數量：60,000 張圖像，共 10 類

http://www.cs.toronto.edu/~kriz/cifar.html?

www.cs.toronto.edu

7、ImageNet ，是根據 WordNet 層次來組織的圖像數據集。WordNet 包含大約 10 萬個短語，而 ImageNet 為每個短語提供平均約 1000 張描述圖像。

大小：約 150 GB

數量：圖像的總數約為 1,500,000；每一張圖像都具備多個邊界框和各自的類別標籤。

http://www.image-net.org/?

www.image-net.org

ImageNet?

www.image-net.org

8、街景門牌號數據集（SVHN），這是一個現實世界數據集，用於開發目標檢測演算法。它需要最少的數據預處理過程。它與 MNIST 數據集有些類似，但是有著更多的標註數據（超過 600,000 張圖像）。這些數據是從谷歌街景中的房屋門牌號中收集而來的。

大小：2.5GB

數量：6,30,420 張圖像，共 10 類

The Street View House Numbers (SVHN) Dataset?

ufldl.stanford.edu

9、Open Images ，是一個包含近 900 萬個圖像 URL 的數據集。這些圖像使用包含數千個類別的圖像級標籤邊界框進行了標註。該數據集的訓練集包含 9,011,219 張圖像，驗證集包含 41,260 張圖像，測試集包含 125,436 張圖像。

大小：500GB（壓縮後）~1.5GB（不包括圖片）

數量：9,011,219 張圖像，帶有超過 5000 個標籤

openimages/dataset?

github.com

10、機器標註的一個超大規模數據集，包含2億圖像。

We address the problem of large-scale annotation of web images. Our approach is based on the concept of visual synset, which is an organization of images which are visually-similar and semantically-related. Each visual synset represents a single prototypical visual concept, and has an associated set of weighted annotations. Linear SVM』s are utilized to predict the visual synset membership for unseen image examples, and a weighted voting rule is used to construct a ranked list of predicted annotations from a set of visual synsets. We demonstrate that visual synsets lead to better performance than standard methods on a new annotation database containing more than 200 million im- ages and 300 thousand annotations, which is the largest ever reported.

VisualSynset?

cpl.cc.gatech.edu

11、包含13萬的圖像的數據集。Scene categorization is a fundamental problem in computer vision. However, scene understanding research has been constrained by the limited scope of currently-used databases which do not capture the full variety of scene categories. Whereas standard databases for object categorization contain hundreds of different classes of objects, the largest available dataset of scene categories contains only 15 classes. In this paper we propose the extensive Scene UNderstanding (SUN) database that contains 899 categories and 130,519 images. We use 397 well-sampled categories to evaluate numerous state-of-the-art algorithms for scene recognition and establish new bounds of performance. We measure human scene classification performance on the SUN database and compare this with computational methods.

http://vision.princeton.edu/projects/2010/SUN/?

vision.princeton.edu

12、包含100萬的圖像，23000視頻；微軟亞洲研究院出品，質量應該有保障。

Microsoft Research – Emerging Technology, Computer, and Software Research?

research.microsoft.com

二、Natural Language Processing data setVisualSynset二、Natural Language Processing data setLarge-scale Scene Recognition from Abbey to Zoo二、Natural Language Processing data setVisualSynset二、Natural Language Processing data set

1、IMDB 電影評論數據集，該數據集對於電影愛好者而言非常贊。它用於二元情感分類，目前所含數據超過該領域其他數據集。除了訓練集評論樣本和測試集評論樣本之外，還有一些未標註數據可供使用。此外，該數據集還包括原始文本和預處理詞袋格式。

大小：80 MB

數量：訓練集和測試集各包含 25,000 個高度兩極化的電影評論

Sentiment Analysis?

ai.stanford.edu

2、歐洲語言機器翻譯數據集，該數據集包含四種歐洲語言的訓練數據，旨在改進當前的翻譯方法。你可以使用以下任意語言對：法語 - 英語西班牙語 - 英語德語 - 英語捷克語 - 英語

大小：約 15 GB

數量：約 30,000,000 個句子及對應的譯文

2018 Third Conference on Machine Translation (WMT18)?

statmt.org

3、WordNet，WordNet 是一個大型英語 synset 資料庫。Synset 也就是同義片語，每組描述的概念不同。WordNet 的結構讓它成為 NLP 中非常有用的工具。

大小：10 MB

數量：117,000 個同義詞集

A Lexical Database for English?

wordnet.princeton.edu

4、Wikipedia Corpus，該數據集是維基百科全文的集合，包含來自超過 400 萬篇文章的將近 19 億單詞。你能逐單詞、逐短語、逐段地對其進行檢索，這使它成為強大的 NLP 數據集。

大小：20 MB

數量：4,400,000 篇文章，包含 19 億單詞

Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram?

nlp.cs.nyu.edu

5、Yelp 數據集，這是 Yelp 出於學習目的而發布的開放數據集。它包含數百萬個用戶評論、商業屬性（businesses attribute）和來自多個大都市地區的超過 20 萬張照片。該數據集是全球範圍內非常常用的 NLP 挑戰賽數據集。，

大小：2.66 GB JSON、2.9 GB SQL 和 7.5 GB 的照片（全部壓縮後）

數量：5,200,000 個評論、174,000 份商業屬性、200,000 張照片和 11 個大都市地區

Yelp Dataset?

www.yelp.com

6、Blog Authorship Corpus，該數據集包含從數千名博主那裡收集到的博客文章，這些數據從 blogger.com 中收集而來。每篇博客都以一個單獨的文件形式提供。每篇博客至少出現 200 個常用的英語單詞。

大小：300 MB

數量：681,288 篇博文，共計超過 1.4 億單詞。

http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm?

u.cs.biu.ac.il

7、Twenty Newsgroups 數據集，顧名思義，該數據集涵蓋新聞組相關信息，包含從 20 個不同新聞組獲取的 20000 篇新聞組文檔彙編（每個新聞組選取 1000 篇）。這些文章有著典型的特徵，例如標題、導語。

大小：20MB

數量：來自 20 個新聞組的 20,000 篇報道

Twenty Newsgroups Data Set?

archive.ics.uci.edu

8、Sentiment140，是一個用於情感分析的數據集。這個流行的數據集能讓你完美地開啟自然語言處理之旅。數據中的情緒已經被預先清空。最終的數據集具備以下六個特徵：推文的情緒極性推文的 ID 推文的日期查詢推特的用戶名推文的文本

大小：80MB（壓縮後）

數量： 1,60,000 篇推文

For Academics - Sentiment140 - A Twitter Sentiment Analysis Tool?

help.sentiment140.com

三、Audio / voice dataset

1、VoxCeleb，是一個大型人聲識別數據集。它包含來自 YouTube 視頻的 1251 位名人的約 10 萬段語音。數據基本上是性別平衡的（男性占 55％）。這些名人有不同的口音、職業和年齡。開發集和測試集之間沒有重疊。對大明星所說的話進行分類並識別——這是一項有趣的工作。

大小：150 MB

數量：1251 位名人的 100,000 條語音

VoxCeleb dataset?

www.robots.ox.ac.uk

2、Youtube-8M為谷歌開源的視頻數據集，視頻來自youtube，共計8百萬個視頻，總時長50萬小時，4800類。為了保證標籤視頻資料庫的穩定性和質量，谷歌只採用瀏覽量超過1000的公共視頻資源。為了讓受計算機資源所限的研究者和學生也可以用上這一資料庫，谷歌對視頻進行了預處理，並提取了幀級別的特徵，提取的特徵被壓縮到可以放到一個硬碟中（小於1.5T）。

大小：~1.5TB

https://research.google.com/youtube8m/?

research.google.com

3、Free Spoken Digit 數據集，這是本文又一個受 MNIST 數據集啟發而創建的數據集！該數據集旨在解決識別音頻樣本中口述數字的任務。這是一個公開數據集，所以希望隨著人們繼續提供數據，它會不斷發展。目前，它具備以下特點： 3 種人聲 1500 段錄音（每個人口述 0- 9 各 50 次）英語發音

大小： 10 MB

數量： 1500 個音頻樣本 SOTA：《Raw Waveform-based Audio

Jakobovski/free-spoken-digit-dataset?

github.com

4、Million Song 數據集，包含一百萬首當代流行音樂的音頻特徵和元數據，可免費獲取。其目的是：鼓勵研究商業規模的演算法為評估研究提供參考數據集作為使用 API 創建大型數據集的捷徑（例如 The Echo Nest API）幫助入門級研究人員在 MIR 領域展開工作數據集的核心是一百萬首歌曲的特徵分析和元數據。該數據集不包含任何音頻，只包含導出要素。示例音頻可通過哥倫比亞大學提供的代碼（https://github.com/tb2332/MSongsDB/tree/master/Tasks_Demos/Preview7digital）從 7digital 等服務中獲取。

大小：280 GB

數量：一百萬首歌曲！

https://labrosa.ee.columbia.edu/millionsong/?

labrosa.ee.columbia.edu

5、FMA 是音樂分析數據集，由整首 HQ 音頻、預計算的特徵，以及音軌和用戶級元數據組成。它是一個公開數據集，用於評估 MIR 中的多項任務。以下是該數據集包含的 csv 文件及其內容： tracks.csv：記錄每首歌每個音軌的元數據，例如 ID、歌名、演唱者、流派、標籤和播放次數，共計 106,574 首歌。 genres.csv：記錄所有 163 種流派的 ID 與名稱及上層風格名（用於推斷流派層次和上層流派）。 features.csv：記錄用 librosa 提取的常見特徵。 echonest.csv：由 Echonest（現在的 Spotify）為 13,129 首音軌的子集提供的音頻功能。

大小：約 1000 GB

數量：約 100,000 個音軌

mdeff/fma?

github.com

6、Ballroom ，該數據集包含舞廳的舞曲音頻文件。它以真實音頻格式提供了許多舞蹈風格的一些特徵片段。以下是該數據集的一些特點：實例總數：698 單段時長：約 30 秒總時長：約 20940 秒大小：14 GB（壓縮後）數量：約 700 個音頻樣本

Ballroom?

mtg.upf.edu

7、LibriSpeech，該數據集是一個包含約 1000 小時英語語音的大型語料庫。數據來源為 LibriVox 項目的音頻書籍。該數據集已經得到了合理地分割和對齊。如果你還在尋找起始點，那麼點擊 http://www.kaldi-asr.org/downloads/build/6/trunk/egs/查看在該數據集上訓練好的聲學模型，點擊 http://www.openslr.org/11/查看適合評估的語言模型。

大小：約 60 GB

數量：1000 小時的語音

openslr.org?

www.openslr.org

四、綜合數據集

1、雅虎發布的超大Flickr數據集，包含1億多張圖片。

The data collected so far represents the world largest multimedia metadata collection that is available for research on scalable similarity search techniques. CoPhIR consist of 106 million processed images. CoPhIR is now available to the research community to try and compare different indexing technologies for similarity search, with scalability being the key issue. Our use of the Flickr image content is compliant to the Creative Commons license. CoPhIR Test Collection is compliant to the European Recommendation 29/2001 CE, based on WIPO (World Intellectual Property Organization) Copyright Treaty and Performances and Phonograms Treaty, and to the current Italian law 68/2003. In order to access the CoPhIR distribution, the organizations (universities, research labs, etc.) interested in building experimentations on it will have to sign the enclosed CoPhIR Access Agreement and the CoPhIR Access Registration Form, sending the original signed document to us by mail. Please follow the instruction in the section 「How to get CoPhIR Test Collection」. You will then receive Login and Password to download the required files.

CoPhIR - what is?

cophir.isti.cnr.it

2、包含8000萬的32x32圖像，CIFAR-10和CIFAR-100便是從中挑選的。

The 79 million images are stored in one giant binary file, 227Gb insize. The metadata accompanying each image is also in a single giantfile, 57Gb in size. To read images/metadata from these files, we haveprovided some Matlab wrapper functions. There are two versions of the functions for reading image data: (i) loadTinyImages.m - plain Matlab function (no MEX), runs under32/64bits. Loads images in by image number. Use this by default. (ii) read_tiny_big_binary.m - Matlab wrapper for 64-bit MEXfunction. A bit faster and more flexible than (i), but requires a 64-bit machine. There are two types of annotation data: (i) Manual annotation data, sorted in annotations.txt, that holds thelabel of images manually inspected to see if image content agrees withnoun used to collect it. Some other information, such as searchengine, is also stored. This data is available for only a very smallportion of images. (ii) Automatic annotation data, stored in tiny_metadata.bin,consisting of information relating the gathering of the image,e.g. search engine, which page, url to thumbnail etc. This data isavailable for all 79 million images.

http://horatio.cs.nyu.edu/mit/tiny/data/index.html?

horatio.cs.nyu.edu

3、The MIRFLICKR-25000 open evaluation project consists of 25000 images downloaded from the social photography site Flickr through its public API coupled with complete manual annotations, pre-computed descriptors and software for bag-of-words based similarity and classification and a matlab-like tool for exploring and classifying imagery.

800谷歌學術引文和3萬9000的下載量來自大學（麻省理工學院、劍橋、斯坦福、牛津，哥倫比亞市，美國，新加坡，Tsinghua，東京大學，韓國科學技術院，等）和公司（IBM，微軟，谷歌，雅虎！臉譜網、飛利浦、索尼、諾基亞等）

The MIRFLICKR Retrieval Evaluation?

press.liacs.nl
推薦閱讀：

TAG:深度學習DeepLearning | 數據集 | 機器學習 |