These datasets can be used for benchmarking deep learning algorithms:
Symbolic Music Datasets
- Piano-midi.de: classical piano pieces (http://www.piano-midi.de/)
- Nottingham : over 1000 folk tunes (http://abc.sourceforge.net/NMD/)
- MuseData: electronic library of classical music scores (http://musedata.stanford.edu/)
- JSB Chorales: set of four-part harmonized chorales (http://www.jsbchorales.net/index.shtml)
Natural Images
- MNIST: handwritten digits (http://yann.lecun.com/exdb/mnist/)
- NIST: similar to MNIST, but larger
- Perturbed NIST: a dataset developed in Yoshua’s class (NIST with tons of deformations)
- CIFAR10 / CIFAR100: 32×32 natural image dataset with 10/100 categories (http://www.cs.utoronto.ca/~kriz/cifar.html)
- Caltech 101: pictures of objects belonging to 101 categories (http://www.vision.caltech.edu/Image_Datasets/Caltech101/)
- Caltech 256: pictures of objects belonging to 256 categories (http://www.vision.caltech.edu/Image_Datasets/Caltech256/)
- Caltech Silhouettes: 28×28 binary images contains silhouettes of the Caltech 101 dataset
- STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. http://www.stanford.edu/~acoates//stl10/
- The Street View House Numbers (SVHN) Dataset - http://ufldl.stanford.edu/housenumbers/
- NORB: binocular images of toy figurines under various illumination and pose (http://www.cs.nyu.edu/~ylclab/data/norb-v1.0/)
- Imagenet: image database organized according to the WordNethierarchy (http://www.image-net.org/)
- Pascal VOC: various object recognition challenges (http://pascallin.ecs.soton.ac.uk/challenges/VOC/)
- Labelme: A large dataset of annotated images, http://labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php
- COIL 20: different objects imaged at every angle in a 360 rotation(http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php)
- COIL100: different objects imaged at every angle in a 360 rotation (http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php)
Artificial Datasets
- Arcade Universe - An artificial dataset generator with images containing arcade games sprites such as tetris pentomino/tetromino objects. This generator is based on the O. Breleux’s bugland dataset generator.
- A collection of datasets inspired by the ideas from BabyAISchool:
- BabyAIShapesDatasets : distinguishing between 3 simple shapes
- BabyAIImageAndQuestionDatasets : a question-image-answer dataset
- Datasets generated for the purpose of an empirical evaluation of deep architectures (DeepVsShallowComparisonICML2007):
- MnistVariations : introducing controlled variations in MNIST
- RectanglesData : discriminating between wide and tall rectangles
- ConvexNonConvex : discriminating between convex and nonconvex shapes
- BackgroundCorrelation : controlling the degree of correlation in noisy MNIST backgrounds
Faces
- Labelled Faces in the Wild: 13,000 images of faces collected from the web, labelled with the name of the person pictured (http://vis-www.cs.umass.edu/lfw/)
- Toronto Face Dataset
- Olivetti: a few images of several different people (http://www.cs.nyu.edu/~roweis/data.html)
- Multi-Pie: The CMU Multi-PIE Face Database (http://www.multipie.org/)
- Face-in-Action (http://www.flintbox.com/public/project/5486/)
- JACFEE: Japanese and Caucasian Facial Expressions of Emotion (http://www.humintell.com/jacfee/)
- FERET: The Facial Recognition Technology Database (http://www.itl.nist.gov/iad/humanid/feret/feret_master.html)
- mmifacedb: MMI Facial Expression Database (http://www.mmifacedb.com/)
- IndianFaceDatabase: http://vis-www.cs.umass.edu/~vidit/IndianFaceDatabase/)
- (e.g. The Yale Face Database (http://vision.ucsd.edu/content/yale-face-database) and The Yale Face Database B (http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html)).
Text
- 20 newsgroups: classification task, mapping word occurences to newsgroup ID (http://qwone.com/~jason/20Newsgroups/)
- Reuters (RCV*) Corpuses: text/topic prediction (http://about.reuters.com/researchandstandards/corpus/)
- Penn Treebank : used for next word prediction or next character prediction (http://www.cis.upenn.edu/~treebank/)
- Broadcast News: large text dataset, classically used for next word prediction (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S44)
- Wikipedia Dataset
- Multidomain sentiment analysis dataset: http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
Speech
- TIMIT Speech Corpus: phoneme classification (http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S1)
- Aurora : Timit with noise and additional information
Recommendation Systems
- MovieLens: Two datasets available from http://www.grouplens.org. The first dataset has 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second dataset has about 1 million ratings for 3900 movies by 6040 users.
- Jester: This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
- Netflix Prize: Netflix released an anonymised version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
- Book-Crossing dataset: This dataset is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.
Misc
- “Musk” dataset
- CMU Motion Capture Database: (http://mocap.cs.cmu.edu/)
- Brodatz dataset: texture modeling (http://www.ux.uis.no/~tranden/brodatz.html)
- Million Song dataset: http://labrosa.ee.columbia.edu/millionsong/
- Merck Molecular Activity Challenge - http://www.kaggle.com/c/MerckActivity/data
from: http://deeplearning.NET/datasets/