Some New Interesting Deep Learning Datasets for Data Scientists

Larger tagged datasets and more available computing power is what has triggered the recent AI revolution. In this article, I have listed some of the very enthralling Deep Learning datasets I found recently for data scientists.

EMNIST: An Extension of MNIST to Handwritten Letters

MNIST is one of the very popular datasets for people getting started with Deep Learning in particular and Machine Learning on images in general. MNIST has images of digits which are to be mapped to the digits themselves. EMNIST extends this to images of letters as well. The dataset can be downloaded here . There is an alternative dataset we discovered as well on Reddit. It’s called HASYv2 and can be downloaded here

link 1


link 2
HICO has images containing multiple objects and these objects have been tagged along with their relationships. The proposed problem is for algorithms to be able to dig out objects in an image and relationship between them after being trained on this dataset. I expect multiple papers to come out of this dataset in future.

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

CLEVR is an attempt by Fei-Fei Li’s group, the same scientist who developed the revolutionary ImageNet dataset. It has objects and questions asked about those objects along with their answers specified by humans. The aim of the project is to develop machines with common sense about what they see. So for example, the machine should be able to find “an odd one out” in an image automatically. You can download the dataset here.

HolStep: A Machine Learning Dataset for Higher-order Logic Theorem Proving

link 4
This dataset is tagged in a way so that algorithms trained on it can be used for automatic theorem proving . The download link is here.

The Parallel Meaning Bank: Towards a Multilingual Corpus of Translations Annotated with Compositional Meaning Representations

link 5
The Parallel Meaning Bank (PMB), developed at the University of Groningen, comprises sentences and texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and formal meaning representations. The download link is here

JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction

JFLEG dataset is an aim to tag sentences with nominal grammatical corrections and smart grammatical corrections. This dataset aims to build machines that can correct grammar automatically for people making mistakes. The dataset can be downloaded here.

Introducing VQA v2.0: A More Balanced and Bigger VQA Dataset!

This dataset has images, questions asked on them and their answers tagged. The aim is to train machines to answer questions asked about images (and in continuation about the real world they are seeing). Visual QA is an old dataset but its 2.0 version came out just this december.

link 6

Google Cloud & YouTube-8M Video Understanding Challenge

link 8
Probably the largest dataset available for training in the open. This is a dataset of 8 Million Youtube videos tagged with the objects within them. There is also a running Kaggle competition on the dataset with a bounty of 1,00,000 dollars.

Data Science Bowl 2017

front_page This turns out to be the largest bounty offered to crack a Data Science problem. There are prizes of $1 Million to be grabbed by Data Scientists who can detect lung cancer using this dataset of tagges CT-Scans.

Exoplanets Dataset

link 9
Today, a team that includes MIT and is led by the Carnegie Institution for Science has released the largest collection of observations made with a technique called radial velocity, to be used for hunting exoplanets. The dataset can be downloaded here

End-to-End Interpretation of the French Street Name Signs Dataset

link 10 This is a huge dataset of French Street signs labeled with what they denote. The dataset is easily readable by everyone’s favorite Tensorflow and can be downloaded here

A Realistic Dataset for the Smart Home Device Scheduling Problem for DCOPs

An upcoming dataset for IoT and AI interface. You can download it here .

RepEval 2017 Shared Task

From Sam Bowman’s team, the creators of the famous SNLI dataset, this dataset about understanding the meaning of the text is going to be out soon as a competition. The dataset is expected by 15th March. You can find it here once it’s live.

Driver Speed Dataset

A 200 Gb huge dataset, which is aimed to calculate speed of moving vehicles. Can be downloaded here

link 11

Similar questions dataset by Quora

This dataset is donated by Quora, so that people can train algorithms to identify similar questions. Just to be clear for a machine to say “Why is Apple tasty ? ” and “Why are Apple stocks trending?” is a very difficult thing. This is what the dataset aims the machine to recognize.

NWPU-RESISC45 Remote sensing images dataset

A huge dataset of remote sensing images covering a wide array of landscapes which can be seen through sattelites. Potential technology that can be developed includes satellite surveys, monitoring, and surveillance. Unfortunately, we are still waiting for the download link here.

Recipe to create your own free datasets from the open web

This is probably the most interesting of the datasets. This dataset has not been tagged by humans but by machines. Also, the authors make things clear about what is to be done if we want to create a similar dataset from the millions of images which are already available on the web.

The LIP Dataset

This large-scale dataset focuses on the semantic understanding of a person. The download link for the dataset is here

WikiReading Data

This dataset is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The downlaod link is here


MUSCIMA++ is a dataset of handwritten music notation for musical symbol detection. Here is the download link.

DeScript (Describing Script Structure)

DeScript is a crowdsourced corpus of event sequence descriptions (ESDs) for different scenarios crowdsourced via Amazon Mechanical Turk. Here is the download link.

Maluuba Frames

Maluuba’s datasets are meant to encourage research in the filed of Artificial Intelligence. Here is the download link.

The UMCD Dataset

This is a collection of geo-referenced video sequences acquired at low-altitude for mosaicking and change detection purposes.

Stanford 2D-3D-Semantics Dataset (2D-3D-S)

The 2D-3D-S dataset provides a variety of mutually registered modalities from 2D, 2.5D and 3D domains, with instance-level semantic and geometric annotations.

Wrapping up the deep learning datasets list for now, I will keep updating when I find some more interesting datasets.

Leave a Reply