introduction
In the application of deep learning, data is considered as one of the most important factors. Therefore, choosing a good dataset is crucial to the success of deep learning. When selecting a data set, not only the size, diversity, and quality of the data need to be paid attention to, but also whether the data set represents the real situation of the research problem. This article organizes the current public data sets in the field of deep learning for everyone to choose and use when training models.
1 Comprehensive dataset
1.1 kaggle data set
Kaggle is one of the largest online repositories of datasets covering a range of topics from sports to medicine to government. Its platform is community-led, meaning users can upload their own datasets. Given the variety of data sources for Kaggle, it's important to thoroughly check the quality of the datasets you're pulling them from. Additionally, Kaggle offers discussions on machine learning topics as well as tutorials on key processes.
Address: kaggle datasets
1.2 AI Studio Dataset
The AI Studio launched by Baidu is a one-stop development platform: it includes AI tutorials, code environment, algorithm computing power, data sets , and provides free online cloud computing. It is an integrated programming environment.
Address: AI Studio Dataset
1.3 Tianchi Dataset
Tianchi Dataset is a scientific research data platform open to the outside world of Alibaba Group. It is jointly provided by the business team of Alibaba Group and external research institutions, covering more than ten fields such as e-commerce, entertainment, logistics, medical health, transportation, industry, natural science, and energy Industry, covering data mining, machine learning, computer vision, natural language processing, decision intelligence and other classic artificial intelligence technology fields.
Address: tianchi datasets
1.4 Graviti Dataset
Graviti is a platform that provides public datasets. You can easily search for the data you want, and you can preview sample data, annotations, and labels online. Graviti has collected more than 400 high-quality CV datasets, covering various AI application fields such as driverless driving, smart retail, and robotics.
Address: graviti datasets
1.7 papers with code
There are over four thousand datasets (and counting). These datasets are uploaded by the community. You can easily filter these datasets by modality, task, and language. The database also contains links to other databases that also provide a variety of datasets.
Address: papers with code datasets
1.8 DataFlair
DataFlair links to over 70 machine learning datasets and also includes useful information like source code and project ideas. For example, in the list of datasets containing handwritten digits, DataFlair suggests creating an image classification algorithm to recognize handwritten digits in paper. Use the site to inspire new ideas.
Address: data flair
1.9 EliteDataScience
EliteDataScience includes free datasets and a curated list of the most popular aggregators. These datasets are organized by use case, and include datasets that can be used for deep learning, natural language processing, web scraping, and more.
Address: elite data science
1.10 UCI dataset
UCI has more than 500 machine learning datasets sortable by file type, task, application domain, and topic. Many of these datasets contain links to academic papers that can be used for benchmarking. One of the oldest sources of datasets and the first stop for finding interesting datasets. While the datasets are user-contributed and thus have varying degrees of cleanliness, the vast majority are clean and can be downloaded directly from the UCI Machine Learning Repository without registration.
Address: uci dataset
1.11 github public dataset
github public datasets provides an open source collection of public datasets. There you can view the catalog and choose a topic, ranging from agriculture to transportation and more. Github also includes a collection of general machine learning models. Most of the linked datasets are free.
Address: github datasets
1.12 Azure Datasets
Microsoft Azure has a database of public datasets that developers can use for prototyping and testing. Database categories include U.S. government and agency data, other statistical and scientific data, and online service data. Also, there you can read documentation on SQL and how to build mobile and web applications.
Address: azure datasets
2 Computer Vision Datasets
2.1 ImageNet dataset
The ImageNet dataset is one of the most popular datasets in the field of deep learning applications today, and it contains a large amount of image data and annotations. The annotations of the ImageNet dataset cover all large categories, medium categories, and small categories. The larger categories are more general, and the smaller categories are more specific. This feature makes this dataset suitable for research on image classification problems.
Address: ImageNet dataset
2.2 COCO Dataset
The full name is "Microsoft Common Objects in Context Dataset". The COCO dataset is a large-scale dataset that can be used for image detection, semantic segmentation and image captioning. It has more than 330K images (220K of which are labeled images), containing 1.5 million targets, 80 target categories (object categories: pedestrians, cars, elephants, etc.), 91 material categories (stuff categories: grass, wall, sky, etc.), each image contains five sentence sentence descriptions of the image, and there are 250,000 pedestrians annotated with keypoints.
Address: coco dataset
2.3 IMDB-Wiki Dataset
The IMDB-Wiki dataset provides the largest collection of face images, with over 500,000 images. Many images are from celebrities and Wikipedia. Each image is tagged with gender and age.
Address: imdb datasets
2.4 LabelMe dataset
Built using the LabelMe labeling tool. This tool enables users to outline and label objects. This dataset can be used in image recognition projects.
Address: labelme datasets
2.5 chars74k dataset
chars74k includes 74,000 images. Data includes character recognition in natural images (for example, images of restaurant signs)
Address: chars74k datasets
2.6 Kinetics-700 Dataset
Kinetics-700 contains a series of links to YouTube videos primarily labeled as human behavior. There are more than 650,000 video clips covering 700 human behaviors.
Address: kinetics-700 datasets
2.7 Places2 Database
Places2 Database is a dataset released by MIT, containing more than 10 million images covering more than 400 scenes. It is helpful for projects such as scene classification and scene parsing.
Address: places2 datasets
2.8 MPII Human Pose Dataset
The MPII Human Pose dataset includes about 25,000 images involving 410 human poses. The images contain approximately 40,000 different people, each with human joints annotated. These images are collected from YouTube videos.
Address: human-pose datasets
2.9 Open Images dataset
Open Images is an open source image data set released by Google, and the latest V7 version will be released in October 2022. This version of the dataset contains more than 9 million images, all labeled with categories. Among them, more than 1.9 million pictures have very fine annotations. Open Images can be used in many different applications, including image classification, object detection, image segmentation, and image generation .
Address: open images dataset
2.10 Cityscapes dataset
Cityscapes is a dataset for semantic segmentation of urban street views, containing 3257 high-resolution images from 50 cities in Germany. The dataset covers Street View images under different lighting conditions such as morning, day and night. Each image has a resolution of 2048x1024 and is professionally annotated for multiple labels including buildings, roads, and pedestrians. The dataset also provides lists for training, validation, and testing, as well as benchmark performance metrics. The introduction of the Cityscapes dataset will help promote the development of urban scene analysis and provide more possibilities for the research and application of deep learning algorithms.
Address: cityscapes dataset
2.11 Sogou Dataset
The Internet photo library comes from part of the data indexed by sogou image search. It collected 2,836,535 pictures in categories including people, animals, buildings, machinery, landscapes, and sports. For each picture, the original image, thumbnail, webpage where the picture is located, and relevant text in the webpage are given in the data set. More than 200G
Address: http://www.sogou.com/labs/dl/p.html
2.12 IMAGECLEF data set
IMAGECLEF is committed to providing a benchmark for bitmap-related fields (retrieval, classification, labeling, etc.) Cross Language Evaluation Forum (CLEF). The competition has been held every year since 2003.
Address: http://www.imageclef.org/
3 Natural Language Processing Datasets
3.1 Google Blogger Corpus
Google Blogger Corpus includes nearly 700,000 blog posts from blogger.com. Each article has at least 200 English words. Overall, these blog posts contain many common English words.
Address: BlogCorpus datasets
3.2 Yelp Reviews
The Yelp Reviews dataset covers rankings and reviews of restaurants and contains rich information related to this topic. The reviews in this dataset can be used in sentiment analysis projects.
Address: yelp dasets
3.3 WikiQA Corpus
The WikiQA corpus is a question answering dataset compiled from Bing search data. It includes more than 3,000 questions and provides 29,000 answer sentences, 1,500 of which are labeled as answer sentences.
地址: WikiQA Corpus
3.4 WordNet
WordNet is a database of English words grouped by meaning. There are 117,000 synsets (words paired according to synonyms), which are then linked to related synsets. Can be used in text classification projects.
Address: wordnet datasets
3.5 OpinRank dataset
The OpinRank dataset contains 300,000 reviews from Edmunds and TripAdvisor. They are categorized by destination, hotel and other relevant factors.
Address: OpinRank datasets
3.6 Multi-Domain Sentiment Dataset
The multi-domain sentiment dataset includes Amazon.com product reviews from four domains: DVD, Books, Kitchen, and Electronics. Each domain has thousands of reviews with 1-5 star ratings. As the name suggests, this dataset is useful for sentiment analysis projects.
Address: mdredze datasets
3.7 Twitter Sentiment Analysis Dataset
The Twitter sentiment analysis dataset includes more than 1.5 million classified tweets. Each row of the dataset has a rank: 1 for positive sentiment and 0 for negative sentiment.
Address: twitter-sentiment datasets
3.8 Newsgroups dataset
Newsgroups contains 20,000 documents and, as the name suggests, comes from more than 20 different newsgroups. It covers a lot of topics, some of which are relatively similar. The dataset consists of three versions: an original version, a version with dates removed, and a version with duplicates removed.
Address: 20Newsgroups datasets
3.9 HuggingFace dataset
The HuggingFace dataset includes 611 text datasets that can be downloaded ready to use in one line of python; covers 467 languages, 99 of which contain at least 10 datasets;
Address: huggingface datasets
4 audio and video datasets
4.1 M-AI Labs Speech Dataset
The M-AI Labs speech dataset includes nearly 1,000 hours of audio and transcriptions. Includes male and female voices in multiple languages.
Address: MAI labs datasets
4.2 LibriSpeech
LibriSpeech includes approximately 1000 hours of speech data that has been segmented and aligned. These data were compiled from audiobooks from the LibriVox project.
Address: Librispeech datasets
5 Dataset Search
5.1 Google Dataset Search
Google provides a dataset search engine where you can search for datasets by name. The engine allows you to sort datasets by several features, such as file type, subject, latest update, and relevance. It can also pull datasets from thousands of databases on the internet, so you can really search through a wide range of options. Uploaders of the dataset include numerous international organizations such as Harvard University and the World Health Organization.
Address: google dataset search
5.2 clue dataset retrieval
Chinese language comprehension benchmarks, including representative datasets, benchmark (pre-trained) models, corpora, and leaderboards. We will select a series of data sets corresponding to certain representative tasks as our test benchmark data sets. These data sets will cover different tasks, data volume, and task difficulty.
Address: cluebenchmarks
5.3 visual data dataset
Visualdata contains some excellent datasets for building computer vision models that users can query by a CV topic, such as semantic segmentation, image captioning, image generation, self-driving cars, etc.
Address: visualdata
6 Specific data sets
6.1 Medical Image Datasets
Lung nodule database LIDC-IDRI: cancer image
Breast Image Database DDSM MIAS: Breast Image Database
Medical Image FAQ: medical-image-faq
Right Ventricle Segmentation Challenge (2012): mr-images
Lung Cancer Classification Competition: http://data-science-bowl-2017
Segmenting lung cancers (Kaggle): finding-lungs-in-ct
Lung cancer database: cancer image
Medical imaging dataset: medical-data
Medical Image Analysis: grand-challenge
6.2 Kaggle Competition Dataset
- Book Recommendation Dataset (goodreads/tens of thousands of books/millions of reviews) [Kaggle] https://www.kaggle.com/zygmunt/goodbooks-10k
- NFL Game Details Dataset with Expected Points and Probability of Win (2009-2016) [Kaggle] https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016
- HackerNews dataset (about 1/4 articles since 2006) [Kaggle] https://www.kaggle.com/hacker-news/hacker-news-corpus
- Hotel Review Dataset [Kaggle] https://www.kaggle.com/datafiniti/hotel-reviews
- NBA player status & performance data set since 1950 [Kaggle] https://www.kaggle.com/drgilermo/nba-players-stats
- [Kaggle competition] Facial keypoints calibration competition data: https://www.kaggle.com/c/facial-keypoints-detection
- 【Kaggle competition】Predict user gender and age competition data based on mobile application software usage behavior: http://dataju.cn/Dataju/web/datasetInstanceDetail/332
- [Kaggle competition] DSTL satellite imagery recognition competition data: https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection
- [Kaggle Competition] Cat and Dog Image Classification Data: https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition
- [Kaggle Competition] Predicting Threat Competition Based on Security Inspection Body Scanning Imaging: https://www.kaggle.com/c/passenger-screening-algorithm-challenge
- [Kaggle competition] Titanic disaster data: https://www.kaggle.com/c/titanic
- [Kaggle competition] Philadelphia crime record data: https://www.kaggle.com/mchirico/philadelphiacrimedata
- 【Kaggle Competition】Ad real-time bidding data: https://www.kaggle.com/zurfer/rtb
- [Kaggle Competition] News and web page content recommendation and click competition: https://www.kaggle.com/c/outbrain-click-prediction
- [Kaggle data] IMDB 5,000 movie data: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset
- [Kagle Data] European football player performance data: https://www.kaggle.com/hugomathien/soccer
- [Kagle Data] Economic development data of countries around the world: https://www.kaggle.com/worldbank/world-development-indicators
- Kepler Space Telescope Deep Space Planet Light Intensity Time Series Dataset [Kaggle] https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data
- Pakistan UAV Attack Dataset (2004-2016) [Kaggle] https://www.kaggle.com/zusmani/pakistandroneattacks
- Melbourne Housing Market Dataset [Kaggle] https://www.kaggle.com/anthonypino/melbourne-housing-market
- 1789-2016 US Presidents Signing Executive Order Dataset [Kaggle] https://www.kaggle.com/nationalarchives/executive-orders
- Python language question answering data set from the Stack Overflow platform [Kaggle] https://www.kaggle.com/stackoverflow/pythonquestions
- R language question answering data set from Stack Overflow Pintai [Kaggle] https://www.kaggle.com/stackoverflow/rquestions
- Daily Sea Ice Extent Dataset [Kaggle] https://www.kaggle.com/nsidcorg/daily-sea-ice-extent-data
- NIPS (1987-2016) paper dataset [Kaggle] https://www.kaggle.com/benhamner/nips-papers
- US stock news data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/220
- US medical insurance market data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/225
- American financial customer complaint data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/229
- Lending Club online loan default data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/206
- Credit card fraud data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/206
- US stock data XBRL [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/214
- New York Stock Exchange data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/214
- Loan default prediction competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/249
- Zillow website real estate value prediction competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/249
- Sberbank Russian real estate value prediction competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/266
- Homesite Insurance Pricing Competition Data [Kaggle Competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/336
- Winton stock return forecasting competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/347?match
- [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/324
- Amazon unlocked mobile phone review data http://dataju.cn/Dataju/web/datasetInstanceDetail/349
- [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/364
- [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/207
- Kaggle competition data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/207
- Bosch production line reduces defective rate competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/208
- Online advertising real-time bidding data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/337
- Shopping cart product association competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/335
- Airbnb new users' homestay reservation prediction competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/333
- Food nutrition data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/80
- EGG brain wave shape data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/79
- Someone's gene sequence data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/121
- Cancer CT image data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/242
- Soft tissue sarcoma CT image data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/124
- Cat and dog classification recognition competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/318
- DSTL satellite image recognition competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/328
- Predict user gender and age competition data based on mobile application software usage behavior [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/332
- Face key point calibration competition data [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/331
- Kaggle competition data collection (partial competition data) http://dataju.cn/Dataju/web/datasetInstanceDetail/368
- Boston Airbnb public data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/209
- Economic development data of countries in the world [Kaagle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/202
- World University Ranking Chicago Crime Data (2001-2017) [Kaagle Data] http://dataju.cn/Dataju/web/datasetInstanceDetail/233
- Worldwide Significant Earthquake Data (1965-2016) [Kaagle Data] http://dataju.cn/Dataju/web/datasetInstanceDetail/231
- American baby name data [Kaagle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/222
- Data of shark attacks on humans all over the world [Kaagle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/219
- Air crash data since 1908 [Kaagle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/218
- 2016 U.S. Presidential Election Data [Kaagle Data] http://dataju.cn/Dataju/web/datasetInstanceDetail/217
- 2013 American Community Statistics [Kaagle Data] http://dataju.cn/Dataju/web/datasetInstanceDetail/273
- 2014 American Community Statistics [Kaagle Data] http://dataju.cn/Dataju/web/datasetInstanceDetail/274
- 2015 American Community Statistics [Kaagle Data] http://dataju.cn/Dataju/web/datasetInstanceDetail/215
- Performance data of European football players [Kaagle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/211
- US Environmental Pollution Data [Kaagle Data] http://dataju.cn/Dataju/web/datasetInstanceDetail/224
- US H1-B visa application data Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/224
- IMDB five thousand movie data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/224
- 2015 flight delay and cancellation data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/226
- Homicide report data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/216
- Human resource analysis data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/259
- Crime data in Philadelphia, USA [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/260
- Enron email data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/262
- Historical baseball data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/263
- United Airlines Twitter user comment data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/264
- Boston Airbnb public data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/265
6.3 Natural Language Processing Datasets
- RCV1:http://dataju.cn/Dataju/web/datasetInstanceDetail/93
- English: http://dataju.cn/Dataju/web/datasetInstanceDetail/90
- News data: http://dataju.cn/Dataju/web/datasetInstanceDetail/78
- Natural Language Reasoning (Text Entailment Marking) Dataset [NYU] https://www.nyu.edu/projects/bowman/multinli/
- 20news English news data http://dataju.cn/Dataju/web/datasetInstanceDetail/78
- First Quora Release Question Pairs Q&A data http://dataju.cn/Dataju/web/datasetInstanceDetail/94
- JRC Names:http://dataju.cn/Dataju/web/datasetInstanceDetail/92
- National language specific entity names: http://dataju.cn/Dataju/web/datasetInstanceDetail/89
- Multi-Domain Sentiment V2.0: http://dataju.cn/Dataju/web/datasetInstanceDetail/205
- LETOR information retrieval data: http://dataju.cn/Dataju/web/datasetInstanceDetail/205
- Yale Youtube Vedio Text: http://dataju.cn/Dataju/web/datasetInstanceDetail/221
- Stanford question and answer data [Kaggle data]: http://dataju.cn/Dataju/web/datasetInstanceDetail/221
- US fake news data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/212
- NIPS conference article information data (1987-2016) [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/268
- 2016 U.S. Presidential Election Debate Data [Kaggle Data] http://dataju.cn/Dataju/web/datasetInstanceDetail/269
- WikiLinks cross-document referencing corpus: http://dataju.cn/Dataju/web/datasetInstanceDetail/277
- European Parliament Proceedings Parallel Corpus machine translation data http://dataju.cn/Dataju/web/datasetInstanceDetail/285
- WikiText English semantic thesaurus data: http://dataju.cn/Dataju/web/datasetInstanceDetail/272
- WMT 2011 News Crawl Machine Translation Data: http://dataju.cn/Dataju/web/datasetInstanceDetail/288
- Stanford Sentiment Treebank vocabulary data: http://dataju.cn/Dataju/web/datasetInstanceDetail/334
- English language model word prediction competition data: http://dataju.cn/Dataju/web/datasetInstanceDetail/201
- Apache Software Foundation Public Mail Archive: The entire publicly available Apache Software Foundation mail archive as of July 11, 2011. (200 GB) http://aws.amazon.com/de/datasets/apache-software-foundation-public-mail-archives/
- Blogger Original Corpus: Contains posts from 19,320 bloggers collected in August 2004 from the http://blogger.com website. 681,288 posts and over 1.4 million words. (298 MB) http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
- Amazon Food Reviews [Kaggle]: Contains 568,454 food reviews left by Amazon users before October 2012. (240MB) https://www.kaggle.com/snap/amazon-fine-food-reviews
- Amazon Reviews: Stanford has collected 35 million Amazon reviews. (11GB) https://snap.stanford.edu/data/web-Amazon.html
- On ArXiv: full text of all accepted papers (270GB) + source files. (190GB) http://arxiv.org/help/bulk_data_s3
- ASAP Automatic Essay Scoring [Kaggle]: In this competition, there are 8 essay collections. Each composition is generated from responses to a single prompt. Selected essays range in length from 150 to 550 words. Some compositions rely on source information, while others do not. All papers are written by students in grades 7-10. All essays are graded manually and a double-grading system is used. (100MB) https://www.kaggle.com/c/asap-aes/data
- ASAP Short Answer Scoring [Kaggle]: Each dataset is generated from responses to a single prompt. The average length of selected responses is 50 words. Some answers rely on source information, while others do not. All responses were written by 10th grade students. All responses were scored manually and a double scoring system was adopted. (35MB) https://www.kaggle.com/c/asap-sas/data
- Political Social Media Categorization: Categorizing social media messages from politicians by content. (4MB) https://www.crowdflower.com/data-for-everyone/
- CLiPS Corpus of Stylistics Research (CSI): Expanded each year with two types of student writing: essays and reviews. The purpose of this corpus is mainly stylistic research, but it can also be used for other research. (The data set needs to be obtained by application) http://www.clips.uantwerpen.be/datasets/csi-corpus
- ClueWeb09 FACC: ClueWeb09 with Freebase annotation. (72GB) http://lemurproject.org/clueweb09/FACC1/
- ClueWeb11 FACC: ClueWeb11 with Freebase annotations. (92GB) http://lemurproject.org/clueweb12/FACC1/
- Common crawler corpus: consists of more than 5 billion web pages (541TB) of crawler data. http://aws.amazon.com/de/datasets/common-crawl-corpus/
- Cornell Movie Dialog Corpus: Contains a large collection of rich metadata, dialogues extracted from original movie scripts: 617 movies, 220,579 conversational exchanges between 10,292 pairs of movie characters. (9.5MB) http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html
- Business messaging: The job of categorizing what businesses are actually talking about on social media. Volunteers were asked to categorize corporate statements as information (objective statements about the company or its activities), conversation (replying to users, etc.) or action (messages asking for votes or asking users to click on links, etc.). (600KB) http://aws.amazon.com/de/datasets/common-crawl-corpus/
- Crosswikis: A database linking English phrases to Wikipedia articles. (11GB) http://nlp.stanford.edu/data/crosswikis-data.tar.bz2/
- A collective effort of the web community to extract structured information from Wikipedia and make this information available on the web. (17GB) http://aws.amazon.com/de/datasets/dbpedia-3-5-1/?tag=datasets%23keywords%23encyclopedic
- Death Row: The last words of every prisoner executed since 1984. (HTML form) http://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html
- http://Del.icio.us: 1.25 million bookmarks on http://delicious.com. http://arvindn.livejournal.com/116137.html
- Disaster tweets on social media: 10,000 tweets, annotated with or without disaster events. (2MB) https://www.crowdflower.com/data-for-everyone/
- Economic News Related Articles: Determine whether a news article is relevant to the U.S. economy, and if so, what the tone of the article is. The time range is from 1951 to 2014. (12MB) https://www.crowdflower.com/data-for-everyone/
- Enron Email Data: Contains 1,227,255 emails with 493,384 attachments covering 151 managers. (210GB) http://aws.amazon.com/de/datasets/enron-email-data/
- Event Registration: A free tool that provides real-time access to news articles from 100,000 outlets around the world. There is an API interface. (query tool) http://eventregistry.org/
- http://Examiner.com - News Headline Phishing Spam [Kaggle]: 3 million crowdsourced news headlines published by the now-defunct phishing site The Examiner from 2010-2015. (200MB) https://www.kaggle.com/therohk/examine-the-examiner
- Federal Contracts from the Federal Acquisition Data Center (http://USASpending.gov): A database of all federal contracts from the Federal Acquisition Data Center at http://USASpending.gov. (180GB) http://aws.amazon.com/de/datasets/federal-contracts-from-the-federal-procurement-data-center-usaspending-gov/
- Flickr Personal Taxonomy: A tree-structured dataset of personal labels. (40MB) http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
- Freebase Database: A database of all current facts and inferences in Freebase. (26GB) http://aws.amazon.com/de/datasets/freebase-data-dump/
- Freebase Simple Topic Library: A database of basic fact-aware facts in every topic in Freebase. (5GB) http://aws.amazon.com/de/datasets/freebase-simple-topic-dump/
- Freebase Quaternary Library: A database of all current facts and inferences in Freebase [LZ1]. (35GB) http://aws.amazon.com/de/datasets/freebase-quad-dump/
- GigaOM Wordpress Challenge [Kaggle]: Blog Posts, Metadata, User Likes. (1.5GB) https://www.kaggle.com/c/predict-wordpress-likes/data
- Google Books n-grams: Also available as a hadoop-format file on Amazon S3. (2.2TB) http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
- Google Web 5-grams: n-grams containing English words, and their frequency counts. (24GB) https://catalog.ldc.upenn.edu/LDC2006T13
- Gutenberg eBook List: A list of annotated eBooks. (2MB) http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
- Canadian Parliament Text Blocks: 1.3 million standard text blocks (sentences or smaller fragments) from the official records of the 36th Parliament of Canada (Hansards). (82MB) http://www.isi.edu/natural-language/download/hansard/
- Harvard Libraries: Bibliographic records of more than 12 million volumes of materials held at Harvard Libraries, including books, periodicals, electronic resources, manuscripts, archival materials, musical scores, audio, video, and other materials. (4GB) http://library.harvard.edu/open-metadata#Harvard-Library-Bibliographic-Dataset
- Hate Speech Identification: Volunteers look at short texts and determine whether it a) contains hate speech, b) is offensive but has no hate speech, or c) is not offensive at all. Containing nearly 15 thousand lines, each text string had three volunteer judgements. (3MB) https://github.com/t-davidson/hate-speech-and-offensive-language
- Hillary Clinton's Emails [Kaggle]: Collated nearly 7,000 pages of Clinton's emails. (12MB) https://www.kaggle.com/kaggle/hillary-clinton-emails
- The Home Depot Company Product Search Association [Kaggle]: Contains many product and customer search terms from the Home Depot Company website. The challenge is to predict the relevance score of search term combinations and products. To create authentic labels, The Home Depot crowdsourced search/product pairings to multiple raters. (65MB) https://www.kaggle.com/c/home-depot-product-search-relevance/data
- Identify key phrases in text: question/answer pairs and text composition; determine whether contextual text is relevant to question/answer. (8MB) https://www.crowdflower.com/data-for-everyone/
- US TV show 'Jeopardy': A collection of 216,930 past questions from 'Jeopardy'. (53MB) http://www.reddit.com/r/datasets/comments
- 200k Plaintext Jokes in English: Archive of 208,000 plaintext jokes from different sources. https://github.com/taivop/joke-dataset
- European language machine translation. (612MB) http://statmt.org/wmt11/translation-task.html#download
- Material Safety Data Sheet: 230000 Material Safety Data Sheet. (3GB) http://aws.amazon.com/de/datasets/material-safety-data-sheets/
- Million News Headlines - ABC Australia [Kaggle]: 1.3 million headlines from 2003-2017 published by ABC News Australia. (56MB) https://www.kaggle.com/therohk/million-headlines
- MCTest: Free-to-use collection of 660 stories and associated questions for researching machine understanding of text, question answering. (1MB) http://research.microsoft.com/en-us/um/redmond/projects/mctest/index.html
- Negra: A Grammatically Annotated Corpus of German Newspaper Texts. Free for all universities and non-profit organizations. Need to sign the agreement and send the application to get it. http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/negra-corpus.html
- News Headlines - Times of India [Kaggle]: 2.7 million categories of news headlines published by Times of India from 2001 to 2017. (185MB) https://www.crowdflower.com/data-for-everyone/
- News article/Wikipedia page pairing: Volunteers read a short article and were asked which of the two Wikipedia articles was the best match. (6MB) https://www.kaggle.com/benhamner/nips-2015-papers/version/2
- 2015 NIPS Papers (Version 2) [Kaggle]: Full text of all 2015 NIPS papers. (335MB) https://www.kaggle.com/benhamner/nips-2015-papers/version/2
- NYT Facebook Data: All NYT posts on Facebook. (5MB) http://minimaxir.com/2015/07/facebook-scraper/
- Global News Weekly Feed [Kaggle]: A dataset of 1.4 million news events published globally in more than 20 languages during one week in August 2017. (115MB) https://www.kaggle.com/therohk/global-news-week
- Correctness of sentence/concept pairs: Volunteers read sentences about two concepts. For example, "A dog is an animal", or "A captain can mean the same thing as an owner", and they were then asked if this sentence was correct and rated it 1-5. (700KB) https://www.crowdflower.com/data-for-everyone/
- Open Library Database: A modified collection of all records in an open library. (16GB) https://openlibrary.org/developers/dumps
- Character Corpus: A collection of experiments on author essay style and personality prediction. Consists of 145 Dutch articles from 145 students. (Access requires application) http://www.clips.uantwerpen.be/datasets/personae-corpus
- Reddit comments: All public comments on the reddit forum as of July 2015. A total of 1.7 billion comments. (250GB) https://www.reddit.com/r/datasets/comments/3bxlg7
- Reddit review (May 2015): Kaggle subdataset. (8GB) https://www.kaggle.com/reddit/reddit-comments-may-2015
- Reddit Submission Corpus: All publicly available Reddit submissions from January 2006 - August 31, 2015. (42GB) https://www.reddit.com/r/datasets/comments/3mg812
- Reuters Corpus: A dataset containing Reuters news articles for research and development of natural language processing, information retrieval, and machine learning systems. The corpus, also known as "Reuters Quotations 1" or RCV1, is much larger than the well-known Reuters 21578 dataset that was originally widely used in text classification. The corpus data needs to be obtained by signing an agreement and sending an email. (2.5GB) https://trec.nist.gov/data/reuters/reuters.html
- SaudiNewsNet: 31030 headlines and metadata extracted from various Saudi Arabian online newspapers. (2MB) https://github.com/ParallelMazen/SaudiNewsNet
- SMS Spam Dataset: 5574 unencoded genuine English SMS messages marked as legitimate/illegal. (200KB) http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
- South Park dataset: A csv file containing script information for seasons, episodes, characters, and lines. (3.6MB) https://github.com/BobAdamsEE/SouthParkData
- Stackoverflow: 7.3 million stackoverflow questions and other stackexchange (question answering tool) questions and answers. http://data.stackexchange.com/
- Twitter's Cheng-Caverlee-lee user targeting dataset: Tweet targeting for September 2009-January 2010. (400MB) https://archive.org/details/twitter_cikm_2010
- Twitter buzz about New England Patriots' deflation: Before the 2015 Super Bowl, there was a lot of chatter about deflated footballs and whether the Patriots were cheating. The dataset provides Twitter sentiment during the time of the scandal in order to gauge how the public felt about the event as a whole. (2MB) https://www.figure-eight.com/data-for-everyone/
- Analysis of public opinion on left-leaning related events on Twitter: tweets about abortion legalization, feminism, Hillary Clinton and other left-leaning related events, tweets will be classified as For (support) and Against (against) based on content inference , Neutral (neutral), or None of the above (none of the above). (600KB) https://www.figure-eight.com/data-for-everyone/
- Twitter's Sentiment140 (Sentiment Analysis Dataset): Tweets about brands/keywords, websites including papers and research ideas. (77MB) http://help.sentiment140.com/for-students/
- Analysis of public opinion on self-driving cars on Twitter: Contributors read the tweets and categorized their attitudes towards autonomous driving into very positive, somewhat positive, neutral, relatively negative, and very negative. If the tweet has nothing to do with self-driving cars, they flag it too. (1MB) https://www.figure-eight.com/data-for-everyone/
- Tokyo-targeted tweets on Twitter: 200,000 tweets from Tokyo. (47MB) http://followthehashtag.com/datasets/200000-tokyo
- UK-targeted tweets on Twitter: 170,000 tweets from the UK. (47MB) http://followthehashtag.com/datasets/170000-uk
- US-targeted tweets on Twitter: 200,000 tweets from the US. (45MB) http://followthehashtag.com/datasets/free-twitter-dataset
- Attitudes towards major U.S. airlines on Twitter (Kaggle dataset): This is a sentiment analysis task for problems with major U.S. airlines. The dataset crawls tweets from February 2015, with contributors classifying them as positive, negative, and neutral, and giving reasons for those classified as negative (e.g. "plane is late" or " Poor service attitude", etc.). (2.5MB) https://www.kaggle.com/crowdflower/twitter-airline-sentiment
- U.S. Economic Performance Based on News Headlines: Sorts the relevance of news to the U.S. economy based on news headlines and summaries. (5MB) https://www.figure-eight.com/data-for-everyone/
- Urban Dictionary (American Online Slang Dictionary) Words and Definitions: A cleansed CSV corpus of all 2.6 million words, definitions, authors, and votes in Urban Dictionary as of May 2016. (238MB) https://www.kaggle.com/therohk/urban-dictionary-words-dataset
- Amazon's Wesbury Lab Usenet Corpus: An anonymous compilation of messages from 47,860 English-language newsgroups from 2005-2010. (40GB) http://aws.amazon.com/de/datasets/the-westburylab-usenet-corpus/
- Wikipedia's Wesbury Lab Corpus: A snapshot of all articles in the English-language section of Wikipedia as of April 2010. The website describes in detail how the data is processed - i.e. stripped of all links and irrelevant material (e.g. navigation text, etc.). A corpus is unlabeled raw text, which is used in Stanford NLP. http://www.psych.ualberta.ca
- Stanford NLP jump link: https://scholar.google.com/scholar
- Wikipedia Extraction (WEX): The processed English version of Wikipedia. (66GB) http://aws.amazon.com/de/datasets/wikipedia-extraction-wex/
- Wikipedia's XML-formatted data: a full reproduction of all Wikimedia, embedded in XML as wikitext source and metadata. (500GB) http://aws.amazon.com/de/datasets/wikipedia-xml-data/
- Comprehensive Questions and Answers from Yahoo Answers: The Yahoo Answers corpus as of October 25, 2007, containing 4,483,032 questions and answers. (3.6GB) http://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- Questions Asked in French in Yahoo Answers: A subset of the Yahoo Answers corpus from 2006-2015, containing 1.7 million French question answers. (3.8GB) https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- "How To" Questions from Yahoo Answers [LZ2]: A subset of 142,627 questions and answers from the October 25, 2007 Yahoo Answers corpus selected according to linguistic attributes. (104MB) https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- Yahoo's HTML format pages extracted from public web pages: Contains a small number of complex HTML format pages and 2.67 million complex format pages. (50+ GB) https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- Metadata extracted from public web pages by Yahoo: 100 million triples of data in RDF format. (2GB) https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- Yahoo's N-gram representations data (N-Gram Representations): This dataset contains N-gram representation data, which can be used for query rewriting (query rewriting) tasks common in IR research, and can also be used in NLP research common word and sentence similarity analysis tasks. (2.6GB) https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- Yahoo's N-gram data (version 2.0): n-gram data (n=1-5) from a corpus of 14.6 million documents (126 million unique sentences, 3.4 billion running words) Extraction of documents from 12,000 news-oriented sites. (12 GB) https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- Relevance Judgment of Yahoo Search Logs: Relevance Judgment of Anonymous Yahoo Search Logs. (1.3GB) https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- Yahoo's English Wikipedia Semantic Annotation Snapshot: Contains 1,490,688 entries of the English Wikipedia as of November 4, 2006 after processing with some publicly available NLP tools. (6GB) https://webscope.sandbox.yahoo.com/catalog.php?datatype=l
- Yelp: Contains restaurant rankings and 2.2 million reviews. https://www.yelp.com/dataset
- Youtube: 1.7 million YouTube video descriptions. (torrent format) https://www.reddit.com/r/datasets/comments/
- Excellent public NLP datasets (with more listings) https://github.com/awesomedata/awesome-public-datasets
- Amazon public dataset https://aws.amazon.com/de/datasets/
- CrowdFlower Dataset (contains a large number of small surveys and crowdsourced data for specific tasks) https://www.crowdflower.com/data-for-everyone/
- Kaggle datasets https://www.kaggle.com/datasets
- Kaggle competitions (please make sure these kaggle competition data can be used outside the competition) https://www.kaggle.com/competitions
- Open Library https://openlibrary.org/developers/dumps
- Quora (mostly annotated corpus) https://www.quora.com/Datasets
- reddit datasets (numerous datasets, mostly crawled by amateurs, but data curation and licensing may not be standardized) https://www.reddit.com/r/datasets
- http://Rs.io: Also a very long list of datasets http://rs.io/100-interesting-data-sets-for-statistics/
- Stackexchange: Open Data http://opendata.stackexchange.com/
- Stanford NLP group (mostly labeled corpora and TreeBanks, and practical NLP tools) https://nlp.stanford.edu/links/statnlp.html
- Yahoo Research's dataset summary Webscope (also includes a list of papers that use the data) http://webscope.sandbox.yahoo.com/
- Natural Language Processing (NLP) dataset list [Nicolas Iderhoff] https://github.com/niderhoff/nlp-datasets
- NLVR: Natural Language Basic Dataset (Object Grouping, Quantity, Comparison and Spatial Relationship Reasoning) http://lic.nlp.cornell.edu/nlvr/
- Stanford NLP released a new multi-round, cross-domain, task-oriented dialogue data set [Mihail Eric] https://github.com/keunwoochoi/YouTube-music-video-5M
- "The Beauty of Data" natural language dataset/code http://t.cn/hBOTM4
- Large-scale crowdsourcing relational database natural language query semantic analysis dataset (80,000+ query samples) http://t.cn/RNMr09n
6.4 Various types/scene image data/comprehensive image
- Visual Genome image data http://dataju.cn/Dataju/web/datasetInstanceDetail/311
- Visual7w image data http://dataju.cn/Dataju/web/datasetInstanceDetail/315
- COCO image data http://dataju.cn/Dataju/web/datasetInstanceDetail/316
- SUFR image data http://dataju.cn/Dataju/web/datasetInstanceDetail/317
- ILSVRC 2014 training data (part of ImageNet) http://dataju.cn/Dataju/web/datasetInstanceDetail/369
- PASCAL Visual Object Classes 2012 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/85
- PASCAL Visual Object Classes 2011 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/107
- PASCAL Visual Object Classes 2010 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/51
- 80 Million Tiny Image image data [data is too large, only introduction] http://dataju.cn/Dataju/web/datasetInstanceDetail/240
- ImageNet [The data is too large and only an introduction] http://dataju.cn/Dataju/web/datasetInstanceDetail/55
- Google Open Images【The data is too large, only the introduction】http://dataju.cn/Dataju/web/datasetInstanceDetail/40
6.5 Scene Image
- Street Scences image data http://dataju.cn/Dataju/web/datasetInstanceDetail/45
- Places2 scene image data http://dataju.cn/Dataju/web/datasetInstanceDetail/48
- (Stanford) 69G large-scale UAV (campus) image dataset [Stanford] http://cvgl.stanford.edu/projects/uav_data/
- Release ADE20K scene perception/analysis/segmentation/multi-target recognition dataset [MIT] https://groups.csail.mit.edu/vision/datasets/ADE20K/
- Multimodal Binary Behavior Dataset [GaTech] http://www.cbi.gatech.edu/mmdb/
- Berkeley Image Segmentation Dataset BSDS500 [Berkeley] https://www2.eecs.berkeley.edu
- UCF Google Street View image data http://dataju.cn/Dataju/web/datasetInstanceDetail/138
- SUN scene image data http://dataju.cn/Dataju/web/datasetInstanceDetail/138
- The Celebrity in Places image data http://dataju.cn/Dataju/web/datasetInstanceDetail/83
6.6 Web Image Tags
- HARRISON social label image http://dataju.cn/Dataju/web/datasetInstanceDetail/183
- NUS-WIDE label image http://dataju.cn/Dataju/web/datasetInstanceDetail/74
- Visual Synset label image http://dataju.cn/Dataju/web/datasetInstanceDetail/112
- Animals With Attributes tag image http://dataju.cn/Dataju/web/datasetInstanceDetail/160
6.7 Human silhouette image
- Face Sketch Dataset [CUHK] http://mmlab.ie.cuhk.edu.hk/archive/facesketch.html
- MPII Human Shapehttp://dataju.cn/Dataju/web/datasetInstanceDetail/234
- Human body contour data http://dataju.cn/Dataju/web/datasetInstanceDetail/173
- Biwi Kinect Head Pose head pose data http://dataju.cn/Dataju/web/datasetInstanceDetail/52
- Upper body portrait data http://dataju.cn/Dataju/web/datasetInstanceDetail/52
- INRIA Person dataset http://dataju.cn/Dataju/web/datasetInstanceDetail/235
6.8 Visual Text Recognition Image
- Street View House Number house number image data http://dataju.cn/Dataju/web/datasetInstanceDetail/236
- MNIST handwritten digit recognition image data http://dataju.cn/Dataju/web/datasetInstanceDetail/253
- 3D MNIST digital recognition image data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/129
- MediaTeam Document Document photocopy and content data http://dataju.cn/Dataju/web/datasetInstanceDetail/129
- Text Recognition text image data http://dataju.cn/Dataju/web/datasetInstanceDetail/110
- NIST Handprinted Forms and Characters handwritten English character data http://dataju.cn/Dataju/web/datasetInstanceDetail/49
- NIST Structured Forms Reference Set of Binary Images http://dataju.cn/Dataju/web/datasetInstanceDetail/73
- (SFRS) image data http://dataju.cn/Dataju/web/datasetInstanceDetail/47
- NIST Structured Forms Reference Set of Binary Images http://dataju.cn/Dataju/web/datasetInstanceDetail/23
- (SFRS) II image data http://dataju.cn/Dataju/web/datasetInstanceDetail/203
6.9 Images of a particular class of things
- The famous cat image annotation data http://dataju.cn/Dataju/web/datasetInstanceDetail/128
- Caltech-UCSDhttp://dataju.cn/Dataju/web/datasetInstanceDetail/176
- Birds200 bird image data http://dataju.cn/Dataju/web/datasetInstanceDetail/278
- Stanford Car car image data http://dataju.cn/Dataju/web/datasetInstanceDetail/294
- Cars car image data http://dataju.cn/Dataju/web/datasetInstanceDetail/295
- MIT Cars car image data http://dataju.cn/Dataju/web/datasetInstanceDetail/41
- Stanford Cars car image data http://dataju.cn/Dataju/web/datasetInstanceDetail/105
- Food-101 food image data http://dataju.cn/Dataju/web/datasetInstanceDetail/106
- 17_Category_Flowerhttp://dataju.cn/Dataju/web/datasetInstanceDetail/106
- Image data http://dataju.cn/Dataju/web/datasetInstanceDetail/254
- 102_Category_Flowerhttp://dataju.cn/Dataju/web/datasetInstanceDetail/255
- Image data http://dataju.cn/Dataju/web/datasetInstanceDetail/109
- UCI Folio Leaf image data http://dataju.cn/Dataju/web/datasetInstanceDetail/114
- Labeled Fisheshttp://dataju.cn/Dataju/web/datasetInstanceDetail/115
- in the Wild fish image http://dataju.cn/Dataju/web/datasetInstanceDetail/60
- US Yelp review site hotel photos http://dataju.cn/Dataju/web/datasetInstanceDetail/61
- CMU-Oxfordhttp://dataju.cn/Dataju/web/datasetInstanceDetail/63
- Sculpture statue image http://dataju.cn/Dataju/web/datasetInstanceDetail/174
- Oxford-IIIT Pet pet image data http://dataju.cn/Dataju/web/datasetInstanceDetail/256
- Naturehttp://dataju.cn/Dataju/web/datasetInstanceDetail/301
- Conservancy Fisheries Monitoring Overfishing monitoring image data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/118
- Pet picture (segmentation) dataset [Oxford] http://www.robots.ox.ac.uk/~vgg/data/pets/
6.10 Material texture images
- CURET texture material image data http://dataju.cn/Dataju/web/datasetInstanceDetail/111
- ETHZ Synthesizability texture image data http://dataju.cn/Dataju/web/datasetInstanceDetail/127
- KTH-TIPS texture material image data http://dataju.cn/Dataju/web/datasetInstanceDetail/172
- Describable Textures texture image data http://dataju.cn/Dataju/web/datasetInstanceDetail/71
6.11 Object Classification Images
- COIL-20 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/62
- COIL-100 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/70
- Caltech-101 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/54
- Caltech-256 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/46
- CIFAR-10 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/42
- CIFAR-100 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/53
- STL-10 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/72
- LabelMe_12_50k http://dataju.cn/Dataju/web/datasetInstanceDetail/72
- Image data http://dataju.cn/Dataju/web/datasetInstanceDetail/69
- NORB v1.0 image data http://dataju.cn/Dataju/web/datasetInstanceDetail/117
- NEC Toy Animal image data http://dataju.cn/Dataju/web/datasetInstanceDetail/237
- iCubWorld image classification data http://dataju.cn/Dataju/web/datasetInstanceDetail/238
- Multi-class image classification data http://dataju.cn/Dataju/web/datasetInstanceDetail/239
- GRAZ image classification data http://dataju.cn/Dataju/web/datasetInstanceDetail/108
6.12 Face Image
IMDB-WIKI 500k+ face images, age and gender data http://dataju.cn/Dataju/web/datasetInstanceDetail/68
- Labeled Faces in the Wild face data http://dataju.cn/Dataju/web/datasetInstanceDetail/50
- Extended Yale Face Database B face data http://dataju.cn/Dataju/web/datasetInstanceDetail/131
- Bao Face face data http://dataju.cn/Dataju/web/datasetInstanceDetail/87
- DC-IGN paper face data http://dataju.cn/Dataju/web/datasetInstanceDetail/119
- 300 Face in Wild image data http://dataju.cn/Dataju/web/datasetInstanceDetail/120
- BioID Face face data http://dataju.cn/Dataju/web/datasetInstanceDetail/122
- CMU Frontal Face Imageshttp://dataju.cn/Dataju/web/datasetInstanceDetail/123
- FDDB_Face Detection Data Set and Benchmark http://dataju.cn/Dataju/web/datasetInstanceDetail/130
- NIST Mugshot Identification Database http://dataju.cn/Dataju/web/datasetInstanceDetail/140
- Faces in the Wild face data http://dataju.cn/Dataju/web/datasetInstanceDetail/170
- CelebA celebrity face image data http://dataju.cn/Dataju/web/datasetInstanceDetail/175
- VGG Face face image data http://dataju.cn/Dataju/web/datasetInstanceDetail/189
- Caltech 10k Web Faces face image data http://dataju.cn/Dataju/web/datasetInstanceDetail/125
6.13 Pose Action Images
- HMDB_a large human motion database http://dataju.cn/Dataju/web/datasetInstanceDetail/126
- Human Actions and Scenes Dataset http://dataju.cn/Dataju/web/datasetInstanceDetail/177
- Buffy Stickmen V3 human body contour recognition image data http://dataju.cn/Dataju/web/datasetInstanceDetail/178
- Human Pose Evaluator Human body contour recognition image data http://dataju.cn/Dataju/web/datasetInstanceDetail/179
- Buffy pose Human pose image data http://dataju.cn/Dataju/web/datasetInstanceDetail/181
- VGG Human Pose Estimation pose image annotation data http://dataju.cn/Dataju/web/datasetInstanceDetail/197
6.14 Fingerprint recognition image
NIST FIGS fingerprint identification data http://dataju.cn/Dataju/web/datasetInstanceDetail/281
- NIST Supplemental Fingerprint Card Data (SFCD) fingerprint identification data http://dataju.cn/Dataju/web/datasetInstanceDetail/280
- NIST Plain and Rolled Images from Paired Fingerprint Cards http://dataju.cn/Dataju/web/datasetInstanceDetail/279
- in 500 pixels per inch fingerprint identification data http://dataju.cn/Dataju/web/datasetInstanceDetail/77
- NIST Plain and Rolled Images from Paired Fingerprint Cards http://dataju.cn/Dataju/web/datasetInstanceDetail/289
- 1000 pixels per inch fingerprint identification data http://dataju.cn/Dataju/web/datasetInstanceDetail/132
6.15 Other image data
Visual Question Answering V1.0 Image Data http://dataju.cn/Dataju/web/datasetInstanceDetail/84
- Visual Question Answering V2.0 Image Data http://dataju.cn/Dataju/web/datasetInstanceDetail/241
- Fashion-MNIST style clothing image dataset [Xiao Han] https://github.com/zalandoresearch/fashion-mnist
- Japanese manga dataset Manga109: http://dl.acm.org/citation.cfm?doid=3011549.3011551
- Pixiv (coloring) image data set [Jerry Li] https://github.com/jerryli27/pixiv_dataset
- Quick, Draw! Stick figure graffiti dataset https://github.com/googlecreativelab/quickdraw-dataset
- Stick figure graffiti data set [hardmaru] https://github.com/hardmaru/sketch-rnn-datasets
- Large-Scale Street-Level Image (Segmentation) Dataset [Peter Kontschieder] http://blog.mapillary.com/product/2017
- Large-Scale Japanese Image Description Dataset https://github.com/STAIR-Lab-CIT/STAIR-captions
- Cityscapes Street View Semantic Segmentation Dataset (50 cities, 30 categories, 5k fine-labeled 20k rough-labeled images and labeled videos) https://github.com/mcordts/cityscapess
- (Street) fashion clothing dataset (2000+ labeled pictures) https://github.com/bearpaw/clothing-co-parsing
6.16 Recommender System Dataset
- Netflix movie evaluation data http://dataju.cn/Dataju/web/datasetInstanceDetail/330
- MovieLens 20m Movie Recommendation Dataset http://dataju.cn/Dataju/web/datasetInstanceDetail/329
- WikiLens http://dataju.cn/Dataju/web/datasetInstanceDetail/227
- Jester http://dataju.cn/Dataju/web/datasetInstanceDetail/350
- HetRec2011 http://dataju.cn/Dataju/web/datasetInstanceDetail/354
- Book Crossing http://dataju.cn/Dataju/web/datasetInstanceDetail/32
- Large Movie Review http://dataju.cn/Dataju/web/datasetInstanceDetail/116
- Retailrocket product review and recommendation data http://dataju.cn/Dataju/web/datasetInstanceDetail/97
- MovieLens https://grouplens.org/datasets/movielens/
- Jester http://www2.informatik.uni-freiburg.de/~cziegler/BX/
- Book-Crossings http://www2.informatik.uni-freiburg.de/~cziegler/BX/
- Last.fm https://grouplens.org/datasets/hetrec-2011/
- OpenStreetMap http://planet.openstreetmap.org/planet/full-history/
- Python Git Repositories https://github.com/lab41/hermes
6.17 Financial Datasets
- The official data released by the US Bureau of Labor Statistics: http://dataju.cn/Dataju/web/datasetInstanceDetail/139
- Ex-rights and ex-dividends of Shanghai and Shenzhen stocks, allotment of additional issuance full data, as of 2016.12.31 http://dataju.cn/Dataju/web/datasetInstanceDetail/344
- Daily data of the main board of the Shanghai Stock Exchange, as of 2017.05.05, original price, pre-reinstatement price, post-reinstatement price, 1260 stocks http://dataju.cn/Dataju/web/datasetInstanceDetail/340
- The daily line data of the main board of the Shenzhen Stock Exchange, as of 2017.05.05, original price, pre-reinstatement price, post-reinstatement price, 466 stocks http://dataju.cn/Dataju/web/datasetInstanceDetail/341
- Daily data of SZSE SME board, as of May 5, 2017, original price, pre-reinstatement price, post-reinstatement price, 852 stocks http://dataju.cn/Dataju/web/datasetInstanceDetail/342
- Shenzhen ChiNext daily data, as of 2017.05.05, original price, pre-reinstatement price, post-reinstatement price, 636 stocks http://dataju.cn/Dataju/web/datasetInstanceDetail/343
- Daily data of Shanghai A shares, from 1999.12.09 to 2016.06.08, before reinstatement, 1095 stocks http://dataju.cn/Dataju/web/datasetInstanceDetail/37
- Shenzhen A-share daily data, 1999.12.09 to 2016.06.08, before reinstatement, 1766 stocks http://dataju.cn/Dataju/web/datasetInstanceDetail/38
- Shenzhen Stock Exchange GEM daily data, 1999.12.09 to 2016.06.08, before reinstatement, 510 stocks http://dataju.cn/Dataju/web/datasetInstanceDetail/39
- MT4 platform foreign exchange transaction historical data http://dataju.cn/Dataju/web/datasetInstanceDetail/43
- Forex platform foreign exchange transaction historical data http://dataju.cn/Dataju/web/datasetInstanceDetail/67
- Several sets of foreign exchange transaction Ticks data http://dataju.cn/Dataju/web/datasetInstanceDetail/44
6.18 Traffic Dataset
- 2013 New York taxi driving data http://dataju.cn/Dataju/web/datasetInstanceDetail/348
- 2013 Chicago taxi driving data http://dataju.cn/Dataju/web/datasetInstanceDetail/355
- Udacity autopilot data http://dataju.cn/Dataju/web/datasetInstanceDetail/356
- New York Uber pick-up data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/76
- British car accident data (2005-2015) [Kaagle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/323
- Chicago car speeding data [Kaggle data] http://dataju.cn/Dataju/web/datasetInstanceDetail/86
- KITTI autonomous driving task data [data is too large and only part of it] http://dataju.cn/Dataju/web/datasetInstanceDetail/210
- Cityscapes scene annotation data [the data is too large and only part of it] http://dataju.cn/Dataju/web/datasetInstanceDetail/210
- German traffic sign recognition data http://dataju.cn/Dataju/web/datasetInstanceDetail/232
- Traffic signal recognition data http://dataju.cn/Dataju/web/datasetInstanceDetail/228
- Chicago Divvy shared bicycle riding data (2013 to present) http://dataju.cn/Dataju/web/datasetInstanceDetail/228
- Riding data of shared bicycles in Chattanooga, USA http://dataju.cn/Dataju/web/datasetInstanceDetail/270
- Bay Area shared bicycle riding data http://dataju.cn/Dataju/web/datasetInstanceDetail/338
- Nice Ride shared bicycle riding data http://dataju.cn/Dataju/web/datasetInstanceDetail/339
- Citibank shared bicycle riding data http://dataju.cn/Dataju/web/datasetInstanceDetail/325
- Using satellite data to track human trajectories in the Amazon rainforest [Kaggle competition] http://dataju.cn/Dataju/web/datasetInstanceDetail/358
- Official ride data of the New York Taxi Management Commission (2009-2016) http://dataju.cn/Dataju/web/datasetInstanceDetail/359
6.19 Commercial data
- Airbnb’s open homestay information and guest review data http://dataju.cn/Dataju/web/datasetInstanceDetail/360
- Amazon food review data http://dataju.cn/Dataju/web/datasetInstanceDetail/361
- US video game sales and evaluation data http://dataju.cn/Dataju/web/datasetInstanceDetail/309
- Predicting Apartment Rent Competition Data http://dataju.cn/Dataju/web/datasetInstanceDetail/208
- Bank product recommendation competition data http://dataju.cn/Dataju/web/datasetInstanceDetail/213
- Website user recommendation click prediction competition data http://dataju.cn/Dataju/web/datasetInstanceDetail/319
6.20 Medical data
- Brain MRI image data when people recognize objects http://dataju.cn/Dataju/web/datasetInstanceDetail/99
- Brain MRI image data when people understand words http://dataju.cn/Dataju/web/datasetInstanceDetail/101
- Cardiac atrial images and labeled data http://dataju.cn/Dataju/web/datasetInstanceDetail/100
- Cytopathology identification http://dataju.cn/Dataju/web/datasetInstanceDetail/98
- FIRE retinal fundus lesion image data http://dataju.cn/Dataju/web/datasetInstanceDetail/290
- Introduction to the cancer data warehouse initiated by the US Department of Health and Services-National Cancer Institute [Only an introduction] http://dataju.cn/Dataju/web/datasetInstanceDetail/250
- Data Science Bowl 2017 Lung Cancer Recognition Competition Data [The data is too large to introduce only] http://dataju.cn/Dataju/web/datasetInstanceDetail/258
- TCGA-LUAD lung cancer CT image data http://dataju.cn/Dataju/web/datasetInstanceDetail/261
- RIDER Lung CT lung cancer CT image http://dataju.cn/Dataju/web/datasetInstanceDetail/275
- TCGA-COAD cancer CT image data http://dataju.cn/Dataju/web/datasetInstanceDetail/284
- TCIA-TCGA-OV cancer CT image data http://dataju.cn/Dataju/web/datasetInstanceDetail/283
- TCIA RIDER NEURO cancer MRI image data http://dataju.cn/Dataju/web/datasetInstanceDetail/287
- QIN Beast breast cancer MRI image data http://dataju.cn/Dataju/web/datasetInstanceDetail/291
6.21 Video data (human motion, object detection, dense crowd, etc.)
- DAVIS_Densely Annotated Video Segmentation data http://dataju.cn/Dataju/web/datasetInstanceDetail/147
- YouTube-8M Video Dataset [The data is too large and only an introduction] http://dataju.cn/Dataju/web/datasetInstanceDetail/133
- YouTube website video backup [data is too large, only introduction] http://dataju.cn/Dataju/web/datasetInstanceDetail/134
6.22 Human Action Video
- Microsoft Research Action human action video data http://dataju.cn/Dataju/web/datasetInstanceDetail/144
- UCF50 Action Recognition action recognition data http://dataju.cn/Dataju/web/datasetInstanceDetail/135
- UCF101 Action Recognition action recognition data http://dataju.cn/Dataju/web/datasetInstanceDetail/136
- UT-Interaction human action video data http://dataju.cn/Dataju/web/datasetInstanceDetail/137
- UCF iPhone sensor data in motion http://dataju.cn/Dataju/web/datasetInstanceDetail/148
- UCF YouTube Human Action Video Data http://dataju.cn/Dataju/web/datasetInstanceDetail/125
- UCF Sport human action video data http://dataju.cn/Dataju/web/datasetInstanceDetail/126
- UCF-ARG human action video data http://dataju.cn/Dataju/web/datasetInstanceDetail/141
- HMDB human action video http://dataju.cn/Dataju/web/datasetInstanceDetail/157
- HOLLYWOOD2 human action video data http://dataju.cn/Dataju/web/datasetInstanceDetail/146
- Recognition of human actions action video data http://dataju.cn/Dataju/web/datasetInstanceDetail/244
- Motion Capture motion capture video data http://dataju.cn/Dataju/web/datasetInstanceDetail/245
- SBU Kinect Interaction body movement video data http://dataju.cn/Dataju/web/datasetInstanceDetail/246
6.23 Object Detection Video
- UCSD Pedestrian pedestrian video data http://dataju.cn/Dataju/web/datasetInstanceDetail/247
- Caltech Pedestrian pedestrian video data http://dataju.cn/Dataju/web/datasetInstanceDetail/248
- ETH pedestrian video data http://dataju.cn/Dataju/web/datasetInstanceDetail/223
- INRIA pedestrian video data http://dataju.cn/Dataju/web/datasetInstanceDetail/159
- TudBrussels pedestrian video data http://dataju.cn/Dataju/web/datasetInstanceDetail/151
- Daimler pedestrian video data http://dataju.cn/Dataju/web/datasetInstanceDetail/150
- ALOV++ object tracking video data http://dataju.cn/Dataju/web/datasetInstanceDetail/152
6.24 Dense Crowd Video
- Crowd Counting High-density crowd images http://dataju.cn/Dataju/web/datasetInstanceDetail/156
- Crowd Segmentation High-density crowd video data http://dataju.cn/Dataju/web/datasetInstanceDetail/243
- Tracking in High Density Crowds High-density crowd video http://dataju.cn/Dataju/web/datasetInstanceDetail/200
6.25 Other Videos
- Fire Detection video data http://dataju.cn/Dataju/web/datasetInstanceDetail/186
- Large (500,000) LOGO logo dataset https://data.vision.ee.ethz.ch/cvl/lld/
- 4D scanning (3D scanning of moving non-rigid objects at 60fps) data set [D-FAUST] http://dfaust.is.tue.mpg.de
- MNIST-based visual counting synthetic dataset Counting MNIST http://fomoro.com/tools/counting-mnist/
- YouTube MV Video Dataset [Keunwoo Choi] https://github.com/keunwoochoi/YouTube-music-video-5M
- Animal Attribute Labeling Dataset [ChristophH. Lampert/Daniel Pucher/JohannesDostal] http://cvml.ist.ac.at/AwA2/
- Overhead Dance Video Dataset http://homepages.inf.ed.ac.uk/rbf/CEILIDHDATA/
- e-VDS Video Dataset https://engineering.purdue.edu/elab/eVDS/#download
- Clothing Portrait Generation Model (&Chictopia10K[HumanParsing] Fashion Portrait Analysis Dataset)【Christoph Lassner/Gerard Pons-Moll/Peter V. Gehler】http://files.is.tue.mpg.de/classner/gp/
- Pixel-wise target segmentation of VOC2012 dataset implemented by PyTorch [BodoKaiser] https://github.com/bodokaiser/piwise
- Twenty Billion Neurons object complex motion and interactive video dataset [Nikita Johnson]
6.26 Audio data
- Google Audioset audio data [the data is too large and only an introduction] http://dataju.cn/Dataju/web/datasetInstanceDetail/164
- Sinhala TTS English speech recognition http://dataju.cn/Dataju/web/datasetInstanceDetail/251
- TIMIT American English speech recognition data http://dataju.cn/Dataju/web/datasetInstanceDetail/252
- LibriSpeech ASR corpus speech data http://dataju.cn/Dataju/web/datasetInstanceDetail/194
- Room Impulse Response and Noise voice data http://dataju.cn/Dataju/web/datasetInstanceDetail/191
- ALFFA African voice data http://dataju.cn/Dataju/web/datasetInstanceDetail/96
- THUYG-20 Uyghur speech data http://dataju.cn/Dataju/web/datasetInstanceDetail/96
- AMI Corpus speech recognition http://dataju.cn/Dataju/web/datasetInstanceDetail/96
- NSynth: Large-Scale High-Quality Note Labeled Audio Dataset https://magenta.tensorflow.org/datasets/nsynth
- Bird Sound Dataset [xeno-canto] http://www.xeno-canto.org
- (TensorFlow) AudioSet Audio Event Dataset Classification Model GitHub: tensorflow/models/tree/master/audioset
6.27 Text, evaluation, answer data collection
- (200,000) English joke dataset [TaivoPungas] https://github.com/taivop/joke-dataset
- Machine Learning Insurance Industry Q&A Open Dataset [HainWang] https://github.com/shuzi/insuranceQA
- Insurance Industry Question Answering (QA) Dataset [Minwei Feng] https://github.com/shuzi/insuranceQA
- Entity/Noun Semantic Relationship Labeling Dataset [David S. Batista] https://github.com/davidsbatista/Annotated-Semantic-Relationships-Datasets
- 28,000 articles/100,000 questions large-scale (English test) reading comprehension dataset https://github.com/qizhex/RACE_AR_baselines
- Misspelling Dataset http://www.dcs.bbk.ac.uk/~ROGER/corpora.html
- Text Simplification Dataset http://www.cs.pomona.edu/~dkauchak/simplification/
- English word/sentence/semantic frame frame annotation data set FrameNet https://framenet.icsi.berkeley.edu/fndrupal/
- Cross-language/multi-style/multi-granularity text similarity detection dataset https://github.com/FerreroJeremy/Cross-Language-Dataset
- Quora dataset: 400,000 rows of potentially duplicate questions http://qim.ec.quoracdn.net/quora_duplicate_questions.tsv
- Text Classification Dataset http://disi.unitn.it/moschitti/corpora.htm
- Frames: Maluuba dialogue dataset https://datasets.maluuba.com/Frames/dl
- Cross-domain (Amazon Product Reviews) Sentiment Dataset http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
- Semantic Web machine learning system evaluation/benchmark dataset http://dws.informatik.uni-mannheim.de
- Japanese woodblock printing character recognition dataset http://t.cn/RCZPfYB
- Benchmark datasets for evaluating supervised machine learning algorithms https://github.com/EpistasisLab/penn-ml-benchmarks
- New YELP dataset: Contains 4.7 million reviews and 156,000 merchants http://t.cn/RNG6JYi
- StackExchange Approximate/Duplicate Question Dataset http://nlp.cis.unimelb.edu.au/resources/cqadupstack/
- AI2 Science Question Answering Dataset (Multiple Choices) http://t.cn/RI5liwJ
6.28 Research datasets
- NIPS 2003 Attribute Selection Competition Data http://dataju.cn/Dataju/web/datasetInstanceDetail/370
- Professor Lin Zhiren of National Taiwan University processes classification modeling data in LibSVM format http://dataju.cn/Dataju/web/datasetInstanceDetail/296
- Large-scale classification modeling data http://dataju.cn/Dataju/web/datasetInstanceDetail/297
- Large-scale classification modeling data in several UCIs http://dataju.cn/Dataju/web/datasetInstanceDetail/298
- Social Computing http://dataju.cn/Dataju/web/datasetInstanceDetail/299
- Data Repository social network data http://dataju.cn/Dataju/web/datasetInstanceDetail/300
6.29 Social Datasets
- Hillary Clinton’s email leak http://dataju.cn/Dataju/web/datasetInstanceDetail/267
- Chicago crime record data since 2001 http://dataju.cn/Dataju/web/datasetInstanceDetail/267
- Criminal record data of Chattanooga, USA (2003 to present) http://dataju.cn/Dataju/web/datasetInstanceDetail/353
- Sidewalk Café License Data in Chicago Street Café Season http://dataju.cn/Dataju/web/datasetInstanceDetail/358
- Chicago restaurant health inspection results data http://dataju.cn/Dataju/web/datasetInstanceDetail/351
- GPS datasets of several human movement locations and routes (cycling, running, etc.) http://dataju.cn/Dataju/web/datasetInstanceDetail/352
6.30 Synthesis of other datasets
- Data Science/Machine Learning Dataset Summary https://elitedatascience.com/datasets
- CORe50: Continuous Target Recognition Dataset [VincenzoLomonaco&DavideMaltoni] https://vlomonaco.github.io/core50/
- (Matlab) Dataset statistical distribution automatic discovery [Isabel Valera] http://proceedings.mlr.press/v70/valera17a.html
- (Building) Damage Assessment Dataset [tsunami] https://github.com/faiton713/ABCDdataset
- IndieWeb Social Graph Dataset [IndieWeb] http://www.indiemap.org
- DeepMind open source environment/dataset/code collection [DeepMind] https://deepmind.com/research/open-source/
- Wolfram Dataset Repository https://datarepository.wolframcloud.com
- Large Music Analysis Dataset FMA https://github.com/mdeff/fma
- (3 million) Instacart online grocery shopping dataset [Jeremy Stanley] https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2
- Synthetic Financial Dataset for Fraud Detection [TESTIMON] https://www.kaggle.com/ntnu-testimon/paysim1
- LIBSVM format classification/regression/multi-label/string dataset https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
- Laptops use logistic regression to fit 100G data sets [Dmitriy Selivanov] http://dsnotes.com/post/2017-02-07-large-data
- 2010-2017 most complete KDD CUP competition review and data set http://suo.im/2kRoQ1
- Recipe Dataset: More than 20,000 recipes with ratings, nutrition and category information [HugoDarwood] https://www.kaggle.com/hugodarwood/epirecipes
- Oscar Dataset [Academy of Motion Picture Arts and Sciences] https://www.kaggle.com/theacademy/academy-awards
- Clustering dataset https://cs.joensuu.fi/sipu/datasets/
- Official Open Climate Dataset https://pan.baidu.com/s/1i52Xarb
- Global Terrorist Attack Dataset [START Consortium] https://www.kaggle.com/START-UMD/gtd
- Seven Machine Learning Time Series Datasets https://machinelearningmastery.com/time-series-datasets-for-machine-learning/
- Horse racing odds dataset http://t.cn/RNf0tXN
- JMIR Dataset Special Issue "JMIR Data" http://t.cn/RCIhmvS
- Census Income Dataset Classification https://github.com/dformoso/sklearn-classification
- Multimodal Binary Behavior Dataset http://t.cn/RCzFn1g
- Facebook StarCraft game data set (TorchCraft readable/365GB/more than 60,000 games/1.5 billion frames/nearly 500 million user operations) http://t.cn/R9j8AUM
- Collection of machine learning papers/datasets/tools (Japanese) http://t.cn/RKV7x2A
- Ten data collection strategies for machine learning companies http://t.cn/R54rtvd
- Japanese similar word data set http://t.cn/RaVFV35
- Large-scale human-based cloze (multiple choice reading comprehension) dataset http://t.cn/Rac2Pey
- List of high-quality free datasets http://t.cn/R6B1aqa
- Microsoft data set MS MARCO, "ImageNet" in the field of reading comprehension http://t.cn/RIMqGBK
7 Government Open Datasets
European Government Dataset https://data.europa.eu/euodp/data/dataset
US Government Dataset https://www.data.gov/
New Zealand Government Dataset https://catalogue.data.govt.nz/dataset
Indian Government Dataset https://data.gov.in/
Northern Ireland Public Dataset https://www.opendatani.gov.uk/