Key elements of deep learning: data collection and sharing

introduction

In the application of deep learning, data is considered as one of the most important factors. Therefore, choosing a good dataset is crucial to the success of deep learning. When selecting a data set, not only the size, diversity, and quality of the data need to be paid attention to, but also whether the data set represents the real situation of the research problem. This article organizes the current public data sets in the field of deep learning for everyone to choose and use when training models.

1 Comprehensive dataset

1.1 kaggle data set

Kaggle is one of the largest online repositories of datasets covering a range of topics from sports to medicine to government. Its platform is community-led, meaning users can upload their own datasets. Given the variety of data sources for Kaggle, it's important to thoroughly check the quality of the datasets you're pulling them from. Additionally, Kaggle offers discussions on machine learning topics as well as tutorials on key processes.

Address: kaggle datasets

1.2 AI Studio Dataset

The AI ​​Studio launched by Baidu is a one-stop development platform: it includes AI tutorials, code environment, algorithm computing power, data sets , and provides free online cloud computing. It is an integrated programming environment.

Address: AI Studio Dataset

1.3 Tianchi Dataset

Tianchi Dataset is a scientific research data platform open to the outside world of Alibaba Group. It is jointly provided by the business team of Alibaba Group and external research institutions, covering more than ten fields such as e-commerce, entertainment, logistics, medical health, transportation, industry, natural science, and energy Industry, covering data mining, machine learning, computer vision, natural language processing, decision intelligence and other classic artificial intelligence technology fields.

Address: tianchi datasets

1.4 Graviti Dataset

Graviti is a platform that provides public datasets. You can easily search for the data you want, and you can preview sample data, annotations, and labels online. Graviti has collected more than 400 high-quality CV datasets, covering various AI application fields such as driverless driving, smart retail, and robotics.

Address: graviti datasets

1.7 papers with code

There are over four thousand datasets (and counting). These datasets are uploaded by the community. You can easily filter these datasets by modality, task, and language. The database also contains links to other databases that also provide a variety of datasets.

Address: papers with code datasets

1.8 DataFlair

DataFlair links to over 70 machine learning datasets and also includes useful information like source code and project ideas. For example, in the list of datasets containing handwritten digits, DataFlair suggests creating an image classification algorithm to recognize handwritten digits in paper. Use the site to inspire new ideas.

Address: data flair

1.9 EliteDataScience

EliteDataScience includes free datasets and a curated list of the most popular aggregators. These datasets are organized by use case, and include datasets that can be used for deep learning, natural language processing, web scraping, and more.

Address: elite data science

1.10 UCI dataset

UCI has more than 500 machine learning datasets sortable by file type, task, application domain, and topic. Many of these datasets contain links to academic papers that can be used for benchmarking. One of the oldest sources of datasets and the first stop for finding interesting datasets. While the datasets are user-contributed and thus have varying degrees of cleanliness, the vast majority are clean and can be downloaded directly from the UCI Machine Learning Repository without registration.

Address: uci dataset

1.11 github public dataset

github public datasets provides an open source collection of public datasets. There you can view the catalog and choose a topic, ranging from agriculture to transportation and more. Github also includes a collection of general machine learning models. Most of the linked datasets are free.

Address: github datasets

1.12 Azure Datasets

Microsoft Azure has a database of public datasets that developers can use for prototyping and testing. Database categories include U.S. government and agency data, other statistical and scientific data, and online service data. Also, there you can read documentation on SQL and how to build mobile and web applications.

Address: azure datasets

2 Computer Vision Datasets

2.1 ImageNet dataset

The ImageNet dataset is one of the most popular datasets in the field of deep learning applications today, and it contains a large amount of image data and annotations. The annotations of the ImageNet dataset cover all large categories, medium categories, and small categories. The larger categories are more general, and the smaller categories are more specific. This feature makes this dataset suitable for research on image classification problems.

Address: ImageNet dataset

2.2 COCO Dataset

The full name is "Microsoft Common Objects in Context Dataset". The COCO dataset is a large-scale dataset that can be used for image detection, semantic segmentation and image captioning. It has more than 330K images (220K of which are labeled images), containing 1.5 million targets, 80 target categories (object categories: pedestrians, cars, elephants, etc.), 91 material categories (stuff categories: grass, wall, sky, etc.), each image contains five sentence sentence descriptions of the image, and there are 250,000 pedestrians annotated with keypoints.

Address: coco dataset

2.3 IMDB-Wiki Dataset

The IMDB-Wiki dataset provides the largest collection of face images, with over 500,000 images. Many images are from celebrities and Wikipedia. Each image is tagged with gender and age.

Address: imdb datasets

2.4 LabelMe dataset

Built using the LabelMe labeling tool. This tool enables users to outline and label objects. This dataset can be used in image recognition projects.

Address: labelme datasets

2.5 chars74k dataset

chars74k includes 74,000 images. Data includes character recognition in natural images (for example, images of restaurant signs)

Address: chars74k datasets

2.6 Kinetics-700 Dataset

Kinetics-700 contains a series of links to YouTube videos primarily labeled as human behavior. There are more than 650,000 video clips covering 700 human behaviors.

Address: kinetics-700 datasets

2.7 Places2 Database

Places2 Database is a dataset released by MIT, containing more than 10 million images covering more than 400 scenes. It is helpful for projects such as scene classification and scene parsing.

Address: places2 datasets

2.8 MPII Human Pose Dataset

The MPII Human Pose dataset includes about 25,000 images involving 410 human poses. The images contain approximately 40,000 different people, each with human joints annotated. These images are collected from YouTube videos.

Address: human-pose datasets

 2.9 Open Images dataset

Open Images is an open source image data set released by Google, and the latest V7 version will be released in October 2022. This version of the dataset contains more than 9 million images, all labeled with categories. Among them, more than 1.9 million pictures have very fine annotations. Open Images can be used in many different applications, including image classification, object detection, image segmentation, and image generation .

Address: open images dataset

 2.10 Cityscapes dataset

Cityscapes is a dataset for semantic segmentation of urban street views, containing 3257 high-resolution images from 50 cities in Germany. The dataset covers Street View images under different lighting conditions such as morning, day and night. Each image has a resolution of 2048x1024 and is professionally annotated for multiple labels including buildings, roads, and pedestrians. The dataset also provides lists for training, validation, and testing, as well as benchmark performance metrics. The introduction of the Cityscapes dataset will help promote the development of urban scene analysis and provide more possibilities for the research and application of deep learning algorithms.

Address: cityscapes dataset

2.11 Sogou Dataset

The Internet photo library comes from part of the data indexed by sogou image search. It collected 2,836,535 pictures in categories including people, animals, buildings, machinery, landscapes, and sports. For each picture, the original image, thumbnail, webpage where the picture is located, and relevant text in the webpage are given in the data set. More than 200G

Address: http://www.sogou.com/labs/dl/p.html

2.12 IMAGECLEF data set

IMAGECLEF is committed to providing a benchmark for bitmap-related fields (retrieval, classification, labeling, etc.) Cross Language Evaluation Forum (CLEF). The competition has been held every year since 2003.

Address: http://www.imageclef.org/

3 Natural Language Processing Datasets

3.1 Google Blogger Corpus

Google Blogger Corpus includes nearly 700,000 blog posts from blogger.com. Each article has at least 200 English words. Overall, these blog posts contain many common English words.

Address: BlogCorpus datasets

3.2 Yelp Reviews

The Yelp Reviews dataset covers rankings and reviews of restaurants and contains rich information related to this topic. The reviews in this dataset can be used in sentiment analysis projects.

Address: yelp dasets

3.3 WikiQA Corpus

The WikiQA corpus is a question answering dataset compiled from Bing search data. It includes more than 3,000 questions and provides 29,000 answer sentences, 1,500 of which are labeled as answer sentences.

地址: WikiQA Corpus

3.4 WordNet

WordNet is a database of English words grouped by meaning. There are 117,000 synsets (words paired according to synonyms), which are then linked to related synsets. Can be used in text classification projects.

Address: wordnet datasets

3.5 OpinRank dataset

The OpinRank dataset contains 300,000 reviews from Edmunds and TripAdvisor. They are categorized by destination, hotel and other relevant factors.

Address: OpinRank datasets

3.6 Multi-Domain Sentiment Dataset

The multi-domain sentiment dataset includes Amazon.com product reviews from four domains: DVD, Books, Kitchen, and Electronics. Each domain has thousands of reviews with 1-5 star ratings. As the name suggests, this dataset is useful for sentiment analysis projects.

Address: mdredze datasets

3.7 Twitter Sentiment Analysis Dataset

The Twitter sentiment analysis dataset includes more than 1.5 million classified tweets. Each row of the dataset has a rank: 1 for positive sentiment and 0 for negative sentiment.

Address: twitter-sentiment datasets

3.8 Newsgroups dataset

Newsgroups contains 20,000 documents and, as the name suggests, comes from more than 20 different newsgroups. It covers a lot of topics, some of which are relatively similar. The dataset consists of three versions: an original version, a version with dates removed, and a version with duplicates removed.

Address: 20Newsgroups datasets

 3.9 HuggingFace dataset

The HuggingFace dataset includes 611 text datasets that can be downloaded ready to use in one line of python; covers 467 languages, 99 of which contain at least 10 datasets;

Address: huggingface datasets

4 audio and video datasets

4.1 M-AI Labs Speech Dataset

The M-AI Labs speech dataset includes nearly 1,000 hours of audio and transcriptions. Includes male and female voices in multiple languages.

Address: MAI labs datasets

4.2 LibriSpeech

LibriSpeech includes approximately 1000 hours of speech data that has been segmented and aligned. These data were compiled from audiobooks from the LibriVox project.

Address: Librispeech datasets

5 Dataset Search

5.1 Google Dataset Search

Google provides a dataset search engine where you can search for datasets by name. The engine allows you to sort datasets by several features, such as file type, subject, latest update, and relevance. It can also pull datasets from thousands of databases on the internet, so you can really search through a wide range of options. Uploaders of the dataset include numerous international organizations such as Harvard University and the World Health Organization.

Address: google dataset search

5.2 clue dataset retrieval

Chinese language comprehension benchmarks, including representative datasets, benchmark (pre-trained) models, corpora, and leaderboards. We will select a series of data sets corresponding to certain representative tasks as our test benchmark data sets. These data sets will cover different tasks, data volume, and task difficulty.

Address: cluebenchmarks

5.3 visual data dataset

Visualdata contains some excellent datasets for building computer vision models that users can query by a CV topic, such as semantic segmentation, image captioning, image generation, self-driving cars, etc.

Address: visualdata

6 Specific data sets

6.1 Medical Image Datasets

Lung nodule database LIDC-IDRI: cancer image

Breast Image Database DDSM MIAS: Breast Image Database

Medical Image FAQ: medical-image-faq

Right Ventricle Segmentation Challenge (2012): mr-images

Lung Cancer Classification Competition: http://data-science-bowl-2017

Segmenting lung cancers (Kaggle): finding-lungs-in-ct

Lung cancer database: cancer image

Medical imaging dataset: medical-data

Medical Image Analysis: grand-challenge

6.2  Kaggle Competition Dataset

6.3 Natural Language Processing Datasets

6.4 Various types/scene image data/comprehensive image

6.5 Scene Image

6.6 Web Image Tags

6.7 Human silhouette image

6.8 Visual Text Recognition Image

6.10 Material texture images

6.11 Object Classification Images

 6.12 Face Image

6.13 Pose Action Images

 6.14 Fingerprint recognition image

 6.15 Other image data

6.16 Recommender System Dataset

6.17 Financial Datasets

6.19 Commercial data

6.21 Video data (human motion, object detection, dense crowd, etc.)

6.22 Human Action Video

6.23 Object Detection Video

6.24 Dense Crowd Video

6.25 Other Videos

6.26 Audio data

6.27 Text, evaluation, answer data collection

6.28 Research datasets

6.29 Social Datasets

6.30 Synthesis of other datasets

7 Government Open Datasets

European Government Dataset https://data.europa.eu/euodp/data/dataset

US Government Dataset https://www.data.gov/

New Zealand Government Dataset https://catalogue.data.govt.nz/dataset

Indian Government Dataset https://data.gov.in/

Northern Ireland Public Dataset https://www.opendatani.gov.uk/

Guess you like

Origin blog.csdn.net/lsb2002/article/details/132178923