20 datasets for deep learning training and research

Datasets play a vital role in computer science and data science. They are used to train and evaluate machine learning models, research and develop new algorithms, improve data quality, solve practical problems, advance scientific research, support data visualization, and decision making. Datasets provide a wealth of information for understanding and applying data, thereby supporting a variety of application domains, including healthcare, finance, transportation, social media, and more. Proper selection and processing of datasets is a key factor in ensuring the success of data-driven applications and is essential for innovating and solving complex problems. As such, datasets are not only the basis for technological development, but also powerful tools for advancing scientific progress and societal decision-making.

Whether it is image recognition, natural language processing, healthcare or any other AI field of interest, these datasets are very important, so this article will collate 20 commonly used and effective datasets.

MNIST : This is a classic data set for image recognition tasks, containing images of handwritten digits from 0 to 9, it can be said that it is the Hello World of image recognition

CIFAR-10 : Another popular image recognition dataset, CIFAR-10, contains 10 different classes of objects, such as airplanes, cars, and animals.

ImageNet : One of the largest image recognition datasets, ImageNet contains millions of labeled images in over 22,000 categories.

COCO : This dataset is commonly used for object detection tasks and contains more than 300,000 images and more than 2 million object instances labeled in 80 categories.

cityscape : A dataset for autonomous driving tasks, cityscape contains street scenes from various cities with pixel-level annotations for objects such as cars, pedestrians, and buildings.

Pascal VOC : Another popular object detection dataset Pascal VOC contains images from real-world scenes with object bounding boxes and object class labels.

WikiText : A large-scale language modeling dataset containing over 100 million tokens from Wikipedia articles. If you compare Penn Treebank with WikiText-2, the latter is almost twice the size and number of the former. In comparison, WikiText-103 is 110 times larger than other versions.

Penn Treebank : A widely used dataset for natural language processing tasks, Penn Treebank contains parsed text from The Wall Street Journal.

Here is a comparison of the two datasets:

SNLI : The Stanford Natural Language Inference Dataset contains 570,000 sentence pairs labeled entailment, contradiction, or neutral. It supports natural language reasoning systems, which can also be called RTE (Recognizing Textual Entailment).

SQuAD : The Stanford Question Answering Dataset contains questions posed in Wikipedia articles, along with corresponding answer text spans.

MIMIC-III : MIMIC-III is a large electronic health record dataset containing various clinical records and diagnostic data from more than 40,000 patients.

Fashion-MNIST : A variant of the MNIST dataset, Fashion-MNIST contains images of clothing items instead of handwritten digits. The Fashion-MNIST dataset contains clothing images of Zalando, which includes 60,000 training samples and 10,000 testing samples.

CelebA : A dataset of celebrity faces with attributes such as age, gender, and facial expressions. This dataset helps various applications verify facial recognition as their security system. The original data of this dataset was released by MMLAB in Hong Kong.

Kinetics : A dataset for human action recognition, Kinetics contains over 50,000 video clips of people performing various actions such as walking, running, and dancing. Each video clip has a duration of 10 seconds and highlights 600 sets of human actions.

Open Images : A large-scale dataset for object detection tasks, Open Images contains millions of images annotated with more than 600 object categories.

LJSpeech : A dataset for text-to-speech synthesis, LJSpeech contains 131,000 short audio recordings of a single speaker reading newspaper sentences aloud. Speakers draw excerpts from 7 non-fiction books.

librispeech : A dataset for speech recognition tasks, librispeech contains over 1000 hours of recordings, part of LibriVox audiobooks, with corresponding transcripts.

AudioSet : A data set for audio event recognition, AudioSet contains recordings of more than 527 types of sounds. These sound clips have a duration of 10 seconds. It's organized by using youtube metadata and research-based content.

NSynth : A dataset for musical instrument synthesis, NSynth contains recordings of various musical instruments with corresponding pitch and timbre information. It is a set of tunes composed of 1006 musical instruments, producing a total of 305,979 beautiful tunes.

Chess : A dataset for chess match prediction, containing data from thousands of games with information such as player ratings and chess piece movement sequences.

Datasets are indispensable tools in the fields of data science and artificial intelligence, and they provide basic data for model training and evaluation, problem solving, and scientific research. Selecting an appropriate data set and performing effective data processing and analysis are important steps to ensure the success of data-driven applications.

https://avoid.overfit.cn/post/8e58a98d26f04a00811257aebdd3e931

おすすめ

転載: blog.csdn.net/m0_46510245/article/details/132634799
おすすめ