20 open source data communities [notes]

For AI developers, it is useless no matter how good the algorithm is without good data. Where can we find good data? Open source communities are important sources of data.

In order to help developers quickly find the data they want, the editor has collected 20 high-quality open source communities and briefly introduced and analyzed these open source communities to help everyone quickly find suitable training data.


Image dataset

1. CORe50 open source platform

CORe50 is specially designed for (C) object (O) object (Re) recognition, which integrates 50 daily necessities and 10 categories for a new dataset and benchmark for continuous object recognition.

Website: CORe50

2. Caltech Dataset

Caltech provides image data. The data has a total of 101 categories, each category has approximately 40 to 800 images, and each image size is approximately 300x200 pixels. Developers can use these data sets to test recognition algorithms.

Website: http://www.vision.caltech.edu/Image_Datasets/Caltech101/

3. STL data set

The STL dataset is an image recognition dataset used for developing unsupervised feature learning, deep learning, and self-learning learning algorithms. It is inspired by the CIFAR-10 dataset, but with some modifications. There are ten categories of images in this dataset, with a total of 100,000 unlabeled images.

URL: STL-10 dataset

4. NORB image recognition data set

This dataset can be used for experiments in identifying 3D objects from shape. The data set contains images of 50 toys. There are 5 major categories of toys. The training set consists of 5 instances of each category (instances 4, 6, 7, 8, and 9) and the test set of the remaining 5 instances (instances 0, 1, 2, 3 and 5).

Information: NORB Object Recognition Dataset, Fu Jie Huang, Yann LeCun, New York University

5.ImageNet

ImageNet is an image database organized according to the WordNet hierarchy (currently nouns only), where each node of the hierarchy is represented by hundreds or thousands of images. The project plays a role in advancing computer vision and deep learning research. This data is freely available to researchers for non-commercial use.

URL: ImageNet

6. The Children's Book Test

A baseline of (question+context, answer) pairs extracted from children's books provided by Project Gutenberg. Used for question answering (reading comprehension) and simulation lookup.

Download link: http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz

7. UCI Machine Learning Repository

UCI is the machine learning community of the University of California, Irvine. There are currently 585 data sets in total, covering fields such as medical care, physics, business, and social sciences. The data sets are suitable for university educators to conduct machine learning experiments. The community was founded in 1987 and has attracted the attention of college students and educators around the world since its creation.

URL: http://archive.ics.uci.edu/ml/index.php

Text data set

8. Amazon Open Source Dataset

Amazon offers a number of downloadable datasets that can be used for machine learning, natural language processing, and more. The dataset is available on the Amazon platform or on your local computer.

Website: Registry of Open Data on AWS

9. Kaggle community

Kaggle is a data science competition platform that has been acquired by Google. On this platform, researchers can publish data and questions, and provide certain bonuses to those who can solve the problems, which is equivalent to a crowdsourcing data platform. In addition to text data sets, the community also includes image, audio and other data sets. Many data used for machine learning can be downloaded for free.

网址:Kaggle: Your Machine Learning and Data Science Community

10. Fudan University Chinese text classification corpus

The data set is provided by the Natural Language Processing Group of the International Database Center of the Department of Computer Information and Technology of Fudan University. It includes 9833 test corpus documents and 9804 training corpus documents, divided into 20 categories in total. The data set is suitable for NLP learning.

Website: Workbench - Heywhale.com

11. CMU Q/A data set

CMU Q/A provides a dataset of questions and answers with difficulty levels derived from Wikipedia article links to human-generated quasi-factual questions and human-generated answers to those questions, for use in academic research. The datasets were collected by a group of computer scientists working on natural language processing and machine learning, as well as many students at Nike Mellon University and the University of Pittsburgh.

URL: Question-Answer Dataset

12. bAbI data set

The bAbi dataset comes from Facebook AI Research’s (FAIR) comprehensive reading comprehension and question answering dataset. Content includes children's books, movie dialogues, WiKiMovies, dialog-based language learning datasets, and more.

Website: https://research.fb.com/projects/babi/

13. Data Portal

The platform lists a total of 590 data portals from around the world. Most of the data are public data from countries and regions, including city medical, population, geographical information and other data.

URL: - Data Portals

14. Multi-domain sentiment analysis data set

The multi-domain sentiment analysis data set is an older academic data set that has been updated to version 2.0 based on the previous version 1.0. It has a total of about 2G data sets and contains product reviews from many product types (domains) on Amazon.com.

URL: Multi-Domain Sentiment Dataset

Speech dataset

15. Datahub Community

The Datahub community provides relevant data sets in finance, medical care, social sciences, education and other fields, covering a wide and complex scope.

URL: https://datahub.io

16. Academic Torrents

The official website of Academic Torrents shows that the platform provides a total of more than 65TB of research data. Anyone can upload data sets through the platform, and the large number of shared data sets are available to researchers. The platform mainly focuses on voice and image data, focusing on the medical field.

Website: https://academictorrents.com/

17. OpenML database

OpenML is a free database of machine learning experiments that allows anyone to share and download large amounts of open source data on the platform. Currently, the platform has a total of 92 data sets, involving medical, machinery, Internet, finance and other industries.

URL: OpenML

18. GitHub Community

In addition to the massive code in the GitHub community, the forum section also has massive data sets, which can be downloaded for free, and are divided into finer fields, including data sets in agriculture, climate, biology, computer networks, economics, education, finance and other fields.

网址:GitHub - awesomedata/awesome-public-datasets: A topic-centric list of HQ open datasets.

19. OpenSLR open source platform

OpenSLR is a well-known voice resource platform in the United States, hosting open source voice data resources from all over the world. Chinese developers can download open source data on the platform through the OpenSLR China mirror. Currently Magic Data provides data storage services for platform images. The platform has a total of more than 1,000 hours of English speech corpus.

Website: openslr.org

20. MagicHub open source community

The MagicHub open source community is developed and maintained by the data company Magic Data. At present, the community has opened up more than 30 conversational AI data sets to developers for the first time for developers to test and train, including Chinese and English customer service text corpora, pronunciation dictionaries, TTS Mandarin data sets, eight major dialect area data sets, and Italian and Arabic , Spanish and dozens of other language data sets. The number and types of data sets will be continuously updated.

Website: MagicHub - Datasets Download | Open-Source Datasets

Guess you like

Origin blog.csdn.net/WASEFADG/article/details/133234270