42 Artificial Intelligence Machine Learning Dataset Recommendations

To successfully launch artificial intelligence (AI) projects, many companies are turning to external datasets. In this day and age, finding datasets is easier than ever, and datasets are increasingly critical to the performance of machine learning models. There are a number of sites that host data repositories covering a wide range of topics, from images of rare frogs to handwriting samples. Whatever your machine learning (ML) project, you can find relevant datasets to use as a starting point. In this paper, we collect links to more than 40 existing high-quality ML data repositories and datasets. For ease of use, we have categorized them by project type and industry. It's worth noting that while these datasets are often good starting points, your use case may require additional labeling on top of what's readily available.  

 

What kind of data do I need?

Before starting your search for the right dataset, it helps to answer a few key questions:

  • What am I trying to achieve with the AI ​​project?
  • Do I have enough internal data to use for this project?
  • What data do I want to have?
  • What use cases do I need data to cover?
  • Which edge use cases do I need data to cover?

These preliminary questions are just to help you get a clearer picture of the specific type of data you need. If you're dealing with protected classes (that is, groups with specific race, gender, sexual orientation, or other factors), more effort will be required to ensure that your dataset appropriately represents these groups. In any case, be specific when searching for data; machine learning projects can easily be derailed by using low-quality data.  

Why choose an off-the-shelf dataset?

Your team may ultimately decide to train your model with an off-the-shelf dataset. Such choices are increasingly common in the field of AI for one reason: building AI is very difficult. Most AI projects fail to achieve deployment due to a variety of factors, including:

  • low budget. Investing in AI projects often requires substantial capital.
  • There is a lack of talent. Skills gaps exist not only in technology, but especially in AI and ML. The industry lacks high-skilled talent, and existing AI plans cannot be launched, and future plans are even more distant. This gap is likely to grow wider as the industry develops.
  • It is still early in the development of AI . Businesses must establish the proper organizational structure to build AI. This means they need proper internal processes, strategies and collaborations to be able to successfully build AI.
  • Data quality is low or insufficient. This last factor, it turns out, is the biggest hurdle to building AI. ML models often require large amounts of data to perform accurately. Depending on the use case, acquiring data presents different challenges. Furthermore, converting low-quality data to high-quality labeled data can be time-consuming and inefficient.

Deploying data annotation is also difficult for many businesses, so it's no surprise they're turning to third parties. In order to solve the data bottleneck problem, enterprises have sought to purchase or utilize free off-the-shelf data sets. It turns out that these datasets are good starting points for building ML models, or in some cases, they are sufficient to adequately cover all use cases. Let's talk about the advantages of ready-made datasets:

  • Compliance. Customers and regulatory authorities are increasingly demanding data security, which makes it increasingly difficult for companies to use internal data. Some businesses naturally have access to large amounts of data at work, but that doesn't mean they can use that data for ML models, especially if doing so could violate customer privacy.
  • Reduce bias. Building responsible AI is a hot topic like never before, as companies recognize the importance of reducing model bias. When businesses rely on internal data, it can be difficult to detect and reduce bias. But with an off-the-shelf dataset, you can research the source of the data to see if it was created with a bias check. Trusted data providers will be able to provide diverse, high-quality data sets.
  • Get to market faster. Collecting and preparing data is time-consuming, and most of a data scientist's time is devoted to it during project work. With off-the-shelf datasets, most of the work is already done (although obviously you'll need to check the quality of the dataset yourself). In an industry where speed is of the essence, doing so will speed time to market.
  • cost-effective. The process of aggregating, reviewing and preparing internal data can be costly. Many off-the-shelf online datasets are available for free or at low cost. If your AI budget is modest, leveraging off-the-shelf datasets might be the right choice.

The advantages of ready-made datasets can help solve many common problems in AI development. Using off-the-shelf datasets is certainly a beneficial strategy to consider in ML model implementation.  

Best starting point for finding datasets

The Internet is full of high-quality off-the-shelf datasets. Listed below are many of the best places to search and discover datasets online, in no particular order. We start with data repositories and then list the best datasets for specific use cases.

data repository

The data repository collects datasets from across the web.

Kaggle

Kaggle is one of the largest online repositories of datasets covering a range of topics from sports to medicine to government. Its platform is community-led, meaning users can upload their own datasets. Given the variety of data sources for Kaggle, it's important to thoroughly check the quality of the datasets you're taking from them. Additionally, Kaggle offers discussions on machine learning topics as well as tutorials on key processes.

Google dataset

Google provides a dataset search engine where you can search for datasets by name. The engine allows you to sort datasets by several features, such as file type, subject, latest update, and relevance. It can also pull datasets from thousands of databases on the internet, so you can really search through a wide range of options. Uploaders of the dataset include numerous international organizations such as Harvard University and the World Health Organization.

Papers with Code

Papers with Code currently has over 4,000 datasets (and counting). These datasets are uploaded by the community. You can easily filter these datasets by modality, task, and language. The database also contains links to other databases that also provide a variety of datasets.

DataFlair

DataFlair links to over 70 machine learning datasets and also includes useful information like source code and project ideas. For example, in the list of datasets containing handwritten digits, DataFlair suggests creating an image classification algorithm to recognize handwritten digits in paper. Use the site to inspire new ideas.

EliteDataScience

EliteDataScience includes free datasets and a curated list of the most popular aggregators. These datasets are organized by use case, and include datasets that can be used for deep learning, natural language processing, web scraping, and more.

UCI Machine Learning Library

UCI has more than 500 machine learning datasets, sortable by file type, task, application domain, and topic. Many of these datasets contain links to academic papers that can be used for benchmarking.

Github's excellent public dataset

Github provides an open source collection of public datasets. There you can view the catalog and choose a topic, ranging from agriculture to transportation and more. Github also includes a collection of general machine learning models. Most of the linked datasets are free.

Azure Public Datasets

Microsoft Azure has a database of public datasets that developers can use for prototyping and testing. Database categories include U.S. government and agency data, other statistical and scientific data, and online service data. Also, there you can read documentation about SQL and how to build mobile and web apps.

Snowflake Data Mart

Snowflake includes more than 650 real-time and ready-to-query datasets from more than 175 third-party data providers and data service providers, facilitating data scientists, business intelligence and analytics professionals, and anyone looking for data-driven decisions.

Open Data Registry on AWS

AWS has a registry of datasets available through AWS resources. Users can share their own datasets or add examples of how to use a particular dataset. There are over 280 searchable datasets in the registry.

KDNuggets

KDNuggets has a comprehensive list of data repositories that includes a wide variety of datasets. The list includes more than 75 data repositories, some of which are international.

Appen

Appen offers a variety of off-the-shelf training datasets. Our catalog includes more than 250 licensable datasets in more than 80 languages, covering multiple dialects. These datasets include many machine learning use cases, such as speech recognition and natural language processing, and cover a range of file types (text, image, video, speech, and audio). For example:

  • Fully transcribed speech datasets for broadcast, call center, in-vehicle and telephony applications;
  • Pronunciation dictionaries, including general vocabulary and domain-specific vocabulary (e.g. names, places, natural numbers);
  • Dictionaries and thesauruses with part-of-speech tags;
  • A text corpus with lexical information and tokens for named entities.

We only provide the highest quality datasets to power your AI needs.  

Computer Vision Datasets

These databases and datasets include image data for your computer vision projects.

ImageNet

ImageNet is a set of nouns organized according to the WordNet hierarchy, where each node has thousands of associated images. Data in this repository are freely available to researchers.

MNIST database

MNIST features images of handwritten digits. This includes a training set of 60,000 examples and a test set of 10,000 examples.

IMDB-Wiki dataset

The IMDB-Wiki dataset provides the largest collection of face images, with over 500,000 images. Many images are from celebrities and Wikipedia. Each image is tagged with gender and age.

LabelMe dataset

LabelMe Dataset is built using the LabelMe labeling tool. This tool enables users to outline and label objects. This dataset can be used in image recognition projects.

MS COCO dataset

The full name of MS COCO is "Microsoft Common Objects in Context Dataset", which is a common object dataset in the Microsoft context, which was released to solve the problem of "common objects in context". It contains more than 120,000 images, and each image has multiple labels related to image annotation techniques such as object detection and segmentation. The images in the dataset are classified into 91 categories.

Chars74K

Chars74K , as the name suggests, includes 74,000 images. The data includes character recognition in natural images (for example, images of restaurant signs).

Kinetics-700

Kinetics-700 contains a series of links to YouTube videos primarily labeled as human behavior. Among them are more than 650,000 video clips covering 700 human behaviors.

Places2 Database

Places2 Database is a dataset released by MIT, containing more than 10 million images covering more than 400 scenes. It is helpful for projects such as scene classification and scene parsing.

Open Images

The Open Images dataset is one of the largest datasets featuring object location annotations. It has more than 9 million images, each with object bounding boxes, segmentation, and other annotations. There are 16 million bounding boxes in total, covering 600 categories.

MPII Human Pose Dataset

The MPII Human Pose dataset includes about 25,000 images involving 410 human poses. The images contain about 40,000 different people, and each image has human joints annotated. These images are collected from YouTube videos.  

Natural Language Processing Datasets

The following datasets have natural language examples across text and audio that can be used in your natural language processing projects. Examples of these include sentiment analysis, speech recognition, transcription, and more.

Google Blogger Corpus

Google Blogger Corpus includes nearly 700,000 blog posts from blogger.com. Each essay has at least 200 English words. Overall, these blog posts contain many common English words.

Yelp Reviews

The Yelp Reviews dataset covers rankings and reviews of restaurants and contains rich information related to this topic. The reviews in this dataset can be used in sentiment analysis projects.

WikiQA Corpus

The WikiQA corpus is a question answering dataset compiled from Bing search data. It includes more than 3,000 questions and provides 29,000 answer sentences, 1,500 of which are labeled as answer sentences.

M-AI Labs Speech Dataset

The M-AI Labs speech dataset includes nearly 1,000 hours of audio and transcriptions. Includes male and female voices in multiple languages.

LibriSpeech

LibriSpeech includes approximately 1000 hours of speech data that has been segmented and aligned. These data were compiled from audiobooks from the LibriVox project.

WordNet

WordNet is a database of English words grouped by meaning. There are 117,000 synsets (words paired according to synonyms), which are then linked to related synsets. You can use it in your next text classification project.

OpinRank dataset

The OpinRank dataset contains 300,000 reviews from Edmunds and TripAdvisor. They are categorized by destination, hotel and other relevant factors.

Multi-Domain Sentiment Dataset

The multi-domain sentiment dataset includes Amazon.com product reviews from four domains: DVD, Books, Kitchen, and Electronics. Each domain has thousands of reviews with 1-5 star ratings attached. As the name suggests, this dataset is useful for sentiment analysis projects.

Twitter Sentiment Analysis

The Twitter Sentiment Analysis dataset includes over 1.5 million classified tweets. Each row of the dataset has a rank: 1 for positive sentiment and 0 for negative sentiment.

20 Newsgroups

20 Newsgroups contains 20,000 documents, and as the name suggests, it comes from more than 20 different newsgroups. It covers a lot of topics, some of which are relatively similar. The dataset consists of three versions: an original version, a version with dates removed, and a version with duplicates removed.  

Datasets by Industry

It is worth mentioning that there are several valuable resources available for obtaining industry-specific data.

US Government Data Portal

The U.S. Government Data Portal includes all government data committed to by the United States. Accessing the portal allows you to search over 300,000+ datasets (for example, student loan data and healthcare facility billing data). Industry: Government

EU Open Data Portal

The EU Open Data Portal provides a way to search EU institutional data such as demographic data, education data, etc. Industry: Government

World Health Organization

The World Health Organization provides data covering important topics such as world hunger, healthcare and disease. Industry: Medical

Broad Institute

The Broad Institute provides many datasets involving cancer, covering related topics from sequencing to classification. Industry: Medical

Google Finance

Google Finance includes over 40 years of stock market data, updated continuously in real time. Industry: Finance

Berkeley DeepDrive

Created by the University of California, Berkeley, Berkeley DeepDrive includes more than 100,000 video clips of different geographic distributions, environments, and weather conditions. These clips are annotated with bounding boxes to detect objects, lane markings, and various forms of segmentation. This dataset can be used to help train self-driving cars. Industry: Automotive

Level5

Level5 was created by ride-sharing company Lyft. The dataset includes raw sensor camera and LiDAR data captured by numerous autonomous vehicles in specific geographic areas. This dataset is annotated with 3D bounding boxes of specific target objects. Industry: Automotive

USDA Open Data Catalog

The USDA Open Data Catalog includes data captured by the United States Department of Agriculture. Topics range from measuring productivity in U.S. agriculture to estimating the cost of foodborne illness. Industry: Agriculture

Fashion-MNIST

Fashion-MNIST includes nearly 60,000 images and 10,000 test images of fashion industry products, divided into 10 categories. This data is useful for product assortment projects. Industry: Retail

Ecommerce Search Relevance

The E-Commerce Search Relevance Dataset includes feature links to various products, the ranks of those products on the page, the search queries that provided the results, and other relevant attributes. The data comes from the top 5 English-language e-commerce websites. Industry: Retail  To find industry datasets not mentioned here, simply search the above data repositories using the appropriate industry tag.  

Expert Insights from Chief Data Scientist Monchu Chen

Database Selection Considerations

When starting a new project, it's best not to rush to acquire any existing datasets right away. Take a step back and think carefully about the user needs your application or service needs to fulfill. Sometimes the same product design can be achieved with different AI-driven capabilities. The potential solution you identify can depend on choosing among vastly different ML models that may vary in price points for development and construction, and methods of training data. When you're ready to move forward, there are also tips for selecting existing publicly available datasets so you can start model development even if you don't have the dedicated budget to collect the data yourself.

Select a subset of the dataset

When choosing a dataset, don't be intimidated by the complexity of the entire dataset. Sometimes, you can extract a subset of the overall dataset, which may be just what your ML project needs.

Combining multiple datasets

Sometimes the dataset you choose may not exactly match the data needed to develop your model. You might consider combining multiple datasets (or subsets) to build a training set that is more similar to the total number of use cases you are dealing with.

Existing APIs

Many datasets come with APIs or libraries to facilitate data access and transformation. This can save you valuable time initially.

Existing example projects

You can also try to find people who have worked on projects using popular datasets, making their work public through repositories like Github. Use their source code, models or even pre-trained models as a basis or just as a reference when making data selection.

license issue

Just like software, there are different types of licenses for datasets. Some licenses may require you to share your work on that particular dataset. Others may restrict your application to non-commercial use only. A general strategy is to separate the code from the dataset as much as possible. The best way to ensure safety is to seek legal advice before selecting a dataset for use in an application.

short term/long term considerations

When making a short-term decision (such as choosing your first dataset), it's best to consider its long-term impact. Looking at the big picture, when you need to transition from a public domain dataset to one you curate yourself, you may find that a sub-optimal choice at the outset can save you a lot of time, effort, and budget.  

Guess you like

Origin blog.csdn.net/Appen_China/article/details/132324665