open data sources

Excerpt from: https://mp.weixin.qq.com/s?__biz=MzI2MjM2MDEzNQ==&mid=2247489072&idx=1&sn=2ac46ef358be4eef43f3de8670086746&chksm=ea4d0b18dd3a820ef82122648806c8516970e8e7323efb5475aa0db1da1752d22ee8c38ec604&mpshare=1&scene=23&srcid=042625ULmfK6xU66wcmkCf1G#rd

Overview

If I could sum up the essence of learning data science in one sentence, it would be: The best way to learn data science is to apply data science. If you are a beginner, you will greatly improve your ability after completing a new project. If you are an experienced data science expert, you already know the value contained here.

 This article will provide you with a list of websites/resources from which you can use data to complete your own data projects or even create your own products.

1. How to use these resources?

There is no limit to how these data sources can be used, application and use are only subject to your creativity and practical application. The easiest way to use them is to make data projects and publish them on a website. Not only will this improve your data and visualization skills, but it will also improve your structured thinking.

On the other hand, if you are considering/processing data based products, these datasets can increase the functionality of your product by providing additional/new input data. So go ahead and work on these projects and share them with the larger world to showcase your data prowess! We've divided these data sources in different sections to help you categorize them by application.

We start with simple, general, and easy-to-handle datasets, then move to large/industry-relevant datasets. We then provide links to datasets for specific purposes - text mining, image classification, recommendation engines, etc. This will give you a complete list of data sources. If you can think of any applications for these datasets, or know of a popular resource we missed, please share with us in the comments below. (some may require FQ)

2. Start with simple and generic datasets

1.data.gov ( https://www.data.gov/ )  This is the home of the U.S. government's publicly available data, and this site contains over 190,000 data points. These datasets are different from those in climate, education, energy, finance and more.

2.data.gov.in ( https://data.gov.in/ )  This is where the government of India publishes data, looking through various industries, climate, healthcare, etc., you can find some inspiration here. Depending on the country you live in, you can also browse similar sites from some other sites.

3. World Bank ( http://data.worldbank.org/ )  Open data from the World Bank. The platform provides several tools such as Open Data Catalog, World Development Index, Education Index, etc.

4. Data provided by RBI (https://rbi.org.in/Scripts/Statistics.aspx)  Reserve Bank of India. This includes several indicators of money market operations, balance of payments, bank usage and some products.

5. Five Thirty Eight Datasets ( https://github.com/fivethirtyeight/data ) Five Thirty Eight, also known as 538, is a blog focusing on poll analysis, politics, economics and sports. This dataset is the dataset used by Five Thirty Eight Datasets. Each dataset includes data, a dictionary explaining the data, and links to Five Thirty Eight articles. If you want to learn how to create data stories, look no further than this.

3. Large datasets

1. Amazon Web Services (AWS) datasets ( https://aws.amazon.com/cn/datasets/ ) Amazon provides some large datasets that can be used on their platform or on a local computer. You can also use EC2 and Hadoop through EMR to analyze data in the cloud. Popular datasets on Amazon include the full Enron email dataset, Google Books n-grams, NASA NEX dataset, Million Songs dataset, etc.

 

2. Google datasets ( https://cloud.google.com/bigquery/public-data/ )  Google provides some datasets as part of its Big Query tool. Includes data from the GitHub public repository, all Hacker News stories and comments.

3. Youtube labeled Video Dataset ( https://research.google.com/youtube8m/ )  A few months ago, the Google research team released a "dataset" on YouTube, which consists of 8 million YouTube video ids and 4,800 visual An entity's associated label composition. It comes from billions of frames of pre-computed, state-of-the-art visual features.

4. Predictive Modeling and Machine Learning Datasets

1. UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html) UCI ​​Machine Learning Repository is obviously the most famous data repository. Usually the first place to go if you are looking for datasets related to machine learning repositories. These datasets include a wide variety of datasets, from popular ones like Iris and Titanic to more recent contributions like air quality and GPS trajectories. The repository contains over 350 domain-like datasets (classification/regression). You can use these filters to determine the dataset you need.

 

2. Kaggle (https://www.kaggle.com/datasets)  Kaggle proposes a platform where people can contribute datasets and other community members can vote and run kernels/scripts. In total they have over 350 datasets - there are over 200 feature datasets. While some of the original datasets usually appear elsewhere, I've seen some interesting datasets on the platform and not elsewhere. Along with the new dataset, another benefit of the interface is that you can see scripts and questions from community members on the same interface.

3. Analytics Vidhya (https://datahack.analyticsvidhya.com/contest/all/ )  You can participate and download datasets from our hands-on questions and hackathon questions. The problem datasets are based on real industry problems and are relatively small as they imply a 2 - 7 day hackathon.

4. Quandl ( https://www.quandl.com/ )  Quandl provides financial, economic and alternative data from different sources through direct integration of the website, API or some tools. Their datasets are divided into open and paid. All open datasets are free, but premium datasets require payment. Quality datasets can still be found on the platform by searching. For example, stock exchange data from India is free.

5.Past KDD Cups ( http://www.kdd.org/kdd-cup )  KDD Cup is an annual data mining and knowledge discovery competition organized by ACM Special Interest Group.

6. Driven Data ( https://www.drivendata.org/ )  Driven Data discovers real-world problems with applying data science to bring about positive social impact. They then organize online simulation competitions for data scientists to develop the best models to solve these problems.

5. Image classification dataset

1. The MNIST Database ( http://yann.lecun.com/exdb/mnist/ )  The most popular image recognition dataset, using handwritten digits. It includes a test set of 60,000 examples and 10,000 examples. This is usually the first dataset for image recognition.

2.Chars74K (http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/ )  Here is the next stage of evolution if you have passed handwritten numbers. This dataset includes character recognition in natural images. The dataset contains 74,000 images, hence the name of the dataset.

3. Frontal Face Images (http://vasc.ri.cmu.edu//idb/html/face/frontal_images/index.html )  If you have completed the first two items and are able to recognize numbers and characters, this is The next level of challenge in image recognition - frontal face images. The images were collected by CMU & MIT and are arranged in four folders.

4. ImageNet (http://image-net.org/)  Now it's time to build something generic. A database of images organized according to the WordNet hierarchy (currently only nouns). Each node of the hierarchy is described by hundreds of images. Currently, the collection averages over 500 images per node (and growing).

6. Text classification dataset

1. Spam – Non Spam (http://www.esp.uem.es/jmgomez/smsspamcorpus/)  Distinguishing SMS as spam is an interesting question. You need to build a classifier to classify text messages.

2. Twitter Sentiment Analysis (http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/)  This dataset contains 1,578,627 classified tweets, each row marked with 1 Positive emotions, 0 negative emotions. Data is in turn based on Kaggle competitions and Nick Sanders' analysis.

3. Movie Review Data (http://www.cs.cornell.edu/People/pabo/movie-review-data/)  This website provides a collection of movie review files annotated with their overall emotional polarity (positive or negative) or subjective ratings (eg, "two and a half stars") and labels for their subjective status (subjective or objective) or polarity.

7. Recommendation engine dataset

1.MovieLens ( https://grouplens.org/ ) MovieLens  is a website that helps people find movies. It has thousands of registered users. They conduct online experiments such as automatic content recommendation, recommendation interface, tag-based recommendation pages, etc. These datasets are available for download and can be used to create your own recommender systems.

2. Jester (http://www.ieor.berkeley.edu/~goldberg/jester-data/)  online joke recommendation system.

8. Dataset Websites from Various Sources

1. KDNuggets (http://www.kdnuggets.com/datasets/index.html)  The datasets page of KDNuggets has always been a reference for people searching for datasets. The list is comprehensive, but some sources no longer provide datasets. Therefore, careful selection of datasets and sources is required.

2. Awesome Public Datasets (https://github.com/caesar0301/awesome-public-datasets)  A GitHub repository that contains a complete list of datasets categorized by domain. The dataset is neatly categorized in different domains, which is very useful. However, there is no description for the dataset in the repository itself, which might make it very useful.

3. Reddit Datasets Subreddit (https://www.reddit.com/r/datasets/)  Since this is a community-driven forum, it might run into some troubles (compared to the previous two sources). However, you can sort the datasets by popularity/vote to see the most popular ones. Plus, it has some interesting datasets and discussions.

9. Final words

We hope this list of resources will be very useful for those who want to project. This is definitely a gold mine, put it to good use!

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324927421&siteId=291194637