Machine Learning Standing Data Summary

  1. Python Programming Specification
    Concise Python Programming Specification
    https://blog.csdn.net/gzlaiyonghao/article/details/2834883
    Python Language Specificationhttp
    ://zh-google-styleguide.readthedocs.io/en/latest/google-python-styleguide /python_language_rules/
    Python Style Guide
    http://en-google-styleguide.readthedocs.io/en/latest/google-python-styleguide/python_style_rules/


2. Required for machine learning - common search sites for massive datasets


|UCI Machine Learning Repository
http://archive.ics.uci.edu/ml/index.php
The most famous UCI dataset repository, from which many papers are derived.
|AWS Public Datasets
https://aws.amazon.com/cn/datasets/Datasets
provided by Amazon cloud services, covering astronomy, biology, chemistry, weather, economy and other fields.
|YAHOO Webscope datasets
https://webscope.sandbox.yahoo.com/Datasets
provided by Yahoo, including image, language, ranking classification and other multi-domain data.
|Kaggle datasets
https://www.kaggle.com/datasets
The dataset library provided by the Kaggle competition platform can find a lot of interesting data from the industry,
such as Uber, Netflix Prize, McDonald's, etc. data.


Computer Vision
| ImageNet
http://www.image-net.org/
The most famous dataset for image processing, you can search for any kind of image according to your project needs for
object recognition, localization, classification and screen analysis And other issues. There are 14197122 images of different sizes,
totaling 140GB.
|MNIST
http://yann.lecun.com/exdb/mnist/
is basically a data set that the newly proposed machine learning algorithm must run. MNIST is a
database with 60,000 training samples and 10,000 testing samples, a
subset of the NIST database.
|The CIFAR-10 dataset
https://www.cs.toronto.edu/~kriz/cifar.html
32x32 color ××× image.
|Google Open Images
https://github.com/ejlb/google-open-image-download
Google Open Images is a large-scale image annotation dataset opened by Google, which contains annotations
of 7,800 categories of content in 9 million images.


Natural Language Processing
| Text Classification Dataset
https://drive.google.com/drive/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M
A large dataset integrated with text classification data from DBPedia, Amazon, Yelp, Yahoo!, Sogou and AG
. Sample sizes range from 120K to 3.6M, and questions range from level 2 to level 14.
|WikiText
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset
Large language modeling corpus from Wikipedia articles.
|Billion Words
http://www.statmt.org/lm-benchmark/
Often used to train distributed word representations such as word2vec or Glove
|Stanford Sentiment Treebank
https://link.zhihu.com/?target=http%253A/ /nlp.stanford.edu/sentiment/code.html Datasets
for Sentiment Analysis


Speech Recognition
| 2000 HUB5 English
https://catalog.ldc.upenn.edu/LDC2002T43
Speech Data in English.
|CHIME
http://spandh.dcs.shef.ac.uk/chime_challenge/data.html
Speech Recognition Dataset with Noise
|TED-LIUM
http://www-lium.univ-lemans.fr/en/content/ ted-lium-corpus
Speech dataset of TED talks, with corresponding full text.


Other categories
| UCR Time Series
http://www.cs.ucr.edu/~eamonn/time_series_data/
"Imagnet" in the time series world, you must run when you publish articles.
|Million Song Dataset
https://labrosa.ee.columbia.edu/millionsong/
It may be used by programmers who do music recommendation or classification.
|Netflix recommendation system data
http://dataju.cn/Dataju/web/datasetInstanceDetail/32
movie evaluation data set, the data set contains randomly selected 480,000 Netflix customers,
more than 1 million evaluations for 17,000 movies, The data period is from 1998.10 to 2005.11. The evaluation is based on a 5-point
scale, and each movie is rated 1-5 points, and customer information has been desensitized.
|Udacity self-driving data set
https://github.com/udacity/self-driving-car/
The self-driving car data set in Udacity's open self-driving course, aiming to create an open-
source self-driving project. Multiple binary compressed files, totaling about 100G

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324806173&siteId=291194637