"2018: skymind.ai issued a very comprehensive set of open source data"

This is a very comprehensive open source data set, you do not really want it?

Recently, skymind.ai issued a very comprehensive open source data set. Including biometrics, a natural image and a depth image data such as the learning set, it is now the heart of the machine are summarized as follows :( oh ~ links included)

 

Recently added data sets

 

  • Open source biometric data: http: //openbiometrics.org/

 

  • Google Audioset: expanded 632 audio classification sample, extract the 2,084,320 individual class mark of 10 seconds sound clips from the YouTube video.

  • Address: https: //research.google.com/audioset/

 

  • Uber 2B trip data: rollout of 2 million kilometers of travel data.

  • Address: https: //movement.uber.com/cities

 

  • Yelp Open Dataset: Yelp dataset Yelp for NLP in business, a subset of the comments, and user data.

  • Address: https: //www.yelp.com/dataset

 

  • Core50: a new reference data set and continuous target recognition.

  • Address: https: //vlomonaco.github.io/core50/

 

  • Kaggle data set: https: //www.kaggle.com/datasets

 

  • Data Portal:http://dataportals.org/

 

  • Open Data Monitor:https://opendatamonitor.eu/

 

  • Quandl Data Portal:https://www.quandl.com/

 

  • Mut1ny head / face divided data sets: http: //www.mut1ny.com/face-headsegmentation-dataset

 

  • Excellent public data sets on Github: https: //www.kdnuggets.com/2015/04/awesome-public-datasets-github.html

 

  • Head CT scan data sets: 491 scans CQ500 data set.

  • Address: http: //headctstudy.qure.ai/

 

Natural image data set

 

  • MNIST: handwritten digital images. The most common availability check. Format 25x25, center, black and white handwritten numbers. This is a simple task - only apply to a certain part of MNIST, does not mean it works.

  • Address: http: //yann.lecun.com/exdb/mnist/

 

  • CIFAR10 / CIFAR100: 32x32 color image, 10/100 class. Although no longer used but still interesting to check availability.

  • Address: http: //www.cs.utoronto.ca/~kriz/cifar.html

 

  • Caltech 101: 101 pictures class of objects.

  • Address: http: //www.vision.caltech.edu/Image_Datasets/Caltech101/

 

  • Caltech 256: 256 pictures class of objects.

  • Address: http: //www.vision.caltech.edu/Image_Datasets/Caltech256/

 

  • STL-10 data sets: for developing unsupervised learning feature, the depth of learning, self-learning algorithm for image recognition dataset. As modified CIFAR-10.

  • Address: http: //cs.stanford.edu/~acoates/stl10/

 

  • The Street View House Numbers (SVHN): Google Street View house numbers. Think of it as a reproducible outdoor MNIST.

  • Address: http: //ufldl.stanford.edu/housenumbers/

 

  • NORB: Decoration Toys binocular images at various illumination and pose.

  • Address: http: //www.cs.nyu.edu/~ylclab/data/norb-v1.0/

 

  • Pascal VOC: universal image segmentation / classification - not very useful for building real-world image annotation, but the baseline is useful.

  • Address: http: //pascallin.ecs.soton.ac.uk/challenges/VOC/

 

  • Labelme: large data sets annotated image.

  • Address: http: //labelme.csail.mit.edu/Release3.0/browserTools/php/dataset.php

 

  • ImageNet: objective image data set of the new algorithm (de-facto image dataset). Many companies have come from its image REST API interface tags, these tags nearly 1,000 classes; WordNet; ImageNet of hierarchy.

  • Address: http: //image-net.org/

 

  • LSUN: a scene with many auxiliary tasks of understanding (room layout is estimated that a significant predictor (saliency prediction), etc.), associated competitions. (Associated competition).

  • Address: http: //lsun.cs.princeton.edu/2016/

 

  • MS COCO: General Image Understanding / instructions, associated race.

  • Address: http: //mscoco.org/

 

  • COIL 20: imaging an object at a different angle in each rotation of 360 degrees.

  • Address: http: //www.cs.columbia.edu/CAVE/software/softlib/coil-20.php

 

  • COIL100: imaging an object at a different angle in each rotation of 360 degrees.

  • Address: http: //www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php

 

  • Google Open Source image: There are nine million URLs set of images, these images via Creative Commons (Creative Commons) are marked as more than 6,000 categories.

  • Address: https: //research.googleblog.com/2016/09/introducing-open-images-dataset.html

 

Geospatial data

 

  • OpenStreetMap: providing vector data for the entire planet for free. It contains (legacy) to the US Census Bureau.

  • Address: http: //wiki.openstreetmap.org/wiki/Planet.osm

 

  • Landsat8: Perspective FIG entire surface of the earth satellite, updated once every few weeks.

  • Address: https: //landsat.usgs.gov/landsat-8

 

  • NEXRAD: US Doppler radar scans the atmosphere.

  • Address: https://www.ncdc.noaa.gov/data-access/radar-data/nexrad

 

-------- I am the dividing line depth study of the image --------

 

Manual data collection

 

  • Arcade Universe: a manual data set generator, arcade game image including the sprite was dropped, as tetris pentomino / tetromino. The generator of bugland O. Breleux based dataset generator.

  • Address: https: //github.com/caglar/Arcade-Universe

 

  • With Baby AI School-inspired collection of data sets.

  • Address: http: //www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BabyAISchool

 

  • Baby AI Shapes Dataset: distinguish three kinds of simple shape.

  • Address: http: //www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BabyAIShapesDatasets

 

  • Baby AI Image And Question Dataset: a problem - image - answer dataset.

  • Address: http: //www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BabyAIImageAndQuestionDatasets

 

  • Deep Vs Shallow Comparison ICML2007: Deep empirical evaluation schema to generate data sets.

  • Address: http: //www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007

 

  • MnistVariations: introducing a controlled change in MNIST in.

  • Address: http: //www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/MnistVariations

 

  • RectanglesData: distinguishing wide rectangular and vertical rectangle.

  • Address: http: //www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/RectanglesData

 

  • ConvexNonConvex: distinction between male and non-convex shapes.

  • Address: http: //www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/ConvexNonConvex

 

  • BackgroundCorrelation: Control noisy background of correlation MNIST

  • Address: http: //www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/BackgroundCorrelation

 

Face Dataset

 

  • Labelled Faces in the Wild: 13000 singled form past cropped face area (use the identifier has been labeled with the name Viola-Jones). Each subset of the data set contains two image person - often used for this training dataset face matching system.

  • Address: http: //vis-www.cs.umass.edu/lfw/

 

  •  UMD Faces: There are two faces of the 8501 theme of 367,920 annotated data sets.

  • Address: http: //www.umdfaces.io/

 

  • CASIA WebFace: more than 10,575 individual by the face detection face data set of 453,453 sheets of an image. Need some quality filter.

  • Address: http: //www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html

 

  • MS-Celeb-1M: 100 Wan Zhang celebrity pictures around the world. Some filters need to get the best results on the Deep Web.

  • Address: https: //www.microsoft.com/en-us/research/project/ms-celeb-1m-challenge-recognizing-one-million-celebrities-real-world/

 

  • Olivetti: a number of different images of mankind.

  • Address: http: //www.cs.nyu.edu/~roweis/data.html

 

  • Multi-Pie: The CMU Multi-PIE Face database.

  • Address: http: //www.multipie.org/

 

  • Face-in-Action:http://www.flintbox.com/public/project/5486/

 

  • JACFEE: Image of Japanese and Caucasian facial expression of emotion.

  • Address: http: //www.humintell.com/jacfee/

 

  • FERET: facial recognition database.

  • Address: http: //www.itl.nist.gov/iad/humanid/feret/feret_master.html

 

  • mmifacedb: MMI facial expression database.

  • Address: http: //www.mmifacedb.com/

 

  • IndianFaceDatabase:http://vis-www.cs.umass.edu/~vidit/IndianFaceDatabase/

 

  • Yale face database: http: //vision.ucsd.edu/content/yale-face-database

 

  • Yale face database B: http: //vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html

 

  •  Mut1ny head / face divided data sets: a pixel portion over a 16K / head divided images

  • Address: http: //www.mut1ny.com/face-headsegmentation-dataset

 

-------- I am the dividing line -------- depth study video

 

Video data sets

 

  • Youtube-8M: Study for understanding large diverse video tag of video data sets.

  • Address: https: //research.googleblog.com/2016/09/announcing-youtube-8m-large-and-diverse.html

 

Text data sets

 

  • 20 newsgroups: word categorization task, there will be mapped to the news group ID. One of the classic text data set for classification, the classification is generally used as a verification or reference any pure IR / indexing algorithm.

  • Address: http: //qwone.com/~jason/20Newsgroups/

 

  • Reuters news dataset :( older) classification based solely on the data set that contains text from the newswire. Commonly used in the tutorial.

  • Address: https: //archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

 

  • Penn Treebank: for the next word or character prediction.

  • Address: http: //www.cis.upenn.edu/~treebank/

 

  • UCI's Spambase: from well-known UCI machine learning libraries (Legacy) Classic spam data sets. According to details of the data collection organization, it can be used as a baseline study of private spam filtering.

  • Address: https: //archive.ics.uci.edu/ml/datasets/Spambase

 

  • Broadcast News: large text data sets, typically for the next word prediction.

  • Address: http: //www.ldc.upenn.edu/Catalog/CatalogEntry.jsp catalogId = LDC97S44?

 

  • Text classification data sets: from Zhang et al, 2015.. Eight data collection set for text classification. These are the new benchmarks for text classification baseline. The sample size ranging from 120K to 3.6M, ranging from 14 to binary classification problem. Data sets from DBPedia, Amazon, Yelp, Yahoo! And AG.

  • Address: https: //drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M

 

  • WikiText: language modeling large corpus of high-quality articles from Wikipedia, curated by Salesforce MetaMind.

  • Address: http: //metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/

 

  • SQuAD: Stanford Q & A data set - a wide range of applications and reading comprehension quiz data sets, which answer to each question are presented in text form.

  • Address: https: //rajpurkar.github.io/SQuAD-explorer/

 

  • Billion Words data set: a large, general-purpose modeling language data sets. Distributed word commonly used to characterize the training, such as word2vec.

  • Address: http: //www.statmt.org/lm-benchmark/

 

  • Common Crawl: fetch byte-level network - most commonly used to study words embedded. Available on the Amazon S3 for free. It may also be used as a network dataset, as it can be in web crawling.

  • Address: http: //commoncrawl.org/the-data/

 

  • Google Books Ngrams: continuous characters from the Google book. When the word was first widely used, it provides an easy way to explore.

  • Address: https: //aws.amazon.com/datasets/google-books-ngrams/

 

  • Yelp open source data set: Yelp dataset Yelp for NLP in business, a subset of the comments, and user data.

  • Address: https: //www.yelp.com/dataset

 

-------- I was deep learning text dividing line --------

 

Q data set

 

  • Maluuba News QA data set: CNN news article 120000 quiz right.

  • Address: https: //datasets.maluuba.com/NewsQA

 

  • Quora Q of: a first data set Quora released, comprising repeating / semantic similarity label.

  • Address: https: //data.quora.com/First-Quora-Dataset-Release-Question-Pairs

 

  • CMU Q / A data set: manually generated simulation ask / answer, Wikipedia articles very high score for its difficulty.

  • Address: http: //www.cs.cmu.edu/~ark/QA-data/

 

  • Maluuba goal-oriented dialogue: dialogue procedural data set, the dialogue aims to complete the task or make a decision. Commonly used in chat robot.

  • Address: https: //datasets.maluuba.com/Frames

 

  • bAbi: reading comprehension and Q integrated data set from Facebook AI Research (FAIR) a.

  • Address: https: //research.fb.com/projects/babi/

 

  • The Children's Book Test: The baseline for children's books Project Gutenberg offered extracted (+ background questions, answers). For questions and answers (reading comprehension) and simulation to find.

  • Address: http: //www.thespermwhale.com/jaseweston/babi/CBTest.tgz

 

Sentiment data set

 

  • Sentiment analysis data sets in many fields: the older academic data sets.

  • Address: http: //www.cs.jhu.edu/~mdredze/datasets/sentiment/

 

  • IMDB: sentiment classification applied to binary older, smaller data sets. Of literature benchmarks can not support larger data sets.

  • Address: http: //ai.stanford.edu/~amaas/data/sentiment/

 

  • Stanford Sentiment Treebank: Standard emotion data sets, each node in the parse tree for each sentence has fine-grained emotional comments.

  • Address: http: //nlp.stanford.edu/sentiment/code.html

 

Recommendation and ranking system

 

  • Movielens: movie rating data sets from Movielens website, has various sizes.

  • Address: https: //grouplens.org/datasets/movielens/

 

  • Million Song Dataset: Kaggle rich metadata on large-scale open-source data sets that can help people use hybrid recommendation system.

  • Address: https: //www.kaggle.com/c/msdchallenge

 

  • Last.fm: music recommendation data set, you can access the deep social networks and other metadata can be used in hybrid systems.

  • Address: http: //grouplens.org/datasets/hetrec-2011/

 

  • Book-Crossing datasets: from the Book-Crossing community. The book contains 278,858 bits about 271,379 of 1,149,780 ratings provided by the user.

  • Address: http: //www.informatik.uni-freiburg.de/~cziegler/BX/

 

  • Jester: from 73,421 users 4,100,000 rates of 100 consecutive joke (score from -10 to 10).

  • Address: http: //www.ieor.berkeley.edu/~goldberg/jester-data/

 

  • Netflix Prize: Netflix released their version of the film anonymous ratings data sets; includes 480,000 users on 17,770 movie 100,000,000 score. The first major challenge of Kaggle style data. With the emergence of privacy issues, we can only provide an informal version.

  • Address: http: //www.netflixprize.com/

 

-------- I am the dividing line depth study charts --------

 

Network and graphics

 

  • Amazon Co-Purchasing: Amazon Comments from "buy this product users also bought ......" this part of the data capture and related products Amazon reviews data. Recommended for trial system in the network.

  • Address: http: //snap.stanford.edu/data/#amazon

 

  • Friendster social network data sets: before becoming a game site, Friendster friend list in the form of 103,750,348 users publish anonymous data.

  • Address: https: //archive.org/details/friendster-dataset-201107

 

Voice data set

 

  • 2000 HUB5 English: English speech recently used data in the Deep Speech papers, obtained from Baidu.

  • Address: https: //catalog.ldc.upenn.edu/LDC2002T43

 

  • LibriSpeech: Audiobooks dataset contains text and voice. There are all kinds of speeches by a plurality of audio books read aloud by nearly 500 hours of composition, it contains chapters with text and voice.

  • Address: http: //www.openslr.org/12/

 

  • VoxForge: clear English voice data sets accents. Suitable for different accent or intonation to enhance the robustness of the case.

  • Address: http: //www.voxforge.org/

 

  • TIMIT: English voice recognition data sets.

  • Address: https: //catalog.ldc.upenn.edu/LDC93S1

 

  • CHIME: noisy speech recognition challenge data sets. The dataset contains real, simulation and clean record. In four real recording nearly 9,000 recordings noisy location consists of four speakers, joined by multiple voice recordings simulation environment and clear noise-free recording from.

  • Address: http: //spandh.dcs.shef.ac.uk/chime_challenge/data.html

 

  • TED-LIUM: TED speech audio transcription. 1495 TED lecture recordings and text transcription of these recordings.

  • Address: http: //www-lium.univ-lemans.fr/en/content/ted-lium-corpus

 

-------- I was deep learning audio dividing line --------

 

Music notes set of data

 

  • Piano-midi.de: Classical piano

  • Address: http: //www.piano-midi.de/

 

  • Nottingham: over 1000 folk

  • Address: http: //abc.sourceforge.net/NMD/

 

  • MuseData: classical music scores Electronic Library

  • Address: http: //musedata.stanford.edu/

 

  • JSB Chorales: four concertos

  • Address: http: //www.jsbchorales.net/index.shtml

 

Other datasets

 

  • CMU action grab the data set: http: //mocap.cs.cmu.edu/

 

  • Brodatz dataset: texture modeling.

  • Address: http: //www.ux.uis.no/~tranden/brodatz.html

 

  • Large Hadron from CERN 300TB of high-quality data Collider (LHC) is.

  • Address: http: //opendata.cern.ch/search ln = en & p = Run2011A + AND + collection:? CMS-Primary-Datasets + OR + collection: CMS-Simulated-Datasets + OR + collection: CMS-Derived-Datasets

 

  • New York Taxi datasets: New York taxi FOIA request by the data obtained, resulting in privacy issues.

  • Address: http: //www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

 

  • Uber FOIL data set: Uber FOIL request from New York 4.5M pickup data.

  • Address: https: //github.com/fivethirtyeight/uber-tlc-foil-response

 

  • Criteo traffic data collection: Internet advertising large set of data from the European Union repositioning.

  • Address: http: //research.criteo.com/outreach/

 

Health & biological data

 

  • EU infectious disease surveillance Atlas: http: //ecdc.europa.eu/en/data-tools/atlas/Pages/atlas.aspx

 

  • Merck Molecular Activity Challenge: http: //www.kaggle.com/c/MerckActivity/data

 

  • Musk dataset: Musk dataset describes different configurations occurring molecule. Each molecule is a musk or non-musk, and wherein a configuration determines this characteristic.

  • Address: https: //archive.ics.uci.edu/ml/datasets/Musk+ (Version + 2)

 

Government & Statistics

 

  • Data USA: America's most comprehensive public data visualization.

  • Address: http: //datausa.io/

 

  • EU gender statistics database: http: //eige.europa.eu/gender-statistics

 

  • Dutch national geological research data: http: //www.nationaalgeoregister.nl/geonetwork/srv/dut/search#fast=index&from=1&to=50&any_OR_geokeyword_OR_title_OR_keyword=landinrichting*&relation=within

 

  • UNDP project: http: //open.undp.org/#2016

Guess you like

Origin www.cnblogs.com/cx2016/p/12059299.html
Recommended