Newsgroups dataset research

20newsgroups data set is used for text classification, text retrieval dig, according to one study of information and international standards for data collection.

Data sets collected set of documents about the news of about 20,000, evenly divided into 20 different themes newsgroups collection.

Some topics of particular newsgroup similar (eg comp.sys.ibm.pc.hardware / comp.sys.mac.hardware), but there are some completely unrelated (eg misc.forsale /soc.religion.christian).

20newsgroups data set has three versions:

19997 is the first version of the original version and not modified: 20news-19997.tar.gz  - raw data set 20 Newsgroups

The second version bydate chronologically into training (60%) and test (40%) two-part data set does not contain duplicate documents and newsgroup name (newsgroups, paths, under date): 20news-bydate .tar.gz  - chronological classification; does not contain duplicate documents and newsgroup names (18,846 documents)

The third version of the 18828 does not contain duplicate documents, only the source and subject matter: 20news-18828.tar.gz - does not contain duplicate documents, only the source and subject matter (document 18828)

In sklearn, the model is loaded in two ways, the first one is sklearn.datasets.fetch_20newsgroups, a return to the original text sequence may be custom parameter extraction features text feature extractor (e.g. sklearn.feature_extraction.text.CountVectorizer); the second is sklearn.datasets.fetch_20newsgroups_vectorized, it returns a text sequence extracted feature, i.e., without using the feature extractor.

Guess you like

Origin www.cnblogs.com/wqbin/p/11335037.html