Explore 20 newsgroup data sets with text analysis algorithms

What is NLP

20 newsgroup data sets, as the name suggests, consist of text extracted from news articles. It is collected by Ken Lang and is widely used in the experiments of text-based applications driven by machine learning technology, especially the development of text-based applications with natural language processing technology.

NLP (Natural Language Processing, NLP) is an important field of machine learning, which studies the machine (computer) and the interaction between the human (natural) languages. Natural languages ​​are not limited to speech and dialogue, they can also be written language or symbolic language. The data used in NLP tasks is in various forms, including social media, web pages, medical prescription text, audio emails, control system commands, and even audio from our favorite music or movies. Nowadays, NLP is widely used in daily life: our lives are inseparable from machine translation; weather forecast scripts are automatically generated; we find that voice search is very convenient; with an intelligent question answering system, we can quickly get answers to questions (for example, What is the population of Canada?); Speech-to-text technology can help students with special needs.

If a machine can understand language like humans, we can think of it as intelligent. In 1950, the famous mathematician (Alan Turing) proposed a test standard that can judge whether a machine has intelligence in an article entitled "Computing Machinery and Intelligence". This standard was later called the Turing test ( Turing test). Its goal is to test whether the computer can understand language enough to make humans mistakenly think that the computer is a person. So far, no computer has passed the Turing test, which seems hardly surprising. In the 1950s, the history of natural language processing began.

It may be difficult to understand a language, but is it easier to automatically translate text from one language to another? I still remember the first programming lesson in my life. The experiment manual printed a very basic machine translation algorithm. We can imagine that a translation algorithm of this level is nothing more than looking up a dictionary and generating translations. A more feasible method is to collect people's translated texts and use them to train computer programs. In 1954, scientists declared in the Georgetown-IBM experiment (a project in cooperation between Georgetown University and IBM) that machine translation would be solved within 3 to 5 years.

Unfortunately, there is no machine translation system that can beat translators. But since the introduction of deep learning methods, the quality of machine translation has been greatly improved.

Chatbot (conversational agent or chatbot) is another hot topic in the field of NLP. The fact that computers can talk to people has changed the way business works. In 2016, Microsoft’s artificial intelligence chat robot Tay was released, which imitated a girl and could chat with users in real time on Twitter. She learns how to chat from the tweets and comments posted by users. However, after waves of tweets hit, she couldn't help it, automatically learned their bad language, and began to output inappropriate tweets to her homepage. She was shut down within 24 hours.

There are also NLP tasks that try to organize knowledge and concepts, thereby reducing the difficulty of operating them with computer programs. The way we organize and express concepts is called ontology . Ontology defines the relationship between concepts. For example, we can use so-called ontology triples to express the relationship between two concepts, such as Python is a programming language.

Among the important application scenarios of NLP, part-of-speech tagging is more low-level than the above usage scenarios. POS (Part Of Speech, POS) is the grammatical meaning of the word category, such as noun or verb. Part-of-speech tagging attempts to determine the part-of-speech of each word in a sentence or longer document. Give a few examples of English words part of speech, as shown in Table 2-1.

Table 2-1 Examples of English words part of speech

Part of speech

Example

Noun

david、machine

Pronoun

them、her

Adjective

awesome、amazing

Verb

read、write

Adverb

very、quite

Preposition

out、at

Conjunction

and、but

Interjection

oh[2]

Article

a、the

2.2 A tour of the powerful Python NLP library

After introducing several practical applications of NLP, the next part will take you to overview the Python NLP technology stack. These Python packages can handle a variety of NLP tasks, including the aforementioned NLP applications, such as sentiment analysis, text classification, and named entity recognition.

It is written in Python, the most famous library has NLP Natural Language Processing toolset (Natural Language Toolkit, NLTK), Gensim and TextBlob. The sicikit-learn library also provides related functions of NLP. NLTK was originally developed for education and is now widely used in the industry. There is such a saying that if you don't mention NLTK, you can't speak about NLP. It is one of the most famous and leading platforms for developing NLP applications in Python. We sudo pip install –U nltkcan install it by running the command in the terminal .

NLTK is equipped with more than 50 large, well-structured text data sets. In NLP terms, they are called corpora (corpora [3] ). The corpus can be used as a dictionary to test whether words appear or not, and can also be used as a data set for model learning and training. Some useful and interesting corpora in NLTK are introduced as follows: Web Text Corpus, Twitter Sample, Shakespeare XML Corpus Sample, Sentiment Polarity, Name Corpus ( Names Corpus, which contains common names, we will use later), Wordnet and Reuters benchmark corpus (Reuters-21578 benchmark corpus). Please see the official website for a list of all corpora of NLTK. Regardless of which corpus resources are to be used, before using it, we have to run the following script in the Python interpreter to download the corpus:

>>> import nltk
>>> nltk.download()

Run the above command, a new window will pop up, asking us which package or corpus we want to download, as shown in Figure 2-1.

I strongly recommend that you install the entire package. It contains all the important data sets that will be used in this book and future research. This is what everyone usually does. After installation, we will immediately explore its name corpus Names.

First, import the corpus:

Use the following code to output the first 10 names of the list:

>>> print names.words()[:10]
[u'Abagael', u'Abagail', u'Abbe', u'Abbey', u'Abbi', u'Abbie',
u'Abby', u'Abigael', u'Abigail', u'Abigale']

There are 7944 names in total:

>>> print len(names.words())
7944

Other corpora are also interesting and worth exploring.

In addition to providing these easy-to-use and data-rich corpora, NLTK is more importantly useful for conquering the following various NLP and text analysis tasks.

  • Segmentation (tokenization): word refers to a given sequence of text characters cut into segments separated by a space, incidentally, also typically deleted punctuation, numbers and emoticons. These characters are called fragments obtained word word string (token), reserved for further processing. A-word word string, called the computational linguistics singleton groups (the unigram); original text immediately two words, called tuples (bigram); 3 consisting of consecutive words called three Tuple (trigram); na group of consecutive words, called a \boldsymbol{n}tuple (n-gram). An example of word segmentation is shown in Figure 2-2.

  • Part-of-speech tagging (POS tagging): We can use the ready-made tagger to tag the part-of-speech, or we can comprehensively use multiple NLTK taggers to customize the tagging process. Direct labeling using the built-in function pos_tagis very simple, for example, we can use this: pos_tag(input_tokens). But behind the function call, it actually uses a pre-built supervised learning model to make predictions. The model is trained with a large corpus, and the words in the corpus have been correctly annotated with part of speech in advance.
  • Named entities recognition: Given a text sequence, the task of named entity recognition is to locate and recognize defined words or phrases, such as names of persons, companies, and locations. This content will be introduced in detail in the next chapter.
  • Stem extract (stemming) and Lemmatization (lemmatization): stem extract refers to the converted inflected or derived get word back to the prototype process. For example, machine is the stem of machines, and learning and learned come from learn. Lemmatization restoration is more careful than stemming. When restoring morphology, the part of speech of the word needs to be considered. We will discuss these two text preprocessing techniques in more detail later. Now, let's quickly understand how they are implemented in NLTK.

First, import the PorterStemmer of the 3 built-in stemming algorithms (the other two are LancasterStemmer and SnowballStemmer), and initialize a stemming extractor:

>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()

Extract the stems of machine and learning:

>>> porter_stemmer.stem('machines')
u'machin'
>>> porter_stemmer.stem('learning')
u'learn'

Please note that when extracting the stem, if necessary, the extractor will also cut off some letters. For example, the machin above cuts off the letter e.

Now, import the lemmatization algorithm based on the built-in Wordnet corpus, and initialize a lemmatizer:

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()

Similarly, we can also restore machines and learning:

>>> lemmatizer.lemmatize('machines')
u'machine'
>>> lemmatizer.lemmatize('learning')
'learning'

Why does the word form of learning have not changed after the restoration operation? The reason is that the algorithm only restores the word form of nouns by default.

The Gensim library developed by Radim Rehurek has gained popularity in recent years. When it was first designed in 2008, its function was to generate a list of similar articles for a given article, and that's how it got its name (Gensim is the abbreviation for generate similar). Later, Radim Rehurek greatly improved its efficiency and scalability. The library can also be installed in the terminal, it is very simple, just run it pip install --upgrade gensim. It depends on the NumPy and SciPy libraries. Before installing it, please make sure that these two libraries are installed.

Gensim is famous for its powerful semantic and topic modeling algorithms. Topic modeling is a typical text mining task, which aims to discover hidden semantic structure in documents. To put it plainly, the semantic structure is the distribution of words in the document. Obviously, it is an unsupervised learning task. We need to enter ordinary text and let the model find abstract topics from it.

In addition to powerful semantic modeling methods, Gensim also has the following functions.

  • Similarity query : Retrieve objects similar to a given query object.
  • Word vectorization : a new method of characterizing words, which can preserve the co-occurrence characteristics between words.
  • Distributed computing : It can learn from millions of texts efficiently.

TextBlob is a relatively new library developed on the basis of NLTK. It not only provides simple and easy-to-use built-in functions and methods, but also encapsulates common tasks and simplifies NLP and text analysis tasks. Run the pip install –U textblobcommand in the terminal to install TextBlob.

In addition, TextBlob also has features not currently available in NLTK, such as spell checking and correction, language detection and translation.

Although scikit-learn is discussed at the end, it is also very important. As mentioned in the first chapter, scikit-learn is the main library used throughout the book. Fortunately, it provides all the text processing functions (such as word segmentation) and a variety of machine learning functions we need. In addition, it also has a built-in loader for 20 newsgroup data sets.

After we understand what tools to use and install them correctly, what happens to the data?

2.3 Newsgroup data set

For the first project in this book, we used 20 newsgroup data sets from scikit-learn. The data set includes approximately 20,000 articles from 20 online newsgroups. Newsgroups are places for online questions and answers on specific topics. The data set has been divided into a training set and a test set according to a specific date.

All documents in the dataset are in English. From the names of newsgroups, you can infer the topics they discuss.

Among them, some newsgroups are closely related or even overlapped, such as these five computer newsgroups (comp.graphics, comp.os.ms-windows.misc, comp.sys.ibm.pc.hardware, comp.sys.mac.hardware And comp.windows.x), and some newsgroups are very unrelated, such as the baseball newsgroup (rec.sport.baseball). The data set is annotated, and each document consists of text data and a set of labels, which is very suitable for supervised learning tasks, such as text classification. We will introduce supervised learning in detail in Chapter 4. Now, we still focus on unsupervised learning, starting from the acquisition of data.

2.4 Get data

It is possible to manually download the data set from the original website or other online warehouses, but there are many versions of the data set, some have been cleaned to a certain degree, and some are still in the original data format. To avoid confusion, it is best to use a consistent method to obtain the data set. The scikit-learn library provides a functional function that can be used to load the data set.

After downloading the data set, scikit-learn automatically loads it into the cache, and we don't need to download it again. In most cases, caching data sets can be considered a best practice, especially if the data set is relatively small. Other Python libraries also provide download functions, but not all of them implement automatic caching. This is another reason we like scikit-learn.

Before loading the dataset, first import the loader of the dataset:

>>> from sklearn.datasets import fetch_20newsgroups

Then, we use the loader to download the data set and use the default parameters.

>>> groups = fetch_20newsgroups()

We can also specify one or more topics or a certain part of the data set (training set, test set or both), or load only a subset of the data set. All parameters and parameter values ​​of the loader function are shown in Table 2-2.

Table 2-2 Introduction to Loader Parameters

parameter

Default parameter value

Examples of parameter values

description

subset

train

train、test、all

Load training set, test set or load all data sets

data_home

~/scikit_learn_data

~/myfiles

Data set storage directory

categories

None

alt.atheismsci.space

The list of newsgroup names to load. Load all newsgroups by default

shuffle

True

TrueFalse

Boolean value, indicating whether to disrupt the order of data

random_state

42

743

The integer random seed on which the scrambled data is based

remove

()

headerfooters、 quotes

A tuple indicating which part of the article (head, tail, and citation) is omitted. No part is omitted by default

download_if_ missing

True

TrueFalse

Boolean value, indicating whether to download if the data is not found locally

2.5 Thinking characteristics

No matter which method is used for downloading, after downloading 20 newsgroup data sets, we can groupscall the data sets with data objects in the program . The data object is a dictionary structure in the form of key-value pairs, and its keys are as follows.

>>> groups.keys()
dict_keys(['description', 'target_names', 'target', 'filenames',
  'DESCR', 'data'])

The key target_namesgives the names of 20 newsgroups:

>>> groups['target_names']
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x',
'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball',
'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space',
'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast',
'talk.politics.misc', 'talk.religion.misc']

The key targetis a list of topic numbers (which newsgroups they belong to) of all documents in the 20 newsgroups. The topic numbers are represented by integers:

>>> groups.target
array([7, 4, 4, ..., 3, 1, 8])

How many different integers are there in the above output? We can uniquefind out with NumPy functions:

>>> import numpy as np
>>> np.unique(groups.target)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

There are 20 numbers from 0 to 19, representing 20 topics. Let's look at the topic number of the first document and the corresponding newsgroup name:

 
 

As can be seen from the above output, the first document is from the rec.autos newsgroup, which is numbered 7. Reading the article, it is not difficult to see that it is about cars. The word car actually appears several times in the article. Words like bumper also seem to be related to cars. However, words such as doors may not necessarily be related to cars. They may also appear in home improvement or other topics. By the way, it makes sense not to distinguish between doors, doors, or the capitalization of the same word (such as Doors). Case sensitivity is rare. For example, when we want to find out whether a document introduces the band The Doors or introduces the more common concept of doors (made of wood), we need to distinguish between upper and lower case.

We can boldly draw conclusions. If we want to know whether a document comes from the rec.autos newsgroup, the presence of words such as car, doors, and bumper is a very helpful feature. The presence or absence can be represented by a boolean variable, and we can also examine the number of occurrences of specific words. For example, car appears multiple times in the document. Perhaps the more such words appear in the document, the more likely the document is related to cars. Depending on the document length, there are differences in the number of occurrences of certain words. Obviously, long texts usually have a larger vocabulary, so we have to offset the impact of the larger vocabulary. For example, the first two documents have different lengths:

>>> len(groups.data[0])
721
>>> len(groups.data[1])
858

So, should we consider the length of the document? In my opinion, even if the number of pages in the book changes (within a reasonable range), the book is still related to Python and machine learning; therefore, the length of the article may not be a significant feature.

What about word sequences? Phrases such as front bumper, sports car, and engine specs seem to strongly indicate that the document is car-themed. However, cars appear more frequently than sports cars. Moreover, the number of two-tuples is much more than the number of one-tuples obtained after deduplication. For example, the two tuples of this car and looking car have basically the same amount of information for newsgroup classification. Obviously, some words have very little information. Words that appear frequently in all categories of documents, such as a, the, and are, are called stop words , and we should ignore them. We are only interested in the occurrence of specific words and their number of occurrences or other metrics, and do not care about the order of occurrence of the words. Thus, we can be seen as a bag containing a number of text words, this model called the bag of words model (bag of words model). Although this is a very basic model, it works well in practical applications. We can also define more complex models that take the order and part of speech of words into account. However, this type of complex model is more computationally expensive and the code implementation is also very difficult. The basic bag-of-words model can meet most needs. You do not believe? We can try to draw a distribution map of a tuple to see if the bag-of-words model works well.

2.6 Visualization

Visualization technology can display data, allowing users to roughly understand the structure of the data, discover potential problems, and determine whether the data contains irregular structures that require special processing. Visualization techniques are of great benefit.

In multi-topic or category classification tasks, it is important to clarify the distribution of topics. It is the easiest to deal with evenly distributed with categories, because there are no under-represented or over-represented categories. However, the distribution of data sets is often skewed, and one or more categories will dominate. We use the seaborn package to calculate the histogram of the categories, and use the matplotlib package to plot. Both packages are available for pipinstallation. We draw the distribution map of each category through the following code:

>>> import seaborn as sns
>>> sns.distplot(groups.target)
<matplotlib.axes._subplots.AxesSubplot object at 0x108ada6a0>
>>> import matplotlib.pyplot as plt
>>> plt.show()

The output result of the above code is shown in Figure 2-3.

As shown in Figure 2-3, each category (approximately) obeys a uniform distribution, and we have one less thing to worry about.

The text data dimensionality of the 20 newsgroup data sets is very high. Each feature must be represented in one dimension. If we use word counts as features, then there will be as many dimensions as there are features of interest. If you use a tuple to count, then we use the CountVectorizer class, and its parameter descriptions are shown in Table 2-3.

Table 2-3 CountVectorizer parameter description

Constructor parameters

Default parameter value

Examples of parameter values

description

ngram_range

(1,1)

(1, 2)、(2, 2)

Extract nthe lower and upper bounds of the tuple from the input text

stop_words

None

English[a, the, of]None

Which stop word list to use. If it is None, stop words are not filtered

lowercase

True

TrueFalse

When extracting features, whether to convert letters to lowercase

max_features

None

None500

If None is used, only a limited number of features are extracted

binary

False

TrueFalse

If set to True, all non-zero word counts are counted as 1

We use the following code to draw a histogram of the word counts of 500 high-frequency words:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> from sklearn.datasets import fetch_20newsgroups
 
>>> cv = CountVectorizer(stop_words="english", max_features=500)
>>> groups = fetch_20newsgroups()
>>> transformed = cv.fit_transform(groups.data)
>>> print(cv.get_feature_names())
 
>>> sns.distplot(np.log(transformed.toarray().sum(axis=0)))
>>> plt.xlabel('Log Count')
>>> plt.ylabel('Frequency')
>>> plt.title('Distribution Plot of 500 Word Counts')
>>> plt.show()

The output result is shown in Figure 2-4.

The list of 500 high-frequency words is as follows:

    ['00', '000', '0d', '0t', '10', '100', '11', '12', '13', '14', '145',
'15', '16', '17', '18', '19', '1993', '1d9', '20', '21', '22', '23', '24',
'25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '34u', '35',
'40', '45', '50', '55', '80', '92', '93', '__', '___', 'a86', 'able', 'ac',
'access', 'actually', 'address', 'ago', 'agree', 'al', 'american',
'andrew', 'answer', 'anybody', 'apple', 'application', 'apr', 'april',
'area', 'argument', 'armenian', 'armenians', 'article', 'ask', 'asked',
'att', 'au', 'available', 'away', 'ax', 'b8f', 'bad', 'based', 'believe',
'berkeley', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'black',
'board', 'body', 'book', 'box', 'buy', 'ca', 'california', 'called',
'came', 'canada', 'car', 'card', 'care', 'case', 'cause', 'cc', 'center',
'certain', 'certainly', 'change', 'check', 'children', 'chip', 'christ',
'christian', 'christians', 'church', 'city', 'claim', 'clinton', 'clipper',
'cmu', 'code', 'college', 'color', 'colorado', 'columbia', 'com', 'come',
'comes', 'company', 'computer', 'consider', 'contact', 'control', 'copy',
'correct', 'cost', 'country', 'couple', 'course', 'cs', 'current', 'cwru',
'data', 'dave', 'david', 'day', 'days', 'db', 'deal', 'death',
'department', 'dept', 'did', 'didn', 'difference', 'different', 'disk',
'display', 'distribution', 'division', 'dod', 'does', 'doesn', 'doing',
'don', 'dos', 'drive', 'driver', 'drivers', 'earth', 'edu', 'email',
'encryption', 'end', 'engineering', 'especially', 'evidence', 'exactly',
'example', 'experience', 'fact', 'faith', 'faq', 'far', 'fast', 'fax',
'feel', 'file', 'files', 'following', 'free', 'ftp', 'g9v', 'game',
'games', 'general', 'getting', 'given', 'gmt', 'god', 'going', 'good',
'got', 'gov', 'government', 'graphics', 'great', 'group', 'groups',
'guess', 'gun', 'guns', 'hand', 'hard', 'hardware', 'having', 'health',
'heard', 'hell', 'help', 'hi', 'high', 'history', 'hockey', 'home', 'hope',
'host', 'house', 'hp', 'human', 'ibm', 'idea', 'image', 'important',
'include', 'including', 'info', 'information', 'instead', 'institute',
'interested', 'internet', 'isn', 'israel', 'israeli', 'issue', 'james',
'jesus', 'jewish', 'jews', 'jim', 'john', 'just', 'keith', 'key', 'keys',
'keywords', 'kind', 'know', 'known', 'large', 'later', 'law', 'left',
'let', 'level', 'life', 'like', 'likely', 'line', 'lines', 'list',
'little', 'live', 'll', 'local', 'long', 'look', 'looking', 'lot', 'love',
'low', 'ma', 'mac', 'machine', 'mail', 'major', 'make', 'makes', 'making',
'man', 'mark', 'matter', 'max', 'maybe', 'mean', 'means', 'memory', 'men',
'message', 'michael', 'mike', 'mind', 'mit', 'money', 'mr', 'ms', 'na',
'nasa', 'national', 'need', 'net', 'netcom', 'network', 'new', 'news',
'newsreader', 'nice', 'nntp', 'non', 'note', 'number', 'numbers', 'office',
'oh', 'ohio', 'old', 'open', 'opinions', 'order', 'org', 'organization',
'original', 'output', 'package', 'paul', 'pay', 'pc', 'people', 'period',
'person', 'phone', 'pitt', 'pl', 'place', 'play', 'players', 'point',
'points', 'police', 'possible', 'post', 'posting', 'power', 'president',
'press', 'pretty', 'price', 'private', 'probably', 'problem', 'problems',
'program', 'programs', 'provide', 'pub', 'public', 'question', 'questions',
'quite', 'read', 'reading', 'real', 'really', 'reason', 'religion',
'remember', 'reply', 'research', 'right', 'rights', 'robert', 'run',
'running', 'said', 'sale', 'san', 'saw', 'say', 'saying', 'says', 'school',
'science', 'screen', 'scsi', 'season', 'second', 'security', 'seen',
'send', 'sense', 'server', 'service', 'services', 'set', 'similar',
'simple', 'simply', 'single', 'size', 'small', 'software', 'sorry', 'sort',
'sound', 'source', 'space', 'speed', 'st', 'standard', 'start', 'started',
'state', 'states', 'steve', 'stop', 'stuff', 'subject', 'summary', 'sun',
'support', 'sure', 'systems', 'talk', 'talking', 'team', 'technology',
'tell', 'test', 'text', 'thanks', 'thing', 'things', 'think', 'thought',
'time', 'times', 'today', 'told', 'took', 'toronto', 'tried', 'true',
'truth', 'try', 'trying', 'turkish', 'type', 'uiuc', 'uk', 'understand',
'university', 'unix', 'unless', 'usa', 'use', 'used', 'user', 'using',
'usually', 'uucp', 've', 'version', 'video', 'view', 'virginia', 'vs',
'want', 'wanted', 'war', 'washington', 'way', 'went', 'white', 'win',
'window', 'windows', 'won', 'word', 'words', 'work', 'working', 'works',
'world', 'wouldn', 'write', 'writes', 'wrong', 'wrote', 'year', 'years',
'yes', 'york']

Our first attempt was to get the 500 high-frequency vocabularies listed above. Our goal is to find the most indicative features. But the above list is not perfect. Can we improve it? Yes, it can be improved by using the data preprocessing techniques described in the next section.

This article is excerpted from "Python Machine Learning Practical Combat"

Python machine learning combat

1. Before explaining the principle of the algorithm and implementing the algorithm with the scikit-learn library encapsulated, I will teach you specific calculation methods through a few examples and let you implement the algorithm manually;
2. The code in the book is relatively coherent and can be directly Pasting into Jupyter Notebook to run, this is very helpful for beginners;
3. The examples in the book are easy to understand, covering a variety of application scenarios: news topic classification, spam filtering, online advertising click-through rate prediction and stock price prediction, etc. The way of explanation is lively and interesting;
4. Provide source code.

The book begins with an introduction to the Python language and the method of setting up a machine learning development environment. The following chapters introduce related important concepts, such as data analysis, data preprocessing, feature extraction, data visualization, clustering, classification, regression, and model performance measurement. This book contains multiple project cases, involving several important and interesting machine learning algorithms, and guides readers to implement their own models from scratch. After finishing this book, you will understand the whole picture of the machine learning ecosystem and master the practice and application of machine learning technology.
With the help of this book, you will learn to use the powerful but simple Python language to deal with data science problems and build your own solutions.

This book includes the following content:
·Using Python language to extract data, processing data and exploring data;
·Use Python to visualize multi-dimensional data and extract useful features;
·Deep research on data analysis techniques to correctly predict development trends;
·Use Python to implement from scratch Machine learning classification algorithms and regression algorithms;
·Use Yahoo Finance data to analyze and predict stock prices;
·Evaluate and optimize the performance of machine learning models;
·Use machine learning and Python to solve practical problems.

Guess you like

Origin blog.csdn.net/epubit17/article/details/115033718