Discussion on two open-source Python package, a social media sentiment analysis entry

[

Learn the basics of natural language processing and explore two useful Python package.

Natural Language Processing (NLP) is a machine learning, it addresses the spoken or written language and computer-aided analysis of the correlation between these languages. Everyday life we ​​experienced numerous NLP innovation, from writing help and advice to real-time voice translation, and interpretation.

This work describes a particular field of NLP: sentiment analysis. The focus is to determine the input language positive, negative or neutral in nature. This section will explain the background of NLP and sentiment analysis, and to explore the two open-source Python package.

In learning sentiment analysis, there is a general understanding of NLP is helpful. This article will not go into the nature of mathematics. Instead, our goal is to clarify the key concepts of NLP, these concepts are essential for the actual combination of these methods to your solution.

We will certainly encounter many difficulties when learning python, as well as the pursuit of new technologies, here's what we recommend learning Python buckle qun: 784758214, here is the python learner gathering place! ! At the same time, he was a senior development engineer python, python script from basic to web development, reptiles, django, data mining and other projects to combat zero-based data are finishing. Given to every little python partner! Daily share some methods of learning and the need to pay attention to small details

Natural language and text data

Reasonable starting point is defined from the beginning: "What is natural language?" It is the way we humans communicate with each other, the main form of communication is spoken and written. We can go further, only concerned with the exchange of text. After all, living in the ubiquitous era Siri, Alexa, etc., we know that the voice is a set of calculations has nothing to do with the text.

Data Prospects and Challenges

We only consider the use of text data, what we can do to language and text it? The first is language, especially English, in addition, there are many exceptions, diversity and difference in the context of the meaning of the rules, which can make human interpreters are confused, not to mention the computer translations. In elementary school, we learned articles and punctuation by native speakers, we have gained the ability to find represents a unique sense of the word intuition. For example, the emergence of such as "a", "the" and "or" like the articles, they are referred to in the NLP stop words , because that the search stops NLP algorithms traditionally find these words in a sequence.

Since our goal is to text automatically classified as emotional type, so we need a way to calculate text data processing method. Therefore, we must consider how to represent text data to the machine. As we all know, the rules of use and interpret language very complicated, enter text size and structure may vary significantly. We need to convert text data into digital data, which is the preferred way machines and mathematics. This shift belongs feature extraction category.

After extracting the digital input text representation of the data, a possible improvement is: Given a text input body for the article listed above to determine a set of vectors statistical data and documents are classified according to the data. For example, too much may make the writer feel angry adverb, or excessive use of stop words might help identify a term paper filled with content. Admittedly, this may not have much relationship with our goal of sentiment analysis.

Bag of words

When you evaluate a text statement is positive or negative when you use context to evaluate what its polarity? (For example, text whether positive, negative or neutral feelings) way is implicit adjectives: something called "disgusting" (disgusting) is considered to be negative, but if the same thing is called is "beautiful" (beautiful), you would think it is positive. By definition, saying giving a sense of familiarity, usually positive, while swearing might be hostile performance. Text data may also include emoticons, fixed with its emotion.

Understand a single word of polar influence to textBag of wordsbag-of-words(BoW) model provides the basis. It analyzes a group of words or vocabulary, and extract a measure of the presence or absence of these words in the input text. Glossary formed is referred to by a known text processing polarity training data marks . Extracting characteristic data from the set of markers, and then analyze the relationship between the features and data associated with the tag.

"Bag of words" is the name explains its use: a single word that is irrespective of the spatial position or contexts. Glossary usually appear all the words from the training set of built, after training tend to be trimmed. If you do not clean up before training stop words, stop word because of its high frequency and low context is removed. Word rarely used can also be deleted, because of the lack of information for the general input examples provided.

However, it is important to note that you can (and should) be further consider the case of a single word out of training data instances, this is calledFrequenciesterm frequency(TF). You should also consider entering the word count data in all training instances, usually, it appeared in all the documents in the low-frequency words is more important, which is calledInverse document frequency indexinverse document frequency(IDF). These indicators will be mentioned in other articles and packages this topic series, so understanding they will help.

Bag of words in many document classification applications useful. However, sentiment analysis, when a lack of situational awareness of the issue being used, things can be resolved. Consider the following sentence:

  • We do not like this war.
  • I hate rainy days, today is a good thing is a sunny day.
  • This is not a life and death issue.

These phrases emotion for human interpreters is difficult, but also through strict attention to instances of a single vocabulary for machine translation is also difficult.

Word may also be used in the NLP referred to as "n-grams" packets. Consider a tuple from the group consisting of two adjacent words instead of (or in addition to) a single bag of words. This should ease the situation, such as the above-mentioned "dislike" and the like, but the lack of contextual meaning, it is still a problem. Further, in the second sentence above, the emotional context may be understood as the last part of the first half of the negative. Therefore, this method will be lost spatial locality context clues. From a practical standpoint, compounding the problem is sparseness given input text features extracted from to. For a complete large vocabulary, each word has a count, it can be seen as a vector of integers. Most of the document vector has a large number of zero count vector, this operator adds unnecessary space and time complexity. While many simple methods for reducing this complexity have been proposed, but it's still a problem.

Embed the word

Embed the wordWord embeddingIs a distributed representation, it allows words with similar meanings have similar representation. This is based on the use of real-valued vector associated with their surroundings. It focuses on the use of the word, rather than only their presence or absence. In addition, the embedded word of a huge practical advantage is that they focus on intensive vector. By word count zero value out of the model vector element having a corresponding number of words provides a more efficient embedded computing paradigm and storage time.

Here are two good words embedding method.

Word2vec

The first is Word2vec , which is developed by Google. With you in-depth research on NLP and sentiment analysis, you may see this embedded method. It is either to use aContinuous bag of wordscontinuous bag of words(CBOW), or using a continuous skip-gram model. In CBOW in the context of a word it is in training based on the words around it to learn. Continuous learning tend to skip-gram word learning about a given word. While this may be beyond the question you need to address, but if you ever have to face generate their own words embedded in the case, Word2vec the authors advocate the use of CBOW ways to improve the speed and frequent assessments of the word, and skip-gram method is more suitable embed embed rare word is more important.

GloVe

The second isFor global word vector representationGlobal Vectors for Word Representation(GloVe), which is developed by Stanford University. It is an extension Word2vec method of trying by the local context of the global information classic text statistical feature extraction obtained with Word2vec determined combined. In fact, in some applications, GloVe performance is better than Word2vec, while in other applications is not as good as Word2vec. Ultimately, the goal set for the data word embedded determines which method is best. Therefore, the best understanding of their existence and high-level mechanism, because you are likely to encounter them.

Creating and using words embedded in

Finally, we know how to get the word embed useful. In Part 2, you will see us through the use of community substantive work of others, stand on the shoulders of giants. This is a way to get the word embedded: the use of existing models through training and validation. In fact, there are numerous models available in English and other languages, there will be a model that can satisfy your application and let you out of the box!

If not, work on the development, the other extreme is to train your own independent model, regardless of your application. In essence, you will get a lot of training data markers, and may use one of these methods to train the model. Even so, you still only understand you enter text data. Then, you need to develop a specific application of the model (for example, software version control sentimental value message) for you, which in turn require their time and effort.

You can also embed a word for the data train your applications, although this can reduce the time and effort, but the word will be embedded in specific applications, which will reduce its reusability.

The available tool options

Considering the large amount of time and computing power required, you may be wondering how to find a solution to the problem. Indeed, the development of a reliable model of complexity may seem daunting. However, there is good news: There have been many proven models, tools and software libraries can provide most of the content we needed. We will focus on Python, because it provides a number of convenient tool for these applications.

SpaCy

SpaCy provides a number of language model for parsing input text data and feature extraction. It is highly optimized and hailed as the fastest of its kind in the library. Best of all, it's open source! SpaCy performs identification of parts of speech classification and dependencies comments. It contains a word model is fitted to perform this function, as well as other features of the operation for extracting more than 46 kinds of languages. In the second article in this series, you will see how it is used for text analysis and feature extraction.

father sentiment

vaderSentiment package provides a measure of positive, negative and neutral emotions. These models are designed for social media text data development and adjustment. VADER received a complete set of training data tagged human, comprising common emoticons, UTF-8 encoded and emoticons oral terms and abbreviations (e.g. meh, lol, sux).

For a given input text data, vaderSentiment returns a fractional percentage polarity triples. It also provides a single score, called vaderSentiment composite index . This is a [-1, 1]real value in the range, for which the value is larger than 0.05emotions is considered positive for scores less than -0.05is considered negative, or neutral.

Guess you like

Origin blog.csdn.net/zhizhun88/article/details/90707131