Text analysis based on big data

In their understanding of big data, people have summarized its 4V characteristics, namely large capacity, diversity, fast production speed and low value density. For this reason, a large number of technologies and tools have been produced to promote the development of the field of big data. In order to make good use of big data, how to effectively extract useful features from it is also an important aspect. Tools and platforms must rely on correct data models and algorithms to highlight its important value.

Now we will take text analysis as a case to analyze the role and impact of data processing technology in the field of big data. First, we discuss three models of text analysis: bag-of-words model, TF-IDF phrase weighted representation, and feature hashing.

 "Bag of words, also called "bag of words". In information retrieval, the Bag of words model assumes that for a text, ignore its word order, grammar, and syntax, and regard it as just a word set or a combination of words. The text The appearance of each word in it is independent and does not depend on whether other words appear, or when the author of this article chooses a word at any position, it is not affected by the previous sentence and is chosen independently.”

 

Its processing is as follows:

· Word segmentation: Divide the text into a collection of words.

· Delete stop words, such as the, and, but, etc.

· Extracting stems, that is, reducing each word to its basic form, that is, normalizing.

· Vectorization, using vectors to represent processed words. Binary vectors are the simplest representation, using 1/0 to represent whether a certain word exists. As the number of words increases, such as millions, a sparse matrix is ​​used to only record whether the word has appeared before, thereby saving memory and disk space, and computing time.

 

Although this assumption simplifies natural language and facilitates modeling, its assumption is unreasonable in some cases. For example, in personalized news recommendation, problems will arise when using Bagof words model. For example, user A is very interested in the phrase "Nanjing drunk driving accident". Using bagof words ignores the order and syntax. It is believed that user A is interested in "Nanjing", "drunk", "driving" and "accident", so it may be recommended It is obviously unreasonable to publish news related to "Nanjing", "bus" and "accident".

 

The following introduces two feature extraction technologies included in SparkMLlib.

"TF-IDF, term frequency-inverse text frequency. Among them, term frequency assigns a weight to each word based on the frequency of the word appearing in the text, and inverse text frequency is calculated based on the frequency of the word in all documents."

 

The original intention of this technical design is that the importance of a word increases proportionally with the number of times it appears in the document, but at the same time it decreases inversely proportional to the frequency of its appearance in the corpus. This design lies in better expression The category distinguishing ability of this word or phrase. Various forms of TF-IDF weighting suitable for classification are commonly used by search engines as a measure or ranking of the relevance of a document to a user's query.

"Feature hashing uses a hashing equation to assign vector subscripts to features, which requires a pre-selected size of the feature vector."

 

If the number of features increases explosively, dimensionality reduction processing, such as clustering, PCA, etc., is required. However, these methods require a large amount of calculation when the number of features and samples is large. There is a simple method for text and classification data. An efficient high-dimensional data processing technology is feature hashing.

The goal of feature hashing is to compress the original high-dimensional feature vector into a lower-dimensional feature vector without losing the expressive ability of the original feature. Its advantage is that it does not need to build a map and save it in memory, and does not need to scan the dataset in advance; it is easy to implement and very fast, because its dimension is much smaller than the dataset, which limits the amount of memory used for model training and prediction.

 

TF-IDF does not train machine learning models, but does feature extraction or transformation. It is often used as a preprocessing step for dimensionality reduction, classification and regression.

Spark can be used to perform calculations according to the following steps:

Step1. Use HashingTF and use feature hashing technology to set the term of each input text as the subscript of the word frequency vector.

tf=hashingTF.transform(tokens)

Step2. Use a global variable IDP vector to convert the word frequency vector into a TF-IDF vector.

idf=new IDF().fit(tf)

tfidf=idf.transform(tf)

Trident is an advanced abstraction based on Storm for real-time stream processing. It provides operations such as aggregation, projection, and filtering of real-time streams, which greatly simplifies the workload of Storm task development. In addition, Trident provides primitives to handle stateful, incremental update operations against a database or other persistent store.

A Trident task can have multiple data sources, and each data source is defined in the task in the form of TridentState.

Now use Trident to calculate TF-IDF about the importance of word frequency, implement a TridentTopology and define a Stream data pipeline as follows:

 

The calculation formula of tf-idf is as follows:

tf-idf=tf(t,d)*log(D/1+df(t)

Among them, tf(t,d) calculates the frequency of word t appearing in document d, df(t) calculates the frequency of word t appearing in all documents, and D calculates the total number of documents.

Its specific implementation is as follows:

TridentState dfState=termStream.groupBy(newFileds(“term”))

.persistentAggregate(getStateFactory(“df”),newcount(),new Fields(“df”));

 

TridentState dState=termStream.groupBy(newFileds(“source”))

.persistentAggregate(getStateFactory(“d”),newcount(),new Fields(“d”));

 

Stream tfidfStream =termStream. groupBy(newFileds(“documentId”,”term”))

.persistentAggregate(new count(),new Fields(“tf”))

.each(newFileds(“term”,“documentId”,”tf”),newTfidfExpresssion(),new Fields(“tfidf”));

Guess you like

Origin blog.csdn.net/victory0508/article/details/50770788