Social Network Analysis 2 (Part 1): Methods, challenges and cutting-edge technologies of social network sentiment analysis

write at the front

The "Social Network Analysis" course is taught by Mr. Lu Hongwei. His teaching method is not only rigorous and responsible, but also full of humor and personal insights. This direction was particularly attractive to me, and I took this course with great interest.

2. Social Network Sentiment Analysis
Combined with PPT Chapter 2 Social Network Sentiment Analysis
This chapter briefly introduces the basic concepts and methods of social network sentiment analysis

Please add image description

As social networks become more prevalent in our daily lives, it becomes increasingly important to understand and analyze emotional expressions on these platforms. Social network sentiment analysis not only helps us gain insight into public sentiment, but also provides critical insights in the fields of business, politics, and social research.

This blog aims to provide an in-depth analysis of the core concepts of sentiment analysis, the challenges faced, and its application in the field of social networks. We will explore different sentiment analysis methods, recent technical advances, and major Python tool libraries to provide a comprehensive guide for interested researchers and practitioners.

1. Basic concepts of sentiment analysis

Covers the basic concepts, application scenarios, technical methods of sentiment analysis and its importance in social network analysis.

  1. Establishment and historical background

    • Sentiment Analysis (Sentiment Analysis) refers to the process of analyzing, processing, summarizing and reasoning on subjective texts with emotional color using automated or semi-automated methods.
    • Since the early 2000s, sentiment analysis has become one of the most active research areas in natural language processing (NLP).
  2. Main tasks and categories:

    • Includingemotional information classification, extraction, retrieval and induction and other tasks.
    • Often compared withOpinion Mining (Opinion Mining), but there are subtle differences between the two.
      • EmotionUsually refers to attitudes, thoughts, or judgments prompted by feelings.
      • Opinion is a point of view, judgment or evaluation of a specific thing.
    • Due to the subtle differences between the two, they are usually classified into the category of sentiment analysis.
  3. Applications in social networks:

    • By collecting and analyzing text information on social networks, complex social phenomena can be understood and explained, and predictions made.
  4. The purpose and process of sentiment analysis:

    • The goal is to obtain automated tools for extracting opinions and sentiments from natural language text.
    • Structuring knowledge for use by decision support systems or decision makers.
    • Sentiment analysis is a process of transformation from unstructured data to structured data.
  5. Multidisciplinary research areas:

    • Scholars such as Erik Cambria pointed out that sentiment analysis integrates multidisciplinary knowledge:
      • Artificial Intelligence andSemantic Network technologies are used for knowledge representation and mining.
      • Mathematics knowledge is used for graph data mining and data dimensionality reduction.
      • LinguisticsKnowledge is used for semantic and pragmatic analysis.
      • Knowledge of Sociology and Psychology is used to develop a deep understanding of natural language.
  6. The importance of meeting challenges:

    • Sentiment analysis is a very challenging task for both industry and academia.
    • Before the emergence of online social networks, sentiment analysis mainly focused on news web pages, blogs, and forums.
  7. Sentiment analysis representation of quintuple:

    • Sentiment analysis results are usually represented by a five-tuple.
    • This representation method transforms unstructured text into structured data (such as database tables).
    • Structured data can be used to conduct rich qualitative, quantitative and trend analysis, leveraging traditional database management systems and online analytical processing tools.

Challenges and applications of social network sentiment analysis

Challenges, applications, and value of social network sentiment analysis, while incorporating specific cases.

  1. Technical challenge

    • Although social network sentiment analysis follows traditional sentiment analysis techniques, the uniqueness of the online social network environment poses new challenges:
      • The amount of data is huge: For example, Sina Weibo has more than 100 million Weibo posts every day.
      • The data is noisy: The text information of social networks is usually short, which leads to the decline in the effectiveness of traditional long text analysis methods.
      • Incomplete data: Users tend to browse more and post less.
      • Language changes rapidly: The emergence of new vocabulary makes analysis difficult.
      • Broadly connected: Social networks influence users’ social identities and behavioral expressions.
  2. Speciality of social networks:

    • Traditional sentiment analysis methods do not consider the influence of social network environment.
    • The difficulty of social network sentiment analysis is to quantify the social network environment and effectively integrate it into the analysis method.
  3. Academic and applied value:

    • Public opinion monitoring and event prediction: For example, through sentiment analysis, we can timely respond to and guide public opinion to prevent social instability.
    • For commercial use:
      • Recommendation system: Provide personalized recommendations by analyzing user emotions, such as recommending similar movies based on movie reviews.
      • Product improvement and market strategy: Companies can optimize product and market strategies by analyzing discussions about products in social networks.
  4. Practical example

    • Public Opinion Influence Case: The 2011 Sina Weibo incident of Guo Meimei showing off her wealth had a negative impact on the Red Cross Society of China.
    • Event prediction case: During the 2012 US election, the results of the election were predicted through Twitter information analysis.
  5. Future direction

    • As technology evolves, social network sentiment analysis needs to continue to adapt to new challenges, such as processing more complex data types, tracking rapidly changing online terms, and better understanding the impact of social networks on individual emotions.

Current status of sentiment analysis research

This article reviews the main development process of sentiment analysis from its origin to the current situation, covering technological evolution, application expansion and future trends, which helps to comprehensively understand the development process and current status of the field of sentiment analysis.

  1. Research origin :

    • Research on sentiment analysis began in the 1990s, pioneered by Wiebe et al.
    • The initial research focused on judging whether a text is a statement of objective facts or an expression of the author's own point of view, that is, distinguishing the subjectivity and objectivity of the text.
  2. Proposition of concept:

    • The concept of sentiment analysis first appeared in 2001, proposed by Das et al. when studying the text of stock market message boards.
    • Define emotions aspositive opinions and negative opinions Classification.
  3. Proposal of opinion mining:

    • The concept of Opinion Mining was proposed by Dave et al.
    • The focus of the research is to analyze the opinions on product attributes in the text to obtain positive, neutral, and negative evaluations about the attribute.
  4. Exhibition process

    • From the initial subjective and objective classification, develop to more detailed analysis of emotions and opinions.
    • Research gradually extends from simple text classification to in-depth emotion understanding and opinion extraction.
  5. Technical performance

    • Early research mostly relied on lexical resources (such as sentiment lexicon) and rules.
    • With the development of machine learning and deep learning, methods tend to be automated and precise.
  6. Application field expansion:

    • Initial applications are mainly in the fields of finance and product evaluation.
    • With the rise of social media, the application fields have expanded to social network analysis, public sentiment monitoring, market trend prediction, etc.
  7. Mirai 趿势

    • There is increasing focus on multimodal sentiment analysis, such as combining text with visual and sound data to provide more comprehensive emotion recognition.
    • Pay attention to cross-cultural and cross-language sentiment analysis to adapt to global communication needs.

2. According to the object of analysis, what types of sentiment analysis can be divided into are briefly explained.

Sentiment analysis and opinion mining are important research areas in natural language processing. Depending on the objects of analysis, sentiment analysis can be divided into the following types:

  1. Document-level Sentiment Analysis:

    • Analyze the overall sentiment of an entire document or chapter (e.g., news article, blog).
    • It is often assumed that the entire document expresses only a single emotional tendency, such as positive or negative.
  2. Sentence-level Sentiment Analysis:

    • Perform sentiment analysis on individual sentences.
    • Determine the overall emotional color of a sentence, such as determining the emotional tendency of a tweet or comment.
  3. Aspect-based Sentiment Analysis (ABSA):

    • A more nuanced analysis that focuses on the emotional tendencies of specific aspects or attributes in a text.
    • For example, for the review "This phone is cheap, but the pixels are not high", the sentiment of the "price" aspect is positive, while the "pixels" aspect is negative.
    • Aspect-level analysis can reveal users’ complex emotional attitudes toward different features.

vocabulary representation method

In the sentiment analysis task based on deep learning, the text is first preprocessed and then converted into a vector form that the computer can understand through word embedding:

  1. Word Vector/Word Embedding:

    • A method for mapping words from symbolic form to vector form.
    • This form of representation facilitates the calculation and understanding of natural language by machines.
    • has become the basis for downstream tasks in natural language processing and understanding.
  2. Comparison of traditional and modern methods:

    • Traditional sentiment analysis usually performs coarse-grained analysis on chapter-level or sentence-level text.
    • Aspect-level sentiment analysis provides a fine-grained analysis method capable of identifying texts containing multiple emotional aspects.
    • With the development of deep learning, sentiment analysis methods tend to be more accurate and automated.

3. According to the analysis method, what types of sentiment analysis can be divided into are briefly explained.

Sentiment Analysis is an important direction in Natural Language Processing (NLP). It is mainly used to identify and classify emotional attitudes in texts. According to different analysis methods, sentiment analysis can be divided into the following types:

  1. Emotional analysis method based on keyword recognition: This method relies on detecting specific emotional keywords in the text, such as "like", "hate", etc. These keywords usually have clear emotional tendencies, and analysts judge the emotional tendencies of the overall text by counting the frequency and context of these keywords.

  2. Dictionary-based sentiment analysis method: This method uses a predefined sentiment dictionary, and each word in the dictionary is assigned a sentiment score, indicating whether it is positive or negative emotional intensity. When analyzing text, the system checks whether each word is in the sentiment dictionary and calculates the sentiment tendency of the entire text based on its sentiment score.

  3. Sentiment analysis method based on machine learning: This method identifies the emotional tendency of text by training a machine learning model. First, a large text data set with emotional labels (such as positive, negative) is needed, and then these data are used to train a classifier (such as support vector machine, neural network, etc.) so that it can make emotional judgments on new texts.

  4. Combination of multiple methods: In practical applications, in order to improve accuracy and adapt to different text types, the above methods are often combined. For example, a dictionary-based approach can be used to conduct an initial analysis of the text, and then a machine learning model can be used for in-depth analysis and adjustment.

Each method has its advantages and limitations, and the choice of which method depends on the specific application scenario and available resources.

Sentiment analysis method based on keyword recognition

An in-depth understanding of sentiment analysis methods based on keyword identification, including its basic principles, improvement methods, and main challenges, is crucial to understanding keyword methods in sentiment analysis and their applicability and limitations.

  1. Basic concept

    • The most original and natural sentiment analysis method, classifying based on specific sentiment words (seed words).
    • Commonly used seed words include "happy", "sad", "scared", etc., which have clear emotional tendencies.
  2. Typical usage example:

    • The vocabulary list created by Elliot contains 198 emotion words, combined with adverbs expressing degree (such as "extremely", "somewhat", etc.).
  3. How to change:

    • The emotional consistency hypothesis proposed by Hatzivassiloglou et al. uses the characteristics of different connectives for keyword annotation.
    • Turney calculates the mutual information between the text and the emotional feature words "excellent" and "poor" in the corpus for emotion classification.
    • Yu et al. improved Turney's method and proposed a log-likelihood ratio calculation method using 600 adjectives.
    • Rao et al. used a label propagation algorithm, treating each word as a node of the graph, and updated the label using an algorithm similar to web page sorting.
    • Qiu et al. define the relationship between emotional keywords and their features, and use principles and rules for polarity assignment.
  4. Specific employment scene

    • The bipartite graph and iterative algorithm proposed by Zhang et al. for specific scenarios are used to solve the problem of missing emotional words. For example, analyze the structure of "verb + quantifier + noun" to determine the emotional polarity.
  5. Advantages of the method:

    • Simple and direct, suitable for texts containing clear emotional words.
  6. Challenges:

    • Cannot handle negative words: For example, in sentences containing negative words such as "Don't be happy!", it is difficult to accurately judge the emotion using keyword-based methods.
    • Inadequacy of deep understanding: This method is difficult for sentences that do not directly express emotions, such as "People in the hometown of Ningnan County, Liangshan Prefecture spontaneously folded flowers and waited for the heroes to return home." Recognize its strong negative tendencies.

Dictionary-based sentiment analysis method

Comprehensive understanding of dictionary-based sentiment analysis methods, including its basic principles, main applications, improvement methods, and main challenges


Dictionary-based sentiment analysis method

  1. Basic principle

    • Based on the pre-constructed emotional dictionary, words are assigned different emotional labels or scores.
    • The final classification is based on sentiment scores or labels by matching words in the sentence with words in the dictionary.
  2. Example of sentiment dictionary:

    • General Inquirer (GI): Early emotional dictionary, marking 1915 positive words and 2291 derogatory words.
    • Opinion Lexicon: Contains 2006 positive words and 4783 derogatory words, plus slang, word deformations, etc.
    • SentiWordNet: Assign objectivity, positivity, and negativity scores to synonym sets based on WordNet.
    • ConceptNet: A knowledge representation system that represents human common sense as a semantic graph. Used to discover keywords and expand vocabulary.
    • SenticNet: Based on ConceptNet, concepts are assigned emotional scores, including 14,000 concepts with emotional labels.
  3. Chinese Sentiment Dictionary:

    • HowNet: Contains Chinese and English words, and uses "sememes" to describe the different meanings of the words.
    • Emotional Vocabulary Ontology Database: Established by Dalian University of Technology, it refines positive emotions and contains 11,229 commendatory words and 10,783 derogatory words.
    • NTUSD: An emotional dictionary established by National Taiwan University, which is divided into 2810 positive words and 8276 derogatory words.
  4. Challenges and Improvements:

    • Handling sarcasm and domain dependence: For example, the word "big" in "big trouble" and "big room" have different emotional tendencies.
    • Integrating human cognition: Xing’s method constructs a dictionary based on human cognition and learns incorrectly predicted texts.
    • Vector representation: Shin proposed to represent dictionary information as vectors for use in convolutional neural networks.
    • Using common sense concepts: Ma is used in long short-term memory networks by representing common sense concepts in SenticNet as vectors.
  5. Special processing of Chinese text:

    • Use sequence annotation to select evaluation elements and expand the emotional dictionary to related fields (Song Jiaying et al.).
    • Lexical semantic tendency calculation method based on HowNet (Zhu Yanlan et al.).
    • Use the synonym word groups in the synonym word forest to expand the seed vocabulary (Lu Bin et al.).
  6. Good points and poor points

    • Advantages: Simple, direct, suitable for texts containing clear emotional words.
    • Disadvantages: It is difficult to process texts that contain sarcasm or indirect expressions of emotion. The words in the dictionary may have different emotional tendencies in different fields.

Sentiment analysis method based on machine learning

Sentiment analysis methods based on machine learning are explained, including its principles, main algorithms, research cases, and challenges faced.

  1. Basic principles and benefits

    • Using the training corpus, the machine learning method not only identifies the emotional tendency of keywords, but also considers other factors such as punctuation, word co-occurrence frequency, etc.
    • It is suitable for long text analysis, but the performance of short text needs to be improved.
  2. Algorithm examples and applications:

    • Research by Tony Mullen et al.: Add new features based on the unit group features of Pang et al., and use a support vector machine (SVM) classifier to improve the accuracy of text analysis .
    • Whitelaw et al.: Use a dictionary to identify emotional phrases, such as "very good", and use this as a feature to perform emotion classification using SMO.
    • Ye et al.: Combine sentiment classification with specific domains (such as travel blogs) and compare naive Bayes, SVM and n-tuple models.
    • Chaovalit et al.: Comparing the sentiment classification accuracy of machine learning algorithms and semantic preference algorithms in the field of movie reviews, using n-tuple models and Turney's semantic preference algorithm.
    • Li Suke et al.: Extract features from comments, use common features and emotional features to train a classifier, and combine spectral clustering methods to improve classification performance.
    • Yang Zhen et al.: In order to solve the problem of 短文本 sparse features and missing context, extract text information such as time, space, relationship, etc., and reconstruct the text ,Using Naive Bayes for Weibo sentiment analysis.
  3. How to use it

    • Basic principle: Based on probability statistics, combining prior probability and posterior probability, it is suitable for large data sets and has a low misjudgment rate.
    • Naive Bayes: Assuming that the features are independent of each other, the probability of the feature in each category and the prior probability of the category are calculated based on the training data.
  4. Application of Recurrent Neural Network (RNN):

    • RNN is suitable for processing sequence data, such as text and video, and can consider the context of the input sequence.
    • Irsoy et al.: Proposed a spatial deep RNN model to process the hierarchical features of language.
    • Yang et al.: Treat the document as the hierarchical structure of sentences, and the sentences as the hierarchical structure of words, and use the bidirectional gated recurrent unit (GRU) with attention mechanism to process long texts emotion analysis.
    • Wang et al.: Combined CNN and Long Short-term Memory Neural Network (LSTM) model to obtain the local information of the sentence and use it as input to the LSTM model to conduct fine-grained sentiment analysis.
    • Zeng Yifu et al.: Apply recurrent neural network to aspect-level sentiment analysis, combining local encoding and segmental decoding to extract emotional features.
    • Cai Guoyong et al.: Constructed from the perspective of semantic association between visual and text data层次化多模态注意力网络.
    • Socher et al.: Design递归张量神经网络(RNTN) to deal with the semantic composition problem of sentences.
  5. Long short-term memory network (LSTM):

    • Used to solve the vanishing and exploding gradient problems in long sequence training.
    • Compared with ordinary RNN, LSTM performs better on longer sequences.
    • Wang et al.: In order to solve the deviation problem between LSTM and tree LSTM, a 胶囊树LSTM model was constructed and a dynamic routing algorithm was introduced.
    • Li Weijiang et al.: Establish a model based on 双向LSTM, analyze text information and emotional resources, and use different channels to make full use of emotional information.
    • Liu Quan et al.: Combined with 区域卷积神经网络和分层LSTM to analyze aspect-level emotions.
  6. Other deep learning applications:

    • Mikolov et al.: Proposed a continuously distributed vector to represent words, and the neural network method significantly improved the effect of sentiment analysis.
    • Kim et al.: Constructed a text classification CNN model, using pre-trained word vectors as features, showing good results.
    • Kalchbrenner et al.: Proposed a dynamic convolutional neural network model to capture the semantic relationships of different distances between words in sentences.
    • He Yanxiang et al.: Map the emoticons in Weibo into continuous vector representations, and use a multi-channel CNN model to strengthen the model's emotional analysis capabilities.
    • Luo et al.: Proposed the Seq2SentiSeq model, which uses a Gaussian kernel layer to finely control emotional intensity and combines it with a cyclic reinforcement learning algorithm to guide model training.
    • Chen et al.: Construct a transfer capsule network model to solve the problem of high annotation cost of aspect-level sentiment analysis, and propose an aspect routing method.
    • Bao et al.: Proposed using the dictionary to enhance the attention mechanism to obtain a more flexible model.
    • Tan et al.: Develop a dual-attention multi-label classification model to solve the problem of sentences expressing positive and negative emotions.
Challenges and future directions of machine learning methods
  1. Dataset size: Machine learning methods require a large amount of training data to achieve high accuracy, and have limited effect on scenarios with small amounts of data.

  2. Short text analysis: When processing short texts (such as Weibo and comments), due to the small amount of information, the performance needs to be improved.

  3. Complex language processing: For example, processing texts containing complex emotional expressions such as sarcasm and puns.

  4. Challenge

    • Processing complex text: Deep learning methods need to solve complex emotional expression problems such as sarcasm and metaphor.
    • Data dependence: High-performance deep learning models often rely on large amounts of labeled data.
  5. Future direction:

    • Combine multiple models: such as the combination of CNN and RNN to adapt to different types of text analysis needs.
    • Utilize the attention mechanism: Enhance the model's ability to capture key information and improve the accuracy of sentiment analysis.

Research status of social network sentiment analysis

It provides a comprehensive understanding of advanced research on social network sentiment analysis, including the key directions of current research, challenges faced, and future development trends, which helps to deeply understand the latest progress and potential research opportunities in the field of social network sentiment analysis.

5. What are the main problems faced by social network sentiment analysis?

  1. Text length limit: Social networking platforms (such as Weibo) usually have restrictions on the length of published content, resulting in concise information that is not enough to express complex emotions.

  2. Informal expressions: Social network users commonly use informal language, including spelling errors, informal abbreviations, emerging words (such as "QTQ", "23333", etc.), increasing The difficulty of sentiment analysis.

  3. Data heterogeneity: Data in social networks are highly heterogeneous, involving text, pictures, videos and other forms, which increases the complexity of comprehensive sentiment analysis.

  4. User relationship impact: Social interactions between users (such as following, forwarding, and commenting) affect emotional expression, and these social factors need to be considered in sentiment analysis.

  5. Difficulty in emotional annotation: Emotional annotation of social network texts is subjective, and different users may have different emotional understandings of the same content.

  6. User-specific emotional expression: Different users may have different emotional expression habits, and personalization factors need to be considered in the model.

  7. Integration of text and user relationships: Integrating text content and user social relationships in sentiment analysis models is a challenge, especially when considering complex interaction patterns between users.

  8. Multimodal data processing: Non-text data such as pictures and videos in social networks also contain emotional information. How to effectively integrate these multimodal data is a big problem.

Main problems and solutions faced by social network sentiment analysis

  1. Speciality of social networks:

    • Large amount of data, short text: Limiting the number of words results in brief information.
    • Users express themselves in a variety of ways: Frequent spelling errors, informal abbreviations, and new words.
  2. Early method :

    • Go et al.: Emotion labeling using emoticons on Twitter, classified using the method of Pang et al.
    • Pak et al.: Adjective disambiguation using a Bayesian classifier.
    • Davidov et al.: Discovering sarcastic sentences in Twitter and Amazon reviews using the KNN algorithm.
  3. For high-quality technology:

    • Agarwal et al.: Represent sentences as core trees and calculate subtree similarities for sentiment analysis.
    • Kouloumpis et al.: Using n-tuple, dictionary features and POS features, trained with adaboost classifier.
    • Mohammad et al.: Combining hand-selected features, sentiment dictionary features and traditional features.
  4. Target level sentiment analysis:

    • Jiang et al.: Proposed target-level sentiment analysis based on target dependence, using the LSTM model with attention mechanism.
  5. Feature extraction and classification method:

    • Cui et al.: Extract generalized emoticons, repeated punctuation and repeated letter information, and classify them through the label propagation algorithm.
    • Kiritchenko et al.: Use the relationship between words and emoticons to build an emotional dictionary and extract emotional features.
    • Barbosa et al.: First carry out subjective and objective classification, and then determine the emotional tendency, using meta-information and grammatical information.
  6. Feature completeness and inclusion of buzzwords:

    • Riloff's Dictionary: Provides subjective features and polarity features, adds Internet popular words, and uses SVM training data.
  7. User Relationship and Weibo Sentiment Analysis:

    • Feng et al.: Using the contextual features of Weibo text, a hierarchical LSTM with two attention mechanisms was used to analyze Weibo sentiment.
    • Tan et al.: Use attention and @ relationships between users to perform sentiment analysis and minimize the difference in sentiment labels between adjacent nodes.
    • Ren et al.: Treat user sentiment analysis as a collaborative filtering task and utilize matrix decomposition method.
    • Cheng et al.: Refine user relationships into approval and disapproval relationships, and use unsupervised methods for user-level sentiment analysis.
    • Huang Faliang et al.: Combining the LDA model and user relationships to analyze Weibo emotional tendencies.
    • Hu et al.: Establishing a relationship matrix between Weibo for sentiment analysis.
    • Lu: Based on the work of Hu et al., a semi-supervised sentiment analysis model is proposed, taking into account user relationships and Weibo text similarity.
  8. Weibo user interaction and emotional network:

    • Wu et al.: Extend social context information to the prediction stage to analyze Weibo sentiment.
    • West et al.: Use the emotional value of user interaction text to construct a weighted user relationship network and predict user opinions.
    • Fersini et al.: Use likes and retweets to build an agreement network and construct an unsupervised model to analyze emotions.
    • Guo et al.: Construct an RNN model with user index and Weibo, and introduce the attention mechanism Hawkes process to classify emotions.
    • Wang et al.: A convolutional neural network model based on the adversarial cross-language learning framework and user attention mechanism to analyze user expression habits.
    • Speriosu et al.: Using user attention graph, combined with maximum entropy model and label propagation algorithm.
    • Smith et al.: Obtain user-level emotions through emotion clustering, but ignore the impact of user relationships.
    • Kim et al.: Use collaborative filtering method to analyze emotions based on user similarity, without fully considering social relationships.
  9. Personalized sentiment analysis model:

    • Wu et al.: Establish a personalized sentiment analysis model, combining global classifiers and user-specific classifiers, but the effect is limited on large data sets and inactive users.
    • Wu Fangzhao et al.: Considering the opinion gap between users, a logistic regression model with L1 regularization is used, but it is difficult to extract heterogeneous relationships.
  10. Information network framework and user influence:

    • Deng et al.: Based on the information network framework, they explore the similarities and differences of user opinions and propose a semi-supervised optimization model.
    • Kaewpitakkun et al.: Extract implicit connections through user historical microblogs for user-level sentiment analysis.
    • Eliacik et al.: Considering user influence, using the PageRank algorithm to identify influential users and extending the sentiment analysis method.
  11. Integrated Information and Heterogeneous Networks:

    • Li et al.: Proposed a user-event-based supervised topic model, combining text topics and user-event factors.
    • Nozza et al.: Treat Weibo as a heterogeneous network and infer the emotional polarity of Weibo and users.
    • Kuo et al.: Combine social interaction information and text opinions to construct a social opinion graph for group sentiment analysis.

Challenges and future directions

  1. Challenge

    • Diversity and unstructured nature of social network data.
    • The impact of user relationships and social dynamics on sentiment analysis.
    • An efficient way to integrate user behavior, textual content and social structure.
    • Informality, abbreviations, and neologisms appear frequently in social network texts.
    • The information content of short texts is limited, making it difficult to accurately grasp emotional tendencies.
    • Process heterogeneous social network relationships and extract complex emotional interactions between Weibo and users.
  2. 发下趋势

    • Delve deeper into the emotional impact of user behavior, relationships, and social dynamics.
    • Combine multiple data sources (such as text, user relationships, metadata) for comprehensive analysis.
    • Develop more complex models, such as leveraging deep learning and natural language processing techniques, to improve the accuracy and adaptability of analysis.
    • Leverage deep learning and natural language processing technologies to improve analytical accuracy.
    • Combine multiple features and models such as emojis, POS tags, n-tuples, etc.
    • Develop more complex models to handle implicit emotions such as sarcasm and puns in text.
    • Develop more complex models, such as integrating social structure and content analysis, to better handle heterogeneous relationships and personalized emotional expressions among users.

comprehensive analysis

Existing social network sentiment analysis methods mainly focus on user-level or topic-level analysis, while there are still challenges in sentiment analysis of Weibo itself, especially in extracting heterogeneous relationships in widespread social networks. Future research should explore more deeply the combined impact of user behavior, social dynamics, and text content, while developing more complex and precise analytical models to handle the diversity and unstructured nature of social networks.

Sentiment analysis related technologies

4. Briefly describe the basic process of sentiment analysis.

  1. Data acquisition and cleaning:

    • First, a large amount of data is obtained from the data set and data cleaning is performed to remove irrelevant information and noise and improve data quality.
  2. Data preprocessing:

    • Preprocess the cleaned data, including text standardization, removal of stop words, etc. This step is time-consuming but crucial to improving classification accuracy.
  3. textual quantification:

    • Convert text data into machine-understandable vector form. Commonly used methods include bag-of-words models, TF-IDF, word embeddings, etc.
  4. Special expedition acquisition:

    • Key features are extracted from vectorized text, which will be used to train sentiment analysis models.
  5. Model construction instruction

    • Establish sentiment analysis models. Common models include naive Bayes, support vector machine, random forest, deep learning model, etc.
    • Perform model training, tune and identify optimal hyperparameters to optimize model performance.
  6. Outcome prediction and evaluation:

    • Use the test data set to make predictions on the trained model.
    • To evaluate model effectiveness, common indicators include accuracy, recall, F1 score, etc.
  7. Model Department:

    • Deploy evaluated models into real-world applications for real-time or batch sentiment analysis.

6. What are the common text vectorization models? Briefly describe them.

In natural language processing tasks, the most fine-grained representations are words. Words can be composed into sentences, and sentences can be composed into paragraphs, chapters and documents.

But computers do not know these words, so we need to mathematically represent the natural language represented by words.

To put it simply, we need to convert vocabulary into a computer-recognizable numerical form. There are currently two main methods of transformation and representation. One is in traditional machine learningone-hot编码方式, the other is 基于神经网络的词嵌入技术.

  1. 词袋模型(Bag of Words, BoW)

    • Convert text into word frequency vectors, ignoring the order and context of words.
    • Each document is represented as a long vector, where each element represents the number of times a specific word appears in the document.
  2. TF-IDF(Term Frequency-Inverse Document Frequency)

    • Evaluate the importance of a word in a collection of documents.
    • Combining term frequency (TF) and inverse document frequency (IDF) to reduce the influence of common words and increase the weight of rare words.
  3. Word Embedding:

    • Map words into real-valued dense vectors to capture the relationships between words.
    • Common methods include Word2Vec, GloVe, etc., which can reflect the semantic and grammatical relationships between words.
  4. Main problem model

    • Automatically identify topics from large amounts of text and represent the text as a mixture of a series of topics.
    • Commonly used algorithms include Latent Dirichlet Allocation (LDA).
  5. One-Hot编码

    • Represent each word as a long vector with position 1 in the vocabulary and 0 elsewhere.
    • It is simple but less efficient and cannot express the semantic relationship between words.
  6. CountVectorizer

    • Convert text documents into word frequency matrices.
    • Similar to the bag-of-words model, but focused on word frequency statistics.

Bag of Words (BoW) model

Insert image description here

  1. Basic concepts: The bag-of-words model is a text representation method widely used in natural language processing and information retrieval. It is represented by converting text into a collection of words, focusing on the frequency of occurrence of words rather than their position in the text or their grammatical structure.

  2. importance

    • Structured text data: Convert unstructured text data into structured numerical data to facilitate machine learning model processing.
    • Widely used: Suitable for a variety of natural language processing tasks, such as text classification, sentiment analysis, document clustering, etc.

Build steps

  1. Tokenization

    • Split text into sequences of words.
  2. Dictionary Creation:

    • Count unique words in all documents to form a dictionary.
  3. Vectorization:

    • Represent each document as a vector, with each element of the vector corresponding to a word in the dictionary.
    • The values ​​in the vector represent the frequency of occurrence of the word in the document.

Application of One-hot Encoding in Lexical Representation

One-hot encoding is a commonly used text vectorization method for converting words in text into computer-recognizable numerical forms. In this encoding, each word is mapped to a unique binary vector.

  • Word mapping: Assign a unique index to each individual word in the corpus.
  • Vector representation: The length of the generated vector is equal to the size of the vocabulary, the index position of the corresponding vocabulary is set to 1, and the remaining positions are 0.

Example

In the case provided, we have a small corpus containing five different words: me, love, dad, mom, China. In one-hot encoding, each word is assigned a unique index number. In this example, the index numbers are as follows:

  • "I" -> 1
  • "Love" -> 2
  • "Dad" -> 3
  • "Mom" -> 4
  • "China" -> 5

Each word is represented as a vector of length 5 (because there are 5 unique words in the corpus). In this vector, the index position of the corresponding word is marked as 1, and the remaining positions are 0.

  • For the sentence "I love China", we convert each word into a vector according to One-hot encoding.

    • "I" is the first word, so the first position is 1 and the other positions are 0.
    • "Love" is the second word, so the second position is 1 and the other positions are 0.
    • "China" is the 5th word, so the 5th position is 1 and the other positions are 0.

    Therefore, the One-hot encoding of "I love China" is expressed as: (1, 1, 0, 0, 1).

  • For "Mom and Dad love me", each word is converted accordingly:

    • "Dad" is the third word, "Mom" is the fourth word, "love" is the second word, and "I" is the first word.

    Therefore, the One-hot encoding of this sentence is expressed as: (1, 1, 1, 1, 0).

  • For "Mom and Dad Love China", the conversion process is similar.

    Therefore, the One-hot encoding of this sentence is expressed as: (0, 1, 1, 1, 1).

important point

  • The disadvantage of one-hot encoding is that the vectors are usually very sparse (most positions are 0), which can lead to inefficiency when you have a large vocabulary.
  • It cannot capture similarities or semantic relationships between words because each word is encoded independently of each other.

advantage

  • Simple and clear: Each word has a unique vector and is easy to implement.
  • Valid Representation: Fixed issue with converting categorical variables into binary vectors.

shortcoming

  • Matrix sparse: The vector dimension is large and most elements are 0, resulting in a waste of computing resources.
  • Curse of Dimensionality: As the vocabulary increases, the vector dimensions grow dramatically.
  • Semantic missing: The vectors are orthogonal and cannot express the semantic relationship between words. (Regarding similarity, for example, the similarity between "I" and "you" is relatively high, while the similarity between "banana" and "apple" is relatively low)

TF-IDF (term frequency-inverse document frequency)

Insert image description here

definition

TF-IDF is a commonly used weighting technique in the field of information retrieval and text mining, which is used to evaluate the importance of a word to a document set or a document in a corpus.

The main idea

  • Term frequency (TF): The frequency with which a term appears in a text, usually normalized (term frequency divided by the total number of words in the article) to avoid biasing towards long texts.
  • Inverse Document Frequency (IDF): Calculated by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm. If a word appears in a few documents, its IDF value is large, indicating that it has better category distinguishing ability.
  • Calculation formula: TF-IDF = TF * IDF, which combines the statistics of term frequency and inverse document frequency.

importance

  • Discriminative ability: Words that appear frequently in specific documents but are rare in the corpus are given high weight, which helps filter common words and highlight important words.
  • Widely used: Suitable for search engines, keyword extraction, text similarity assessment, text summarization, etc.

shortcoming

  • Lack of semantic information: The simple structure of TF-IDF does not consider the semantic information of words and cannot effectively handle the situation of polysemy and multiple words.

Application scenarios

  1. Search Engine: Used to evaluate the importance of query keywords in the document.
  2. Keyword extraction: Extract the most representative words from the text.
  3. Text similarity: Compare the similarity of different documents.
  4. Text summary: Extract key information of the document as a summary.

Guess you like

Origin blog.csdn.net/wtyuong/article/details/135006785