【BERTopic Application 02/3】Analyzing Qatar World Cup Twitter Data

Shadow: Fauzan Saari  on  Unsplash

1. Description

        This is Part 3 of our analysis of World Cup Twitter data, and we're dropping it. We will run sentiment analysis on our data to find out how people feel about the World Cup in Qatar. A powerful toolkit that I will cover here is Hugging Face, where you can find various models, tasks, datasets, and it also has courses for those who are just starting out in machine learning. In this post, we will use a sentiment analysis model and hugging face tokens for our task.

2. Sentiment Analysis

      Sentiment analysis is the use of natural language processing (NLP) to identify, extract and study emotional states and subjective information. We may apply this technology to customer reviews and survey responses to my opinion about a product or service.

        Let's look at a few examples:

  • I love the weather today! Tags: front
  • The weather forecast says it will be cloudy tomorrow. Tags: unisex
  • The rain won't stop. Picnic plans were postponed. scoundrel. . . tags: negative

        The above example clearly shows the polarity of the tweet because of the simple structure of the text. Here are some challenging cases where emotions are not easily detected.

  • I don't like rainy days. (negative)
  • I love running when it's windy but wouldn't recommend it to my friends. (conditional positive emotion, difficult to classify)
  • A good cup of coffee really takes time as it made me wait 30 minutes for a sip. (Satire)

        Now that we've covered what sentiment analysis is and how to apply this technique, let's learn how to implement this method on our Twitter data.

Meet " twitter-roberta-base-mood-latest "

        Twitter-roBERTa-base for Sentiment Analysis is a RoBERTa-base model trained on ~<>M tweets from <>124 Jan 2018 to 2021 <> and fine-tuned for sentiment analysis using the TweetEval benchmark. I won't go into the details of the RoBERTa-base model, but briefly, RoBERTa removes the next sentence prediction (NSP) task from the pre-training process and introduces a dynamic mask, so the mask token changes during training . For a more detailed review, I recommend reading Suleiman Khan 's article and Chandan Durgia's article .

To launch this model, we can use the inference API         created by the Hugging Face team . The Inference API lets you text-to-text and evaluate 80,000+ machine learning models on tasks in NLP, audio, and computer vision. Check here for detailed documentation. It's easy to use the inference API, you just do that to get your hug face token (it's free). First, you should create a hug face account and sign up. Then, click on your profile and go to settings.

         Go to Access Tokens and click New Token. When creating a new token, you will be asked to choose a role for the token.
  • Read: Use this role if you only need to read from hugging face hubs (for example, when downloading a private model or doing inference).
  • Write: Use this token if you need to create or push content to the repository (for example, when training a model or modifying a model card).

created by the author

User access tokens

Now that we have everything we need, let's do some analysis!

3. Prepare our data

        First, we need to import some dependencies and load the data. Running sentiment analysis on the 10,000 tweets we have takes some time. For demonstration purposes, we will randomly sample 300 tweets from the pool.

import pandas as pd
import pickle
import requests
import random

with open('world_cup_tweets.pkl', 'rb') as f:
    data = pickle.load(f)

tweets = data.Tweet_processed.to_list()
tweets = random.sample(tweets, 300)

4. Running Analysis

        We then pass the language model and our hug face tokens into variables separately.

model = "cardiffnlp/twitter-roberta-base-sentiment-latest"
hf_token = "YOUR OWN TOKEN"

        We define an analysis function that takes a single argument: data. This function converts our Twitter data into JSON format containing the results of model inference on the input data passed to the function.

API_URL = "https://api-inference.huggingface.co/models/" + model
headers = {"Authorization": "Bearer %s" % (hf_token)}

def analysis(data):
    payload = dict(inputs=data, options=dict(wait_for_model=True))
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

        We initialize an empty list to store the sentiment analysis results for each tweet. We use a loop for each tweet in the list. Then we use the try-except block technique:

  • For tweets that can be analyzed, we call the function we defined analyze, taking the current tweet as input and retrieving the first result from the returned list. The result should be a list of dictionaries, each containing sentiment labels and scores. We use the built-in max function to find the dictionary with the highest score among the sentiment results. We append a new dictionary to the tweets_analysis list, containing tweets and their corresponding labels, which contain the sentiment with the highest score.
  • For tweets that cannot be parsed, we use an except block, which catches any exception that occurs in the try block and prints an error message. Some tweets may not be analyzed by the sentiment analysis feature, so this block is included to handle those cases.
tweets_analysis = []
for tweet in tweets:
    try:
        sentiment_result = analysis(tweet)[0]
        top_sentiment = max(sentiment_result, key=lambda x: x['score']) # Get the sentiment with the higher score
        tweets_analysis.append({'tweet': tweet, 'sentiment': top_sentiment['label']})
 
    except Exception as e:
        print(e)

        We can then load the data into a dataframe and see some preliminary results.

# Load the data in a dataframe
df = pd.DataFrame(tweets_analysis)
 
# Show a tweet for each sentiment
print("Positive tweet:")
print(df[df['sentiment'] == 'positive']['tweet'].iloc[0])
print("\nNeutral tweet:")
print(df[df['sentiment'] == 'neutral']['tweet'].iloc[0])
print("\nNegative tweet:")
print(df[df['sentiment'] == 'negative']['tweet'].iloc[0])
# Outputs: (edited by author to remove vulgarity)

Positive tweet:
Messi, you finally get this World Cup trophy. Happy ending and you are officially called球王  

Neutral tweet:
Nicholas the Dolphin picks 2022 World Cup Final winner     

Negative tweet:
Yall XXXX and this XXXX world cup omg who XXXX CARESSS

We should also use the groupby function to see how many tweets in the sample are positive or negative.

sentiment_counts = df.groupby(['sentiment']).size()
print(sentiment_counts)

# Outputs: 
sentiment
negative     46
neutral      63
positive    166
dtype: int64

Now that we're here, why not use a pie chart to visualize the results:

import matplotlib.pyplot as plt
fig = plt.figure(figsize=(6,6), dpi=100)
ax = plt.subplot(111)
sentiment_counts.plot.pie(ax=ax, autopct='%1.1f%%', startangle=270, fontsize=12, label="")

created by the author

It seems that most people are satisfied with the World Cup in Qatar. great!

What are people talking about in positive and negative tweets? We can use a word cloud to display the keywords in these groups.

# pip install first if you have not installed wordcloud in your environment 

from wordcloud import WordCloud
from wordcloud import STOPWORDS
 
# Wordcloud with positive tweets
positive_tweets = df[df['sentiment'] == 'positive']['tweet']
stop_words = ["https", "co", "RT"] + list(STOPWORDS)
positive_wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="white", stopwords = stop_words).generate(str(positive_tweets))
plt.figure()
plt.title("Positive Tweets - Wordcloud")
plt.imshow(positive_wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

# Wordcloud with negative tweets
negative_tweets = df[df['sentiment'] == 'negative']['tweet']
stop_words = ["https", "co", "RT"] + list(STOPWORDS)
negative_wordcloud = WordCloud(max_font_size=50, max_words=50, background_color="white", stopwords = stop_words).generate(str(negative_tweets))
plt.figure()
plt.title("Negative Tweets - Wordcloud")
plt.imshow(negative_wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

created by the author

Created and edited by the author 

V. Summary

        By now I hope you have learned how to use the Inference API in Hugging Face to perform sentiment analysis on tweets. It is a powerful tool that is highly applicable in various fields. Follow me for more ideas and techniques.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132287137