【BERTopic Application 01/3】Analyzing Qatar World Cup Twitter Data

Photo by: Rhett Lewis on Unsplash

1. Description

        Qatar World Cup is full of surprises! From Saudi Arabia shocking the world by beating Argentina to Morocco's historic semi-final, you have to hear or witness those moments in football fever. In this post, I will use BERTopic to analyze tweets posted during the 2022 World Cup. Let's see what the most popular topics related to the World Cup are and whether we can make sense of them.

2. Prepare data

        , we need to retrieve data from social media. This time, I will use text data retrieved from Twitter as our research object. To scrape tweets, we'll use snscrape , a scraper for social networking services like Facebook, Twitter, and Reddit. To install the development version:

pip install git+https://github.com/JustAnotherArchivist/snscrape.git

        Using this crawler, we can fetch attributes from Twitter users, user profiles, hashtags, searches (live tweets, trending tweets and users), tweets (single or surrounding threads), list posts, communities and trends. To start sniffing:

# Get tweets using SNSCRAPE 
import snscrape.modules.twitter as sntwitter
import pandas as pd

        Then, let's get some tweets written in English that contain the search term: World Cup, from 20/18/2022 to 12/2022. NOTE: Since we only want tweets and not replies, we filter out replies. Here we need to be more careful with time frames. The time period should be set using: Until: 19–00–00_00:2022:11_AST From: 20–00–00_00:2022:12_AST. Because times in the till clause are excluded, we should set it to 19-0-<> <> o'clock. AST stands for Arabic Standard Time. If we don't specify a time in the AST, the time will be automatically set in UTC.

# Get 10,000 tweets containing search term: world cup within a certain period of time
query = "(world cup) lang:en until:2022-12-19_00:00:00_AST since:2022-11-20_00:00:00_AST -filter:replies"
tweets = []
limit = 10000


for tweet in sntwitter.TwitterSearchScraper(query).get_items():
    if len(tweets) == limit:
        break
    else:
        tweets.append([tweet.date, tweet.id, tweet.username, tweet.content])

# Store tweets under a data frame        
df = pd.DataFrame(tweets, columns=['Date', 'Id', 'User', 'Tweet'])

3. Preprocessing data

        When we are dealing with unstructured text data, one of the things we need to do before running any analysis is to preprocess the data. Here, we will remove all URLs, emoji, and line breaks in the tweet. I recommend printing out some samples after each round of data cleaning to see if we are manipulating the text the way we want. Also, when you're done with all the data cleaning, I recommend using pickle to save the final version. If we need to perform a different task with the dataset, we can simply load the pickled data instead of scraping from scratch again.

# Round1: Remove URL
df['Tweet']=df['Tweet'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ')
# df.head()

# Round2: Remove emoji
df['Tweet']=df['Tweet'].str.replace('[^\w\s#@/:%.,_-]', '', flags=re.UNICODE)
# df.head()

# Round3: Remove newlines:\n
df['Tweet_processed'] = df['Tweet'].replace('\n','', regex=True)
# df.head()

# Store the filtered tweets in a new data frame
df_new=df.drop('Tweet',axis=1)
# df_new.head()

df_new.to_pickle('world_cup_tweets.pkl')

4. Use BERTopic

BERTopic  is a topic modeling technique that leverages transformers and custom class-based TF-IDF to create dense clusters, allowing for easy-to-interpret topics while preserving important words in topic descriptions. To install BERTopic:

pip install bertopic

        We then store the tweets in a list and instantiate BERTopic using the following code. Please be patient as this process may take a while depending on how many tweets you are scraping and whether you are using the CPU (slower) or GPU (faster).

texts=df_new['Tweet_processed']

# Set the language to English. There is other language models as well. 
from bertopic import BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(texts)

5. Extract the theme

        After fitting the model, we can see some results. First, we can examine the 10 most common themes:

freq = topic_model.get_topic_info(); freq.head(11)
Top ten themes

-1 means all outliers should be ignored. Next, let's look at a common theme generated:

topic_model.get_topic(0)  # Select the most frequent topic, which is topic 0
# Result 
[('qatar', 0.03543929861004061),
 ('hosting', 0.01386799885558573),
 ('best', 0.012979749598061752),
 ('qatar2022', 0.011693988492318397),
 ('thank', 0.01145218957158738),
 ('ever', 0.011044699651033682),
 ('tournament', 0.008304478567760317),
 ('hosted', 0.007997848070012806),
 ('you', 0.007665018845225487),
 ('the', 0.007350426929036396)]

        As we can see, the topic refers to football fans expressing their gratitude to host country Qatar. Although the World Cup in Qatar was controversial from start to finish, in the end fans appreciated the efforts of the hosts for the event. We can also look at three representative tweets under topic 0:

topic_model.get_representative_docs(0)
# Result
['Critics sceptical of Qatars carbon neutrality claim at #WorldCup  ',
 'Apart from Messi being the highlight of the WC, this tournament has been a huge success for Qatar itself. From the stunning venues to the welcoming and hospitable atmosphere, the tournament has truly shone, making this World Cup truly one to remember. Congratulations Qatar  ',
 'With the World Cup over, that means no more Tracey Holmes doing Qatar propaganda']

6. Visualize the top 10 topics

        We can get a clearer picture of the 10 most common themes using:

topic_model.visualize_barchart(top_n_topics=10)
Top 10 Topics for Visualization

7. Load data

        In the Part 1 tutorial, I saved the data under the name "world_cup_tweets.pkl". Now we can unpickle it, using:

import pandas as pd
import pickle
with open('world_cup_tweets.pkl', 'rb') as f:
    data = pickle.load(f)

8. Dynamic Topic Modeling

"Dynamic Topic Modeling (DTM) is a set of techniques designed to analyze changes in topics over time. These methods allow you to understand how a topic has been represented over time .

        To represent different time periods, we can create a list of tweets and their corresponding posting times. Then we need to create and train a BERTopic model, just like we did in Part 1:

timestamps = data.Date.to_list()
tweets = data.Tweet_processed.to_list()

from bertopic import BERTopic
topic_model = BERTopic(language="english", calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(tweets)

        Now we should call topics_over_time and pass the tweet and timestamp. Note that the bins represent a single range of continuous values ​​used to group values ​​in the chart. If a bin width is too large, we won't get enough differentiation; if it's too small, we won't be able to properly group the data. Here we set the number of bins to 20. and visualize the top 10 topics.

topics_over_time = topic_model.topics_over_time(tweets, timestamps, nr_bins=20)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)

9. Results:

Top 10 Topics Over Time

        From this graph, we can easily grasp how different themes emerge over time. Isn't it very convenient!

10. Summary

        In this post, we learned how to scrape tweets using snscrape and how to use BERTopic to model topics from the Qatar World Cup case study. The results showed that fans generally recognized Qatar's efforts and were surprised by the thrilling final between Argentina and France.

        BERTopic has more amazing features like Dynamic Topic Modeling (DTM). I'll show you how to do it shortly. Goodbye for now, stay tuned!

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132287094