Text mining combat: to see what people are doing at home abroad during virus isolation?

Through text mining, people locked in during the coronavirus is to do and explore their feelings and emotions analysis conducted

As more and more countries announced the closure of the country, most people were asked to stay at home isolation. We take a look at the foreign people how to spend the "off" during this time and how it feels, so I analyzed some of the tweets in this article to see what all the foreign friends in the end.

Data acquisition and pre-processing

For data collection, I use txxxR library to extract the 20,000 tweets with "#quarantine" and "#stayhome" hashtags from the push.

After importing the data into R, we need to push pretreated and marked text into a word (token) for analysis.

tweet_words <- tweets %>%
  select(id,
         screenName,
         text,
         created) %>%
  mutate(created_date = as.POSIXct(created, format="%m/%d/%Y %H")) %>%
  mutate(text = replace_non_ascii(text, replacement = "", remove.nonconverted = TRUE)) %>%
  mutate(text = str_replace_all(text, regex("@\\w+"),"" )) %>%
  mutate(text = str_replace_all(text, regex("[[:punct:]]"),"" )) %>%
  mutate(text = str_replace_all(text, regex("http\\w+"),"" )) %>%
  unnest_tokens(word, text)

Deletes common words and stop words from the data

After the data sets and pre-labeled, we need to remove stop words useless for analysis, such as "for", "the", "an" and so on.

#Remove stop words
my_stop_words <- tibble(
  word = c(
    "https","t.co","rt","amp","rstats","gt",
    "cent","aaya","ia","aayaa","aayaaaayaa","aaaya"
  ),
  lexicon = "txxxxr"
)#Prepare stop words tibble
all_stop_words <- stop_words %>%
  bind_rows(my_stop_words)#Remove numbers
suppressWarnings({
  no_numbers <- tweet_words %>%
    filter(is.na(as.numeric(word)))
})#Anti-join the stop words and tweets tibbles
no_stop_words <- no_numbers %>%
  anti_join(all_stop_words, by = "word")

We can also use the following code to quickly check to see how many disabled concentrated delete words from the data:

tibble(total_words = nrow(tweet_words),
  after_cleanup = nrow(no_stop_words)
)

The results are as follows:

Number on the right (155,940) is the number of tokens remaining after removing stop words.

Now our data cleansing has been completed, it can be handled

Word frequency analysis

Common methods of text mining is to look at the word frequency. First, let us look at some of the tweets most commonly used words.

The first five words are:

 隔离-出现13358次
 Covid19 –出现1628次
 冠状病毒-出现了1566次
 天-出现1200次
 家-出现了1122次

Clearly, the isolated coronavirus COVID-19 situation / is related to people in order to avoid exposure to the virus and stay at home.

#Unigram word cloud
no_stop_words %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100, random.order = FALSE,scale=c(4,0.7), 
colors=brewer.pal(8, "Dark2"),random.color = TRUE))

The most common positive and negative words

After obtaining the frequency of words, we can use the "NRC" dictionary assign a label to each word (positive or negative). Then, we can create tags to label the word cloud.

Word cloud show, we know that most people feel the pressure during the isolation and boredom. But on the plus side, we also learned that it is sending a friendly message, tell others stay safe and healthy.

#Positive and negative terms word cloud
no_stop_words %>%
  inner_join(get_sentiments("bing"), by = c("word" = "word")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = brewer.pal(2, "Dark2"),
                   max.words = 100)

emotion analysis

Text and sentiment analysis can help us to express views from the text data identification. It helps us understand the attitudes and feelings about a particular topic.

Tweets extract emotion ranking

When people worry corona virus, most of us still maintain a positive attitude. Surprisingly, compared with negative words, it issued a more positive word during isolation.

#Sentiment ranking
nrc_words <- no_stop_words %>%
  inner_join(get_sentiments("nrc"), by = "word")sentiments_rank <- nrc_words %>%
  group_by(sentiment) %>%
  tally %>%
  arrange(desc(n))#ggplot
sentiments_rank %>%
  #count(sentiment, sort = TRUE) %>%
  #filter(n > 700) %>%
  mutate(sentiment = reorder(sentiment, n)) %>%
  ggplot(aes(sentiment, n)) +
  geom_col(fill = "mediumturquoise") +
  xlab(NULL) +
  coord_flip() +
  ggtitle("Sentiment Ranking") +
  geom_text(aes(x = sentiment, label = n), vjust = 0, hjust = -0.3, size = 3)

Emotional introspection - to find out people's emotions

By using the "NRC" dictionary, we can also mark the word as well as positive and negative emotion words eight types.

After dispensing the label, we can group the emotion and generates a word frequency graph, as shown below. Also note that some terms can be found in the more emotional labels, such as music and money.

Based on the above some insight into the emotional tags:

 在此期间,人们正在努力争取金钱,(没有)生日,音乐和艺术品
 人们在谈论政府:国会与协议
#Ten types of emotion chart
tweets_sentiment <- no_stop_words %>%
  inner_join(get_sentiments("nrc"), by = c("word" = "word"))tweets_sentiment %>%
  count(word, sentiment, sort = TRUE)#ggplot
tweets_sentiment %>%
  # Count by word and sentiment
  count(word, sentiment) %>%
  # Group by sentiment
  group_by(sentiment) %>%
  # Take the top 10 words for each sentiment
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ sentiment, scales = "free") +
  coord_flip() +
  ggtitle("Word frequency based on emotion")

Visualization word relationship

When text mining, visualization word relationship is very important. By the arrangement of the word "network" in the drawing, we can see how the word in the data set are interconnected.

First, we need the data set is marked as a double word (two words). We can then be aligned to the word combination of nodes connected to visualize.

FIG Network "isolated" data set

#Tokenize the dataset into bigrams
tweets_bigrams <- tweets %>%
  select(id,
  #       screenName,
         text,
         created) %>%
  mutate(created_date = as.POSIXct(created, format="%m/%d/%Y %H")) %>%
  mutate(text = replace_non_ascii(text, replacement = "", remove.nonconverted = TRUE)) %>%
  mutate(text = str_replace_all(text, regex("@\\w+"),"" )) %>%
  mutate(text = str_replace_all(text, regex("[[:punct:]]"),"" )) %>%
  mutate(text = str_replace_all(text, regex("http\\w+"),"" )) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2)#Separate the bigrams into unigrams
bigrams_separated <- tweets_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")#Remove stop words
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% all_stop_words$word) %>%
  filter(!word2 %in% all_stop_words$word)#Combine bigrams together
bigrams_united <- bigrams_filtered %>%
  unite(bigram, word1, word2, sep = " ")

Some interesting insights from the network map:

 人们在隔离期间在推上写日记
 在检疫期间,人们会听李·摩根(Lee Morgan)的爵士音乐
 在检疫期间,Jojo的现场表演越来越受欢迎
 自我隔离是与Covid-19对抗的一种方式,人们对健康技巧和消除压力的技巧很感兴趣

Word correlation analysis - How the perception of social distance?

Social isolation may lead to or away from the challenge emotionally, I would like to learn more about people's feelings during this period.

Word correlation allows us to study the extent of one pair of common words that appear together in the data set. It allows us to have a more specific understanding of the word and its association with other words.

By word cloud, we know that "pressure" and "boring" often appear in our data set. So I extracted three words: "boring", "accent", "stuck" to see its relevance word.

During the isolation, the feeling of staying at home during the relevance of the word

From the "boring", "pressure" and "stuck" in the words of the knowledge obtained by the correlation:

 人们在感到无聊时会使用TikTok(抖音的海外版)和游戏来消磨时间
 乏味几乎可以概括大多数人在2020年的生活
 造成压力,人们正在网上寻找减轻压力的提示
 人们在家中“被困”时在Netflix上观看恐怖电影/连续剧

Correlation Analysis word - so people at home do?

In order to understand how people in this home and what to do during quarantine at home to spend his time, I extracted three words: "play", "read" and "watch" for more insight.

During the isolation period, to stay at home during the measures taken word relevance

From the "play", "read" and "watch" a word derived insights relevance:

 大多数人可能会通过玩游戏,看电影和视频来度过自己的时间
 人们花时间阅读他们的孩子
 人们在此期间也终于有时间阅读

Word correlation analysis - birthdays, money and community ...

Emotional chart labels that often appear in three words, namely "birthday", "community" and "money." So I studied the correlation between word and other terms.

From the "birthday", "community" and "money" word of knowledge obtained by the correlation:

 生日聚会被取消。 取而代之的是,人们在推上表达自己的愿望
 人们同意金钱并不能阻止我们感染该病毒的观点

in conclusion

We can understand people feel during this coronavirus closed and what they are doing, while still following the rules of social isolation.

We extract some of the key insights include:

 人们在冠状病毒情况下感到压力重重,但仍保持积极态度
 在此居家和隔离期间,Tiktok和Netflix被广泛使用
 人们将更多的时间花在与孩子,艺术,音乐和电影上

Finally: The above major research and scientific data based on machine learning. We are not health professionals or epidemiologists, and therefore point of this article should not be construed as professional advice.

Originally address: https://imba.deephub.ai/p/96b84f50722111ea90cd05de3860c663

 

Guess you like

Origin www.cnblogs.com/deephub/p/12597944.html