[Recommended reading] Consumer insights based on text data

Author introduction
@edan

Former business data analyst, current TMD data product manager.

I look forward to doing something interesting with my data partner~

01 What is consumer insight?

With the development of society, the environment in which Chinese consumers live is changing, and their consumption concepts are also changing: everyone has changed from dealing with life to operating life and enjoying life. In the rapidly changing market environment, there are many factors that affect the growth of consumer brands. Among them, a deep understanding of consumers and timely response to changes in consumer psychology and behavior is a key part. This is what we usually call " Consumer Insights".

We can borrow the Laddering model to understand consumer insights: "Analyze consumers' preferences and motivations at different levels from different levels such as product attributes, functional benefits, emotional benefits, and values, and understand the factors that win their favor."

In the Internet field, consumer insight (ie, user research) is an important part of product work. Only a deep understanding of the user's behavior habits and the demands behind it can bring a good user experience to users.

Also in the marketing field, consumer insight is also the starting point for all marketing actions. Product positioning needs to explore the demands of target groups in different scenarios, and product promotion needs to find communication skills that match consumer demands to reach consumers and products that have already been put on the market. It is necessary to diagnose the health of the product through word-of-mouth analysis.

02 Consumer observation based on text big data

Before the advent of the Internet age, consumer insights were basically developed through questionnaires combined with user interviews. The advantage of this form is that you can ask what you want. But the problem is also obvious. The sample size is small, and the user expression may not be true.

In the Internet era, there are already a large number of consumer expressions on the Internet, such as Weibo, e-commerce comments, forum posts (such as Baby Tree), and even some medical professional fields, such as good doctors and other consultation platforms that carry user expressions. This provides better “land” for consumer insights:

1) The sample size is larger. Unlike the past questionnaire surveys, hundreds of samples are already large, and the number of consumers available for online research is 100 million;

2) The scene is more abundant. For example, it is difficult to be asked in the questionnaire when consumers use potato chips for cooking.

3) The expression is more real. It is not an answer led by the question, but the consumer said.

Therefore, consumer insights based on online texts have been widely recognized by brands.

03 From text data to insights

Taking the diaper category in the maternal and infant industry as an example, I will introduce how to make consumer insights based on text big data.

1) Determine the target group and capture relevant data

In the diaper market, although the users are 0-3 year-old babies, the real consumers are mothers, and mothers will pay attention to diapers from the beginning of pregnancy, so pregnant mothers to 1 year old mothers are our target analysis users. In order to obtain the online comments of target users, the author uses crawler technology to grab mothers’ data from corresponding mother and infant forums. These data include basic information data, text-related data (posts, Q&A), and mothers’ follow-up relationship data. As shown below.

[Recommended reading] Consumer insights based on text data

2) Structure the text data by tagging

For example, "Kao diapers are really a bit thick", this sentence contains two information dimensions: "Kao diapers" and "a bit thick". How to extract these two dimensions of information? Therefore, the author constructed a keyword vocabulary containing information of different dimensions. If there is a corresponding keyword in a sentence, then the sentence has a corresponding dimension label.

For example: Suppose that the [diaper brand-Kao] dimension in the built thesaurus contains three keywords: Kao, kao, and Miao and Shu. Because "Kao is a bit thick", "Kao's diapers are a bit thick", "Wonderful and comfortable a bit thick" these three sentences all match a certain keyword in the Kao dimension, so they all include the information point of the Kao brand. How to do it in detail, let's expand in detail below~

(1) Build the lexicon

(A) Preliminarily build the lexicon framework through professional information. For example, to build a thesaurus in the field of diapers, you can first grab product-related information through e-commerce websites. The following is the brand information and functional characteristics of Pampers that can be crawled on JD.com. Combining some industry experience, the author initially sorted out the framework of the lexicon, and took these official expressions as the initial dimension of the lexicon.
[Recommended reading] Consumer insights based on text data

[Recommended reading] Consumer insights based on text data

(B) Apply NLP word segmentation technology to expand and supplement the thesaurus. Randomly select a certain amount of text, and use the jieba package in the python software to segment each sentence. According to the order of word frequency, from high to low, put the keywords into the corresponding dimensions. Existing categories can be supplemented with keywords; if there are no dimensions, add new dimensions to form a relatively complete dictionary.

In order to adapt consumer expression more flexibly, regular expression patterns can be used to replace common keywords. The specific form of the thesaurus after landing is as follows, where tagname represents the dimension name of the word, and keywords is the regular expression of the keyword.

[Recommended reading] Consumer insights based on text data

(C) Manually check sample data, review the coverage rate & accuracy rate of the lexicon. Randomly extract 1000 texts, traverse and read each text, supplement the keywords that are not hit, and modify the keywords that match the wrong (Tips: Use the software to highlight the matched words, which can greatly improve the review efficiency ).

Coverage rate = the number of texts covered / spot-checked for all key information points in the text;

Accuracy rate of a dimension = the number of texts correctly labeled for the corresponding dimension/the number of texts that hit the dimension.

When the coverage rate is >90% and the overall accuracy rate of the thesaurus is >90%, the thesaurus can be put into use.

[Recommended reading] Consumer insights based on text data

(2) Tag text through thesaurus;

Write a small python script, input the lexicon, and output the marked data. The basic steps are as follows:

Input text file -> segment each individual text based on certain rules (for example, according to period/semicolon) -> mark each sentence based on thesaurus -> form label data.

The specific results are in the form of the following table (sessionid is the segmented short sentence id). Based on the label data, cross-dimension analysis can be done.
[Recommended reading] Consumer insights based on text data

(3) Emotion recognition

Recognizing emotions is mainly done by machine learning models for emotion classification. Based on the lexicon marking, the corresponding "entity-feature" (such as "Kao-breathability") can be captured from the text, and we further extract a certain amount of data for artificial emotional annotation (negative/positive/neutral). Finally, it is handed over to the model to train and make emotional predictions for more text data.

[Recommended reading] Consumer insights based on text data

3) Data analysis

After the text data is structured, the author can mine and analyze consumers. The following is an example of the demand analysis of the category market and the analysis of the difference in brand recognition.

(1) Demand analysis of category market

Analyzing the number of mentions of different needs in the diaper category text and the proportion of positive comments on different needs, it can be found that "red ass/allergies" are now consumers who think it is very important and are not well met. According to the two formulas of demand importance and demand satisfaction, the corresponding results are obtained, as shown in the figure below.

Demand importance = the number of users who follow a certain demand/the number of users who mention the diaper category or brand;

Demand satisfaction = the number of positive expressions of a demand/the number of mentions of a demand.

[Recommended reading] Consumer insights based on text data

(2) Analysis of differences in brand recognition

From the perspective of consumers with brand awareness, there are significant differences in the amount of mentions of different brands (as shown below).

Kao, Curious, and Pampers are the most watched TOP3 brands. Among them, the gap between Curious and the first Kao is very small;

The recognition of curious series is much higher than that of Pampers. Among them, only 3% of users of Pampers brand clearly mention the product series.

Brand attention=number of users who mentioned a brand/number of users who mentioned any brand.

[Recommended reading] Consumer insights based on text data

From the perspective of the distribution of different demand points in the positive reviews of consumers, the main reasons why users choose each brand are (as shown in the figure below):

"Not red ass/allergic" and "easy to use" are the points that brands are recognized by consumers;

Curiosity is more recognized by consumers as "non-red ass/allergic", "softness" and "breathability";

Kao is more recognized as the "place of origin";

Pampers is more recognized for its "water absorption" and "price".
[Recommended reading] Consumer insights based on text data

04 Summary

The purpose of this article is to let everyone understand how to conduct consumer insights through text data in the form of cases. If there are relevant data scenarios in the work, you can follow the ideas of the article for basic practical operations. Due to limited space, relevant knowledge points cannot be fully developed for everyone (for example, how to efficiently construct thesaurus through more natural language processing methods), and those interested in children's shoes should remember to leave a comment~

A data plots person is a person boost the growth of family data, the data help partners interested in learning clear direction, accurate upgrade their skills. Follow me and take you to explore the magical mysteries of data

1. Go back to "Data Products" and get <Interview Questions for Data Products from Big Factory>

2. Go back to "Data Center" and get <Dachang Data Center Information>

3. Go back to "Business Analysis" and get <Dachang Business Analysis Interview Questions>;

4. Go back to "make friends", join the exchange group, and get to know more data partners.

[Recommended reading] Consumer insights based on text data

Guess you like