2020 American College Mathematical Contest in Modeling C-question data value The whole process of problem-solving documents and procedures

2020 American Collegiate Mathematical Contest in Modeling

The value of data in question C

Reproduction of the original title:

  The Amazon online store provides customers with an opportunity to rate and evaluate products after purchasing them. Personal ratings - called "star ratings" - allow buyers to express their satisfaction with a product using a scale of 1 (low rating, low satisfaction) to 5 (high rating, high satisfaction). In addition, customers can also submit text messages (known as "reviews") to express further opinions and information on products. Other customers may submit helpful or unhelpful ratings (called "helpful ratings") on these reviews to aid in their own product purchasing decisions. Companies use this data to gain insight into the markets they participate in, when to do so, and the potential success of product feature design choices.
  Sunshine plans to launch and sell three new products on the online marketplace: microwave ovens, baby pacifiers and hair dryers. They have engaged your team as consultants to use historical ratings and reviews provided by customers to identify key patterns, relationships, metrics and parameters relative to other competing products to 1) inform their online sales strategy, 2) identify potential Important design features to enhance the attractiveness of the product. Sunshine has used data to guide sales strategies in the past, but they had never used this particular combination and type of data (relational databases) before. Sunshine is particularly interested in time-based patterns in this data and how they help companies build successful products (interaction of time and product).
  To help you, Sunshine's data center provides you with three data files for this project: hair_dryer.tsv, microwave.tsv, and pacifier.tsv. The data represents ratings and reviews provided by customers for microwave ovens, baby pacifiers, and hair dryers sold on the Amazon marketplace during the time periods indicated in the data. A glossary of data label definitions is also provided. The provided data file contains the only data you apply to this problem.
  Task
  1. Analyze the three provided product datasets to identify meaningful quantitative and/or qualitative patterns, relationships, measures, and parameters based on identification, description, and supporting mathematical evidence that will be reported in star ratings, reviews, and help Within and between levels, this will help Sunshine succeed in their three new online sales.
  2. Use your analysis to solve the following specific questions and requirements of the marketing director of Sunshine Company: (It is recommended to read the original text carefully, there are a lot of mathematical and economic terms, and the translation is not very accurate) a. Once the three products of Sunshine
  Company Sell ​​in the online marketplace, with data metrics based on their most informative ratings and reviews.
  b. Identify and discuss time-based measures and patterns in each dataset that may indicate an increase or decrease in a product's reputation in the online marketplace.
  c. A combination of text-based and rating-based metrics to determine which metric is most indicative of a potential product success or failure.
  d. A combination of text-based and rating-based metrics to determine the metric that best indicates a potential success or failure of the product.
  e. Are descriptive characteristics of text-based reviews, such as "enthusiastic", "disappointed", etc., strongly correlated with star ratings?
  3. Write a one- to two-page letter to Sunshine's Marketing Director summarizing your team's analysis and results. Include specific rationale for the findings that your team is most confident recommending to the Marketing Director.

Overview of the overall solution process (abstract)

  We conduct data analysis and information mining from four aspects: relevance of reviews, star ratings and usefulness ratings, product brand ratings, product reputation prediction, and impact of star ratings on reviews, so as to propose reliable sales strategies and product improvement suggestions.

  First, after data preprocessing, we use the NLTK tool to perform word segmentation and preliminary sentiment analysis on the text data, and quantify it into a sentiment score with a range of [-1, 1]. Using methods such as data visualization, descriptive statistics, and correlation analysis, a multivariate Logistic regression model was further constructed to analyze the relationship between usefulness ratings and review length, star rating, and composite degree. The results show that the help rating has an inverted "U-shaped" relationship with the comment length, and a positive "U-shaped" relationship with the star rating.

  In the next step, we conduct analysis based on rating and evaluation models. Build an LDA analysis model to find the theme characteristics of each product, and put forward product improvement suggestions on this basis. At the same time, we summarized five indicators that affect product sales from the theme characteristics, namely quality, price, appearance, service and size. An index score is then calculated for each review using a computer search algorithm based on text similarity. In addition, we combined the scoring with the hierarchical analysis process to determine the weight of each indicator and construct a weighted brand scoring system. Finally, we cluster all product brands through systematic clustering to select potential high-quality brands and recommend them to Sunshine Company.

  Further calculating the comprehensive score of reviews and stars and taking it as the reputation of the product is beneficial to predict the future reputation of the three products through time series analysis. It describes that these three products have seasonal characteristics and are likely to maintain a stable seasonal cycle in the near future. However, there are differences in the peak times of the comprehensive reputation scores of the three products. Further analysis shows that during the peak period of reputation scores, the product sales figures are relatively large, and Sunshine can formulate sales plans based on this.

  Finally, we analyze the relationship between star ratings and reviews. By establishing a distributed lag model, it is found that the customer's current review will be affected by other customer ratings and reviews. At the same time, we observe a time-varying synchrony between sentiment ratings and star ratings, it is clear that reviews containing positive words lead to higher star ratings, while reviews containing negative words lead to lower star ratings . In other words, there is a strong correlation between star ratings and specific quality descriptors.

Model assumptions:

  We made several assumptions in our model. Later, we may relax some of these assumptions to optimize our model for complex real-world environments.

  1. All reviews are from customers, no automatic reviews.

  2. There is no deliberate attempt to discredit the product in customer ratings and reviews.

  3. Unverified orders will not be sold on Amazon. The sum of confirmed sales in the dataset is the total sales.

  4. The reviews of Amazon Vine certified members have high credibility.

Question restatement:

  Sunshine plans to launch and sell microwave ovens, hair dryers and baby pacifiers on the online marketplace. In order to understand these three commodity markets and develop sales strategies, it is necessary to analyze customer feedback data. We will complete the following tasks based on the given data:
  Develop a sales strategy for Sunshine Corporation.
  Identify potentially important design features that increase product desirability. To accomplish the above two tasks, our specific work is as follows:
  analyze the relationship among star ratings, helpful votes and reviews.
  Conduct in-depth analysis of reviews and ratings to identify product strengths and weaknesses and make recommendations for product improvements.
  Based on the analysis of comments and ratings, a product rating system is established, and high-quality brand products are selected and recommended to Sunshine Company for sale.
  Establish a reputation scoring system to predict the development trend of reputation.
  Analyze whether the current review is affected by previous star ratings and reviews.
  Analyze whether specific quality descriptors of text-based reviews are strongly correlated with rating levels.

Model establishment and solution Overall paper thumbnail

insert image description here
insert image description here

For all papers, please see below "Only modeling QQ business cards" Click on the QQ business card

Part of the program code: (code and documentation not free)

import pandas as pd
from snownlp import SnowNLP
import re
from gensim import corpora,models
#from nltk.tokenize import word_tokenize
data = pd.DataFrame(test2)
print(type(data))
#data_null = data.drop_duplicates()
#data_null.to_csv('C:/Users/lenovo/Desktop/comments_null.csv')
#data_null_comments = data_null['contents']
#data_null_comments.to_csv('C:/Users/lenovo/Desktop/contents.txt',index=False,enco
ding='utf-8')
#data_len = data_null_comments[data_null_comments.str.len()>4]
#print(data_len)
#data_len.to_csv('contents.txt',index=False,encoding='utf-8')
coms = []
coms = data[0].apply(lambda x:SnowNLP(x).sentiments)
data_post = data[coms>=0.01]
data_neg = data[coms<0.01]
print(data_post)
print(data_neg)
data_post[0].to_csv(&apos;C:/Users/lenovo/Desktop/comments_positive.txt&apos;,encoding=&apos
;utf-8&apos;,header=None)
data_neg[0].to_csv(&apos;C:/Users/lenovo/Desktop/comments_negative.txt&apos;,encoding=&apos
;utf-8-sig&apos;,header=None)
with open(&apos;C:/Users/lenovo/Desktop/comments_positive.txt&apos;,encoding=&apos;utf-8&ap
os;) as fn1:
string_data1 = fn1.read()
pattern = re.compile(u&apos;\t|\n|\.|-|——|||||,||;|\)|\(|\?|"&apos;)
string_data1 = re.sub(pattern, &apos;&apos;, string_data1) 
print(string_data1)
fp = open(&apos;C:/Users/lenovo/Desktop/comments_post.txt&apos;,&apos;a&apos;,encoding
=&apos;utf8&apos;)
fp.write(string_data1 + &apos;\n&apos;)
fp.close()
with open(&apos;C:/Users/lenovo/Desktop/comments_nagative.txt&apos;,encoding=&apos;utf-8&a
pos;) as fn2:
string_data2 = fn2.read()
pattern = re.compile(u&apos;\t|\n|\.|-|——|||||,||;|\)|\(|\?|"&apos;)
string_data2 = re.sub(pattern, &apos;&apos;, string_data2)
print(string_data2)
fp = open(&apos;C:/Users/lenovo/Desktop/comments_neg.txt&apos;,&apos;a&apos;,encoding
=&apos;utf8&apos;)
fp.write(string_data2 + &apos;\n&apos;)
fp.close()
data1 = pd.read_csv(&apos;C:/Users/lenovo/Desktop/comments_post.txt&apos;,encoding=&apos;utf
-8&apos;,header=None)
data2 = pd.read_csv(&apos;C:/Users/lenovo/Desktop/comments_neg.txt&apos;,encoding=&apos;utf
-8&apos;,header=None)
#mycut = lambda s: &apos; &apos;.join(word_tokenize(s))
#data1 = data1[0].apply(mycut)
#data2 = data2[0].apply(mycut)
#mycut = lambda s: &apos; &apos;.join(word_tokenize(s))
data1 = data1[0]
data2 = data2[0]
data1.to_csv(&apos;C:/Users/lenovo/Desktop/comments_post_cut.txt&apos;,index=False,header=Fa
lse,encoding=&apos;utf-8&apos;)
data2.to_csv(&apos;C:/Users/lenovo/Desktop/comments_neg_cut.txt&apos;,index=False,header=Fal
se,encoding=&apos;utf-8&apos;)
print(data2)
post = pd.read_csv(&apos;C:/Users/lenovo/Desktop/comments_post_cut.txt&apos;,encoding=&apos
;utf-8&apos;,header=None,error_bad_lines=False)
neg = pd.read_csv(&apos;C:/Users/lenovo/Desktop/comments_neg_cut.txt&apos;,encoding=&apos;
utf-8&apos;,header=None,error_bad_lines=False)
stop = pd.read_csv(&apos;C:/Users/lenovo/Desktop/stoplist.txt&apos;,encoding=&apos;utf-8&apos;
,header=None,sep=&apos;tipdm&apos;,engine=&apos;python&apos;)
stop = [&apos; &apos;,&apos;&apos;] + list(stop[0])
post[1] = post[0].apply(lambda s: s.split(&apos; &apos;))
post[2] = post[1].apply(lambda x: [i for i in x if i not in stop])
neg[1] = neg[0].apply(lambda s: s.split(&apos; &apos;))
neg[2] = neg[1].apply(lambda x: [i for i in x if i not in stop])
post_dict = corpora.Dictionary(post[2])
post_corpus = [post_dict.doc2bow(i) for i in post[2]]
post_lda = models.LdaModel(post_corpus, num_topics=4, id2word=post_dict)
for i in range(3):
print(post_lda.print_topic(i))
print(&apos)
neg_dict = corpora.Dictionary(neg[2])
neg_corpus = [neg_dict.doc2bow(i) for i in neg[2]]
neg_lda = models.LdaModel(neg_corpus, num_topics=4, id2word=neg_dict)
for i in range(3):
print(neg_lda.print_topic(i))
For all papers, please see below "Only modeling QQ business cards" Click on the QQ business card

Guess you like

Origin blog.csdn.net/weixin_43292788/article/details/131822531