Use of scrapy crawler framework and selenium: LDA text mining of coupon recommendation website data

Original link: http://tecdat.cn/?p=12203


Introduction

Everyone likes to save money. We all try to make the most of our funds, and sometimes this is the simplest thing that can make the biggest difference. For a long time, coupons have been brought to supermarkets to get discounts, but using coupons has never been easier, thanks to Groupon.

Groupon is a coupon recommendation service that broadcasts electronic coupons at restaurants and shops near you. Some of these coupons may be very important, especially when planning group activities, because discounts can be as high as 60%.

 

data

The data was obtained from the New York City area of ​​the Groupon website. The layout of the site is divided into album searches for all different groupon, followed by in-depth pages for each specific groupon. The appearance of the website is as follows:

 
 

The layout of the two pages is not dynamic, so a custom scrapy was established to quickly browse all the pages and retrieve the information to be analyzed. However, comments, important information, are rendered and loaded via JavaScript. The Selenium script uses the URL of groupons obtained from scrapy, which essentially mimics the human click on the "next" button in the user's comment section.

for url in url_list.url[0:50]:
	try:
		driver.get(url)
		time.sleep(2)
		#Close Any Popup That Occurs#
		# if(driver.switch_to_alert()):
		try:
			close = driver.find_element_by_xpath('//a[@id="nothx"]')
			close.click()
		except:
			pass
		time.sleep(1)
		try:
			link = driver.find_element_by_xpath('//div[@id="all-tips-link"]')
			driver.execute_script("arguments[0].click();", link)
			time.sleep(2)
		except:
			next
		i = 1
		print(url)
		while True:
			try:
				time.sleep(2)
				print("Scraping Page: " + str(i))
				reviews = driver.find_elements_by_xpath('//div[@class="tip-item classic-tip"]')
				next_bt = driver.find_element_by_link_text('Next')
				for review in reviews[3:]:
					review_dict = {}
					content = review.find_element_by_xpath('.//div[@class="twelve columns tip-text ugc-ellipsisable-tip ellipsis"]').text
					author = review.find_element_by_xpath('.//div[@class="user-text"]/span[@class="tips-reviewer-name"]').text
					date = review.find_element_by_xpath('.//div[@class="user-text"]/span[@class="reviewer-reviewed-date"]').text
					review_dict['author'] = author
					review_dict['date'] = date
					review_dict['content'] = content
					review_dict['url'] = url

					writer.writerow(review_dict.values())
				i += 1 
				next_bt.click()
			except:
				break
	except:
		next

csv_file.close()
driver.close()

The data retrieved from each group is shown below.

Groupon title 

Classified information

Trading function location

Total Rating URL

Author Date

Comment URL

There are about 89,000 user reviews. The data retrieved from each comment is shown below.

print(all_groupon_reviews[all_groupon_reviews.content.apply(lambda x: isinstance(x, float))])
indx = [10096]
all_groupon_reviews.content.iloc[indx]
            author       date content  \
10096  Patricia D. 2017-02-15     NaN   
15846       Pat H. 2016-09-24     NaN   
19595      Tova F. 2012-12-20     NaN   
40328   Phyllis H. 2015-06-28     NaN   
80140     Andre A. 2013-03-26     NaN   

                                                 url  year  month  day  
10096  https://www.groupon.com/deals/statler-grill-9  2017      2   15  
15846         https://www.groupon.com/deals/impark-3  2016      9   24  
19595   https://www.groupon.com/deals/hair-bar-nyc-1  2012     12   20  
40328     https://www.groupon.com/deals/kumo-sushi-1  2015      6   28  
80140  https://www.groupon.com/deals/woodburybus-com  2013      3   26  

Exploratory data analysis

An interesting finding is that the use of groups has greatly increased in the past few years. We found this by checking the date provided by the review. Looking at the image below, where the x-axis represents the month / year and the y-axis represents the count, this conclusion becomes obvious. The last slight decline was due to the fact that some groups at the time may be seasonal.

 

 

An interesting finding is that the use of groups has greatly increased in the past few years. We found this by checking the date provided by the review. Look at the image below, where the x-axis represents the month / year and the y-axis represents the count. The last slight decline was due to the fact that some groups at the time may be seasonal.

pie_chart_df = Groupons.groupby('categories').agg('count')

plt.rcParams['figure.figsize'] = (8,8)

sizes = list(pie_chart_df.mini_info)
labels = pie_chart_df.index
plt.pie(sizes, shadow=True, labels = labels, autopct='%1.1f%%', startangle=140)
# plt.legend(labels, loc="best")
plt.axis('equal')
 

Finally, since most of the data is through text: price (original price), a regular expression is derived to parse price information and the number of transactions they provide. This information is displayed in the following bar chart:


objects = list(offer_counts.keys())
y = list(offer_counts.values())
tst = np.arange(len(y))

plt.bar(tst,y, align = 'center')
plt.xticks(tst, objects)
plt.ylabel('Total Number of Groupons')
plt.xlabel('Different Discounts Offers')
plt.show()
 

plt.ylabel('Number of Offerings')
plt.xticks(ind, ('Auto', 'Beauty', 'Food', 'Health', 'Home', 'Personal', 'Things'))
plt.xlabel('Category of Groupon')
plt.legend((p0[0], p1[0], p2[0], p3[0], p4[0], p5[0], p6[0], p7[0], p10[0]), ('0', '1', '2', '3', '4', '5', '6', '7', '10'))

 

 

sns.violinplot(data = savings_dataframe)

 

Finally, use the user comment data to generate a word cloud:

plt.rcParams['figure.figsize'] = (20,20)
wordcloud = WordCloud(width=4000, height=2000, max_words=150, background_color='white').generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

 

Topic modeling

For topic modeling, the two most important software packages used are gensim and spacy . The first step in creating a corpus is to delete all stop words, such as "," and so on. Finally create trigrams.

The selected model is Latent Dirichlet Allocation, because it can distinguish topics from different documents, and there is a package that can clearly and effectively visualize the results. Since this method is unsupervised, the number of topics must be selected in advance, and the optimal number is 25 in 25 consecutive iterations of the model. The results are as follows:

 

The above visualization is to project topics onto two components, where similar topics will be closer and dissimilar topics will be further away. The words on the right are the words that make up each topic, and the lambda parameter controls the exclusivity of the words. A lambda of 0 represents the most exclusive word around each topic, while a lambda of 1 represents the most frequent word around each topic.

The first topic represents the quality of service and reception. The second topic has words describing exercise and physical activity. Finally, the third topic has words that belong to the food category.

in conclusion

Topic modeling is a form of unsupervised learning. The scope of this project is to briefly check the function of finding patterns behind basic words. Although we think our reviews of certain products / services are unique, this model clearly shows that, in fact, certain words are used throughout the entire population.

Guess you like

Origin www.cnblogs.com/tecdat/p/12737316.html