Inflatable doll does it feel? Python tell you

Inflatable doll does it feel? Python tell you

Hack  9 May 21

The following article comes from naked pigs, author gentlemen

Naked pigs

Naked pigs

Gentlemen of Python introductory tutorial

Today will take you to play an exciting!

First, the background of the demand

After the actual development process, before we start work developed, we are a product manager for the (test, front-end, back-end, project managers, etc.) to explain the needs, we understand the needs, we began to work together to discuss technical solutions.

We also need to discuss the needs of their own to achieve some small function, that is to tell people why we do this thing? We want to use this product or solve any problems.

We often see something about inflatable doll expression package and pictures or news, but some of this kind of thing rarely communicate with each other like some small videos, play touch we may have been stolen. So gentlemen believe that in fact most of the students did not know at the inflatable doll in the end it is what it feels like (including gentlemen), so gentlemen'm curious what kind of experience? True, as rumored as cool it?

Second, the functional description

Based on a lot of people have not experienced what it feels like inflatable dolls, but very curious, so I hope that by analyzing data reptile + the way of intuitive and true to tell you (next picture shows the finished map).

Third, the technical program

In order to achieve the above requirements and features, we discuss the specific technical implementation:

  1. Analysis of a data request east comment

  2. Use the library to fetch requests comment on a inflatable doll East

  3. Use word cloud for data display

Fourth, technology

Last article gave you said to us today as an East Product Code: 1263013576 commodity as an object, data analysis, we take a look at the detailed technical steps to achieve it!

This tutorial is only for the exchange of learning and not for commercial profit, at your peril!
If infringement or adversely affect any company or individual, please inform the deleted

1. Analyze and obtain comment interfaces URL

Step one: open an East product page, find the product you want to study.

Step Two: We chose to check in right page in mouse (or F12) to bring up the Debug window browser.

The third step: bring up the browser, click on the comment button to make it load the data, and then we click on the network to view the data.

Step Four: Find a request to load url comment data, we can use a comment passage, then in the debug window.


4 steps above analysis, we got the interface Jingdong comment data: https: //sclub.jd.com/comment/productPageComments.action callback = fetchJSON_comment98vv4654 & productId = 1263013576 & score = 0 & sortType = 5 & page = 0 & pageSize = 10 & isShadowSku = 0 & fold = 1?

productPageComments: Look at this name to know the product page Reviews

2. Comments crawling data

After the data interface to get a comment url, we can begin to write code to fetch the data. Generally, we will first try to grab a piece of data, after a successful, we could analyze how a lot crawl.

Previous We have already explained how to use the library to initiate requests http / s requests, let's look at the code

but was empty in the data result of printing? Why browser request is successful, and our code did not request the data it? Are we encountered anti-climb? How to solve this case?

Everyone in the face of this situation and return to the debugger window of the browser, view the browser initiated request header, request header as carrying what parameters might browser request is not in our code.

Sure enough, we see there are two request headers in the browser header Referer and the User-Agent , then we put them added to the request header code, try again!

3. Data Extraction

We crawling data analysis, the data request is returned as the result jsonp json cross-domain, so we just put in front of fetchJSON_comment98vv4646(and finally )removed to get the json data.

Copy data to json json formatting tool or click on the Chrome browser debugging window Previewcan also be seen, json data as a key commentsvalue is that we want to review the data.

Let us analyze the value of comments found is a list of multiple pieces of data, and each item in the list is subject to each comment, including comment on the content, time, id, evaluate sources of information and so on, and one of the contentfield user evaluation is what we see in the pages.

Then we use the code each evaluated contentfield extract and print it out

4. Data storage

After extracting the data we need to save them up and save the general format of the data are: files, databases, memory these three categories. Today, we will save the data as a txt file format, because the operation is relatively simple documents but also to meet the needs of our subsequent data analysis.

Then we look at the content of the generated file is correct

5. Batch crawling

And then complete a data crawling, extraction, after saving, let's look at how Batch fetching?

Did web classmates may know, there is a feature that we have to do, and that is paging. What is paging? Why do paging?

We often see a lot of web browsing time such words, "Next", in fact, this is the use of paging, data presented to the user because it is impossible to show all the data one time, so the use of paging technology, a page displayed.

Let's go back to load the url of the beginning of the comment data:

https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv4654&productId=1263013576&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1

We can see the link there are two parameters Page = 0 & pageSize = 10 , Page represents the current number of pages, pageSize indicate how many of each page, and that the two data directly to the database limit data.

Older drivers can see that this is a paging parameters, but some students will say: If I were doing the old driver also see your article? So I teach you how to find the paging parameters .

East back to a product page, we will pull the bottom page evaluation found that the paging buttons, and then we recorded in the debug request before the empty window.

After requesting the record before empty, we click on the red box digital paging buttons on Figure 2, which represents the second page, and then copy the first reviews to debug window searching, finally found a link request.

Then we click Headersto view the second page url request data

Then we compare what the first page of the evaluation and the evaluation of the second page url difference

here also demonstrate the gentlemen's guess: page indicates the current number of pages, pageSize indicate how many of each page. And we can draw another conclusion: The first page = 0, the second page page = 1 then turn back. Some students will ask: Why is not the first page 1, but 0, because most of the database are counted from 0, programming industry many lists are arrays start counting from 0.

Well, after paging you know the law, as long as we request the page each time increment parameter can not batch fetching yet? We have to write code!

Briefly explain the changes made:

  1. For spider_comment ways to increase the participation Page :

    Pages, then increase in the url placeholder, so that you can dynamically modify url, crawling specified number of pages.

  2. Add a batch_spider_comment method, the cycle call spider_comment method, tentatively crawling 100.

  3. In batch_spider_comment setting method for a randomized loop sleep time, it is intended to simulate the user in browsing, as crawling prevent too frequently blocked ip.

Check the results of climb after take complete

6. Data Cleaning

After the data has been successfully saved the data we need to clean word for word we use well-known sub-thesaurus jieba .
The first is to install jieba library:

pip3 install jieba


Of course, here you can also eliminate some of the prepositions invalid word, to avoid invalid data.

7. Generate word cloud

We need to use to generate a word cloud numpy, matplotlib, wordcloud, Pillowthese libraries, first to download. matplotlibLibrary for image processing wordcloudlibrary used to generate a word cloud.

Note: font_path is to choose the font path, if you do not set the default font may not support Chinese, gentlemen choose a Mac system comes with Times New Roman characters!

The end result:

we look at the whole Code

V. Summary

Consideration of novice friendly, long essays and articles, a detailed description of the requirements to technical analysis, crawling data, data cleansing, final data analysis. Let's summarize what it learned in this article:

  1. How to analyze and identify data loading url

  2. How to use the library headers resolve requests Referer and User-Agent pocketing technology

  3. How to find paging parameters to achieve mass crawling

  4. Provided a crawler interval prevent blocked ip

  5. Data extraction and saved to a file

  6. Use jieba library of data segmentation cleaning

  7. Generating a predetermined shape using wordcloud word cloud

This is a complete data analysis of the case, I hope you can try it yourself, to explore more interesting case, to be an interesting person -

Project address: https: //github.com/pig6/jd_comment_spider

【Finish】

Published 117 original articles · won praise 41 · views 60000 +

Guess you like

Origin blog.csdn.net/pangzhaowen/article/details/102912555