Simple web crawler data python code size of implementation (a single code loop and web crawlers)

Web crawler for large data class professional students who may not be familiar, before you tell web crawlers, bloggers to give us a presentation we usually use the browser works, as long as understand the working principle of the usual browser , the web crawler will become easier.
The following chart is a browser normal work flow chart principle:
Work flow chart of the general browser here Insert Picture Description
First, we can see there are four processes:
(1) we enter the url (URL) Xianxiang browser;
(2) the browser to the specified server transmitting the HTTP request, the request in two ways: one is to get, and the other is: spot. The two then what difference does it make? You can understand: get means that we need to download the data back from the server in, spot what we upload (paste) to the data on the server;
(3) will have the http response to the server when the server receives a request to the browser ;
(4) the response back is actually html source code of the average person can not read, and then through the browser, at last, to show us rich, beautiful pages! ! !

After introduction of a normal workflow browser, we begin our web crawler up. Web Crawler: Analog browser page, download a program of network resources that we need, in fact, it is essentially a http request forgery. Well, we start our code section (Note: Use python environment)
First we reload third-party libraries in python: re, urllib, json, time , three libraries

#加载第三方库
import urllib
import re
import json
import time
###
#输入网站
url = 'https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv6992&productId=8758880&score=0&sortType=5&page=2&pageSize=10&isShadowSku=0&rid=0&fold=1'

#1:模拟浏览器的http请求部分
#html = urllib.request.urlopen(url)
#2:模拟响应过程
#html = urllib.request.urlopen(url).read()
#我们可以打印print(html)看,如果源码出现乱码,我们再响应过程部分进行编码设置:常用的编码有:utf-8,gbk,gb18030

#完整的模拟浏览器的http请求过程和http响应的过程
html = urllib.request.urlopen(url).read().decode('gbk')#此处已经经请求和响应过程合并
print(html)

#由于读取到的源码不是标准的 JSON 格式,因此需要使用进行处理
json_data = re.search('{.+}', html).group()#正则表达式处理

data = json.loads(json_data)# 将json格式数据转为字典格式(反序列化)
   

Now to comment on the data page 1 crawling, running the code:

#忘记每行代码的目的,请看上面备注
import pandas as pd
import urllib
import requests
import re
import json
import random

url = 'https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv6992&productId=8758880&score=0&sortType=5&page=2&pageSize=10&isShadowSku=0&rid=0&fold=1'

html = urllib.request.urlopen(url).read().decode('gbk')
print(html)
json_data = re.search('{.+}', html).group()
data = json.loads(json_data)
data['comments']

After the run we can get the data we needed to review the data.

Web crawler is not suddenly feeling very simple? A few lines of code completely solved, but do not forget this is just crawling data page. For crawling multiple pages will need to write our own cycle crawling slightly. In order to facilitate everyone, bloggers also wrote a cycle crawler code is as follows:

all_url = []
for i in range(0,10):
    all_url.append('https://sclub.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98vv6992&productId=8758880&score=0&sortType=5&page='+str(i)+'&pageSize=10&isShadowSku=0&rid=0&fold=1')

allh_data = pd.DataFrame()
for k in range(0,10):    #爬取10页的数据
    
    print("正在打印第{}页的评论数据".format(k+1))
    
    html_data = urllib.request.urlopen(all_url[k]).read().decode('gb18030')
    
    json_data = re.search('{.+}', html_data).group()
    all_data = json.loads(json_data)
    
    alls_data = all_data['comments']
    referenceName = [x['referenceName'] for x in alls_data]  # 提取商品的品牌名
    nickname = [x['nickname'] for x in alls_data]  # 提取商品购买用户的昵称
    creationTime = [x['creationTime'] for x in alls_data]  # 提取购物时间
    content = [x['content'] for x in alls_data]  # 发表时间

    all_data_hp = pd.DataFrame({'referenceName': referenceName,
                                'nickname': nickname,
                                'creationTime': creationTime,
                                'content': content})
												   
    allh_data = allh_data.append(all_data_hp)
    time.sleep(random.randint(2, 3))
    print(">>>爬取第{}结束.......".format(k+1))

allh_data.index = range(len(allh_data))#重设dateframe的index

So finally we can get the following data:
Crawling of a commodity data 10 Reviews
If in doubt welcome comments, bloggers will try to answer.

Reprinted Note: For reprint, please indicate the source: https://blog.csdn.net/data_bug/article/details/84646030

Released five original articles · won praise 16 · views 10000 +

Guess you like

Origin blog.csdn.net/data_bug/article/details/84646030