Crawl dynamically loaded data

Dynamically loaded data

Example 1: Crawling movie detail data in Douban movies

url:https://movie.douban.com/

1. What is dynamically loaded data:

We can't get data every time we crawl through the requests module. Some data is the address requested by the url in the non-browser address bar. But the data requested by other requests, then the data requested by other requests is the dynamically loaded data. (Guess that it may be js code when we visit this page will send get request to get data in other url)

2. How to detect whether there is dynamically loaded data in the web page

Open the packet capture tool on the current page and capture the data packet corresponding to the url in the address bar. Search the data we want to crawl in the response tab of the data packet. If the search result is found, the data is not dynamically loaded. Otherwise, the data is dynamically loaded

3. If the data is dynamically loaded, how can we capture the dynamically loaded data?

Global search based on packet capture tool.

Locate the data package corresponding to the dynamically loaded data, and extract it from the data package

  • Requested url
  • Request method
  • Parameters carried in the request
  • See response data

Then we need to analyze the relationship between parameters and url. It is found that he has 'start': '0', 'limit': '20', and this parameter is carried in the url. The parameter start is from which movie to start. The limit is how many are displayed from the beginning. For example, to crawl the first 30 movie data, it is start: 0 to limit: 30, for example, start: 2 to limit: 12, is to get 12 movies from the second movie, and then back

So we started to write code: crawl 50 movies starting from the 10th

url = 'https://movie.douban.com/j/chart/top_list'
params = {
    'type': '5',
    'interval_id': '100:90',
    'action': '',
    'start': '10',
    'limit': '50',
}
response = requests.get(url=url,params=params,headers=headers)
#.json()将获取的字符串形式的json数据序列化成字典或者列表对象
page_text = response.json()
#解析出电影的名称+评分
for movie in page_text:
    name = movie['title']
    score = movie['score']
    print(name,score)

Example 2: Crawl the location data of KFC restaurant by page

  • Crawl operation of paging data
  • analysis:
    • 1. Enter the keyword in the text box for entering the keyword and press the search button to initiate an ajax request
      • The location information refreshed from the current page must be the data requested through the ajax request
    • 2. Based on the packet capture tool, locate the data packet requested by the ajax, and capture from the data packet:
      • Requested url
      • Request method
      • Parameters carried in the request
      • See response data
#爬取的是第一页的数据
url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
data = {
    'cname': '',
    'pid': '',
    'keyword': '北京',
    'pageIndex': '1',
    'pageSize': '10',
}
#data参数是post方法中处理参数动态化的参数
get的参数用params,他会显示在url上。post用data!!!!
response = requests.post(url=url,headers=headers,data=data)
page_text = response.json()
for dic in page_text['Table1']:
    title = dic['storeName']
    addr = dic['addressDetail']
    print(title,addr)
    
#爬取多页的话
分析之后处理data数据即可,pageIndex就是显示的页数。for循环,请求每一页的。
for page in range(1,9):
    data = {
        'cname': '',
        'pid': '',
        'keyword': '北京',
        'pageIndex': str(page),
        'pageSize': '10',
    }
    response = requests.post(url=url,headers=headers,data=data)
    page_text = response.json()
    for dic in page_text['Table1']:
        title = dic['storeName']
        addr = dic['addressDetail']
        print(title,addr)

Example 3: Detailed data of each page of the SFDA

url = http://125.35.6.84:81/xk/

We get the current url, and there is no data of each enterprise in the response tab. Through the packet capture tool, we found that the post requested data obtained from other URLs, so we use this post request to obtain the url to send the request

The parameters are:

Then we send a request to get the data of each page. But we need the details page content of each company in each page

We clicked on the detail page and found that there was no data in the response tab of the get request page. Later, we found that this page is also dynamically loaded data. There is a post request to send data

After analysis, he found that he used the id to query the results. This ID has every enterprise when obtaining the page Json data.

The results are above. We write code to get the detailed information of all companies in the first 3 pages

url = "http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsList"
for i in range(3):
    data={'on': 'true',
    'page': '1',
    'pageSize': '15',
    'productName': '',
    'conditionType': i,
    'applyname': '',
    'applysn': '',}
    response = requests.post(url=url,data=data).json()
    for w in response["list"]:
         print(w)
        data = {"id":w["ID"]}
        url =' http://125.35.6.84:81/xk/itownet/portalAction.do?method=getXkzsById'
        response = requests.post(url=url,data=data).json()
        print(response)

Guess you like

Origin www.cnblogs.com/zzsy/p/12687221.html