[Python] JSON data analysis

Table of contents

JSON file data analysis

The crawler obtains the json data packet of the hero information of King Glory and parses it

The crawler gets the Douyin video json packet and parses it


JSON file data analysis

json string: usually similar to the combination of lists and dictionaries in the python data type, or may be a separate list or dictionary format, usually can be converted to a python data type through the function interface of the json module, and the data type in python can also be converted into json string

json file: the backend usually transfers the database files to the frontend in the format of a json file, and the frontend processes and renders the data in the json file and displays it on the frontend page.

Many times, the data we get from the front-end page using crawlers is not complete because it has been processed and rendered, so we sometimes need to get the data by getting the back-end json file data package

The crawler obtains the json data packet of the hero information of King Glory and parses it

Note: This code is only for crawler technology learning and use, without any commercial purpose

[Python] crawler data extraction

In the previous article, the way to obtain the hero skin image of Glory of Kings is from the url in the front-end code. There are only 93 heroes in total, and the obtained data is not complete. But this time we can use the json file in the network packet Get the data information, and then download it through the crawler. There are a total of 114 heroes, which are all the current heroes in the game.

Find the json file data packet storing the hero information in the network data packet,

Get the url link of the json file in the request header, and then download the json file packet

Download the King of Glory full hero skin picture code:

import requests
import json
import os

url = "https://pvp.qq.com/web201605/js/herolist.json"
response = requests.get(url)    # 请求得到json包数据
# print(response.text)

# 此处的json文件格式是列表里面包含字典元素
heroList = json.loads(response.text)
# print(len(heroList))        # 114

for i in heroList:
    id = i['ename']
    name = i['cname']
    print(id, name)

    os.makedirs(f"./imag/{name}")   # 给每个英雄的皮肤单独创建目录

    # 找到英雄皮肤图片的url链接,对比观察寻找规律
    cnt = 1
    while True:
        try:
            url = f"https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/{id}/{id}-bigskin-{cnt}.jpg"
            response = requests.get(url)
            if response.status_code != 200:
                break
            with open(f"./imag/{name}/skin-{cnt}.jpg", "wb") as f:
                f.write(response.content)
                cnt += 1
        except:
            print(Exception)
            break

The crawler gets the Douyin video json packet and parses it

Note: This code is only for crawler technology learning and use, without any commercial purpose

Example: The homepage video of Liu Genghong, a sports coach with tens of millions of followers

1. First open the TikTok webpage, search for Liu Genghong, enter Coach Liu’s video homepage, and select Check

There are many ways to crawl videos. You can download them directly through the web link of the video through you-get, but you-get will often fail when facing the anti-climbing mechanism of Douyin. At this time, we will Find the data packet of the video file, and obtain the video file directly through the network request

2. Select the webpage on the PC side, refresh it, check the network data packets, and look for the data packets that store the video. The packet names do not have a uniform format, so you need to search one by one. This is a process that depends on luck and experience, and you may not find it video packet missing

Observe the information in the response (response), generally there is a "video" tag in front of the video file, and the link of the image file will start with p or the tag will be img or p, this is not absolute, it needs to rely on experience to judge

3. Through the data packet search on the PC side, I did not find the data packet storing the video file, so we can search the data packet on the mobile end, and it is possible to find it. Of course, the premise of searching on the mobile end page is that this web page has a mobile end. soft

After switching the packet capture tool to the mobile mode, press ctrl F to refresh, and you will find that the home page has changed to this shadow. At this time, the network data packets in NetWork have also changed. We still need to search for video data packets one by one in NetWork. Observe the home page information on the left, there are 12 videos in total, that is to say, we can find up to 12 video data packages

4. Based on my experience, I found a data package with a video tag. There is a url_list under this data package, and there are links in it. Copy the link and enter it, and found that it is the video file we need

5. To analyze the data structure in the data packet, we open a tool called sojson, which can be used by Baidu’s direct search. After entering, click json online analysis. This tool can help us organize string data and convert it into a scalable format. , so that we can observe the overall structure of the packet

Through observation, the structure of this data packet is that the outermost layer is a dictionary, and the key value of the 'awere_list' keyword is a list. After clicking on the list, it is found that there are multiple layers of dictionaries or lists nested inside.

6. Get the url of the request packet, you can find the url in the request header, the request method is get, and you can also get the user-agent, referer, cookied... information from the request header when you get the url

7. After obtaining the response information response, load it with the json module, json.loads(response.text), get the dictionary format data of the data packet, and then obtain the list containing the video information through the keywords of the dictionary, and then iterate Traverse the list to get the file url and title of each video

Directly pile up each video file for network request, and write the requested data in binary format to the file with .mp4 suffix, that is, the download is completed. Please define the directory where the video is stored in the code.

import requests
import json

# 找到包含视频信息的数据包的url
url = "https://m.douyin.com/web/api/v2/aweme/post/?reflow_source=reflow_page&sec_uid=MS4wLjABAAAASwhiL0bRi1X_zs7UhAIO2udbD1F_XKrsJMOaukl1Io4&count=15&max_cursor=0&msToken=AjdH_77aAG1sC-0U-MaMQBD3QT95XjiZP1e4e5JJYpBnimVxKqDUU10RT2MgbZWKVfyTaxM09vdszhneWinYQNztXdYjJmQxVrp-phFdeimKvdCLmEP8uf3XbhPt4qI=&X-Bogus=DFSzKwVOFYJANeTitVG4MBt/pLfR&_signature=_02B4Z6wo00001.ZN80AAAIDCfQZoo-troSP2XffAAJmzewiu-7U6iD-JbAD74nmRsnNpUV.-BS9Fw6LVCVWwonyxlS-XqkHgugFjUnAqh-vM3n5uFWhxCFihg6oeZDnwSp1ZGQjVtWQvVauT29"
head = {
    'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36',
    'referer': 'https://m.douyin.com/share/user/MS4wLjABAAAASwhiL0bRi1X_zs7UhAIO2udbD1F_XKrsJMOaukl1Io4',
    'cookie': 'ttwid=1%7CSfdsymdYx1QlfjQTR3xYeFSNa9O6q4f1wFp6BtrRqxs%7C1676826682%7C30e0eb856043a47a208c80c1af10d30fc0cfcf4e270115abae592c5ec3873249; passport_csrf_token=6372cba6b4f2cdcca86e2189f5d1065e; passport_csrf_token_default=6372cba6b4f2cdcca86e2189f5d1065e; strategyABtestKey=%221681489097.642%22; d_ticket=5bc0d2611324cd7fbfc9e0f926fec81dc0d98; passport_assist_user=CkHkH_6ka8IiPFCiZKjYeqLlQ3IQO4dnoM20tGGmRhZNzWR0b-vg3QFOuNc8hYUqhotVsLwAHo5tpR-NWGrr7cp3_hpICjynJIYtuuLTHP0W9Aa8AzzU2xfJszvShJhQUoSBt96bF2UGXYQAgVAAwid85Z0QTR8LLb5lB5pKAX6aZzkQ_r6uDRiJr9ZUIgED4L4Gmw%3D%3D; n_mh=48GDnZrzh9L0L71QYQe9RHV7INgvIpqOtZbHqRblemk; sso_auth_status=5ac2453cbbb4e7711ec74d4525109405; sso_auth_status_ss=5ac2453cbbb4e7711ec74d4525109405; sso_uid_tt=23bd5aa65e9bf83cf5d1da0295dab814; sso_uid_tt_ss=23bd5aa65e9bf83cf5d1da0295dab814; toutiao_sso_user=e17ba4d3ba02f5112b43bf87f6ebcf82; toutiao_sso_user_ss=e17ba4d3ba02f5112b43bf87f6ebcf82; sid_ucp_sso_v1=1.0.0-KDkzMjFjN2Q2NzM2YTlkZDA5ZDQ3NThiMzNlZDlhZjFlM2U4ZDhhNTIKHwjX-IDEvvXbAxDohOahBhjvMSAMMIKS-v0FOAJA8QcaAmxmIiBlMTdiYTRkM2JhMDJmNTExMmI0M2JmODdmNmViY2Y4Mg; ssid_ucp_sso_v1=1.0.0-KDkzMjFjN2Q2NzM2YTlkZDA5ZDQ3NThiMzNlZDlhZjFlM2U4ZDhhNTIKHwjX-IDEvvXbAxDohOahBhjvMSAMMIKS-v0FOAJA8QcaAmxmIiBlMTdiYTRkM2JhMDJmNTExMmI0M2JmODdmNmViY2Y4Mg; odin_tt=c4d2b5e57715bbf9f03692fd068425eda4129913fbb28d6c1c1287a19e2203d77a66deb3279cb017d3bc28e0e451f475ca24d947928720cc9b5762f8aedd9ea6; passport_auth_status=a7ce1e07f923c0f2f1c214dd3a228ba0%2Cc53255e0ba344f56a27077246b056268; passport_auth_status_ss=a7ce1e07f923c0f2f1c214dd3a228ba0%2Cc53255e0ba344f56a27077246b056268; uid_tt=2750b133aa1926c33be64dbd3cd2ff92; uid_tt_ss=2750b133aa1926c33be64dbd3cd2ff92; sid_tt=310d75ea30feab567b26cc6cb5972446; sessionid=310d75ea30feab567b26cc6cb5972446; sessionid_ss=310d75ea30feab567b26cc6cb5972446; publish_badge_show_info=%220%2C0%2C0%2C1681490545371%22; LOGIN_STATUS=1; store-region=cn-zj; store-region-src=uid; sid_guard=310d75ea30feab567b26cc6cb5972446%7C1681490547%7C5183992%7CTue%2C+13-Jun-2023+16%3A42%3A19+GMT; sid_ucp_v1=1.0.0-KDdiN2I2MWM2N2IzYmFhYmRiODNiMTlhMDJiNmI0ZGUyZDZjZTQ4Y2MKGwjX-IDEvvXbAxDzhOahBhjvMSAMOAJA8QdIBBoCaGwiIDMxMGQ3NWVhMzBmZWFiNTY3YjI2Y2M2Y2I1OTcyNDQ2; ssid_ucp_v1=1.0.0-KDdiN2I2MWM2N2IzYmFhYmRiODNiMTlhMDJiNmI0ZGUyZDZjZTQ4Y2MKGwjX-IDEvvXbAxDzhOahBhjvMSAMOAJA8QdIBBoCaGwiIDMxMGQ3NWVhMzBmZWFiNTY3YjI2Y2M2Y2I1OTcyNDQ2; download_guide=%223%2F20230415%22; SEARCH_RESULT_LIST_TYPE=%22single%22; my_rd=1; s_v_web_id=verify_lghtat20_3RFiWIRj_x930_45U0_BZ3H_uaRjGNXAso9X; ttcid=08ead2e071f34a49a4e20a80c33afd4f42; FOLLOW_LIVE_POINT_INFO=%22MS4wLjABAAAAxMM2c6KNNNHvRluZ2KTOB7UJeBxyCmzUXWp4TliKXZ_wSxAmrY0IkUK4pwbGPM7g%2F1681574400000%2F0%2F1681553160083%2F0%22; FOLLOW_NUMBER_YELLOW_POINT_INFO=%22MS4wLjABAAAAxMM2c6KNNNHvRluZ2KTOB7UJeBxyCmzUXWp4TliKXZ_wSxAmrY0IkUK4pwbGPM7g%2F1681574400000%2F0%2F0%2F1681554392542%22; msToken=AjdH_77aAG1sC-0U-MaMQBD3QT95XjiZP1e4e5JJYpBnimVxKqDUU10RT2MgbZWKVfyTaxM09vdszhneWinYQNztXdYjJmQxVrp-phFdeimKvdCLmEP8uf3XbhPt4qI=; tt_scid=N0mdcnXLcnZ0sRc48a2X5KXeTgh11VQsoMIIH7tamef--JVlvaRFUjPP8LOViFHJ8ae2; VIDEO_FILTER_MEMO_SELECT=%7B%22expireTime%22%3A1682160805638%2C%22type%22%3A1%7D; msToken=fvm1KRAKqcE7dnf6sC2JpB9vNqVP3TloBQ4vGEQeljq2Ly5ypnjz_iUbBl3q2wq_ISn2uUhS4_XncTkCQQEsNQX63mFUzjeDdrx9yOh2ERFmUkGvQEHhgw==; home_can_add_dy_2_desktop=%221%22'
}

response = requests.get(url, headers=head)
# print(response.text)

# 通过json提取数据,通过观察可得json最外层是字典格式
data_dict = json.loads(response.text)
data_list = data_dict['aweme_list']     # vedeo在这个标签下,一共是主页的12个视频
# print(len(data_list))   # 12

# 'aweme_list'是一个字典的关键字,它的键值是一个长为12的列表,列表的每个元素又是字典
for i in data_list[:3]:
    v_title = i['desc']
    v_url = i['video']['play_addr']['url_list'][0]
    print(v_title, v_url)        # 拿到每一个视频文件的标题、url
    res = requests.get(v_url, headers=head)
    with open(f"./LiuGH/{v_title}.mp4", "wb") as f:
        f.write(res.content)

 

Guess you like

Origin blog.csdn.net/phoenixFlyzzz/article/details/130170792