How to write and crawl Douyin video simply and quickly in Python (with detailed steps)


foreword

Earlier we learned some knowledge about Python crawlers, you can click to view.

Python crawler weapon - Selenium

Python job analysis report

Python crawls girl pictures

However, they are all crawled based on the content in the webpage of the PC-side browser. Now more and more mobile apps are used, and many of them do not have a web page. For example, Douyin does not have a web version, so the videos on it cannot be captured in batches?


1. APP Capture Packet

The answer is of course No! For the app, the communication process in the application is similar to that of the web page, both of which send requests to the background to obtain data. When we open the debugging tool in the browser, we can see the specific request content, but we cannot directly see it in the App. So we need to use the packet capture tool to get the App request and response information. Regarding packet capture tools, there are Wireshark, Fiddler, Charles, etc. Today we will talk about how to use Fiddler to capture packets of mobile apps.

The working principle of Fiddler is equivalent to a proxy. After configuration, the request we send from the mobile app will be sent by Fiddler, and the information returned by the server will also be relayed by Fiddler once. So through Fiddler, we can see the request sent by the App to the server and the server's response.

Two, Fiddler installation configuration

1. Import library

After we installed Fiddler, first select these two places under the menu Tool>Options>Https.
insert image description here
Then check Allow remote computers to connect under the Connections tab to allow Fiddler to accept requests from other devices.

At the same time, remember the port number here, the default is 8088, and you need to fill it in on the mobile phone.
insert image description here
After the configuration is complete and saved, be sure to close Fiddler and reopen it.

2. Mobile terminal configuration

Make sure that the mobile phone and the computer are in the same LAN, let's check the IP address of the computer first, and enter ipconfig in cmd to see it. My computer uses a wireless network, so the IP address is 192.168.1.3.

insert image description here
Turn on the wireless connection of the mobile phone and select the hotspot to be connected. Long press to select Modify Network, and fill in the IP address of our computer and the port of the Fiddler proxy in the proxy. As shown in the figure below:
insert image description here
insert image description here
After saving, open http://192.168.1.3:8008 in the mobile phone's native browser, which is our computer IP and port above. In this step, I can’t open it in the Quark browser. I must open it in the browser that comes with the phone.

After opening, click the link below to download the certificate, and then install the certificate.

The browser on the computer side also needs to open this address and install the certificate to facilitate future packet capture operations on the browser.

insert image description here
After installation, everything is OK. You can open the App with your mobile phone and grab the package happily on Fiddler.

3. Code

The code is very simple, just like what we said in the previous articles, just use requests to request the corresponding link.

The code is only used as a simple example to download only the content of the current page. If you want to download all videos, you can construct a new URL address based on the has_more and max_cursor parameters in the returned JSON result and download continuously.

The user_id in the URL can be changed according to the user you want to crawl, you can share the user to WeChat, and then open the link in the browser, you can see the user's user_id in the opened URL.

import requests
import urllib.request
def get_url(url):
   headers = {
    
    'user-agent': 'mobile'}
   req = requests.get(url, headers=headers, verify=False)
   data = req.json()
   for data in data['aweme_list']:
       name = data['desc'] or data['aweme_id']
       url = data['video']['play_addr']['url_list'][0]
       urllib.request.urlretrieve(url, filename=name + '.mp4')


if __name__ == "__main__":
   get_url('https://api.amemv.com/aweme/v1/aweme/post/?max_cursor=0&user_id=98934041906&count=20&retry_type=no_retry&mcc_mnc=46000&iid=58372527161&device_id=56750203474&ac=wifi&channel=huawei&aid=1128&app_name=aweme&version_code=421&version_name=4.2.1&device_platform=android&ssmix=a&device_type=STF-AL10&device_brand=HONOR&language=zh&os_api=26&os_version=8.0.0&uuid=866089034995361&openudid=008c22ca20dd0de5&manifest_version_code=421&resolution=1080*1920&dpi=480&update_version_code=4212&_rticket=1548080824056&ts=1548080822&js_sdk_version=1.6.4&as=a1b51dc4069b2cc6252833&cp=dab7ca5f68594861e1[wIa&mas=014a70c81a9db218501e1433b04c38963ccccc1c4cac4c6cc6c64c')

After running, you can get the video list:

insert image description here

Summarize

If you have any questions, please add me as a friend in the background to ask me questions.

Guess you like

Origin blog.csdn.net/liaozp88/article/details/129650201