Use python to make youtube automatic downloader! Attach the complete code!

Process

1. post

According to the first step in the idea, we first need to use the post method to get the encrypted js field. The author uses the requests third-party library to execute. For crawlers, please refer to my previous article

i. First format the headers in the post

# set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "https://en.savefrom.net",
        "pragma": "no-cache",
        "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}

The cookie part may need to be changed, and then it’s best to focus on the one on your browser. The meaning of each parameter is beyond the scope of this article. You can go to a search engine to search.

ii. Then format the parameters

# set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}

The sf_url field is the url of the youtube video we want to download, and the other parameters remain unchanged

iii. Finally execute the post request of the requests library

# do the POST request
    r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()

Note that data=kv

iv. Encapsulated into a function

import requests

def gethtml(url):
    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "https://en.savefrom.net",
        "pragma": "no-cache",
        "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}
    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}
    # do the POST request
    r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()
    # get the result
    return r.text

2. Call the decryption function

i. Analysis

The difficulty lies in executing javascript code in python, and the solution at night is PyV8, etc. This article uses execjs. In the idea part, we can find that the last few lines of the js part are the decryption function, so we only need to execute all of them in execjs first, and then execute the decryption function separately.

ii. Take out the js part first

# target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]

In fact, regular expressions can be used here, but since the author is not too familiar with regular expressions, I use split directly.

iii. Take the first decryption function as the decryption function we use

When you take the results of different videos several times, you will find that the decryption function is different each time, but the position is still in a fixed number of lines

# split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"

So name is our decryption function (the variable name is not too good hhh)

iv. Execute with execjs

# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(reo)
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))

Among them, only taking the sum after = and removing the semicolon means that this function is executed without assignment. It is not impossible to perform assignment + decryption and then take the value.

But we can find that the error is reported immediately (if it was as simple as that)

1. This is the window variable does not exist

If you remember correctly, it is an error of this or $b. I tried to remove all this or put all the boxes in a class (so this becomes that class), but I didn’t succeed. Then I found that there is a jsdom under npm. Simulate window variables in execjs (in fact, there should be a better way), so we need to download npm and jsdom inside, and then rewrite the above code

addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')

among them

  • The cwd field is the result of npm root -g, which is the path of npm modules
  • addition is used to simulate the window,
    but we can find the next error

2. The alert does not exist

This error is because it is meaningless to execute the alert function under execjs, because we do not have a browser to let him pop up, and the original definition of the alert function is the source window and we have customized the window, so we have to rewrite the coverage before the code alert function (equivalent to defining an alert)

# override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")

v. Integration code

# target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
    # override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
    # split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"
    # add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))

3. Analyze the decryption result

i. Get key json

After running the above part, the decryption result is stored in the text, and we can find in our thinking that what is really important to us is the json in window.parent.sf.videoResult.show(), so use regular expressions to take This part of json

# get the result in json
    result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")

ii. Format json

There are many libraries that python can format json, here I used the json library (remember to import)

# use `json` to load json
    j = json.loads(result)

iii. Get the download address

Then comes the last step. According to the ideas and json formatting tools, we can find that j["url"][num]["url"] is the download link, and num is the video format we want (different resolutions and types )

# the selection of video(in this case, num=1 mean the video is
    # - 360p known from j["url"][num]["quality"]
    # - MP4 known from j["url"][num]["type"]
    # - audio known from j["url"][num]["audio"]
    num = 1
    downurl = j["url"][num]["url"]
    # do some download
    # thanks :)
    # - EOF -

3. All codes

# -*- coding: utf-8 -*-
# @Time: 2021/1/10
# @Author: Eritque arcus
# @File: Youtube.py
# @License: MIT
# @Environment:
#           - windows 10
#           - python 3.6.2
# @Dependence:
#           - jsdom in npm(windows also can use)
#           - requests, execjs, re, json in python
import requests
import execjs
import re
import json


def gethtml(url):
    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "https://en.savefrom.net",
        "pragma": "no-cache",
        "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}
    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}
    # do the POST request
    r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()
    # get the result
    return r.text


if __name__ == '__main__':
    # target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
    # override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
    # split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"
    # add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))
    # get the result in json
    result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
    # use `json` to load json
    j = json.loads(result)
    # the selection of video(in this case, num=1 mean the video is
    # - 360p known from j["url"][num]["quality"]
    # - MP4 known from j["url"][num]["type"]
    # - audio known from j["url"][num]["audio"]
    num = 1
    downurl = j["url"][num]["url"]
    # do some download
    # thanks :)
    # - EOF -
  • 102 rows in total

 

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

  • Development environment
# @Environment:
#           - windows 10
#           - python 3.6.2
  • rely
# @Dependence:
#           - jsdom in npm(windows also can use)
#           - requests, execjs, re, json in python

-end-

For crawler

Copyright statement: This article is the original article of the blogger and follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this statement for reprinting.

Author: https://www.cnblogs.com/Eritque-arcus/ or https://blog.csdn.net/qq_40832960

#感谢您访问本站#
#本文转载自互联网,若侵权,请联系删除,谢谢!

 

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112561777