python crawling youtube video multithreaded non-Chinese Automatic Translation

 

 

Disclaimer: I wrote all the articles are made in the blog garden, I saw the other copy and paste the past nor even a source to write, direct marked with their own watermark. . . Really I did not say.

Introduction: Some time ago do some climbing video project, code is written, here it is in writing to re-analyze it again. Mo has a bad place to take offense:)

Environment: python2.7 + win10

 

Let me talk about the start, Internet access youtube need science, resolve on their own, it is best global proxy.

ok, now, first open the site to observe

 

 

 Site is very clean and comfortable, this time to do is to climb those search based on keywords related videos, so you can classify a very good, if the input Chinese search results that are generally domestic video, in English, then it is abroad.

Here the first to test the Chinese, enter '' funny '', found out a lot of video, you can also filter according to the conditions, YouTube video links very regular, all this https://www.youtube.com/watch?v= v_OVBHGwOaU, only behind the v value is not the same, here it is called id.

 

 

ok, start with the easiest start to view the page source code to see if these video links are on the inside, I widened my 24k single dog's eyes to find out. . . Looked at all in this video <script> tag inside.

 

 

 That being the case, it would direct the regular expression to match

"url":"/watch\?v=(.*?)","webPageType"

This would be matched ID. But it seems only the video of the first page, the second page that it is often observed that this method does not work, flip video is based on ajax request, the source of inside information is always the first page of data, ok that since In this way, we have to analyze ajax request, I like to use Google browser, open developer tools, network, task package.

Mouse has been pulled down, it will automatically request a post request, a look at the video message is returned.

 

 Glad to see here, not far away from the victory. But, let's look at the headers and post parameters sent, then read on a wtf. . .

 

Ten thousand alpacas in the Pentium, I put those encryption parameters are marked, before and after the end of the interaction, since it is made of past data, it must have been produced at the front, as to what is produced, it would have step by step analysis of the ,At last. I did not come out for analysis. . . At first view next to suffer js file js parameter is indeed generated inside, though. . . tmd write too complicated. . . Limited capacity, can not be resolved. Is it to give up yet. Certainly not, otherwise you will not see the article. So, I had an idea, which enter in the address bar & page = results, really return to the video. . . FML Ha ha ha, I was really very happy about. Because the front page and there is no turning button, I did not think this can really flip. . . Haha

 

 Since this I have been guessed, that idea is very clear, flip - get the source code - regular match - you can get bulk video link, and then go after the re-thinking of ways to download directly through this link . So, while Baidu Google to find many ways, but also find a lot of api, ok it unnecessary duplication create the wheel, directly use that.

There is an open source project on github youtube-dl is a command line application, after installation, he is this.

 

youtube-dl -F https://www.youtube.com/watch?v=_iupLGTX890

This allows direct analysis of all the information video formats, then you can download it by id. It is a tool very easy to use.

How to use the code inside it, directly call the cmd command on the line, however. I found it after testing, batch download time, there is always some video did not download the full, so I did not use this method, and found a api quite good on foreign websites.

How to find how to use api I would not have introduced it, and so will be directly attached to the code, we see at a glance.

Here at that when I enter a keyword is in English, then search out the results are all in English, so I downloaded successfully, save the file to translate what his title. Translated into Chinese, and I went to translation, we end up with Kingsoft, and if you use official api, it seems like there are charges. . I will not, I'm going straight to climb the page, so I'll just climb his or translate pages, submitted in English, Chinese return, page parsing, regular match it. So hey hey hey. .

ok. Having said that now the code.

 

Copy the code

# -*-coding:utf-8-*-
# author : Corleone
from bs4 import BeautifulSoup
import lxml
import Queue
import requests
import re,os,sys,random
import threading
import logging
import json,hashlib,urllib
from requests.exceptions import ConnectTimeout,ConnectionError,ReadTimeout,SSLError,MissingSchema,ChunkedEncodingError
import random

reload(sys)
sys.setdefaultencoding('gbk')

# 日志模块
logger = logging.getLogger("AppName")
formatter = logging.Formatter('%(asctime)s %(levelname)-5s: %(message)s')
console_handler = logging.StreamHandler(sys.stdout)
console_handler.formatter = formatter
logger.addHandler(console_handler)
logger.setLevel(logging.INFO)

q = Queue.Queue()   # url队列
page_q = Queue.Queue()  # 页面

def downlaod(q,x,path):
    urlhash = "https://weibomiaopai.com/"
    try:
        html = requests.get(urlhash).text
    except SSLError:
        logger.info(u"网络不稳定 正在重试")
        html = requests.get(urlhash).text
    reg = re.compile(r'var hash="(.*?)"', re.S)
    result = reg.findall(html)
    hash_v = result[0]
    while True:
        data = q.get()
        url, name = data[0], data[1].strip().replace("|", "")
        file = os.path.join(path, '%s' + ".mp4") % name
        api = "https://steakovercooked.com/api/video/?cached&hash=" + hash_v + "&video=" + url
        api2 = "https://helloacm.com/api/video/?cached&hash=" + hash_v + "&video=" + url
        try:
            res = requests.get(api)
            result = json.loads(res.text)
        except (ValueError,SSLError):
            try:
                res = requests.get(api2)
                result = json.loads(res.text)
            except (ValueError,SSLError):
                q.task_done()
                return False
        vurl = result['url']
        logger.info(u"正在下载:%s" %name)
        try:
            r = requests.get(vurl)
        except SSLError:
            r = requests.get(vurl)
        except MissingSchema:
            q.task_done()
            continue
        try:
            with open(file,'wb') as f:
                f.write(r.content)
        except IOError:
            name = u'好开心么么哒 %s' % random.randint(1,9999)
            file = os.path.join(path, '%s' + ".mp4") % name
            with open(file,'wb') as f:
                f.write(r.content)
        logger.info(u"下载完成:%s" %name)
        q.task_done()

def get_page(keyword,page_q):
    while True:
        headers = {
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'
        }
        page = page_q.get()
        url = "https://www.youtube.com/results?sp=EgIIAg%253D%253D&search_query=" + keyword + "&page=" + str(page)
        try:
            html = requests.get(url, headers=headers).text
        except (ConnectTimeout,ConnectionError):
            print u"不能访问youtube 检查是否已翻墙"
            os._exit(0)
        reg = re.compile(r'"url":"/watch\?v=(.*?)","webPageType"', re.S)
        result = reg.findall(html)
        logger.info(u"第 %s 页" % page)
        for x in result:
            vurl = "https://www.youtube.com/watch?v=" + x
            try:
                res = requests.get(vurl).text
            except (ConnectionError,ChunkedEncodingError):
                logger.info(u"网络不稳定 正在重试")
                try:
                    res = requests.get(vurl).text
                except SSLError:
                    continue
            reg2 = re.compile(r"<title>(.*?)YouTube",re.S)
            name = reg2.findall(res)[0].replace("-","")
            if u'\u4e00' <= keyword <= u'\u9fff':
                q.put([vurl, name])
            else:
                # 调用金山词霸
                logger.info(u"正在翻译")
                url_js = "http://www.iciba.com/" + name
                html2 = requests.get(url_js).text
                soup = BeautifulSoup(html2, "lxml")
                try:
                    res2 = soup.select('.clearfix')[0].get_text()
                    title = res2.split("\n")[2]
                except IndexError:
                    title = u'好开心么么哒 %s' % random.randint(1, 9999)
                q.put([vurl, title])
        page_q.task_done()


def main():
    # 使用帮助
    keyword = raw_input(u"请输入关键字:").decode("gbk")
    threads = int(raw_input(u"请输入线程数量(建议1-10): "))
    # 判断目录
    path = 'D:\youtube\%s' % keyword
    if os.path.exists(path) == False:
        os.makedirs(path)
    # 解析网页
    logger.info(u"开始解析网页")
    for page in range(1,26):
        page_q.put(page)
    for y in range(threads):
        t = threading.Thread(target=get_page,args=(keyword,page_q))
        t.setDaemon(True)
        t.start()
    page_q.join()
    logger.info(u"共 %s 视频" % q.qsize())
    # 多线程下载
    logger.info(u"开始下载视频")
    for x in range(threads):
        t = threading.Thread(target=downlaod,args=(q,x,path))
        t.setDaemon(True)
        t.start()
    q.join()
    logger.info(u"全部视频下载完成!")

main()

Copy the code

 

Here to say, I was all win10 encoding gbk of all, if the running linux above, please modify your own. Is multi-threaded download, the default download directory d: \ youtube then create a subdirectory based on the keywords in the video are on the inside. By the way there I use the code inside screened, only to climb within 1 day of update. Climb it again every day.

To test. Download speeds, it is the test of the time, the network is not good, there may be some I have not caught exception. . . I may be looking for fq server speed is also OK. .

 

 

ok here the whole article is over, get writing articles almost an hour. . Not easy to do: - 

My github address https://github.com/binglansky/spider

Published 115 original articles · won praise 41 · views 60000 +

Guess you like

Origin blog.csdn.net/pangzhaowen/article/details/104230623