Python crawling TED speech video (code)

Environment: windows+python3.6+pycharm (not required)

Referenced python libraries/modules: requests, bs4, os, random, you-get

Preparation knowledge: application of requests, find_all() of BeautifulSoup, os.system ("cmd command"), you-get

Crawling steps:

1. For crawlers, I am used to using the IP proxy pool. Although some websites do not have anti-crawling strategies, it is not a big problem to use them. Encapsulate the ip proxy pool as a module that can be called at any time

Paste the code directly: get_ip.py

import requests
from bs4 import BeautifulSoup
import random

head = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre',
'ue': 'utf-8',
}    

def get_ip_list (): # Directly crawl a large number of ips from IP proxy website 1
    url = 'http://www.xicidaili.com/nn/'             #ip代理网站
    response = requests.get(url, headers=head).text
    bs = BeautifulSoup(response, 'html.parser')
    ips = bs.find_all('tr')
    ip_list = []
    for i in range(1, len(ips)):
        ip_info = ips[i]
        tds = ip_info.find_all('td')
        ip_list.append(tds[1].text + ':' + tds[2].text)
    return ip_list

def get_random_ip (): # Get a random ip address call in the ip pool
    ips_list = get_ip_list()
    proxy_list = []
    for ip in ips_list:
        proxy_list.append('http://' + ip)
    proxy_ip = random.choice(proxy_list)
    proxies = {'http': proxy_ip}
    return proxies

2. Now to implement crawling TED

(1) Analyze the TED webpage, I will post the rules directly here

        TED Homepage: https://www.ted.com/

        The list page of TED videos: https://www.ted.com/talks?page=1, the last page=1 means the first page of the list. And so on

(2) Paste the code get_TED.py directly

import requests
from get_ip import get_random_ip
from bs4 import BeautifulSoup
import os

head = {
    'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre',
    'proxies': get_random_ip(),
    'ue': 'utf-8',
}
path = r'F:\TED'

def get_TED(url, count):
    page_part = url.split('=')
    for i in range(1, count+1):
        url_ted = page_part[0] + '=' + str(i)
        response = requests.get(url_ted, params=head)
        html = response.text
        bs = BeautifulSoup(html, 'html.parser')
        talks_list = bs.find_all('div', attrs={'class': 'media__message'})
        for j in range(len(talks_list)):
            ted_a = talks_list[j].find_all('a', attrs={'class': 'ga-link', 'data-ga-context': 'talks'})
            ted_url = 'https://www.ted.com' + ted_a[0]['href']
            print("TED演讲主题:" + ted_a[0].text)
            os.system(r'you-get -o {} {}'.format(path, ted_url))

if __name__ == '__main__':
    url = 'https://www.ted.com/talks?page=1'
    count = int(input("请输入要下载的页数(一页36个TED):"))
    get_TED(url, count)


代码在链接在个人的github上:https://github.com/goodloving/python

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325872204&siteId=291194637