Environment: windows+python3.6+pycharm (not required)
Referenced python libraries/modules: requests, bs4, os, random, you-get
Preparation knowledge: application of requests, find_all() of BeautifulSoup, os.system ("cmd command"), you-get
Crawling steps:
1. For crawlers, I am used to using the IP proxy pool. Although some websites do not have anti-crawling strategies, it is not a big problem to use them. Encapsulate the ip proxy pool as a module that can be called at any time
Paste the code directly: get_ip.py
import requests from bs4 import BeautifulSoup import random head = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre', 'ue': 'utf-8', } def get_ip_list (): # Directly crawl a large number of ips from IP proxy website 1 url = 'http://www.xicidaili.com/nn/' #ip代理网站 response = requests.get(url, headers=head).text bs = BeautifulSoup(response, 'html.parser') ips = bs.find_all('tr') ip_list = [] for i in range(1, len(ips)): ip_info = ips[i] tds = ip_info.find_all('td') ip_list.append(tds[1].text + ':' + tds[2].text) return ip_list def get_random_ip (): # Get a random ip address call in the ip pool ips_list = get_ip_list() proxy_list = [] for ip in ips_list: proxy_list.append('http://' + ip) proxy_ip = random.choice(proxy_list) proxies = {'http': proxy_ip} return proxies
2. Now to implement crawling TED
(1) Analyze the TED webpage, I will post the rules directly here
TED Homepage: https://www.ted.com/
The list page of TED videos: https://www.ted.com/talks?page=1, the last page=1 means the first page of the list. And so on
(2) Paste the code get_TED.py directly
import requests from get_ip import get_random_ip from bs4 import BeautifulSoup import os head = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre', 'proxies': get_random_ip(), 'ue': 'utf-8', } path = r'F:\TED' def get_TED(url, count): page_part = url.split('=') for i in range(1, count+1): url_ted = page_part[0] + '=' + str(i) response = requests.get(url_ted, params=head) html = response.text bs = BeautifulSoup(html, 'html.parser') talks_list = bs.find_all('div', attrs={'class': 'media__message'}) for j in range(len(talks_list)): ted_a = talks_list[j].find_all('a', attrs={'class': 'ga-link', 'data-ga-context': 'talks'}) ted_url = 'https://www.ted.com' + ted_a[0]['href'] print("TED演讲主题:" + ted_a[0].text) os.system(r'you-get -o {} {}'.format(path, ted_url)) if __name__ == '__main__': url = 'https://www.ted.com/talks?page=1' count = int(input("请输入要下载的页数(一页36个TED):")) get_TED(url, count)
代码在链接在个人的github上:https://github.com/goodloving/python