环境: windows+python3.6+pycharm(非必须)
引用的python库/模块:requests, bs4, os, random,you-get
准备知识:requests的应用,BeautifulSoup的find_all(),os.system(“cmd命令”),you-get
爬取步骤:
1.对于爬虫,我习惯都用上ip代理池,虽然有的网站没有反爬虫策略,但是用上也无大碍。将ip代理池封装为一个模块可以随时调用
直接贴代码:get_ip.py
import requests from bs4 import BeautifulSoup import random head = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre', 'ue': 'utf-8', } def get_ip_list(): # 从IP代理网站1直接爬取大量的ip url = 'http://www.xicidaili.com/nn/' #ip代理网站 response = requests.get(url, headers=head).text bs = BeautifulSoup(response, 'html.parser') ips = bs.find_all('tr') ip_list = [] for i in range(1, len(ips)): ip_info = ips[i] tds = ip_info.find_all('td') ip_list.append(tds[1].text + ':' + tds[2].text) return ip_list def get_random_ip(): # 在ip池中获取一个随机ip地址调用 ips_list = get_ip_list() proxy_list = [] for ip in ips_list: proxy_list.append('http://' + ip) proxy_ip = random.choice(proxy_list) proxies = {'http': proxy_ip} return proxies
2.现在来实现爬取TED
(1)分析TED网页,我这里直接贴出规律
TED主页:https://www.ted.com/
TED视频的列表网页:https://www.ted.com/talks?page=1,最后的page=1表示列表第一页。如此类推
(2)直接贴代码get_TED.py
import requests from get_ip import get_random_ip from bs4 import BeautifulSoup import os head = { 'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre', 'proxies': get_random_ip(), 'ue': 'utf-8', } path = r'F:\TED' def get_TED(url, count): page_part = url.split('=') for i in range(1, count+1): url_ted = page_part[0] + '=' + str(i) response = requests.get(url_ted, params=head) html = response.text bs = BeautifulSoup(html, 'html.parser') talks_list = bs.find_all('div', attrs={'class': 'media__message'}) for j in range(len(talks_list)): ted_a = talks_list[j].find_all('a', attrs={'class': 'ga-link', 'data-ga-context': 'talks'}) ted_url = 'https://www.ted.com' + ted_a[0]['href'] print("TED演讲主题:" + ted_a[0].text) os.system(r'you-get -o {} {}'.format(path, ted_url)) if __name__ == '__main__': url = 'https://www.ted.com/talks?page=1' count = int(input("请输入要下载的页数(一页36个TED):")) get_TED(url, count)
代码在链接在个人的github上:https://github.com/goodloving/python