在电影天堂的列表页面,爬取每个链接的子页面中的,电影标题以及下载地址,并用正则表达式匹配出想要的电影类型
源代码获取:
https://github.com/akh5/Python/blob/master/movieparise.py
用爬虫程序我们做到的效果就是从分类页面,跳转到每一个电影的页面内爬取我们想要的信息并存储在数据字典内
这里只储存 标题 和下载连接
实现的结果如下:
from lxml import etree
import requests
import re
BASE_DOMAIN = 'http://dytt8.net'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}
```python
from lxml import etree
import requests
import re
BASE_DOMAIN = 'http://dytt8.net'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}
先定义一个根url 方便找到<a>内的href属性后跳转页面,
Headers内是一个存储头部的数据字典,来伪装爬虫程序
主函数:
def spider():
base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_1.html"
movies = []
url = base_url
detail_urls = get_detail_urls(url)
for detail_url in detail_urls:
movie = parse_detail_page(detail_url)
movies.append(movie)
#print(movie)
#print(movies)
find_what_u_want(movies)
这里只有第一页的爬取,也可以用for循环爬取10页,如下?
def get_detail_urls(url):
respons = requests.get(url,headers=headers)
text = respons.text
html = etree.HTML(text)
detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
detail_urls = map(lambda url:BASE_DOMAIN+url,detail_urls)
return detail_urls
这个函数是找到每一个子页面的url
用的是xpath语法,xpath语法专门用来寻找页面里的标签,相对简单,但是响应却特别慢
这里返回的是子页面的url,具体访问时要加上根url进行字符串拼接
def parse_detail_page(url):
movie = {}
respons = requests.get(url,headers = headers)
text = respons.content.decode('gbk')
html = etree.HTML(text)
title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
download = html.xpath("//div[@id='Zoom']//a/@href")[0]
movie['title'] = title
movie['download'] = download
return movie
这里的函数就是将子找到子页面中的标题和下载地址,返回一个movie数据字典中,并在主函数中用一个列表接收
还有一个正则表达式的函数
把数据字典列表传入到函数中,然后 根据正则表达式,找到我们想要的内容,比如“悬疑”然后储存到一个列表中
def find_what_u_want(movies):
find_out = []
for movie in movies:
text = movie['title']
if(bool(re.search('.+悬疑*',text))):
find_out.append(movie)
print(find_out)
最后的结果是只找到悬疑类的电影
完整代码:
from lxml import etree
import requests
import re
BASE_DOMAIN = 'http://dytt8.net'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}
def get_detail_urls(url):
respons = requests.get(url,headers=headers)
text = respons.text
html = etree.HTML(text)
detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
detail_urls = map(lambda url:BASE_DOMAIN+url,detail_urls)
return detail_urls
def parse_detail_page(url):
movie = {}
respons = requests.get(url,headers = headers)
text = respons.content.decode('gbk')
html = etree.HTML(text)
title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
download = html.xpath("//div[@id='Zoom']//a/@href")[0]
movie['title'] = title
movie['download'] = download
return movie
def find_what_u_want(movies):
find_out = []
for movie in movies:
text = movie['title']
if(bool(re.search('.+悬疑*',text))):
find_out.append(movie)
print(find_out)
def spider():
base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_1.html"
movies = []
url = base_url
detail_urls = get_detail_urls(url)
for detail_url in detail_urls:
movie = parse_detail_page(detail_url)
movies.append(movie)
#print(movie)
#print(movies)
find_what_u_want(movies)
spider()