python爬虫获取电影天堂中电影的标题与下载地址,并用正则表达匹配电影类型

在电影天堂的列表页面,爬取每个链接的子页面中的,电影标题以及下载地址,并用正则表达式匹配出想要的电影类型

源代码获取:

https://github.com/akh5/Python/blob/master/movieparise.py
在这里插入图片描述
用爬虫程序我们做到的效果就是从分类页面,跳转到每一个电影的页面内爬取我们想要的信息并存储在数据字典内

这里只储存 标题 和下载连接
在这里插入图片描述在这里插入图片描述
实现的结果如下:
在这里插入图片描述

from lxml import etree
import requests
import re
BASE_DOMAIN = 'http://dytt8.net'

headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
        }
```python
from lxml import etree
import requests
import re
BASE_DOMAIN = 'http://dytt8.net'

headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
        }

先定义一个根url 方便找到<a>内的href属性后跳转页面,
Headers内是一个存储头部的数据字典,来伪装爬虫程序

主函数:

def spider():
    base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_1.html"
    movies = []
    url = base_url
    detail_urls = get_detail_urls(url)
    for detail_url in detail_urls:
        movie = parse_detail_page(detail_url)   
        movies.append(movie)
        #print(movie)
    #print(movies)
    find_what_u_want(movies)

这里只有第一页的爬取,也可以用for循环爬取10页,如下?
在这里插入图片描述

def get_detail_urls(url):    
    respons = requests.get(url,headers=headers)
    text = respons.text
    html = etree.HTML(text)
    detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
    detail_urls = map(lambda url:BASE_DOMAIN+url,detail_urls)
    return detail_urls

这个函数是找到每一个子页面的url
用的是xpath语法,xpath语法专门用来寻找页面里的标签,相对简单,但是响应却特别慢
在这里插入图片描述
在这里插入图片描述
这里返回的是子页面的url,具体访问时要加上根url进行字符串拼接

def parse_detail_page(url):
    movie = {}
    respons = requests.get(url,headers = headers)
    text = respons.content.decode('gbk')
    html = etree.HTML(text)
    title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
    download = html.xpath("//div[@id='Zoom']//a/@href")[0]
    movie['title'] = title
    movie['download'] = download
    return movie

这里的函数就是将子找到子页面中的标题和下载地址,返回一个movie数据字典中,并在主函数中用一个列表接收

还有一个正则表达式的函数
把数据字典列表传入到函数中,然后 根据正则表达式,找到我们想要的内容,比如“悬疑”然后储存到一个列表中

def find_what_u_want(movies):
    find_out = []
    for movie in movies:
        text = movie['title']
        if(bool(re.search('.+悬疑*',text))):
            find_out.append(movie)
        
    print(find_out)
        

最后的结果是只找到悬疑类的电影
在这里插入图片描述

完整代码:

from lxml import etree
import requests
import re
BASE_DOMAIN = 'http://dytt8.net'

headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
        }

def get_detail_urls(url):    
    respons = requests.get(url,headers=headers)
    text = respons.text
    html = etree.HTML(text)
    detail_urls = html.xpath("//table[@class='tbspan']//a/@href")
    detail_urls = map(lambda url:BASE_DOMAIN+url,detail_urls)
    return detail_urls

def parse_detail_page(url):
    movie = {}
    respons = requests.get(url,headers = headers)
    text = respons.content.decode('gbk')
    html = etree.HTML(text)
    title = html.xpath("//div[@class='title_all']//font[@color='#07519a']/text()")[0]
    download = html.xpath("//div[@id='Zoom']//a/@href")[0]
    movie['title'] = title
    movie['download'] = download
    return movie

def find_what_u_want(movies):
    find_out = []
    for movie in movies:
        text = movie['title']
        if(bool(re.search('.+悬疑*',text))):
            find_out.append(movie)
        
    print(find_out)
        

def spider():
    base_url="https://www.dytt8.net/html/gndy/dyzz/list_23_1.html"
    movies = []
    url = base_url
    detail_urls = get_detail_urls(url)
    for detail_url in detail_urls:
        movie = parse_detail_page(detail_url)   
        movies.append(movie)
        #print(movie)
    #print(movies)
    find_what_u_want(movies)
spider()
发布了52 篇原创文章 · 获赞 13 · 访问量 5462

猜你喜欢

转载自blog.csdn.net/MPF1230/article/details/102547721
今日推荐