Python crawls the entire site video of Linlang community (6000 videos per night)

Linlang Community (the rumored most popular website among men) , hum, I want to see if it's true

This project is used to crawl the whole site video of Linlang Community (for learning only)

Main use: python3.7 + scrapy2.19 + Mysql 8.0 + win10

First determine the content to be crawled and define item :

class LinglangItem(scrapy.Item):
	#视频属于哪个模块
    video_belong_module = scrapy.Field()
    #视频播放页面url
    video_url = scrapy.Field()
    #视频标题
    video_title = scrapy.Field()
    #视频真实m3u8地址
    video_m3u8_url =  scrapy.Field()

Then write the crawler file :
construct the parsing function of the initial url, get the video classification request of Linlang website, and generate the main directory for storage locally

def parse(self, response):
    # 创建主目录
    if not os.path.exists(self.base_dir):
        os.mkdir(self.base_dir)
    all_module_url = response.xpath('//div[@id="head_nav"]/div/div[@class="left_nav"]/a/@href').extract()[1:]
    #得到所有模块（最新，动漫。。。）的绝对url
    all_module_url = [self.start_urls[0] + url for url in all_module_url]
    # 构造所有模块页的请求
    for page_url in all_module_url:
        # 引擎判断该数据为一个请求，给调度器，
        # 如果是其他格式比如列表，引擎不能识别，只能通过我们的命令-o处理
        yield scrapy.Request(page_url, callback=self.page_parse)

Define the parsing function of the specific module page, support pagination crawling:

def page_parse(self,response):

    # 得到该页面所有视频的url，title，视频m3u8地址 （20个）
    video_urls = response.xpath('//ul[contains(@class,"piclist")]/li/a/@href').extract()
    video_titles =  response.xpath('//ul[contains(@class,"piclist")]/li/a/@title').extract()
   
    video_m3u8_url_ls = response.xpath('//ul[contains(@class,"piclist")]/li/a/@style').extract()
    # 该视频所在模块
    video_belong_module =  response.xpath('//a[contains(@class,"on")]/text()').extract_first()
    for index,video_m3u8_url in enumerate(video_m3u8_url_ls):
        # 最好yield一个item就重新创建一个，否则可能导致一些问题，比如名字重复
        item = dict()
        ls = video_m3u8_url.split('/')
        #https://bbb.188370aa.com/20191014/WLDsLTZK/index.m3u8
        #0       1   2                     3       4      5
        # 得到绝对m3u8_url
        try:
            m3u8_url = self.m3u8_domain + ls[3] + '/' +ls[4] + '/index.m3u8'
        except:
            continue
        item['video_belong_module'] = video_belong_module
        item['video_url'] = self.start_urls[0] + video_urls[index]

        #教训：有些名字后面带空格，删的时候找不到文件
        # item['video_title'] = video_titles[index].strip()
        item['video_title'] = video_titles[index].strip().replace('.','')
        # item['video_m3u8_url'] = m3u8_url
        self.num += 1
        print('当前是第 %s 个视频： %s' % (self.num, item['video_title']))
        #创建每个视频目录
        module_name = video_belong_module
        file_name = item['video_title']
        # module_path = os.path.join(self.base_dir, module_name)
        # video_path = os.path.join(module_path, file_name)
        module_path = self.base_dir + module_name + '/'
        video_path = module_path + file_name +'/'
        if not os.path.exists(video_path):
            try:
                os.makedirs(video_path)
            except:
                video_path = module_path + str(random()) + '/'
                os.makedirs(video_path)

        yield scrapy.Request(m3u8_url, callback=self.m3u8_parse, meta={
    
    'video_path':video_path,'item':item})
    try:
        # 得到下一页的a标签selector对象
        next_page_selector = response.xpath('//div[@class="pages"]/a')[-2]
        # 如果有下一页则向下一页发起请求,尾页的下一页a标签没有href属性
        next_page = next_page_selector.xpath('./@href').extract_first()
        if  next_page:
            next_page_url = self.start_urls[0] + next_page
            yield scrapy.Request(next_page_url, callback=self.page_parse)
    except:
        pass

Return item to pipeline file:

ef m3u8_parse(self,response):
        item = LinglangItem()
        for k,v in response.meta['item'].items():
            item[k] = v
        # response.text得到m3u8文件内容字符串
        # 得到最新的m3u8文件url
        real_url = re.findall(r'https:.*?m3u8', response.text)[-1]
        item['video_m3u8_url'] = real_url
        # yield返回给引擎的时候会判断 item 的数据类型是不是item类型如果是则返回给piplines
        yield item

Implement a deduplication pipeline:

#实现去重Item Pipeline 过滤重复数据
class DuplicatesPipline(object):
    #只在第一个item来时执行一次，可选实现，做参数初始化等
    def __init__(self):
        self.video_title_set = set()
    def process_item(self,item,spider):
        video_title = item['video_title']
        if video_title in self.video_title_set:
            item['video_title'] = item['video_title'] + str(random())
        self.video_title_set.add(video_title)
        #表示告诉引擎，我这个item处理完了，可以给我下一个item
        return item
    #然后去settings中启用DuplicatesPipline

Then realize the storage pipeline for storing data in mysql, where you can also choose other types of databases for storage:

#将item数据存入数据库
class MySqlPipeline(object):
    def __init__(self,database):
        self.database = database

    # 该方法可以在settings里面拿到一些配置信息
    @classmethod
    def from_crawler(cls, crawler):
        # 相当于返回一个MySqlPipeline对象
        return cls(
            # 得到settings里面的对应配置信息并返回，当作init的参数
            database=crawler.settings.get('DATABASE')
        )

    #当spider被开启时，这个方法被调用, 连接数据库
    def open_spider(self, spider):
        self.db = pymysql.connect(host='localhost',port=3306,user='root',password='123456',database=self.database, charset='utf8')
        self.cursor = self.db.cursor()
        print('数据库：',type(self.db),type(self.cursor))


    def process_item(self,item,spider):

        sql = "insert into video_info values(%s,%s,%s,%s);"
        values = tuple(dict(item).values())
        #执行成功返回1
        self.cursor.execute(sql,values)
        # 前面只是把数据写到缓存里，执行commit命令写到数据库中
        self.db.commit()
        return item
        # 然后去settings中启用MySqlPipeline,这里暂时不启用


    # 当spider被关闭时，这个方法被调用,关闭数据库
    def close_spider(self, spider):
        self.cursor.close()
        self.db.close()

In fact, it is already possible to crawl at this point. But we use scrapy to make so many requests to this website frequently, when the other server determines that we are a crawler, it will forcibly close the connection with us.

Although scrapy will return these unsuccessful crawling requests back to the scheduler, and wait for the connection to be successful before sending the request, but this will waste some time for us.

In order to improve efficiency, when the local request fails, we can use the dynamic proxy in the download middleware to re-initiate the request:

  def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        # 如果返回的response状态不是200，重新生成当前request对象
        if response.status != 200:
            print('使用代理-------------------------')
            headers = {
    
    
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36'
            }
            # 对当前request加上代理
            request.headers = headers
            request.meta['proxy'] = 'http://' + self.random_ip()
            return request
        return response

Finally, start the crawler, wait for the crawler to end, check the database, and get a lot of results~ It
It can be seen that the website has a total of 5997 videos, which is not as many as imagined. can be seen that the website has a total of 5997 videos, which is not as many as imagined. I dare not post the URL of the website, I am afraid, haha.

Real knowledge comes from practice, this kind of energetic website is a good goal for practice, but the body is not getting better day by day, maybe it is staying up late. . .

Python crawls the entire site video of Linlang community (6000 videos per night)

This project is used to crawl the whole site video of Linlang Community (for learning only)

Guess you like