Python's requests + xpath crawling cat-eye movies and written to the database (Photo Tutorial)

A, pyhton mysql database connection

I wrote a py file to package it, then call in crawling cat's eye py file, you need to pymysql library, the students do not have this library to pre-install it, directly on the code here

#coding=utf-8
import pymysql

class mysqlConn:
    def get_conn(self, dbname):
        """提供你要连接的数据库名,并连接数据库"""
        self.conn = pymysql.connect(
            host="127.0.0.1",
            user="root",
            password="你的密码",
            db=dbname, #可选择要连接的数据库名
            charset="utf8"
        )
        self.cur = self.conn.cursor()

    def exe_sql(self, sql):
        """执行不返回结果的sql语句, 例如增删改"""
        self.cur.execute(sql)
        self.conn.commit()
        # print("事物提交成功")

    def select_sql(self, sql):
        """执行查询语句"""
        self.cur.execute(sql)
        return self.cur.fetchall()

    def close_conn(self):
        if self.cur:
            self.cur.close()
        if self.conn:
            self.conn.close()

if __name__ == "__main__":
    #找一个数据库表来执行一下看能不能行
    connection = mysqlConn()
    connection.get_conn("school") #连接'school'数据库
    sql = '''insert into student2 (name, nickname) values ("赵六", "六娃")'''
    connection.exe_sql(sql)
    connection.close_conn()

Note that there is a beginning of the file # coding = utf-8, not a small error will be prompted to write, did not write this habit before, after the habit of writing seems to be a
look at the results
Here Insert Picture Description

Second, grab useful information with xpath

Cat's Eye movie site address : https://maoyan.com/films?showType=3
first look at the final database to see what information we have to crawl
Here Insert Picture Description
from the graph you can see that we have to grab a movie name, movie poster link , the movie details page of links to introduce the film, and the cast.
now look at Cat's Eye Home
Here Insert Picture Description
Here Insert Picture Description
from here we have been able to discover the laws of his transformation of the URL
and then crawl the URL of the first page of information in different movies
look at the posters address
Here Insert Picture Description
used here a catch xpath plugin tools (XPath Helper), no students can go to Google App store download, shortcut keys Ctrl + Shift + X after installation can quickly use
Here Insert Picture Description
two content ok, now home of movie content needs to have crawl is completed, the remaining content needs to grab the details page, let's write the code for a home to see the effect crawl

#coding=utf-8
import requests
from lxml import etree
from fake_useragent import UserAgent

# 这里用了一个可以随机模拟请求头的库,因为一天内爬太多次导致频繁要验证,用随意模拟的请求头会好很多
# 感兴趣的小伙伴可以自己'pip install fake_useragent', 不感兴趣可以直接用普通的请求头
headers = {
    "User-Agent": UserAgent().random
}
url = 'https://maoyan.com/films?showType=3'
print("url: " + url)
resp = requests.get(url=url, headers=headers)
tree = etree.HTML(resp.text)
# 完整的图片地址,可以直接打开
img_ar = tree.xpath('//dl/dd//img[2]/@src')
# 只有地址的后半段,需要拼接'https://maoyan.com/'
urls_ar = tree.xpath('//dd/div[@class="movie-item film-channel"]/a/@href')

#打印出来看看数据爬到没有
print(img_ar)
print(urls_ar)

Here Insert Picture Description
Find details page address caught, picture address did not catch ??? place is really split, here I am also ignorant force for a while, thought that his xpath did not write about now that it seems only debug, and we hit breakpoint
Here Insert Picture Description
that now we will crawl pictures xpath changed '// dl / dd // img [ 2] / @ data-src' to try again, found to have been able to successfully crawl
and now we have to capture detail screen information the cast
where the movie name and profile are better caught, I will not screenshot, a would have been affixed xpath, you can own and scratched
Here Insert Picture Description
where you can try other formulations, not to repeat
the code page of crawling details

headers = {
    "User-Agent": UserAgent().random
}
url = 'https://maoyan.com/films/1218029'

print("url: " + url)
resp = requests.get(url=url, headers=headers)

tree = etree.HTML(resp.content.decode("utf-8"))
name = str(tree.xpath('string(//h1)')) #转为py内置的str类型,保险点
print("正在储存电影<{}>......".format(name))
actors_ar = tree.xpath('//div[@class="celebrity-group"][2]//li/div[@class="info"]/a/text()')  # 演员列表
types = tree.xpath('string(//li[@class="ellipsis"])').replace("\n", "").replace(" ", "")  # 字符串
intro = str(tree.xpath('string(//span[@class="dra"])'))
actors = '|'.join(actors_ar).replace("\n", "").replace(" ", "") #将演员列表拼接为字符串

You can print the results of their own to look, if not met verification code will be able to climb

Finally, all the code crawl cat's eye film, packaged as such, to develop good habits code

import requests
from lxml import etree
from mysql_api import mysqlConn
from fake_useragent import UserAgent
from pymysql import err

class maoYan_spider:
    headers = {
        "User-Agent": UserAgent().random
    }
    def get_urls(self, url):
        """返回一个电影首页捕获到的所有海报地址和电影详情url"""
        print("url: " + url)
        resp = requests.get(url=url, headers=self.headers)
        tree = etree.HTML(resp.text)
        # 完整的图片地址,可以直接打开
        img_ar = tree.xpath('//dl/dd//img[2]/@data-src')
        # 只有地址的后半段,需要拼接'https://maoyan.com'
        urls_ar = tree.xpath('//dd/div[@class="movie-item film-channel"]/a/@href')
        #只有py具有返回多个参数的特性,其他语言只能返回一个
        return img_ar, urls_ar

    def save_data(self, img_src, url):
        """将电影详情写入数据库"""
        #print("url: " + url)
        resp = requests.get(url=url, headers=self.headers)

        tree = etree.HTML(resp.content.decode("utf-8"))
        name = str(tree.xpath('string(//h1)'))
        print("正在储存电影<{}>......".format(name))
        if name == "":
            print("遇到验证码, 程序停止")
            return False
        actors_ar = tree.xpath('//div[@class="celebrity-group"][2]//li/div[@class="info"]/a/text()')  # 演员列表
        types = tree.xpath('string(//li[@class="ellipsis"])').replace("\n", "").replace(" ", "")  # 字符串
        intro = str(tree.xpath('string(//span[@class="dra"])'))
        actors = '|'.join(actors_ar).replace("\n", "").replace(" ", "") #将演员列表拼接为字符串
        sql = 'insert into maoyan (m_name, m_type, m_src, m_link, m_intro, m_actors) values ("%s","%s","%s","%s","%s","%s")' % (name, types, img_src, url, intro, actors)
        try:
            self.connect.exe_sql(sql)
        except err.ProgrammingError:
            print("该条编码有问题,舍弃")
        return True
    def run(self):
        self.connect = mysqlConn()
        self.connect.get_conn("movies")
        tag = True
        #爬取前两页的电影
        for i in range(2):
            main_url = "https://maoyan.com/films?showType=3&offset={}".format(30 * i)
            imgs, urls = self.get_urls(main_url)
            if len(imgs) == 0:
                print("遇到验证码, 程序停止")
                print("再次尝试...")
                imgs, urls = self.get_urls(main_url)
            for img, url in zip(imgs, urls):
                img = img.split('@')[0]
                url = 'https://maoyan.com' + url
                tag = self.save_data(img, url)
                while not tag:
                    tag = True
                    print("再次尝试...")
                    tag = self.save_data(img, url)
        self.connect.close_conn()

if __name__ == "__main__":
    # conn1 = mysqlConn()
    # conn1.get_conn("movies")
    # sql = """create table maoyan(
    #         m_id int primary key auto_increment,
    #         m_name varchar(30) not null,
    #         m_type varchar(20) null,
    #         m_src varchar(100) not null,
    #         m_link varchar(100) not NULL,
    #         m_intro text null,
    #         m_actors text null
    #         )default charset = utf8"""
    # conn1.exe_sql(sql)
    # conn1.close_conn()
    spider = maoYan_spider()
    spider.run()

Here Insert Picture Description
This is where I connect to the database file and the file crawling, probably in the same directory on the line, you can write the same the next page and then 'from .mysql_api import mysqlConn' and catch a movie with

Said a few relatively easy to pit out of place

One

Here Insert Picture Description
This is a piece of code to build the table, provided that you have to own movies in the built in mysql database
and then uncomment this paragraph, the first two lines of code to a comment, and then run, then run there will be a warning character set encoding, this does not matter, the end of the run you can comment this piece of code, and then uncomment the following two lines, beginning reptiles, of course, you can also build the first table in the database, because before I only wrote py CRUD , wanted to try to build the table, do not ask me why m_actors to the text type, originally gave varchar (200), and later prompt enough to write, change 400 is not enough, I went to see an actor who is so much to see - complex United

two

Here Insert Picture Description
This sql statement, pay attention to the placeholder on both sides of double quotes (it must be a double quotation mark, the single quotation marks because there is a special significance in the database, I checked the article says so, what specifically do not know yet), I had inside with single quotes, double quotes outside wrapped with sql, leading to several pages will lead to 'pymysql.err.ProgrammingError', I could not find the problem, we can only catch the exception, and later changed to single quotes this is not something wrong.

three

There are other issues can be discussed in the comments area message, since today regarded as the first few days learning reptiles do not see the video, complete their own independent work caught it, hee hee

effect

Here Insert Picture Description

Published an original article · won praise 9 · views 54

Guess you like

Origin blog.csdn.net/weixin_43110554/article/details/105225737