Python crawler super detailed explanation (zero-based entry, old people can understand)

Author: agricultural BookSea code

https://blog.csdn.net/bookssea/article/details/107309591

Look first and then like, develop a habit.
Like collection, life is brilliant.

Before explaining our crawler, let’s first outline the simple concepts about crawlers (after all, it’s a zero-based tutorial)

reptile

A web crawler (also known as a web spider or web robot) is a program that simulates a browser to send a web request and receive a response to the request. It is a program that automatically crawls Internet information according to certain rules.
In principle, as long as the browser (client) can do, the crawler can do it.

Why do we use crawlers

The era of Internet big data gives us convenience in life and the explosive appearance of massive amounts of data on the Internet.
In the past, we used books, newspapers, television, radio, or information. The amount of information was limited, and after certain screening, the information was relatively effective, but the disadvantage was that the information was too narrow. The asymmetrical information transmission has limited our vision and cannot learn more information and knowledge.
In the era of big data on the Internet, we suddenly have free access to information, and we get a huge amount of information, but most of it is invalid spam.
For example, Sina Weibo generates hundreds of millions of status updates a day, while in the Baidu search engine, you can search for one—100,000,000 information about weight loss.
With such a large amount of information fragments, how can we obtain information useful to ourselves?
The answer is screening!
Collect the relevant content through a certain technology, and analyze and delete the selection to get the information we really need.
This information collection, analysis and integration work can be applied to a wide range of applications, whether it is life services, travel, financial investment, product market demand for various manufacturing industries, etc.... You can use this technology to obtain more accurate and effective information. Take advantage.
Although the web crawler technology has a weird name, the first reaction is the kind of soft creeping creature, but it is a weapon that can go forward in the virtual world.

Reptile preparation

We usually talk about Python crawlers. In fact, there may be a misunderstanding here. Crawlers are not unique to Python. There are many languages ​​that can be used for crawling. For example: PHP, JAVA, C#, C++, Python. I chose Python for crawling because Python is relatively. It is relatively simple, and the functions are relatively complete.
First of all, we need to download python. I downloaded the latest official version 3.8.3.
Secondly, we need an environment to run Python. I use pychram.


It can also be downloaded from the official website,
we also need some libraries to support the running of the crawler (some libraries may come with Python)


It's almost these libraries, I have already written a comment later

(During the crawler running process, you may not only need the above libraries. It depends on a specific way of writing your crawler. Anyway, if you need a library, we can install it directly in the setting)

Reptile project explanation

What I do is to crawl the crawler code of Douban scored movies Top250
What we want to crawl is this website: https://movie.douban.com/top250

Here I have already crawled, and I will show you the renderings. I will save the crawled content toxlsin

The content of our crawling is:Movie details link, picture link, movie Chinese name, movie foreign name, rating, number of reviews, overview, and related information.

Code analysis

First release the code, and then I analyze it step by step according to the code

# -*- codeing = utf-8 -*-
from bs4 import BeautifulSoup  # 网页解析,获取数据
import re  # 正则表达式,进行文字匹配`
import urllib.request, urllib.error  # 制定URL,获取网页数据
import xlwt  # 进行excel操作
#import sqlite3  # 进行SQLite数据库操作


findLink = re.compile(r'<a href="(.*?)">')  # 创建正则表达式对象,标售规则   影片详情链接的规则
findImgSrc = re.compile(r'<img.*src="(.*?)"', re.S)
findTitle = re.compile(r'<span class="title">(.*)</span>')
findRating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')
findJudge = re.compile(r'<span>(\d*)人评价</span>')
findInq = re.compile(r'<span class="inq">(.*)</span>')
findBd = re.compile(r'<p class="">(.*?)</p>', re.S)








def main():
    baseurl = "https://movie.douban.com/top250?start="  #要爬取的网页链接
    # 1.爬取网页
    datalist = getData(baseurl)
    savepath = "豆瓣电影Top250.xls"    #当前目录新建XLS,存储进去
    # dbpath = "movie.db"              #当前目录新建数据库,存储进去
    # 3.保存数据
    saveData(datalist,savepath)      #2种存储方式可以只选择一种
    # saveData2DB(datalist,dbpath)






# 爬取网页
def getData(baseurl):
    datalist = []  #用来存储爬取的网页信息
    for i in range(0, 10):  # 调用获取页面信息的函数,10次
        url = baseurl + str(i * 25)
        html = askURL(url)  # 保存获取到的网页源码
        # 2.逐一解析数据
        soup = BeautifulSoup(html, "html.parser")
        for item in soup.find_all('div', class_="item"):  # 查找符合要求的字符串
            data = []  # 保存一部电影所有信息
            item = str(item)
            link = re.findall(findLink, item)[0]  # 通过正则表达式查找
            data.append(link)
            imgSrc = re.findall(findImgSrc, item)[0]
            data.append(imgSrc)
            titles = re.findall(findTitle, item)
            if (len(titles) == 2):
                ctitle = titles[0]
                data.append(ctitle)
                otitle = titles[1].replace("/", "")  #消除转义字符
                data.append(otitle)
            else:
                data.append(titles[0])
                data.append(' ')
            rating = re.findall(findRating, item)[0]
            data.append(rating)
            judgeNum = re.findall(findJudge, item)[0]
            data.append(judgeNum)
            inq = re.findall(findInq, item)
            if len(inq) != 0:
                inq = inq[0].replace("。", "")
                data.append(inq)
            else:
                data.append(" ")
            bd = re.findall(findBd, item)[0]
            bd = re.sub('<br(\s+)?/>(\s+)?', "", bd)
            bd = re.sub('/', "", bd)
            data.append(bd.strip())
            datalist.append(data)


    return datalist




# 得到指定一个URL的网页内容
def askURL(url):
    head = {  # 模拟浏览器头部信息,向豆瓣服务器发送消息
        "User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122  Safari / 537.36"
    }
    # 用户代理,表示告诉豆瓣服务器,我们是什么类型的机器、浏览器(本质上是告诉浏览器,我们可以接收什么水平的文件内容)


    request = urllib.request.Request(url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html




# 保存数据到表格
def saveData(datalist,savepath):
    print("save.......")
    book = xlwt.Workbook(encoding="utf-8",style_compression=0) #创建workbook对象
    sheet = book.add_sheet('豆瓣电影Top250', cell_overwrite_ok=True) #创建工作表
    col = ("电影详情链接","图片链接","影片中文名","影片外国名","评分","评价数","概况","相关信息")
    for i in range(0,8):
        sheet.write(0,i,col[i])  #列名
    for i in range(0,250):
        # print("第%d条" %(i+1))       #输出语句,用来测试
        data = datalist[i]
        for j in range(0,8):
            sheet.write(i+1,j,data[j])  #数据
    book.save(savepath) #保存


# def saveData2DB(datalist,dbpath):
#     init_db(dbpath)
#     conn = sqlite3.connect(dbpath)
#     cur = conn.cursor()
#     for data in datalist:
#             for index in range(len(data)):
#                 if index == 4 or index == 5:
#                     continue
#                 data[index] = '"'+data[index]+'"'
#             sql = '''
#                     insert into movie250(
#                     info_link,pic_link,cname,ename,score,rated,instroduction,info)
#                     values (%s)'''%",".join(data)
#             # print(sql)     #输出查询语句,用来测试
#             cur.execute(sql)
#             conn.commit()
#     cur.close
#     conn.close()




# def init_db(dbpath):
#     sql = '''
#         create table movie250(
#         id integer  primary  key autoincrement,
#         info_link text,
#         pic_link text,
#         cname varchar,
#         ename varchar ,
#         score numeric,
#         rated numeric,
#         instroduction text,
#         info text
#         )
#
#
#     '''  #创建数据表
#     conn = sqlite3.connect(dbpath)
#     cursor = conn.cursor()
#     cursor.execute(sql)
#     conn.commit()
#     conn.close()


# 保存数据到数据库














if __name__ == "__main__":  # 当程序执行时
    # 调用函数
     main()
    # init_db("movietest.db")
     print("爬取完毕!")

Let me explain and analyze it from bottom to bottom based on the code

- - = UTF-codeing. 8 - -, this is the beginning of the set encoded as utf-8, written in the top, to prevent distortion.
Then below importJust import some libraries and make preparations (I didn't use the SQLite3 library so I commented it out).
Some of the followingfindThe beginning is a regular expression, which is used to filter information.
(Regular expressions use the re library, or regular expressions are not necessary.) The
general process is divided into three steps:

1. Crawl web pages
2. Analyze data one by one
3. Save web pages

First analyze process 1, crawling web pages, baseurl is the web page URL we want to crawl, go down, call getData (baseurl),
let’s look at the getData method

  for i in range(0, 10):  # 调用获取页面信息的函数,10次
        url = baseurl + str(i * 25)

You may not understand this
paragraph, but it is actually like this: Because the movie is rated Top250, each page only displays 25, so we need to visit the page 10 times, 25*10=250.

baseurl = "https://movie.douban.com/top250?start="

We just need to add a number after baseurl to jump to the corresponding page,For example, when i=1

https://movie.douban.com/top250?start=25

I put a hyperlink, you can click to see which page you will jump to, after all, practice gives you real knowledge.

Then I called askURL to request the webpage. This method is the main method of requesting the webpage
. I am afraid that it will be troublesome for you to turn the page. I will copy the code again to give you an intuitive feeling.

def askURL(url):
    head = {  # 模拟浏览器头部信息,向豆瓣服务器发送消息
        "User-Agent": "Mozilla / 5.0(Windows NT 10.0; Win64; x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 80.0.3987.122  Safari / 537.36"
    }
    # 用户代理,表示告诉豆瓣服务器,我们是什么类型的机器、浏览器(本质上是告诉浏览器,我们可以接收什么水平的文件内容)


    request = urllib.request.Request(url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
            print(e.code)
        if hasattr(e, "reason"):
            print(e.reason)
    return html

This askURL is used to send requests to the web page,So here is the old iron question, why is there a head here?

This is because if we don’t write, we will be recognized as crawlers when we visit certain websites, displaying errors and error codes.

418

This is a mess everyone can download on Baidu,

418 I’m a teapot

The HTTP 418 I’m a teapot client error response code indicates that
the server refuses to brew coffee because it is a teapot. This error
is a reference to Hyper Text Coffee Pot Control Protocol which was an
April Fools’ joke in 1998.

I am a teapot

So we need to "pretend" and pretend that we are a browser, so that we will not be recognized and
pretend to be an identity.

Come, we continue to go down,

  html = response.read().decode("utf-8")

This paragraph is the content of the webpage we read and set the encoding to utf-8, the purpose is to prevent garbled codes.
After the visit is successful, it comes to the second process:

2. Analyze the data one by one

To parse the data, we used the BeautifulSoup library, which is almost a must-have library for crawlers, no matter what you write.

Let's start to find the data that meets our requirements, using the BeautifulSoup method and the
regular expression of the re library to match,

findLink = re.compile(r'<a href="(.*?)">')  # 创建正则表达式对象,标售规则   影片详情链接的规则
findImgSrc = re.compile(r'<img.*src="(.*?)"', re.S)
findTitle = re.compile(r'<span class="title">(.*)</span>')
findRating = re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')
findJudge = re.compile(r'<span>(\d*)人评价</span>')
findInq = re.compile(r'<span class="inq">(.*)</span>')
findBd = re.compile(r'<p class="">(.*?)</p>', re.S)

Match the data that meets our requirements, and then save it dataList , and so dataList The data we need is stored in it.

The last process:

3. Save the data

    # 3.保存数据
    saveData(datalist,savepath)      #2种存储方式可以只选择一种
    # saveData2DB(datalist,dbpath)

Save the data can choose to save to the xls table, you need (xlwt library support)
or you can choose to save the data to the sqlite database, you need (sqlite3 library support)

Here I choose to save to the xls table, which is why I commented a lot of code. The commented part is the code saved to the sqlite database. Choose one of the two.

The main method of saving to xls is saveData (The saveData2DB method below is to save to the sqlite database)

def saveData(datalist,savepath):
    print("save.......")
    book = xlwt.Workbook(encoding="utf-8",style_compression=0) #创建workbook对象
    sheet = book.add_sheet('豆瓣电影Top250', cell_overwrite_ok=True) #创建工作表
    col = ("电影详情链接","图片链接","影片中文名","影片外国名","评分","评价数","概况","相关信息")
    for i in range(0,8):
        sheet.write(0,i,col[i])  #列名
    for i in range(0,250):
        # print("第%d条" %(i+1))       #输出语句,用来测试
        data = datalist[i]
        for j in range(0,8):
            sheet.write(i+1,j,data[j])  #数据
    book.save(savepath) #保存

Create worksheets, create columns (will be created in the current directory),

   sheet = book.add_sheet('豆瓣电影Top250', cell_overwrite_ok=True) #创建工作表
    col = ("电影详情链接","图片链接","影片中文名","影片外国名","评分","评价数","概况","相关信息")

Then store the data in the dataList one by one.

After the final operation is successful, such a file will be generated on the left

After opening, see if it is the result we want

It's done, it's done!

If we need to store in a database, we can generate the xls file first, and then import the xls file into the database.

This article has finished the explanation. I feel that what I have said is quite detailed. I have only recently learned about crawling. I am more interested in this. I must have something bad about it. Welcome everyone to correct me.

I am also constantly learning, and I will share with you as soon as I learn new things.
You can use your little hands and pay attention to not get lost.

If you don’t understand anything about this article, please leave a message below, and I will answer every one of you.


White prostitution is not good, creation is not easy. Your likes are the biggest motivation for my creation. If I write something wrong, please leave a message in the comment area for correction.
Lao Tie, if you have any gains, please click a free like to encourage the blogger

Sharing or watching is my greatest support 

Guess you like

Origin blog.csdn.net/ityard/article/details/108373648