01: Quick Start Crawler

1. Guidance

1.Robots protocol

The full name of Robots protocol (crawler protocol) is "Robots Exclusion Protocol". Websites use Robots protocol to tell search engines which pages can be crawled and which pages cannot be crawled. This agreement is a common code of ethics in the international Internet community. Although it is not written into law, every crawler should abide by this agreement.

2. The crawler process

flow chart

Please add image description

(1) Get the web page

Obtaining a web page is to send a request to a URL, which will return the data of the entire web page.

Commonly used techniques

Basic technologies for obtaining web pages: requests, urllib and selenium

Advanced techniques for obtaining web pages: multi-process and multi-thread crawling (8), login crawling, breaking through IP bans and using server crawling

(2) Parsing web pages (extracting data)

Parsing a web page is to extract the desired data from the acquired data of the entire web page.

Commonly used techniques

Basic technologies for parsing web pages: re regular expressions, BeautifulSoup and lxml

Advanced technology for parsing web pages: solving Chinese garbled characters

(3) Store data

Storing data is also easy to understand, which is to store the data. We can store in csv or in database

Commonly used techniques

Basic technologies for storing data: saving to txt files and saving to csv files

Advanced technology for storing data: storing in MySQL database and MongoDB database

2. Environment installation

(1)Anaconda

(2)pycharm

3. Entry-level case

[Website for learning HTML] http://www.w3school.com.cn/html/index.asp

Python 100 test questions l

Reptile learning test website

Find the element we want on the browser like this

Step 1: Open the browser, jump to the specified web page, and then right-click the mouse

Please add image description

After clicking the small arrow, it will automatically jump to the content we want and display the label

Please add image description

#*******************************************************#
# 导包:从bs4中导入BeautifulSoup
#*******************************************************#
import requests
from bs4 import BeautifulSoup
#*******************************************************#
#  link是我们想要获取的网页链接
#  这个链接可换成我们想要获取的网页链接
#*******************************************************#
link = "http://www.santostang.com/"
#*******************************************************#
#  headers是定义的浏览器的请求头,伪装成浏览器
#  一般是固定的,不要去改变它
#*******************************************************#
headers = {
    
    "User-Agent": "Mozilla/5.0 (WIindows; U; Windows NT 6.1;en-US; rv:1.9.1.6)Gecko/20100101 Firefox/3.5.6"}
#*******************************************************#
#####  第一步:请求网页
#*******************************************************#
r = requests.get(link, headers=headers)
print(f"获取网页的html网页内容如下:\n{
      
      r.text}")
#*******************************************************#
#####  第二步:解析网页
#   BeautifulSoup会将获取的html文档代码转换成soup对象
#   然后利用soup对象查找我们想要的指定元素
#*******************************************************#
soup = BeautifulSoup(r.text, 'html.parser')
#*******************************************************#
#    利用soup对象查找指定的元素
#    soup.find("h1", class_="post-title").a.text.strip()的意思是,找到第一篇文章标题
#    定位到class是"post-title"的h1元素,提取a元素,提取a元素里面的字符串,strip()去除左右空格
#*******************************************************#
title = soup.find("h1",class_="post-title").a.text.strip()
#*******************************************************#
# 输出一下自己提取的结果,看是否提取到
#*******************************************************#
print(title)
#*******************************************************#
#####  第三步:保存数据,为txt
#*******************************************************#
with open("results.txt", "a+") as f:
    f.write(title + "\n")

Guess you like

Origin blog.csdn.net/qq_63119830/article/details/131044159