Data acquisition learning (using Python's urllib module)

1. Course introduction

  • Environment build
  • urllib and BeautifulSoup
  • Store data to MySQL
  • Common document reading (TXT, PDF)
  • What to pay attention to when using reptiles

1. Preliminary courses

insert image description here

2. Related content to do

  • Shopping comparison network
  • Comprehensive search network
  • Statistics of interests and hobbies of QQ users
  • search engine

2. Environment construction

1. Download and install Python

insert image description here

2. Install BeautifulSoup4

1. Linux installation commands

sudo apt-get install python-bs4

2. Mac installation command

sudo easy_install pip
pip install beautifulsoup4

3. Windows installation command

pip install beautifulsoup4    # python2环境
# 或者是
pip3 install beautifulsoup4   # python3环境

insert image description here

3. Check whether the installation is successful

  • Enter the following commands on the command line:
python    # 检查Python是否安装成功
from urllib.request import urlopen   # 检查urllib模块是否存在
from bs4 import BeautifulSoup     # 检查bs4模块是否存在

As shown in the figure below: insert image description here
There are no error prompts for the three commands, indicating that the environment is ready.

3. urllib and BeautifulSoup

  • urllib is a library for manipulating URLs provided in Python 3.x, which can easily simulate users accessing web pages with a browser.

1. Usage of urllib

1. Specific steps

# 1. 导入urllib库的request模块
from urllib import request

# 2. 请求URL
resp = request.urlopen('http://www.baidu.com')

# 3. 使用响应对象输出数据
print(resp.read().decode("utf-8"))

2. A complete example of the get method of simple urllib

from bs4 import BeautifulSoup   # 导入BeautifulSoup模块
from urllib import request    # 导入urllib.request的urlopen模块

url = "http://www.baidu.com/"
resp = request.urlopen(url)
print(resp.read().decode("utf-8"))

The printing effect is as follows:
insert image description here

3. Simulate a real browser

1. Carry the User-Agent header

from urllib import request

url = "http://www.baidu.com"
key = "User-Agent"
value = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.42"
req = request.Request(url)
req.add_header(key, value)
resp = request.urlopen(req)
print(resp.read().decode("utf-8"))

Running the program can also print out the results:
insert image description here

4. Steps to send a request using the post method

# 1. 导入urllib库下面的parse
from urllib import parse

# 2. 使用urlencode生成post数据
postData = parse.urlencode([
	(key1, val1),
	(key2, val2),
	(keyn, valn)
])

# 3. 使用postData发送post请求
request.urlopen(req, data=postDate.encode('utf-8'))

# 4. 得到请求状态
resp.status

# 5. 得到服务器的类型
resp.reason

5. Example: urllib uses the post method to request the Taiwan High Speed ​​Rail Network

from urllib import request
from urllib.request import urlopen
from urllib import parse

url = "https://m.thsrc.com.tw/TimeTable/Search"
headers = {
    
    
    'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# # 代理IP,由快代理提供
# proxy = '124.94.203.122:20993'
# proxy_values = "%(ip)s" % {'ip': proxy}
# proxies = {"http": proxy_values, "https": proxy_values}
#
# # 设置代理
# handler = request.ProxyHandler(proxies)
# opener = request.build_opener(handler)

data = {
    
    
    "SearchType": "S",
    "Lang": "TW",
    "StartStation": "NanGang",
    "EndStation": "ZuoYing",
    "OutWardSearchDate": "2022/10/18",
    "OutWardSearchTime": "14:30",
    "ReturnSearchDate": "2022/10/18",
    "ReturnSearchTime": "14:30",
    "DiscountType": ""
}
data = parse.urlencode(data).encode("utf8")    # 对参数进行编码
req = request.Request(url=url, data=data, headers=headers, method="POST")    # 请求处理
resp = request.urlopen(req)
# resp = opener.open(req).read()    # 使用代理用这种方式请求

print(resp.read().decode("utf-8"))

Refer to the article when access is denied: https://blog.csdn.net/kdl_csdn/article/details/103989024
Running effect:
insert image description here

Extended example: implemented with the requests module

import requests

headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36 Edg/105.0.1343.42"
}
url = "https://m.thsrc.com.tw/TimeTable/Search"
params = {
    
    
    "SearchType": "S",
    "Lang": "TW",
    "StartStation": "NanGang",
    "EndStation": "ZuoYing",
    "OutWardSearchDate": '2022/10/18',
    "OutWardSearchTime": "14:00",
    "ReturnSearchDate": "2022/10/18",
    "ReturnSearchTime": "14:00",
    "DiscountType": ""
}

resp = requests.post(url=url, headers=headers, params=params)
# print(resp.status_code)     # 200
print(resp.text)

Running effect:
insert image description here
Test the crawler tools that send requests: postman, fildder.

2. Use of BeautifulSoup

1. Comparison of advantages and disadvantages of parsers

parser Instructions Advantage disadvantage
Python standard library BeautifulSoup(markup, “html.parser”) 1. Python's built-in standard library;
2. Moderate execution speed;
3. Strong document fault tolerance
Documentation error tolerance is poor in versions prior to Python 2.7.3 or (3.2.2).
lxml HTML parser BeautifulSoup(markup, “lxml”) 1. Fast speed;
2. Strong document fault tolerance.
The C language library needs to be installed.
lxml XML parser BeautifulSoup(markup, [“lxml”, “xml”])
BeautifulSoup(markup, “xml”)
1. Fast;
2. The only parser that supports XML.
The C language library needs to be installed.
html5lib parser BeautifulSoup(markup, “html5lib”) 1. The best fault tolerance;
2. Parse the document in the browser;
3. Generate the document in HTML5 format.
1. Slow speed;
2. Does not depend on external extensions.

2. Several simple ways to browse structured data

soup.title    # 获取第一个title标签
# <title>The Dormouse's story</title>

soup.title.name    # 获取第一个title标签名字
# u'title'

soup.title.string   # 获取第一个title标签内的文本内容
# u'The Dormoouse's story'

soup.title.parent.name    # 获取第一个title标签父元素的名字
# u'head'

soup.p    # 获取第一个p标签
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']    # 获取第一个p标签的class属性值
# u'title'

soup.a    # 获取第一个a标签
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')     # 获取所有a标签   
"""
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
]
"""

soup.find(id="link3")  # 获取第一个id值为link3的标签
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

3. Test the methods commonly used in BeautifulSoup

from bs4 import BeautifulSoup as bs

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://exampleScom/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well </p>

<p class="story">...</p>
"""

soup = bs(html_doc, "html.parser")

# print(soup.prettify())

print(soup.title.string)   # 获取title标签的内容
print(soup.a)    # 获取第一个a标签
print(soup.find(id="link2"))    # 获取id=“link2”的元素
print(soup.find(id="link2").string)    # 获取id=“link2”的元素的内容(string要获取的内容中不含有标签才行)
print(soup.find(id="link2").get_text())    # 获取id=“link2”的元素的内容
print(soup.find_all("a"))     # 获取所有的a标签
print(soup.findAll("a"))    # 获取所有的a标签
print([item.string for item in soup.findAll("a")])    # 获取所有的a标签的文本内容   # 列表推导式
print(soup.find("p", {
    
    "class": "story"})) # 获取class为story的p标签
print(soup.find("p", {
    
    "class": "story"}).get_text())   # 获取class为story的p标签的内容
print(soup.find("p", {
    
    "class": "story"}).string)  # 获取class为story的p标签的内容 由于获取到的p标签中还含有别的标签,所以无法用string,否则返回None。

print()
import re
# 使用正则表达式
for tag in soup.find_all(re.compile("^b")):    # 查找以b开头的标签名
    print(tag.name)

# 查找所有a标签中href属性为“http://...”这样的a标签
data = soup.findAll("a", href=re.compile(r"^http://example.com/"))
print(data)
data2 = soup.findAll("a", href=re.compile(r"^http://example\.com/"))
print(data2)

# 文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id28

operation result:
insert image description here

4. Example: Obtain Wikipedia entry information (this example is for reference only, and the function has expired)

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

# 请求URL并把结果用utf-8编码
resp = urlopen("https://en.wikipedia.org/wiki/Main_Page").read().decode("utf-8")
# 使用BeautifulSoup解析
soup = BeautifulSoup(resp, "html.parser")
# 获取所有以/wiki/开头的a标签的href属性
listUrls = soup.findAll("a", href=re.compile("^/wiki/"))
# 输出所有词条对应的名称和URL
for url in listUrls:
	if not re.search("\.(jpg|JPG)$", url['href']):    # 过滤掉以.JPG或.jpg结尾的图片URL
		# print(url['href'])     # 输出不完整的url
		# print(url.get_text(), "<--->", url['href'])    # 输出对应名字和不完整的url
		print(url.get_text(), "<---->", "https://en.wikipedia.org" + url['href'])    # 输出对应名字和完整的url

insert image description here
insert image description here
insert image description here

5. Example: Get the entries and links of Baidu Encyclopedia

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

# 词条来源的url地址
url = "https://baike.baidu.com/"
# 请求url,并把结果用utf-8编码
resp = urlopen(url).read().decode("utf-8")
# 使用BeautifulSoup解析
soup = BeautifulSoup(resp, "html.parser")
# 获取所有以class="card_cnt_tit"的div
list_divs = soup.findAll("div", {
    
    "class": "card_cnt_tit"})
# 根据源码中的规律,我们先找到包裹词条a标签的div
for div in list_divs:
    # 再在div标签中用正则表达式中过滤出a标签
    a = div.find("a", href=re.compile(r"^https://"))
    # 输出词条的名称和链接
    print(a.string, "<-------->", a['href'])

operation result:
insert image description here

4. Store data in MySQL

1. Environment preparation

  • The pymysql module needs to be installed:
pip install pymysql

2. How to store in MySQL database

# 1. 引入开发包
import pymysql.cursors

# 2. 获取数据库链接
connection = pymysql.connect(
	host="localhost", 
	user="root", 
	password="123456", 
	db="baikeurl",
	charset="utf8mb4")

# 3. 获取会话指针
connection.cursor()

# 4. 执行SQL语句
cursor.execute(sql, (参数1, 参数2, ..., 参数n))

# 5. 提交
connection.commit()

# 6. 关闭
connection.close()

3. Example: Store the data in the previous example into the MySQL database

1. Use Navicat to create database and data table

Create a database:
insert image description here
Create a data table in the database:
insert image description here

2. Modify the previous sample code to add data to the database

Full code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pymysql.cursors

# 词条来源的url地址
url = "https://baike.baidu.com/"
# 请求url,并把结果用utf-8编码
resp = urlopen(url).read().decode("utf-8")
# 使用BeautifulSoup解析
soup = BeautifulSoup(resp, "html.parser")
# 获取所有以class="card_cnt_tit"的div
list_divs = soup.findAll("div", {
    
    "class": "card_cnt_tit"})
# 根据源码中的规律,我们先找到包裹词条a标签的div
for div in list_divs:
    # 再在div标签中用正则表达式中过滤出a标签
    a = div.find("a", href=re.compile(r"^https://"))
    # 输出词条的名称和链接
    print(a.string, "<-------->", a['href'])

    # 获取数据库链接
    connection = pymysql.connect(host="localhost",
                                 user="root",
                                 password="123456",
                                 database="baikeurl",
                                 charset="utf8mb4")
    try:
        # 获取会话指针
        with connection.cursor() as cursor:
            # 创建sql语句
            sql = "insert into `urls` (`urlname`, `urlhref`)values(%s, %s)"
            # 执行sql语句
            cursor.execute(sql, (a.get_text(), a['href']))
            # 提交
            connection.commit()
    finally:
        connection.close()

running result:
insert image description here
insert image description here

4. How to read from MySQL database

# 1. 引入开发包
import pymysql.cursors

# 2. 获取数据库链接
connection = pymysql.connect(
	host="localhost", 
	user="root", 
	password="123456", 
	db="baikeurl",
	charset="utf8mb4")

# 3. 获取会话指针
connection.cursor()

# 4.1 得到总记录数
cursor.execute()

# 4.2 查询下一行
cursor.fetchchone()

# 4.3 得到指定条数的数据
cursor.fetchmany(size=None)

# 4.4 得到全部
cursor.fetchall()

# 5. 关闭链接
connection.close()

5. Example: Query the content in the MySQL database

# 导入模块
import pymysql.cursors

# 获取数据库连接
connection = pymysql.connect(host="localhost",
                                user="root",
                                password="123456",
                                database="baikeurl",
                                charset="utf8mb4")
try:
    # 获取会话指针
    with connection.cursor() as cursor:
        # 查询语句
        sql = "select urlname, urlhref from urls where id is not null"
        # 查询共有几条记录
        count = cursor.execute(sql)
        print(count)   # 9

        # 查询数据
        result = cursor.fetchmany(size=3)    # 获取前三条数据
        # result = cursor.fetchall()    # 获取所有数据
        print(result)
finally:
    connection.close()

The running results are compared with the data in the database as follows:
insert image description here

5. Common document reading (TXT, PDF)

  • Read txt document (read with urlopen() method)
  • Read PDF documents (use the third-party module pdfminer3k to read)

1. Some language characters appear garbled

  • Computers can only handle two numbers , 0 and 1 , so if you want to process text, you must turn the text into numbers like 0 and 1. The earliest computers used eight 0 and 1 to represent a byte , so the largest integer that can be represented It is 255=11111111. If you want to represent a larger number, you must use more bytes.
  • Since the computer was invented by Americans, only 127 characters were programmed into the computer at the earliest, which is the common Arabic numerals, upper and lower case letters , and symbols on the keyboard . This encoding is called ASCII encoding . For example, the ASCII encoding of the uppercase letter A is 65, and the number 65 is converted into binary 01000001 , which is what the computer really processes.
  • Obviously, ASCII codes cannot express our Chinese, so China has formulated its own GB2312 code , which is compatible with ASCII codes, then the problem comes, use the GB2312 coded text, the three characters of MOOC , assuming that the codes are 61 respectively , 62, 63, then in the ASCII code table it may be the @ symbol on the keyboard, or something else.
    insert image description here
    And Unicode encoding integrates all languages ​​together:
    insert image description here
    Unicode encoding is larger in size:
    insert image description here
    files will be encoded between systems: (utf-8 opening can save space, Unicode saving can maximize compatibility)
    insert image description here
    The server will also first convert Unicode The encoding is converted into utf-8 encoding and then transmitted to the browser, which can reduce the burden on the browser:
    insert image description here
  • Python3 strings use Unicode encoding by default, so Python3 supports multiple languages .
  • The str expressed in Unicode can be encoded into the specified bytes through the encode() method.
  • If the bytes use ASCII encoding, characters that are not in the ASCII code table will be represented by \x## . At this time, only '\x##'.decode('utf-8') will suffice.

2. Read txt

from urllib.request import urlopen
# 百度robots协议:https://www.baidu.com/robots.txt

url = "https://www.baidu.com/robots.txt"
html = urlopen(url)

print(html.read().decode('utf-8'))

3. Read PDF files

1. Install the pdfminer3k module

Download and install the pdfminer3k module:

pip install pdfminer3k

insert image description here
Another way, or directly download the package from the Internet, then unzip the package, enter the package directory (there is a setup.py file in the directory), and use the following command to install directly:

python setup.py install

2. Check whether the pdfminer3k module is installed successfully

python

import pdfminer

insert image description here

3. The process of reading PDF documents

insert image description here
insert image description here
insert image description here
A complete diagram of the above process is as follows:
insert image description here

4. Read the pdf file

Open schema reference:
insert image description here

Example: read local PDF

# 导入需要用到的包:
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

# 获取文档对象:
fp = open("Automatic Detection.pdf", "rb")    # 以二进制只读方式打开

# 创建一个与文档关联的解释器
parser = PDFParser(fp)

# PDF文档的对象
doc = PDFDocument()

# 连接解释器和文档对象
parser.set_document(doc)
doc.set_parser(parser)

# 初始化文档
doc.initialize("")      # 由于文档没有密码,所以里面的密码参数设置为空字符串

# 创建PDF资源管理器
resource = PDFResourceManager()

# 创建参数分析器
laparam = LAParams()

# 创建一个聚合器
device = PDFPageAggregator(resource, laparams=laparam)

# 创建PDF页面解释器
interpreter = PDFPageInterpreter(resource, device)

# 使用文档对象得到页面的集合
for page in doc.get_pages():
    # 使用页面解释器来读取
    interpreter.process_page(page)

    # 使用聚合器来获得内容
    layout = device.get_result()

    # 获得布局内容
    for out in layout:     # 利用循环输出布局的每一项
        # 避免报错:AttributeError: 'LTFigure' object has no attribute 'get_text'
        if hasattr(out, "get_text"):
            print(out.get_text())

operation result:

insert image description here
Sections that cannot be parsed are marked in red:
insert image description here

Example: Reading PDFs on the Web

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from urllib.request import urlopen

# 获取文档对象
# 参考pdf:https://www.tipdm.org/u/cms/www/202107/28162910tww9.pdf
# fp = open("Automatic Detection.pdf", "rb")    # 以二进制只读方式打开
fp = urlopen("https://www.tipdm.org/u/cms/www/202107/28162910tww9.pdf")    # 以二进制只读方式打开

# 创建一个与文档关联的解释器
parser = PDFParser(fp)

# PDF文档的对象
doc = PDFDocument()

# 连接解释器和文档对象
parser.set_document(doc)
doc.set_parser(parser)

# 初始化文档
doc.initialize("")      # 由于文档没有密码,所以里面的密码参数设置为空字符串

# 创建PDF资源管理器
resource = PDFResourceManager()

# 创建参数分析器
laparam = LAParams()

# 创建一个聚合器
device = PDFPageAggregator(resource, laparams=laparam)

# 创建PDF页面解释器
interpreter = PDFPageInterpreter(resource, device)

# 使用文档对象得到页面的集合
for page in doc.get_pages():
    # 使用页面解释器来读取
    interpreter.process_page(page)

    # 使用聚合器来获得内容
    layout = device.get_result()

    # 获得布局内容
    for out in layout:     # 利用循环输出布局的每一项
        # 避免报错:AttributeError: 'LTFigure' object has no attribute 'get_text'
        if hasattr(out, "get_text"):
            print(out.get_text())

insert image description here

6. What should be paid attention to when using reptiles

Precautions

  • The full name of the Robots protocol (also known as crawler protocol, robot protocol, etc.) is "web crawler exclusion standard". The website tells search engines which pages can be crawled and which pages cannot be crawled through the Robots protocol.
  • User-agent : Indicates that the specified crawler * is a wildcard.
  • Disallow : Access is not allowed.
  • Allow : Allow access.
  • The Robots protocol is generally aimed at search engines, and there are no restrictions on users' crawlers.

what not to do

  1. The terms of the website service agreement clearly prohibit the use of crawlers, and the other party detects your behavior and informs you to stop this behavior through some means.
  2. The use of distributed multi-threaded crawlers brings a huge burden to the other party's server , affects the normal use of the other party's users, and even causes substantial damage to the other party's server .
  3. Deliberately using crawlers to consume the server of the other party, a malicious hacker attack.
    If the above three conditions are met at the same time , it is an infringement of the fixed assets of the other party. If it only violates the crawler agreement without meeting the other two conditions, it is not illegal. So please limit your crawlers and avoid gathering during peak hours.

scene analysis

scene one

  • Iterate through all the content of a small website.
    Be sure to do it at night when the site is relatively free (from 3 in the evening to 8 in the morning).

scene two

  • Search for a relevant content and crawl tens of thousands of websites.
    It's best to crawl quickly and don't spend too much time on a site.
    If you want to traverse all the content of a website, you need to make some restrictions on crawlers, because traversing a large website will infringe the intellectual copyright of others.
    It's okay to crawl part of the content, but don't traverse it.

scene three

  • Traverse large websites such as MOOC .
    It is best not to traverse this kind of website that many people use frequently, causing the website to be overloaded and causing the website to crash.

7. Course Summary

Environment build

  • Python3
  • BeautifulSoup4

urllib and BeautifulSoup

  • screaming
    • Use urlopen to request a link
    • Use the add_header(key, value) method to add request headers
    • Use decode to encode the result
    • Use Request(url) to get the request object
    • Use parser.urlencode() to generate post data
    • use urlopen(req, data=postData.encode('utf-8'))
  • BeautifulSoup
    • Use BeautifulSoup(html, "html.parser") to parse HTML
    • Find a node: soup.find(id='imooc')
    • Find multiple nodes: soup.findAll('a')
    • Use regular expression matching: soup.findAll('a', href=re.compile(exp))

Store data to MySQL database

  • Get database connection: connection = pymysql.connect(host='localhost', user='root', password='123456', db='db', charset='utf8md4')
  • Use connection.cursor() to get the session pointer
  • Use cursor.ececute(sql, (parameter 1, parameter 2, ...,parameter n)) to execute sql
  • Submit connection.commit()
  • Close the connection connection.close()
  • Use cursor.execute() to get how many records are queried
  • Use cursor.fetchone() to get the next row of records
  • Use cursor.fetchmany(size=10) to get the specified number of records
  • Use cursor.fetchall() to get all records

Common document reading (TXT, PDF)

  • Causes of garbled characters and solutions
  • Read PDF documents with pdfminer3k

What to pay attention to when using reptiles

  • The crawler protocol file robots.txt
  • User-agent: Indicates that the specified crawler * is a wildcard
  • Disallow: access is not allowed
  • Allow: allow access
  • How to find the website robots agreement: enter robots.txt after the website root URL

Article Notes Reference Course: https://www.imooc.com/video/12622
Code Resources: https://download.csdn.net/download/ungoing/86790114

Guess you like

Origin blog.csdn.net/ungoing/article/details/127382349