#第5篇Sharing: python crawler-opening a new era of data collection (1)

#python crawler's first experience
1. Introduction to crawlers:
a. Theoretical basis: crawlers get the data we need from the website, including text, audio, video, etc.; but not all websites can be crawled easily, because some data owners They don’t want to be shared, so anti-climbing appears. You can find some websites without anti-climbing in the early stage of learning. Later, you can increase the difficulty a little bit and play offensive and defensive games.

Crawlers are divided into:
general crawlers, such as search engines: Baidu, Google, Firefox;
focus crawlers, targeted crawling of a certain web page, such as light music network, trailer network, etc., to capture the corresponding data (we mainly introduce Is to focus on crawlers);

Crawler trilogy: data crawling -> data cleaning -> data storage (use and management)

b. Crawler module introduction:
#1. No need to install, directly use python's own crawler module to learn: from urllib import request;
#2. Install third-party module requests: pip install requests; import requests;

Just choose one of the above to use, the syntax is different and the function is the same.

c. Data cleaning module introduction:
#1. No need to install, directly use python's own module to learn: import re (regular expression data filtering, no need to care about the format of the data to be filtered, the following is a summary of the regular expression usage) ;
#2. Install lxml: For the filtering and cleaning of .html web page information, F12 can view the source code of the web page: pip install lxml; from lxml import etree;

Note:
#1. Regular expression re:
a. Any alphanumeric symbol has a corresponding matching format;
b. There are two modes in the matching process, greedy mode (. ) and non-greedy mode (. ?);
c. reserch () Matching ends once, findall() matches to the end of all content;
#2.html document matches lxml, take the div tag as an example: a.//div
(means matching all content below this tag)
b.//div[@ class="attribute"] (representing the content under the div tag and the tag containing this attribute)
c.//div[@class="attribute"]/a (representing the content of the a tag under the attribute tag)/text( ) The content retrieved by crawling will remove the tag
d.//div[@class="attribute"]/a/@href (means matching the href value under the a tag)

The above can be used in combination, and the code for selecting the appropriate cleaning method will be more concise.

Maybe after reading the above introduction, I don’t understand, and I don’t panic. I learned it for a long time before getting an epiphany. It was my first experience. Just get an idea. Follow me down and get an epiphany.

re: filtering of any data format

# #正则表达式:数据筛选&匹配
import re
通字符作为原子(匹配一个普通字符)
# a="湖南湖北广东广西"
# pat="湖北"
# result=re.search(pat,a)
# print(result)

#匹配通用字符
#\w 任意字母/数字/下划线
#\W 和小写w相反
#\d 十进制数字
#\D 除了十进制数以外的值
#\s 空白字符
#\S 非空白字符

# b="136892763900"
# pat2="1\d\d\d\d\d\d\d\d\d\d"
# print(re.search(pat2,b))

# c="@@@@@@@@@@##@!_tdyuhdihdiw"
# pat3=r"\W\w\w"
# print(re.search(pat3,c))

#匹配数字、英文、中文
# 数字 [0-9]
# 英文 [a-z][A-Z]
# 中文 [\u4e00-\u9fa5]

# d="!@#$@#@##$张三%$^%$%#@$boy#@%##$%$$@#@#23@#@#@#@##$%$%$"

# pat1=r"[\u4e00-\u9fa5][\u4e00-\u9fa5]"
# pat2=r"[a-z][a-z][a-z]"
# pat3=r"[0-9][0-9]"

# result1=re.search(pat1,d)
# result2=re.search(pat2,d)
# result3=re.search(pat3,d)

# print(result1,result2,result3)

#原子表
#定义一组平等的原子
# b="18689276390"
# pat2="1[3578]\d\d\d\d\d\d\d\d\d"
# print(re.search(pat2,b))

# c="nsiwsoiwpythonjsoksosj"
# pat3=r"py[abcdt]hon"

# print(re.search(pat2,b))
很重要的
#元字符--正则表达式中具有特殊含义的字符
# . 匹配任意字符 \n除外
# ^ 匹配字符串开始位置  ^136
# $ 匹配字符串中结束的位置 6666$
# * 重复0次1次多次前面的原子 \d*
# ? 重复一次或者0次前面的原子 \d?              .*贪婪模式   .*?非贪婪模式
# + 重复一次或多次前面的原子  \d+

d="135738484941519874888813774748687"
# pat1="..."
# pat2="^135\d\d\d\d\d\d\d\d"
# pat3=".*8687$"
pat4="8*"
pat5="8+"
print(re.findall(pat5,d))

#匹配固定次数
#{n}前面的原子出现了n次
#{n,}至少出现n次
#{n,m}出现次数介于n-m之间
# a="234ded65de45667888991jisw"
# pat1=r"\d{8,10}"
# print(re.search(pat1,a))

# #多个表达式 |
# a="13699998888"
# b="027-1234567"
# pat1=r"1[3578]\d{9}|\d{3}-\d{7}"
# print(re.search(pat1,a))

#分组 ()
# a="jiwdjeodjo@$#python%$$^^&*&^%$java#@!!!!!!!!!!!!!!13688889999!!!!!!!!!!!!!!!!!#@#$#$"
# pat=r"(python).{0,}(java).{0,}(1[3578]\d{9})"
# print(re.search(pat,a).group(3))
#group() 同group(0)就是匹配正则表达式整体结果
#group(1) 列出第一个括号匹配部分,group(2) 列出第二个括号匹配部分,group(3) 列出第三个括号匹配部分。

# a="jiwdjeodjo@$#python%$$^^&*&^%$java#@!!!!!!!!!!!!!!aaa我要自学网bbb!!!!!!!!!!!!!!!!!#@#$#$"
# pat=r"aaa(.*?)bbb"
# print(re.findall(pat,a))

**#贪婪模式和非贪婪模式**
#贪婪模式:在整个表达式匹配成功的前提下,尽可能多的匹配;
#非贪婪模式:在整个表达式匹配成功的前提下,尽可能少的匹配 ( ? );
#Python里默认是贪婪的。
# strr='aa<div>test1</div>bb<div>test2</div>cc'
# pat1=r"<div>.*</div>"
# print(re.search(pat1,strr)) #贪婪模式
# strr='aa<div>test1</div>bb<div>test2</div>cc'
# pat1=r"<div>.*?</div>"
# print(re.findall(pat1,strr)) #非贪婪模式

# import re
# #compile函数---将正则表达式转换成内部格式,提高执行效率
# strr="PYTHON666Java"
# pat=re.compile(r"Python",re.I) #模式修正符:忽略大小写
# print(pat.search(strr))

# import re
# #match函数和search函数
# # match函数--匹配开头
# # search函数--匹配任意位置
# #这两个函数都是一次匹配,匹配到一次就不再往后继续匹配了
# strr="javapythonjavahtmlpythonjs"
# pat=re.compile(r"python")
# print(pat.search(strr).group())

很重要的
# import re
# #findall()   查找所有匹配的内容,装到列表中
# #finditer()  查找所有匹配的内容,装到迭代器中
# strr="hello--------hello-----------\
# ---------hello-----------------\
# ---------hello--hello----------------\
# ----------hello---------hello----hello----------"
# pat=re.compile(r"hello")
# #print(pat.findall(strr))
# data=pat.finditer(strr)
# list1=[]
# for i in data:
#  list1.append(i.group())
# print(list1)

lxml: Use tags to filter data (enter the webpage -> F12 to view the webpage code, find the demand data -> filter data with tags): extended link
Insert picture description here
Insert picture description here

Insert picture description here

2. The first simple crawler:
a.request+re crawler:

from urllib import request                #python自带模块
import re                                 #正则表达式模块,用于数据筛选
import random                             #随机函数模块

# 1.网站地址:
url = "http://www.baidu.com/?tn=18029102_3_dg"
#2.浏览器选择
#伪装浏览器:多个浏览器随机选择
#a.创建自定义请求对象-----反爬虫机制
#b.反爬虫机制1:判断用户是否是浏览器访问
#c.可以通过伪装浏览器进行爬取访问
agent1 = "Mozilla/5.0 (Windows NT 10.0;Win64; x64) AppleWebKit/537.36 " \
         "(KHTML, like Gecko)Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"
agent2 = "Mozilla/5.0 (Linux; Android 8.1.0; ALP-AL00Build/HUAWEIALP-AL00;" \
         " wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/63.0.3239.83 " \
         "Mobile Safari/537.36 T7/10.13 baiduboxapp/10.13.0.11 (Baidu; P1 8.1.0)"

#构造列表,随机选取浏览器
list = [agent1,agent2]
agent = random.choice(list)                       #随机数
header = {
    
    "User-Agent":agent}                     #构造请求头信息:字典格式
#print(agent)

#3.发送请求.获得响应信息
req = request.Request(url,headers = header)
reponse = request.urlopen(req).read().decode()  #解码----(编码encode)
#print(reponse)

#4.数据清洗
pat = r"<title>(.*?)</title>"   #通过正则表达式进行数据清洗
data = re.findall(pat,reponse)
#print(data)

#5.操作文件存储
f = open("wenjian.txt", "a+",encoding='utf-8')
f.write(data)
f.close()

b.requests-get method crawler:

import requests
url = "http://www.baidu.com/s?"
header = {
    
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0;\
           Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)\
           Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"}
#get链接拼接
wd = {
    
    "wd","中国"}
#response = requests.request("get",url,params = wd,headers = header)
#获得get型的响应
response = requests.get(url,params = wd,headers = header)
#对响应进行编码
data1 = response.text    #返回字符串类型数据
data2 = response.content #返回二进制类型数据
data3 = data2.decode()   #返回字符串类型数据
#数据打印
print(data2)

c.requests-post method crawler:

import requests
import re

#构造请求头信息
header = {
    
    "User-Agent":"Mozilla / 5.0(WindowsNT10.0;Win64;x64) AppleWebKit /"
        "537.36(KHTML, likeGecko) Chrome / 70.0.3538.102Safari / 537.36Edge / 18.18362"}

url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule"
key = input("请输入想要翻译的内容:")
#post请求需要提交的参数
formdata={
    
    
"action":"FY_BY_CLICKBUTTION",
"bv":"1ca13a5465c2ab126e616ee8d6720cc3",
"client":"fanyideskweb",
"doctype":"json",
"from": "AUTO",
"i":key,
"keyfrom":"fanyi.web",
"salt":"15825444779453",
"sign":"28237f8c8331019bc43baef299570901",
"smartresult":"dict",
"to":"AUTO",
"ts":"1582544477945",
"version": "2.1"
}
response = requests.post(url,headers = header,data = formdata).text
pat = r'"tgt":"(.*?)"}]]'
result = re.findall(pat,response)
print(result)

d.requests+lxml method crawler:

#爬取糗事百科段子
from lxml import etree
import requests

url = 'https://www.qiushibaike.com/'
headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362'}

response = requests.get(url,headers=headers).content.decode()

html = etree.HTML(response)
result1 = html.xpath('//a[@crecmd-content]')
for i in range(len(result1)):
    print(result1[i].text)
    print("------------------------------------------")

How to find content by tag:
crawler-data search The
above is a video I recorded, because I feel that some knowledge, the text is still not full enough, you can also check out my other video skr if you are interested.

The above is the first experience of crawlers. After learning the above, I believe that we have a basic understanding of crawlers and can make a basic small crawler. If you are interested, you can continue to learn with me. If you are not interested, I will give you recommend a site, "national resolve" you can be considered not in vain, want to see the movie for free VIP can see, I bought vip, but continue to pay in advance that will be used to unlock it, most people I did not tell;
Part V Sharing, continuous updating,,,,,

Guess you like

Origin blog.csdn.net/weixin_46008828/article/details/108645192