python advanced——web crawler

Web Crawler

Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description hereInsert image description here
Insert image description here

Insert image description here

Insert image description here
Insert image description here
Insert image description here
2.1 Send a request

Import the Requests module:

import requests

Get the webpage:

r = requests.get('http://xxx.xxx')

At this point, we have obtained the Response object r, and we can obtain the required information through r. Requests' simple API means that all HTTP request types are obvious, let's look at an example using the common HTTP request types get, post, put, delete:

r = requests.get('http://xxx.xxx/get')
r = requests.post('http://xxx.xxx/post', data = {
    
    'key':'value'})
r = requests.put('http://xxx.xxx/put', data = {
    
    'key':'value'})
r = requests.delete('http://xxx.xxx/delete')

Usually we set the timeout of the request. Requests uses the timeout parameter to set it in seconds. The example is as follows:

r = requests.head('http://xxx.xxx/get', timeout=1)

2.2 Parameter passing

When sending a request using the get method, we will place the key-value pair formal parameters after the question mark in the URL, such as: http://xxx.xxx/get?key=val. Requests uses the params keyword as a string. Dictionary to provide these parameters. For example, if you want to pass key1=val1 and key2=val2 to http://xxx.xxx/get, the example is as follows:

pms= {
    
    'key1': 'val1', 'key2': 'val2'}
r = requests.get("http://xxx.xxx/get", params=pms)

Requests also allow a list to be passed in as a value:

pms= {
    
    'key1': 'val1', 'key2': ['val2', 'val3']}

Note: Keys with a value of None in the dictionary will not be added to the URL query string.
2.3 Response content

Let’s get the response content of the server. Here is the address https://api.github.com as an example:

import requests
r = requests.get('https://api.github.com')
print(r.text)

# 输出结果
# {"current_user_url":"https://api.github.com/user","current_user...

When accessing r.text, Requests will use its inferred text encoding. We can use r.encoding to view its encoding, or we can modify the encoding, such as: r.encoding = 'GBK'. When the encoding is changed, access r again. .text, Request will use the new value of r.encoding.

1) Binary response content. For example, when we want to obtain the data of a picture, we will obtain the response data in binary form. The example is as follows:

from PIL import Image
from io import BytesIO
i = Image.open(BytesIO(r.content))

2) The JSON decoder has been built into the JSON response content Requests, so we can easily parse the JSON data. The example is as follows:

import requests
r = requests.get('https://api.github.com')
r.json()

Note: A successful call to r.json() does not necessarily result in a successful response. Some servers will include a JSON object (such as HTTP 500 error details) in the failed response. In this case, we need to check the response status code r. status_code or r.raise_for_status(). When called successfully, r.status_code is 200 and r.raise_for_status() is None.
2.4 Custom request headers

When we want to add headers to the request, we only need to pass a dictionary to the headers parameter. The example is as follows:

url = 'http://xxx.xxx'
hds= {
    
    'user-agent': 'xxx'}
r = requests.get(url, headers=hds)

Note: The priority of custom headers is lower than some specific information. For example, if user authentication information is set in .netrc, the authorization set using headers will not take effect. When the auth parameter is set, the settings of .netrc will not take effect. invalid. All header values ​​must be string, bytestring or unicode. Unicode is generally not recommended.

Exercise:
Crawl the titles of Safe Guest

import re
import requests
import time

url = "https://www.anquanke.com/"
# header伪装
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"
}
# 获取网页内容
req = requests.get(url, headers)
# 获取网页状态码
# print(req)
# 获取全部网页内容
# print(req.text)

# 将内容保存到本地文件中
with open("anquanke.txt", "w", encoding="utf-8") as f1:
    data = f1.write(requests.get(url).text)

# 读取本地文件
with open("anquanke.txt", "r", encoding="utf-8") as f2:
    content = f2.read()

    # 表达式原内容
    # <a class="title g-line1" href="/post/id/289226">智在粤港澳,阿里云原生安全2.0应运而生</a>
    # 编写表达式
    m = re.findall('<a class="title g-line1" href="(.*?)">(.*?)</a>', content)

    print(f"安全客安全监测与响应中心 每日安全热点\t{
      
      time.strftime('%Y-%m-%d')}\n")
    # 获取网页标题及链接
    num = 9
    for i in m:
        print(f"标题{
      
      len(m) - num}{
      
      i[1]}")
        print(f"链接:https://www.anquanke.com/{
      
      i[0]}")
        num -= 1

Crawl images

import requests
import re

"""
1.确定网址
2.搭建关系 发送请求 接受响应
3.筛选数据
4.保存本地
"""

url = 'https://wallspic.com/tag/iphone/for_mobile'

response = requests.get(url = url).text

contentUrl = re.findall(r'contentUrl":"(https://img\d.wallspic.com/crops/.*?)"', response)

j = 0
for i in contentUrl:
    # print(contentUrl)
    j += 1
    Content = requests.get(url = i).content
    print(Content)
    with open(f'Wallspic-{
      
      j}.jpg', mode = 'wb') as f:
        f.write(Content)
        print(f'[+] 已保存完成{
      
      j}张壁纸!')
        

Crawling song titles and singers from NetEase Cloud Music playlists

import requests
import fake_useragent
import re

"""
1.确定网址
2.搭建关系 发送请求 接受响应
3.筛选数据
4.保存本地
"""

#

url = "https://music.163.com/discover/toplist?id=3778678"
ua = fake_useragent.UserAgent()

header = {
    
    
    'user-agent': ua.random
}
response = requests.get(url = url, headers = header)
r = response.text
# print(r)
response.close()

# 包含歌曲和歌曲链接的一段字符串
# <ul class="f-hide"><li><a href="/song?id=1974443814">我记得</a></li>
all = ''.join(re.findall(r'<ul class="f-hide">(.*?)</ul>', r))
# print(all)

# 从 all 里提取歌名
# <li><a href="/song?id=1974443814">我记得</a></li><li>
name = re.findall(r'<a href=".*?">(.*?)</a>', all)

# 从 all 里提取歌曲地址
song_url = re.findall(r'<a href="(.*?)">.*?</a>', all)


# 从页面全部源代码中提取歌手的信息
# "artists":[{"id":6731,"name":"赵雷","tns":[],"alias":[]}],"stat
singer = re.findall(r'"artists":\[{"id":.*?,"name":"(.*?)",', r)


# 打印
for i in range(len(name)):
    print(name[i], '\t', singer[i], '\t', song_url[i])

Insert image description here

Practice
crawling QQ music playlists

import requests
from fake_useragent import UserAgent
import re
from lxml import etree
import csv
import time


# 打开csv文件
f = open('QQ音乐热歌榜单.csv', mode = 'w', newline = '', encoding = 'utf-8-sig')
w_headers = csv.DictWriter(f, fieldnames = ['歌名','歌手','歌曲地址'])
w_headers.writeheader()

# 目标地址
url = 'https://y.qq.com/n/ryqq/toplist/4'

# 请求头
headers = {
    
    
    'user-agent' : UserAgent().random
}

time.sleep(2)

#发送请求
response = requests.get(url = url, headers = headers)
html = response.text

#### xpath 解析提取
h = etree.HTML(html)

Dict = {
    
    }
data = h.xpath('//div[@class="songlist__item songlist__item--even"]')[0]
Dict['歌名'] = data.xpath('//div[@class="songlist__songname"]/span/a[2]/text()')
Dict['歌手'] = data.xpath('//div[@class="songlist__artist"]/a/text()')
s = 'https://y.qq.com'
src = data.xpath('//div[@class="songlist__songname"]/span/a[2]/@href')
print(src)
for i in range(0,len(src)):
    src[i] = s + src[i]
Dict['歌曲地址'] = src

w = {
    
    }
for i in range(0, 20):
    w['歌名'] = Dict['歌名'][i]
    w['歌手'] = Dict['歌手'][i]
    w['歌曲地址'] = Dict['歌曲地址'][i]
    w_headers.writerow(w)
    print(f'[*] 已保存完成{
      
      i+1}首歌曲信息')

### re正则表达式解析提取数据
Dict = {
    
    }
title = re.finditer(r'<a title=".*?" href="(?P<src>.*?)">(?P<name>.*?)<.*?<a class="playlist__author" title="(?P<singer>.*?)" href',html)
j = 0
for i in title:
    j += 1
    Dict['歌名'] = i.group('name')
    Dict['歌手'] = i.group('singer')
    Dict['歌曲地址'] = 'https://y.qq.com' + i.group('src')
    w_headers.writerow(Dict)
    print(f'[*] 已保存完成{
      
      j}首歌曲信息')

Crawling Lianjia’s housing information

import csv
import requests
import fake_useragent
from lxml import etree

f = open('链家租房信息.csv', mode = 'w', newline = '', encoding = 'utf-8-sig')
writer = csv.DictWriter(f, fieldnames = ['地区','小区','简介','月租'])
writer.writeheader()

for i in range(1, 4):
    ua = fake_useragent.UserAgent()
    url = "https://bj.lianjia.com/zufang/pg{}/".format(i)
    header = {
    
    "User-Agent": ua.random}
    # 访问获取源代码
    r = requests.get(url = url, headers = header).text
    # print(r)
    html = etree.HTML(r)

    # 先获取整个列表
    data = html.xpath('//div[@class="content__list"]//div[@class="content__list--item"]')
    print(data)
    #再循环挨个提取相关信息
    for i in data:
        # 创建一个字典,把信息放到字典里面
        # 方便后期保存
        d = {
    
    }
        d['地区'] = i.xpath('.//p[@class="content__list--item--des"]/a[1]/text()')
        d['小区'] = i.xpath('.//p[@class="content__list--item--des"]/a[3]/text()')
        d['简介'] = i.xpath('.//p[@class="content__list--item--des"]/text()')
        d['月租'] = i.xpath('.//span[@class="content__list--item-price"]/em/text()')
        print(d)
        writer.writerow(d)

Crawling vulnerability information from the vulnerability database

import csv
import requests
import fake_useragent
from lxml import etree

f = open('漏洞库.csv', mode = 'w', newline = '', encoding = 'utf-8-sig')
writer = csv.DictWriter(f, fieldnames = ['Title', 'EDB-ID', 'CVE', 'Author', 'Type', 'Date', 'DOWNLOAD'])
writer.writeheader()

ua = fake_useragent.UserAgent()
#url = 'https://www.exploit-db.com/exploits/51467'
header = {
    
    
    'user-agent': ua.random
}

zzz = []
for i in range(0,300):
    zzz.append(f"https://www.exploit-db.com/exploits/{
      
      51515-i}")

for i in range(0,300):
    req = requests.get(url=zzz[i], headers=header)
    if int(req.status_code) == 200:
        h = etree.HTML(req.text)
        Dict = {
    
    }
        Dict['Title'] = h.xpath('//div[@class="row justify-content-md-center"]/h1/text()')
        b = h.xpath('//div[@class="info info-horizontal"]/div')[0]
        Dict['EDB-ID'] =b.xpath('//div[@class="col-6 text-center"]/h6/text()')
        Dict['CVE'] = b.xpath('//div[@class="col-6 text-center"]/h6/text()')
        Dict['Author'] = b.xpath('//div[@class="col-6 text-center"]/h6/text()')
        Dict['Type'] = b.xpath('//div[@class="col-6 text-center"]/h6/text()')
        Dict['Date'] = b.xpath('//div[@class="col-6 text-center"]/h6/text()')
        asd = h.xpath('//div[@class="stats h5 text-center"]/a[1]/@href')
        Dict['DOWNLOAD'] = f"https://www.exploit-db.com{
      
      asd[0]}"
        print(Dict)
        writer.writerow(Dict)

Guess you like

Origin blog.csdn.net/m0_51553670/article/details/131275153