python爬虫第一天

1.爬虫介绍,什么是爬虫?

就是根据URL获取网站信息

2.如何爬取网站信息?

a.伪造浏览器向某个地址发送http请求,获取返回的字符串

(导入requests模块,利用requests伪造浏览器访问,通过命令安装:pip3 install requests)

a1.下载页面
requests.get(url = "地址")

requests.encoding = requests.apparent_encoding          #获取网页编码格式,并使用此编码

requests.text                                                                     #获取文本信息

requests.content                                                               #获取字节

response = requests.post(

    url='地址',

    data={

    字典

    },

    headers = {},                       #请求头

    cookies = {}

)

cookies_dict = response.cookies.get_dict

注意:1.伪造浏览器,2.请求分析

b.(导入BeautifulSoup模块,用来解析HTML字符串)解析:获取想要的指定内容beautifulsoup

通过命令安装:pip3 install BeautifulSoup4

soup = BeautifulSoup('<html>........</html>','html.parser')          #lxml速度更快,需要单独安装

div = soup.find(name='标签名')                                                       #找到标签

div = soup.find(name='标签名',id='il')

div = soup.find(name='标签名',_class = 'il')

div = soup.find(name='div',id='auto-channel-lazyload-article')

div.text                                                                             #获取文本

div.attrs                                                                            #获取所有属性

div.get('href')                                                                   #获取单个属性

divs = soup.find_all(name='标签名')                                                #找到标签

divs = soup.find_all(name='标签名',id='il')

divs = soup.find_all(name='标签名',_class = 'il')

divs = soup.find_all(name='div',id='auto-channel-lazyload-article')

divs获取的是列表,只能通过索引查找,不能直接 find找到

示例:

import requests
from bs4 import BeautifulSoup


#1.下载页面
ret = requests.get(url = "地址")
ret.encoding = ret.apparent_encoding
# print(ret.text)

#2.解析:获取想要的指定内容beautifulsoup
soup = BeautifulSoup(ret.text,'html.parser') #lxml速度更快,需要单独安装

div = soup.find(name='div',id='auto-channel-lazyload-article')

li_list =div.find_all(name='li')
for li in li_list:
h3 = li.find(name='h3')
if not h3:
continue
p = li.find(name='p')
a = li.find('a')
img = li.find('img')
src = img.get('src')

file_name = src.rsplit('__',maxsplit=1)[1]

ret_img = requests.get(
url="https:"+src
)
with open(file_name,'wb') as f:
f.write(ret_img.content)
print(h3.text,a.get('href'))
print(p.text)

思路:想要爬去一个网站的信息,就要尽可能的装的更像浏览器,做到以假乱真就可以无视反爬了

第一步下载,第二步解析

猜你喜欢

转载自www.cnblogs.com/guanzizai/p/9269281.html