爬虫简单使用 - 代码天地

爬虫简单使用

其他 2019-10-19 23:00:49 阅读次数: 0

一、常识

import requests
# 模块作用：伪造浏览器请求
response = requests.get(访问的url)
from bs4 import BeautifulSoup
# 将html的内容解析成对象 
bs4 = BeautifulSoup(response.text, 'html.parser')
# 查找的方法
bs4.find(name='标签名', attrs={'属性名：‘属性值’})
# find_all查找全部
# 获取内容
# content 原始内容 用于获取bytes数据类型(图片、视频)
# text 获取

二、示例

import requests
from bs4 import BeautifulSoup
import os
#
path = os.path.join(os.getcwd(), 'img')
# 1.伪造浏览器请求
response = requests.get("......")
response.encoding = 'gbk'
# 2.获取网页的html文件
# print(response.text)
# 3.使用bs4将html文件解析成对象
bs4 = BeautifulSoup(response.text, 'html.parser')
# print(bs4)
div = bs4.find(name='div', attrs={'id': 'auto-channel-lazyload-article'})
# print(div)
li_list = div.find_all(name='li')
for li in li_list:
    print('='*120)
    # print(li)
    h3 = li.find(name='h3')
    if not h3:
        continue
    print(h3.text)
    a = li.find(name='a')
    href = a.get('href')
    print('https:{}'.format(href))
    img = li.find(name='img')
    src = img.get('src')
    src = 'https:{}'.format(src)
    print(src)
    file_name = src.rsplit('/', maxsplit=1)[1]
    # print(file_name)
    file_path = os.path.join(path, file_name)
    # print(file_path)
    # src是地址，重新伪造get请求
    ret = requests.get(src)
    # content是获取原始的数据
    # print(ret.content)
    # 保存图片
    with open(file_path, 'wb') as f:
        f.write(ret.content)

猜你喜欢

转载自www.cnblogs.com/wt7018/p/11706125.html

爬虫--selenium简单使用

爬虫简单使用

简单爬虫之requests的使用

使用python实现简单爬虫

【转载】HTTPClient爬虫简单使用

Python爬虫----Scrapy的简单使用

python 爬虫 Selenium的简单使用

python爬虫之类的简单使用

Java爬虫-Jsoup的简单使用

爬虫，简单爬虫基础！

简单的爬虫

简单爬虫

Python爬虫学习：简单的爬虫

使用Python爬虫爬取简单网页（Python爬虫入门）

爬虫从入门到放弃——WebMagic使用简单的爬虫（1）

爬虫从入门到放弃——WebMagic使用简单的爬虫（2）

使用requests+BeautifulSoup的简单爬虫练习

5.简单爬虫------------使用selenium

使用JSoup实现简单的爬虫技术

使用superagent 与cheerio完成简单爬虫

网络爬虫（三）：简单使用scrapy

使用lxml编写简单爬虫实例

Python代理IP爬虫的简单使用

python爬虫-简单使用xpath下载图片

Python爬虫 --- 2.3 Scrapy 框架的简单使用

使用 java 做爬虫的简单例子

python爬虫入门之urllib的简单使用

使用python实现简单的爬虫操作

jsoup爬虫工具超简单使用(记录)

Go语言爬虫+正则简单使用

今日推荐

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

周排行

Java基础复习_day13_Collection集合

2018.11.16 c语言学习经验

且看Java内置四大核心函数式接口

小程序云开发中数据库的数据分段和显示图片

python的函数

Web-JS进阶

【干货】C++常用代码积累笔记大全

Spring的ioc操作与 IOC底层原理

构建之法20191121-11 Scrum立会报告+燃尽图 07

Spring boot之Hello World访问404

每日归档

更多

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)