Although my girlfriend spends money on vacation, I make money in python at home, just spend it~~~

Hello everyone, I am a red panda ❤

It's not easy for a girlfriend to take a holiday, she is really charming when she takes me with flowers~

Please add image description

Why don't I come to spoil my girlfriend!

Please add image description
As a programmer, you must be conscious of creating value with your own technology~

Today, I will show you how to use python to collect outsourcing websites~

This is the first step on the road to wealth.

Please add image description

Environmental use:

  • Python 3.8
  • Pycharm

Module use:

  • requests >>> pip install requests
  • parsel >>> pip install parsel
  • csv

Module installation problem:

  1. win + R Enter cmd Click OK, enter the installation command pip install module name (pip install requests) Enter
  2. Click Terminal in pycharm to enter the installation command

Installation failed reason:
Failure one: pip is not an internal command

Solution: Set the environment variable

Failure 2: There are a lot of red reports (read time out)

Solution: Because the network link times out, you need to switch the mirror source


Tsinghua: https://pypi.tuna.tsinghua.edu.cn/simple Alibaba
Cloud: http://mirrors.aliyun.com/pypi/simple/ University of Science and Technology of China
https://pypi.mirrors.ustc.edu.cn /simple/
Huazhong University of Science and Technology: http://pypi.hustunique.com/ Shandong University of Technology: http://pypi.sdutlinux.org/
Douban: http://pypi.douban.com/simple/ For example: pip3 install - i
https://pypi.doubanio.com/simple/ module name


Failure three: cmd shows that it has been installed, or the installation is successful, but it still cannot be imported in pycharm

Solution: There may be multiple python versions installed (anaconda or python can install one)
just uninstall one or the python interpreter in your pycharm is not set

How to implement a crawler case?

1. What does it look like to analyze data? Analyze data sources?

Use developer tools to capture packets and analyze where the data comes from

2. Code implementation steps:

  1. Send a request, send a request for finding the url address
  2. Get data, get server return data
  3. Parse the data, extract the content we want
  4. Save the data, save it in the csv table
  5. Multi-page data collection: analyze the change law of the requested url

code

import module

# 导入数据请求模块 导入模块没有使用 灰色待机状态
import requests  # 第三方模块 pip install requests  (别人写好 开源代码,你可以直接去调用)
# 导入数据解析模块
import parsel  # 第三方模块 pip install parsel
# 导入csv模块
import csv  # 内置模块 不需要安装的

whole

f = open('data.csv', mode='a', encoding='utf-8', newline='')
# 快速替换 选择替换内容 按住ctrl + R 输入正则
csv_writer = csv.DictWriter(f, fieldnames=[
    '标题',
    '招标',
    '浏览人数',
    '竞标人数',
    '招标状态',
    '价格',
    '详情页',
])
csv_writer.writeheader()

# 1. 发送请求, 对于找到url地址发送请求
for page in range(1, 11):
    print(f'正在爬取第{
      
      page}页的数据内容')
    url = f'https://task.epwk.com/page{
      
      page}.html'  # 确定网址
    # 爬虫模拟(伪装成)浏览器 对于url地址发送请求
    # 反爬 相当于别人给你打电话, 电话显示推销广告
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'
    }
    response = requests.get(url=url, headers=headers)  # <Response [200]>  200状态码表示 请求成功 不代表得到数据了
    # 2. 获取数据, 获取服务器返回数据
    # print(response.text)   # 返回html字符串数据内容  >>> 提取字符串数据 要使用re正则
    # 3. 解析数据, 提取我们想要数据内容
    selectors = parsel.Selector(response.text)  # 需要转一下数据类型
    # 有了解过css选择器吗?  xpath根据标签节点提取数据, css选择器是根据标签属性提取数据
    # 第一次提取, 获取所有div标签
    divs = selectors.css('.itemblock')  # 列表
    for div in divs:  # for遍历循环
        # 定位有一个class类名字为title的div标签 下面a标签里面 title 属性
        # attr() 属性选择器, 选择某一个标签里面属性内容
        title = div.css('div.title a::attr(title)').get()  # get 返回字符串, 并且取第一个标签数据
        # strip() 去除字符串左右两端空格
        modelName = div.css('div.modelName::text').get().strip()  # 招标
        num = div.css('div.browser div:nth-child(2) span::text').get().strip()  # 浏览人数
        num_1 = div.css('div.browser div:nth-child(3) span::text').get().strip()  # 竞标人数
        status = div.css('span.status::text').get().strip()  # 招标状态
        price = div.css('span.price::text').get().strip()  # 价格
        href = div.css('div.title a::attr(href)').get()  # 详情页
        # 4. 保存数据, 保存到表格数据 字典创建  键:值  >>> 键值对  键值对与键值对之间用 逗号隔开
        # 值 可以是用任何数据类型, 键 大多数情况字符串数据类型 不能以数字
        dit = {
    
    
            '标题': title,
            '招标': modelName,
            '浏览人数': num,
            '竞标人数': num_1,
            '招标状态': status,
            '价格': price,
            '详情页': href,
        }
        csv_writer.writerow(dit)
        print(title, modelName, num, num_1, status, price, href)

Video Tutorial [You can watch if you have a fate, even if you don't]

https://www.bilibili.com/video/BV1vS4y1v7Fu/?spm_id_from=333.999.0.0

Please add image description
I hope everyone can make a little money~

The article ends here~

I'm Red Panda, see you in the next article (✿◡‿◡)

Please add image description

Guess you like

Origin blog.csdn.net/m0_67575344/article/details/127125198