Python collects a Top 250 information, no longer afraid of loneliness and boredom...

Hello everyone, I am a red panda ❤

There's been a bit of drama lately...

The top ten movies are no longer enough

This entire rankingTOP250Let's have a look at one time

Please add image description

Highlights of this time:

  • Analyze web page structure
  • css/xpath/re parsing data
  • Save CSV file

Introduction to the environment used:

  • python 3.8
  • Pycharm

Please add image description

This module uses:

  • requests >>> data request module pip install requests
  • parsel >>> data parsing module pip install parsel
  • csv

Module installation problem:

If installing python third-party modules:

  1. win + R Enter cmd Click OK, enter the installation command pip install module name (pip install requests) Enter
  2. Click Terminal in pycharm to enter the installation command

Reason for installation failure:

Failure 1: pip is not an internal command
Solution: Set environment variables

Failure 2: There are a lot of red reports (read time out)
Solution:

Because the network connection times out, you need to switch the mirror source

Tsinghua: https://pypi.tuna.tsinghua.edu.cn/simple Alibaba
Cloud: http://mirrors.aliyun.com/pypi/simple/ University of Science and Technology of China
https://pypi.mirrors.ustc.edu.cn /simple/
Huazhong University of Science and Technology: http://pypi.hustunique.com/ Shandong University of Technology: http://pypi.sdutlinux.org/
Douban: http://pypi.douban.com/simple/ For example: pip3 install - i
https://pypi.doubanio.com/simple/ module name

Failure 3: The cmd shows that it has been installed, or the installation is successful, but it still cannot be imported in pycharm
Solution: There may be multiple python versions installed (anaconda or python can install one)

Just uninstall one or the python interpreter in your pycharm is not set up

Please add image description

How to configure the python interpreter in pycharm?

  1. Select file >>> setting >>> Project >>> python interpreter (python interpreter)
  2. Click the gear, select add to
    add the python installation path
  3. How does pycharm install plugins?
    Select file >>> setting >>> Plugins (plugins),
    click Marketplace, enter the name of the plugin you want to install, such as: translate plugin, enter translation
    , select the corresponding plugin, and click install.
  4. After the installation is successful, the option to restart pycharm will pop up, click OK, and restart to take effect.

There are basically four steps to the crawler code:

  1. Send a request to determine the request url address
    Simulate a browser with python code to send a request to the url address

  2. Get data
    Get the response data returned by the server

  3. Analytical data

  4. save data

Please add image description


code

import requests  # 数据请求模块
import parsel  # 数据解析模块
import csv  # 保存csv文件

f = open('top250最终版本03.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    '电影名',
    # '导演',
    # '主演',
    '演员信息',
    '年份',
    '国家',
    '电影类型',
    '评分',
    '评论量',
    '简介',
    '详情页',
])
csv_writer.writeheader()
# 1. 发送请求
for page in range(0, 250, 25):
    url = f'https:///top250?start={
      
      page}&filter='
    # 请求头 字典数据类型, 构建完整键值对  对于一些基本没有什么反爬的网站, 不加请求头也可以
    headers = {
    
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.41 Safari/537.36'
    }
    # 发送请求代码
    response = requests.get(url=url, headers=headers)  # <Response [200]>  200 状态码表示请求成功
    # 2. 获取响应对象的文本数据
    # print(response.text)  # 字符串数据类型
    # 3. 解析数据 提取我们想要数据内容 如果你想要直接对于字符串数据进行解析(提取) 只能用re正则
    selector = parsel.Selector(response.text)  # 把获取下来html字符串数据, 转成selector可解析的对象
    # print(selector)   <Selector xpath=None data='<html lang="zh-CN" class="ua-windows ...'>
    # css选择器: 就是根据标签属性内容,提取相关数据
    lis = selector.css('.grid_view li')  # 第一次提取, 获取所有li标签 返回列表
    for li in lis:  # 一个一个提取列表里面元素, 通过for循环遍历
        try:
            # span:nth-child(1) 选择第一个span标签
            # span:nth-child(1)::text 获取第一个span标签里面文本数据
            # span::text 也可以得到数据 get() 获取第一个标签里面 getall() 获取所有内容
            title = li.css('.hd a span::text').get()  # 电影名字
            info_list = li.css('.bd p:nth-child(1)::text').getall()
            # strip() 去除字符串左右两端空格  split 对于字符串数据分割返回列表 replace 替换
            # actor_list = info_list[0].strip().split('   ')
            # director = actor_list[0].replace('导演: ', '')  # 导演
            # actor = actor_list[1].replace('主演: ', '').replace('/...', '')  # 主演
            actor_list = info_list[0].strip().replace('导演: ', '').replace('主演: ', '')  # 演员信息
            info = info_list[1].strip().split(' / ')
            date = info[0]  # 年份
            country = info[1]  # 国家
            movie_types = info[2]  # 电影类型
            score = li.css('.rating_num::text').get()  # 评分
            comment = li.css('.star span:nth-child(4)::text').get().replace('人评价', '')  # 评论量
            summary = li.css('.inq::text').get() # 简介
            href = li.css('.hd a::attr(href)').get()  # 详情页
            dit = {
    
    
                '电影名': title,
                # '导演': director,
                # '主演': actor,
                '演员信息': actor_list,
                '年份': date,
                '国家': country,
                '电影类型': movie_types,
                '评分': score,
                '评论量': comment,
                '简介': summary,
                '详情页': href,
            }
            csv_writer.writerow(dit)
            print(title, actor_list, date, country, movie_types, score, comment, summary, href, sep=' | ')
        except Exception as e:
            print(e)

Today's article ends here~

I'm Red Panda, see you in the next article (✿◡‿◡)

源码资料素材下方名片获取

Please add image description

Guess you like

Origin blog.csdn.net/m0_67575344/article/details/127091904