Do you want this benefit? Use Python to collect data of girls on dating sites

foreword

As the saying goes, is it easy for gay men who learn from our profession to find a partner?

It's almost the end of the new year, it's time to find a girlfriend

I found a website here maybe you can take a look

If you don’t need to send a wave of single benefits, you can also learn how to collect these data.

insert image description here

Environment and Modules

environment development

  • Python 3.8
  • Pycharm

module use

import parsel       --> pip install parsel
import requests     --> pip install requests
import csv
import re

install module

Friends who have not installed the module

win + R, enter cmd, and enter the installation command pip install module name (if you think the installation speed is slow, you can switch the domestic mirror source)

Module installation problem:

  • If installing a python third-party module:
    1. Win + R, enter cmd, click OK, enter the installation command pip install module name (pip install requests) and press Enter
    2. Click Terminal in pycharm to enter the installation command

  • Reason for installation failure:

  • Failure 1: pip is not an internal command
    Solution: Set environment variables

  • Failure 2: There are a lot of red (read time out)
    solutions: Because the network connection timed out, you need to switch the mirror source

Tsinghua University: https://pypi.tuna.tsinghua.edu.cn/simple Alibaba
Cloud: https://mirrors.aliyun.com/pypi/simple/
University of Science and Technology of China https://pypi.mirrors.ustc.edu.cn /simple/
Huazhong University of Science and Technology: https://pypi.hustunique.com/
Shandong University of Technology: https://pypi.sdutlinux.org/
Douban: https://pypi.douban.com/simple/
Example: pip3 install - i https://pypi.doubanio.com/simple/ module name

  • Failure 3: The cmd shows that it has been installed, or the installation is successful, but it still cannot be imported in
    pycharm. The python interpreter is not set up

How to configure the python interpreter in pycharm?

  1. Select file (file) >>> setting (setting) >>> Project (project) >>> python interpreter (python interpreter)
  2. Click on the gear, select add
  3. Add python installation path

How does pycharm install plugins?

  1. Select file (file) >>> setting (setting) >>> Plugins (plugins)
  2. Click Marketplace and enter the name of the plug-in you want to install. For example: for the translation plug-in, enter translation / for the Sinicization plug-in, enter Chinese
  3. Select the corresponding plug-in and click install
  4. After the installation is successful, the option to restart pycharm will pop up, click OK, and the restart will take effect

Basic thought process

1. Data source analysis:

  1. Clear requirements:

What is the collected data —> data data <static web page>

in the source code of the page

insert image description here

As long as all IDs are obtained, all data information can be collected

insert image description here

There are url IDs of all the girls' details pages

2. Code implementation steps:

  • send request
  • retrieve data
  • Analytical data
  • save data

Get all detail page IDs:

  1. Send a request, simulate the browser to send a request for the url address

insert image description here

  1. Get data, get the response data returned by the server
    Developer Tools—> response

  2. Parse the data and extract the data content we want
    Details page ID —> UID
    Get details page information

  3. Send a request, simulate the browser to send a request for the url address url address of the
    insert image description here
    data details page

  4. Get data, get the response data returned by the server
    Web page source code

  5. Parsing the data, extracting the
    basic information of the data we want

  6. Save the data, save the data content locally, save the
    basic data information in the csv form
    , and save the photo data, save the local folder

insert image description here


Implementation code

Click here to get the complete source code

# 导入数据请求模块
import requests
# 导入数据解析模块
import parsel
# 导入csv
import csv
# 导入正则
import re
f = open('data.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['昵称',
                                           '性别',
                                           '年龄',
                                           '身高',
                                           '体重',
                                           '出生日期',
                                           '生肖',
                                           '星座',
                                           '籍贯',
                                           '所在地',
                                           '学历',
                                           '婚姻状况',
                                           '职业',
                                           '年收入',
                                           '住房',
                                           '车辆',
                                           '照片',
                                           '详情页',
                                           ])
csv_writer.writeheader()

1. Send a request, simulate the browser to send a request for the url address

  • Simulate browser headers request headers
    that can be copied and pasted in developer tools to
    prevent anti-crawling
  • <Response [200]> The
    200 status code of the response object indicates that the request is successful
for page in range(1, 11):
    # 请求链接
    url = f'https://********.com/valueApp/api/love/searchLoveUser?page={page}&perPage=12&sex=0'
    # 伪装模拟
    headers = {
    
    
        # User-Agent 用户代理, 表示浏览器基本信息
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
    }
    # 发送请求
    response = requests.get(url=url, headers=headers)
    print(response)

2. Get the data, get the response data returned by the server

Developer Tools —> response

response.json() Get response json data, dictionary data type

3. Parse the data and extract the content of the data we want

    详情页ID ---> UID
    因为得到数据: 字典数据类型
    所以解析数据: 键值对取值 ---> 根据冒号左边的内容[键], 提取冒号右边的内容[值]
# for循环遍历, 把列表里面元素一个一个提取出来
for index in response.json()['data']['items']:
    #  https://love.19lou.com/detail/51593564  format 字符串格式化方法
    link = f'https://****.com/detail/{index["uid"]}'

4. Send a request, simulate the browser to send a request for the url address

        https://love.19lou.com/detail/51593564  资料详情页url地址

5. Get data, get the response data returned by the server

web page source code

  • response.text Get the response text data and return the string data type
  • response.json() Get response json data, dictionary data type
html_data = requests.get(url=link, headers=headers).text

6. Parse the data and extract the data content we want

Basic information

css selector: extract data according to label attribute content
xpath: extract data according to label node
re regular

  1. Will find the label corresponding to the data is that
  2. Just select copy

Convert the acquired html string data <html_data> into a parsable object

selector = parsel.Selector(html_data)
name = selector.css('.username::text').get()
info_list = selector.css('.info-tag::text').getall()

. Indicates the calling method attribute

gender = info_list[0].split(':')[-1]
age = info_list[1].split(':')[-1]
height = info_list[2].split(':')[-1]
date = info_list[-1].split(':')[-1]

Judging the number of info_list elements, when the number of elements is 4, it means that there is no column for weight

if len(info_list) == 4:
    weight = '0kg'
else:
    weight = info_list[3].split(':')[-1]
info_list_1 = selector.css('.basic-item span::text').getall()[2:]
zodiac = info_list_1[0].split(':')[-1]
constellation = info_list_1[1].split(':')[-1]
nativePlace = info_list_1[2].split(':')[-1]
location = info_list_1[3].split(':')[-1]
edu = info_list_1[4].split(':')[-1]
maritalStatus = info_list_1[5].split(':')[-1]
job = info_list_1[6].split(':')[-1]
money = info_list_1[7].split(':')[-1]
house = info_list_1[8].split(':')[-1]
car = info_list_1[9].split(':')[-1]
img_url = selector.css('.page .left-detail .abstract .avatar img::attr(src)').get()

7. Save the picture and get the binary data of the picture

img_content = requests.get(url=img_url, headers=headers).content
with open('data\\' + new_name + '.jpg', mode='wb') as img:
    img.write(img_content)
print(dit)

Effect

Sensible Baozi can also make visual charts by himself

insert image description here

At last

Xiaoyuan also recommends a case tutorial for zero-based friends to learn here. If you are interested, you can take a look. If you need the source code, you can also click on the business card below to get it~

[Python case teaching] The most suitable practical case for zero-based learning, hands-on practice, let you become the next Python master

insert image description here

Guess you like

Origin blog.csdn.net/yxczsz/article/details/128792917