foreword
As the saying goes, is it easy for gay men who learn from our profession to find a partner?
It's almost the end of the new year, it's time to find a girlfriend
I found a website here maybe you can take a look
If you don’t need to send a wave of single benefits, you can also learn how to collect these data.
Environment and Modules
environment development
- Python 3.8
- Pycharm
module use
import parsel --> pip install parsel
import requests --> pip install requests
import csv
import re
install module
Friends who have not installed the module
win + R, enter cmd, and enter the installation command pip install module name (if you think the installation speed is slow, you can switch the domestic mirror source)
Module installation problem:
-
If installing a python third-party module:
1. Win + R, enter cmd, click OK, enter the installation command pip install module name (pip install requests) and press Enter
2. Click Terminal in pycharm to enter the installation command -
Reason for installation failure:
-
Failure 1: pip is not an internal command
Solution: Set environment variables -
Failure 2: There are a lot of red (read time out)
solutions: Because the network connection timed out, you need to switch the mirror source
Tsinghua University: https://pypi.tuna.tsinghua.edu.cn/simple Alibaba
Cloud: https://mirrors.aliyun.com/pypi/simple/
University of Science and Technology of China https://pypi.mirrors.ustc.edu.cn /simple/
Huazhong University of Science and Technology: https://pypi.hustunique.com/
Shandong University of Technology: https://pypi.sdutlinux.org/
Douban: https://pypi.douban.com/simple/
Example: pip3 install - i https://pypi.doubanio.com/simple/ module name
- Failure 3: The cmd shows that it has been installed, or the installation is successful, but it still cannot be imported in
pycharm. The python interpreter is not set up
How to configure the python interpreter in pycharm?
- Select file (file) >>> setting (setting) >>> Project (project) >>> python interpreter (python interpreter)
- Click on the gear, select add
- Add python installation path
How does pycharm install plugins?
- Select file (file) >>> setting (setting) >>> Plugins (plugins)
- Click Marketplace and enter the name of the plug-in you want to install. For example: for the translation plug-in, enter translation / for the Sinicization plug-in, enter Chinese
- Select the corresponding plug-in and click install
- After the installation is successful, the option to restart pycharm will pop up, click OK, and the restart will take effect
Basic thought process
1. Data source analysis:
- Clear requirements:
What is the collected data —> data data <static web page>
in the source code of the page
As long as all IDs are obtained, all data information can be collected
There are url IDs of all the girls' details pages
2. Code implementation steps:
- send request
- retrieve data
- Analytical data
- save data
Get all detail page IDs:
- Send a request, simulate the browser to send a request for the url address
-
Get data, get the response data returned by the server
Developer Tools—> response -
Parse the data and extract the data content we want
Details page ID —> UID
Get details page information -
Send a request, simulate the browser to send a request for the url address url address of the
data details page -
Get data, get the response data returned by the server
Web page source code -
Parsing the data, extracting the
basic information of the data we want -
Save the data, save the data content locally, save the
basic data information in the csv form
, and save the photo data, save the local folder
Implementation code
# 导入数据请求模块
import requests
# 导入数据解析模块
import parsel
# 导入csv
import csv
# 导入正则
import re
f = open('data.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['昵称',
'性别',
'年龄',
'身高',
'体重',
'出生日期',
'生肖',
'星座',
'籍贯',
'所在地',
'学历',
'婚姻状况',
'职业',
'年收入',
'住房',
'车辆',
'照片',
'详情页',
])
csv_writer.writeheader()
1. Send a request, simulate the browser to send a request for the url address
- Simulate browser headers request headers
that can be copied and pasted in developer tools to
prevent anti-crawling - <Response [200]> The
200 status code of the response object indicates that the request is successful
for page in range(1, 11):
# 请求链接
url = f'https://********.com/valueApp/api/love/searchLoveUser?page={page}&perPage=12&sex=0'
# 伪装模拟
headers = {
# User-Agent 用户代理, 表示浏览器基本信息
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
}
# 发送请求
response = requests.get(url=url, headers=headers)
print(response)
2. Get the data, get the response data returned by the server
Developer Tools —> response
response.json() Get response json data, dictionary data type
3. Parse the data and extract the content of the data we want
详情页ID ---> UID
因为得到数据: 字典数据类型
所以解析数据: 键值对取值 ---> 根据冒号左边的内容[键], 提取冒号右边的内容[值]
# for循环遍历, 把列表里面元素一个一个提取出来
for index in response.json()['data']['items']:
# https://love.19lou.com/detail/51593564 format 字符串格式化方法
link = f'https://****.com/detail/{index["uid"]}'
4. Send a request, simulate the browser to send a request for the url address
https://love.19lou.com/detail/51593564 资料详情页url地址
5. Get data, get the response data returned by the server
web page source code
- response.text Get the response text data and return the string data type
- response.json() Get response json data, dictionary data type
html_data = requests.get(url=link, headers=headers).text
6. Parse the data and extract the data content we want
Basic information
css selector: extract data according to label attribute content
xpath: extract data according to label node
re regular
- Will find the label corresponding to the data is that
- Just select copy
Convert the acquired html string data <html_data> into a parsable object
selector = parsel.Selector(html_data)
name = selector.css('.username::text').get()
info_list = selector.css('.info-tag::text').getall()
. Indicates the calling method attribute
gender = info_list[0].split(':')[-1]
age = info_list[1].split(':')[-1]
height = info_list[2].split(':')[-1]
date = info_list[-1].split(':')[-1]
Judging the number of info_list elements, when the number of elements is 4, it means that there is no column for weight
if len(info_list) == 4:
weight = '0kg'
else:
weight = info_list[3].split(':')[-1]
info_list_1 = selector.css('.basic-item span::text').getall()[2:]
zodiac = info_list_1[0].split(':')[-1]
constellation = info_list_1[1].split(':')[-1]
nativePlace = info_list_1[2].split(':')[-1]
location = info_list_1[3].split(':')[-1]
edu = info_list_1[4].split(':')[-1]
maritalStatus = info_list_1[5].split(':')[-1]
job = info_list_1[6].split(':')[-1]
money = info_list_1[7].split(':')[-1]
house = info_list_1[8].split(':')[-1]
car = info_list_1[9].split(':')[-1]
img_url = selector.css('.page .left-detail .abstract .avatar img::attr(src)').get()
7. Save the picture and get the binary data of the picture
img_content = requests.get(url=img_url, headers=headers).content
with open('data\\' + new_name + '.jpg', mode='wb') as img:
img.write(img_content)
print(dit)
Effect
Sensible Baozi can also make visual charts by himself
At last
Xiaoyuan also recommends a case tutorial for zero-based friends to learn here. If you are interested, you can take a look. If you need the source code, you can also click on the business card below to get it~
[Python case teaching] The most suitable practical case for zero-based learning, hands-on practice, let you become the next Python master