foreword
Recently, the company needs to purchase data of some users, such as gift packages with the largest total amount
But I found that the website does not provide a download channel (the operator told me, if it is not my fault┓(;´_`)┏)
I found that a page has only ten data, and there is no jump, so just crawling the webpage is definitely not enough.
Then I wrote a crawler to get the form and it has been failing. I found that the website may have a relatively high degree of anti-crawler. . .
I searched the Internet and found the selenium library. I haven’t used it before. It’s quite interesting, so I decided to use this.
Prepare
Basic knowledge is still necessary, you can watch the big blog
Then let’s install the browser driver. Here we use edge, which should be available.
Check the version first
, find the corresponding driver, download
it, unzip it, put it under the python directory,
and save the current path to the environment variable (My Computer>>Right-click Properties>>Advanced System Settings>>Advanced>>Environment Variables>>System Variables>>Path)
But when we run the program, we will find bugs
In fact, just change the driver name to the one in this error report.
Install without selenium library
That's good preparation
1. Open the web page
Selenium is a simulation operation, so start from opening the browser
First we import the library and set the location
from selenium import webdriver
import time
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 10)
Then open the browser, use edge here, there should be all of them, let's go directly to the website
driver = webdriver.Edge()
driver.get('https://me.modian.com/u/user_index')
time.sleep(1)
Note that the display is different from the normal open browser
2. Login account
Since it is fully automatic, we need to use the account password. Looking at the position of this button
is ok, we found that we can use class to locate the upper label, then h3, and finally span+click operation
The blogs mentioned earlier in this part have instructions, so you can take a look at them first.
The next step is to locate and click
There is a time interval here because we have to wait for the browser to respond, and the same applies later
button1=driver.find_element_by_xpath("//*[@class='login_field']/h3/span")
time.sleep(1)
button1.click()
The next step is to find two input boxes, and then log in, that is, to find three components
Still locate the lower level through the class, and then we simulate two inputs and one click (fork it out for me without changing the account password and run it directly Ψ( ̄∀ ̄)Ψ)
time.sleep(1)
driver.find_element_by_xpath("//*[@class='phone']/input").send_keys('你的账号')
driver.find_element_by_xpath("//*[@class='password']/input").send_keys('你的密码')
button2=driver.find_element_by_xpath("//*[@class='loginBtn hover other_input']")
button2.click()
Now we have entered the account homepage, but the customer information is not here yet, it is here,
so we still need to click
time.sleep(1)
driver.find_element_by_xpath("//*[@id='user_manage']/a").click()
time.sleep(1)
Ok finally entered the data page, perfect! (* ^ - ^ *)ゞ
Data read and save
First build a dataframe to save the data
df = pd.DataFrame(columns = ['id','name','金额'])
num=0
Then look at where the data is
, then we can read all the information lines according to the class, and then divide each line of information
userslist = driver.find_elements_by_xpath("//tr[@class='el-table__row']")
for user in userslist:
data=user.text.split('\n')
print(data)
After all, it is company information, and the code is thick, but it can still be seen that a customer has 7 information
The next step is to select the information we want to save, and change the print(data) to the following code
Note that we don't want to modify the rows of the group in the last transaction record, so make a judgment
if len(data) == 7 :
if data[2] == "成交客户":
df.loc[num,'id'] = data[3]
df.loc[num,'name'] = data[0]
df.loc[num,'金额'] = data[4]
num = num+1
However, an error may be reported during operation. This is another reason.
In the example just now, all groups have data. What if this item is blank?
Oh, there are only six data, so of course it is wrong according to the original format, so add another judgment
if len(data) == 6:
if data[1] == "成交客户":
df.loc[num,'id'] = data[2]
df.loc[num,'name'] = data[0]
df.loc[num,'金额'] = data[3]
num = num+1
In this way, we can get the data of one page, and of course we have to turn the page later, the old way is to find the button and
click him!
driver.find_element_by_xpath("//*[@class='el-icon el-icon-arrow-right']").click()
#print(df[-10:-1])
time.sleep(2)
In this way, you can read page by page. The above time can be shortened if your network is fast enough, otherwise there will be repeated reading.
Next is repeated work. For it,
for i in range(1): #你的页数!!!!!!!!!!!!!!!!!!别问我怎么只有一页数据
userslist = driver.find_elements_by_xpath("//tr[@class='el-table__row']")
for user in userslist:
data=user.text.split('\n')
if len(data) == 7 :
if data[2] == "成交客户":
df.loc[num,'id'] = data[3]
df.loc[num,'name'] = data[0]
df.loc[num,'金额'] = data[4]
num = num+1
if len(data) == 6:
if data[1] == "成交客户":
df.loc[num,'id'] = data[2]
df.loc[num,'name'] = data[0]
df.loc[num,'金额'] = data[3]
num = num+1
driver.find_element_by_xpath("//*[@class='el-icon el-icon-arrow-right']").click()
#print(df[-10:-1])
time.sleep(2)
Finally add a save
df.to_excel(r'D:\allmoney.xlsx',index = False)
Completion * \ (^ o ^) / *
full code
# -*- coding: utf-8 -*-
"""
Created on Wed Feb 23 19:08:35 2022
@author: xyyl
"""
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
from selenium import webdriver
import time
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 10)
#打开
driver = webdriver.Edge()
driver.get('https://me.modian.com/u/user_index')
time.sleep(1)
#定位账号密码登录
button1=driver.find_element_by_xpath("//*[@class='login_field']/h3/span")
time.sleep(1)
button1.click()
#输入账号密码
time.sleep(1)
driver.find_element_by_xpath("//*[@class='phone']/input").send_keys('你的账号')
driver.find_element_by_xpath("//*[@class='password']/input").send_keys('你的密码')
button2=driver.find_element_by_xpath("//*[@class='loginBtn hover other_input']")
button2.click()
#跳转信息
time.sleep(1)
driver.find_element_by_xpath("//*[@id='user_manage']/a").click()
time.sleep(1)
#创建dataframe
df = pd.DataFrame(columns = ['id','name','金额'])
num=0
#定位客户列表
for i in range(1): #你的页数!!!!!!!!!!!!!!!!!!别问我怎么只有一页数据
userslist = driver.find_elements_by_xpath("//tr[@class='el-table__row']")
for user in userslist:
data=user.text.split('\n')
if len(data) == 7 :
if data[2] == "成交客户":
df.loc[num,'id'] = data[3]
df.loc[num,'name'] = data[0]
df.loc[num,'金额'] = data[4]
num = num+1
if len(data) == 6:
if data[1] == "成交客户":
df.loc[num,'id'] = data[2]
df.loc[num,'name'] = data[0]
df.loc[num,'金额'] = data[3]
num = num+1
driver.find_element_by_xpath("//*[@class='el-icon el-icon-arrow-right']").click()
#print(df[-10:-1])
time.sleep(2)
df.to_excel(r'D:\allmoney.xlsx',index = False)
The effect is good, except that the data runs for a long time, and it can be analogized to a
blogger who is like a website that is about to be finished and not doing business ( ̄ω ̄;)