Python-selenium crawls customer purchase information (front-end analysis + crawler code)

foreword

Recently, the company needs to purchase data of some users, such as gift packages with the largest total amount

insert image description here

But I found that the website does not provide a download channel (the operator told me, if it is not my fault┓(;´_`)┏)
insert image description here

I found that a page has only ten data, and there is no jump, so just crawling the webpage is definitely not enough.
Then I wrote a crawler to get the form and it has been failing. I found that the website may have a relatively high degree of anti-crawler. . .
I searched the Internet and found the selenium library. I haven’t used it before. It’s quite interesting, so I decided to use this.

Prepare

Basic knowledge is still necessary, you can watch the big blog

Then let’s install the browser driver. Here we use edge, which should be available.
Check the version first
insert image description here
, find the corresponding driver, download
insert image description here
it, unzip it, put it under the python directory,
insert image description here
and save the current path to the environment variable (My Computer>>Right-click Properties>>Advanced System Settings>>Advanced>>Environment Variables>>System Variables>>Path)
insert image description here

But when we run the program, we will find bugs
insert image description here

In fact, just change the driver name to the one in this error report.
insert image description here

Install without selenium library

That's good preparation

1. Open the web page

Selenium is a simulation operation, so start from opening the browser
First we import the library and set the location

from selenium import webdriver
import time
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 10)

Then open the browser, use edge here, there should be all of them, let's go directly to the website

driver = webdriver.Edge()

driver.get('https://me.modian.com/u/user_index')

time.sleep(1)

insert image description here
Note that the display is different from the normal open browser

2. Login account

Since it is fully automatic, we need to use the account password. Looking at the position of this button
insert image description here
is ok, we found that we can use class to locate the upper label, then h3, and finally span+click operation

The blogs mentioned earlier in this part have instructions, so you can take a look at them first.

The next step is to locate and click

There is a time interval here because we have to wait for the browser to respond, and the same applies later

button1=driver.find_element_by_xpath("//*[@class='login_field']/h3/span")
time.sleep(1)
button1.click()

insert image description here

The next step is to find two input boxes, and then log in, that is, to find three components

insert image description here

Still locate the lower level through the class, and then we simulate two inputs and one click (fork it out for me without changing the account password and run it directly Ψ( ̄∀ ̄)Ψ)

time.sleep(1)
driver.find_element_by_xpath("//*[@class='phone']/input").send_keys('你的账号')
driver.find_element_by_xpath("//*[@class='password']/input").send_keys('你的密码')
button2=driver.find_element_by_xpath("//*[@class='loginBtn hover other_input']")
button2.click()

Now we have entered the account homepage, but the customer information is not here yet, it is here,
insert image description here
so we still need to click

time.sleep(1)
driver.find_element_by_xpath("//*[@id='user_manage']/a").click()
time.sleep(1)

Ok finally entered the data page, perfect! (* ^ - ^ *)ゞ

Data read and save

First build a dataframe to save the data

df = pd.DataFrame(columns = ['id','name','金额']) 
num=0

Then look at where the data is
insert image description here
, then we can read all the information lines according to the class, and then divide each line of information

userslist = driver.find_elements_by_xpath("//tr[@class='el-table__row']")
for user in userslist:
    data=user.text.split('\n')
    print(data)

After all, it is company information, and the code is thick, but it can still be seen that a customer has 7 information

insert image description here
The next step is to select the information we want to save, and change the print(data) to the following code

Note that we don't want to modify the rows of the group in the last transaction record, so make a judgment

if len(data) == 7 :  
	if data[2] == "成交客户":
	                df.loc[num,'id'] = data[3]
	                df.loc[num,'name'] = data[0]
	                df.loc[num,'金额'] = data[4]
	                num = num+1

However, an error may be reported during operation. This is another reason.
In the example just now, all groups have data. What if this item is blank?
insert image description here
Oh, there are only six data, so of course it is wrong according to the original format, so add another judgment

if len(data) == 6:
            if data[1] == "成交客户":
                df.loc[num,'id'] = data[2]
                df.loc[num,'name'] = data[0]
                df.loc[num,'金额'] = data[3]
                num = num+1

In this way, we can get the data of one page, and of course we have to turn the page later, the old way is to find the button and
insert image description hereclick him!

driver.find_element_by_xpath("//*[@class='el-icon el-icon-arrow-right']").click()
#print(df[-10:-1])
time.sleep(2)

In this way, you can read page by page. The above time can be shortened if your network is fast enough, otherwise there will be repeated reading.
Next is repeated work. For it,

for i in range(1):  #你的页数!!!!!!!!!!!!!!!!!!别问我怎么只有一页数据
    userslist = driver.find_elements_by_xpath("//tr[@class='el-table__row']")
    for user in userslist:
        data=user.text.split('\n')
        if len(data) == 7 :
            if data[2] == "成交客户":
                df.loc[num,'id'] = data[3]
                df.loc[num,'name'] = data[0]
                df.loc[num,'金额'] = data[4]
                num = num+1
        if len(data) == 6:
            if data[1] == "成交客户":
                df.loc[num,'id'] = data[2]
                df.loc[num,'name'] = data[0]
                df.loc[num,'金额'] = data[3]
                num = num+1
    driver.find_element_by_xpath("//*[@class='el-icon el-icon-arrow-right']").click()
    #print(df[-10:-1])
    time.sleep(2)

Finally add a save

df.to_excel(r'D:\allmoney.xlsx',index = False)

Completion * \ (^ o ^) / *

full code

# -*- coding: utf-8 -*-
"""
Created on Wed Feb 23 19:08:35 2022

@author: xyyl
"""
# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

from selenium import webdriver
import time
import pandas as pd
import numpy as np

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 10)

#打开
driver = webdriver.Edge()

driver.get('https://me.modian.com/u/user_index')

time.sleep(1)

#定位账号密码登录
button1=driver.find_element_by_xpath("//*[@class='login_field']/h3/span")
time.sleep(1)
button1.click()

#输入账号密码
time.sleep(1)
driver.find_element_by_xpath("//*[@class='phone']/input").send_keys('你的账号')
driver.find_element_by_xpath("//*[@class='password']/input").send_keys('你的密码')
button2=driver.find_element_by_xpath("//*[@class='loginBtn hover other_input']")
button2.click()

#跳转信息
time.sleep(1)
driver.find_element_by_xpath("//*[@id='user_manage']/a").click()
time.sleep(1)

#创建dataframe
df = pd.DataFrame(columns = ['id','name','金额']) 
num=0
#定位客户列表

for i in range(1):  #你的页数!!!!!!!!!!!!!!!!!!别问我怎么只有一页数据
    userslist = driver.find_elements_by_xpath("//tr[@class='el-table__row']")
    for user in userslist:
        data=user.text.split('\n')
        if len(data) == 7 :
            if data[2] == "成交客户":
                df.loc[num,'id'] = data[3]
                df.loc[num,'name'] = data[0]
                df.loc[num,'金额'] = data[4]
                num = num+1
        if len(data) == 6:
            if data[1] == "成交客户":
                df.loc[num,'id'] = data[2]
                df.loc[num,'name'] = data[0]
                df.loc[num,'金额'] = data[3]
                num = num+1
    driver.find_element_by_xpath("//*[@class='el-icon el-icon-arrow-right']").click()
    #print(df[-10:-1])
    time.sleep(2)
df.to_excel(r'D:\allmoney.xlsx',index = False)

The effect is good, except that the data runs for a long time, and it can be analogized to a
blogger who is like a website that is about to be finished and not doing business ( ̄ω ̄;)

Guess you like

Origin blog.csdn.net/qq_44616044/article/details/123547468