Selenium automatically logs in to crawl website data

Target website: Ancient poetry website
Realization goal: Automatically log in to the website, crawl the data of the specified page, and store it.
Tools used: selenium, Baidu handwritten digit recognition

Step 1: Browse the web

We found that to log in to the target website, we only need to fill in the user's information, and then fill in the verification code to log in to the website. After clarifying the requirements, we will start to operate
insert image description here

Step Two: Preliminary Operations

We can use the developer tool to locate the location of the account and password, obtain their id information, use selenium to locate the id, and fill in the information with send_keys. In order to browse the web more conveniently, we can maximize the window, that is, use maximize_window()

bro = webdriver.Chrome('./chromedriver.exe')
bro.get('https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx')
bro.maximize_window()
# 填写id
bro.find_element_by_id('email').send_keys('你的账号')
# 填写密码
bro.find_element_by_id('pwd').send_keys('你的密码')

The key to login is to obtain the verification code. I tried to crawl the verification code picture directly before, but when I logged in to the website after reading the verification code, I found that when I got the verification code picture, I made a second request to the website, so after the verification code was recognized, when filling in the verification code, the verification code did not match the code on the picture, so here I still use screenshots to get the verification code picture.
For the naming of pictures, I used the time function to get the current time to name

picture_name2 = str(t) + '抠图.png'
bro.save_screenshot(picture_name2)

At this time, the image of the entire page is cut, and we will cut out the image based on this later.
We can try to locate the location of the verification code, or use selenium to locate the id in the old way

address = bro.find_element_by_id('imgCode')
left = address.location['x']+370
top = address.location['y']+105
right = left + address.size['width']+18
height = top +address.size['height'] +12

The parameters added later are manually tested by myself, and I will find out how to accurately locate the method later. The
following is the code for the cutout

kt_img = jt_img.crop((left,top,right,height))
kt_img.save(picture_name2)

Check what the cutout picture looks like.
insert image description here
We can see that although the letters are obvious, there are still some interference factors. We need to remove these interference factors or reduce the interference, so we need to grayscale the image.

imgs = img.convert('L')
threshold = 127
filter_func = lambda x:0 if x<threshold else 1
image = imgs.point(filter_func,'1')

The point function is to define a point function, which maps the matrix data in the point function to 0 or 1, which is more convenient to record matrix information. 0 represents white, 1 represents black, and if it exceeds the preset parameter, it is black, which is used to degrade the part of the image we don’t need

After processing the picture,
insert image description here
we can see that the verification code in the middle is clearer

Step 3: Identify the picture

After we read the picture, we can use Baidu handwritten digit recognition to recognize the picture. Regarding Baidu handwritten digit recognition, we log in to the Baidu artificial intelligence website. For specific operations, you can read other articles. I won’t introduce too much

Find the following parameters and write them into our code
APP_ID =
APP_KEY =
SECRET_KEY =
Then we can perform image recognition

baidu_sip = AipOcr(APP_ID,APP_KEY,SECRET_KEY)
result = baidu_sip.handwriting(chuli_img)
print('验证码结果是:',result['words_result'][0]['words'])

result will return us a dictionary, here we only need to extract the verification code

Step 4: Fill in the verification code to log in to the website

Locate the verification code filling box, fill in the verification code we just obtained, and then click the login button to realize the web page login

V_code = result['words_result'][0]['words']
bro.find_element_by_id('code').send_keys(f'{V_code}')
bro.find_element_by_id('denglu').click()

Step 5: Crawl website data

Here I will not crawl the whole site. Interested friends can learn crawlspider and combine it with selenium to crawl the whole site. I will write related articles later. Let's locate a tab at random, and we crawl the first page of data of famous sentences. The specific code is as follows, so I won't explain too much.

bro.find_element_by_xpath('//*[@id="html"]//div[1]/div[1]/div/div[2]/div[1]/a[2]').click()
# 爬取名句诗文
html = etree.HTML(res)
item = html.xpath('//*[@id="html"]//div[2]/div[1]/div[2]')
for i in item:
    shiwen = i.xpath('./div/a[1]/text()')
    zuozhe = i.xpath('./div/a[2]/text()')
    with open('./古诗文.csv','a',encoding='utf-8')as f:
        data = csv.writer(f)
        data.writerow(shiwen)
        data.writerow(zuozhe)

There are some errors in the handwritten digit recognition verification code. It is impossible to recognize it 100% correctly, but the accuracy rate is still very high.

Guess you like

Origin blog.csdn.net/weixin_44052130/article/details/130683864