Target website: Ancient poetry website
Realization goal: Automatically log in to the website, crawl the data of the specified page, and store it.
Tools used: selenium, Baidu handwritten digit recognition
Step 1: Browse the web
We found that to log in to the target website, we only need to fill in the user's information, and then fill in the verification code to log in to the website. After clarifying the requirements, we will start to operate
Step Two: Preliminary Operations
We can use the developer tool to locate the location of the account and password, obtain their id information, use selenium to locate the id, and fill in the information with send_keys. In order to browse the web more conveniently, we can maximize the window, that is, use maximize_window()
bro = webdriver.Chrome('./chromedriver.exe')
bro.get('https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx')
bro.maximize_window()
# 填写id
bro.find_element_by_id('email').send_keys('你的账号')
# 填写密码
bro.find_element_by_id('pwd').send_keys('你的密码')
The key to login is to obtain the verification code. I tried to crawl the verification code picture directly before, but when I logged in to the website after reading the verification code, I found that when I got the verification code picture, I made a second request to the website, so after the verification code was recognized, when filling in the verification code, the verification code did not match the code on the picture, so here I still use screenshots to get the verification code picture.
For the naming of pictures, I used the time function to get the current time to name
picture_name2 = str(t) + '抠图.png'
bro.save_screenshot(picture_name2)
At this time, the image of the entire page is cut, and we will cut out the image based on this later.
We can try to locate the location of the verification code, or use selenium to locate the id in the old way
address = bro.find_element_by_id('imgCode')
left = address.location['x']+370
top = address.location['y']+105
right = left + address.size['width']+18
height = top +address.size['height'] +12
The parameters added later are manually tested by myself, and I will find out how to accurately locate the method later. The
following is the code for the cutout
kt_img = jt_img.crop((left,top,right,height))
kt_img.save(picture_name2)
Check what the cutout picture looks like.
We can see that although the letters are obvious, there are still some interference factors. We need to remove these interference factors or reduce the interference, so we need to grayscale the image.
imgs = img.convert('L')
threshold = 127
filter_func = lambda x:0 if x<threshold else 1
image = imgs.point(filter_func,'1')
The point function is to define a point function, which maps the matrix data in the point function to 0 or 1, which is more convenient to record matrix information. 0 represents white, 1 represents black, and if it exceeds the preset parameter, it is black, which is used to degrade the part of the image we don’t need
After processing the picture,
we can see that the verification code in the middle is clearer
Step 3: Identify the picture
After we read the picture, we can use Baidu handwritten digit recognition to recognize the picture. Regarding Baidu handwritten digit recognition, we log in to the Baidu artificial intelligence website. For specific operations, you can read other articles. I won’t introduce too much
Find the following parameters and write them into our code
APP_ID =
APP_KEY =
SECRET_KEY =
Then we can perform image recognition
baidu_sip = AipOcr(APP_ID,APP_KEY,SECRET_KEY)
result = baidu_sip.handwriting(chuli_img)
print('验证码结果是:',result['words_result'][0]['words'])
result will return us a dictionary, here we only need to extract the verification code
Step 4: Fill in the verification code to log in to the website
Locate the verification code filling box, fill in the verification code we just obtained, and then click the login button to realize the web page login
V_code = result['words_result'][0]['words']
bro.find_element_by_id('code').send_keys(f'{V_code}')
bro.find_element_by_id('denglu').click()
Step 5: Crawl website data
Here I will not crawl the whole site. Interested friends can learn crawlspider and combine it with selenium to crawl the whole site. I will write related articles later. Let's locate a tab at random, and we crawl the first page of data of famous sentences. The specific code is as follows, so I won't explain too much.
bro.find_element_by_xpath('//*[@id="html"]//div[1]/div[1]/div/div[2]/div[1]/a[2]').click()
# 爬取名句诗文
html = etree.HTML(res)
item = html.xpath('//*[@id="html"]//div[2]/div[1]/div[2]')
for i in item:
shiwen = i.xpath('./div/a[1]/text()')
zuozhe = i.xpath('./div/a[2]/text()')
with open('./古诗文.csv','a',encoding='utf-8')as f:
data = csv.writer(f)
data.writerow(shiwen)
data.writerow(zuozhe)
There are some errors in the handwritten digit recognition verification code. It is impossible to recognize it 100% correctly, but the accuracy rate is still very high.