[Selenium crawler] Yhen takes you by hand to use selenium automated crawler to crawl one piece anime pictures

The following content is original, I welcome everyone to watch and study, and it is forbidden to be used for commercial purposes. Please indicate the source when reprinting.
  Great Karma! I am Yhen who has been practicing python for one month. I am very happy to share my learning experience with you here. As a little white, I may encounter all kinds of bugs when I write code. I share some of my experiences with you and I hope it will be helpful to everyone!

  Today I will take you to use a special crawling method-selenium to achieve crawling of the one piece picture in Baidu pictures. I will give you the source code later, because it is more detailed, so the length may be longer, so if you only want to see the results, you can go directly to the source code behind.

———————— Manual dividing line ————————————————

Alright, start sharing today

I have nothing to do, I want to climb some anime pictures and play.
What kind of anime should I climb ?

Open the Baidu Image Search wallpaper
browse through a partition animation Columns Columns

Hey yeah ~ I decided you were
"One Piece"

Although I haven't seen this anime much

But I have heard about its superb painting

I believe many friends love this movie

Insert picture description here
The pictures inside are all cool
url:
wallpaper cartoon anime one piece

You know, there are more than these dozens of pictures here,
all the way to the bottom . There are more than
447 pictures.
Insert picture description here
Our goal today is to crawl all the 447 pictures.

With the demand, let's start thinking analysis:

Since the picture we are going to climb

Naturally, it is easy to think of a common idea:
1. First send a request to the homepage interface to obtain page data
2. Perform data extraction and obtain a link
to a picture 3. Send a request to a picture link to obtain picture data
4. Save the picture to local

Isn't this the same as climbing the emoticon before? So easy, you can get it in 10 minutes!

But ... is it really that simple?

Come, let me show you the general method of climbing pictures

The first is a simple guide package and send request

# 导入爬虫库
import requests
# 导入pyquery数据提取库
from pyquery import PyQuery as pq

# 首页网址
url = "https://image.baidu.com/search/index?ct=&z=&tn=baiduimage&ipn=r&word=%E5%A3%81%E7%BA%B8%20%E5%8D%A1%E9%80%9A%E5%8A%A8%E6%BC%AB%20%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=0&istype=2&ie=utf-8&oe=utf-8&cl=&lm=-1&st=-1&fr=&fmq=1587020770329_R&ic=&se=&sme=&width=1920&height=1080&face=0&hd=&latest=&copyright="
# 请求头
headers ={"User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36"}
#对首页地址发送请求,返回文本数据
response = requests.get(url).text
print(response)

Data can be obtained normally
Insert picture description here

Then the data is extracted

First, open the inspection tool
is positioned to the first picture
on the right can be seen in the class corresponding to the class selector is a link inside main_img img-hover attribute data-imgurl https://ss0.bdstatic.com/70cFvHSh_Q1YnxGkpoWK1HF6hhy/it/ u = 1296489273,320485179 & fm = 26 & gp = 0.jpg
Insert picture description here

Let ’s visit, it
Insert picture description here
turns out to be the picture detail page we ’re looking for

So next we use pyquery to extract the data to
see if we can extract the link just now

Because this is not the focus of today, I will show you the code directly.
If you want to know how to use pyquery,
you can see my previous blog posts

# 数据初始化
doc = pq(response)
# 通过类选择器main_img img-hover 来提取数据 注意:中间的空格用.代替
main_img = doc(".main_img.img-hover").text()
print(main_img)

Print it and see if we can get the data we want
Insert picture description here

Oh no, why are there nothing? !
After confirming that there is no problem with the code we wrote,
my first reaction was: being crawled back! ! !

It does not matter, if it is anti-climbing, we just add a few more parameters to the request header.

Insert picture description here
I added the request data type, user information, and anti-theft chain to the request header

# 请求头
#          浏览器类型
headers ={"User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.9 Safari/537.36",
          # 请求数据类型
          "Accept':'application/json, text/javascript, */*; q=0.01",
          # 用户信息
          "Cookie':'BIDUPSID=19D65DF48337FDD785B388B0DF53C923; PSTM=1585231725; BAIDUID=19D65DF48337FDD770FCA7C7FB5EE199:FG=1; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; indexPageSugList=%5B%22%E9%AB%98%E6%B8%85%E5%A3%81%E7%BA%B8%22%2C%22%E5%A3%81%E7%BA%B8%22%5D; delPer=0; PSINO=1; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm; BDRCVFR[tox4WRQ4-Km]=mk3SLVN4HKm; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BCLID=8092759760795831765; BDSFRCVID=KH_OJeC62A1E9y7u9Ovg2mkxL2uBKEJTH6aoBC3ekpDdtYkQoCaWEG0PoM8g0KubBuN4ogKK3gOTH4AF_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=tJCHoK_MfCD3HJbpq45HMt00qxby26niWNO9aJ5nJDoNhqKw2jJhef4BbN5LabvrtjTGah5FQpP-HJ7tLTbqMn8vbhOkahoy0K6UKl0MLn7Ybb0xynoDLRLNjMnMBMPe52OnaIbp3fAKftnOM46JehL3346-35543bRTLnLy5KJYMDcnK4-XD653jN3P; ZD_ENTRY=baidu; H_PS_PSSID=30963_1440_21081_31342_30824_26350_31164",
          # 防盗链
          "Referer':'https://image.baidu.com/search/index?ct=&z=&tn=baiduimage&ipn=r&word=%E5%A3%81%E7%BA%B8%20%E5%8D%A1%E9%80%9A%E5%8A%A8%E6%BC%AB%20%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=0&istype=2&ie=utf-8&oe=utf-8&cl=&lm=-1&st=-1&fr=&fmq=1587020770329_R&ic=&se=&sme=&width=1920&height=1080&face=0&hd=&latest=&copyright="
          }

Request again to see if we can get the data we want

Insert picture description here
。。。。。。

Isn't it ... not yet?

Autistic ...

It's okay, how can this setback frustrate me! ! !

Let's analyze a wave in reverse

We use the class selector main_img img-hover to locate the data,
but we did not get any data. This situation also occurs
under the premise that the code is correct and not anti-climbed.

Then

…there is only one truth!
The homepage data we requested at the beginning did not have the main_img img-hover class selector! ! !

Let's verify if this homepage data is the culprit

First print the requested home page data, and then search for main_img img-hover

Insert picture description here

The search found that it
turned red and it turned out that this guy was fooling
around. Actually don't give me data!

Then it turns out that the home page is dynamic data, and his data interface is not the URL of the home page!

After this situation, there are two solutions

1. Among these vast interfaces, find the interface of the home page data
Insert picture description here

2. Use selenium to request the page directly

I don't know which one you choose. Anyway, let me find an interface in so much data, I refuse! ! ! Much effort

Isn't selenium fragrant?

why?

Because the webpage is opened with selenium, all the information will be loaded into Elements, and then, the dynamic webpage can be crawled by the method of static webpage.

It means that
as long as selenium makes a request to the homepage, the data obtained is the source code that we see in the console after pressing f12! You don't have to work hard to find an interface!

Regarding selenium, the most common usage is still automation. Selenium can automatically start a browser and simulate user operations, such as automatic login, automatic page turning and so on.

Want to know more students can refer to this Chinese translation document

https://selenium-python-zh.readthedocs.io/en/latest/

OK, just do it

The first is the guide package
and then the configuration browser. Selenium has a visual mode (open a real browser, you can see the operation of the browser) and a silent mode (run in the background, not visible)

We adopt silent mode today. Because if the focus is on crawlers, you do n’t need to see the browser operation, and opening a browser every time is too annoying and consumes memory.

Another important point is that to use selenium, you must first install a webdriver driver corresponding to your browser and put it in the same path as your py file, so that selenium can achieve simulated browser operations

For download addresses of various browser drivers, please refer to
https://www.jianshu.com/p/6185f07f46d4

The webdriver should be put together with the py file.
Insert picture description here
The following is the silent mode configuration of selenium. It is
more troublesome, but it is a dead operation, everyone can be familiar with it.

from selenium import  webdriver #调用webdriver模块
from selenium.webdriver.chrome.options import Options # 调用Options类

chrome_options = Options() # 实例化Option
chrome_options.add_argument('--headless') # 设置浏览器启动类型为静默启动
driver = webdriver.Chrome(options = chrome_options) # 设置浏览器引擎为Chrome

If you ask me why each step is set up in this way, I do n’t know. If you want to know more, you can go to the documentation

But it ’s just a little more troublesome to set the silent mode, the visual mode is set up with two or three lines of code

After setting,
you can use selenium to send requests

#对首页进行请求
driver.get('https://image.baidu.com/search/index?ct=&z=&tn=baiduimage&ipn=r&word=%E5%A3%81%E7%BA%B8%20%E5%8D%A1%E9%80%9A%E5%8A%A8%E6%BC%AB%20%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=0&istype=2&ie=utf-8&oe=utf-8&cl=&lm=-1&st=-1&fr=&fmq=1587020770329_R&ic=&se=&sme=&width=1920&height=1080&face=0&hd=&latest=&copyright=')
# 返回页面源码
response = driver.page_source

Note that to get the source code of the page, driver.page_source
is used and the data in string format is directly returned to us.

Let's print the returned data
Insert picture description here

To be successful to get the data
that we are now again in the data obtained in the search main_img img-hover
look at this is not really got the data we want
Insert picture description here
Dengdengdengdeng
this time total no problem right, we want Image URL is here

Then you can
use pyquery to extract the image URL.
First, initialize the data, and then extract the data through the class selector. After the traversal, the image link is extracted through the attribute "data-imgurl". The
code is as follows

from pyquery import PyQuery as pq
# 数据初始化
doc = pq(response)
# 通过类选择器提取数据
x = doc(".main_img.img-hover").items()
count = 0
# 遍历数据
for main_img in x :
     # 通过属性“data-imgurl”取出图片链接
     image_url = main_img.attr("data-imgurl")
     print(image_url)

After a lot of hard work, I
finally got the picture URL successfully. It ’s
not easy,
so I printed the picture URL with excitement and
Insert picture description here
found out ...
WHAT ???
We just had more than 400 pictures on the page,
you give me Return 20 URLs? ? ? Have you eaten the rest? ? ? ?

Interesting. Make me right, but the devil is one foot tall and one foot high

I go back to the webpage

In the source code just now, ctrl + F opens the search function.
Searching main_img img-hover Insert picture description here
found that there are only 20 search results.

Then I immediately thought that the more than four hundred pictures we just obtained by continuously pulling down

So it is most likely because the initial page has not been loaded. If you want to get the remaining url, you must let Selenium control the browser to continuously pull down

How to achieve it?
I do n’t know hahahaha,
but I know Baidu, I ’ll ask Du Niang

I found a csdn article that describes how to use selenium to simulate sliding to the bottom of the operation.
Original link
https://blog.csdn.net/weixin_43632109/article/details/86797701
Insert picture description here
However, be aware that to get more than 400 pictures of data It cannot be achieved by just sliding the bottom of the operation once. Selenium must perform the bottom operation multiple times before obtaining the source code data to get all the data.

So how do I implement multiple operations?
I set up a for loop,
first look at the code

import time
# 执行24次下滑到底部操作
for a in range(25):
     # 将滚动条移动到页面的底部
     js = "var q=document.documentElement.scrollTop=1000000"
     driver.execute_script(js)
     # 设置一秒的延时 防止页面数据没加载出来
     time.sleep(1)

I set the number of cycles to 25.
Why is 25?
Because I ’ve tested it myself before, and I can get the last picture just 25 times
.

Now let ’s take a look at the link to the picture we got now
Insert picture description here

Obviously we got a lot more data this time
I click on the last link

Insert picture description here
Insert picture description here
As you can see, the last link is actually the link to the last picture,
so we succeeded in getting all the picture links

The next step is to send a request to these pictures, get the data and save it locally.

import request
#定义count初始值为0
count = 0
# 遍历数据
for main_img in x :
     # 通过属性提取出图片网址
     # 通过属性“data-imgurl”取出图片链接
     image_url = main_img.attr("data-imgurl")

     #对图片链接发送请求,获取图片数据
     image =requests.get(image_url)

     # 在海贼王图片下载文件夹下保存为jpg文件,以“wb”的方式写入 w是写入 b是进制转换
     f = open("海贼王图片下载/" + "{}.jpg".format(count), "wb")
     # 将获取到的数据写入 content是进制转换
     f.write(image.content)
     # 关闭文件写入
     f.close()
     #意思是 count = count+1
     count+=1

Send requests or use our classic crawler library requests

First send a request to get data

Then save it as a .jpg file in the One Piece picture download folder, write it in the way of "wb", w is written, b is hexadecimal conversion

Write the acquired data into it

Finally close the file write

After executing the program,
see if you can download all the pictures

Insert picture description here
perfect! The success of a total of 447 pictures on a Web page to download down
Insert picture description here
Sahua end!

Finally, put the source code

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from pyquery import PyQuery as pq
import requests


# 实例化一个options对象
chrome_options =Options()
# 把浏览器设置为静默模式
chrome_options.add_argument("headless")
driver = webdriver.Chrome(options=chrome_options)

# 对首页进行请求
driver.get('https://image.baidu.com/search/index?ct=&z=&tn=baiduimage&ipn=r&word=%E5%A3%81%E7%BA%B8%20%E5%8D%A1%E9%80%9A%E5%8A%A8%E6%BC%AB%20%E6%B5%B7%E8%B4%BC%E7%8E%8B&pn=0&istype=2&ie=utf-8&oe=utf-8&cl=&lm=-1&st=-1&fr=&fmq=1587020770329_R&ic=&se=&sme=&width=1920&height=1080&face=0&hd=&latest=&copyright=')
# 执行24次下滑到底部操作
for a in range(25):
     # 将滚动条移动到页面的底部
     js = "var q=document.documentElement.scrollTop=1000000"
     driver.execute_script(js)
     # 设置一秒的延时 防止页面数据没加载出来
     time.sleep(1)

# 返回页面源码
response = driver.page_source
# print(response)
# 数据初始化
doc = pq(response)
# 通过类选择器提取数据
x = doc(".main_img.img-hover").items()
#定义count初始值为0
count = 0
# 遍历数据
for main_img in x :
     # 通过属性提取出图片网址
     # 通过属性“data-imgurl”取出图片链接
     image_url = main_img.attr("data-imgurl")

     #对图片链接发送请求,获取图片数据
     image =requests.get(image_url)

     # 在海贼王图片下载文件夹下保存为jpg文件,以“wb”的方式写入 w是写入 b是进制转换
     f = open("海贼王图片下载/" + "{}.jpg".format(count), "wb")
     # 将获取到的数据写入 content是进制转换
     f.write(image.content)
     # 关闭文件写入
     f.close()
     # 意思是 count = count + 1
     count+=1

Next to my water blowing

[Yhen said]
  Everyone hasn't seen you for a long time. Due to many reasons, I haven't posted reptile articles in a while. After I checked my article at noon the day before yesterday, I was thinking about what to write in the next article in the afternoon. Suddenly thought I might try to climb Baidu's pictures, so I tried it myself. I thought it would be very smooth, just like what I wrote in the article, it can be done directly by the ordinary crawler method. Because this time the content is completely my own way of thinking, not like following the teacher before. The teacher of Six Star also had a tutorial to climb Baidu pictures, but I did not read it before writing this article, just to try to see if I can make a project independently. After writing, I found the teacher's video and found that the teacher used the method of finding the interface. I think this method is an innovation. After using selenium, the problem of incomplete data still appeared, which troubled me for a while. But after I found the information, I solved it. Learning is not the process of constantly discovering problems and then solving them. When I finally climbed down the 447 pictures, I was very fulfilled hahaha. Therefore, I hope you can try when you are free. Under the premise of observing the network protocol, use crawlers to do what you are interested in. You may encounter many frustrations in the process, but when you succeed, how excited you are Only you know it. It also proves that you can really use Python after learning it, and it is not a waste of time, right? Come on!

Yesterday I actually found the article about crawling novels I wrote on an unknown website ... and I didn't add a source. I am ready to contact csdn and related personnel to negotiate, so that sentence is still welcome, everyone is welcome to read my article to learn, but please indicate the source of the reprint, and it is prohibited for commercial use! Thank you for your cooperation.

I'm very happy to share my experience with you here, and I hope it will be helpful to everyone. If there is anything you do not understand or want to give me any suggestions, please leave a message in the comment area!

If you feel that the classmates I wrote are helpful to you, please give me a small praise, it would be better to pay attention Your support is my motivation. In the future, I will share more experience with everyone.

I ’m Yhen, see you next time

[Review of previous articles]

[Crawler] Yhen takes you to crawl the fiction website with python, and the entire network is exhausted, just watch!
(This may be the most detailed tutorial you have seen)

[Crawler] Yhen takes you with your hands and crawls with python

【Crawler】 Yhen teaches you to crawl emoticons, letting you become the most beautiful baby in the world of fighting figures

【Crawler】 Yhen takes you hand in hand to get Qunar's popular travel information (and packaged as a travel information query gadget)

Published 7 original articles · won 15 · views 755

Guess you like

Origin blog.csdn.net/Yhen1/article/details/105585188