docker+python headless browser crawler


Where does the massive data come from? There is no such thing as big data in the world. If you climb more, you will naturally have data.

[size=18px; box-sizing: border-box;]Why use docker?[/size]

[size=18px; box-sizing: border-box;]As a developer, are you still struggling with the environment Successful and frustrated? Are you often disgusted by complicated installation steps, and thus put off by new technologies? [/size]

[size=18px; box-sizing: border-box;] Then docker was born to solve these pain points. Docker is not a new technology, something similar to docker has existed for a long time, but docker is more fun to use, you should really try it, just like fingerprint unlocking. Once you use it, you can't go back. [/size]

[size=18px; box-sizing: border-box;] Now many websites have anti-crawling functions. All we have to do is try to disguise our requests as if they were made by a real browser. It is best to use the browser to send requests directly, such as using WebDriver to drive the browser to simulate real operations. But the speed is too slow. Besides, the linux of the server is generally the server version, and there is no desktop at all, so there is no browser available. So we use headless browsers. The function is the same as the real browser, the speed is faster, but there is no interface. [/size]

1. Install ubunt16.04 on the virtual machine. (Omitted)
2. Install docker. Ubuntu14 or above comes with docker, you can install it directly
ubt1606@ubt1606-virtual-machine:~$ dockerThe program 'docker' is currently not installed. You can install it by typing:sudo apt install docker.ioubt1606@ubt1606-virtual-machine:~$ sudo apt install docker.ioubt1606@ubt1606 -virtual-machine:~$ docker imagesCannot connect to the Docker daemon. Is the docker daemon running on this host? In ubuntu, you need to add sudo before the docker command, otherwise the above information will be reported. ubt1606@ubt1606-virtual-machine:~$ sudo docker imagesREPOSITORY TAG IMAGE ID CREATED SIZE

If it is troublesome to add sudo every time you enter a docker command, you can add the current user to the docker user group. For details, refer to this article
3. Pull the docker image to

search for python+selenium. One of them is this "Container with python selenium for lazy people (like me) to avoid configuration xvfb on server.". This is exactly what we want.

The author also kindly gave a small demo. We will use it later.

Using default tag: latestPulling repository docker.io/pimuzzo/selenium-pythonNetwork timed out while trying to connect to https://index.docker.io/v1/repositories/pimuzzo/selenium-python/images. You may want to check your internet connection or if you are behind a proxy.

curl -sSL https://get.daocloud.io/daotools/set_mirror.sh | sh -s http://c4c833cb.m.daocloud.io

ubt1606@ubt1606-virtual-machine:~$ sudo su[sudo] password for ubt1606: root@ubt1606-virtual-machine:/home/ubt1606# echo "DOCKER_OPTS=\"$DOCKER_OPTS --registry-mirror=http://c4c833cb.m.daocloud.io\"" >> /etc/default/dockerroot@ubt1606-virtual-machine:/home/ubt1606# vi /etc/default/docker

root@ubt1606-virtual-machine:/home/ubt1606# service docker restart



root@ubt1606-virtual-machine:/home/ubt1606#docker pull index.docker.io/pimuzzo/selenium-python-xvfbUsing default tag: latestlatest: Pulling from pimuzzo/selenium-python-xvfb759d6771041e: Already exists 8836b825667b: Already exists c2f5e51744e6: Already exists a3ed95caeb02: Already exists 21fb0716901c: Already exists 9cc47e6dfb6f: Pull complete 08c1371dc842: Pull complete 0aa04c2152b2: Pull complete db151fc54aee: Pull complete 3f0af4107074: Pull complete 00d9524b72cc: Pull complete 3ba8b369c5ab: Pull complete aad0e22b9317: Pull complete Digest: sha256:73b4aca6ecfc2a5bf392065cd07cf7fc89e5da61104492e7c04332f2bfd8da4dStatus: Downloaded newer image for pimuzzo/selenium-python-xvfb:latest

If you see information similar to the above in docker images, it means that the image pull is successful. Pay attention to the size of SIZE. If the network speed is not good, it is likely to fail. If you fail, just try a few more times. If it doesn't work, see if there are other mirror sources. If it doesn't work, just use open connect (you need to buy vps). You can also copy an image file from someone else and import it into docker. No matter what method you use, as long as you can easily and quickly get the image into docker.

At this point, the environment is basically OK. If it weren't for the small twists and turns of pulling the mirror image, the operation is still very simple, and it's so simple. You must know that it is easy to pull an image, but it is not easy to create an image, it is really troublesome. How much time can a docker image save us? This image can not only be used during development, but can also be deployed directly after development. Kill two birds with one stone. There is only one point, that is, it is not very good to debug during development. So doing a little development or just testing a specific environment, or trying new technologies, using docker is really convenient.
4. Write the first small demo

and create a demo.py file under /home/ubt1606/demo. Note that ubt1606 is the username

[code="python"]#!/usr/bin/env python

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(800, 600))
display.start( )

# now Firefox will run in a virtual display.
# you will not see the browser.
browser = webdriver.Firefox()
browser.get('http://www.baidu.com')
print browser.title
browser.quit()

display.stop()

5. Start the container and map the data volume



[size=18px; box-sizing: border-box;]-ti: It can also be written as -i -t, which means requesting a console to interact with the container. It is the first letter of interact and terminal respectively[/size]

[size=18px; box-sizing: border-box;]/home/something folder. Just like windows and VMWare share folders. [/size]

[size=18px; box-sizing: border-box;]windows.iso file. The container is equivalent to the windows that has been started. [/size]

[size=18px; box-sizing: border-box;]python /home/something/demo.py: run the demo.py file in the /home/something folder in docker. Note that the path is the path in docker. [/size]

[size=18px; box-sizing: border-box;]/home/something[/size] and copy it to the /home/other folder, then change the command to python /home/other/demo2.py That's it. Be sure to understand the things "in docker" and "path in docker". In order not to cause trouble for yourself, it is recommended not to copy it elsewhere.

[size=18px; box-sizing: border-box;]Summary:[/size]

[size=18px; box-sizing: border-box;]Pull a docker image. [/size]

[size=18px; box-sizing: border-box;]Start the container and run the demo.py file. [/size]

[size=18px; box-sizing: border-box;]The crawler is actually written using the python version of the [size=18px; box-sizing: border-box;]webdriver. It's just that the browser is not used here, but pyvirtualdisplay. [/size][/size]

[size=18px; box-sizing: border-box;] But I think the most used is still webdriver. Just use it to write crawlers. This article is just a guide. As for the use of the webdriver api, the use of the webdriver api is relatively simple, and it is very fast to get started with JavaEE, so I will not introduce it in detail here. [/size]


Copyright statement: The content of this article is contributed by Internet users voluntarily, and this community does not own the ownership and does not assume relevant legal responsibility. If you find any content suspected of plagiarism in this community, please send an email to: [email protected] to report and provide relevant evidence. Once verified, this community will immediately delete the allegedly infringing content.
Original link

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326112873&siteId=291194637