Scrapy-Splash introduction and handling of codes (a)

  In the previous blog, we learned the use of selenium, which is a dynamic fetch page method, however, crawl dynamic pages there are other ways to introduce Splash method here, combined with specific examples to explain.


A: Splash Introduction and Preparation

1 Introduction

  Splash is a JavaScript rendering service, when it comes js am sure you will think of the page, right, Scrapy Splash is supported JavaScrapy rendering tool is a lightweight browser with HTTP API, you can crawl dynamic rendering of the page .

2. Install

  Scrapy-Splash is equipped with two methods, here we install it with Docker, therefore we must first install Docker (multi-vessel technology, applications and environments will be packaged to form a separate "application" that allows each application isolation, suitable for massive crawler system), download address is:

  https://docs.docker.com/docker-for-windows/install/

  after downloaded, it will appear docker desktop requires Windows 10 Pro or Enterprise version 15063 question.
The problem is windows10 Home Edition does not support Hyper-V, you can not install docker, you need to download docker toolbox installed. Address:

  http://mirrors.aliyun.com/docker-toolbox/windows/docker-toolbox/


  can also be installed turned Hyper-V, not specifically described herein.

  After installing open cmd console will have the following results, indicating a successful run (the installation process is cumbersome, requires patience):

  Here is a command to install Scrapy-Splash is installed:

docker run -p 8050:8050 scrapinghub/splash

  Here, then, I'm out on the environment configuration problems, and then the bios to Intel Virtualization Technology has been set to enable, but when you run docker or there is a problem, outside of Intel virtualization technology has opened up, but with not virutualBox virtual machine. Wait until after the blog here, we continue to explain Splash. If we have to know the reason, you can communicate with me.



II: the identification codes (1)

  Now, many sites are crawling with anti variety of measures, one of which is with a verification code. And this code has now been developed, there are many, and interactive code has become increasingly popular, more and more need for mouse operation, which also resulted in more and more difficult to work reptiles, following the first describes how to identify common with python a graphic codes.

Identification verification code pattern

  CAPTCHA is the first authentication code, is very common, generally have letters and numbers, let's save some validation code pictures online, as follows:

FIG recognition technology (1) using

  OCR technology : optical character recognition, is character-scanning refers to the process by, the shape and its translation into electronic text.

(2) identify the library used in FIG.

  tesserocr library : Python OCR recognition of a library is made tesseract layer of encapsulation, it is necessary to install tesseract, then tesserocr installation, the installation process is unknown here spoke.

(3) identification of a method to achieve
import tesserocr
from PIL import Image

image = Image.open('1.jpg')
result = tesserocr.image_to_text(image)
print(result)

Before and after pictures and the identification results are as follows:

  There are other methods may also identify verification code, it is to use file_to_text () method to directly convert image files to a string, we change the pictures as follows:

print(tesserocr.file_to_text('2.jpg'))

Before and after pictures and the identification results are as follows:

  We see the results of recognition and we want to be different, but in fact, the line blocked, in the following blog will be how to handle.

Other identification verification code as follows:

Guess you like

Origin www.cnblogs.com/ITXiaoAng/p/11799090.html