Write simple Web Crawler

In our daily Internet browsing, we often see some nice pictures, we want to save these images to download, desktop wallpaper or the user used to do, or to do the design of the material. We can achieve this functionality through a simple reptile python, the code we want crawled locally. Here's a look at how to use python to implement such a feature.

development tools

I use the tool is sublimetext3, it's short and pithy (men may not like the word) makes me very fascinated. It recommends the use, of course, if your computer is configured well, pycharm may be more suitable for you.

sublime text3 build a python development environment is recommended to view this blog:

[Sublime build a python development environment] [http://www.cnblogs.com/codefish/p/4806849.

Introduction reptiles

Reptile As the name suggests, it is the same as the worms, crawling the Internet this big online. So, we can get what they want.

If it wants to climb on the Internet, then we need to understand the URL, Buddhist name "Uniform Resource Locator", nickname "link." The structure consists of three parts:

(1) Protocol: as our common HTTP protocol in the URL.

(2) the domain name or IP address: domain name, such as: www.baidu.com, IP addresses, domain names corresponding to upcoming parses IP.

(3) Path: the directory or files.

urllib develop the easiest reptiles

(1) urllib Profile

 

(2) to develop the easiest reptiles

Baidu Home simple and elegant, very suitable for our crawlers.

Reptile code is as follows:

 

The results are shown:

We can right-click the blank space in the Baidu home page to view and review the elements to compare our operating results.

Of course, requ EST may generate a request object, the object can be opened by urlopen method.

code show as below:

 

Operating results and just the same.

(3) Error Handling

Error handling is processed by urllib module, and the main UrlError HTTPError error, wherein the error is UrlError HTTPError wrong subclasses, i.e., may be captured by HTTRPError URLError. HTTPError code may be captured by their properties.

处理HTTPError的代码如下:

 

运行结果如图:

404为打印出的错误代码,关于此详细信息大家可以自行百度。

URLError可以通过其reason属性来捕获。

chuliHTTPError的代码如下:

 

运行结果如图:

既然为了处理错误,那么最好两个错误都写入代码中,毕竟越细致越清晰。须注意的是,HTTPError是URLError的子类,所以一定要将HTTPError放在URLError的前面,否则都会输出URLError的,如将404输出为Not Found。

代码如下:

 

Guess you like

Origin blog.csdn.net/sinat_38682860/article/details/94763235