Python web crawler (on)

Python web crawler (on)




1, how to handle page contains a lot of JavaScript (JS) and how to deal with login problems

2, screen scraping (page screen shot), data mining (data mining), web harvesting (harvesting page), web crawling, web crawler (crawler), BOT (netbot)

3, web crawlers advantages: First, to handle thousands or even millions of pages; Second, different from the traditional search engine, you can get more accurate data; Third, compared with the API to obtain data, web crawlers more flexibility strong

4, web crawler used: in market forecasting, machine language translation, medical diagnostics, news sites, articles, forums health, macro-economic, biological gene, International Relations, Health Forum, the arts and other data acquisition and analysis (classification and aggregation )

5, web crawler relates to: a database, a network server, HTTP protocol, HTML language (Hyper Text Markup Language  H yper  T EXT  M arkup  L reverse knowledge anguage), network security, image processing, data science

6, up a Web page: HTML text layer, CSS style Layer ( C ascading  S tylE  S heets), JavaScript execution layer, image rendering layer

7, JavaScript ideas : (1) learn the basic syntax of C language, (2) reference data types and Java language, memory management, (3) Reference Scheme language, will enhance the function to "first class citizens" (first class) of status, (4) Self learn the language, based prototype (prototype) of the inheritance mechanism. JavaScript consisting of : (1) Core (ECMAScript), describes the syntax and basic object of the language, (2) Document Object Model (DOM), a descriptive approach web content and interfaces, (3) browser object model (BOM) the method, described in the browser to interact and interface. JavaScript libraries : jQuery, Prototype, MooTools, etc.

8, HTML text structure: HTML structure is a tree structure, a tree formed in memory

9, HTML is only responsible for the structure and content of the documents, forms entirely to CSS, CSS basic syntax : selector {property: value; attribute: value; attribute: value;} (tagAttributes)

10, the browser loads pages need to load a number of resource-related documents, including: image files, JavaScript files, CSS files, links to other pages URL address information, etc.

11, the browser loads the server resources, such as <img src = 'cutkitten.jpg'>, creates a data packet according to the tag, the operating system sends a command request to the server, and the acquired data into an image interpretation. Browser is the code, and the code can be decomposed into a number of basic components, it can be rewritten, and modified according to the needs of reuse

12, .get_text () : remove all the HTML document tagName , hyperlinks, paragraph useless information. Usually ready to print, storage, final data operation, using .get_text ()!

Principle Analysis and Development Tools

1, urllib standard library, urllib.request import urlopen

1.1, urllib standard library functions: data page request, processing Cookie, vary as a function of these user agent request header and metadata

1.2, urlopen Function: open and read from the acquisition of the network remote object can read HTML files, image files, and any other file stream

2, BeautifulSoup library

2.1, XML is eXtensible Markup Language , HTML is HTML: XML syntax and HTML syntax more rigorous loose; XML data format is mainly used for storage and is mainly used for HTML editing pages; XML language is hypertext mark supplementary language; designed for different purposes, XML is designed to transmit and store data , which is the focus of the content data, HTML was designed to display data , which is the focus of appearance of the data

2.2, BeautifulSoup library by positioning HTML tags to format and organize complex web of information; simple to use python objects showing the structure of XML information

2.3, BeautifulSoup library created BeautifulSoup objects : BS = BeautifulSoup . ( (), 'html.parser'), BS the HTML tags (html.title, html.body.h1, html.body.div)

  The first argument: BeautifulSoup objects based on HTML text

  The second parameter: the BeautifulSoup objects created the object interpreter, 'html.parser', 'lxml' , 'html5lib'

  1, 'lxml', 'html5lib ' advantages: with fault tolerance, if the HTML tags (tagName) there is an abnormality: not closed, not nested properly, deletions head tags, body tags deletions , 'lxml', 'html5lib' may be constant Optimization

  2, 'html.parser', 'lxml', 'html5lib' web crawling speed three explanations is not the same, 'lxml'> 'html.parser'> 'html5lib', but the key problems and bottlenecks are broadband speeds rather than crawl rate!

2.4, web crawlers abnormal , since the page data format abnormality , abnormality appears Web Crawler: a, the urlopen () problem; Second, Print (bs.h1) problem

  A, urlopen () problem:

  1, the page does not exist on the server: HTTPError: '404 PageNot Found', '505 Internet Server Error'

  2, the server does not exist: URLError

  Two, Print (bs.h1) problems:

  1, BeautifulSoup target label exception does not exist ! --- none> AttributeError ! Since BeautifulSoup subject to call if the label does not exist, it will return None . If we call None following sub-tab, there will be AttributeError , so it is necessary to avoid anomalies two cases!

  Ideas handle exceptions: try ... except ... else ... abnormal increase checkpoint.

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
        html = urlopen(url)
    except HTTPError as e:
        return None
        bsObj = BeautifulSoup(, "lxml")
        title = bsObj.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle("")
if title == None:
    print("Title could not be found")

3, HTML parsing: BeautifulSoup, regular expressions

3.1, web crawlers to Enhance the flexibility and reading the code of:

  1, a direction, page PC version and APP version compare, contrast the PC version of HTML styles and APP version of HTML styles. Select more applicable version, the first state modification request to acquire the status of the corresponding version.

  2, two directions, JavaScript files, web pages loaded by JavaScript information file contains the

  3, the direction of three, URL links contained in the page title, get information directly from the target URL link

  4, the direction of four target information source website

3.2, BeautifulSoup use: CSS syntax attribute value to find the label , the label group, the navigation tree (BeautifulSoup tag tree navigation)


  2、bs.find_all(tagName,tagAttributes )


3, writing web crawler: scrapy

4, storage target information: MySQL

Examples of use












Guess you like