[Python][Crawler 01] Build from zero environment to the first instance

>Python environment

    First, we need to download a certain version of Python.

    The Python version is mainly divided into two options: 2.7 and 3.x. Since some of the syntax and features in Python 3 have changed compared to Python, some original libraries do not support 3 very well.

    Considering some convenience in the future, Python 2.7 (64bit) is used here . (Compared to 64-bit Python, there is no 32-bit Python memory limit of 2G.) Of course, in general, I will install Python3 and Python2.7 at the same time.

    Next, we need an IDE. The choices are mainly as follows: Eclipse , IDEA , PyCharm

  • Eclipse is completely free, while the latter two are paid (but both have free community and educational editions).
  • Eclipse and IDEA are actually JAVA compilers strictly speaking, but there are corresponding Plugins for extending other languages; PyCharm is an integrated development environment specially created for python by IDEA production company JetBrains.
  • From the overall experience, the experience of IDEA and PyCharm is much better than that of Eclipse. The pyhon environment of Eclipse can be built > refer to here< ; the python environment of IDEA can be built > refer to here< .

> Spider's abstraction level

    In Python, a complete crawler ( Internet spider ) can be implemented in various ways, but the main process is still two steps:

  1. Disguise HTTP Request and obtain corresponding HTML files (including but not limited to CSS, JS, etc.);
  2. Parse HTML (XML or DOM Tree) to get the required data;

    In the first step , the common method is to use urllib and urllib2 (in Python3, these two packages have been combined into one package) to achieve the acquisition of web resources. Of course, the official documentation actually recommends using a third-party library: requests .

  • The urllib(2) module is a collection of components in the Python standard library for processing URLs;
  • requests is an HTTP client interface on top of urllib, which is a high-level encapsulation in a sense.

    In the second step , to parse the scraped HTML files (and others), the two commonly used are regular expressions (regex) and BeautifulSoup (lxml is not considered for now, because BS can actually use lxml as the parsing engine) .

  • The difficulty of learning regular expressions (regex) depends on the situation. If you have regular expressions in other languages ​​before, you can quickly get started with regular expressions in Python.
  • BeautifulSoup is quicker to use, and the language is more natural, but some web pages that are not so rigorously written may be difficult to parse with BS. In this case, you need to use regular expressions for extraction.

    But these methods are still inconvenient in a sense, so there is a high-level abstract crawler framework like Scrapy . (It can be used for more than just scraping web data)

    In terms of learning difficulty and operation difficulty, from urllib+regex, requests+BeautifulSoup to Scrapy, the degree of abstraction increases in turn, and the degree of convenience is getting better and better, but the mastery of the underlying layer can better help you understand some actual operations Eoor and The reason for the parsing failure. The first crawler, we use the basic method of urllib + regular expression to achieve.


> first crawler

    Let's try to crawl the HTML content of Baidu's homepage:

url = 'http://www.baidu.com/'
response = urllib2.urlopen(url)  # import urllib2
result = response.read()
print result

    At this time, the HTML content of the entire page (including JS and CSS) can be output in the console.   

    But this doesn't seem to be interesting. Let's use urllib+regex for a real battle - our goal is to grab the TOP promotion video on the homepage of station B :


    First, in the first step, we forge a request:

def req(url):
    response = urllib2.urlopen(url)  # import urllib2
    return response
    Then, in the second step, we are ready to start parsing the obtained data and extract the required data from the entire HTML using regex. Using the inspect element function, we can get the following:


    Therefore, we can easily get the DOM Tree structure of each card:

<div class="groom-module home-card">
    <p class="title">
    <p class="author">
    <p class="play">
    So the decoding method can be written out:
def decode(response):
    card_root_div = r'<div class="groom-module home-card">(.*?)</div>'
    card_title_p = r'<p class="title">(.*?)</p>'
    card_author_p = r'<p class="author">(.*?)</p>'
    card_play_p = r'<p class="play">(.*?)</p>'
    all_card_root = re.findall(card_root_div, response, re.S|re.M)  # import re
    for c in all_card_root:
        title = re.search(card_title_p, c, re.S|re.M).group(1)
        author = re.search(card_author_p, c, re.S|re.M).group(1)
        play = re.search(card_play_p, c, re.S|re.M).group(1)
        print title, author, play
     Execute our crawler script:
bilibili_url = 'https://www.bilibili.com/'
decode(req(bilibili_url).read())

    The output is as follows (video title + video author + video playback volume):


    * Regarding regular expressions, you need to learn more< , here only explains and demonstrates how to "analyze web pages" and match it with "regular expression extraction".


    The first complete simple crawler script is written like this.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325644642&siteId=291194637