Python actual crawler (ii) data analysis

Having a few questions on how the crawl a page, and crawling might encounter. So then we need to parse the pages have been crawled down, extracted the data we want.

According crawling down, we need to write different analytical methods, the most common are generally HTML data, which is the source of the page, and some may be Json data, Json data is a lightweight data interchange format , relatively easy to parse format is as follows.

{
    "name": "中国",
    "province": [{
        "name": "黑龙江",
        "cities": {
            "city": ["哈尔滨", "大庆"]
        }
    }, {
        "name": "广东",
        "cities": {
            "city": ["广州", "深圳", "珠海"]
        }
    }, {
        "name": "台湾",
        "cities": {
            "city": ["台北", "高雄"]
        }
    }, {
        "name": "新疆",
        "cities": {
            "city": ["乌鲁木齐"]
        }
    }]
}

Speaking on a crawling Ctrip load out of that part of asynchronous request Json data is returned to us, for this type of data, Python has a very convenient parsing library, so we do not have to write much code relative.

But for a crawling down the HTML data, wherein the tag structure can be complex and different HTML structures may differ, so analytical methods need the case may be.

Analytical methods are relatively easy regular expressions, xPath and BeautifulSoup4 library.

Three running speed of course is the fastest compared to regular expressions, xPath secondly, Bs4 the slowest, because Bs4 is encapsulated in the library, as opposed to the other two, is undoubtedly heavy tanks in general, but does use the simplest Bs4 one, and regular expression is the most troublesome one.

Regular expressions almost all programming languages ​​support, is every language there is little difference in expression but similar. If you are designing a complex system, do not consider regular expressions, because this method is too unstable, you can not always guarantee that you wrote the rules of regular correspondence current system is totally not being given.

xPath is a finding information in an XML document language. xPath can be used to traverse the elements and attributes in an XML document.

About regular expressions and Detailed xPath do after the actual combat, now primarily to master the use of Bs4.

We first need to download Bs4 library.

pip install lxml
pip install beautifulsoup4

When we crawled down an entire page of HTML, Bs4 you can identify the data that you want to crawl in accordance with the relative position of the tag.

This relative positioning is similar to the following:

body > div.banner > div > div.celeInfo-right.clearfix > div.movie-stats-container > div > div > span > span

The HTML page can be understood as a peeled onion layer by layer.

This positioning is called selector, we can not write it yourself, compare the HTML structure may be more complex, it is easy wrong.

We can open the browser console (F12), then Elements inside to find what we want to resolve after crawling, this time you put up the mouse position corresponding to the content of the page will turn blue to let you to compare, as shown below.

1

Can be found inside these tags dd is the current page all the movie information. Rebels of the magic boy came into the world you can be understood as dd-1, can be used as giant storm dd-2, and so on.

Then you put the mouse on the right-dd tag, there will be a copy options, there are a selector, it is to copy down the selector.

Below are incarnate Damned and the giant storm Selector Zha, can be found, only the last dd: Different nth-child.

#app > div > div.movies-panel > div.movies-list > dl > dd:nth-child(1)

#app > div > div.movies-panel > div.movies-list > dl > dd:nth-child(2)

With this law, we can easily parse one-off kind of list-type pages.

3

# -*- coding: utf-8 -*-
import os
import re
from bs4 import BeautifulSoup
import requests

# 请求头设置
header = {
    'Accept': '*/*;',
    'Connection': 'keep-alive',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Host': 'maoyan.com',
    'Referer': 'http://maoyan.com/',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}

data = requests.get('https://maoyan.com/films', headers=header)
soup = BeautifulSoup(data.text, 'lxml')
titles = soup.select('#app > div > div.movies-panel > div.movies-list > dl > dd ')

print(titles)

To carefully explain the above code.

request.get (url, headers) yesterday said the, headers that request header, which contains information about our clients and the way the request is Get or Post and so on.

The data is returned in response, you can directly print the data, but the response body which contains more than HTML pages, as well as data related to the request, such as response code 200 shows a success, 404 Description not have the resources to find and so on.

data.text is to get a Web page from the HTML code in the response body.

BeautifulSoup is our main target analysis, lxml is appropriate analytical methods.

Select the selector by invoking the method BeautifulSoup, to obtain the corresponding incoming from the previous tag in HTML.

In fact, such a look Bs4 is very simple, but this is only the basis of the application Bs4 for our common resolve a page has been sufficient, if interested can go in-depth look into, but say this is only the tool library, if you took the trouble can resolve themselves.

After reading the code, and now if I want to get the name of the movie page, this time above the selector can not be used, because it is not precise enough, only to '

'And we want to be accurate to the movie name.

With this selector.

1

#app > div > div.movies-panel > div.movies-list > dl > dd:nth-child(1) > div.channel-detail.movie-item-title > a

Almost all the other ways similar.

These are the HTML parsing, we sometimes crawling data is Json data, such data is relatively very regular, I'd very much hope that the target would be Json data format.

For example, in the chapter on Ctrip.

Its flight information is requested Json returned.

4

Python parsing of regular expressions is very simple, you put it as a dictionary data type on it.

Initially Json you get a bunch of string, after passing through the Python Json.loads (jsonData), returning in fact, the dictionary data types, operated directly on it.

5

import json

jsonData = '{
        "name":"gzj",
        "age":"23",
        "sex":"man",
        "mail":{
            "gmail":"[email protected]",
            "qmail":"[email protected]"
            }
        }'

res = json.loads(jsonData)

print(res['mail']['qmail'])

(Real part of the recent thinking whether or not to record video and two-part article, please pay attention to the public number Kankan!)

Guess you like

Origin www.cnblogs.com/LexMoon/p/pyspider02.html