How to use large data reptiles crawling data for analysis

Reptile must have a lot of people have heard, here a brief introduction reptiles, reptile was a program can automatically grab information on a web page that can help us get some useful information. Able to perform these functions can be called a crawler, the current mainstream Python reptile framework is divided into scheduler (valuable data crawling) URL management, web download, web parser application. The scheduler primarily schedule manager, to download and parser; URL manager mainly used to manage the URL, to prevent repeated cycles fetching or crawl like; downloader for downloading web pages, and converted into a string; page parser for download string parsing, the key to resolving the DOM tree can be resolved against the XML, HTML. Reptile framework has helped us to complete 80% of the work, we just need to focus on three steps:


1, how to get the data request target site,


2, how to intercept data from the parser we want,


3, how to get the data for analysis.


Here's to the hottest sites on the betta live theme, the most popular target which anchor an example to demonstrate the entire implementation process:


python environment to build

Build a python environment can refer to the following address: http: //www.runoob.com/python/python-install.html


Crawler frame also need to install several modules, and Requests beautifulSoup4, each running command,


pip install requests


pip install beautifulSoup4


PyCharm development tool selection, after running the above command, the two components mounted thereon PyCharm modules, the installation is successful, the packet may be introduced in the relevant tools in


#!/usr/bin/python

import requests

from bs4 import BeautifulSoup as bs

If the introduction is successful, the installation is complete. After setting up the environment, we begin to code code.


The website information acquisition target

Let's open betta live website, https: //www.douyu.com/directory/all,F12 can see the information page,


By requesting site data requests


response = requests.get("https://www.douyu.com/directory/all")

print response.text



We get the string information page, where you completed the first step of our attention.


Parsing web information

Here's how to filter the large amount of text in the required data, BeautifulSoup very powerful, by way of the DOM tree to help us parse out the structure of the web, can be resolved with Python comes html.parser, it can also be used to parse lxml .


html = response.text

html_tree = bs(html, "html.parser")

print html_tree

Information before you can see the string has been formatted to give very clear html text. Back can easily be accessed by each node in the DOM tree data. We can observe html text useful data in the <ul /> tags <li />, the room name is content <h3 class = "ellipsis" /> tag, room type in <span class = "tag ellipsis "/> tab, the number of rooms in the <span class =" dy-num fr "/>, the anchor name in the <span class =" dy-name ellipsis fl "/> in.

In just parsed html text, find out id = "live-list-contentbox" of <ul /> tag, and get all the <li /> tag content


# Inquiry ul tag

host_infos = html_tree.find("ul", {"id": "live-list-contentbox"})

# print host_infos

# Queries all li tags

host_list = host_infos.find_all("li")

print host_list

# Traversing get live information

for host in host_list:

    # Get a room name

    home_name = host.find("h3", {"class": "ellipsis"}).string.strip()

    home_name = home_name.replace(",", "")

    # Obtain anchor name

    p_str = host.find("p")

    host_name = p_str.find("span", {"class": "dy-name ellipsis fl"}).string.strip()

    # Get the type of room    

    home_type = host.find("span", {"class": "tag ellipsis"}).string

    # Get the room

    home_num = host.find("span", {"class": "dy-num fr"}).string

    print "\ 033 [31m room name: \ 033 [0m% s, \ 033 [31m Room Types: \ 033 [0m% s, \ 033 [31m anchor name: \ 033 [0m% s, \ 033 31m of the room [ : \ 033 [0m% s "\

          % (home_name, home_type, host_name, home_num)



At this point, we are crawling to the first page of data analysis required data, we continue to crawl back 2, 3,. . . For general web, the basic flow of data so crawling. Of course, different difficulty crawling different sites, the technique used will be different, need to observe and think about how to get to useful data. For example, some sites need to get data Once logged in, we need to simulate the login process, save cookie or token is used to request data.


Take betta Web site, when we click on the next page, you can view network request through F12, you can find an interesting law,






When we click on page 3, the link request is https://www.douyu.com/gapi/rkc/directory/0_0/3


When you click on page 4, the link request is https://www.douyu.com/gapi/rkc/directory/0_0/4


Request link at the end is just the digital page request, so we can get hundreds of pages of data one time, directly on the code, run after to get the live information data 200 pages.


#!/usr/bin/python

# coding=UTF-8

import requests

import json

import sys # reload before () must be introduced into the module

reload(sys)

sys.setdefaultencoding ( 'utf-8') # solve the Chinese garbled

 

count = 1

base_url = "https://www.douyu.com/gapi/rkc/directory/0_0/"

 

# Storage data path

host_file_data = open("D:\\tmp_data\\file_data.csv", "w")

host_file_data.write ( "room name, room category, anchor name, room number \ n")

# 200 data request

while count < 200:

    request_url = base_url + str(count)

    response = requests.get(request_url)

    # load json data

    json_data = json.loads(response.text)

    for host_info in json_data["data"]["rl"]:

        # Parse json inside the room name, room type, anchor name, room number

        home_name = host_info["rn"].replace(" ", "").replace(",", "")

        home_type = host_info["c2name"]

        host_name = host_info["nn"]

        home_user_num = host_info["ol"]

        # Print "\ 033 [31m room name: \ 033 [0m% s, \ 033 [31m Room Types: \ 033 [0m% s, \ 033 [31m anchor name: \ 033 [0m% s, \ 033 [31m room number: \ 033 [0m% s "\

        #       % (home_name, home_type, host_name, home_user_num)

        # Write file

        host_file_data.write(home_name + "," + home_type + "," + host_name +

                             "," + str(home_user_num) + "\n")

    count += 1



Statistical analysis of the data

Our goal is to statistics on the hottest topics ranking and most popular anchor name. The matplotlib python library can help us to quickly draw 2D graphics, pandas library data analysis tasks can be solved, here to complete the import and read job data.


By pip install pandas and pip install matplotlib install the library, if you run the report SimHei not found, Chinese box display problems:


First simhei.ttf windwos under the Python fonts copied to the root directory in the directory /python2.7/site-packages/matplotlib/mpl-data/fonts/ttf,

Then delete ~ / .cache / matplotlib buffer directory, re-run.


      1. Statistical live up to the number of topics (Room Type)


import pandas as pd

import matplotlib.pyplot as plt

 

df = pd.read_csv ( "/ root / .jupyter / study_python / file_data.csv") # read data

# Statistics live up to the theme (room category)

names = df [ "Room Type"] .value_counts ()

plt.rcParams [ 'figure.figsize'] = size (20.0,4.0) of FIG set #

plt.rcParams [ 'figure.dpi'] = 200 # Set Resolution

# Set the font map

font={

   'family':'SimHei',

   'weight':'bold',

   'size':'15'

}

plt.rc ( 'font', are **)

plt.bar(names.index[0:15],names.values[0:15],fc='b')

plt.show()

      2. Live topics of viewers ranking




      3. The number of viewers each anchor rankings




Data from crawling to the data analysis, the whole process is so basic, I hope readers may be able to generate interest in big data and reptiles through this simple example. Follow-up will continue to show the advanced data analysis to share, thank you.


Guess you like

Origin blog.51cto.com/14485508/2426994