Data python crawling

A, python data crawling

 

1, understanding data analysis ideas

 

 

                              FIG four ideas 1.1

 

1.1 Demand layer

 

1.1.1 Description

Demand is the beginning of data analysis, but also you have to analyze the target direction.

In theory, data analysis work done is given the results of the corresponding data services, rather than the solution.

 

1.1.2 Demand Source

 Scene One: to monitor existing indicators abnormalities that need to analyze the reasons of data

 Scene 2: Company to the existing business model or evaluated to determine whether a product needs to be adjusted or optimized

 Scene Three: The company issued a short-term strategic objectives or goals need to look at how to achieve through analysis

 

1.1.3 Demand with skill

1, business, product, needs a relatively deep understanding of the background, as well as enough to guide you in order to determine the needs of

2, light to understand the demand side is not enough, you need a quick set of tools to combine the skills you have mastered the demand has obtained a preliminary analysis of ideas

3, after comprehensive judgment and then decide whether you need to analyze, how should the analysis, and demand Fonda whether to agree

 

1.2 Data Layer

 

1.1.1 Description

Data layer is roughly divided into: data acquisition, data cleaning, sorting data

Real core of big data analysis in order to use the database.

 

1.1.2 Big Data

Meaning: refers to can not be extracted with existing software tools, storage, search, sharing, analysis and processing of vast amounts of complex data sets.

Mining Value: 1. Customer segmentation and customized services for each particular amount of population; 2 simulated real-world environment, identify new demands while increasing the rate of return on investment; 3 strengthening sectoral linkages, and improve the entire management chain. the efficiency of the industrial chain; 4. reduce service costs, discover hidden clues innovative products and services.

 

 

 

1.3 Analysis layer

 

1.3.1 Description

Throughout the analysis need to master the tools for SQL, excel, python and so on.

Analysis steps: descriptive analysis - locking direction - modeling analysis - model test - iterative optimization - Load model - Insight conclusions

 

1.3.2 Data Description

Characterization data for basic information, comprising: total number of data, the time span of time granularity, spatial extent, spatial granularity, and other data sources.

 

1.3.3 Statistical Indicators

To make a report, the data analysis of the actual situation indicator, can be roughly divided into four categories: 变化, 分布, 对比,预测。

 

1.4 the output layer

 

1.4.1 Description

A complete data report shall contain at least the following six elements: a background report, reporting purposes, data sources, such as the number of basic information, charts pagination and page content conclusion, each part Summary and final summary, the next step strategies or trends prediction;

 

2, simple page crawling

 

2.1 Preparation requests library and User Agent

Installation pip install requests

requests library based Urlib, is a common library http request

User agent-- let reptile pretending to be a normal user makes a request to a server using a browser in the target site

Successful installation

 

 

                          Figure 2.1 View requests library

 

2.2 code implementation

import requests
from bs4 import BeautifulSoup


get_info DEF (url):
"" "to get web content" ""
r = requests.get (url)
return r.content

def parse_str(content):
"""解析结果为需要的内容"""
soup = BeautifulSoup(content, 'lxml')
infos = [v.find('a') for v in soup.find_all('li')]
r = []
for v in infos:
try:
r.append('\t'.join([v.text, v['href']]))
except:
pass
return '\n'.join(r)

load_rlt DEF (RLT, filename):
"" "save the results to a file" ""
with Open (filename, 'w') AS fw:
fw.write (RLT)

def main():
url = 'http://hao.bigdata.ren/'
r = get_info(url)
rlt = parse_str(r)
load_rlt(rlt, 'bigdata.csv')

if __name__ == '__main__':
main()
print('finished!')

 

# URL information which aims to crawl large data page (http://hao.bigdata.ren) of

 

  

                               Figure 2.2 shows the URL

 

2.2.1 code implements a path

Run the code carried by Visual Studio Code editor, and generate a custom document bigdata.csv

 

 

                                               FIG implementation code 2.3 VS

 

2.2.2 code implements two paths

By cmd command line input to achieve, advance requests must confirm the installation library

First copy the address of the .py file to the command line

cd .py path

Adhesive

python .py name

 

                               FIG run command line 2.4

 

 

 

 

 

 

Operational results achieved generation bigdata.csv file

 

 

                   Figure 2.3 successfully saved desktop

 

Guess you like

Origin www.cnblogs.com/CRRPF/p/12431933.html