A, python data crawling
1, understanding data analysis ideas
FIG four ideas 1.1
1.1 Demand layer
1.1.1 Description
Demand is the beginning of data analysis, but also you have to analyze the target direction.
In theory, data analysis work done is given the results of the corresponding data services, rather than the solution.
1.1.2 Demand Source
Scene One: to monitor existing indicators abnormalities that need to analyze the reasons of data
Scene 2: Company to the existing business model or evaluated to determine whether a product needs to be adjusted or optimized
Scene Three: The company issued a short-term strategic objectives or goals need to look at how to achieve through analysis
1.1.3 Demand with skill
1, business, product, needs a relatively deep understanding of the background, as well as enough to guide you in order to determine the needs of
2, light to understand the demand side is not enough, you need a quick set of tools to combine the skills you have mastered the demand has obtained a preliminary analysis of ideas
3, after comprehensive judgment and then decide whether you need to analyze, how should the analysis, and demand Fonda whether to agree
1.2 Data Layer
1.1.1 Description
Data layer is roughly divided into: data acquisition, data cleaning, sorting data
Real core of big data analysis in order to use the database.
1.1.2 Big Data
Meaning: refers to can not be extracted with existing software tools, storage, search, sharing, analysis and processing of vast amounts of complex data sets.
Mining Value: 1. Customer segmentation and customized services for each particular amount of population; 2 simulated real-world environment, identify new demands while increasing the rate of return on investment; 3 strengthening sectoral linkages, and improve the entire management chain. the efficiency of the industrial chain; 4. reduce service costs, discover hidden clues innovative products and services.
1.3 Analysis layer
1.3.1 Description
Throughout the analysis need to master the tools for SQL, excel, python and so on.
Analysis steps: descriptive analysis - locking direction - modeling analysis - model test - iterative optimization - Load model - Insight conclusions
1.3.2 Data Description
Characterization data for basic information, comprising: total number of data, the time span of time granularity, spatial extent, spatial granularity, and other data sources.
1.3.3 Statistical Indicators
To make a report, the data analysis of the actual situation indicator, can be roughly divided into four categories: 变化
, 分布
, 对比
,预测。
1.4 the output layer
1.4.1 Description
A complete data report shall contain at least the following six elements: a background report, reporting purposes, data sources, such as the number of basic information, charts pagination and page content conclusion, each part Summary and final summary, the next step strategies or trends prediction;
2, simple page crawling
2.1 Preparation requests library and User Agent
Installation pip install requests
requests library based Urlib, is a common library http request
User agent-- let reptile pretending to be a normal user makes a request to a server using a browser in the target site
Successful installation
Figure 2.1 View requests library
2.2 code implementation
import requests
from bs4 import BeautifulSoup
get_info DEF (url):
"" "to get web content" ""
r = requests.get (url)
return r.content
def parse_str(content):
"""解析结果为需要的内容"""
soup = BeautifulSoup(content, 'lxml')
infos = [v.find('a') for v in soup.find_all('li')]
r = []
for v in infos:
try:
r.append('\t'.join([v.text, v['href']]))
except:
pass
return '\n'.join(r)
load_rlt DEF (RLT, filename):
"" "save the results to a file" ""
with Open (filename, 'w') AS fw:
fw.write (RLT)
def main():
url = 'http://hao.bigdata.ren/'
r = get_info(url)
rlt = parse_str(r)
load_rlt(rlt, 'bigdata.csv')
if __name__ == '__main__':
main()
print('finished!')
# URL information which aims to crawl large data page (http://hao.bigdata.ren) of
Figure 2.2 shows the URL
2.2.1 code implements a path
Run the code carried by Visual Studio Code editor, and generate a custom document bigdata.csv
FIG implementation code 2.3 VS
2.2.2 code implements two paths
By cmd command line input to achieve, advance requests must confirm the installation library
First copy the address of the .py file to the command line
cd .py path
Adhesive
python .py name
FIG run command line 2.4
Operational results achieved generation bigdata.csv file
Figure 2.3 successfully saved desktop