Quick Start Python data analysis Practical Guide

Python now has become one of the scientific data analysis and standard language and standard platform on the use of data. So as a novice white, how quick start Python data analysis?

According to the following general workflow for data analysis, combing the relevant knowledge and skills as well as study guides.

Data analysis generally works as follows:

data collection
Data storage and extraction
Data cleaning and preprocessing
Data Modeling and Analysis
data visualization

1. Data Acquisition

Data sources are divided into internal data and external data, the internal data is a major enterprise database data, external data is downloaded some of the main public data taken using the web crawler or acquired. (If the data analysis done only for internal data processing, then this step can be ignored.)

Public data sets can be downloaded directly to us, so the focus of this section is the intellectual content web crawler. Then we must have the skills to master Python basic grammar, how to write Python reptiles.

Python basic syntax : master the elements (lists, dictionaries, tuples, etc.), the basics of variables, loops, functions, etc., to be able to skillfully write code, at least not a syntax error occurs.

Python Reptile Content : learn how to implement web crawler using mature Python libraries (such as urllib, BeautifulSoup, requests, scrapy) .

Most sites have their own anti-climb mechanism, so it needs to learn some skills to deal with anti-climb policies for different sites. Including: regular expressions to simulate the user logs on, use a proxy, set the crawling frequency, using the cookie information and so on.

Recommended Resources:

2. The data storage and extraction

Mentioned data storage, databases must be captive. SQL language as the most basic tool database, you must grasp! Common relational databases and non-relational databases also need to be aware of.

SQL language : the most basic of the four operations, CRUD . To be learned by heart, super skilled! Often need to extract some data specified in the analysis process, it is possible to write sql statement to extract specific data is an essential skill. In some complex data processing time, but also related to the packet aggregation of data, establish links between multiple tables , this must grasp .

MySQL and MongoDB : MongoDB and MySQL master the basic use and understand the difference between the two databases. Just have to learn these two databases, other databases on this basis can get started quickly, easily Fun.

Recommended Resources:

3. Data preprocessing and cleaning

Tend to get dirty data, presence data duplication, deletion, outliers like. This time we need for data cleaning and pre-treatment, get rid of interfering factors, in order to analyze the results more accurately.

For data preprocessing, we mainly use Python's Pandas library.

Pandas: library for data processing, not only provides a wealth of data structure, while providing a corresponding function to process the data table and time series. The main control selection, missing values, value processing is repeated, and the abnormal value processing space, correlation operations, combined, packets and the like.

Recommended Resources:

4. Data Modeling and Analysis

The highlight of the data analysis, this part is not simply process the data, and need to have some knowledge of mathematical probability and machine learning related content.

Statistical knowledge of probability theory : basic statistics (mean, median, mode, etc.), descriptive statistics (variance, standard deviation), statistical knowledge (and the overall sample, parameters and statistics, etc.), the probability distribution and hypothesis testing (various distributions, hypothesis testing procedure), conditional probability, other Bayesian probability theory and other knowledge.

Machine Learning : Mastering commonly used machine learning classification, regression, clustering algorithms and principles, understand the characteristics of project-based, parameter adjustment method and data analysis package Python scipy, numpy, scikit-learn and so on. And can select an algorithm appropriate model data analysis, analysis and conclusions drawn.

NumPy: a common library, not only the common numeric array support, while providing for efficient processing of the function of these arrays.
SciPy: Python scientific computing library of functions NumPy done a lot to expand, but also some of the features are coincident. Numpy and SciPy once shared code base, and later parted ways.

With the increasing amount of practical projects, we will gradually learn how to select different types of problems the algorithm model, and understand how the feature extraction, parameter adjustment to improve the accuracy of prediction.

Recommended Resources:

5. Data Visualization

Data visualization, which depends on the part of the Python and Matplotlib Seaborn. Based on the above analysis result data, visual display, and outputs the analysis report.

Matplotlib: a 2D graphics library that provides good support in terms of rendering graphics and images. Current, Matplotlib has been incorporated into SciPy and support NumPy.
Seaborn: matplotlib based graphical visualization package python. It provides a highly interactive interface for users to make various attractive charts

Recommended Resources:

Matplotlib data plotting Foundation Course

In accordance with the guidelines above, step by step completion, the primary data can be achieved substantially analysts requirements. But do not forget, after mastering the basic skills, but also more practice, pay attention to in order to better combat skills upgrading.

The following recommended a number of Projects:

Projects from the above experimental building " House + Data Analysis and Mining actual participants," the.