Python3 data analysis and modeling of actual mining
Data Analysis Introduction to Python
Getting Started with Python
Run: cmd under "python hello.py"
Basic commands:
Third-party libraries
installation
Windows in
pip install numpy
Or download the source
python setup.py install
Pandas default installation can not read and write Excel files, you need to install xlrd and xlwt library to support reading and writing excel
pip install xlrd
pip install xlwt
Exe StatModel can be mounted pip, note that this is dependent on the library and patsy Pandas
Scikit-Learn machine learning is associated libraries, but does not contain artificial neural network
model.fit () # training model, supervision model fit (X, y), unsupervised model fit (X)
# Supervision Model Interface
model.predict (X_new) # predict new samples
model.predict_proba (X_new) # predicted probability
model.score () # higher the score, Fit better
# Unsupervised Model Interface
model.transform () # from high school to the new data " base space "
model.fit_transform () # from the new base learned data, and is switched according to the set-yl
Keras based reinforced Theano depth study library, can be used to build a general neural network, learning model various depths, such as from the encoder, recurrent neural network, recurrent neural network, a convolutional neural network. Theano Python is a library, can achieve efficient decomposition symbol, fast, good stability, GPU acceleration achieved, in the CPU 10 times the intensive data processing, the disadvantage is threshold is too high. Keras speed in Windows will be greatly reduced.
Under Windows: Installing MinGWindows-- installation Theano --- installation Keras-- installation configuration CUDA
Gensim for language processing tasks, such as text similarity calculation, LDA, Word2Vec, it is recommended to run under Windows.
Linux,
sudo apt-get install python-numpy
sudo apt-get install python-scipy
sudo apt-get install python-matplotlib
use
Matplotlib default font is in English, if you want to use Chinese label,
plt.rcParams [ 'font.sans-serif'] = [ 'SimHei']
When you save the image mapping negative number is not displayed correctly:
plt.rcParams['axes.unicode_minus'] = False
Data exploration
Dirty: missing values, outliers, values do not match, the data is repeated
Outlier analysis
- Simple statistic analysis: beyond the reasonable range of values
- 3sigma principle: If the normal distribution, exceeds the abnormal value is defined as three times the standard deviation of the mean; otherwise, can be used to describe how many times away from the average.
- FIG Box Analysis: outliers defined as less than Q_L-1.5IQR or greater than Q_U + 1.5IQR. Q_L is lower quartile, all the data is less than a quarter of him. Q_U is the upper quartile. IQR called interquartile, IQR = Q_U-Q_L
Distribution Analysis
Quantitative analysis of the data distribution: seeking poor (max-min), and the determined group from the group number, decision points, are listed in the frequency distribution table, plotting frequency distribution histogram.
Distribution analysis of qualitative data: pie or bar charts
Comparative analysis
Statistics Analysis
Measure of central tendency: mean, median, mode
Trends from the metric: poor, standard deviation, coefficient of variation, median four pitch
Coefficient of variation: S represents a standard deviation, x represents the mean
Periodic analysis
Contribution analysis
Also known as the Pareto analysis, the principle is the Pareto Principle, namely the 20/80 law, the same investment in different places will produce different benefits.
Correlation analysis
Means: Draw scattergram scatterplot matrix to calculate the correlation coefficient
Pearson correlation coefficients: the values required for continuous variables normally distributed.
$$
\begin{cases}
{| R | \ leq 0.3} & \ text {linear correlation does not exist} \
0.3 <| r | \ leq 0.5 & \ text {} related linear low \
0.5 <| r | \ leq 0.8 & \ text {significant linear correlation} \
0.8 <| r | \ leq 1 & \ text {highly linear correlation} \
\end{cases}
$$
The correlation coefficient of r in the range [-1, 1]
Spearman correlation coefficients: do not obey the normal distribution of variables, the correlation between the level of classification variables are available or the coefficient, also known as rank correlation coefficient.
Order of the two variables are in ascending order of sort, rank is obtained. R_i represents the rank x_i, Q_i represents the rank y_i.
The coefficient of determination: square of the correlation coefficient, regression equations used to explain the degree of explanation y.
Data exploration function
E-commerce website user behavior analysis and recommendation service
Data extraction: establish a database - Import Data - operating environment database structures Python
data analysis
- Page type analysis
- Clicks analysis
- Page Rank
Data preprocessing
- Data cleansing: Delete Data (middle of the page URL, web site publishing success, Sign-in Assistant page)
- Data changes: the identification page URL and de-emphasis, misclassification manual URL classification and further classification
- Properties Statute: Only select users and user-selected web page data
Model building
Collaborative filtering algorithm based on item: calculating the similarity between the goods, to establish similarity matrix; to user-generated recommendation list based on the similarity of user behavior and historical items.
Similarity measures: cosine, Jaccard coefficient, the correlation coefficient
Revenue Factors Analysis and Prediction Model
data analysis
- Descriptive statistical analysis
- related analysis
Model building
For revenue, value-added tax, business tax, corporate income tax, government funds, personal income tax
- Adaptive-Lasso variable selection model: the removal of unrelated variables
- Gray prediction model were established with the neural network model
Area Analysis positioning data based on the base station
Data preprocessing
- Properties Statute: delete redundant attributes combined time property
- Data transformation: to calculate the residence time per working day, early morning, weekends, daily and other indicators, and standardization.
Model building
- Construction of district clustering model: hierarchical clustering algorithm
- Model: clustering result observed features
Electricity supplier Product Reviews sentiment analysis data
Text collection: octopus collector (crawler tool)
Text preprocessing:
- Text deduplication: automatic evaluation, the evaluation is completely duplicate, copy comments
- Mechanical compression to the words:
- Delete the phrase
Text Comments word: The Python Chinese word package "Jieba" word, the accuracy of more than 97%.
Model building
- Emotional advocacy model: word vector generation; manual annotation and mapping subset of the set of comments; self-training stack network coding