Python data analysis tools - Pandas, StatsModels, Scikit-Learn

Pandas

Pandas is the most powerful data analysis and exploration tool in Python. It contains advanced data structures and nifty tools that make working with data in Python very fast and easy. Pandas is built on top of Numpy, which makes it easy to use for Numpy-centric applications. Pandas is very powerful, supports data addition, deletion, query and modification similar to SQL, and has rich data processing functions; supports time series analysis functions; supports flexible processing of missing data, etc.

The installation of Pandas is relatively easy. After installing Numpy, you can install it directly. You can install it through pip install pandas or python setup.py install after downloading the source code. Since we frequently use reading and writing Excel, but the default Pandas cannot read and write Excel files, we need to install the xlrd (read) and xlwt (write) libraries to support Excel reading and writing, as follows:

pip install xrd #Add the ability to read Excel for Python

pip install xlwt #Add the function of writing to Excel for Python

The basic data structures of Pandas are Series and Dataframe. As the name implies, Series is a sequence, similar to a one-dimensional array; Data Frame is equivalent to a two-dimensional table, similar to a two-dimensional array, and each column of it is a Series. In order to locate the elements in the Series, Pandas provides the Index object. Each Series will have a corresponding Index to mark different elements. The content of the Index is not necessarily numbers, but can also be letters, Chinese, etc. It is similar to Primary key in SQL.

Similarly, Data Frame is equivalent to a combination of multiple Series with the same Index, and each Series has a unique header to identify different Series. for example:

# -*- coding:utf-8 -*-

import pandas as pd #Usually use pd as an alias for pandas.

s=pd.Series([1,2,3], index=['a','b','c']) #Create a sequence s

d=pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c']) #Create a table

d2=pd.DataFrame(s) #You can also use an existing sequence to create a table

print(d.head()) #Preview the first 5 rows of data

print(d.describe()) #Data basic statistics

pd.read_excel('data.xls') #Read the Exce1 file and create a Dataframe

pd.read_csv('data.csv', encoding='utf-8') #Read data in text format, generally use encoding to specify the encoding.

StatsModels

Pandas focuses on the reading, processing and exploration of data, while StatsModels pays more attention to statistical modeling and analysis of data, which makes Python have the flavor of R language. StatsModels supports data interaction with Pandas, so it is combined with Pandas to become a powerful data mining combination under Python.

Installing StatsModels is fairly simple, either via pip or from source. For Windows users, the official website even has compiled exe files for download. If you install it manually, you need to solve the dependency problem by yourself. Statmodel depends on Pandas (of course, it also depends on Pandas), and it also depends on pasty (a library for describing statistics).

The following is an example of ADF stationarity test using Stats Models.

# -*- coding: utf-8 -*-

from statsmodels.tsa.stattools import adfuller as ADF #Import ADF exact test

import numpy as np

ADF.(np.random.rand(100)) #The returned result has ADF, p value

Scikit-Learn

Scikit-Learn is a powerful machine learning toolkit under Python, which provides a complete machine learning toolbox, including data preprocessing, classification, regression, clustering, prediction and model analysis. Scikit-Learn depends on Numpy, Scipy and Matplotlib. Therefore, you only need to install these libraries in advance, and then install Scikit-Learn. There is basically no problem. The installation method is the same as before, or pipinstall scikit-leam installation , or download the source code and install it yourself.

Creating a machine learning model is simple:

# -*- coding:utf-8 -*-

from sklearn.linear_model import Linearregression #Import linear regression model

model= Linearregression() #Build a linear regression model

print (model)

1) The interfaces provided by all models are:

model fit0: Train the model, fit(x,y) for supervised models and fit(X) for unsupervised models.

2) The interfaces provided by the supervision model are:

model predict(xnew): predict new samples

model predict proba(Xnew): predict the probability, only useful for some models (such as LR)

model score: the higher the score, the better the fit

3) The interfaces provided by the unsupervised model are:

model transform(: learn a new "base space" from the data

model fit transform: Learn new bases from the data and transform the data according to this set of "bases".

Scikit-Learn itself provides some instance data, the more common ones are Anderson iris flower dataset, handwritten image dataset, etc. Now write a simple machine learning example using the iris dataset iris. For this dataset, you can read " R Language Data Mining Practice - Introduction to Data Mining "

# -*- coding:utf-8 -*-

from sklearn import datasets #Import datasets

iris= datasets.load_iris() #Load dataset

print(iris.data.shape) #View dataset size

from sklearn import svm #Import SVM model

clf=svm.LinearSVC() #Build a linear SVM classifier

clf.fit(iris.data,iris.target) #train the model with data

clf.predict([[5.0,3.6,1.3,0.25]]) #After training the model, input new data for prediction

clf.coef_ #View the parameters of the trained model

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324711932&siteId=291194637