Pandas
Pandas is the most powerful data analysis and exploration tool in Python. It contains advanced data structures and nifty tools that make working with data in Python very fast and easy. Pandas is built on top of Numpy, which makes it easy to use for Numpy-centric applications. Pandas is very powerful, supports data addition, deletion, query and modification similar to SQL, and has rich data processing functions; supports time series analysis functions; supports flexible processing of missing data, etc.
The installation of Pandas is relatively easy. After installing Numpy, you can install it directly. You can install it through pip install pandas or python setup.py install after downloading the source code. Since we frequently use reading and writing Excel, but the default Pandas cannot read and write Excel files, we need to install the xlrd (read) and xlwt (write) libraries to support Excel reading and writing, as follows:
pip install xrd #Add the ability to read Excel for Python
pip install xlwt #Add the function of writing to Excel for Python
The basic data structures of Pandas are Series and Dataframe. As the name implies, Series is a sequence, similar to a one-dimensional array; Data Frame is equivalent to a two-dimensional table, similar to a two-dimensional array, and each column of it is a Series. In order to locate the elements in the Series, Pandas provides the Index object. Each Series will have a corresponding Index to mark different elements. The content of the Index is not necessarily numbers, but can also be letters, Chinese, etc. It is similar to Primary key in SQL.
Similarly, Data Frame is equivalent to a combination of multiple Series with the same Index, and each Series has a unique header to identify different Series. for example:
# -*- coding:utf-8 -*-
import pandas as pd #Usually use pd as an alias for pandas.
s=pd.Series([1,2,3], index=['a','b','c']) #Create a sequence s
d=pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c']) #Create a table
d2=pd.DataFrame(s) #You can also use an existing sequence to create a table
print(d.head()) #Preview the first 5 rows of data
print(d.describe()) #Data basic statistics
pd.read_excel('data.xls') #Read the Exce1 file and create a Dataframe
pd.read_csv('data.csv', encoding='utf-8') #Read data in text format, generally use encoding to specify the encoding.
StatsModels
Pandas focuses on the reading, processing and exploration of data, while StatsModels pays more attention to statistical modeling and analysis of data, which makes Python have the flavor of R language. StatsModels supports data interaction with Pandas, so it is combined with Pandas to become a powerful data mining combination under Python.
Installing StatsModels is fairly simple, either via pip or from source. For Windows users, the official website even has compiled exe files for download. If you install it manually, you need to solve the dependency problem by yourself. Statmodel depends on Pandas (of course, it also depends on Pandas), and it also depends on pasty (a library for describing statistics).
The following is an example of ADF stationarity test using Stats Models.
# -*- coding: utf-8 -*-
from statsmodels.tsa.stattools import adfuller as ADF #Import ADF exact test
import numpy as np
ADF.(np.random.rand(100)) #The returned result has ADF, p value
Scikit-Learn
Scikit-Learn is a powerful machine learning toolkit under Python, which provides a complete machine learning toolbox, including data preprocessing, classification, regression, clustering, prediction and model analysis. Scikit-Learn depends on Numpy, Scipy and Matplotlib. Therefore, you only need to install these libraries in advance, and then install Scikit-Learn. There is basically no problem. The installation method is the same as before, or pipinstall scikit-leam installation , or download the source code and install it yourself.
Creating a machine learning model is simple:
# -*- coding:utf-8 -*-
from sklearn.linear_model import Linearregression #Import linear regression model
model= Linearregression() #Build a linear regression model
print (model)
1) The interfaces provided by all models are:
model fit0: Train the model, fit(x,y) for supervised models and fit(X) for unsupervised models.
2) The interfaces provided by the supervision model are:
model predict(xnew): predict new samples
model predict proba(Xnew): predict the probability, only useful for some models (such as LR)
model score: the higher the score, the better the fit
3) The interfaces provided by the unsupervised model are:
model transform(: learn a new "base space" from the data
model fit transform: Learn new bases from the data and transform the data according to this set of "bases".
Scikit-Learn itself provides some instance data, the more common ones are Anderson iris flower dataset, handwritten image dataset, etc. Now write a simple machine learning example using the iris dataset iris. For this dataset, you can read " R Language Data Mining Practice - Introduction to Data Mining "
# -*- coding:utf-8 -*-
from sklearn import datasets #Import datasets
iris= datasets.load_iris() #Load dataset
print(iris.data.shape) #View dataset size
from sklearn import svm #Import SVM model
clf=svm.LinearSVC() #Build a linear SVM classifier
clf.fit(iris.data,iris.target) #train the model with data
clf.predict([[5.0,3.6,1.3,0.25]]) #After training the model, input new data for prediction
clf.coef_ #View the parameters of the trained model