pandas
It is a third-party library that data analysts must be familiar with. pandas
It has great advantages in scientific computing, especially for data analysts, it is quite important. It is available in python Numpy
, but Numpy
it is still relatively mathematical. There is also a need for a library that can represent the data model more specifically. We all know that it EXCEL
plays a very important role in data processing. The table mode is the best part of the data model. a form of presentation.
pandas
It is a simulation of the tabular data model on python. It has simple SQL
data processing and can be easily implemented on python.
Installation of pandas
pandas
Installation on python pip
is performed in the same way:
pip install pandas
pandas create object
pandas
There are two data structures: Series
and DataFrame
.
Series
Series
Like data in python list
, each data has its own index. list
Created from Series
.
>>> import pandas as pd
>>> s1 = pd.Series([100,23,'bugingcode'])
>>> s1
0 100
1 23
2 bugingcode
dtype: object
>>>
Series
Add the corresponding index in :
>>> import numpy as np
>>> ts = pd.Series(np.random.randn(365), index=np.arange(1,366))
>>> ts
The index value set in index is a value from 1 to 366.
Series
The data structure most similar to a dictionary in Python is created from a dictionary Series
:
sd = {'xiaoming':14,'tom':15,'john':13}
s4 = pd.Series(sd)
At this time, you can see that Series
it already has its own index.
pandas
Matplotlib
It has many connections with another third-party library of Python . Matplotlib
One of the most commonly used ones is for displaying data. If you still Matplotlib
don’t understand it, it will be introduced in the following chapters. Now take it and use it directly. If If it has not been installed yet, use the same pip
command to install it pip install Matplotlib
. The following data will be displayed:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(365), index=np.arange(1,366))
ts.plot()
plt.show()
An irregular graph. In data analysis, time is an important feature, because a lot of data is related to time, sales are related to time, and weather is related to time. . . , pandas
some functions about time are also provided in , and use to date_range
generate a series of times.
>>> pd.date_range('01/01/2017',periods=365)
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25',
'2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29',
'2017-12-30', '2017-12-31'],
dtype='datetime64[ns]', length=365, freq='D')
>>>
Before, our graphics were irregular. One reason was that the data was not continuous. Use to cumsum
make the data continuous:
as follows:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(365), index=pd.date_range('01/01/2017',periods=365))
ts = ts.cumsum()
ts.plot()
plt.show()
DataFrame
DataFrame
It is equivalent to Series
an extension of one dimension. It is a two-dimensional data model, which is equivalent to the data in the EXcel table. It has horizontal and vertical coordinates. The horizontal axis Series
uses index, and the vertical axis uses columns to determine. When creating DataFrame
an object , three elements need to be determined: data, horizontal axis, and vertical axis.
df = pd.DataFrame(np.random.randn(8,6), index=pd.date_range('01/01/2018',periods=8),columns=list('ABCDEF'))
print df
Data are as follows:
A B C D E F
2018-01-01 0.712636 0.546680 -0.847866 -0.629005 2.152686 0.563907
2018-01-02 -1.292799 1.122098 0.743293 0.656412 0.989738 2.468200
2018-01-03 1.762894 0.783614 -0.301468 0.289608 -0.780844 0.873074
2018-01-04 -0.818066 1.629542 -0.595451 0.910141 0.160980 0.306660
2018-01-05 2.008658 0.456592 -0.839597 1.615013 0.718422 -0.564584
2018-01-06 0.480893 0.724015 -1.076434 -0.253731 0.337147 -0.028212
2018-01-07 -0.672501 0.739550 -1.316094 1.118234 -1.456680 -0.601890
2018-01-08 -1.028436 -1.036542 -0.459044 1.321962 -0.198338 -1.034822
In the process of data analysis, a very common situation is that the data comes directly from excel
or cvs
, the data can excel
be read in DataFrame
and the data DataFrame
is processed in:
df = pd.read_excel('data.xlsx',sheet_name= 'Sheet1')
print df
The same is true for saving data excel
into to_excel
.
The function to process cvs data is: read_cvs
and to_cvs
, and the function to process HDF5 is read_hdf
and to_hdf
.
Access DataFrame
can be done in the same way as a two-digit array:
print df['A']
Bring out the horizontal axis labels:
2018-01-01 0.712636
2018-01-02 -1.292799
2018-01-03 1.762894
2018-01-04 -0.818066
2018-01-05 2.008658
2018-01-06 0.480893
2018-01-07 -0.672501
2018-01-08 -1.028436
You can also specify an element:
print df['A']['2018-01-01']
Slice the array and identify the horizontal and vertical axes:
>>> import pandas as pd
>>> df = pd.read_excel('data.xlsx',sheet_name= 'Sheet1')
>>> df[:][0:3]
A B C D E F
2018-01-01 0.712636 0.546680 -0.847866 -0.629005 2.152686 0.563907
2018-01-02 -1.292799 1.122098 0.743293 0.656412 0.989738 2.468200
2018-01-03 1.762894 0.783614 -0.301468 0.289608 -0.780844 0.873074
>>>
DataFrame
There are more functions involved, and there will be more introductions next.