Getting started with pandas (1): Installation and creation of pandas

pandasIt is a third-party library that data analysts must be familiar with. pandasIt has great advantages in scientific computing, especially for data analysts, it is quite important. It is available in python Numpy, but Numpyit is still relatively mathematical. There is also a need for a library that can represent the data model more specifically. We all know that it EXCELplays a very important role in data processing. The table mode is the best part of the data model. a form of presentation.

pandasIt is a simulation of the tabular data model on python. It has simple SQLdata processing and can be easily implemented on python.

Installation of pandas

pandasInstallation on python pipis performed in the same way:

pip install pandas

pandas create object

pandasThere are two data structures: Seriesand DataFrame.

Series

SeriesLike data in python list, each data has its own index. listCreated from Series.

>>> import pandas as pd
>>> s1 = pd.Series([100,23,'bugingcode'])
>>> s1
0           100
1            23
2    bugingcode
dtype: object
>>>

SeriesAdd the corresponding index in :

>>> import numpy as np
>>> ts = pd.Series(np.random.randn(365), index=np.arange(1,366))
>>> ts

The index value set in index is a value from 1 to 366.

SeriesThe data structure most similar to a dictionary in Python is created from a dictionary Series:

sd = {'xiaoming':14,'tom':15,'john':13}
s4 = pd.Series(sd)

At this time, you can see that Seriesit already has its own index.

pandasMatplotlibIt has many connections with another third-party library of Python . MatplotlibOne of the most commonly used ones is for displaying data. If you still Matplotlibdon’t understand it, it will be introduced in the following chapters. Now take it and use it directly. If If it has not been installed yet, use the same pipcommand to install it pip install Matplotlib. The following data will be displayed:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(365), index=np.arange(1,366))
ts.plot()
plt.show()

Series Drawing Instructions

An irregular graph. In data analysis, time is an important feature, because a lot of data is related to time, sales are related to time, and weather is related to time. . . , pandassome functions about time are also provided in , and use to date_rangegenerate a series of times.

>>> pd.date_range('01/01/2017',periods=365)
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
               '2017-01-09', '2017-01-10',
               ...
               '2017-12-22', '2017-12-23', '2017-12-24', '2017-12-25',
               '2017-12-26', '2017-12-27', '2017-12-28', '2017-12-29',
               '2017-12-30', '2017-12-31'],
              dtype='datetime64[ns]', length=365, freq='D')
>>>

Before, our graphics were irregular. One reason was that the data was not continuous. Use to cumsummake the data continuous:

as follows:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ts = pd.Series(np.random.randn(365), index=pd.date_range('01/01/2017',periods=365))
ts = ts.cumsum()
ts.plot()
plt.show()

Insert image description here

DataFrame

DataFrameIt is equivalent to Seriesan extension of one dimension. It is a two-dimensional data model, which is equivalent to the data in the EXcel table. It has horizontal and vertical coordinates. The horizontal axis Seriesuses index, and the vertical axis uses columns to determine. When creating DataFramean object , three elements need to be determined: data, horizontal axis, and vertical axis.

df = pd.DataFrame(np.random.randn(8,6), index=pd.date_range('01/01/2018',periods=8),columns=list('ABCDEF'))
print df

Data are as follows:

                   A         B         C         D         E         F
2018-01-01  0.712636  0.546680 -0.847866 -0.629005  2.152686  0.563907
2018-01-02 -1.292799  1.122098  0.743293  0.656412  0.989738  2.468200
2018-01-03  1.762894  0.783614 -0.301468  0.289608 -0.780844  0.873074
2018-01-04 -0.818066  1.629542 -0.595451  0.910141  0.160980  0.306660
2018-01-05  2.008658  0.456592 -0.839597  1.615013  0.718422 -0.564584
2018-01-06  0.480893  0.724015 -1.076434 -0.253731  0.337147 -0.028212
2018-01-07 -0.672501  0.739550 -1.316094  1.118234 -1.456680 -0.601890
2018-01-08 -1.028436 -1.036542 -0.459044  1.321962 -0.198338 -1.034822

In the process of data analysis, a very common situation is that the data comes directly from excelor cvs, the data can excelbe read in DataFrameand the data DataFrameis processed in:

df = pd.read_excel('data.xlsx',sheet_name= 'Sheet1')
print df

The same is true for saving data excelinto to_excel.

The function to process cvs data is: read_cvsand to_cvs, and the function to process HDF5 is read_hdfand to_hdf.

Access DataFramecan be done in the same way as a two-digit array:

print df['A']

Bring out the horizontal axis labels:

2018-01-01    0.712636
2018-01-02   -1.292799
2018-01-03    1.762894
2018-01-04   -0.818066
2018-01-05    2.008658
2018-01-06    0.480893
2018-01-07   -0.672501
2018-01-08   -1.028436

You can also specify an element:

print df['A']['2018-01-01']

Slice the array and identify the horizontal and vertical axes:

>>> import pandas as pd
>>> df = pd.read_excel('data.xlsx',sheet_name= 'Sheet1')
>>> df[:][0:3]
                   A         B         C         D         E         F
2018-01-01  0.712636  0.546680 -0.847866 -0.629005  2.152686  0.563907
2018-01-02 -1.292799  1.122098  0.743293  0.656412  0.989738  2.468200
2018-01-03  1.762894  0.783614 -0.301468  0.289608 -0.780844  0.873074
>>>

DataFrameThere are more functions involved, and there will be more introductions next.

pandas entry column

Guess you like

Origin blog.csdn.net/weixin_40425640/article/details/79095807