1 pandas Index
For Series, which can be accessed by index values directly, such as:
s=pd.Series(np.array([1,2,3,4]),index=['a','b','a','b'])
s[‘a’]
For DataFrame, you need to use loc, such as:
s=pd.DataFrame(np.array([[1,2],[3,4],[5,6]],index=[‘a’,’b’,’a’])
s.loc[‘b’]
2, repeated index
Determination: To determine whether they contain duplicate index, is_unique functions may be used, such as:
s.is_unique
Obtaining a duplicate index, unique functions may be used, such as:
s1=s.unique()
Index process is repeated, by groupby may be used, such as:
s2=s.groupby(s.index).sum()
3, multi-level index
3.1 Series
Create a multi-level index can be used MultiIndex functions, such as:
a1=[[‘a’,’b’,’a’,’c’,’b’,’b’],[1,2,2,3,1,2]]
a2=list(zip(*a1))
index=pd.MultiIndex.from_tuples(a2,names=[‘level1’,’level2’])
3.2 DataFrame
Series with slightly different, such as:
df=pd.DataFrame(np.random.randint(1,10,(6,3)),index=[[‘a’,’b’,’a’,’c’,’b’,’b’],[1,2,2,3,1,2]],columns=[[‘one’,’two’,’one’],[‘red’,’blue’,’red’]])
Exchange between 3.3 multilevel indexing
df.swaplevel(0)
3.4 According to sort index
df.sortlevel(1)
3.5 calculates the index
df.sum(level=0)
3.6 will convert the column index
df.set_index (1 column names, column names, 2 ...)
You can also go back to the index, such as df.reset_index ()
4 grouping calculation
The basic process: Split - Applications - Merge
4.1 by grouping list
Example: df.groupby (1 column name, column name 2, ...) .sum ()
4.2 by grouping Dictionary
df=pd.DataFrame(np.random.randint(1,10,(6,4)),columns=[‘a’,’b’,’c’,’d’])
mapping={‘a’:red,’b’:blue,’c’:red,’d’:blue}
df1=df.groupby(mapping,axis=1)
4.3 by grouping function
Based on the index, a function return value of the packet
def _group(idx):
return idx
df.groupby(_group)
More than 4.4-level index grouping
Different levels of index names to be grouped, such as:
df.groupby (level = index name, axis = 1)
5 aggregation operation
Built-in functions polymerization 5.1
sum,mean,min,max,describle
5.2 custom aggregation function
def _group(s):
return s.max()-s.min()
df1.agg(_group)
Note: df1 is the result of grouping
5.3 apply method
def top(g,n=2,column=’data1’):
return g.sort_value(by=column,ascengding=False)[:,n]
df2 = df.groupby (column name) .apply (top)
6 Data Import and Export
6.1 Data Import
6.1.1 Import method
pd.read_csv (file path)
pd.read_table (file path, sep = ',')
pd.read_table pd.read_csv more flexible than some of its sep supports regular expressions.
6.1.2 Import settings
When reading a file, you may be set if there are column labels, some of the columns may be designated as the row label, such as:
df = pd.read_csv (path, header = None, index_col = [1 column name, column name 2, ...])
6.1.3 missing values
For the missing data, pd.read_csv automatically vacancies, NA as missing values and the like, can also be customized, such as:
df = pd.read_csv (path, na_values = [ 'NA', 'NULL', 'foo'])
Even manner can be used dictionaries, individually provided for each column
6.1.4 read data block
Chunksize block read data by parameters such as the number of key values is calculated:
df = pd.read_csv (path, chunksize = 1000)
result=pd.Series([])
for chunk in df:
result=result.add(chunk[key].value_counts(),fill_value=0)
6.2 Export data to disk
pd.to_csv (path)
It can be provided, without holding the index, such as: pd.to_csv (path, index = False)
7 Time Series
7.1 Two common libraries
from datetime import datetime
from datetime import timedelta
7.2 Define a time
t1=datetime(2019,9,23)
7.3 Time Conversion
Time change string:
t1=datetime(2019,7,23)
t1.strftime(‘%Y/%m/%d %H:%M:%S’)
String change time:
datetime.strptime(‘2019-7-12 9:20’,’%y-%m-%d %H:%M’)
7.4 generating a time series
Use date_range:
pd.date_range(‘20190620’,’20190628’)
pd.date_range(‘20190620’,periods=10,freq=’M’)
Use period:
pd.period_range(‘2016-10’,periods=10,freq=’M’)
7.5 timestamp conversion of the period
timestamp converted period:
如:s=pd.Series(np.random.randint(5),index=pd.date_range(‘2016-04-01’,periods=5,freq=’D’))
Call: s.to_period (), can also add parameters, s.to_period (freq = 'M')
Similarly, the data for the period s1, s1.to_timestamp () can be converted into a timestamp
Note: One is periods, one period!
7.6 Time resampling
7.6.1 downsampling: low frequency to high frequency
ts.resample (sampling period, embodiment, the time value), such as:
ts.resample(‘5min’,how=’sum’,label=’right’)
You can also use groupby function:
ts.groupby(lambda x:x.month).sum()
ts.groupby(ts.index.to_period(‘M’)).sum()
7.6.2 upsampling: low to high
Timestamp: ts.resample ( 'D', fill_method = 'ffill')
period:ts.resamle(‘A-DEC’,how=’sum’)
7.7 The file parsing time to time to deal with
df=pd.read_csv(路径,parse_dates=True)
8 Data Visualization
Videos line chart: ts.plot (figsize = tuple, style = color and line type, title = FIG name)
Videos Scatter: ts.plot.scatter (x = '', y = '')
Draw a histogram: ts.plot.bar (stacked = True)
Draw a histogram: ts.plot.hist (bins = 20)
FIG yourselves: ts.plot.pie ()