Pandas
python in structured data analysis tool set
is based numpy: high-performance matrix calculation
graphics library matplotlib: providing data visualization
Getting Started documentation: Link
-
Creating pandas objects:
- One-dimensional array Series:
S = pd.Series (Data, index = index)
index is a list of tag data as to the
nature: Object Class ndarray / dict class objects (using .get ()) / tab its operation - Dimensional array DataFrame:
DF = pd.DataFrame (Data, index = index, columns Columns =)
index is the row label, columns is a column tab- Created by an array of
data = pd.DataFrame (np.random.randn (6,4) , index = index list, columns = column labels list) - Created by the dictionary,
D = pd.DataFrame ({ 'A': [l, 2,3], 'B': [ 'A', 'B', 'C', 'D'], 'C': [ 545,565,585]}) # A / B / C is the column label
# e.g. dates=pd.date_range('20160301',periods=6) pd.DataFrame(np.random.randn(6,4),index=dates,columes=list('ABCD'))
- Created by an array of
- Three-dimensional array Panel:
current with relatively small
items: 0 axis, corresponding to the index element is a dataframe
major_axis: Axis 1, dataframe in the row labels
minor_axis: axis 2, dataframe in the column label# e.g. data={'Item1':pd.DataFrame(np.random.randn(4,3)), 'Item2':pd.DataFrame(np.random.randn(4,2))} pn=pd.Panel(data) pn.to_frame() # 将三维转换为多维标签表示的二维数组
- One-dimensional array Series:
-
Reindex
s.reindex (list, fill_value = 0) reindex, deletions assigned default value 0
after the method = 'bfill' deletion filling; s.reindex (list, method = ' ffill') re-indexed, a filling value before deletion a value
df.reindex (index =, columns =) re-indexing of the dataframe -
See Data:
data.values generating ndarray
data.shape view shape
data.head () returns the five lines before
the end data.tail () returns the five lines
data.index return row labels
data.columns returns the column label
df.value_counts () each digit appears several times
df.mode () produced the highest number of digital
data.describe () View basic data statistics (mean, quantile etc.)
df.pivot_table (values = [ 'D'], index = [ 'a', 'B'] , columns = [ 'C'], aggfunc = 'mean') corresponding to multiple values are averaged, without a corresponding null value
s.unique () returns Series value list do not overlap
s.index .is_unique determines whether the index repeated
s.isin (element list) determines whether the element in the Series -
Data selection
data [col] or the designated column select data.col
data.loc [label] The row select line tag
data.loc [ '20160301', 'B '] predetermined value selected according to the label
data.iloc [label] The index values select line
data [5:10] to select a plurality of rows
data [bool_vector] Boolean vector selected according to a plurality of rows
data.at [pd.Timestamp ( '20160301') , 'B'] Similarly, but must use the native data structure, more efficient
data .iat [1,1] position by a single value index number position
data [data> 0] condition selection -
Sorting
data.sort_index (axis = 1, acsending = True) column taband
data.sort_values (by = 'column name') by the value of a column to sort
s.rank (method = 'first' / 'average') ranked median value of Series -
Add / delete / change
- data.copy () deep copy
- pd.concat ([df1, df2, df3 , ...]) stitching the plurality of lists
combined pd.merge (left, right, on = 'key') about
an equivalent SQL # SELECT * FROM left INNER JOIN right ON left.key = right.key;
df.append (S, ignore_index = True) combined vertically - df.insert (location index 'is inserted column names', inserted in the list) specified position insert a new column
df.assign (column name list = new column value) new columns (do copy operation, the original array has not changed) - del df [ 'Column Name'] or df.pop ( 'column name') or df.drop ( 'listed', axis = 1) delete column
df.drop ( 'OK name') - df.stack () column index into the row index
stacked.unstack () is reduced to the row index column index
df.add_prefix ( 'prefix') to prefix the column name
-
Basic operations
- Null
df.dropna () containing null lines removed
df.fillna (value = 0) is replaced null 0
pd.isnull (DF) determines whether the available data array
pd.isnull (df) .any () .any () to determine whether the entire array of data available - Deferred
pd.Series ([1,3,5, np.nan, 6,8 ], index = dates) .shift (2) a sequence beginning two values deferred - Overall computing
df.sub (s, axis = 'index ') each column of the two-dimensional array are subtracted df one-dimensional array S
df.apply (np.cumsum, Axis =) use function (rows or columns, row by default ): two-dimensional array of application-accumulate functions
df.applymap (fun) function to use (for all values)
- Null
-
Packet
- df.groupby ( 'A'). sum () packet statistics
- df.groupby ( 'A'). size () Get the number of values in each packet
- A dictionary mapping packets
Mapping = { 'A': Red, 'B': 'Red', 'C': Blue, 'D': 'Orange', 'E': 'Blue'}
df.groupby (Mapping, Axis = 1) - df.groupby (len) function by grouping
- df.groupby (level =, axis =) level packet by
-
sequentially
- Creating time-series
dates = pd.date_range ( '20160301', periods = 6) days interval default
dates = pd.period_range ( '2000Q1', '2016Q1', freq = 'Q') - Resampling
s.resample ( '2Min', how = 'sum') Sampling - Calculation interval
pd.Timestamp ( '20,160,301') - pd.Timestamp ( '20,160,201')
pd.Timestamp ( '20,160,301') + pd.Timedelta (Days =. 5)
DF [ 'Grade'] = df.raw_grade.astype ( 'category') Add category
- Creating time-series
-
Data of the IO
df.to_csv ( 'the data.csv') derived data
pd.read_csv ( 'data.csv', index_col = 0) read data
pd.read_table ( 'data.dat', sep = '', header = None, names = column name list) read data dat -
Other
% matplotlib inlines # plotted on pages
formater = '{0: .03f} ' format defined formatting functions.