20/02/08 packet data science learning (2)

Pandas

python in structured data analysis tool set
is based numpy: high-performance matrix calculation
graphics library matplotlib: providing data visualization

Getting Started documentation: Link

  • Creating pandas objects:

    • One-dimensional array Series:
      S = pd.Series (Data, index = index)
      index is a list of tag data as to the
      nature: Object Class ndarray / dict class objects (using .get ()) / tab its operation
    • Dimensional array DataFrame:
      DF = pd.DataFrame (Data, index = index, columns Columns =)
      index is the row label, columns is a column tab
      • Created by an array of
        data = pd.DataFrame (np.random.randn (6,4) , index = index list, columns = column labels list)
      • Created by the dictionary,
        D = pd.DataFrame ({ 'A': [l, 2,3], 'B': [ 'A', 'B', 'C', 'D'], 'C': [ 545,565,585]}) # A / B / C is the column label
      # e.g.
      dates=pd.date_range('20160301',periods=6)
      pd.DataFrame(np.random.randn(6,4),index=dates,columes=list('ABCD'))
      
    • Three-dimensional array Panel:
      current with relatively small
      items: 0 axis, corresponding to the index element is a dataframe
      major_axis: Axis 1, dataframe in the row labels
      minor_axis: axis 2, dataframe in the column label
      # e.g.
      data={'Item1':pd.DataFrame(np.random.randn(4,3)),
      		'Item2':pd.DataFrame(np.random.randn(4,2))}
      pn=pd.Panel(data)
      pn.to_frame()  # 将三维转换为多维标签表示的二维数组
      
  • Reindex
    s.reindex (list, fill_value = 0) reindex, deletions assigned default value 0
    after the method = 'bfill' deletion filling; s.reindex (list, method = ' ffill') re-indexed, a filling value before deletion a value
    df.reindex (index =, columns =) re-indexing of the dataframe

  • See Data:
    data.values generating ndarray
    data.shape view shape
    data.head () returns the five lines before
    the end data.tail () returns the five lines
    data.index return row labels
    data.columns returns the column label
    df.value_counts () each digit appears several times
    df.mode () produced the highest number of digital
    data.describe () View basic data statistics (mean, quantile etc.)
    df.pivot_table (values = [ 'D'], index = [ 'a', 'B'] , columns = [ 'C'], aggfunc = 'mean') corresponding to multiple values are averaged, without a corresponding null value
    s.unique () returns Series value list do not overlap
    s.index .is_unique determines whether the index repeated
    s.isin (element list) determines whether the element in the Series

  • Data selection
    data [col] or the designated column select data.col
    data.loc [label] The row select line tag
    data.loc [ '20160301', 'B '] predetermined value selected according to the label
    data.iloc [label] The index values select line
    data [5:10] to select a plurality of rows
    data [bool_vector] Boolean vector selected according to a plurality of rows
    data.at [pd.Timestamp ( '20160301') , 'B'] Similarly, but must use the native data structure, more efficient
    data .iat [1,1] position by a single value index number position
    data [data> 0] condition selection

  • Sorting
    data.sort_index (axis = 1, acsending = True) column taband
    data.sort_values (by = 'column name') by the value of a column to sort
    s.rank (method = 'first' / 'average') ranked median value of Series

  • Add / delete / change

    • data.copy () deep copy
    • pd.concat ([df1, df2, df3 , ...]) stitching the plurality of lists
      combined pd.merge (left, right, on = 'key') about
      an equivalent SQL # SELECT * FROM left INNER JOIN right ON left.key = right.key;
      df.append (S, ignore_index = True) combined vertically
    • df.insert (location index 'is inserted column names', inserted in the list) specified position insert a new column
      df.assign (column name list = new column value) new columns (do copy operation, the original array has not changed)
    • del df [ 'Column Name'] or df.pop ( 'column name') or df.drop ( 'listed', axis = 1) delete column
      df.drop ( 'OK name')
    • df.stack () column index into the row index
      stacked.unstack () is reduced to the row index column index
      df.add_prefix ( 'prefix') to prefix the column name
  • Basic operations

    • Null
      df.dropna () containing null lines removed
      df.fillna (value = 0) is replaced null 0
      pd.isnull (DF) determines whether the available data array
      pd.isnull (df) .any () .any () to determine whether the entire array of data available
    • Deferred
      pd.Series ([1,3,5, np.nan, 6,8 ], index = dates) .shift (2) a sequence beginning two values deferred
    • Overall computing
      df.sub (s, axis = 'index ') each column of the two-dimensional array are subtracted df one-dimensional array S
      df.apply (np.cumsum, Axis =) use function (rows or columns, row by default ): two-dimensional array of application-accumulate functions
      df.applymap (fun) function to use (for all values)
  • Packet

    • df.groupby ( 'A'). sum () packet statistics
    • df.groupby ( 'A'). size () Get the number of values ​​in each packet
    • A dictionary mapping packets
      Mapping = { 'A': Red, 'B': 'Red', 'C': Blue, 'D': 'Orange', 'E': 'Blue'}
      df.groupby (Mapping, Axis = 1)
    • df.groupby (len) function by grouping
    • df.groupby (level =, axis =) level packet by
  • sequentially

    • Creating time-series
      dates = pd.date_range ( '20160301', periods = 6) days interval default
      dates = pd.period_range ( '2000Q1', '2016Q1', freq = 'Q')
    • Resampling
      s.resample ( '2Min', how = 'sum') Sampling
    • Calculation interval
      pd.Timestamp ( '20,160,301') - pd.Timestamp ( '20,160,201')
      pd.Timestamp ( '20,160,301') + pd.Timedelta (Days =. 5)
      DF [ 'Grade'] = df.raw_grade.astype ( 'category') Add category
  • Data of the IO
    df.to_csv ( 'the data.csv') derived data
    pd.read_csv ( 'data.csv', index_col = 0) read data
    pd.read_table ( 'data.dat', sep = '', header = None, names = column name list) read data dat

  • Other
    % matplotlib inlines # plotted on pages
    formater = '{0: .03f} ' format defined formatting functions.

Published 17 original articles · won praise 0 · Views 807

Guess you like

Origin blog.csdn.net/weixin_44602323/article/details/104217649