The basic operation Pandas DataFrame

In Series DataFrame pandas and basic operation.

  1.  
    Provided DataFrame a resultant data is shown below:
  2.  
    a b c
  3.  
    one 4 1 1
  4.  
    two 6 2 0
  5.  
    three 6 1 6

First, view data (see methods are equally applicable to the object is Series)

1. See DataFrame front or rear row xx xx line
A = DataFrame (Data);
a.head (6) represents a 6-row data before displaying, if the head () are not all the data is displayed parameters.
a.tail (6) represents a 6-line display data, all the data is also displayed if the tail () with no parameters.

2. Check DataFrame the index, columns and values
a.index; a.columns; a.values to

3.describe () function for quick summary statistics
a.describe () for each column of data, statistics, including count, average, STD, quantile each other.

4. transpose data
aT

The sorting shaft
a.sort_index (axis = 1, ascending = False);
wherein represents axis = 1 sorts all the columns, the number of the following occurs also moves. Behind the ascending = False represented in descending order, ascending order of default parameter missing.

6. DataFrame the values are sorted
a.sort (columns = 'x')
i.e. the columns in a x, from small to large sort. Note that only the column x, and all the columns will be sorted by the shaft when the above operation.

Second, select the object

1. Select data in a specific row and column of
a [ 'x'] then x will return the columns to the column, note that in this way can only return one column. as ax and a [ 'x'] meaning.

Take line data by slicing [] is selected
as: a [0: 3] of the first three lines of data will be returned.

2.loc are selected by the data tag
a.loc [ 'one'] will represent the behavior of select 'one' default line;

a.loc [:, [ 'a', 'b']] represents select all rows and columns a, b, columns;

a.loc [[ 'one', 'two'], [ 'a', 'b']] represents select 'one' and 'two' and the two rows of columns a, b columns;

a.loc [ 'one', 'a'] and a.loc [[ 'one'], [ 'a']] is the same action, but only the former is a value corresponding to the display, which displays the corresponding row and column labels.

3.iloc is directly selected by the position data
which is selected by a similar tab
a.iloc [1: 2,1: 2] of the first row and first column of data appears; (slice value fail to back)

a.iloc [1: 2] value is not expressed i.e. behind the column, the default position of the data select line 1;

a.iloc [0,2], [1,2] [] which can be freely selected row position and the column position corresponding data.

4. Use conditions to select
separate columns to select the data
a [ac> 0] is greater than the data selection column c 0 represents

Where the data used to select
a [a> 0] to select a table contains all of the data is greater than 0

Use ISIN () to select a particular row of the column contains a specific value
A1 = a.copy ()
A1 [A1 [ 'one'] ISIN ([ '2', '. 3']).] The following table shows the condition: column one values in all lines '2', '3'.

Third, set the value (assignment)

Assignment Assignment can operate directly on the basis of the above-described selection operation.
Example a.loc [:, [ 'a' , 'c']] = 9 , and is about a value in column c of all rows is set. 9
a.iloc [:, [l, 3]] =. 9 also shows a a value setting all rows and c columns is 9

Condition also still be used directly assigned
a [a> 0] = - a a shows a number greater than 0, all converted to negative

Fourth, the missing values

In pandas, a np.nan instead of missing values, default values ​​are not included in the calculation.

1.reindex () method
is used to change the index of the specified axis / add / delete operations, which returns a copy of the original data.
a.reindex (index = list (a.index) + [ 'five'], columns = list (a.columns) + [ 'd'])

a.reindex(index=[‘one’,’five’],columns=list(a.columns)+[‘d’])

I.e. with index = [] represents the index of the operation, columns of the table column operation.

2. missing values filled
a.fillna (value = x)
indicates to fill the missing value is the number of x

3. Remove the row containing missing values
a.dropna (how = 'any')
represents all lines contained deletions remove values

The consolidated

1.concat
the concat (a1, axis = 0/1, Keys = [ 'XX', 'XX', 'XX', ...]), where a1 represents a list for a data connection, axis = Table 1 for sideways data connection. axis = 0 or when not specified, the data table will be connected vertically. to be connected to the data a1 corresponds to a few number of keys, setting keys is to distinguish each of the original data a1 in the later data connection.

例:a1=[b[‘a’],b[‘c’]]
result=pd.concat(a1,axis=1,keys=[‘1’,’2’])

2.Append connecting one or more rows of data to a DataFrame
a.append (a [2:], ignore_index = True)
represents all added later in a third row of data into a, if the parameters specify ignore_index , will be added to index data retained, if ignore_index = Ture will be re-established automatically index all of the line.

3.merge similar to SQL join the
set a1, a2 two dataframe, the presence of both of the same key way two connected objects are the following categories:
(1) within the connector, pd.merge (a1, A2, ON = 'Key')
(2) left connection, pd.merge (a1, a2, on = 'Key', How = 'left')
(. 3) right connection, pd.merge (a1, a2, on = 'Key', How = 'right')
(. 4) external connection, pd.merge (a1, a2, on = 'key', how = 'outer')
As for the four specific differences, in particular with reference to the corresponding learning sql grammar.

Six groups (groupby)

Generation consecutive number of days specified date with pd.date_range function
pd.date_range ( '20000101', periods = 10)

  1.  
    def shuju():
  2.  
    data={
  3.  
    'date':pd.date_range('20000101',periods=10),
  4.  
    'gender':np.random.randint(0,2,size=10),
  5.  
    'height':np.random.randint(40,50,size=10),
  6.  
    'weight':np.random.randint(150,180,size=10)
  7.  
    }
  8.  
    a=DataFrame(data)
  9.  
    print(a)
  10.  
    date gender height weight
  11.  
    0 2000-01-01 0 47 165
  12.  
    1 2000-01-02 0 46 179
  13.  
    2 2000-01-03 1 48 172
  14.  
    3 2000-01-04 0 45 173
  15.  
    4 2000-01-05 1 47 151
  16.  
    5 2000-01-06 0 45 172
  17.  
    6 2000-01-07 0 48 167
  18.  
    7 2000-01-08 0 45 157
  19.  
    8 2000-01-09 1 42 157
  20.  
    9 2000-01-10 1 42 164
  21.  
     
  22.  
    With a.groupby ( . The results 'gender') sum () is obtained: # Note that after the python groupby ( '' xx) to add sum (), otherwise the display
  23.  
    Not the data object.
  24.  
    gender height weight
  25.  
    0 256 989
  26.  
    1 170 643
  • 1

Further can count the number of the gender of a respective a.groupby ( 'gender'). Size ().

Effect can be seen groupby equivalent:
according to gender gender classification, automatic summation of the corresponding column numbers, and for the character string that is not displayed; groupby course also be simultaneously ([ 'x1', ' x2 ', ...]) a plurality of fields, the effect similar to the above.

Seven, Categorical by a column re-coding classification

To Sixth as a re-encoded in gender classification, into the corresponding 0,1 male, female, as follows:

  1.  
    a[ 'gender1']=a['gender'].astype('category')
  2.  
    A [ 'gender1']. cat.categories = [ 'MALE', 'FEMALE'] # is about 0 , a converted first category type before encoding.
  3.  
     
  4.  
    Results print (a) is obtained:
  5.  
    date gender height weight gender1
  6.  
    0 2000-01-01 1 40 163 female
  7.  
    1 2000-01-02 0 44 177 male
  8.  
    2 2000-01-03 1 40 167 female
  9.  
    3 2000-01-04 0 41 161 male
  10.  
    4 2000-01-05 0 48 177 male
  11.  
    5 2000-01-06 1 46 179 female
  12.  
    6 2000-01-07 1 42 154 female
  13.  
    7 2000-01-08 1 43 170 female
  14.  
    8 2000-01-09 0 46 158 male
  15.  
    9 2000-01-10 1 44 168 female
  • 1

It can be seen the re-encoded coded dataframe automatically added as the last one.

VIII related operations

Descriptive statistics:
1.a.mean () Default averaged data for each column; add parameters if a.mean (1) is averaged for each row;

2. count the number of each value in a column x appear: a [ 'x'] value_counts ();.

3. The function applied to the data
a.apply (lambda x: x.max () - x.min ())
returns all columns maximum value - the minimum difference.

4. String related operations
a [ 'gender1']. Str.lower () gender1 all the capital into lowercase, no attention dataframe str properties, there are only a series, so to select a field of gender1.

Nine, time series

With six pd.date_range ( 'xxxx', periods = xx, freq = 'D / M / Y ....') Function generation consecutive number of days specified date list.
E.g. pd.date_range ( '20000101', periods = 10), which represents a continuous frequency periods;
pd.date_range ( '20,000,201', '20.00021 million', FREQ = 'D') may not specify the frequency, only the designated start date.

If not specified FREQ Further, from the starting date of the default frequency of day. Other frequencies expressed as follows:


1.png

Ten, drawing (plot)

  1.  
    In the first pycharm: Import matplotlib .pyplot AS plt
  2.  
    a=Series(np .random.randn(1000),index=pd.date_range('20100101',periods=1000))
  3.  
    b= a.cumsum()
  4.  
    b.plot()
  5.  
    plt.show () # Finally we must add this plt.show (), or will not show a map.
  • 1

2.PNG


The following code may be used to generate a plurality of time series chart:

  1.  
    a=DataFrame(np .random.randn(1000,4),index=pd.date_range('20100101',periods=1000),columns=list('ABCD'))
  2.  
    b= a.cumsum()
  3.  
    b.plot()
  4.  
    plt.show()
  • 1

3.png

XI, import and export files

Writes and reads excel files
have written to excel table although two writing xls and csv, but it is recommended to use less csv, when otherwise adjust the data in a table format, save up asking if you save the new format, a lot of trouble. And when reading data, if a sheet which is specified, the format will appear pycharm not aligned.

When there is writing data to the table, excel will automatically give you the front add a field in a table, the data lines are numbered.

  1.  
    a.to_excel( r'C:\\Users\\guohuaiqi\\Desktop\\2.xls',sheet_name='Sheet1')
  2.  
     
  3.  
    a=pd.read_excel( r'C:\\Users\\guohuaiqi\\Desktop\\2.xls','Sheet1',na_values=['NA'])
  4.  
     
  5.  
    注意sheet_name后面的Sheet1中的首字母大写;读取数据时,可以指定读取哪一张表中的数据,而
  6.  
    且对缺失值补上NA。
  7.  
     
  8.  
    最后再附上写入和读取csv格式的代码:
  9.  
    a.to_csv( r'C:\\Users\\guohuaiqi\\Desktop\\1.csv',sheet_name='Sheet1')
  10.  
    a=pd.read_csv( r'C:\\Users\\guohuaiqi\\Desktop\\1.csv',na_values=['NA'])
  • 1
        </div>
            </div>
转自:http://www.jianshu.com/p/75f915cc5147

Guess you like

Origin www.cnblogs.com/yanruizhe/p/12131561.html