Pandas usage summary

Pandas usage summary

A. Generating a data table

1. First introduced pandas library, library numpy usually used, alternate pilot:

    #####  Import numpy AS NP ##### Import PANDAS AS PD #### 2. xlsx or import CSV file:  



data = pd.DataFrame(pd.read_csv('name.csv',header=1))
data = pd.DataFrame(pd.read_exce('name.xlsx'))

3. Create a data table with pandas


df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range('20130102', periods=6),
"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
"age":[23,44,54,32,34,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],
"price":[1200,np.nan,2133,5433,np.nan,4432]},
columns =['id','date','city','category','age','price'])
  • 1

  • 2

  • 3

  • 4

  • 5

  • 6

  • 7

II. Data Sheet View

1. Dimensions View:

data.shape

Table 2. Basic data information (dimension. Column Name Data format footprint, etc.)

data.info()

3, each column of data format:

​ data.dtypes

4, a column format:

​ data[‘B’].dtype

5, null value:

​ data.isnull()

6, see a column a null value:

​ data.isnull()

7, see the unique values ​​of a column:

​ data[‘B’].unique()

8, view the data table values:

​ data.values

9, see the column name:

​ data.columns

10, see the first 10 rows of data, after 10 rows of data:

data.head () # 10 rows of data before the default data.tail () # default after 10 rows

 

III. Data Sheet Cleaning

1, the number zero filled null value:

​ data.fillna(value=0)

2, using the mean of the prince column filled NA:

​ data[‘prince’].fillna(df[‘prince’].mean())

3, clear character spaces city fields:

​ data[‘city’]=df[‘city’].map(str.strip)

4, case conversion:

​ data[‘city’]=df[‘city’].str.lower()

5, change the data format:

​ data[‘price’].astype(‘int’)

6, change the column name:

​ data.rename(columns={‘category’: ‘category-size’})

7, delete duplicates after:

​ data[‘city’].drop_duplicates()

8, to delete duplicate values ​​appear:

​ data[‘city’].drop_duplicates(keep=’last’)

9, data replacement:

​ data[‘city’].replace(‘sh’, ‘shanghai’)

Fourth, data preprocessing


df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008],
"gender":['male','female','male','female','male','female','male','female'],
"pay":['Y','N','Y','Y','N','Y','N','Y',],
"m-point":[10,12,20,40,40,40,30,20]})
  • 1

  • 2

  • 3

  • 4

    1, the data table merger

    1.1 merge
    df_inner = pd.merge (df, df1, how = 'inner') # match combined intersection df_left = pd.merge (df, df1, how = 'left') df_right = pd.merge (df, df1, how = ' right ') df_outer = pd.merge (df, df1, how =' outer ') # union
    • 1

    • 2

    • 3

    • 4

    1.2 append


    result = df1.append(df2)
    • 1

    1.3 join


    result = left.join(right, on='key')
    • 1

    1.4 concat


    pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
            keys=None, levels=None, names=None, verify_integrity=False,
            copy=True)
    • 1

    • 2

    • 3

    objs︰ a sequence or series, or integrated mapping panel objects. If the dictionary is transmitted, as the key parameter, using the sort key, unless it is passed, in which case the value will be selected (see below). Any absence of any objection will be silently discarded unless they are not in this case will raise ValueError. axis: {0,1, ...}, the default value is 0. To connect along the shaft. join: { 'inner', 'outer'}, default 'outside'. How to deal with an index on the other axis (es). Within the Union and outside the intersection. ignore_index︰ Boolean value, default False. If True, then do not use the value of the index series axis. Axis resulting labeled 0, ..., n-1. This is useful if you do not index information tandem axle tandem meaningful objects. Please note that the index values for the other axis in the coupling is still respected. List join_axes︰ index of the object. Specific indicators, n-1 for the other axis instead of performing internal / external setting logic. keys︰ sequence, default is none. Construction of layered index by using the key as the outermost level. If multiple levels adopted should tuple. levels︰ sequence list, the default is no. Specific levels (unique value) used to construct multiple. Otherwise, they will deduce the key. names︰ list, default is none. Name Level of the resulting layered index in. verify_integrity︰ Boolean value, default False. Check that the new tandem axle comprising duplicates. This may be connected in series with respect to the actual data is very expensive. Copy ︰ Boolean value, default True. If False, do not unnecessarily replicated data.

    Examples: 1.frames = [DF1, DF2, DF3] 2.result = pd.concat (Frames)

    2, the column set index

    ​ df_inner.set_index(‘id’)

3, sorted according to the value of a particular column:

df_inner.sort_values(by=[‘age’])

4, according to the index column sort:

df_inner.sort_index()

5, if the value of the column prince> 3000, group column shows high, otherwise displays low:

df_inner[‘group’] = np.where(df_inner[‘price’] > 3000,’high’,’low’)

6, a plurality of composite data markers are grouped conditions

df_inner.loc[(df_inner[‘city’] == ‘beijing’) & (df_inner[‘price’] >= 4000), ‘sign’]=1

7, the value of the category field sequentially disaggregated, and create a data table, the index value df_inner index column, the column name for the category and size

pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category’]),index=df_inner.index,columns=[‘category’,’size’]))

8, after the completion of the split data sheets and data sheets to match the original df_inner

df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)

 

V. Data Extraction

Three main functions used: loc, iloc and ix, loc function extraction by tag value, iloc extracted by location, ix can be simultaneously extracted by the tag and location.

1, according to the index value extracted single row

​ df_inner.loc[3]

2, according to the index value of the row region extraction

​ df_inner.iloc[0:5]

3, reset the index

​ df_inner.reset_index()

4, set the date for the index

​ df_inner=df_inner.set_index(‘date’)

5, extract all the data before the 4th

​ df_inner[:’2013-01-04’]

6, by using the location area data extracting iloc

df_inner.iloc [: 3,: 2] # numbers before and after the colon is the label name of the index is no longer, but the position data is located, starting from 0, the first three lines, the first two columns.

7, to adapt the position lifted by separate data iloc

df_inner.iloc [[0,2,5], [4,5]] # 0,2,5 row extraction, column 4,5

8, the use of the mixed extract ix tag and location data in the index

df_inner.ix [: '2013-01-03',: 4] No. Before # 2013-01-03, first four columns of data

9, to determine whether the value of the city column is Beijing

​ df_inner[‘city’].isin([‘beijing’])

10, it is determined whether the city column comprising in beijing and shanghai, then the data that meets the condition extracted from

​ df_inner.loc[df_inner[‘city’].isin([‘beijing’,’shanghai’])]

11, extracts the first three characters, and generating table data

​ pd.DataFrame(category.str[:3])

VI. Screening Data

Used, or, three non-mating condition is greater than, less than, equal to filter the data, and counting and summing.

1, the use of screening "and"

df_inner.loc[(df_inner[‘age’] > 25) & (df_inner[‘city’] == ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]]

2, "or" screening

df_inner.loc[(df_inner[‘age’] > 25) | (df_inner[‘city’] == ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘age’])

3, the use of "non" condition filter

df_inner.loc[(df_inner[‘city’] != ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’])

4, the filtered data is counted by the city column

df_inner.loc[(df_inner[‘city’] != ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’]).city.count()

5, using the query function to filter

df_inner.query(‘city == [“beijing”, “shanghai”]’)

6, after the results of the screening are summed by prince

df_inner.query(‘city == [“beijing”, “shanghai”]’).price.sum()

VII. Data summary

The main function is groupby and pivote_table

1, counts of all the columns summary

df_inner.groupby(‘city’).count()

2, according to the urban id field counts

df_inner.groupby(‘city’)[‘id’].count()

3, counting of two fields Summary

df_inner.groupby([‘city’,’size’])[‘id’].count()

4, on the city field to summarize, and calculate the total and mean prince

df_inner.groupby(‘city’)[‘price’].agg([len,np.sum, np.mean])

VIII. Statistics

Data sampling, calculate the standard deviation, covariance and correlation coefficient

1, a simple data samples

df_inner.sample(n=3)

2, manually setting the sample weight

weights = [0, 0, 0, 0, 0.5, 0.5] df_inner.sample(n=2, weights=weights)

3, after sampling without replacement

df_inner.sample(n=6, replace=False)

4, after sampling back

df_inner.sample(n=6, replace=True)

5, the data table descriptive statistics

df_inner.describe (). round (2) .T #round decimal display function is provided, T represents a transpose

6, the standard deviation calculated column

df_inner[‘price’].std()

7, the calculation of the covariance between two fields

df_inner[‘price’].cov(df_inner[‘m-point’])

8, the data table covariance between all the fields

df_inner.cov ()

9, correlation analysis of two fields

df_inner [ 'price']. corr (df_inner [ 'm-point']) # correlation coefficient between -1 and 1, a positive correlation close to 1, a negative correlation is close to -1, 0 uncorrelated

10, the correlation data tables Analysis

df_inner.corr()

IX. The data output

Data analysis can xlsx format and output format csv

1, write Excel

df_inner.to_excel(‘excel_to_python.xlsx’, sheet_name=’bluewhale_cc’)

2, is written to CSV

df_inner.to_csv(‘excel_to_python.csv’)

 

Guess you like

Origin www.cnblogs.com/zuichuyouren/p/11094662.html