A. Generating a data table
1. First introduced pandas library, library numpy usually used, alternate pilot:
##### Import numpy AS NP ##### Import PANDAS AS PD #### 2. xlsx or import CSV file:
data = pd.DataFrame(pd.read_csv('name.csv',header=1))
data = pd.DataFrame(pd.read_exce('name.xlsx'))
3. Create a data table with pandas
df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range('20130102', periods=6),
"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
"age":[23,44,54,32,34,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],
"price":[1200,np.nan,2133,5433,np.nan,4432]},
columns =['id','date','city','category','age','price'])
-
1
-
2
-
3
-
4
-
5
-
6
-
7
II. Data Sheet View
1. Dimensions View:
data.shape
Table 2. Basic data information (dimension. Column Name Data format footprint, etc.)
data.info()
3, each column of data format:
data.dtypes
4, a column format:
data[‘B’].dtype
5, null value:
data.isnull()
6, see a column a null value:
data.isnull()
7, see the unique values of a column:
data[‘B’].unique()
8, view the data table values:
data.values
9, see the column name:
data.columns
10, see the first 10 rows of data, after 10 rows of data:
data.head () # 10 rows of data before the default data.tail () # default after 10 rows
III. Data Sheet Cleaning
1, the number zero filled null value:
data.fillna(value=0)
2, using the mean of the prince column filled NA:
data[‘prince’].fillna(df[‘prince’].mean())
3, clear character spaces city fields:
data[‘city’]=df[‘city’].map(str.strip)
4, case conversion:
data[‘city’]=df[‘city’].str.lower()
5, change the data format:
data[‘price’].astype(‘int’)
6, change the column name:
data.rename(columns={‘category’: ‘category-size’})
7, delete duplicates after:
data[‘city’].drop_duplicates()
8, to delete duplicate values appear:
data[‘city’].drop_duplicates(keep=’last’)
9, data replacement:
data[‘city’].replace(‘sh’, ‘shanghai’)
Fourth, data preprocessing
df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008],
"gender":['male','female','male','female','male','female','male','female'],
"pay":['Y','N','Y','Y','N','Y','N','Y',],
"m-point":[10,12,20,40,40,40,30,20]})
-
1
-
2
-
3
-
4
1, the data table merger
1.1 merge
df_inner = pd.merge (df, df1, how = 'inner') # match combined intersection df_left = pd.merge (df, df1, how = 'left') df_right = pd.merge (df, df1, how = ' right ') df_outer = pd.merge (df, df1, how =' outer ') # union
-
1
-
2
-
3
-
4
1.2 append
result = df1.append(df2)-
1
1.3 join
result = left.join(right, on='key')-
1
1.4 concat
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)-
1
-
2
-
3
objs︰ a sequence or series, or integrated mapping panel objects. If the dictionary is transmitted, as the key parameter, using the sort key, unless it is passed, in which case the value will be selected (see below). Any absence of any objection will be silently discarded unless they are not in this case will raise ValueError. axis: {0,1, ...}, the default value is 0. To connect along the shaft. join: { 'inner', 'outer'}, default 'outside'. How to deal with an index on the other axis (es). Within the Union and outside the intersection. ignore_index︰ Boolean value, default False. If True, then do not use the value of the index series axis. Axis resulting labeled 0, ..., n-1. This is useful if you do not index information tandem axle tandem meaningful objects. Please note that the index values for the other axis in the coupling is still respected. List join_axes︰ index of the object. Specific indicators, n-1 for the other axis instead of performing internal / external setting logic. keys︰ sequence, default is none. Construction of layered index by using the key as the outermost level. If multiple levels adopted should tuple. levels︰ sequence list, the default is no. Specific levels (unique value) used to construct multiple. Otherwise, they will deduce the key. names︰ list, default is none. Name Level of the resulting layered index in. verify_integrity︰ Boolean value, default False. Check that the new tandem axle comprising duplicates. This may be connected in series with respect to the actual data is very expensive. Copy ︰ Boolean value, default True. If False, do not unnecessarily replicated data.
Examples: 1.frames = [DF1, DF2, DF3] 2.result = pd.concat (Frames)
2, the column set index
df_inner.set_index(‘id’)
-
3, sorted according to the value of a particular column:
df_inner.sort_values(by=[‘age’])
4, according to the index column sort:
df_inner.sort_index()
5, if the value of the column prince> 3000, group column shows high, otherwise displays low:
df_inner[‘group’] = np.where(df_inner[‘price’] > 3000,’high’,’low’)
6, a plurality of composite data markers are grouped conditions
df_inner.loc[(df_inner[‘city’] == ‘beijing’) & (df_inner[‘price’] >= 4000), ‘sign’]=1
7, the value of the category field sequentially disaggregated, and create a data table, the index value df_inner index column, the column name for the category and size
pd.DataFrame((x.split(‘-‘) for x in df_inner[‘category’]),index=df_inner.index,columns=[‘category’,’size’]))
8, after the completion of the split data sheets and data sheets to match the original df_inner
df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)
V. Data Extraction
Three main functions used: loc, iloc and ix, loc function extraction by tag value, iloc extracted by location, ix can be simultaneously extracted by the tag and location.
1, according to the index value extracted single row
df_inner.loc[3]
2, according to the index value of the row region extraction
df_inner.iloc[0:5]
3, reset the index
df_inner.reset_index()
4, set the date for the index
df_inner=df_inner.set_index(‘date’)
5, extract all the data before the 4th
df_inner[:’2013-01-04’]
6, by using the location area data extracting iloc
df_inner.iloc [: 3,: 2] # numbers before and after the colon is the label name of the index is no longer, but the position data is located, starting from 0, the first three lines, the first two columns.
7, to adapt the position lifted by separate data iloc
df_inner.iloc [[0,2,5], [4,5]] # 0,2,5 row extraction, column 4,5
8, the use of the mixed extract ix tag and location data in the index
df_inner.ix [: '2013-01-03',: 4] No. Before # 2013-01-03, first four columns of data
9, to determine whether the value of the city column is Beijing
df_inner[‘city’].isin([‘beijing’])
10, it is determined whether the city column comprising in beijing and shanghai, then the data that meets the condition extracted from
df_inner.loc[df_inner[‘city’].isin([‘beijing’,’shanghai’])]
11, extracts the first three characters, and generating table data
pd.DataFrame(category.str[:3])
VI. Screening Data
Used, or, three non-mating condition is greater than, less than, equal to filter the data, and counting and summing.
1, the use of screening "and"
df_inner.loc[(df_inner[‘age’] > 25) & (df_inner[‘city’] == ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]]
2, "or" screening
df_inner.loc[(df_inner[‘age’] > 25) | (df_inner[‘city’] == ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘age’])
3, the use of "non" condition filter
df_inner.loc[(df_inner[‘city’] != ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’])
4, the filtered data is counted by the city column
df_inner.loc[(df_inner[‘city’] != ‘beijing’), [‘id’,’city’,’age’,’category’,’gender’]].sort([‘id’]).city.count()
5, using the query function to filter
df_inner.query(‘city == [“beijing”, “shanghai”]’)
6, after the results of the screening are summed by prince
df_inner.query(‘city == [“beijing”, “shanghai”]’).price.sum()
VII. Data summary
The main function is groupby and pivote_table
1, counts of all the columns summary
df_inner.groupby(‘city’).count()
2, according to the urban id field counts
df_inner.groupby(‘city’)[‘id’].count()
3, counting of two fields Summary
df_inner.groupby([‘city’,’size’])[‘id’].count()
4, on the city field to summarize, and calculate the total and mean prince
df_inner.groupby(‘city’)[‘price’].agg([len,np.sum, np.mean])
VIII. Statistics
Data sampling, calculate the standard deviation, covariance and correlation coefficient
1, a simple data samples
df_inner.sample(n=3)
2, manually setting the sample weight
weights = [0, 0, 0, 0, 0.5, 0.5] df_inner.sample(n=2, weights=weights)
3, after sampling without replacement
df_inner.sample(n=6, replace=False)
4, after sampling back
df_inner.sample(n=6, replace=True)
5, the data table descriptive statistics
df_inner.describe (). round (2) .T #round decimal display function is provided, T represents a transpose
6, the standard deviation calculated column
df_inner[‘price’].std()
7, the calculation of the covariance between two fields
df_inner[‘price’].cov(df_inner[‘m-point’])
8, the data table covariance between all the fields
df_inner.cov ()
9, correlation analysis of two fields
df_inner [ 'price']. corr (df_inner [ 'm-point']) # correlation coefficient between -1 and 1, a positive correlation close to 1, a negative correlation is close to -1, 0 uncorrelated
10, the correlation data tables Analysis
df_inner.corr()
IX. The data output
Data analysis can xlsx format and output format csv
1, write Excel
df_inner.to_excel(‘excel_to_python.xlsx’, sheet_name=’bluewhale_cc’)
2, is written to CSV
df_inner.to_csv(‘excel_to_python.csv’)