Python - Pandas - Data Analysis(1)

pandas data analysis

foreword

Vue框架:Learn Vue
OJ算法系列:magic tricks from the project- Detailed algorithm explanation
Linux操作系统:of Fenghouqimen-linux
C++11:Tongtianlu-C++11

One line to check whether Pandas has been downloaded:
pip list
One line to download:
pip install pandas

Two basic data structures

One-dimensional array Series

Array initialization:

By default, an array is the data in a row, and the index is an increasing index
import pandas as pd

#数组初始化
series1 = pd.Series([1, 'a', [1.1, 1.2], 'string'])

print(series1)

print(series1[3])

index specifies the index:

The index may not be 0 1 2, or other types
import numpy as np

series2 = pd.Series([1, 'a', [1.1, 1.2], np.nan], index=['甲', '乙', '丙', '丁'])

print(series2)

print(series2['甲'])

Dictionaries specify both index & value:

A key in the dictionary corresponds to a row in the Series
dictionary = {
    
    '甲':1, '乙':'a', '丙':[123, 456], '丁':np.nan}

series3 = pd.Series(dictionary)

print(series3)

print(series3['丁'])

Two-dimensional array DataFrame:

columns initialization:

columns are all column attributes, independent of data definition
import pandas as pd
import numpy as np 

#按照行写data
data = [
    ['a', 1, 111],
    ['b', 2, 222],
    ['c', 3, 333]
]
#按照行写columns
columns = ['name', 'class', 'score']

dataframe = pd.DataFrame(data,columns) #dtype=float

print(dataframe)

Dictionary initialization:

A key in the dictionary corresponds to a column of data in the DataFrame. The row data
index in the DataFrame is still missing in the index column. If there is
a missing number in the index column, numpy.NaN is used to represent empty
import pandas as pd
import numpy as np 

#按照列写字典
dictionary = {
    
    
    'name' : ['a', 'b', 'c'],
    'class': [1, 2, 3],
    'score': [111, 222, 333]
}

dataframe = pd.DataFrame(dictionary) #dtype=float

print(dataframe)

print(dataframe[['name', 'score']])

Series and DataFrame synthesis:

A **column** of a DataFrame can be filled with a Series
# data支持多种数据类型

#日期数组
dates = pd.date_range('20130101', periods=4)
#series
series = pd.Series(1, index=list(range(4)), dtype='float32')
#np数组
np_array = np.array([3] * 4, dtype='int32')
#Categorical
categorical = pd.Categorical(["test", "train", "test", "train"])

dataframe1 = pd.DataFrame({
    
    
    'A': 1.,        #可以复用非数组,但不可用数组
    'B': dates,
    'C': series,
    'D': np_array,
    'E': categorical,
    'F': 'foo'
})

print(dataframe1)

Basic entry function:

Data index:

  • basic data:
import pandas as pd

dataframe = pd.DataFrame({
    
    
    'a' : [1, 2, 3, 4, 5],
    'b' : ['a', 'b', 'c', 'd', 'e'],
    'c' : [2.1, 2.2, 2.3, 2.4, 2.5]
})
print(dataframe)

slice:

['key1', 'key2', 'key3', ...]
print(dataframe.a)
#等同于:
print(dataframe['a'])
#多列:
print(dataframe[['b', 'c']])

loc:

.loc[a:b, ['key1', 'key2', 'key3', ...]] The
number of lines starts from a and goes to b, but does not contain b
#所有行,c列
print(dataframe.loc[:, ['c']])

#某几行,c列和a列
print(dataframe.loc[0:3, ['c', 'a']])

location:

.iloc[a:b:c, d:e:f]
The number of rows starts from a, and increases by c steps each time, less than b.
The number of columns starts from d, and increases by f steps each time, less than e

.iloc[a: b:c, [d, e]] From row a, increase c steps each time, less than
column b d and e
print(dataframe.iloc[0:1, 0:3:2])       

print(dataframe.iloc[0:4, [0, 2]])
  • iloc pays more attention to the index, and cannot directly specify the key, after all, i starts with

Condition index:

A certain value size relationship:

#打印完整行
print(dataframe[dataframe.a > 3])

#打印行中几个值
print(dataframe.loc[dataframe.a > 2, ['b', 'c']])

Is a value among:

print(dataframe[dataframe['a'].isin([3, 5])])

Delete data out of place:

index[] delete row:

#df.index[[1,2,3]],删索引所指的几行
dataframe2.drop(dataframe2.index[[1, 2]])

[] + axis delete column:

dataframe2.drop(['a', 'b'], axis = 1)

Delete rows with a value in a column:

dataframe2[~(dataframe2['a'] == 3)]

Data preprocessing:

  • The given data is as follows:
df1 = pd.DataFrame({
    
    'id': [1,2,3],
                   'a': [11,22,33],
                   'b': [111,222,333],})
df2 = pd.DataFrame({
    
    'id': [4,5,6],
                   'd': [1.1,2.2,3.3]})

concat stitching:

The inner and outer of the join determine whether to delete the row or use NaN to fill the missing row. The
0 and 1 of the axis determine the merge key or the merge index. 0 merges the key and 1 merges the index.

axis0 + outer:

#行上外连,合并键,但不合并index,缺失则NAN
pd.concat([df1, df2], axis=0, join='outer')

axis0 + inner:

#行上内连,合并键,但不合并index,缺失则删整行
pd.concat([df1, df2], axis=0, join='inner')

axis1 + outer:

#列上外连,合并index,不合并键,不会出现NaN
pd.concat([df1, df2], axis=1, join='outer')

axis1 + inner:

#列上内连,合并index,不合并键,不会出现NaN
pd.concat([df1, df2], axis=1, join='inner')

merge splicing:

axis=1 merge index
merge can specify key merge, not limited to index
pd.merge(dataframe1, dataframe2, on = 'id')

drop_duplicates() deduplication:

Deduplication is aimed at two identical
rows . After deduplication, the index remains unchanged and is not continuous.
  • Data are as follows:
import pandas as pd

dataframe = pd.DataFrame({
    
    
    'id' : [1, 2, 3, 3, 5],
    'vale' : [1, 2, 3, 3, 5]
})

keep=‘first’:

  • Keep the first in the repeating group
dataframe.drop_duplicates(keep='first')

keep=‘last’:

  • Keep last in repeating group
dataframe.drop_duplicates(keep='last')

keep=False:

  • Keep the first in the repeating group
dataframe.drop_duplicates(keep=False)

Remove NaNs:

any mode:

If a row contains a NaN, the row/column needs to be discarded
#含有NaN的行,整个行就丢弃
dataframe3.dropna(how = 'any', axis=0)

#含有NaN的列,整个列就丢弃
dataframe3.dropna(how = 'any', axis=1)

all mode:

All NaN in a row, this row/column is discarded
#指定行中均为NaN,则丢弃整个行
dataframe3.dropna(how = 'all', axis = 0)

#指定列中均为NaN,则丢弃整个列
dataframe3.dropna(how = 'all', axis = 1)

Fill NaNs:

fill():

If a value is NaN, it will be filled with value
dataframe3.fillna(value = 0)

ffill:

Fill according to the corresponding value of the previous row:
dataframe3.fillna(method='ffill', axis = 0)
Fill according to the corresponding value of the previous column:
dataframe3.fillna(method='ffill', axis = 1)

Guess you like

Origin blog.csdn.net/buptsd/article/details/129352521
Recommended