Titanic mission - data loading and preliminary observations, exploratory analysis

Data loading and preliminary observation, exploratory analysis

Load and observe

Relative paths and absolute paths

  • Use the command to view the current working directory os.getcwd(), change the current working directory os.chdir(路径), and obtain the absolute path of the current directory .os.path.abspath('.')

  • The relative path is the file in the current working directory and can be opened directlydata = pd.read_csv('train.csv')

  • Absolute path means you can use any file path.

    • Because backslash \is used as an escape character, use double slashes\\
    • Use original string r+ path, add before the path r, such as: data = pd.read_csv(r'E:\数分学习\hands-on-data-analysis-master\第一单元项目集合\train.csv'),
    • Use /, such as: data = pd.read_csv(r'E:/数分学习\hands-on-data-analysis-master/第一单元项目集合/train.csv'),
  • read_csvread_tableOne of the differences between and is that the former uses ,as field separator, while the latter uses tab characters (space, carriage return, line feed, etc.). To make the read data format the same, useread_table(文件,sep = ',')

  • When the amount of data is too large, in order to avoid insufficient memory, block reading is used, and the returned type is Textfilereaderno longer dataframe,

    data = pd.read_csv('train.csv', chunkersize = 100)  #每100行读取一次
    
    for piece in data
    	print(piece)                                   #分块读取全部数据,在for循环内进行数据处理
        
    data.get_chunk()								# 读取一块
    

Modify column name

data.rename(columns={
    
    'PassengerId' : '乘客ID','Survived' : '是否幸存', 'Pclass' : '乘客等级(1/2/3等舱位)'})  #不改变原始数据内容,需要赋值给新的变量,假如直接修改源数据,请添加inplace = True

newcolums = ['乘客ID', '是否幸存', '乘客等级(1/2/3等舱位)']
data.columns = newcolunms  # 直接修改了原始数据的列名

data = pd.read_csv('train.csv', names = newcolumns) #读取数据时直接重新命名列

View basic information about the data

df.info():          # 打印摘要
df.describe():      # 描述性统计信息
df.values:          # 数据 <ndarray>
df.to_numpy()       # 数据 <ndarray> (推荐)
df.shape:           # 形状 (行数, 列数)
df.columns:         # 列标签 <Index>
df.columns.values:  # 列标签 <ndarray>
df.index:           # 行标签 <Index>
df.index.values:    # 行标签 <ndarray>
df.head(n):         # 前n行
df.tail(n):         # 尾n行
pd.options.display.max_columns=n: # 最多显示n列
pd.options.display.max_rows=n:    # 最多显示n行
df.memory_usage():                # 占用内存(字节B)
    

View null values

#判断数据是否为空,为空的地方返回True,其余地方返回False
data.isnull().head()
# 查看所有输入是否有空值,有返回True,否则返回False
data1.isnull().values.any()  
# 查看每一列是否有NaN:
df.isnull().any(axis=0)
# 查看每一行是否有NaN:
df.isnull().any(axis=1)

Data saving

data.to_csv('train_chinese.csv', encoding= 'GBK')

pandas basics

View data column names

# 链表推导式
[conlumns for conlumns in df]
# 直接list
list(df)
#columns属性返回index 可以通过tolist转为list
df.columns.tolist()  

Look at the columns and rows of data

df['w']  #选择表格中的'w'列,使用类字典属性,返回的是Series类型
df.w    #选择表格中的'w'列,使用点属性,返回的是Series类型
df[['w']]  #选择表格中的'w'列,返回的是DataFrame属性
data[0:2]  #返回第1行到第2行的所有行,前闭后开,包括前不包括后
data[1:2]  #返回第2行,从0计,返回的是单行,通过有前后值的索引形式,
data.iloc[-1]   #选取DataFrame最后一行,返回的是Series
data.iloc[-1:]   #选取DataFrame最后一行,返回的是DataFrame

View specific data

# 查看具体的数据,loc 是根据index查找;iloc是根据行号,列号,从0开始逐次加1
midage.loc[[100,105,108],['性别','客舱','乘客姓名']]
midage.iloc[[100,105,108],[3,4,10]]

Delete row or column

# axis默认等于0,即按行删除,这里表示按行删除第1行和第三行,inplace = True 会对原数据修改
df1.drop(labels=[1,3]) 
df1.drop(labels=range(1:4))  #删除第1行到第三行
# axis=1 表示按列删除,删除多列
df1.drop(labels=[列名,列名],axis=1)  
df.drop([列名, 列名], axis=1)
df.drop(columns=[列名, 列名])
#删除多索引的dataframe中特定的索引组合
df.drop(index='cow', columns='small')
df.drop(index=('falcon', 'weight'))
df.drop(index='length', level=1)
#删除列的其他方法
del df[列名]
df.pop(列名)

Conditional filtering and intersection and union operations

data[(data["a"]>2) & (data["b"]<9)] # 交集 &
data[(data["a"]>2) | (data["b"]<9)] # 并集 |
data[~(data["a"]>2)] # 补集 ~
#isin(序列):选出在序列中的数据
lis = [3, 4]
data[data['a'].isin(lis)]
# 对称差集:^属于集合A和B的并集但不属于A和B的交集的元素
data[(data['a']>=2) ^ (data['b']<=8)]#对称差集

index reset

midage = midage.reset_index(drop =True)
#reset_index()函数的参数
#drop: 重新设置索引后是否将原索引作为新的一列并入DataFrame,默认为False
#inplace: 是否在原DataFrame上改动,默认为False

sort

# sort_values(by, axis=0, ascending=True, inplace=False, na_position='last') 
df.sort_values(by=['col1', 'col2']) # by的值可以是列名,或者列名列表
df.sort_values(by='col1', ascending=False) # 下降排序,默认上升排序
df.sort_values(by='col1', ascending=False, na_position='first') #空值优先
#sort_index(axis=0, ascending=True, inplace=False, na_position='last')
df.sort_index() #默认按照行升序排列,同sort_values

pandas for arithmetic operations

Please indicate when reprinting:

Author: vision
Article title: Application of copilot in Pycharm
Article source: 1: Chengnan Huakai ; 2: vision_wang’s csdn

Guess you like

Origin blog.csdn.net/vision666/article/details/125263680
Recommended