Python Data Analysis combat (source CSDN course) Notes

URL:https://edu.csdn.net/course/play/26990/361139

 

Chapter I: Introduction Course

background

1, data cleansing is the first step in the process of data analysis, but also the entire data analysis project in the most time-consuming step

2, data cleaning process determines the accuracy of the data analysis

3, with the increasing popularity of big data, data cleaning is one of the skills necessary

4, using the Python efficient data processing becomes very important

 

Based on data to electricity suppliers

 

Course targets

1, master data related to cleaning methods and techniques

2, master Numpy and Pandas libraries used in data cleaning

3, it is possible to finish the washing stage data Data Analysis Project

 

Course Outline

1, the conventional cleaning tools data

  1, Numpy common data structures and methods

  2, Numpy common data cleaning function

  3, Pandas series common data structures and methods

  4, Pandas common data structures and methods dataframe

2, the cleaning operations of the data files

  1, Pandas read and write CSV files and related parameters explained

  2, Pandas excel file read and interpreted parameters

  3, Pandas interaction with mysql

3, the data table processing data cleaning

  1, data filtering

  2, add and delete data

  3, data modification and search

  4, data reduction

4, data conversion of data cleaning

  1, a data processing string

  2, data processing date format

  3, using the data transformation function or mapping

5, data cleansing of statistics

  1, the data packet method groupby

  2, using packet aggregation function objects

  3, grouped objects, and apply functions use

6, the data of the data pre-cleaning

  1, the processing is repeated values

  2, missing values

  3, the abnormal value processing

  4, discrete data

Summary of the data cleansing

  Data cleaning is essentially actual business problems, clean dirty data, converted to 'clean data', the so-called dirty, means may differ by several problems (major problems):

    1, data deletion (in Incomplete state) where the attribute value is null. The Occupancy = ""

    2, the noise data (Noisy) if the data value is anomaly. The Salary = "- 100"

    3, data inconsistencies (Inconsistent) is the presence of contradictory data before and after. The Age = "042" or Birthday = "01/09/1985"

    4, data redundancy (Redundant) or the amount of data exceeds the number of attribute data analysis where required.

    5, outlier / outliers (Outliers) worth of data is departing from the majority of

    6, duplicate data is the data appear more than once in the dataset

 

 


Chapter II: Data Cleaning of common tools

1, data cleansing significance

  1, in real life, the data is not perfect, need to be cleaned later data analysis

  2, data analysis, data cleansing the entire project the most time-consuming step

  3. The quality of the data ultimately determines the accuracy of the data analysis

  4, the data is the only cleaning method can improve data quality, so that the results of data analysis becomes more reliable

2, data cleaning

  1, currently in Python, numpy and pandas are most mainstream tool

  2, the quantization operation is such that Numpy data processing becomes efficient

  3, Pandas provides an efficient method of cleaning a large amount of data

  4, in Python, as much as possible the use of pandas numpy and functions to improve the cleaning efficiency of the data

3, Numpy common data structure

  1, Numpy data structure is commonly used format ndarray

  2, using the array function to create, syntax as array (list or Ganso)

  3, you can use other functions such as creating arange, linspace, zeros, etc.

    

 

 

   Description: ndim return dimension, shape returns an array of structures, size Returns the total number of elements in the array, dtype return type of array elements

  np.array[a,b]  a:代表行索引  b:代表列索引  

4, Numpy common data cleaning function

  4.1, sorting function

      The sort function: ascending sort, keyword determining the axial axis, determining the ranking Reverse direction

      argsort Function: the data is returned, from small to large index values

  4.2, data search

      where the function: S>. 3,, else -1np.where(s>3,1,-1)

      The extract function:  conditions to find out, otherwise discardednp.extract(s>3,s)

  5, Pandas Series common data structures and methods

      Series to create a data structure by pandas.Series

      pandas.Series(data,index,dtype,name)

      The above parameters, data may be a list, array or dict.

      Appeals parameters, index represents the index, you must name the same data length, name represents the object

6, Pandas common data structures and methods dataframe

      DataFrame to create a data structure by pandas.DataFrame

      pandas.DataFrame(data,index,dtype,columns)

      The above parameters, data may be a list, array or dict

      The above parameters, index represents the row index, representative of columns or column name column labels

 

Chapter 3: Cleaning of file read and write data

1, CSV file read and write

  pandas built-in 10 kinds of source data read function, it is common in CSV and Excel

  Use read_csv read method, the result is dataframe format

  When reading csv file, the file name is in English as much as possible

  Many parameters, can be self-control, but often with default parameters

  Reading CSV, note encoding, commonly encoded as utf-8, gbk, gbk2312 and the like gb18030

  Quick Save method using to_csv

  df = pd.read_csv('meal_order_info.csv',encoding='gbk')df =pd.read_csv('meal_order_info.csv',encoding='gbk',nrows=10)df.to_csv('df.csv', index=False)

  Tips: 1a.set_option('display.max_rows':100)

      2import os

        os.chdir('path')

Finishing processes :

           

2, Excel document literacy

  Use read_excel read method, the result is dataframe format

  Read excel file and csv file parameters roughly the same, but to consider worksheet sheet page

  Many parameters, can be self-control, but often with default parameters

  Reading excel, note encoding, commonly encoded as utf-8, gbk, gbk2312 and the like gb18030

  Use to_excel method is rapid saved as xlsx format

  df = pd.read_excel('meal_info.xlsx',sheet_name='sheet1')

  df = pd.read_excel('meal_info.xlsx',encoding='utf-8',nrows=10) df.to_excel('a.xlsx',sheet_name='sheet1',index=False,encoding='utf-8')

3, the database file read and write

  Establish a connection using sqlalchemy

  You need to know the parameters of the database, such as database IP address, user name and password

  通过pandas中read_sql函数读入,读取完以后是dataframe格式

  通过dataframe的to_sql方法保存

  sql = 'select * from meal_order_info' df1 = pd.read_sql(sql,conn) df.to_sql('testdf',con=conn,index=False,if_exists='replace')

  数据库建立连接参数 

  conn = create_engine('mysql+pymysql://user:passward@IP:3306/test01')

  root:用户名

  passward:密码

  IP:服务器IP,本地电脑用localhost

  3306:端口号

  test01:数据库名称

  df.to_sql(name, con=engine,if_exists='replace/append/fail',index=False)

  name是表名

  con是连接

  if_exists:表如果存在怎么处理。三个选项append代表追加,replace代表删除原表,建立新表,fail代表什么都不干

  index=False 不插入索引index

 

第四章:数据清洗之数据表处理

1、数据筛选

  1.1 数据常用筛选方法

  在数据中,选择需要的行或者列

  基础索引方式,就是直接引用

  ioc[行索引名称或者条件,列索引名称或者标签]

  iloc[行索引位置,列索引位置]

  注意:区分loc和iloc

  loc索引的是标签的名称,iloc索引的是行列的索引编号

2、数据增加和删除

  在数据中,直接添加列

  使用df.insert方法在数据中添加一列

  掌握drop(labels, axis, inplace=True)的用法

  labels表示删除的数据,axis表示作用轴,inplace=True表示是否对原数据生效

  axis=0按行操作,axis=1按列操作

  使用del函数直接删除其中一列 

  del basic['数据']

  basic.drop(labels=['数量','价格'], axis=1, inplace=True)

  basic.drop(labels=range(6,11), axis=0, inplace=True)

  basic.insert(位置, '新名称',需要插入的数据)

3、数据修改和查找

  在数据中,可以使用rename修改列名称或者行索引名称

  使用loc方法修改数据

  使用loc方法查找符合条件的数据

  条件与条件之前用&或者|连接,分别代表'且','或'

  使用between和isin选择满足条件的行

    

  df[df['buy_mount'].between(4,10, inclusive=True)]

  df[ df['cat'].isin(['24','26','546'])]

4、数据整理

  定义:在数据清洗过程中,很多时候需要将不用的数据整理在一起,方便后续的分析,这个过程也叫数据合并。

  合并方法:常见的合并方法有堆叠和按主键进行合并,堆叠又分为横向堆叠和纵向堆叠,按主键合并类似于sql里面的关联操作。

  

5、层次化索引

   

 

     df.loc[(28,[20303,2344]),['auction_id','cat_id']]

 

第五章:数据清洗之数据转换

1、日期格式数据处理

  Pandas中使用to_datetime()方法将文本格式转换为日期格式

  dataframe数据类型如果为datatime64,可以使用dt方法取出年月日等

  对于时间差数据,可以使用timedelta函数将其转换为指定时间单位的数值

  时间差数据,可以使用dt方法访问其常用属性

  

 

   df['diff_day'].astype('timedelta64[Y]')

2、高阶函数处理

  在dataframe中使用apply方法,调用自定义函数对数据进行处理

  函数apply,axis=0表示对行进行操作,axis=1表示对列进行操作

  可以使用astype函数对数据进行转换

  可以使用map函数进行数据转换

  

 

 

 

 

  df2['性别']=df2['gender'].map({'0':'女','1':'男','2':'未知'})

  df2['性别']=df2['gender'].map(f1)

3、字符串数据处理

 

 

第六章:数据清洗之数据统计

1、数据分组方法

 

  使用groupby方法进行分组计算,得到分组对象GroupBy

  语法为df.groupby(by=)

  分组对象GroupBy可以运用描述性统计方法,如count、mean、median、max、和min等

  Group = load_info.groupby(by='product')

  group1 = loan_info.groupby(by=['product','jgmc'])

  Group.mean()

  Group.sum()

  Group.max()

2、聚合函数使用

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Guess you like

Origin www.cnblogs.com/nuochengze/p/12426527.html