The Road to AI-Data Analysis (1) Pandas Summary and Framework Arrangement

1. Write on the front

Mainly a summary of the phased framework

1.1 The road to AI:

Data analysis-machine learning-deep learning-CV/NLP

1.2 Tools/skills:

Python、NumPy、Pandas、Matplotlib——Scikit-learn;LR、SVM…——TensorFlow、Keras、Pytorch;CNN、RNN…

2. Data analysis

Use NumPy or Pandas for data analysis, the latter is more powerful and professional, and has its own Matplotlib interface for visualization.

2.1 The process of data analysis

  1. submit questions
  2. Understand the data
  3. Data cleaning
  4. Build model
  5. data visualization

2.2 Basic operation method of data

Take Pandas's operation of more than two-dimensional data as an example to summarize some methods of manipulating data.

2.2.1 Overview of Pandas

  • Pandas method chain:
    Most Pandas methods return a DataFrame object so that they can be used by subsequent Pandas methods.
  • Create DataFrame:
    pass in dictionary generation: give the value of each column
    pass in array generation: give the value of each row

2.2.2 Use Pandas to manipulate the core of data

So many methods of operation (data), in summary, are actually two steps:
first select the data, and then perform the function operation .

(1) Select data

Including the selection/filtering of row data and column data.

  1. Select/filter by row
  • Show
    df.head(n)
    df.tail(n)
  • Randomly select
    df.sample(frac=0.5)
    df.sample(n=10)
  • Select df.iloc[0:2] by row index (slice),
    select 1-2 rows
  • Select df.loc[0:2] according to the row number (slice), an
    error may be reported (if the row does not need 0)
    df.loc[1:2] select 1-2 rows
  • Select the first/last n
    df.nlargeest(n,'value')
    df.nsmallest(n,'value') of a specific value according to the sorting
  • Select
    df.[df.Length>7] according to logic rules
    Insert picture description here
  • Delete duplicate rows
    df.drop_duplicates()
  1. Select/filter by column
  • Select by column label (column name) Select
    one column
    df['width'] or df.width
    Select multiple columns
    df[['width','length','species']]
  • Use slice selection
    Use column index slice selection
    df.iloc[:,'x2':'x4']
    Use column label name (column name) to select
    df.loc[:, [1, 2, 5]]
    df.loc[: , 1:3]
  • Filter by regular expression
    df.filter(regex='regex')
    Insert picture description here-Filter by logic rules
    df.loc[df['a']>10, ['a','c']]

(2) Operation data

Including function operations for row data, column data, and overall data.

  • Descriptive statistics

    Overall description
    df.shape()
    df.info()
    df.describe()
    len(df)
    df.['W'].values_counts()
    df.['W'].unique()

    Specific statistics
    sum()
    count()
    median()
    min()
    max()
    mean()
    var()
    std()

  • Modify data
    Add, delete, modify, check
    df.assign() Add column

    Group and reorganize
    df.groupby() to group a df

    pd.merge() merge different df data

    pd.melt() convert column names into column data/convert column names into column data/Gather columns into rows
    df.pivot() and pd.melt() are the inverse operations of each other

    pd.concat() merge different df by / row

  • Function function
    apply(function)
    df.dropna()
    df.fillna(value)
    df.drop()
    agg(founction)
    df.sort_values('mpg')
    df.sort_index()
    df.reset_index() Turn row index into column Data
    df.rename()

  • Visualization functions
    df.plot.hist()
    df.plot.scatter()

2.2.2 Detailed data

  • type of data
  • 。。。

3. Write at the end

Data analysis is the foundation of machine learning.
Attach information for review
Insert picture description hereInsert picture description hereInsert picture description here

Guess you like

Origin blog.csdn.net/Robin_Pi/article/details/103834061