table of Contents
1. Write on the front
Mainly a summary of the phased framework
1.1 The road to AI:
Data analysis-machine learning-deep learning-CV/NLP
1.2 Tools/skills:
Python、NumPy、Pandas、Matplotlib——Scikit-learn;LR、SVM…——TensorFlow、Keras、Pytorch;CNN、RNN…
2. Data analysis
Use NumPy or Pandas for data analysis, the latter is more powerful and professional, and has its own Matplotlib interface for visualization.
2.1 The process of data analysis
- submit questions
- Understand the data
- Data cleaning
- Build model
- data visualization
2.2 Basic operation method of data
Take Pandas's operation of more than two-dimensional data as an example to summarize some methods of manipulating data.
2.2.1 Overview of Pandas
- Pandas method chain:
Most Pandas methods return a DataFrame object so that they can be used by subsequent Pandas methods. - Create DataFrame:
pass in dictionary generation: give the value of each column
pass in array generation: give the value of each row
2.2.2 Use Pandas to manipulate the core of data
So many methods of operation (data), in summary, are actually two steps:
first select the data, and then perform the function operation .
(1) Select data
Including the selection/filtering of row data and column data.
- Select/filter by row
- Show
df.head(n)
df.tail(n) - Randomly select
df.sample(frac=0.5)
df.sample(n=10) - Select df.iloc[0:2] by row index (slice),
select 1-2 rows - Select df.loc[0:2] according to the row number (slice), an
error may be reported (if the row does not need 0)
df.loc[1:2] select 1-2 rows - Select the first/last n
df.nlargeest(n,'value')
df.nsmallest(n,'value') of a specific value according to the sorting - Select
df.[df.Length>7] according to logic rules
- Delete duplicate rows
df.drop_duplicates()
- Select/filter by column
- Select by column label (column name) Select
one column
df['width'] or df.width
Select multiple columns
df[['width','length','species']] - Use slice selection
Use column index slice selection
df.iloc[:,'x2':'x4']
Use column label name (column name) to select
df.loc[:, [1, 2, 5]]
df.loc[: , 1:3] - Filter by regular expression
df.filter(regex='regex')
-Filter by logic rules
df.loc[df['a']>10, ['a','c']]
(2) Operation data
Including function operations for row data, column data, and overall data.
-
Descriptive statistics
Overall description
df.shape()
df.info()
df.describe()
len(df)
df.['W'].values_counts()
df.['W'].unique()Specific statistics
sum()
count()
median()
min()
max()
mean()
var()
std() -
Modify data
Add, delete, modify, check
df.assign() Add columnGroup and reorganize
df.groupby() to group a dfpd.merge() merge different df data
pd.melt() convert column names into column data/convert column names into column data/Gather columns into rows
df.pivot() and pd.melt() are the inverse operations of each otherpd.concat() merge different df by / row
-
Function function
apply(function)
df.dropna()
df.fillna(value)
df.drop()
agg(founction)
df.sort_values('mpg')
df.sort_index()
df.reset_index() Turn row index into column Data
df.rename() -
Visualization functions
df.plot.hist()
df.plot.scatter()
2.2.2 Detailed data
- type of data
- 。。。
3. Write at the end
Data analysis is the foundation of machine learning.
Attach information for review