This is the 44th article in the future of machines
Original address: https://blog.csdn.net/RobotFutures/article/details/126615267
Article directory
- 1 Overview
- 2. Load the dataset
- 3. View data characteristics
-
- 3.1 View the first 5 rows of data
- 3.2 View the maximum value of each feature of the dataset
- 3.3 View the minimum value of each feature
- 3.4 View feature mean
- 3.5 View characteristic percentiles
- 3.6 View the fluctuation of characteristic data distribution
- 3.8 View the number of feature samples
- 3.9 View label data distribution
- 4. Other commonly used scientific functions
- 5. Summary
1 Overview
This article takes the data preprocessing of iris as an example to describe an example of the use of scientific computing in machine learning.
2. Load the dataset
Take the iris dataset as an example.
The iris dataset has 4 features and 1 label. The features are sepal_length, sepal_width, petal_length, and petal_width, which are sepal length, sepal width, petal length, and petal width, respectively. The label is the classification of iris, 0, 1, and 2 respectively. Represents Setosa, Versicolor, Virginical
import numpy as np
data = []
with open(file='iris.txt',mode='r') as f:
f.readline()
while True:
line = f.readline()
if line:
data.append(line.strip().split(','))
else:
break
data = np.array(data,dtype=float)
# 使用切片提取前4列数据作为特征数据
X_data = data[:, :4] # 或者 X_data = data[:, :-1]
# 使用切片提取最后1列数据作为标签数据
y_data = data[:, -1]
data.shape, X_data.shape, y_data.shape
((150, 5), (150, 4), (150,))
3. View data characteristics
3.1 View the first 5 rows of data
X_data[0:5], y_data[0:5]
(array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2]]),
array([0., 0., 0., 0., 0.]))
3.2 View the maximum value of each feature of the dataset
# axis = 0指定X轴,取每列的最大值
np.max(X_data, axis=0)
array([7.9, 4.4, 6.9, 2.5])
The above value is the maximum value of each feature. The maximum value of the calyx length of the dataset is 7.9, the maximum value of the calyx width is 4.4, the maximum value of the petal length is 6.9, and the maximum value of the petal width is 2.5
If the axis parameter is removed, the maximum value of all the data in the dataset will be taken, and the maximum value of all columns will be integrated.
np.max(X_data)
7.9
3.3 View the minimum value of each feature
np.min(X_data, axis=0)
array([4.3, 2. , 1. , 0.1])
The above value is the minimum value of each feature. The minimum value of the calyx length of the dataset is 4.3, the minimum value of the calyx width is 2, the minimum value of the petal length is 1, and the minimum value of the petal width is 0.1
3.4 View feature mean
np.mean(X_data, axis=0)
array([5.84333333, 3.05733333, 3.758 , 1.19933333])
3.5 View characteristic percentiles
Percentile is a measure used in statistics that represents the percentage of the population that has a sample size of observations that are smaller than this value.
# 25%
np.percentile(X_data, 0.25, axis=0)
array([4.33725, 2.0745 , 1.03725, 0.1 ])
# 50%
np.percentile(X_data, 0.50, axis=0)
array([4.3745, 2.149 , 1.0745, 0.1 ])
# 75%
np.percentile(X_data, 0.75, axis=0)
array([4.4 , 2.2 , 1.11175, 0.1 ])
3.6 View the fluctuation of characteristic data distribution
np.std(X_data, axis=0)
array([0.82530129, 0.43441097, 1.75940407, 0.75969263])
From the standard deviation, it can be seen that the standard deviation of the characteristic sepal width is 0.43441097, the data fluctuation is the smallest, and the standard deviation of the petal length is 1.75940407, the data fluctuation is the largest.
3.8 View the number of feature samples
X_data.shape
(150, 4)
It can be seen that the number of samples is 150, and each sample has 4 features
3.9 View label data distribution
The unique ID and the corresponding number of samples are obtained through np.unique, and then converted into a dictionary through zip and dict.
unique, count = np.unique(y_data, return_counts=True)
label_count = dict(zip(unique, count))
label_count
{0.0: 50, 1.0: 50, 2.0: 50}
It can be seen that the labels are balanced, and the number of samples for each category is 50.
4. Other commonly used scientific functions
function | illustrate | Example |
---|---|---|
np.sum | seek accumulation | np.sum((y_pred - y_data)**2) |
np.exp | Exponential function to the base of natural constant e | np.exp**2 |
np.var | Find the variance | np.var(X_data, axis=0) |
np.round | rounding | np.round(np.var(X_data, axis=0), decimals=2) |
np.square | square | np.square(X_data) |
np.abs | find absolute value | np.abs([1, -1, -7.9, 6]) |
np.argmax | Find the position index of the maximum value | np.argmax(X_data, axis=0) |
e.g. argmin | Find the position index of the minimum value | np.argmin(X_data, axis=0) |
… |
5. Summary
The above is a brief introduction to numpy scientific functions, and more APIs will be described in future use.
Write at the end:
- Blog Introduction: Focus on the AIoT field, follow the pulse of the future era, and record the technological growth on the road!
- Column introduction: Master the common data science libraries Numpy, Matploblib, and Pandas from 0 to 1.
- Target audience: AI primary learners
- Column plan: Next, we will gradually publish a series of blog posts that step into artificial intelligence, so stay tuned
- Python Zero Basic Quick Start Series
- Python Data Science Series
- Artificial intelligence development environment building series
- Machine Learning Series
- Object Detection Quick Start Series
- Automatic Driving Object Detection Series
- …