Ape Creation Essay | [Python Data Science Quick Start Series | 05] Common Scientific Computing Functions

This is the 44th article in the future of machines

Original address: https://blog.csdn.net/RobotFutures/article/details/126615267

1 Overview

This article takes the data preprocessing of iris as an example to describe an example of the use of scientific computing in machine learning.

2. Load the dataset

Take the iris dataset as an example.
The iris dataset has 4 features and 1 label. The features are sepal_length, sepal_width, petal_length, and petal_width, which are sepal length, sepal width, petal length, and petal width, respectively. The label is the classification of iris, 0, 1, and 2 respectively. Represents Setosa, Versicolor, Virginical

import numpy as np

data = []
with open(file='iris.txt',mode='r') as f:
    f.readline()
    while True:
        line = f.readline()
        if line:
            data.append(line.strip().split(','))
        else:
            break

data = np.array(data,dtype=float)

# 使用切片提取前4列数据作为特征数据
X_data = data[:, :4]  # 或者 X_data = data[:, :-1]

# 使用切片提取最后1列数据作为标签数据
y_data = data[:, -1]

data.shape, X_data.shape, y_data.shape
((150, 5), (150, 4), (150,))

3. View data characteristics

3.1 View the first 5 rows of data

X_data[0:5], y_data[0:5]
(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2]]),
 array([0., 0., 0., 0., 0.]))

3.2 View the maximum value of each feature of the dataset

# axis = 0指定X轴,取每列的最大值
np.max(X_data, axis=0)
array([7.9, 4.4, 6.9, 2.5])

The above value is the maximum value of each feature. The maximum value of the calyx length of the dataset is 7.9, the maximum value of the calyx width is 4.4, the maximum value of the petal length is 6.9, and the maximum value of the petal width is 2.5

If the axis parameter is removed, the maximum value of all the data in the dataset will be taken, and the maximum value of all columns will be integrated.

np.max(X_data)
7.9

3.3 View the minimum value of each feature

np.min(X_data, axis=0)
array([4.3, 2. , 1. , 0.1])

The above value is the minimum value of each feature. The minimum value of the calyx length of the dataset is 4.3, the minimum value of the calyx width is 2, the minimum value of the petal length is 1, and the minimum value of the petal width is 0.1

3.4 View feature mean

np.mean(X_data, axis=0)
array([5.84333333, 3.05733333, 3.758     , 1.19933333])

3.5 View characteristic percentiles

Percentile is a measure used in statistics that represents the percentage of the population that has a sample size of observations that are smaller than this value.

# 25%
np.percentile(X_data, 0.25, axis=0)
array([4.33725, 2.0745 , 1.03725, 0.1    ])
# 50%
np.percentile(X_data, 0.50, axis=0)
array([4.3745, 2.149 , 1.0745, 0.1   ])
# 75%
np.percentile(X_data, 0.75, axis=0)
array([4.4    , 2.2    , 1.11175, 0.1    ])

3.6 View the fluctuation of characteristic data distribution

np.std(X_data, axis=0)
array([0.82530129, 0.43441097, 1.75940407, 0.75969263])

From the standard deviation, it can be seen that the standard deviation of the characteristic sepal width is 0.43441097, the data fluctuation is the smallest, and the standard deviation of the petal length is 1.75940407, the data fluctuation is the largest.

3.8 View the number of feature samples

X_data.shape
(150, 4)

It can be seen that the number of samples is 150, and each sample has 4 features

3.9 View label data distribution

The unique ID and the corresponding number of samples are obtained through np.unique, and then converted into a dictionary through zip and dict.

unique, count = np.unique(y_data, return_counts=True)
label_count = dict(zip(unique, count))
label_count
{0.0: 50, 1.0: 50, 2.0: 50}

It can be seen that the labels are balanced, and the number of samples for each category is 50.

4. Other commonly used scientific functions

function illustrate Example
np.sum seek accumulation np.sum((y_pred - y_data)**2)
np.exp Exponential function to the base of natural constant e np.exp**2
np.var Find the variance np.var(X_data, axis=0)
np.round rounding np.round(np.var(X_data, axis=0), decimals=2)
np.square square np.square(X_data)
np.abs find absolute value np.abs([1, -1, -7.9, 6])
np.argmax Find the position index of the maximum value np.argmax(X_data, axis=0)
e.g. argmin Find the position index of the minimum value np.argmin(X_data, axis=0)

5. Summary

The above is a brief introduction to numpy scientific functions, and more APIs will be described in future use.

Write at the end:

  • Blog Introduction: Focus on the AIoT field, follow the pulse of the future era, and record the technological growth on the road!
  • Column introduction: Master the common data science libraries Numpy, Matploblib, and Pandas from 0 to 1.
  • Target audience: AI primary learners
  • Column plan: Next, we will gradually publish a series of blog posts that step into artificial intelligence, so stay tuned

Guess you like

Origin blog.csdn.net/RobotFutures/article/details/126615267