NumPy Data Analysis Fundamentals: Detailed Explanation of NumPy Features and Python's Built-in Data Structure Comparison

Table of contents

foreword

First, the difference between Numpy and Python built-in structure

2. Operational performance comparison

3. The underlying architecture

Pay attention, prevent getting lost, if there are any mistakes, please leave a message for advice, thank you very much



foreword

As one of the three giants of data analysis, Pandas, matplotlib, and NumPy, it is necessary to give a separate explanation for the face. NumPy application scenarios are very broad, and many Pandas functions are converted to NumPy array data structures. It is even more frequently used than Pandas in machine learning, deep learning, and some data processing operations. Moreover, NumPy is powerful and convenient to use, and supports a variety of complex operations. I usually use NumPy in my Pandas and some machine learning articles, but the blog content does not explain the operation of NumPy in detail, nor does it record some specific function answers about the operation of NumPy. It is really inappropriate for a blogger like me who pursues one-stop service needs, so I will fill up the old pit and publish a new one-text speed learning series-Numpy data analysis basic column.

This series of articles will be included in my column one, a series of fast learning - NumPy data analysis foundation, which basically covers the use of NumPy data to analyze daily business and routine mathematical modeling analysis and complex operations. I will spend a lot of time and thought on creating from basic array operations to complex operations such as processing matrix and vector features, as well as professional NumPy common functions. If you need to engage in data analysis or data development, mathematical modeling, Friends of Python engineering recommend subscribing to the column, and you will learn the most practical and common knowledge in the first time. This blog is long and worth reading and practicing. I will pick out the best part and talk about practice in detail. Bloggers will maintain blog posts for a long time. If you have any mistakes or doubts, you can point them out in the comment area. Thank you for your support.


First, the difference between Numpy and Python built-in structure

A scientific computing implemented in python, including:

1. A powerful N-dimensional array object Array;

2. A relatively mature function library;

3. Toolkit for integrating C/C++ and Fortran code;

4. Practical linear algebra, Fourier transform and random number generating functions.

Numpy and the sparse matrix operation package scipy are more convenient to use together, and more comprehensive with Pandas data analysis. NumPy is the basic package for scientific computing in Python. It is a Python library that provides multidimensional array objects, various derived objects (such as masked arrays and matrices), and a variety of routines for fast operations on arrays, including math, logic, shape operations, sorting, selection, I/ O, discrete Fourier transform, basic linear algebra, basic statistical operations, stochastic simulation, etc.

The core of the NumPy package is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, and to improve performance many operations are performed in compiled code. There are several important differences between NumPy arrays and standard Python sequences:

  • Unlike Python lists (which can grow dynamically), NumPy arrays have a fixed size when created. Changing the size of a standard array will create a new array and delete the original array.
  • Elements in a NumPy array all need to be of the same data type and therefore the same memory size. The exception is: there can be arrays of (Python, including NumPy) objects, allowing arrays of elements of different sizes.
  • NumPy arrays facilitate advanced mathematical and other types of operations on large amounts of data. In general, such operations are performed more efficiently and with less code than using Python's built-in sequences.
  • A growing number of Python-based scientific and math packages are using NumPy arrays; while they generally support Python sequence input, they convert such input to NumPy arrays before processing, and usually output NumPy arrays. In other words, to effectively use many (or even most) of today's Python-based scientific/math software, it's not enough to know how to use Python's built-in sequence types, you also need to know how to use NumPy arrays.

2. Operational performance comparison

Sequence size and speed are especially important in scientific computing. As a simple example, consider the case of multiplying each element in a one-dimensional sequence with the corresponding element in another sequence of the same length. If the data is stored in two Python lists a and b, we can iterate over each element:

a=[1,2,3,4]
b=[5,6,7,8]
c = []
for i in range(len(a)):
    c.append(a[i]*b[i])

But if both a and b contain millions of numbers, looping in Python is nowhere near as efficient as in C. The same task can be done faster in C by writing (ignoring variable declaration and initialization, memory allocation, etc. for clarity):

for (i = 0; i < rows; i++) {
  c[i] = a[i]*b[i];
}

 However, it is very complicated when there are multi-dimensional arrays, especially the aggregation operation in multi-dimensional arrays will greatly increase the complexity and time complexity of the code.

for (i = 0; i < rows; i++) {
  for (j = 0; j < columns; j++) {
    c[i][j] = a[i][j]*b[i][j];
  }
}

But NumPy is optimized based on the above two ways, its underlying code is written in C language, and the usage method is extremely simple: when it comes to ndarray, element-wise operation is the "default mode", but element-wise operation is performed by precompiled C code Quick execution:

import numpy as np
a=np.array([1,2,3,4])
b=np.array([5,6,7,8])
c=a*b

 Do what the previous examples do at near C speed, with the simplicity of Python-based code. Using NumPy is even simpler. The last example illustrates two features of NumPy that underlie its powerful capabilities: vectorization and broadcasting.

3. The underlying architecture

Vectorization describes the absence of any explicit loops, indices, etc. in the code, which of course happens "behind the scenes" in optimized, precompiled C code. Vectorized code has many advantages, including:

  • Vectorized code is cleaner and easier to read
  • Fewer lines of code usually means fewer bugs
  • The code is closer to standard math notation (it's often easier to encode math structures correctly)
  • Vectorization produces more "Pythonic" code. Without vectorization, our code would be full of loops that are inefficient and hard to read.

Broadcasting is the term used to describe the implicit element-by-element behavior of operations; in general, in NumPy, all operations, not only arithmetic operations, but logic, bits, functions, etc., are the way they behave, i.e. they broadcast. Also, in the above example, a and b can be multidimensional arrays of the same shape, or a scalar and an array, or even two arrays with different shapes, provided that the smaller array can be "expanded" to the larger The shape of the array, thus making the resulting broadcast unambiguous. For detailed "rules" of broadcasting, see:

Broadcasting — NumPy v1.24.dev0 Manual
 

NumPy fully supports an object-oriented approach, starting with ndarray again. For example, ndarray is a class with many methods and properties. Many of its methods are mirrored by functions in the outermost NumPy namespace, allowing programmers to code in whatever paradigm they prefer. This flexibility makes the NumPy array dialect and the NumPy-ndarray class the de facto language for multidimensional data exchange used in Python.

Pay attention, prevent getting lost, if there are any mistakes, please leave a message for advice, thank you very much

That's all for this issue. I'm fanstuck, if you have any questions, feel free to leave a message to discuss, see you in the next issue


Guess you like

Origin blog.csdn.net/master_hunter/article/details/127118157