03-Indexing and slicing, seventy-two changes of playing array

What is an index?

The Oxford English-Chinese Dictionary, an authoritative work in the field of Chinese-English translation, contains nearly a million words, which is thicker than a brick. How do my friends check it when using it? The words in the dictionary are sorted according to the first letter. If the first letter is the same, they are sorted according to the second letter, and so on. This provides us with great convenience for finding words. For example, we want to find pythonthe word that we first find the initials Psection at the beginning, and then find the second letter Y, until the final inspection to python.

Obviously Oxford dictionary with A-Za total of 26 letters, a multi-level index, the word be ordered in such a way, every time we look up the words are based on such index layer by layer during the cycle. That's why the index provides a pointer to the storage location for the data value.

In the list, we use the position of the element in the list as an index. In the field of arrays, in order to ensure the compatibility of indexes, many friends have thought, since arrays are extensions of lists in a multi-dimensional space, why not indexes? This is indeed the case. The index of the array is a multidimensional expansion of the index of the list. In this section, we will take a look at it specifically. How to play with the index of the array?

1. Array indexing and slicing

1.1 One-dimensional array

One-dimensional arrays are relatively simple and basically consistent with lists. Examples are as follows:

import numpy as np
arr1d = np.arange(1,10,1, dtype=np.float32)
arr1d
Out: array([1., 2., 3., 4., 5., 6., 7., 8., 9.], dtype=float32)

# 索引的起始位置为0，选择索引为6的元素
arr1d[6]
Out: 7.0

# 选择索引为[5:8）的区间元素，注意是左闭右开
arr1d[5:8]
Out: array([6., 7., 8.], dtype=float32)

1.2 Two-dimensional array

Take a 3×3 array as an example:

# 创建二维数组
arr2d=np.arange(9, dtype=np.float32).reshape(3,3)
arr2d
Out: 
    array([[0., 1., 2.],
       	  [3., 4., 5.],
          [6., 7., 8.]], dtype=float32)

Obviously, a two-dimensional array has two-dimensional indexes. If mapped to a plane space, the two axes of the two-dimensional array are axis 0 and axis 1, respectively. By default, Numpy indexes the axis 0 first and then the axis 1 index. (Array can actually be regarded as a nested list, usually the outermost index is defined as axis 0, which increases in order, this rule is also applicable to high-dimensional arrays)

image description

If only indexing axis 0:

# 方括号里面，可以理解为对轴的操作。这里方括号里为单个整数，表示对最外层的axis 0进行操作
arr2d[1]
# 结果请参考下图的左侧部分
Out: array([3., 4., 5.], dtype=float32)

Index axis 0 and axis 1 at the same time:

# 方括号里为2个整数，表示依次对axis 0和axis 1操作（axis 0在前）;取同时满足axis 0中index=1和axis 1中index=1的元素
arr2d[1, 1]
# 结果请参考下图的右侧部分
Out: 4.0

image description

Expand a bit on the basis of the index, and select multiple consecutive indexes at the same time, that is the slice effect. For example, we make some simple changes to the index in the previous step:

# 方括号的里面，分别还是表示对 axis 0和 axis 1操作，只不过变成了切片。
# 该结果表示：同时在 axis 0 轴上满足[0:2)，axis 1轴上满足[1:3)的元素组成的数组
# arr2d[0:2, 1:3]的结果不等价于arr2d[[0,1],[1,2]]，请注意这个结论！！！
arr2d[0:2, 1:3]
Out: 
    array([[1., 2.],
           [4., 5.]], dtype=float32)

Simply put, a two-dimensional array can be compared to a two-dimensional space coordinate system, which is more intuitive and easy to understand. Friends practice a lot, try carefully to master it;

1.3 Three-dimensional array

Three-dimensional arrays have one more dimension than two-dimensional arrays. Three-dimensional array are more common in the field of image for the RGBimage of the three primary colors mode, it is to use the array size m × n × 3 to represent an image, where m represents the vertical image size, horizontal size image represents n-3 represents the three primary colors. How should we index and slice a three-dimensional array?

The format of indexing and slicing of three-dimensional arrays is an extension of two-dimensional arrays. Examples are as follows:

# 创建三维数组
arr3d = np.arange(1, 19, 1).reshape(3,2,3)
arr3d
Out: 
    array([[[ 1,  2,  3],
            [ 4,  5,  6]],

           [[ 7,  8,  9],
            [10, 11, 12]],

           [[13, 14, 15],
            [16, 17, 18]]])

# 对3维数组，取最外层索引为1、次外层索引为0的位置的元素
arr3d[1,0]
Out: array([7, 8, 9])

# 对3维数组，取最外层索引切片为0:2、次外层索引为1、最内层索引切片为2:3的位置的元素。
arr3d[:2,1,2:3]
Out: 
    array([[ 6],
       	   [12]])

1.4 High-dimensional arrays

Usually in our data analysis field, even in the field of AI big data, the original input layer is two-dimensional, that is, each sample is one-dimensional (a sample is defined by n indicators), and the sample set is two-dimensional In the field of image recognition, the original input layers usually used are all three-dimensional, because pictures are generally three-dimensional arrays. So friends who are familiar with the common indexes and slices of 2D and 3D are enough to handle most actual scenes. High-dimensional arrays are not recommended to dig deeper.

2. Boolean index

When the array meets the actual scene, the Boolean index becomes more grounded. For example, the Boolean index is more like a filter in Excel, which is based on the Boolean value of the conditional judgment result to determine which data is our target data.

Let's first look at an example:

cities = np.array(["hz", "sh", "hz", "bj", "wh", "sh", "sz"])
arr_rnd = np.random.randn(7,4)
arr_rnd
Out: 
    array([[ 0.52214772,  0.70276312, -2.2606387 ,  0.44816176],
       [ 1.8575996 , -0.07908252, -0.60976332, -1.24109283],
       [ 0.79739726,  0.86862637,  0.91748762,  1.58236216],
       [-2.01706647,  1.02411895, -0.27238117,  0.11644394],
       [-0.5413323 ,  0.41044278, -0.54505957, -0.27226035],
       [ 0.85592045,  1.14458831,  0.36227036, -0.22211316],
       [ 2.40476032,  1.22042702, -1.07018219,  0.95419508]])

# 利用数组的比较运算，生成一个布尔类型的数组
cities == "hz"
Out: array([ True, False, True, False, False, False, False])

# 利用布尔型数组，进行数组索引；观察索引的规律
# 我们可以做这样一个推断：布尔型数组的长度要和被索引的轴的长度一致
arr_rnd[cities == "hz"]
Out:
    array([[ 0.52214772,  0.70276312, -2.2606387 ,  0.44816176],
           [ 0.79739726,  0.86862637,  0.91748762,  1.58236216]])

It should be noted here that Boolean indexes can be used in conjunction with index slices:

# 利用布尔型数组、切片进行2个维度的索引
arr_rnd[cities == "hz"， :3]
Out: 
    array([[ 0.52214772,  0.70276312, -2.2606387 ],
           [ 0.79739726,  0.86862637,  0.91748762]])

Of course, the Boolean index is used just right and has the effect of turning corruption into a miracle. For example, for the arr_rndarray generated in the previous step that obeys the standard normal distribution , I hope to filter out all the negative numbers and set it to 0. It's actually very simple:

arr_rnd[arr_rnd<0] = 0
arr_rnd
Out: 
    array([[0.52214772, 0.70276312, 0.        , 0.44816176],
           [1.8575996 , 0.        , 0.        , 0.        ],
           [0.79739726, 0.86862637, 0.91748762, 1.58236216],
           [0.        , 1.02411895, 0.        , 0.11644394],
           [0.        , 0.41044278, 0.        , 0.        ],
           [0.85592045, 1.14458831, 0.36227036, 0.        ],
           [2.40476032, 1.22042702, 0.        , 0.95419508]])

It is also necessary to emphasize the Boolean arithmetic operators of Boolean arrays, which is a bit convoluted, but it is actually well understood, that is, how to easily implement "and", "or", and "not" operations between multiple Boolean arrays :

# 非运算 ~
~(cities == "hz")
Out: array([False,  True, False,  True,  True,  True,  True])
# 和运算 &
(cities == "hz") & (cities == "sz")
Out: array([False, False, False, False, False, False, False])
# 或运算 |
(cities == "hz") | (cities == "sz")
Out: array([ True, False,  True, False, False, False,  True])

3. Fancy Index

To summarize briefly, up to this point, we have explained how to use single integers, slices, Boolean lists, and combinations of them for indexing. In fact, they are already very powerful, powerful enough to handle most scenarios. Let’s take a look at a case. I have a 4×6 two-dimensional array. I want to make an interesting slice. Take out the 4 corner elements of the two-dimensional array to form a 2×2 array. The effect is as follows:

image description

It is achievable using the knowledge we have explained so far, but because all we have learned are continuous slices or independent single indexes, it will be a little more cumbersome to achieve this effect. After reading this, friends may wish to think about it, what method do you have to solve this problem?

Here at least we can provide two ideas first:

# 新建4×6的二维数组arr_
arr_demo01 = np.arange(24).reshape(4,6)
arr_demo01
Our:
    array([[ 0,  1,  2,  3,  4,  5],
           [ 6,  7,  8,  9, 10, 11],
           [12, 13, 14, 15, 16, 17],
           [18, 19, 20, 21, 22, 23]])

# 方法1：分别将4个角的元素索引出来，然后把取出来的4个元素，重新组成一个新2×2的数组
arr_method1 = np.array([[arr_demo01[0,0], arr_demo01[0,-1]], 
                        [arr_demo01[-1,0],arr_demo01[-1,-1]]])
arr_method1
Out:
    array([[ 0,  5],
           [18, 23]])

# 方法2：利用布尔索引，可以同时索引不连续的行。分别对axis 0方向和axis 1方向进行索引。但是需要注意的是，得分2次索引；
arr_method2 = arr_demo01[[True, False, False, True]][:, [True, False, False, False, False, True]]
arr_method2：
Out:
    array([[ 0,  5],
           [18, 23]])

The first method is easier to understand, and the second method is to index twice. The first step is to perform Boolean indexing on the axis 0 direction; the second step is to integrate the operation slice and Boolean index, and index the axis 1 direction on the result generated in the previous step.

Is there a more concise way? Here is a fancy index .

Fancy indexing is actually indexing using integer arrays. Based on what we generated above arr_demo01, let's look at two simple examples.

# 我们传入一个整数数组，对axis 0方向进行索引，并且索引结果的顺序和传入的整数数组一一对应：
arr_demo01[[2,0]]
Out: 
    array([[12, 13, 14, 15, 16, 17],
           [ 0,  1,  2,  3,  4,  5]])

If we pass in two integer arrays at the same time, the result may be somewhat different from what we expected. Let's look at the following example:

# 如果同时传入2个整数数组，中间用逗号分开。那么这两个数组会以两两配对的形式，对元素进行索引。而并不是一个矩形状的索引区域！
arr_demo01[[0,-1], [0,-1]]
Out: 
    array([ 0, 23])

The results actually indexed here are the elements on the two coordinates (0, 0), (-1, -1). Note that it is not a rectangular area. How to achieve the effect of the above demo? Here we introduce several methods for you to compare and learn.

Method 3: Just pass in the coordinates of the four corners. The whole idea is very similar to Method 1, but the writing is more concise:

# 方法3：分别传入4个角的坐标，请朋友们注意观察传入的2个整数数组的规律
arr_demo01[[0, 0, -1, -1], [0, -1, 0, -1]]
Out: array([ 0,  5, 18, 23])

arr_demo01[[0,0,-1,-1], [0,-1,0,-1]].reshape(2,2)
Out: 
    array([[ 0,  5],
           [18, 23]])

Pay attention and observe that the data obtained in this way is only a one-dimensional array composed of a series of elements, and we need to reshapechange the shape of the data by additional methods.

# 方法4：利用花式索引和切片混用，整体思路和方法2很相似。也是通过连续2次索引，得到一个矩形状的区域
arr_demo01[[0,-1]] [:,[0,-1]]
Out: 
    array([[ 0,  5],
           [18, 23]])

Finally, I will introduce you an indexer, which uses a np.ix_function to convert the two one-dimensional integer arrays passed in into a regional indexer for selecting elements.

# 方法5：利用函数np.ix_，构建矩形索引器：
arr_demo01[np.ix_([0,-1], [0,-1])]
Out: 
    array([[ 0,  5],
           [18, 23]])

4. Summary

This chapter shows friends the "seventy-two changes" of arrays. Using indexing and slicing methods, we can change the structure of the array, extract the elements and set of elements as we like.

Overall, indexes can be divided into 4 types: single integer index, Boolean index, slice index ( Pythonconsistent with the list) and integer array. The more complicated one is the combined index, such as the combination of other indexes and slice indexes. I believe that friends who read the above carefully will be able to master the data index method. In daily study, it is recommended to use this chapter as a reference book like a case, focusing on logical understanding, and just refer to it when you encounter problems, without rote memorization.

Array indexing and slicing are the basic content of the entire data analysis course. In the Pandascontent of the next chapter , we will also encounter the problem of slicing and indexing. By then, you can compare them and deepen your understanding.