Python在科研中的应用 06：NumPy 数据分析进阶

Posted on 2024-04-09 Edited on 2024-04-09 In Programming language Waline:

NumPy，是“Numerical Python”的简称，是Python编程语言中的一个核心数学库，专注于高效处理多维数组和矩阵数据。在数据分析领域，NumPy发挥着举足轻重的作用，它提供了丰富的功能和工具，可以执行复杂的数学运算、线性代数操作以及统计分析。NumPy的高性能数组处理能力，使得用户可以轻松地处理大规模数据集，无论是进行数值计算、数据转换还是数据清洗，NumPy都能提供强大的支持。其简洁而直观的API设计，使得数据分析和科学计算变得更为简单高效。在数据科学、机器学习、科学计算等领域，NumPy都是不可或缺的基础工具，助力研究人员和工程师们快速实现复杂的数据处理和分析任务。

本节课程是第六周课程的延续，让你脱离基础性的NumPy使用，通过一些具体问题的形式学习NumPy的进阶使用方法。

将数组转换为平面一维数组

问题：将array_of_arrays转换为扁平线性1d数组。

arr1 = np.arange(3)
arr2 = np.arange(3,7)
arr3 = np.arange(7,10)

array_of_arrays = np.array([arr1, arr2, arr3])
print('array_of_arrays: ', array_of_arrays)

# Solution 1
arr_2d = np.array([a for arr in array_of_arrays for a in arr])

# Solution 2:
arr_2d = np.concatenate(array_of_arrays)
print(arr_2d)
# > array_of_arrays:  [array([0, 1, 2]) array([3, 4, 5, 6]) array([7, 8, 9])]
# > [0 1 2 3 4 5 6 7 8 9]

numpy.concatenate()函数，沿现有轴连接数组序列。

1 2	numpy.concatenate((a1, a2, ...), axis=0, out=None, dtype=None, casting="same_kind") # Join a sequence of arrays along an existing axis.

参数:

a1, a2,…: array_like数组序列，必须具有相同的形状，除了待拼接轴对应的维度（默认是第一个维度）。
axis: int, 可选项，数组将沿其连接的轴。如果axis为None，则数组在使用前被平面化。默认为0。
out: ndarray, 可选项，如果提供，则为输出存储的位置。形状必须是正确的，与未指定out参数时concatenate返回的形状相匹配。如果提供，目标数组将具有此
dtype: str or dtype, 可选项，不能和out一起提供。
cast: {‘no’,’equiv’,’safe’,’same_kind’,’unsafe’}, 可选项，控制可能发生的数据强制转换类型。默认为’same_kind’。

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)
array([[1, 2],
       [3, 4],
       [5, 6]])
np.concatenate((a, b.T), axis=1)
array([[1, 2, 5],
       [3, 4, 6]])
np.concatenate((a, b), axis=None)
array([1, 2, 3, 4, 5, 6])

如何在NumPy中为数组生成单热编码？

在机器学习算法中，我们经常会遇到分类特征，例如：人的性别有男女，祖国有中国，美国，法国等。这些特征值并不是连续的，而是离散的，无序的。通常我们需要对其进行特征数字化。One-Hot编码，又称为一位有效编码，主要是采用N位状态寄存器来对N个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候只有一位有效。

为什么使用单热编码：在回归，分类，聚类等机器学习算法中，特征之间距离的计算或相似度的计算是非常重要的，而我们常用的距离或相似度的计算都是在欧式空间的相似度计算，计算余弦相似性，基于的就是欧式空间。

计算一次性编码（数组中每个唯一值的虚拟二进制变量）

# **给定：**
np.random.seed(101) 
arr = np.random.randint(1,4, size=6)
arr
# > array([2, 3, 2, 2, 2, 1])

# 期望输出：
# > array([[ 0.,  1.,  0.],
# >        [ 0.,  0.,  1.],
# >        [ 0.,  1.,  0.],
# >        [ 0.,  1.,  0.],
# >        [ 0.,  1.,  0.],
# >        [ 1.,  0.,  0.]])

# Solution:
def one_hot_encodings(arr):
    uniqs = np.unique(arr)
    out = np.zeros((arr.shape[0], uniqs.shape[0]))
    for i, k in enumerate(arr):
        out[i, k-1] = 1
    return out

one_hot_encodings(arr)
# > array([[ 0.,  1.,  0.],
# >        [ 0.,  0.,  1.],
# >        [ 0.,  1.,  0.],
# >        [ 0.,  1.,  0.],
# >        [ 0.,  1.,  0.],
# >        [ 1.,  0.,  0.]])

# Method 2:
(arr[:, None] == np.unique(arr)).view(np.int8)

如何创建按分类变量分组的序号？

创建按分类变量分组的序号。使用以下来自鸢尾属植物物种的样本作为输入。

# **给定：**
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
species = np.genfromtxt(url, delimiter=',', dtype='str', usecols=4)
np.random.seed(100)
species_small = np.sort(np.random.choice(species, size=20))
species_small
# > array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
# >        'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
# >        'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
# >        'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
# >        'Iris-versicolor', 'Iris-virginica', 'Iris-virginica',
# >        'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
# >        'Iris-virginica'],
# >       dtype='<U15')
print([i for val in np.unique(species_small) for i, grp in enumerate(species_small[species_small==val])])
[0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1, 2, 3, 4, 5]

如何根据给定的分类变量创建组ID？

根据给定的分类变量创建组ID。使用以下来自鸢尾属植物物种的样本作为输入。

# **给定：**
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
species = np.genfromtxt(url, delimiter=',', dtype='str', usecols=4)
np.random.seed(100)
species_small = np.sort(np.random.choice(species, size=20))
species_small
# > array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
# >        'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor',
# >        'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
# >        'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
# >        'Iris-versicolor', 'Iris-virginica', 'Iris-virginica',
# >        'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
# >        'Iris-virginica'],
# >       dtype='<U15')

# Solution:
output = [np.argwhere(np.unique(species_small) == s).tolist()[0][0] for val in np.unique(species_small) for s in species_small[species_small==val]]

# Solution: For Loop version
output = []
uniqs = np.unique(species_small)

for val in uniqs:  # uniq values in group
    for s in species_small[species_small==val]:  # each element in group
        groupid = np.argwhere(uniqs == s).tolist()[0][0]  # groupid
        output.append(groupid)

print(output)
# > [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]

使用NumPy获取数组中各项排名？

为给定的数组a创建排名。

np.random.seed(10)
a = np.random.randint(20, size=10)
a
# array([ 9,  4, 15,  0, 17, 16, 17,  8,  9,  0])

# Solution
a.argsort()
# array([3, 9, 1, 7, 0, 8, 2, 5, 4, 6], dtype=int64)

a.argsort().argsort()
# array([4, 2, 6, 0, 8, 7, 9, 3, 5, 1], dtype=int64)

如何使用NumPy对多维数组中的项进行排名？

创建与给定数字数组a相同形状的排名数组。

# **给定：**
np.random.seed(10)
a = np.random.randint(20, size=[2,5])
print(a)
# > [[ 9  4 15  0 17]
# >  [16 17  8  9  0]]

# Solution
print(a.ravel().argsort().argsort().reshape(a.shape))
# > [[4 2 6 0 8]
# >  [7 9 3 5 1]]

numpy.ravel()返回一个连续的扁平数组。

1 2	numpy.ravel(a, order='C') # Return a contiguous flattened array. A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.

如何在二维NumPy数组的每一行中找到最大值？

问题：计算给定数组中每行的最大值。

# Input
np.random.seed(100)
a = np.random.randint(1,10, [5,3])
a
# array([[9, 9, 4],
#        [8, 8, 1],
#        [5, 3, 6],
#        [3, 3, 3],
#        [2, 1, 9]])

# Solution 1
np.amax(a, axis=1)
# np.amax 函数就是 np.max 函数，历史遗留问题

# Solution 2
np.apply_along_axis(np.max, arr=a, axis=1)
# > array([9, 8, 6, 3, 9])

numpy.apply_along_axis()表示沿给定轴向对一维切片应用函数 func1d。

1
2
3

# numpy.apply_along_axis
numpy.apply_along_axis(func1d, axis, arr, *args, **kwargs)
# Apply a function to 1-D slices along the given axis.

如何计算二维NumPy数组每行的最小值与最大值的比值？

为给定的二维NumPy数组计算每行的最小值与最大值的比值。

# Input
np.random.seed(100)
a = np.random.randint(1,10, [5,3])
a
# array([[9, 9, 4],
#        [8, 8, 1],
#        [5, 3, 6],
#        [3, 3, 3],
#        [2, 1, 9]])

# Solution
np.apply_along_axis(lambda x: np.min(x)/np.max(x), arr=a, axis=1)
# array([0.44444444, 0.125     , 0.5       , 1.        , 0.11111111])

np.apply_along_axis(lambda x: np.min(x)/np.max(x), arr=a, axis=0)
# array([0.22222222, 0.11111111, 0.11111111])

如何在NumPy数组中找到重复的记录？

在给定的NumPy数组中找到重复的条目(第二次出现以后)，并将它们标记为True。第一次出现应该是False的。

# Input
np.random.seed(100)
a = np.random.randint(0, 5, 10)
a
# array([0, 0, 3, 0, 2, 4, 2, 2, 2, 2])

## Solution
# There is no direct function to do this as of 1.13.3

# Create an all True array
out = np.full(a.shape[0], True)

# Find the index positions of unique elements
unique_positions = np.unique(a, return_index=True)[1]

# Mark those positions as False
out[unique_positions] = False

print(out)
# > [False  True False  True False False  True  True  True  True]

如何找出数字的分组均值？

在二维数字数组中查找按分类列分组的数值列的平均值

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# 理想的输出：
# > [[b'Iris-setosa', 3.418],
# >  [b'Iris-versicolor', 2.770],
# >  [b'Iris-virginica', 2.974]]

# Solution
# No direct way to implement this. Just a version of a workaround.
numeric_column = iris[:,1].astype('float')  # sepalwidth
grouping_column = iris[:,4]  # species

# List comprehension version
[[group_val, numeric_column[grouping_column==group_val].mean()] for group_val in np.unique(grouping_column)]

# For Loop version
output = []
for group_val in np.unique(grouping_column):
    output.append([group_val, numeric_column[grouping_column==group_val].mean()])

output
# > [[b'Iris-setosa', 3.418],
# >  [b'Iris-versicolor', 2.770],
# >  [b'Iris-virginica', 2.974]]

如何将PIL图像转换为NumPy数组？

从以下URL导入图像并将其转换为numpy数组。
URL = ‘https://upload.wikimedia.org/wikipedia/commons/8/8b/Denali_Mt_McKinley.jpg‘

from io import BytesIO
from PIL import Image
import PIL, requests

# Import image from URL
URL = 'https://upload.wikimedia.org/wikipedia/commons/8/8b/Denali_Mt_McKinley.jpg'
response = requests.get(URL)

# Read it as Image
I = Image.open(BytesIO(response.content))

# Optionally resize
I = I.resize([150,150])

# Convert to numpy array
arr = np.asarray(I)

# Optionaly Convert it back to an image and show
im = PIL.Image.fromarray(np.uint8(arr))
Image.Image.show(im)

删除NumPy数组中所有NaN值

从一维NumPy数组中删除所有NaN值

1
2
3

a = np.array([1,2,3,np.nan,5,6,7,np.nan])
a[~np.isnan(a)]
# > array([ 1.,  2.,  3.,  5.,  6.,  7.])

计算两个数组之间的欧氏距离

计算两个数组a和数组b之间的欧氏距离。

# Input
a = np.array([1,2,3,4,5])
b = np.array([4,5,6,7,8])

# Solution
dist = np.linalg.norm(a-b)
dist
# > 6.7082039324993694

在一维数组中找到所有的局部极大值(或峰值)？

找到一个一维数字数组a中的所有峰值。峰顶是两边被较小数值包围的点。

a = np.array([1, 3, 7, 1, 2, 6, 0, 1])
doublediff = np.diff(np.sign(np.diff(a)))
peak_locations = np.where(doublediff == -2)[0] + 1
peak_locations
# > array([2, 5])

numpy.diff()函数计算计算沿给定轴的n-th离散差分。

The numpy.sign function returns -1 if x < 0, 0 if x==0, 1 if x > 0.

1 2	numpy.diff(a, n=1, axis=-1, prepend=<no value>, append=<no value>)[source] # Calculate the n-th discrete difference along the given axis.

参数说明:

a: array_like, 输入数组；
n: int, 可选项，值差的次数。默认值为1，如果为零，则按原样返回输入。
axis: int, 可选项，计算差值的轴，默认是最后一个轴。
diff: ndarray, n-th差值。输出的形状与a相同，除了沿轴的尺寸小n。输出的类型与a的任意两个元素之间的差的类型相同。在大多数情况下，这与a的类型相同。一个值得注意的例外是datetime64，它产生一个timedelta64输出数组。

x = np.array([1, 2, 4, 7, 0])
np.diff(x)
array([ 1,  2,  3, -7])
np.diff(x, n=2)
array([  1,   1, -10])
x = np.array([[1, 3, 6, 10], [0, 5, 6, 8]])
np.diff(x)
array([[2, 3, 4],
       [5, 1, 2]])
np.diff(x, axis=0)
array([[-1,  2,  0, -2]])

从二维数组中减去一维数组，其中一维数组的每一项从各自的行中减去

从2d数组a_2d中减去一维数组b_1D，使得b_1D的每一项从a_2d的相应行中减去。

# Input
a_2d = np.array([[3,3,3],[4,4,4],[5,5,5]])
b_1d = np.array([1,2,3])

# Solution
print(a_2d - b_1d[:,None])
# > [[2 2 2]
# >  [2 2 2]
# >  [2 2 2]]

查找数组中项的第n次重复索引

找出x中数字1的第5次重复的索引。

x = np.array([1, 2, 1, 1, 3, 4, 3, 1, 1, 2, 1, 1, 2])
n = 5

# Solution 1: List comprehension
[i for i, v in enumerate(x) if v == 1][n-1]

# Solution 2: Numpy version
np.where(x == 1)[0][n-1]
# > 8

将NumPy的datetime 64对象转换为datetime的datetime对象？

问题：将NumPy的datetime64对象转换为datetime的datetime对象

# **给定：** a numpy datetime64 object
dt64 = np.datetime64('2018-02-25 22:10:10')

# Solution
from datetime import datetime
dt64.tolist()
# or
dt64.astype(datetime)
# > datetime.datetime(2018, 2, 25, 22, 10, 10)

计算NumPy数组的移动平均值

对于给定的一维数组，计算窗口大小为3的移动平均值。

np.random.seed(100)
Z = np.random.randint(10, size=10)

# Solution
# Source: https://stackoverflow.com/questions/14313510/how-to-calculate-moving-average-using-numpy
def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

print('array: ', Z)
# Method 1
moving_average(Z, n=3).round(2)

# Method 2:  # Thanks AlanLRH!
# np.ones(3)/3 gives equal weights. Use np.ones(4)/4 for window size 4.
np.convolve(Z, np.ones(3)/3, mode='valid') . 

# > array:  [8 8 3 7 7 0 4 2 5 2]
# > moving average:  [ 6.33  6.    5.67  4.67  3.67  2.    3.67  3.  ]

在给定起始点、长度和步骤的情况下创建一个NumPy数组序列

创建长度为10的NumPy数组，从5开始，在连续的数字之间的步长为3。

length = 10
start = 5
step = 3

def seq(start, length, step):
    end = start + (step*length)
    return np.arange(start, end, step)

seq(start, length, step)
# > array([ 5,  8, 11, 14, 17, 20, 23, 26, 29, 32])

填写不规则系列的NumPy日期中的缺失日期

给定一系列不连续的日期序列。填写缺失的日期，使其成为连续的日期序列。

# Input
dates = np.arange(np.datetime64('2018-02-01'), np.datetime64('2018-02-25'), 2)
print(dates)
# > ['2018-02-01' '2018-02-03' '2018-02-05' '2018-02-07' '2018-02-09'
# >  '2018-02-11' '2018-02-13' '2018-02-15' '2018-02-17' '2018-02-19'
# >  '2018-02-21' '2018-02-23']

# Solution ---------------
filled_in = np.array([np.arange(date, (date+d)) for date, d in zip(dates, np.diff(dates))]).reshape(-1)

# add the last day
output = np.hstack([filled_in, dates[-1]])
output

# For loop version -------
out = []
for date, d in zip(dates, np.diff(dates)):
    out.append(np.arange(date, (date+d)))

filled_in = np.array(out).reshape(-1)

# add the last day
output = np.hstack([filled_in, dates[-1]])
output
# > ['2018-02-01' '2018-02-03' '2018-02-05' '2018-02-07' '2018-02-09'
# >  '2018-02-11' '2018-02-13' '2018-02-15' '2018-02-17' '2018-02-19'
# >  '2018-02-21' '2018-02-23']

# > array(['2018-02-01', '2018-02-02', '2018-02-03', '2018-02-04',
# >        '2018-02-05', '2018-02-06', '2018-02-07', '2018-02-08',
# >        '2018-02-09', '2018-02-10', '2018-02-11', '2018-02-12',
# >        '2018-02-13', '2018-02-14', '2018-02-15', '2018-02-16',
# >        '2018-02-17', '2018-02-18', '2018-02-19', '2018-02-20',
# >        '2018-02-21', '2018-02-22', '2018-02-23'], dtype='datetime64[D]')

从给定的一维数组创建步长

从给定的一维数组arr中，利用步进生成一个二维矩阵，窗口长度为4，步距为2，类似于 [[0,1,2,3], [2,3,4,5], [4,5,6,7]..]

arr = np.arange(15) 
arr
# > array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

def gen_strides(a, stride_len=5, window_len=5):
    n_strides = ((a.size-window_len)//stride_len) + 1
    # return np.array([a[s:(s+window_len)] for s in np.arange(0, a.size, stride_len)[:n_strides]])
    return np.array([a[s:(s+window_len)] for s in np.arange(0, n_strides*stride_len, stride_len)])

print(gen_strides(np.arange(15), stride_len=2, window_len=4))
# > [[ 0  1  2  3]
# >  [ 2  3  4  5]
# >  [ 4  5  6  7]
# >  [ 6  7  8  9]
# >  [ 8  9 10 11]
# >  [10 11 12 13]]

本章总结

数组属性

在使用NumPy时，你会想知道数组的某些信息。很幸运，NumPy包里边包含了很多便捷的方法，可以给你想要的信息。

# Array properties
a = np.array([[11, 12, 13, 14, 15],
              [16, 17, 18, 19, 20],
              [21, 22, 23, 24, 25],
              [26, 27, 28 ,29, 30],
              [31, 32, 33, 34, 35]])

print(type(a)) # >>><class 'numpy.ndarray'>
print(a.dtype) # >>>int64
print(a.size) # >>>25
print(a.shape) # >>>(5, 5)
print(a.itemsize) # >>>8
print(a.ndim) # >>>2
print(a.nbytes) # >>>200

正如你在上面的代码中看到的，NumPy数组实际上被称为'numpy.ndarray'。

shape属性是数组有多少行和列，上面的数组有5行和5列，所以它的shape是(5, 5)。
itemsize属性是每个项占用的字节（Byte）数。这个数组的数据类型是int64，一个int64中有64 bit，1 byte = 8 bit，即为8 byte。
ndim属性是数组的维数，在本例中为2。
nbytes属性是数组中的所有数据消耗掉的字节数。这并不计算数组信息定义开销，因此数组占用的实际内存空间将稍微大一点。

数组索引

NumPy提供了几种索引数组的方法，包括单元素索引，切片索引，整数数组索引，布尔数组索引等等。

单元素索引

人们期望的是1-D数组的单元素索引。它的工作方式与其他标准Python序列完全相同。它从0开始计数，并接受从数组末尾开始索引的负索引。

>>> x = np.arange(10)
>>> x[2]
2
>>> x[-2]
8

切片索引（Slicing）

与Python列表类似，可以对NumPy数组进行切片。由于数组可能是多维的，因此必须为数组的每个维指定一个切片：

import numpy as np

a = np.array([[1, 2, 3, 4],     # Create the following rank 2 array with shape (3, 4)
              [5, 6, 7, 8],
              [9,10,11,12]])

b = a[:2, 1:3]    # Use slicing to pull out the subarray consisting of the first 2 rows and columns 1 and 2; 
print(b)          # b is the following array of shape (2, 2):
                  # [[2 3]
                  #  [6 7]]

# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

你还可以将整数索引与切片索引混合使用。但是，这样做会产生比原始数组更低级别的数组。请注意，这与MATLAB处理数组切片的方式完全不同：

import numpy as np

a = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8], 
              [9,10,11,12]])

# Two ways of accessing the data in the middle row of the array.
# Mixing integer indexing with slices yields an array of lower rank,
# while using only slices yields an array of the same rank as the
# original array:
row_r1 = a[1, :]    # Rank 1 view of the second row of a
row_r2 = a[1:2, :]  # Rank 2 view of the second row of a
print(row_r1, row_r1.shape)  # Prints "[5 6 7 8] (4,)"
print(row_r2, row_r2.shape)  # Prints "[[5 6 7 8]] (1, 4)"

# We can make the same distinction when accessing columns of an array:
col_r1 = a[:, 1]
col_r2 = a[:, 1:2]
print(col_r1, col_r1.shape)  # Prints "[ 2  6 10] (3,)"
print(col_r2, col_r2.shape)  # Prints "[[ 2]
                             #          [ 6]
                             #          [10]] (3, 1)"

整数数组索引

使用切片索引到NumPy数组时，生成的数组视图将始终是原始数组的子数组。相反，整数数组索引允许你使用另一个数组中的数据构造任意数组。这是一个例子：

import numpy as np

a = np.array([[1, 2],
              [3, 4], 
              [5, 6]])

# An example of integer array indexing.
# The returned array will have shape (3,) and
print(a[[0, 1, 2], [0, 1, 0]])  # Prints "[1 4 5]"

# The above example of integer array indexing is equivalent to this:
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))  # Prints "[1 4 5]"

# When using integer array indexing, you can reuse the same
# element from the source array:
print(a[[0, 0], [1, 1]])  # Prints "[2 2]"

# Equivalent to the previous integer array indexing example
print(np.array([a[0, 1], a[0, 1]]))  # Prints "[2 2]"

布尔数组索引

布尔数组索引允许你选择数组的任意元素。通常，这种类型的索引用于选择满足某些条件的数组元素。下面是一个例子：

import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)   # Find the elements of a that are bigger than 2;
                     # this returns a numpy array of Booleans of the same
                     # shape as a, where each slot of bool_idx tells
                     # whether that element of a is > 2.

print(bool_idx)      # Prints "[[False False]
                     #          [ True  True]
                     #          [ True  True]]"

# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print(a[bool_idx])  # Prints "[3 4 5 6]"

# We can do all of the above in a single concise statement:
print(a[a > 2])     # Prints "[3 4 5 6]"

Where 函数

where()函数是一个根据条件返回数组中的值的有效方法。只需要把条件传递给它，它就会返回一个使得条件为真的元素的列表。

# Where
a = np.arange(0, 100, 10)
b = np.where(a < 50) 
c = np.where(a >= 50)[0]
print(b) # >>>(array([0, 1, 2, 3, 4]),)
print(c) # >>>[5 6 7 8 9]

反转二维数组的列

# Input
arr = np.arange(9).reshape(3,3)

# Solution
arr[:, ::-1]
# > array([[2, 1, 0],
# >        [5, 4, 3],
# >        [8, 7, 6]])

交换二维 numpy 数组中的两列

# Input
arr = np.arange(9).reshape(3,3)
arr
# > array([[0, 1, 2],
# >        [3, 4, 5],
# >        [6, 7, 8]])

# Solution
arr[:, [1,0,2]]
# > array([[1, 0, 2],
# >        [4, 3, 5],
# >        [7, 6, 8]])

从 1 维元组数组中提取特定列

问题：从前面问题中导入的一维鸢尾属植物数据集中提取文本列的物种。

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)
print(iris_1d.shape)

# Solution:
species = np.array([row[4] for row in iris_1d])
species[:5]
# > array([b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa',
# >        b'Iris-setosa'],
# >       dtype='|S15')

广播(Broadcasting)

广播是一种强大的机制，它允许NumPy在执行算术运算时使用不同形状的数组。通常，我们有一个较小的数组和一个较大的数组，我们希望多次使用较小的数组来对较大的数组执行一些操作。

例如，假设我们要向矩阵的每一行添加一个常数向量。我们可以这样做：

import numpy as np

# We will add the vector v to each row of the matrix x,
# storing the result in the matrix y
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
v = np.array([1, 0, 1])
vv = np.tile(v, (4, 1))   # Stack 4 copies of v on top of each other
print(vv)                 # Prints "[[1 0 1]
                          #          [1 0 1]
                          #          [1 0 1]
                          #          [1 0 1]]"
y = x + vv  # Add x and vv elementwise
print(y)  # Prints "[[ 2  2  4
          #          [ 5  5  7]
          #          [ 8  8 10]
          #          [11 11 13]]"

广播通常会使你的代码更简洁，效率更高，因此你应该尽可能地使用它。

改变数组的形状

问题：如何将一维数组转换为 2 行的 2 维数组

arr = np.arange(10)
arr.reshape(2, -1)  # Setting to -1 automatically decides the number of cols
# > array([[0, 1, 2, 3, 4],
# >        [5, 6, 7, 8, 9]])

numpy.reshape(a, newshape, order='C')
# Gives a new shape to an array without changing its data.

数组拼接

a = np.arange(10).reshape(2,-1)
# array([[0, 1, 2, 3, 4],
#        [5, 6, 7, 8, 9]])

b = np.repeat(1, 10).reshape(2,-1)
# array([[1, 1, 1, 1, 1],
#        [1, 1, 1, 1, 1]])

# 垂直叠加两个数组
# Method 1:
np.concatenate([a, b], axis=0)

# Method 2:
np.vstack([a, b])

# Method 3:
np.r_[a, b]
# > array([[0, 1, 2, 3, 4],
# >        [5, 6, 7, 8, 9],
# >        [1, 1, 1, 1, 1],
# >        [1, 1, 1, 1, 1]])

# 水平叠加两个数组
# Method 1:
np.concatenate([a, b], axis=1)

# Method 2:
np.hstack([a, b])

# Method 3:
np.c_[a, b]
# > array([[0, 1, 2, 3, 4, 1, 1, 1, 1, 1],
# >        [5, 6, 7, 8, 9, 1, 1, 1, 1, 1]])

随机数产生器

# Solution Method 1:
rand_arr = np.random.randint(low=5, high=10, size=(5,3)) + np.random.random((5,3))
# print(rand_arr)

# Solution Method 2:
rand_arr = np.random.uniform(5,10, size=(5,3))
print(rand_arr)
# > [[ 8.50061025  9.10531502  6.85867783]
# >  [ 9.76262069  9.87717411  7.13466701]
# >  [ 7.48966403  8.33409158  6.16808631]
# >  [ 7.75010551  9.94535696  5.27373226]
# >  [ 8.0850361   5.56165518  7.31244004]]

# 问题：在 iris_2d 数据集中的 20 个随机位置插入 np.nan 值

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')

# Method 1
i, j = np.where(iris_2d)

# i, j contain the row numbers and column numbers of 600 elements of iris_x
np.random.seed(100)
iris_2d[np.random.choice((i), 20), np.random.choice((j), 20)] = np.nan

# Method 2
np.random.seed(100)
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Print first 10 rows
print(iris_2d[:10])
# > [[b'5.1' b'3.5' b'1.4' b'0.2' b'Iris-setosa']
# >  [b'4.9' b'3.0' b'1.4' b'0.2' b'Iris-setosa']
# >  [b'4.7' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
# >  [b'4.6' b'3.1' b'1.5' b'0.2' b'Iris-setosa']
# >  [b'5.0' b'3.6' b'1.4' b'0.2' b'Iris-setosa']
# >  [b'5.4' b'3.9' b'1.7' b'0.4' b'Iris-setosa']
# >  [b'4.6' b'3.4' b'1.4' b'0.3' b'Iris-setosa']
# >  [b'5.0' b'3.4' b'1.5' b'0.2' b'Iris-setosa']
# >  [b'4.4' nan b'1.4' b'0.2' b'Iris-setosa']
# >  [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']]

随堂练习

生成一个尺寸为[10,20]的随机数数组a，数值在[-10,10)之间均匀随机分布;

在数组a中的20个随机位置插入NaN;

检索数组a中的NaN值，并替换为在[-20,20)之间均匀随机分布的随机数；

将数组a中大于5的值替换为5，小于-5的值替换为-5.

随堂练习答案

import numpy as np

a = np.random.uniform(-10,10,(10,20))
a[np.random.choice((i),20),np.random.choice((j),20)] = np.nan
a[np.isnan(a)] = np.random.uniform(-20,20)
a = np.clip(a,-5,5)