Python在科研中的应用 04:NumPy 数据分析基础

NumPy,是“Numerical Python”的简称,是Python编程语言中的一个核心数学库,专注于高效处理多维数组和矩阵数据。在数据分析领域,NumPy发挥着举足轻重的作用,它提供了丰富的功能和工具,可以执行复杂的数学运算、线性代数操作以及统计分析。NumPy的高性能数组处理能力,使得用户可以轻松地处理大规模数据集,无论是进行数值计算、数据转换还是数据清洗,NumPy都能提供强大的支持。其简洁而直观的API设计,使得数据分析和科学计算变得更为简单高效。在数据科学、机器学习、科学计算等领域,NumPy都是不可或缺的基础工具,助力研究人员和工程师们快速实现复杂的数据处理和分析任务。

本节课程仅作为学习NumPy的参考,并让你脱离基础性的NumPy使用,通过一些具体问题的形式学习NumPy的进阶使用方法。

导入NumPy作为np,并查看版本

将NumPy导入为 np 并打印版本号:

1
2
3
import numpy as np
print(np.__version__)
# > 1.13.3

你必须将NumPy导入np作为简称,才能使本节课程中的其余代码正常工作。要安装NumPy,建议安装Anaconda,里面已经包含了NumPy。

如何创建一维数组

创建从0到9的一维数字数组

1
2
3
4
arr = np.arange(10)
print(arr)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

创建一个布尔数组

创建一个NumPy数组元素值全为True的数组

1
2
3
4
5
6
7
8
9
10
np.full((3, 3), True, dtype=bool)
# array([[ True, True, True],
# [ True, True, True],
# [ True, True, True]], dtype=bool)

# Alternate method:
np.ones((3,3), dtype=bool)
# array([[ True, True, True],
# [ True, True, True],
# [ True, True, True]])

从一维数组中提取满足指定条件的元素

从 arr 中提取所有的奇数

1
2
3
4
5
6
7
8
9
# Input
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Solution
print(arr % 2 == 1)
# > array([False, True, False, True, False, True, False, True, False, True])

print(arr[arr % 2 == 1])
# > array([1, 3, 5, 7, 9])

将数组中的另一个值替换满足条件的元素项

将arr中的所有奇数替换为-1。

1
2
3
arr[arr % 2 == 1] = -1
print(arr)
# > array([ 0, -1, 2, -1, 4, -1, 6, -1, 8, -1])

在不影响原始数组的情况下替换满足条件的元素项

将arr中的所有奇数替换为-1,而不改变arr。

1
2
3
4
5
6
7
8
arr = np.arange(10)
out = np.where(arr % 2 == 1, -1, arr)

print(arr)
# > array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

print(out)
#> array([ 0, -1, 2, -1, 4, -1, 6, -1, 8, -1])
1
numpy.where(condition, x, y)
  • condition: array_like, bool如果为True,则返回x,否则返回y。
  • x, y: array_like, 可选择的值。 x, y和condition适配广播规则。
  • returns: ndarray数组,当condition为True,元素值取自x, 否则元素值取自y。
1
2
3
4
5
6
7
8
9
10
11
a = np.arange(10)
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.where(a < 5, a, 10*a)
array([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])

# This can be used on multidimensional arrays too:
np.where([[True, False], [True, True]],
[[1 , 2 ], [3 , 4 ]],
[[9 , 8 ], [7 , 6 ]])
# >array([[1 , 8 ], [3 , 4 ]])

改变数组的形状

问题:如何将一维数组转换为2行的2维数组

1
2
3
4
arr = np.arange(10)
arr.reshape(2, -1) # Setting to -1 automatically decides the number of cols
# > array([[0, 1, 2, 3, 4],
# > [5, 6, 7, 8, 9]])
1
2
numpy.reshape(a, newshape, order='C')
# Gives a new shape to an array without changing its data.
  • a: array_like, 待改变形状的数组;
  • newshape: int 或 int 元组, 新形状应与原形状兼容。如果是整数,则结果将是该长度的1-D数组。一个形状维度可以是-1。在这种情况下,该值是从数组的长度和剩余维度推断出来的。
  • order: {‘C’, ‘F’, ‘A’} 可选项,使用此索引顺序读取a的元素,并使用此索引顺序将元素放入重塑的数组中。’C’ 意味着使用类似C的索引顺序读写元素,最后一个轴索引变化最快,回到第一个轴索引变化最慢。 ‘F’表示使用类似fortran的索引顺序读写元素,第一个索引变化最快,最后一个索引变化最慢。请注意,’C’和’F’选项不考虑底层数组的内存布局,而只参考索引的顺序。’A’表示如果A在内存中是Fortran连续的,则以类似Fortran的索引顺序读取/写入元素,否则以类似C的顺序读取。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
a = np.arange(6).reshape((3, 2))
a
array([[0, 1],
[2, 3],
[4, 5]])


np.reshape(a, (2, 3)) # C-like index ordering
array([[0, 1, 2],
[3, 4, 5]])

np.reshape(a, (2, 3), order='F') # Fortran-like index ordering
array([[0, 4, 3],
[2, 1, 5]])

垂直叠加两个数组

问题:垂直堆叠数组a和数组b

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
a = np.arange(10).reshape(2,-1)
# array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])

b = np.repeat(1, 10).reshape(2,-1)
# array([[1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1]])

# Method 1:
np.concatenate([a, b], axis=0)

# Method 2:
np.vstack([a, b])

# Method 3:
np.r_[a, b]
# > array([[0, 1, 2, 3, 4],
# > [5, 6, 7, 8, 9],
# > [1, 1, 1, 1, 1],
# > [1, 1, 1, 1, 1]])

水平叠加两个数组

问题:将数组a和数组b水平堆叠。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
a = np.arange(10).reshape(2,-1)
# array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])

b = np.repeat(1, 10).reshape(2,-1)
# array([[1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1]])

# Answers
# Method 1:
np.concatenate([a, b], axis=1)

# Method 2:
np.hstack([a, b])

# Method 3:
np.c_[a, b]
# > array([[0, 1, 2, 3, 4, 1, 1, 1, 1, 1],
# > [5, 6, 7, 8, 9, 1, 1, 1, 1, 1]])

获取两个NumPy数组之间的公共项

问题:获取数组a和数组b之间的公共项。

1
2
3
4
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])
np.intersect1d(a,b)
# > array([2, 4])
1
2
[intersect1d, comm1, comm2] = numpy.intersect1d(ar1, ar2, assume_unique=False, return_indices=False)
# Find the intersection of two arrays.
  • ar1, ar2: array_like, 输入数组。即使不是一维,也会被一维化。

  • assume_unique: bool, 如果为True,则假定输入数组都是唯一的,这可以加快计算速度。如果为True,但ar1或ar2不是唯一的,则可能导致不正确的结果和越界索引。默认为False。

  • return_indices: bool, 如果为True,则返回两个数组的交点对应的索引。如果有多个值,则使用值的第一个实例。默认为False。

  • intersect1d: ndarray, 对共有元素和唯一元素的1D数组进行排序。

  • comm1: ar1中第一次出现的公共值的索引。仅当return_indicesTrue时提供。

  • comm2: ar2中第一次出现的公共值的索引。仅当return_indicesTrue时提供。

1
2
3
4
5
6
7
x = np.array([1, 1, 2, 3, 4])
y = np.array([2, 1, 4, 6])
xy, x_ind, y_ind = np.intersect1d(x, y, return_indices=True)
x_ind, y_ind
# >(array([0, 2, 4]), array([1, 0, 2]))
xy, x[x_ind], y[y_ind]
# >(array([1, 2, 4]), array([1, 2, 4]), array([1, 2, 4]))

从一个数组中删除存在于另一个数组中的项

问题:从数组a中删除数组b中的所有项。

1
2
3
4
5
6
a = np.array([1,2,3,4,5])
b = np.array([5,6,7,8,9])

# From 'a' remove all of 'b'
np.setdiff1d(a,b)
# > array([1, 2, 3, 4])
1
2
setdiff1d = numpy.setdiff1d(ar1, ar2, assume_unique=False)
# Find the set difference of two arrays.
  • ar1: array_like, 输入数组;
  • ar2: array_like, 输入比较数组;
  • assume_unique: bool, 如果为True,则假定输入数组都是唯一的,这可以加快计算速度。默认为False。
  • setdiff1d: ar1中不属于ar2的值的一维数组。当assume_unique=False时对结果进行排序,否则只在输入已排序时才对结果进行排序。

得到两个数组元素匹配的位置

问题:获取a和b元素匹配的位置。

1
2
3
4
5
a = np.array([1,2,3,2,3,4,3,4,5,6])
b = np.array([7,2,10,2,7,4,9,4,9,8])

np.where(a == b)
# > (array([1, 3, 5, 7]),)

从NumPy数组中提取给定范围内的所有数字

问题:获取5到10之间的所有项目。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
a = np.arange(15)

# Method 1
index = np.where((a >= 5) & (a <= 10))
a[index]
# > (array([6, 9, 10]),)

# Method 2:
index = np.where(np.logical_and(a>=5, a<=10))
a[index]
# > (array([6, 9, 10]),)

# Method 3: (thanks loganzk!)
a[(a >= 5) & (a <= 10)]

创建一个Python函数来处理标量运算并在NumPy数组上工作

问题:转换适用于两个标量的函数maxx,以处理两个数组。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 给定:
def maxx(x, y):
"""Get the maximum of two items"""
if x >= y:
return x
else:
return y

maxx(1, 5)
# > 5

# 期望的输出:
a = np.array([5, 7, 9, 8, 6, 4, 5])
b = np.array([6, 3, 4, 8, 9, 7, 1])
pair_max(a, b)
# > array([ 6., 7., 9., 8., 9., 7., 5.])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def maxx(x, y):
"""Get the maximum of two items"""
if x >= y:
return x
else:
return y

pair_max = np.vectorize(maxx, otypes=[float])

a = np.array([5, 7, 9, 8, 6, 4, 5])
b = np.array([6, 3, 4, 8, 9, 7, 1])

pair_max(a, b)
# > array([ 6., 7., 9., 8., 9., 7., 5.])

交换二维numpy数组中的两列

问题:在数组arr中交换列1和2。

1
2
3
4
5
6
7
8
9
10
11
12
# Input
arr = np.arange(9).reshape(3,3)
arr
# > array([[0, 1, 2],
# > [3, 4, 5],
# > [6, 7, 8]])

# Solution
arr[:, [1,0,2]]
# > array([[1, 0, 2],
# > [4, 3, 5],
# > [7, 6, 8]])

交换二维numpy数组中的两行

问题:交换数组arr中的第1和第2行:

1
2
3
4
5
6
7
8
# Input
arr = np.arange(9).reshape(3,3)

# Solution
arr[[1,0,2], :]
# > array([[3, 4, 5],
# > [0, 1, 2],
# > [6, 7, 8]])

反转二维数组的行

问题:反转二维数组arr的行。

1
2
3
4
5
6
7
# Input
arr = np.arange(9).reshape(3,3)
# Solution
arr[::-1]
array([[6, 7, 8],
[3, 4, 5],
[0, 1, 2]])

反转二维数组的列

问题:反转二维数组arr的列。

1
2
3
4
5
6
7
8
# Input
arr = np.arange(9).reshape(3,3)

# Solution
arr[:, ::-1]
# > array([[2, 1, 0],
# > [5, 4, 3],
# > [8, 7, 6]])

创建包含5到10之间随机浮动的二维数组

问题:创建一个形状为5x3的二维数组,以包含5到10之间的随机十进制数。

1
2
3
4
5
6
7
8
9
10
11
12
# Solution Method 1:
rand_arr = np.random.randint(low=5, high=10, size=(5,3)) + np.random.random((5,3))
# print(rand_arr)

# Solution Method 2:
rand_arr = np.random.uniform(5,10, size=(5,3))
print(rand_arr)
# > [[ 8.50061025 9.10531502 6.85867783]
# > [ 9.76262069 9.87717411 7.13466701]
# > [ 7.48966403 8.33409158 6.16808631]
# > [ 7.75010551 9.94535696 5.27373226]
# > [ 8.0850361 5.56165518 7.31244004]]

在NumPy数组中只打印小数点后三位

问题:只打印或显示numpy数组rand_arr的小数点后3位。

1
2
3
4
5
6
7
8
9
10
# Create the random array
rand_arr = np.random.random([5,3])

# Limit to 3 decimal places
np.set_printoptions(precision=3)
rand_arr[:4]
# > array([[ 0.443, 0.109, 0.97 ],
# > [ 0.388, 0.447, 0.191],
# > [ 0.891, 0.474, 0.212],
# > [ 0.609, 0.518, 0.403]])

通过e式科学记数法(如1e10)来打印一个NumPy数组

问题:通过e式科学记数法来打印rand_arr(如1e10)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Reset printoptions to default
np.set_printoptions(suppress=False)

# Create the random array
np.random.seed(100)
rand_arr = np.random.random([3,3])/1e3
rand_arr
# > array([[ 5.434049e-04, 2.783694e-04, 4.245176e-04],
# > [ 8.447761e-04, 4.718856e-06, 1.215691e-04],
# > [ 6.707491e-04, 8.258528e-04, 1.367066e-04]])
np.set_printoptions(suppress=True, precision=6) # precision is optional
rand_arr
# > array([[ 0.000543, 0.000278, 0.000425],
# > [ 0.000845, 0.000005, 0.000122],
# > [ 0.000671, 0.000826, 0.000137]])

限制numpy数组输出中打印的项目数

问题:将numpy数组a中打印的项数限制为最多6个元素。

1
2
3
np.set_printoptions(threshold=6)
a = np.arange(15)
# > array([ 0, 1, 2, ..., 12, 13, 14])

打印完整的numpy数组而不截断

1
2
3
4
5
6
7
8
# Input
np.set_printoptions(threshold=6)
a = np.arange(15\
}""

# Solution
np.set_printoptions(threshold=np.nan)
# > array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])

导入数字和文本的数据集保持文本在numpy数组中完好无损

问题:导入鸢尾属植物数据集,保持文本不变。

1
2
3
4
5
6
7
8
9
10
# Solution
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# Print the first 3 rows
iris[:3]
# > array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa'],
# > [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa'],
# > [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa']], dtype=object)

从1维元组数组中提取特定列

问题:从前面问题中导入的一维鸢尾属植物数据集中提取文本列的物种。

1
2
3
4
5
6
7
8
9
10
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)
print(iris_1d.shape)

# Solution:
species = np.array([row[4] for row in iris_1d])
species[:5]
# > array([b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa',
# > b'Iris-setosa'],
# > dtype='|S15')

将1维元组数组转换为2维NumPy数组

问题:通过省略鸢尾属植物数据集种类的文本字段,将一维鸢尾属植物数据集转换为二维数组iris_2d。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)

# Solution:
# Method 1: Convert each row to a list and get the first 4 items
iris_2d = np.array([row.tolist()[:4] for row in iris_1d])
iris_2d[:4]

# Alt Method 2: Import only the first 4 columns from source url
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[:4]
# > array([[ 5.1, 3.5, 1.4, 0.2],
# > [ 4.9, 3. , 1.4, 0.2],
# > [ 4.7, 3.2, 1.3, 0.2],
# > [ 4.6, 3.1, 1.5, 0.2]])

计算numpy数组的均值,中位数,标准差

问题:求出鸢尾属植物萼片长度的平均值、中位数和标准差(第1列)

1
2
3
4
5
6
7
8
9
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])

# Solution
mu, med, sd = np.mean(sepallength), np.median(sepallength), np.std(sepallength)
print(mu, med, sd)
# > 5.84333333333 5.8 0.825301291785

规范化数组,使数组的值正好介于0和1之间

问题:创建一种标准化形式的鸢尾属植物间隔长度,其值正好介于0和1之间,这样最小值为0,最大值为1。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])

# Solution
Smax, Smin = sepallength.max(), sepallength.min()
S = (sepallength - Smin)/(Smax - Smin)
# or
S = (sepallength - Smin)/sepallength.ptp() # Thanks, David Ojeda!
print(S)
# > [ 0.222 0.167 0.111 0.083 0.194 0.306 0.083 0.194 0.028 0.167
# > 0.306 0.139 0.139 0. 0.417 0.389 0.306 0.222 0.389 0.222
# > 0.306 0.222 0.083 0.222 0.139 0.194 0.194 0.25 0.25 0.111
# > 0.139 0.306 0.25 0.333 0.167 0.194 0.333 0.167 0.028 0.222
# > 0.194 0.056 0.028 0.194 0.222 0.139 0.222 0.083 0.278 0.194
# > 0.75 0.583 0.722 0.333 0.611 0.389 0.556 0.167 0.639 0.25
# > 0.194 0.444 0.472 0.5 0.361 0.667 0.361 0.417 0.528 0.361
# > 0.444 0.5 0.556 0.5 0.583 0.639 0.694 0.667 0.472 0.389
# > 0.333 0.333 0.417 0.472 0.306 0.472 0.667 0.556 0.361 0.333
# > 0.333 0.5 0.417 0.194 0.361 0.389 0.389 0.528 0.222 0.389
# > 0.556 0.417 0.778 0.556 0.611 0.917 0.167 0.833 0.667 0.806
# > 0.611 0.583 0.694 0.389 0.417 0.583 0.611 0.944 0.944 0.472
# > 0.722 0.361 0.944 0.556 0.667 0.806 0.528 0.5 0.583 0.806
# > 0.861 1. 0.583 0.556 0.5 0.944 0.556 0.583 0.472 0.722
# > 0.667 0.722 0.417 0.694 0.667 0.667 0.556 0.611 0.528 0.444]

找到numpy数组的百分位数

问题:找到鸢尾属植物数据集的第5和第95百分位数

1
2
3
4
5
6
7
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])

# Solution
np.percentile(sepallength, q=[5, 95])
# > array([ 4.6 , 7.255])

在数组中的随机位置插入值

问题:在iris_2d数据集中的20个随机位置插入np.nan值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')

# Method 1
i, j = np.where(iris_2d)

# i, j contain the row numbers and column numbers of 600 elements of iris_x
np.random.seed(100)
iris_2d[np.random.choice((i), 20), np.random.choice((j), 20)] = np.nan

# Method 2
np.random.seed(100)
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Print first 10 rows
print(iris_2d[:10])
# > [[b'5.1' b'3.5' b'1.4' b'0.2' b'Iris-setosa']
# > [b'4.9' b'3.0' b'1.4' b'0.2' b'Iris-setosa']
# > [b'4.7' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
# > [b'4.6' b'3.1' b'1.5' b'0.2' b'Iris-setosa']
# > [b'5.0' b'3.6' b'1.4' b'0.2' b'Iris-setosa']
# > [b'5.4' b'3.9' b'1.7' b'0.4' b'Iris-setosa']
# > [b'4.6' b'3.4' b'1.4' b'0.3' b'Iris-setosa']
# > [b'5.0' b'3.4' b'1.5' b'0.2' b'Iris-setosa']
# > [b'4.4' nan b'1.4' b'0.2' b'Iris-setosa']
# > [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']]

在NumPy数组中找到缺失值的位置

问题:在iris_2d的sepallength中查找缺失值的数量和位置(第1列)

1
2
3
4
5
6
7
8
9
10
11
12
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Solution
print("Number of missing values: \n", np.isnan(iris_2d[:, 0]).sum())
print("Position of missing values: \n", np.where(np.isnan(iris_2d[:, 0])))
# > Number of missing values:
# > 5
# > Position of missing values:
# > (array([ 39, 88, 99, 130, 147]),)

根据两个或多个条件过滤numpy数组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
答案:

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

# Solution
condition = (iris_2d[:, 2] > 1.5) & (iris_2d[:, 0] < 5.0)
iris_2d[condition]
# > array([[ 4.8, 3.4, 1.6, 0.2],
# > [ 4.8, 3.4, 1.9, 0.2],
# > [ 4.7, 3.2, 1.6, 0.2],
# > [ 4.8, 3.1, 1.6, 0.2],
# > [ 4.9, 2.4, 3.3, 1. ],
# > [ 4.9, 2.5, 4.5, 1.7]])

从numpy数组中删除包含缺失值的行

问题:选择没有任何nan值的iris_2d行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Solution
# No direct numpy function for this.
# Method 1:
any_nan_in_row = np.array([~np.any(np.isnan(row)) for row in iris_2d])
iris_2d[any_nan_in_row][:5]

# Method 2: (By Rong)
iris_2d[np.sum(np.isnan(iris_2d), axis = 1) == 0][:5]
# > array([[ 4.9, 3. , 1.4, 0.2],
# > [ 4.7, 3.2, 1.3, 0.2],
# > [ 4.6, 3.1, 1.5, 0.2],
# > [ 5. , 3.6, 1.4, 0.2],
# > [ 5.4, 3.9, 1.7, 0.4]])

找到numpy数组的两列之间的相关性

问题:在iris_2d中找出SepalLength(第1列)和PetalLength(第3列)之间的相关性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

# Solution 1
np.corrcoef(iris[:, 0], iris[:, 2])[0, 1]

# Solution 2
from scipy.stats.stats import pearsonr
corr, p_value = pearsonr(iris[:, 0], iris[:, 2])
print(corr)

# Correlation coef indicates the degree of linear relationship between two numeric variables.
# It can range between -1 to +1.

# The p-value roughly indicates the probability of an uncorrelated system producing
# datasets that have a correlation at least as extreme as the one computed.
# The lower the p-value (<0.01), stronger is the significance of the relationship.
# It is not an indicator of the strength.
# > 0.871754157305

查找给定数组是否具有任何空值

问题:找出iris_2d是否有任何缺失值。

1
2
3
4
5
6
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

np.isnan(iris_2d).any()
# > False

在numpy数组中用0替换所有缺失值

问题:在numpy数组中将所有出现的nan替换为0

1
2
3
4
5
6
7
8
9
10
11
12
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Solution
iris_2d[np.isnan(iris_2d)] = 0
iris_2d[:4]
# > array([[ 5.1, 3.5, 1.4, 0. ],
# > [ 4.9, 3. , 1.4, 0.2],
# > [ 4.7, 3.2, 1.3, 0.2],
# > [ 4.6, 3.1, 1.5, 0.2]])

在numpy数组中查找唯一值的计数

问题:找出鸢尾属植物物种中的独特值和独特值的数量

1
2
3
4
5
6
7
8
9
10
11
12
13
# Import iris keeping the text column intact
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# Solution
# Extract the species column as an array
species = np.array([row.tolist()[4] for row in iris])

# Get the unique values and the counts
np.unique(species, return_counts=True)
# > (array([b'Iris-setosa', b'Iris-versicolor', b'Iris-virginica'],
# > dtype='|S15'), array([50, 50, 50]))

将数字转换为分类(文本)数组

问题:将iris_2d的花瓣长度(第3列)加入以形成文本数组,这样如果花瓣长度为:

1
2
3
<= 3 --> 'small'
3-5 --> 'medium'
'>=5 --> 'large'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# Bin petallength
petal_length_bin = np.digitize(iris[:, 2].astype('float'), [0, 3, 5, 10])

# Map it to respective category
label_map = {1: 'small', 2: 'medium', 3: 'large', 4: np.nan}
petal_length_cat = [label_map[x] for x in petal_length_bin]

# View
petal_length_cat[:4]
<# > ['small', 'small', 'small', 'small']

从numpy数组的现有列创建新列

问题:在iris_2d中为卷创建一个新列,其中volume是(pi x petallength x sepal_length ^ 2)/ 3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution
# Compute volume
sepallength = iris_2d[:, 0].astype('float')
petallength = iris_2d[:, 2].astype('float')
volume = (np.pi * petallength * (sepallength**2))/3

# Introduce new dimension to match iris_2d's
volume = volume[:, np.newaxis]

# Add the new column
out = np.hstack([iris_2d, volume])

# View
out[:4]
# > array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa', 38.13265162927291],
# > [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa', 35.200498485922445],
# > [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa', 30.0723720777127],
# > [b'4.6', b'3.1', b'1.5', b'0.2', b'Iris-setosa', 33.238050274980004]], dtype=object)

在NumPy中进行概率抽样

问题:随机抽鸢尾属植物的种类,使得刚毛的数量是云芝和维吉尼亚的两倍

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Import iris keeping the text column intact
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution
# Get the species column
species = iris[:, 4]

# Approach 1: Generate Probablistically
np.random.seed(100)
a = np.array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
species_out = np.random.choice(a, 150, p=[0.5, 0.25, 0.25])

# Approach 2: Probablistic Sampling (preferred)
np.random.seed(100)
probs = np.r_[np.linspace(0, 0.500, num=50), np.linspace(0.501, .750, num=50), np.linspace(.751, 1.0, num=50)]
index = np.searchsorted(probs, np.random.random(150))
species_out = species[index]
print(np.unique(species_out, return_counts=True))

# > (array([b'Iris-setosa', b'Iris-versicolor', b'Iris-virginica'], dtype=object), array([77, 37, 36]))

方法2是首选方法,因为它创建了一个索引变量,该变量可用于取样2维表格数据。

在按另一个数组分组时获取数组的第二大值

问题:第二长的物种setosa的价值是多少

1
2
3
4
5
6
7
8
9
10
11
# Import iris keeping the text column intact
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution
# Get the species and petal length columns
petal_len_setosa = iris[iris[:, 4] == b'Iris-setosa', [2]].astype('float')

# Get the second last value
np.unique(np.sort(petal_len_setosa))[-2]
# > 1.7

按列对2D数组进行排序

问题:根据sepallength列对虹膜数据集进行排序。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Sort by column position 0: SepalLength
print(iris[iris[:,0].argsort()][:20])
# > [[b'4.3' b'3.0' b'1.1' b'0.1' b'Iris-setosa']
# > [b'4.4' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
# > [b'4.4' b'3.0' b'1.3' b'0.2' b'Iris-setosa']
# > [b'4.4' b'2.9' b'1.4' b'0.2' b'Iris-setosa']
# > [b'4.5' b'2.3' b'1.3' b'0.3' b'Iris-setosa']
# > [b'4.6' b'3.6' b'1.0' b'0.2' b'Iris-setosa']
# > [b'4.6' b'3.1' b'1.5' b'0.2' b'Iris-setosa']
# > [b'4.6' b'3.4' b'1.4' b'0.3' b'Iris-setosa']
# > [b'4.6' b'3.2' b'1.4' b'0.2' b'Iris-setosa']
# > [b'4.7' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
# > [b'4.7' b'3.2' b'1.6' b'0.2' b'Iris-setosa']
# > [b'4.8' b'3.0' b'1.4' b'0.1' b'Iris-setosa']
# > [b'4.8' b'3.0' b'1.4' b'0.3' b'Iris-setosa']
# > [b'4.8' b'3.4' b'1.9' b'0.2' b'Iris-setosa']
# > [b'4.8' b'3.4' b'1.6' b'0.2' b'Iris-setosa']
# > [b'4.8' b'3.1' b'1.6' b'0.2' b'Iris-setosa']
# > [b'4.9' b'2.4' b'3.3' b'1.0' b'Iris-versicolor']
# > [b'4.9' b'2.5' b'4.5' b'1.7' b'Iris-virginica']
# > [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']
# > [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']]

在NumPy数组中找到最常见的值

问题:在鸢尾属植物数据集中找到最常见的花瓣长度值(第3列)。

1
2
3
4
5
6
7
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution:
vals, counts = np.unique(iris[:, 2], return_counts=True)
print(vals[np.argmax(counts)])
# > b'1.5'

找到第一次出现的值大于给定值的位置

问题:在虹膜数据集的petalwidth第4列中查找第一次出现的值大于1.0的位置。

1
2
3
4
5
6
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution: (edit: changed argmax to argwhere. Thanks Rong!)
np.argwhere(iris[:, 3].astype(float) > 1.0)[0]
# > 50

将大于给定值的所有值替换为给定的截止值

问题:从数组a中,替换所有大于30到30和小于10到10的值。

1
2
3
4
5
6
7
8
9
10
11
12
# Input
np.set_printoptions(precision=2)
np.random.seed(100)
a = np.random.uniform(1,50, 20)

# Solution 1: Using np.clip
np.clip(a, a_min=10, a_max=30)

# Solution 2: Using np.where
print(np.where(a < 10, 10, np.where(a > 30, 30, a)))
# > [ 27.63 14.64 21.8 30. 10. 10. 30. 30. 10. 29.18 30.
# > 11.25 10.08 10. 11.77 30. 30. 10. 30. 14.43]

从numpy数组中获取最大n值的位置

问题:获取给定数组a中前5个最大值的位置。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Input
np.random.seed(100)
a = np.random.uniform(1,50,20)

# Solution:
print(a.argsort())
# > [18 7 3 10 15]

# Solution 2:
np.argpartition(-a, 5)[:5]
# > [15 10 3 7 18]

# Below methods will get you the values.
# Method 1:
a[a.argsort()][-5:]

# Method 2:
np.sort(a)[-5:]

# Method 3:
np.partition(a, kth=-5)[-5:]

# Method 4:
a[np.argpartition(-a, 5)][:5]

计算数组中所有可能值的行数

问题:按行计算唯一值的计数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
np.random.seed(100)
arr = np.random.randint(1,11,size=(6, 10))
arr
# > array([[ 9, 9, 4, 8, 8, 1, 5, 3, 6, 3],
# > [ 3, 3, 2, 1, 9, 5, 1, 10, 7, 3],
# > [ 5, 2, 6, 4, 5, 5, 4, 8, 2, 2],
# > [ 8, 8, 1, 3, 10, 10, 4, 3, 6, 9],
# > [ 2, 1, 8, 7, 3, 1, 9, 3, 6, 2],
# > [ 9, 2, 6, 5, 3, 9, 4, 6, 1, 10]])
# Solution
def counts_of_all_values_rowwise(arr2d):
# Unique values and its counts row wise
num_counts_array = [np.unique(row, return_counts=True) for row in arr2d]

# Counts of all values row wise
return([[int(b[a==i]) if i in a else 0 for i in np.unique(arr2d)] for a, b in num_counts_array])

# Print
print(np.arange(1,11))
counts_of_all_values_rowwise(arr)
# > [ 1 2 3 4 5 6 7 8 9 10]

# > [[1, 0, 2, 1, 1, 1, 0, 2, 2, 0],
# > [2, 1, 3, 0, 1, 0, 1, 0, 1, 1],
# > [0, 3, 0, 2, 3, 1, 0, 1, 0, 0],
# > [1, 0, 2, 1, 0, 1, 0, 2, 1, 2],
# > [2, 2, 2, 0, 0, 1, 1, 1, 1, 0],
# > [1, 1, 1, 1, 1, 2, 0, 0, 2, 1]]
# 输出包含10列,表示从1到10的数字。这些值是各行中数字的计数。 例如,cell(0,2)的值为2,这意味着数字3在第一行中恰好出现了2次。

# Example 2:
arr = np.array([np.array(list('bill clinton')), np.array(list('narendramodi')), np.array(list('jjayalalitha'))])
print(np.unique(arr))
counts_of_all_values_rowwise(arr)
# > [' ' 'a' 'b' 'c' 'd' 'e' 'h' 'i' 'j' 'l' 'm' 'n' 'o' 'r' 't' 'y']

# > [[1, 0, 1, 1, 0, 0, 0, 2, 0, 3, 0, 2, 1, 0, 1, 0],
# > [0, 2, 0, 0, 2, 1, 0, 1, 0, 0, 1, 2, 1, 2, 0, 0],
# > [0, 4, 0, 0, 0, 0, 1, 1, 2, 2, 0, 0, 0, 0, 1, 1]]

将数组转换为平面一维数组

问题:将array_of_arrays转换为扁平线性1d数组。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
arr1 = np.arange(3)
arr2 = np.arange(3,7)
arr3 = np.arange(7,10)

array_of_arrays = np.array([arr1, arr2, arr3])
print('array_of_arrays: ', array_of_arrays)

# Solution 1
arr_2d = np.array([a for arr in array_of_arrays for a in arr])

# Solution 2:
arr_2d = np.concatenate(array_of_arrays)
print(arr_2d)
# > array_of_arrays: [array([0, 1, 2]) array([3, 4, 5, 6]) array([7, 8, 9])]
# > [0 1 2 3 4 5 6 7 8 9]