Python在科研中的应用 05：NumPy 数据分析进阶

Posted on 2024-04-09 Edited on 2024-04-09 In Programming language Waline:

NumPy，是“Numerical Python”的简称，是Python编程语言中的一个核心数学库，专注于高效处理多维数组和矩阵数据。在数据分析领域，NumPy发挥着举足轻重的作用，它提供了丰富的功能和工具，可以执行复杂的数学运算、线性代数操作以及统计分析。NumPy的高性能数组处理能力，使得用户可以轻松地处理大规模数据集，无论是进行数值计算、数据转换还是数据清洗，NumPy都能提供强大的支持。其简洁而直观的API设计，使得数据分析和科学计算变得更为简单高效。在数据科学、机器学习、科学计算等领域，NumPy都是不可或缺的基础工具，助力研究人员和工程师们快速实现复杂的数据处理和分析任务。

本节课程是第五周课程的延续，让你脱离基础性的NumPy使用，通过一些具体问题的形式学习NumPy的进阶使用方法。

导入数字和文本的数据集保持文本在numpy数组中完好无损

问题：导入鸢尾属植物数据集，保持文本不变。

# Solution
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# Print the first 3 rows
iris[:3]
# > array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa'],
# >        [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa'],
# >        [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa']], dtype=object)

具体来说，dtype object是一种特殊的数据类型对象，它用于描述NumPy数组中元素的数据类型。通过指定dtype object，可以让NumPy数组支持更多的数据类型，例如复数、日期、字符串等。此外，dtype object还可以用于指定数据类型的大小、字节顺序等属性。

需要注意的是，使用dtype object会使得数组的运算速度变慢，因为每个元素都需要使用Python的解释器来执行运算，而不是使用NumPy的优化运算。因此，只有在必要的情况下才应该使用dtype object，否则应该尽量使用预定义的数据类型来提高数组的运算效率。

从1维元组数组中提取特定列

问题：从前面问题中导入的一维鸢尾属植物数据集中提取文本列的物种。

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)
print(iris_1d.shape)

# Solution:
species = np.array([row[4] for row in iris_1d])
species[:5]
# > array([b'Iris-setosa', b'Iris-setosa', b'Iris-setosa', b'Iris-setosa',
# >        b'Iris-setosa'],
# >       dtype='|S15')

将1维元组数组转换为2维NumPy数组

问题：通过省略鸢尾属植物数据集种类的文本字段，将一维鸢尾属植物数据集转换为二维数组iris_2d。

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_1d = np.genfromtxt(url, delimiter=',', dtype=None)

# Solution:
# Import only the first 4 columns from source url
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[:4]
# > array([[ 5.1,  3.5,  1.4,  0.2],
# >        [ 4.9,  3. ,  1.4,  0.2],
# >        [ 4.7,  3.2,  1.3,  0.2],
# >        [ 4.6,  3.1,  1.5,  0.2]])

计算numpy数组的均值，中位数，标准差

问题：求出鸢尾属植物萼片长度的平均值、中位数和标准差(第1列)

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])

# Solution
mu, med, sd = np.mean(sepallength), np.median(sepallength), np.std(sepallength)
print(mu, med, sd)
# > 5.84333333333 5.8 0.825301291785

1 2	numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>, *, where=<no value>) # Compute the arithmetic mean along the specified axis.

a: array_like数组，其中包含所需平均值的数字。如果a不是数组，则尝试转换。
axis: None或int或int元组，可选参数，计算平均值的轴向。默认值是计算平面化数组的平均值。
dtype: data-type，可选参数，用于计算平均值的类型。对于整数输入，默认值是float64；对于浮点输入，它与输入dtype相同。
out: narray，可选参数，用于放置结果的备用输出数组。默认为None；如果提供，它必须具有与预期输出相同的形状，但如果需要，将强制转换类型。
keepdims: bool，可选参数，如果设置为True，则减少的轴在结果中保留为大小为1的维度。使用此选项，结果将根据输入数组正确广播。如果传递默认值，则keepdim将不会传递给narray子类的mean方法，但任何非默认值将被传递。如果子类的方法没有实现keepdim，将引发任何异常。
where: array_like of bool，可选参数，判断计算平均值的元素。

a = np.array([[1, 2], 
              [3, 4]])
np.mean(a)
2.5
np.mean(a, axis=0)
array([2., 3.])
np.mean(a, axis=1)
array([1.5, 3.5])

a = np.array([[5, 9, 13], [14, 10, 12], [11, 15, 19]])
np.mean(a)
12.0
np.mean(a, where=[[True], [False], [False]])
9.0

找到NumPy数组的百分位数

问题：找到鸢尾属植物数据集的第5和第95百分位数

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
sepallength = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0])

# Solution
np.percentile(sepallength, q=[5, 95])
# > array([ 4.6  ,  7.255])

1 2	numpy.percentile(a, q, axis=None, out=None, overwrite_input=False, method='linear', keepdims=False, *, interpolation=None) # Compute the q-th percentile of the data along the specified axis.

a: array_like of real numbers, 可转换为数组的输入数组或对象。
q: array_like of float, 用于计算百分位数的百分比或百分比序列。取值必须在0到100之间。
axis: {int, int的元组，None}，可选参数，计算百分位数的轴向。默认值是沿数组的平面化版本计算百分位数。
overwrite_input: bool，可选参数，如果为True，则允许通过中间计算修改输入数组a，以节省内存。
method: str, 可选参数，此参数指定用于估计百分位数的方法。有许多不同的方法，其中一些是NumPy独有的。请参阅注释以获得解释，包括’inverted_cdf’ ‘averaged_inverted_cdf’ ‘closest_observation’ ‘interpolated_inverted_cdf’ ‘hazen’ ‘weibull’ ‘linear’(默认) ‘median_unbiased’ ‘normal_unbiased’等等。
keepdims: bool，可选如果设置为True，则减少的轴在结果中保留为大小为1的维度。使用此选项，结果将针对原始数组正确广播。

查找给定数组是否具有任何空值

问题：找出iris_2d是否有任何缺失值。

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

np.isnan(iris_2d).any()
# > False

在数组中的随机位置插入值

问题：在iris_2d数据集中的20个随机位置插入np.nan值

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')

# Method 1
# i, j contain the row numbers and column numbers of 600 elements of iris_x
i, j = np.where(iris_2d)
np.random.seed(100)
iris_2d[np.random.choice((i), 20), np.random.choice((j), 20)] = np.nan

# Method 2
np.random.seed(100)
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Print first 10 rows
print(iris_2d[:10])
# > [[b'5.1' b'3.5' b'1.4' b'0.2' b'Iris-setosa']
# >  [b'4.9' b'3.0' b'1.4' b'0.2' b'Iris-setosa']
# >  [b'4.7' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
# >  [b'4.6' b'3.1' b'1.5' b'0.2' b'Iris-setosa']
# >  [b'5.0' b'3.6' b'1.4' b'0.2' b'Iris-setosa']
# >  [b'5.4' b'3.9' b'1.7' b'0.4' b'Iris-setosa']
# >  [b'4.6' b'3.4' b'1.4' b'0.3' b'Iris-setosa']
# >  [b'5.0' b'3.4' b'1.5' b'0.2' b'Iris-setosa']
# >  [b'4.4' nan b'1.4' b'0.2' b'Iris-setosa']
# >  [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']]

在NumPy数组中找到缺失值的位置

问题：在iris_2d的sepallength中查找缺失值的数量和位置（第1列）

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Solution
print("Number of missing values: \n", np.isnan(iris_2d[:, 0]).sum())
print("Position of missing values: \n", np.where(np.isnan(iris_2d[:, 0])))
# > Number of missing values: 
# >  5
# > Position of missing values: 
# >  (array([ 39,  88,  99, 130, 147]),)

根据两个或多个条件过滤NumPy数组

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

# Solution
condition = (iris_2d[:, 2] > 1.5) & (iris_2d[:, 0] < 5.0)
iris_2d[condition]
# > array([[ 4.8,  3.4,  1.6,  0.2],
# >        [ 4.8,  3.4,  1.9,  0.2],
# >        [ 4.7,  3.2,  1.6,  0.2],
# >        [ 4.8,  3.1,  1.6,  0.2],
# >        [ 4.9,  2.4,  3.3,  1. ],
# >        [ 4.9,  2.5,  4.5,  1.7]])

从NumPy数组中删除包含缺失值的行

问题：选择没有任何nan值的iris_2d行。

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Solution
# No direct numpy function for this.
# Method 1:
any_nan_in_row = np.array([~np.any(np.isnan(row)) for row in iris_2d])
iris_2d[any_nan_in_row][:5]

# Method 2: (By Rong)
iris_2d[np.sum(np.isnan(iris_2d), axis = 1) == 0][:5]
# > array([[ 4.9,  3. ,  1.4,  0.2],
# >        [ 4.7,  3.2,  1.3,  0.2],
# >        [ 4.6,  3.1,  1.5,  0.2],
# >        [ 5. ,  3.6,  1.4,  0.2],
# >        [ 5.4,  3.9,  1.7,  0.4]])

找到NumPy数组的两列之间的相关性

问题：在iris_2d中找出SepalLength（第1列）和PetalLength（第3列）之间的相关性

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

# Solution 1
np.corrcoef(iris[:, 0], iris[:, 2])[0, 1]

# Solution 2
from scipy.stats.stats import pearsonr  
corr, p_value = pearsonr(iris[:, 0], iris[:, 2])
print(corr)

# Correlation coef indicates the degree of linear relationship between two numeric variables.
# It can range between -1 to +1.

# The p-value roughly indicates the probability of an uncorrelated system producing 
# datasets that have a correlation at least as extreme as the one computed.
# The lower the p-value (<0.01), stronger is the significance of the relationship.
# It is not an indicator of the strength.
# > 0.871754157305

1 2	numpy.corrcoef(x, y=None, rowvar=True, bias=<no value>, ddof=<no value>, *, dtype=None) # Return Pearson product-moment correlation coefficients.

相关系数 $r$ 是一个介于 -1 和 1 之间的无单位的值。统计显著性以 $p$ 值表示。

$r$ 越接近 0，线性关系越弱。
正的 $r$ 值表示正相关，在这种情况下，两个变量的值往往一起增加。
负的 $r$ 值表示负相关，在这种情况下，当一个变量的值增加时，另一个变量的值往往会减少。
值 1 和 -1 都代表“完美”的相关性，分别表示正相关和负相关。两个完美相关的变量会以固定的比率一起变化。我们说，它们有线性关系；当绘制在散点图上时，所有的数据点可以用一条直线连接。

在NumPy数组中用0替换所有缺失值

问题：在NumPy数组中将所有出现的nan替换为0

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
iris_2d[np.random.randint(150, size=20), np.random.randint(4, size=20)] = np.nan

# Solution
iris_2d[np.isnan(iris_2d)] = 0
iris_2d[:4]
# > array([[ 5.1,  3.5,  1.4,  0. ],
# >        [ 4.9,  3. ,  1.4,  0.2],
# >        [ 4.7,  3.2,  1.3,  0.2],
# >        [ 4.6,  3.1,  1.5,  0.2]])

在NumPy数组中查找唯一值的计数

问题：找出鸢尾属植物物种中的独特值和独特值的数量

# Import iris keeping the text column intact
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# Solution
# Extract the species column as an array
species = np.array([row.tolist()[4] for row in iris])

# Get the unique values and the counts
np.unique(species, return_counts=True)
# > (array([b'Iris-setosa', b'Iris-versicolor', b'Iris-virginica'],
# >        dtype='|S15'), array([50, 50, 50]))

1 2	numpy.unique(ar, return_index=False, return_inverse=False, return_counts=False, axis=None, *, equal_nan=True) # Return Pearson product-moment correlation coefficients.

numpy.unique()函数返回数组中已排序的唯一元素。除了唯一元素之外，还有三个可选输出:

给出唯一值的输入数组的索引
用于重建输入数组的唯一数组的索引
每个唯一值在输入数组中出现的次数

输入参数：

ar: array_like, 输入数组。除非指定了轴，否则如果它不是1-D，它将被平面化。
return_index: bool, 可选参数，如果为True，还返回ar的索引，从而产生唯一数组。
return_inverse: bool, 可选参数，如果为True，还返回可用于重建ar的唯一数组的索引。
return_counts: bool, 可选参数，如果为True，还返回每个唯一项在ar中出现的次数。
axis: int或None, 可选参数，要操作的轴。如果为None，ar将被扁平化。如果是整数，则由给定轴索引的子数组将被平面化，并被视为具有给定轴的维度的1-D数组的元素。
equal_nan: bool, 可选参数，如果为True，将返回数组中的多个NaN值折叠为一个。

返回参数:

unique: ndarray，排序后的唯一值。
unique_indices: ndarray, 可选参数，原始数组中唯一值第一次出现的索引。仅当return_index为True时提供。
unique_inverse: ndarray, 可选参数，从unique数组重构原始数组的索引。仅当return_inverse为True时提供。
unique_counts: ndarray, 可选参数，每个唯一值在原始数组中出现的次数。仅当return_counts为True时提供。

np.unique([1, 1, 2, 2, 3, 3])
array([1, 2, 3])
a = np.array([[1, 1], [2, 3]])
np.unique(a)
array([1, 2, 3])

# Return the unique rows of a 2D array
a = np.array(['a', 'b', 'b', 'c', 'a'])
u, indices = np.unique(a, return_index=True)
u
array(['a', 'b', 'c'], dtype='<U1')
indices
array([0, 1, 3])
a[indices]
array(['a', 'b', 'c'], dtype='<U1')

# Reconstruct the input array from the unique values and inverse:
a = np.array([1, 2, 6, 4, 2, 3, 2])
u, indices = np.unique(a, return_inverse=True)
u
array([1, 2, 3, 4, 6])
indices
array([0, 1, 4, 3, 1, 2, 1])
u[indices]
array([1, 2, 6, 4, 2, 3, 2])

# Reconstruct the input values from the unique values and counts:
a = np.array([1, 2, 6, 4, 2, 3, 2])
values, counts = np.unique(a, return_counts=True)
values
array([1, 2, 3, 4, 6])
counts
array([1, 3, 1, 1, 1])
np.repeat(values, counts)
array([1, 2, 2, 2, 3, 4, 6])    # original order not preserved

在NumPy数组中找到最常见的值

问题：在鸢尾属植物数据集中找到最常见的花瓣长度值（第3列）。

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution:
vals, counts = np.unique(iris[:, 2], return_counts=True)
print(vals[np.argmax(counts)])
# > b'1.5'

将数字转换为分类（文本）数组

问题：将iris_2d的花瓣长度（第3列）加入以形成文本数组，这样如果花瓣长度为：

1
2
3

<= 3 --> 'small'
 3-5 --> 'medium'
'>=5 --> 'large'

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')

# Bin petallength 
petal_length_bin = np.digitize(iris[:, 2].astype('float'), [0, 3, 5, 10])

# Map it to respective category
label_map = {1: 'small', 2: 'medium', 3: 'large', 4: np.nan}
petal_length_cat = [label_map[x] for x in petal_length_bin]

# View
petal_length_cat[:4]
<# > ['small', 'small', 'small', 'small']

从NumPy数组的现有列创建新列

问题：在iris_2d中创建一个新列，其数值通过其他列计算得到。

# Input
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution
# Compute volume
sepallength = iris_2d[:, 0].astype('float')
petallength = iris_2d[:, 2].astype('float')
volume = (np.pi * petallength * (sepallength**2))/3

# Introduce new dimension to match iris_2d's
volume = volume[:, np.newaxis]

# Add the new column
out = np.hstack([iris_2d, volume])

# View
out[:4]
# > array([[b'5.1', b'3.5', b'1.4', b'0.2', b'Iris-setosa', 38.13265162927291],
# >        [b'4.9', b'3.0', b'1.4', b'0.2', b'Iris-setosa', 35.200498485922445],
# >        [b'4.7', b'3.2', b'1.3', b'0.2', b'Iris-setosa', 30.0723720777127],
# >        [b'4.6', b'3.1', b'1.5', b'0.2', b'Iris-setosa', 33.238050274980004]], dtype=object)

在NumPy中进行概率抽样

问题：随机抽样150组鸢尾属植物的数据，使得’Iris-setosa’的概率是’Iris-versicolor’和’Iris-virginica’的两倍。

# Import iris keeping the text column intact
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution
# Get the species column
species = iris[:, 4]

# Approach 1: Generate Probablistically
np.random.seed(100)
a = np.array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
species_out = np.random.choice(a, 150, p=[0.5, 0.25, 0.25])

# Approach 2: Probablistic Sampling (preferred)
np.random.seed(100)
probs = np.r_[np.linspace(0, 0.500, num=50), np.linspace(0.501, .750, num=50), np.linspace(.751, 1.0, num=50)]
index = np.searchsorted(probs, np.random.random(150))
species_out = species[index]
print(np.unique(species_out, return_counts=True))

# > (array([b'Iris-setosa', b'Iris-versicolor', b'Iris-virginica'], dtype=object), array([77, 37, 36]))

方法2是首选方法，因为它创建了一个索引变量，该变量可用于取样2维表格数据。

在按另一个数组分组时获取数组的第二大值

问题：物种setosa中第二长的长度数值

# Import iris keeping the text column intact
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution
# Get the species and petal length columns
petal_len_setosa = iris[iris[:, 4] == b'Iris-setosa', [2]].astype('float')

# Get the second last value
np.unique(np.sort(petal_len_setosa))[-2]
# > 1.7

1 2	numpy.sort(a, axis=-1, kind=None, order=None) # Return a sorted copy of an array.

a: array_like, 待排序的数组。
axis: int或None, 可选参数，排序所沿的轴向。如果为None，则在排序之前对数组进行扁平化。默认值是-1，它沿着最后一个轴排序。
kind: {‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}，可选参数，排序算法。默认值是‘quicksort’。
Order: str或str的列表，可选参数，当a是一个定义了字段的数组时，这个参数指定首先比较哪个字段。可以将单个字段指定为字符串，而不需要指定所有字段，但仍将使用未指定的字段，按照它们在dtype中出现的顺序，以打破关系。
sorted_array: ndarray，返回值，类型和形状与a相同的数组。

a = np.array([[1,4],[3,1]])
np.sort(a)                # sort along the last axis
array([[1, 4],
       [1, 3]])
np.sort(a, axis=None)     # sort the flattened array
array([1, 1, 3, 4])
np.sort(a, axis=0)        # sort along the first axis
array([[1, 1],
       [3, 4]])

按列对2D数组进行排序

问题：根据sepallength列对数据集进行排序。

# Sort by column position 0: SepalLength
print(iris[iris[:,0].argsort()][:20])
# > [[b'4.3' b'3.0' b'1.1' b'0.1' b'Iris-setosa']
# >  [b'4.4' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
# >  [b'4.4' b'3.0' b'1.3' b'0.2' b'Iris-setosa']
# >  [b'4.4' b'2.9' b'1.4' b'0.2' b'Iris-setosa']
# >  [b'4.5' b'2.3' b'1.3' b'0.3' b'Iris-setosa']
# >  [b'4.6' b'3.6' b'1.0' b'0.2' b'Iris-setosa']
# >  [b'4.6' b'3.1' b'1.5' b'0.2' b'Iris-setosa']
# >  [b'4.6' b'3.4' b'1.4' b'0.3' b'Iris-setosa']
# >  [b'4.6' b'3.2' b'1.4' b'0.2' b'Iris-setosa']
# >  [b'4.7' b'3.2' b'1.3' b'0.2' b'Iris-setosa']
# >  [b'4.7' b'3.2' b'1.6' b'0.2' b'Iris-setosa']
# >  [b'4.8' b'3.0' b'1.4' b'0.1' b'Iris-setosa']
# >  [b'4.8' b'3.0' b'1.4' b'0.3' b'Iris-setosa']
# >  [b'4.8' b'3.4' b'1.9' b'0.2' b'Iris-setosa']
# >  [b'4.8' b'3.4' b'1.6' b'0.2' b'Iris-setosa']
# >  [b'4.8' b'3.1' b'1.6' b'0.2' b'Iris-setosa']
# >  [b'4.9' b'2.4' b'3.3' b'1.0' b'Iris-versicolor']
# >  [b'4.9' b'2.5' b'4.5' b'1.7' b'Iris-virginica']
# >  [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']
# >  [b'4.9' b'3.1' b'1.5' b'0.1' b'Iris-setosa']]

1 2	numpy.argsort(a, axis=-1, kind=None, order=None) # Returns the indices that would sort an array.

numpy.argsort()函数与np.sort()函数几乎完全一致，区别在于一个输出为排序后的数值，一个输出为排序后的索引。

x = np.array([3, 1, 2])
np.argsort(x)
array([1, 2, 0])

# Two-dimensional array:
x = np.array([[0, 3], [2, 2]])
x
array([[0, 3],
       [2, 2]])
ind = np.argsort(x, axis=0)  # sorts along first axis (down)
ind
array([[0, 1],
       [1, 0]])
ind = np.argsort(x, axis=1)  # sorts along last axis (across)
ind
array([[0, 1],
       [0, 1]])

找到第一次出现的值大于给定值的位置

问题：在数据集的petalwidth第4列中查找第一次出现的值大于1.0的位置。

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')

# Solution: (edit: changed argmax to argwhere. Thanks Rong!)
np.argwhere(iris[:, 3].astype(float) > 1.0)[0]
# > 50

将大于给定值的所有值替换为给定的截止值

问题：从数组a中，替换所有大于30为30，替换所有小于10为10。

# Input
np.set_printoptions(precision=2)
np.random.seed(100)
a = np.random.uniform(1,50, 20)

# Solution: Using np.clip
np.clip(a, a_min=10, a_max=30)

numpy.clip()给定一个区间，区间外的值被裁剪到区间边缘。例如，如果指定了区间[0,1]，则小于0的值变为0，大于1的值变为1。不检查a_min < a_max。

1 2	numpy.clip(a, a_min, a_max, out=None, **kwargs) # Clip (limit) the values in an array.

a: array_like, 包含要剪辑的元素的数组。
a_min、a_max: array_like, 或无最小值和最大值。如果是None，则不对相应的边进行裁剪。a_min和a_max只能有一个为None。
out: 可选参数，结果将放置在此数组中。Out必须有合适的形状来容纳输出。

从NumPy数组中获取最大n值的位置

问题：获取给定数组a中前5个最大值的位置。

# Input
np.random.seed(100)
a = np.random.uniform(1,50,20)

# Solution:
print(a.argsort()[-5:])
# > [18 7 3 10 15]


# Below methods will get you the values.
# Method 1:
a[a.argsort()][-5:]

# Method 2:
np.sort(a)[-5:]