x = np.arange(10) # (10,) shape
y = x.reshape(-1, 1) # (10, 1) shape
z = x.reshape(1, -1) # (1, 10) shape
np.vstack([x, x]).shape
np.hstack([x, x]).shape
np.vstack([y, y]).shape
np.hstack([y, y]).shape
np.vstack([z, z]).shape
np.hstack([z, z]).shape
numpy.random.choice¶
Generates a random sample from a given 1-D array
对给定的 1-D array 进行随机采样
numpy.random.choice(a, size=None, replace=True, p=None)
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html
可以指定 replace 参数,控制是否重复选择
random.choices[Python内置函数] 与其类似
np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
numpy.allclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
Returns True if two arrays are element-wise equal within a tolerance.
如果两个 array 每一项误差都在可容忍范围内则返回 True
https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html
默认两个 nan 是不相等的,可通过 equal_nan=True
设置
np.np.array_equal
np.allclose([1e10, 1e-7], [1.00001e10, 1e-8])
np.allclose([1e10, 1e-8], [1.00001e10, 1e-9])
np.allclose([1e10, 1e-8], [1.0001e10, 1e-9])
np.allclose([1.0, np.nan], [1.0, np.nan])
np.allclose([1.0, np.nan], [1.0, np.nan], equal_nan=True)
numpy.array_split¶
numpy.array_split(ary, indices_or_sections, axis=0)
将 arr 分成几个 subarr,返回列表
https://docs.scipy.org/doc/numpy/reference/generated/numpy.array_split.html
https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html
Algebra Routine
ord norm for matrices norm for vectors
None Frobenius norm 2-norm
‘fro’ Frobenius norm –
‘nuc’ nuclear norm –
inf max(sum(abs(x), axis=1)) max(abs(x))
-inf min(sum(abs(x), axis=1)) min(abs(x))
0 – sum(x != 0)
1 max(sum(abs(x), axis=0)) as below
-1 min(sum(abs(x), axis=0)) as below
2 2-norm (largest sing. value) as below
-2 smallest singular value as below
other – sum(abs(x)**ord)**(1./ord)
The Frobenius norm is given by:
||A||_F = [\sum_{i,j} abs(a_{i,j})^2]^{1/2}
The nuclear norm is the sum of the singular values.
def my_func(a):
"""Average first and last element of a 1-D array"""
return (a[0] + a[-1]) * 0.5
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
np.apply_along_axis(my_func, 0, b)
np.apply_along_axis(my_func, 1, b)
b = np.array([[1, 2], [4, 5], [7, 8]])
res = np.apply_along_axis(np.diag, -1, b)
res.shape
https://docs.scipy.org/doc/numpy/reference/generated/numpy.bincount.html
返回 x 中各元素的计数,没出现的补 0
输入必须为 int
numpy.unique(ar, return_index=False, return_inverse=False, return_counts=False, axis=None)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html
返回排序后的 unique elements 默认 axis=None, return_index 控制是否返回相应 indice,counts 控制是否返回计数
a = np.array([[1, 2, 1], [2, 3, 4]])
u, indices = np.unique(a, return_index=True)
indices
a[np.unravel_index(indices, a.shape)]
custom_ndarray = np.zeros(5, dtype=[('position', float, 2),
('size', float, 1),
('growth', float, 1),
('color', float, 4),
('name', str, 1)])
custom_ndarray
custom_ndarray[0]
custom_ndarray[0]['position']
array([([0., 0.], 0., 0., [0., 0., 0., 0.], ''),
([0., 0.], 0., 0., [0., 0., 0., 0.], ''),
([0., 0.], 0., 0., [0., 0., 0., 0.], ''),
([0., 0.], 0., 0., [0., 0., 0., 0.], ''),
([0., 0.], 0., 0., [0., 0., 0., 0.], '')],
dtype=[('position', '<f8', (2,)), ('size', '<f8'), ('growth', '<f8'), ('color', '<f8', (4,)), ('name', '<U1')])
savetxt¶
np.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ', encoding=None)
savetxt 默认导出格式为科学计数
使用 savetxt 保存数据时,最好指定编码格式 encoding,可以指定 header,comments 和 encoding
# 导出为 csv
np.savetxt('data/output.csv', arr, delimiter=',', header='',
comments='', encoding='utf-8') # float 格式
# 导出为 txt
np.savetxt('data/output.txt', arr, delimiter=' ',
header='', comments='', encoding='utf-8')
x = np.zeros(
(2,), dtype=[('time', [('min', int), ('sec', int)]), ('temp', float)])
x[0]['time']['min'] = 10
x['temp'] = 98.25
genfromtxt¶
np.genfromtxt(fname, dtype=<class 'float'>, comments='#', delimiter=None, skip_header=0, skip_footer=0, converters=None, missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True, max_rows=None, encoding='bytes')
def write_records(records, format, f):
Write a sequence of tuples to a binary file of structures.
record_struct = Struct(format)
for r in records:
f.write(record_struct.pack(*r))
records = [(1, 2.3, 4.5),
(6, 7.8, 9.0),
(12, 13.4, 56.7)]
with open('data/data.b', 'wb') as f:
write_records(records, '<idd', f)
records = np.fromfile('data/data.b', dtype='<i,<d,<d')
records
np.load¶
np.load(
['file', 'mmap_mode=None', 'allow_pickle=True', 'fix_imports=True', "encoding='ASCII'"],
ndarray 转换为 DataFrame(多维->2维)¶
多维 ndarray 从数据结构上来说是比较高效的,但如果需要使用 pandas 进行数据处理则有些麻烦,因为 pandas 处理的数据大多是 2-D 的,此时需要将 ndarray 中多余的维度转换到 2-D DataFrame 中的一列。
比较保守的转换¶
假设现有维度为 (50, 100, 3) 的数据,第一维度对应时间,第二维度对应个体ID,第三维度对应个体坐标 x,y,z。若使用 pandas 进行处理,希望将 ndarray 转换为 (5000, 5) 的二维 DataFrame,其中 5000对应 50x100,第二维度在x,y,z基础上增加两列 t 和 ID,对应列标签 t, ID, x, y, z.
array([[[ 2.42442956e+02, 7.76911920e+01, 6.64777151e-01],
[ 2.61380074e+02, 2.01793185e+02, 2.94516922e+00],
[ 4.12767690e+02, 1.35482822e+02, -4.92483385e-01],
[ 4.10753164e+02, 2.02361917e+02, -1.47121999e-01],
[ 2.69633830e+02, 2.68148789e+02, 1.27458590e+00],
[ 3.30322105e+02, 1.75890005e+02, -7.97956043e-01]],
[[ 2.45365704e+02, 7.83676099e+01, 2.27428156e-01],
[ 2.58458717e+02, 2.02475590e+02, 2.91211536e+00],
[ 4.14913593e+02, 1.33386372e+02, -7.73741839e-01],
[ 4.11856535e+02, 1.99572191e+02, -1.19416454e+00],
[ 2.70467390e+02, 2.71030660e+02, 1.28923763e+00],
[ 3.32290845e+02, 1.73626366e+02, -8.54962584e-01]],
[[ 2.48239647e+02, 7.75071160e+01, -2.90917521e-01],
[ 2.55461149e+02, 2.02596349e+02, -3.18185644e+00],
[ 4.17462899e+02, 1.31804904e+02, -5.55250026e-01],
[ 4.09707081e+02, 1.97479383e+02, -2.36954656e+00],
[ 2.68405895e+02, 2.73210163e+02, 2.32837616e+00],
[ 3.33943130e+02, 1.71122378e+02, -9.87520127e-01]],
[[ 3.56975004e+02, 5.97239259e+00, -5.75102134e-01],
[ 1.30143542e+02, 1.64376675e+02, -2.87853593e+00],
[ 4.74988523e+02, 1.15517297e+01, -8.53119091e-01],
[ 4.27604192e+02, 8.84718553e+01, -7.13370919e-01],
[ 1.53469992e+02, 2.17281949e+02, -3.01382082e+00],
[ 3.95602341e+02, 5.33207351e+01, -9.52802521e-01]],
[[ 3.58316991e+02, 3.28928418e+00, -1.10701967e+00],
[ 1.27151855e+02, 1.64599848e+02, -3.21605251e+00],
[ 4.75983562e+02, 8.72155337e+00, -1.23271295e+00],
[ 4.29080172e+02, 8.58600584e+01, -1.05641846e+00],
[ 1.50495321e+02, 2.16892942e+02, -3.01155720e+00],
[ 3.97560036e+02, 5.10475365e+01, -8.59831857e-01]],
[[ 3.60269963e+02, 1.01202655e+00, -8.61907762e-01],
[ 1.24279619e+02, 1.63733672e+02, -2.84869734e+00],
[ 4.77039363e+02, 5.91347873e+00, -1.21116002e+00],
[ 4.31150993e+02, 8.36894137e+01, -8.08928892e-01],
[ 1.47569414e+02, 2.16230318e+02, -2.91888170e+00],
[ 3.99619625e+02, 4.88662316e+01, -8.14090703e-01]]])
indice = np.mgrid[0:dim_1, 0:dim_2]
indice
indice = indice.reshape((2,-1))
indice
indice = indice.T
indice
array([[[ 0, 0, 0, ..., 0, 0, 0],
[ 1, 1, 1, ..., 1, 1, 1],
[ 2, 2, 2, ..., 2, 2, 2],
[47, 47, 47, ..., 47, 47, 47],
[48, 48, 48, ..., 48, 48, 48],
[49, 49, 49, ..., 49, 49, 49]],
[[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99]]])
array([[242.44295567, 77.69119197, 0.66477715],
[261.38007362, 201.79318451, 2.94516922],
[412.76769009, 135.48282173, -0.49248338],
[431.15099334, 83.68941366, -0.80892889],
[147.56941378, 216.23031823, -2.9188817 ],
[399.61962538, 48.86623161, -0.8140907 ]])
array([[ 0. , 0. , 242.44295567, 77.69119197,
0.66477715],
[ 0. , 1. , 261.38007362, 201.79318451,
2.94516922],
[ 0. , 2. , 412.76769009, 135.48282173,
-0.49248338],
[ 49. , 97. , 431.15099334, 83.68941366,
-0.80892889],
[ 49. , 98. , 147.56941378, 216.23031823,
-2.9188817 ],
[ 49. , 99. , 399.61962538, 48.86623161,
-0.8140907 ]])
import pandas as pd
df = pd.DataFrame(result, columns=["t", "ID", "x", "y", "z"])
df.head()
比较激进的转换¶
假设希望将最后一维的三个值也坍缩到新增维度里,即将 (50, 100, 3) 的数据转换为 (15000, 4) 的数据,对应列标签 t, ID, cat, value,其中 cat 中包含 x, y, z 三个种类。
这应该是最适合用 pandas 进行处理的数据格式了。
array([[[ 2.42442956e+02, 7.76911920e+01, 6.64777151e-01],
[ 2.61380074e+02, 2.01793185e+02, 2.94516922e+00],
[ 4.12767690e+02, 1.35482822e+02, -4.92483385e-01],
[ 4.10753164e+02, 2.02361917e+02, -1.47121999e-01],
[ 2.69633830e+02, 2.68148789e+02, 1.27458590e+00],
[ 3.30322105e+02, 1.75890005e+02, -7.97956043e-01]],
[[ 2.45365704e+02, 7.83676099e+01, 2.27428156e-01],
[ 2.58458717e+02, 2.02475590e+02, 2.91211536e+00],
[ 4.14913593e+02, 1.33386372e+02, -7.73741839e-01],
[ 4.11856535e+02, 1.99572191e+02, -1.19416454e+00],
[ 2.70467390e+02, 2.71030660e+02, 1.28923763e+00],
[ 3.32290845e+02, 1.73626366e+02, -8.54962584e-01]],
[[ 2.48239647e+02, 7.75071160e+01, -2.90917521e-01],
[ 2.55461149e+02, 2.02596349e+02, -3.18185644e+00],
[ 4.17462899e+02, 1.31804904e+02, -5.55250026e-01],
[ 4.09707081e+02, 1.97479383e+02, -2.36954656e+00],
[ 2.68405895e+02, 2.73210163e+02, 2.32837616e+00],
[ 3.33943130e+02, 1.71122378e+02, -9.87520127e-01]],
[[ 3.56975004e+02, 5.97239259e+00, -5.75102134e-01],
[ 1.30143542e+02, 1.64376675e+02, -2.87853593e+00],
[ 4.74988523e+02, 1.15517297e+01, -8.53119091e-01],
[ 4.27604192e+02, 8.84718553e+01, -7.13370919e-01],
[ 1.53469992e+02, 2.17281949e+02, -3.01382082e+00],
[ 3.95602341e+02, 5.33207351e+01, -9.52802521e-01]],
[[ 3.58316991e+02, 3.28928418e+00, -1.10701967e+00],
[ 1.27151855e+02, 1.64599848e+02, -3.21605251e+00],
[ 4.75983562e+02, 8.72155337e+00, -1.23271295e+00],
[ 4.29080172e+02, 8.58600584e+01, -1.05641846e+00],
[ 1.50495321e+02, 2.16892942e+02, -3.01155720e+00],
[ 3.97560036e+02, 5.10475365e+01, -8.59831857e-01]],
[[ 3.60269963e+02, 1.01202655e+00, -8.61907762e-01],
[ 1.24279619e+02, 1.63733672e+02, -2.84869734e+00],
[ 4.77039363e+02, 5.91347873e+00, -1.21116002e+00],
[ 4.31150993e+02, 8.36894137e+01, -8.08928892e-01],
[ 1.47569414e+02, 2.16230318e+02, -2.91888170e+00],
[ 3.99619625e+02, 4.88662316e+01, -8.14090703e-01]]])
indice = np.mgrid[0:dim_1, 0:dim_2]
indice
indice = indice.reshape((2,-1))
indice
indice = indice.T
indice
data = data.reshape(-1) # 此处也可以分别取3列,再组合起来
indice = np.repeat(indice, dim_3, axis=0)
indice.shape
cat = np.tile(np.arange(dim_3).reshape((dim_3,-1)), (dim_1*dim_2,1))
array([[[ 0, 0, 0, ..., 0, 0, 0],
[ 1, 1, 1, ..., 1, 1, 1],
[ 2, 2, 2, ..., 2, 2, 2],
[47, 47, 47, ..., 47, 47, 47],
[48, 48, 48, ..., 48, 48, 48],
[49, 49, 49, ..., 49, 49, 49]],
[[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99],
[ 0, 1, 2, ..., 97, 98, 99]]])
result[:, :2] = indice
result[:, 2] = cat.reshape(-1)
result[:, -1] = data
result
array([[ 0. , 0. , 0. , 242.44295567],
[ 0. , 0. , 1. , 77.69119197],
[ 0. , 0. , 2. , 0.66477715],
[ 49. , 99. , 0. , 399.61962538],
[ 49. , 99. , 1. , 48.86623161],
[ 49. , 99. , 2. , -0.8140907 ]])
import pandas as pd
df = pd.DataFrame(result, columns=["t", "ID", "cat", "value"])
df.head()
df.loc[df['cat'] == 0, "cat"] = "x"
df.loc[df['cat'] == 1, "cat"] = "y"
df.loc[df['cat'] == 2, "cat"] = "z"
df.head()
NumPy 函数式编程主要有以下几种方式
apply_along_axis(func1d, axis, arr, *args, ...)
Apply a function to 1-D slices along the given axis.
apply_over_axes(func, a, axes)
Apply a function repeatedly over multiple axes.
vectorize(pyfunc[, otypes, doc, excluded, ...])
Generalized function class.
frompyfunc(func, nin, nout)
Takes an arbitrary Python function and returns a NumPy ufunc.
piecewise(x, condlist, funclist, *args, **kw)
Evaluate a piecewise-defined function.
numpy.vectorize¶
class numpy.vectorize(pyfunc, otypes=None, doc=None, excluded=None, cache=False, signature=None)
Define a vectorized function which takes a nested sequence of objects or numpy arrays as inputs and returns an single or tuple of numpy array as output. The vectorized function evaluates pyfunc over successive tuples of the input arrays like the python map function, except it uses the broadcasting rules of numpy.
基于输入的 python func 返回一个向量化的函数
arr = np.array([5, 4, -2, 1, -2, 0, 4, 4, -6, -1])
u, indices = np.unique(arr, return_inverse=True)
indices
count = np.bincount(indices)
count
u[np.argmax(count)]
arr = np.array([[5, 5, 5, 5, -2, 0, 4, 4, -6, -1],
[0, 1, 1, 2, 3, 4, 5, 6, 7, 8]])
u, indices = np.unique(arr, return_inverse=True)
indices
# 这里需要指定 bincount 的 minlenghth
counted = np.apply_along_axis(np.bincount, 1, indices.reshape(arr.shape),
None, np.max(indices) + 1)
counted
u[np.argmax(counted, axis=1)]
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.meshgrid.html
如果从两个ndarray的相同位置取一个值,则更符合坐标轴的习惯,适用于评估多元函数值
基于给定的 array 或 list 生成 grid
x = np.arange(0, 4, 2)
y = np.arange(0, 6, 2)
ls = np.meshgrid(x, y)
ndarray 和 matrix¶
参考:What are the differences between numpy arrays and matrices? Which one should I use?
matrix 是严格 2 维的,而 ndarray 可以是 n 维的,matrix 是 ndarray 的一个子集,拥有全部 ndarray 的方法。matrix 主要的好处是可以方便的进行矩阵乘法,a*b
操作即为矩阵乘法
matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)
matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 1)
取决于条件,从 x 或 y 中返回值组成结果。
https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.where.html?highlight=where#numpy.where
numpy.where 有两种用途
给定 condition 和 x, y
只给定 condition
同时给定 condition 和 x, y¶
如果 x, y 为 1-D,则类似于相当于
[xv if c else yv for (c,xv,yv) in zip(condition,x,y)]
不是只对 1-D 生效,而是以 1-D 为例作解释
numpy.argwhere¶
numpy.argwhere(a)
Find the indices of array elements that are non-zero, grouped by element.
https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.argwhere.html
返回的结果是一个个的 坐标 (符合直觉的坐标,每个元素都由(x,y,...)构成,即一个坐标)
numpy.tile 和 numpy.repeat¶
tile¶
numpy.tile(A, reps)
Construct an array by repeating A the number of times given by reps.
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.tile.html
numpy.repeat(a, repeats, axis=None)
Repeat elements of an array. 按元素重复
默认 axis=None 即扁平化重复
resieze 给出的提示更加清晰 only works on single-segment arrays
即因为数据分段了。可以用 flags 属性查看一下,sliced_1 并不具有元数据的所有权,而是引用 base 数组的数据。
没有数据且引用的数据在内存中不连续就无法 resize
b = np.array([5, 6, 7, 8])
a[1:3], b[1:3] = b[1:3], a[1:3]
b = np.array([5, 6, 7, 8])
a[1:3], b[1:3] = b[1:3].copy(), a[1:3].copy()
https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ufunc.outer.html
对所有的 a in A 和 b in B 组合执行该函数
执行机制类似于双层for循环
r = empty(len(A),len(B))
for i in range(len(A)):
for j in range(len(B)):
r[i,j] = op(A[i], B[j]) # op = ufunc in question
numpy.sort(a, axis=-1, kind='quicksort', order=None)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html
返回排序后的 ndarray
dtype = [('name', 'S10'), ('height', float), ('age', int)]
values = [('Arthur', 1.8, 41), ('Lancelot', 1.9, 38),
('Galahad', 1.7, 38)]
a = np.array(values, dtype=dtype) # create a structured array
np.sort(a, order='height')
x = np.array([(1, 0), (0, 1)], dtype=[('x', '<i4'), ('y', '<i4')])
np.argsort(x, order=('x', 'y'))
np.argsort(x, order=('y', 'x'))