pandas10minnutes_python

10 minutes to pandas英文网址 pandas10minnutes_中英对照01 pandas10minnutes_中英对照02 pandas10minnutes_中英对照03 [pandas10minnutes_中英对照04 待更新]

本次主要讲以下4部分：
7.Grouping 分组
8.Reshaping 重塑
9.Time series 时间序列
10.Categoricals分类

7.Grouping 分组

By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
See the Grouping section.

“分组方式”指的是涉及以下一个或多个步骤的流程：
根据某些标准将数据分成若干组
将一个函数独立应用于每个组
将结果合并到数据结构中
请参见分组部分。

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),})
df

	A	B	C	D
0	foo	one	-1.688913	-2.514367
1	bar	one	-1.252292	1.632814
2	foo	two	0.363461	1.565951
3	bar	three	-0.330988	-1.242105
4	foo	two	1.387786	1.159832
5	bar	two	2.395909	-0.361804
6	foo	one	-1.085930	0.089441
7	foo	three	-0.694535	0.752959

Grouping and then applying the sum() function to the resulting groups:
分组，然后将sum（）函数应用于结果组：

df.groupby("A").sum()

	C	D
A
bar	-2.250374	-0.334026
foo	1.244043	0.054678

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function:
按多列分组形成一个层次索引，然后再次应用sum（）函数：

df.groupby(["A", "B"]).sum()

		C	D
A	B
bar	one	-1.312258	0.076088
	three	-0.285997	0.058971
	two	-0.652119	-0.469085
foo	one	-1.510840	-0.534755
	three	0.981995	0.419014
	two	1.772888	0.170420

8.Reshaping 重塑

see the sections on Hierarchical Indexing and Reshaping.
请参阅有关层次索引和重塑的章节。

8.1Stack 堆栈

tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)

tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2 = df[:4]
df2

		A	B
first	second
bar	one	-0.642310	-0.121771
bar	two	0.105929	-0.170706
baz	one	0.624519	1.795898
baz	two	0.355989	0.295519

The stack() method “compresses” a level in the DataFrame’s columns:
stack()方法“压缩”数据帧(DataFrame)列中的一个级别：

stacked = df2.stack()
stacked

first  second   
bar    one     A   -0.642310
               B   -0.121771
       two     A    0.105929
               B   -0.170706
baz    one     A    0.624519
               B    1.795898
       two     A    0.355989
               B    0.295519
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:
使用“堆叠”的数据帧或序列（以多索引作为索引），stack（）的逆 *** 作是unstack（），默认情况下，它会取消最后一级的堆栈：

stacked.unstack()

		A	B
first	second
bar	one	-0.642310	-0.121771
bar	two	0.105929	-0.170706
baz	one	0.624519	1.795898
baz	two	0.355989	0.295519

stacked.unstack(1)

	second	one	two
first
bar	A	-0.642310	0.105929
bar	B	-0.121771	-0.170706
baz	A	0.624519	0.355989
baz	B	1.795898	0.295519

stacked.unstack(0)

	first	bar	baz
second
one	A	-0.642310	0.624519
one	B	-0.121771	1.795898
two	A	0.105929	0.355989
two	B	-0.170706	0.295519

8.2Pivot tables 数据透视表

See the section on Pivot Tables.
请参阅有关透视表的部分。

df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)

df

	A	B	C	D	E
0	one	A	foo	0.022691	0.250766
1	one	B	foo	1.246638	0.597852
2	two	C	foo	0.237767	1.409630
3	three	A	bar	0.781579	0.698842
4	one	B	bar	-0.350703	1.788431
5	one	C	bar	2.225344	0.052856
6	two	A	foo	0.748157	0.376670
7	three	B	foo	-1.509539	-0.405203
8	one	C	foo	-1.840205	-0.195269
9	one	A	bar	1.051340	-1.058422
10	two	B	bar	0.587531	-0.431633
11	three	C	bar	-0.191187	-0.008472

We can produce pivot tables from this data very easily:
我们可以很容易地从这些数据生成数据透视表：

pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])

	C	bar	foo
A	B
one	A	1.051340	0.022691
	B	-0.350703	1.246638
	C	2.225344	-1.840205
three	A	0.781579	NaN
	B	NaN	-1.509539
	C	-0.191187	NaN
two	A	NaN	0.748157
	B	0.587531	NaN
	C	NaN	0.237767

9.Time series 时间序列

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.
pandas具有简单、强大且高效的功能，用于在频率转换期间执行再采样 *** 作（例如，将二次数据转换为5分钟数据）。这在金融应用中非常常见，但不限于此。参见时间序列部分。

rng = pd.date_range("1/1/2012", periods=1000, freq="S")
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts

2012-01-01 00:00:00    145
2012-01-01 00:00:01    198
2012-01-01 00:00:02    434
2012-01-01 00:00:03     39
2012-01-01 00:00:04    250
2012-01-01 00:00:05     36
2012-01-01 00:00:06    174
2012-01-01 00:00:07    424
2012-01-01 00:00:08    339
2012-01-01 00:00:09    194
2012-01-01 00:00:10    175
2012-01-01 00:00:11    476
2012-01-01 00:00:12     88
2012-01-01 00:00:13     61
2012-01-01 00:00:14    255
2012-01-01 00:00:15     21
2012-01-01 00:00:16    283
2012-01-01 00:00:17    437
2012-01-01 00:00:18    140
2012-01-01 00:00:19    246
2012-01-01 00:00:20    246
2012-01-01 00:00:21    354
2012-01-01 00:00:22    287
2012-01-01 00:00:23    429
2012-01-01 00:00:24     39
2012-01-01 00:00:25    367
2012-01-01 00:00:26    296
2012-01-01 00:00:27    384
2012-01-01 00:00:28    482
2012-01-01 00:00:29    457
                      ... 
2012-01-01 00:16:10    150
2012-01-01 00:16:11    180
2012-01-01 00:16:12    175
2012-01-01 00:16:13     16
2012-01-01 00:16:14    109
2012-01-01 00:16:15    413
2012-01-01 00:16:16    446
2012-01-01 00:16:17    220
2012-01-01 00:16:18    367
2012-01-01 00:16:19    465
2012-01-01 00:16:20    178
2012-01-01 00:16:21    348
2012-01-01 00:16:22    322
2012-01-01 00:16:23     24
2012-01-01 00:16:24    236
2012-01-01 00:16:25    496
2012-01-01 00:16:26    467
2012-01-01 00:16:27    400
2012-01-01 00:16:28    177
2012-01-01 00:16:29    267
2012-01-01 00:16:30     21
2012-01-01 00:16:31    115
2012-01-01 00:16:32    173
2012-01-01 00:16:33     66
2012-01-01 00:16:34    240
2012-01-01 00:16:35    287
2012-01-01 00:16:36    259
2012-01-01 00:16:37    288
2012-01-01 00:16:38    489
2012-01-01 00:16:39    335
Freq: S, Length: 1000, dtype: int64

ts.resample("3Min").sum()

2012-01-01 00:00:00    43769
2012-01-01 00:03:00    46206
2012-01-01 00:06:00    48664
2012-01-01 00:09:00    45263
2012-01-01 00:12:00    45174
2012-01-01 00:15:00    25726
Freq: 3T, dtype: int64

Time zone representation:
时区表示：

rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2012-03-06    1.082849
2012-03-07   -0.260217
2012-03-08   -0.878703
2012-03-09   -0.883832
2012-03-10    0.832079
Freq: D, dtype: float64

ts_utc = ts.tz_localize("UTC")
ts_utc

2012-03-06 00:00:00+00:00    1.082849
2012-03-07 00:00:00+00:00   -0.260217
2012-03-08 00:00:00+00:00   -0.878703
2012-03-09 00:00:00+00:00   -0.883832
2012-03-10 00:00:00+00:00    0.832079
Freq: D, dtype: float64

Converting to another time zone:
转换到另一个时区：

ts_utc.tz_convert("US/Eastern")

2012-03-05 19:00:00-05:00    1.082849
2012-03-06 19:00:00-05:00   -0.260217
2012-03-07 19:00:00-05:00   -0.878703
2012-03-08 19:00:00-05:00   -0.883832
2012-03-09 19:00:00-05:00    0.832079
Freq: D, dtype: float64

Converting between time span representations:
在时间跨度表示之间转换：

rng = pd.date_range("1/1/2012", periods=5, freq="M")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2012-01-31    0.085870
2012-02-29    0.320371
2012-03-31   -0.144583
2012-04-30    0.259971
2012-05-31    0.031106
Freq: M, dtype: float64

ps = ts.to_period()
ps

2012-01    0.085870
2012-02    0.320371
2012-03   -0.144583
2012-04    0.259971
2012-05    0.031106
Freq: M, dtype: float64

ps.to_timestamp()

2012-01-01    0.085870
2012-02-01    0.320371
2012-03-01   -0.144583
2012-04-01    0.259971
2012-05-01    0.031106
Freq: MS, dtype: float64

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

周期和时间戳之间的转换可以使用一些方便的算术函数。在下面的示例中,我们将11月作为年度终点的季度频率转换为对应当季度末月上午9点的数据：

prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")
ts = pd.Series(np.random.randn(len(prng)), prng)
ts

1990Q1    0.205428
1990Q2   -0.468783
1990Q3   -0.660921
1990Q4   -0.603024
1991Q1    0.626573
1991Q2    0.930349
1991Q3    1.025994
1991Q4    0.757378
1992Q1    0.239052
1992Q2   -0.188778
1992Q3    0.400186
1992Q4   -1.182243
1993Q1    0.488901
1993Q2   -0.229461
1993Q3   -1.149555
1993Q4   -0.493716
1994Q1    0.358941
1994Q2   -0.862758
1994Q3    1.415536
1994Q4    0.667995
1995Q1   -0.082420
1995Q2   -0.131518
1995Q3   -0.942415
1995Q4    0.045751
1996Q1    0.542599
1996Q2    0.438003
1996Q3   -0.391305
1996Q4   -2.592706
1997Q1    0.799962
1997Q2   -0.667447
1997Q3   -0.166855
1997Q4    0.476623
1998Q1   -0.948281
1998Q2    0.508382
1998Q3    1.489794
1998Q4   -0.090221
1999Q1   -2.080581
1999Q2    0.944585
1999Q3    1.499972
1999Q4   -1.385293
2000Q1    1.545408
2000Q2    0.536199
2000Q3   -0.835179
2000Q4   -0.902938
Freq: Q-NOV, dtype: float64

ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9
ts.head()

1990-03-01 09:00    0.205428
1990-06-01 09:00   -0.468783
1990-09-01 09:00   -0.660921
1990-12-01 09:00   -0.603024
1991-03-01 09:00    0.626573
Freq: H, dtype: float64

10.Categoricals分类

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.
pandas可以在数据框中包含分类数据。有关完整文档，请参阅分类介绍和API文档。

df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)

Converting the raw grades to a categorical data type:
将原始等级转换为分类数据类型：

df["grade"] = df["raw_grade"].astype("category")
df

	id	raw_grade	grade
0	1	a	a
1	2	b	b
2	3	b	b
3	4	a	a
4	5	a	a
5	6	e	e

df["raw_grade"] = df["raw_grade"].astype("str")
df.info()


RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
id           6 non-null int64
raw_grade    6 non-null object
grade        6 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 286.0+ bytes

Rename the categories to more meaningful names (assigning to Series.cat.categories() is in place!):
将类别重命名为更有意义的名称（指定给Series.cat.categories（）已就位！）

df["grade"].cat.categories = ["very good", "good", "very bad"]
df

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):
重新排列类别，同时添加缺少的类别（默认情况下，Series.cat（）下的方法会返回一个新的系列）：

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)
df

	id	raw_grade	grade
0	1	a	very good
1	2	b	good
2	3	b	good
3	4	a	very good
4	5	a	very good
5	6	e	very bad

Sorting is per order in the categories, not lexical order:
排序是根据类别中的顺序，而不是词汇顺序：

df.sort_values(by="grade")

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

Grouping by a categorical column also shows empty categories:
按类别列分组也会显示空类别：

df.groupby("grade").size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/793319.html

pandas10minnutes

发表评论

评论列表（0条）