本次主要讲以下4部分:
7.Grouping 分组
8.Reshaping 重塑
9.Time series 时间序列
10.Categoricals分类
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
See the Grouping section.
“分组方式”指的是涉及以下一个或多个步骤的流程:
根据某些标准将数据分成若干组
将一个函数独立应用于每个组
将结果合并到数据结构中
请参见分组部分。
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),})
df
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | -1.688913 | -2.514367 |
1 | bar | one | -1.252292 | 1.632814 |
2 | foo | two | 0.363461 | 1.565951 |
3 | bar | three | -0.330988 | -1.242105 |
4 | foo | two | 1.387786 | 1.159832 |
5 | bar | two | 2.395909 | -0.361804 |
6 | foo | one | -1.085930 | 0.089441 |
7 | foo | three | -0.694535 | 0.752959 |
Grouping and then applying the sum() function to the resulting groups:
分组,然后将sum()函数应用于结果组:
df.groupby("A").sum()
C | D | |
---|---|---|
A | ||
bar | -2.250374 | -0.334026 |
foo | 1.244043 | 0.054678 |
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function:
按多列分组形成一个层次索引,然后再次应用sum()函数:
df.groupby(["A", "B"]).sum()
C | D | ||
---|---|---|---|
A | B | ||
bar | one | -1.312258 | 0.076088 |
three | -0.285997 | 0.058971 | |
two | -0.652119 | -0.469085 | |
foo | one | -1.510840 | -0.534755 |
three | 0.981995 | 0.419014 | |
two | 1.772888 | 0.170420 |
see the sections on Hierarchical Indexing and Reshaping.
请参阅有关层次索引和重塑的章节。
tuples = list(
zip(
*[
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
)
)
tuples
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2 = df[:4]
df2
A | B | ||
---|---|---|---|
first | second | ||
bar | one | -0.642310 | -0.121771 |
two | 0.105929 | -0.170706 | |
baz | one | 0.624519 | 1.795898 |
two | 0.355989 | 0.295519 |
The stack() method “compresses” a level in the DataFrame’s columns:
stack()方法“压缩”数据帧(DataFrame)列中的一个级别:
stacked = df2.stack()
stacked
first second
bar one A -0.642310
B -0.121771
two A 0.105929
B -0.170706
baz one A 0.624519
B 1.795898
two A 0.355989
B 0.295519
dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:
使用“堆叠”的数据帧或序列(以多索引作为索引),stack()的逆 *** 作是unstack(),默认情况下,它会取消最后一级的堆栈:
stacked.unstack()
A | B | ||
---|---|---|---|
first | second | ||
bar | one | -0.642310 | -0.121771 |
two | 0.105929 | -0.170706 | |
baz | one | 0.624519 | 1.795898 |
two | 0.355989 | 0.295519 |
stacked.unstack(1)
second | one | two | |
---|---|---|---|
first | |||
bar | A | -0.642310 | 0.105929 |
B | -0.121771 | -0.170706 | |
baz | A | 0.624519 | 0.355989 |
B | 1.795898 | 0.295519 |
stacked.unstack(0)
first | bar | baz | |
---|---|---|---|
second | |||
one | A | -0.642310 | 0.624519 |
B | -0.121771 | 1.795898 | |
two | A | 0.105929 | 0.355989 |
B | -0.170706 | 0.295519 |
See the section on Pivot Tables.
请参阅有关透视表的部分。
df = pd.DataFrame(
{
"A": ["one", "one", "two", "three"] * 3,
"B": ["A", "B", "C"] * 4,
"C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
"D": np.random.randn(12),
"E": np.random.randn(12),
}
)
df
A | B | C | D | E | |
---|---|---|---|---|---|
0 | one | A | foo | 0.022691 | 0.250766 |
1 | one | B | foo | 1.246638 | 0.597852 |
2 | two | C | foo | 0.237767 | 1.409630 |
3 | three | A | bar | 0.781579 | 0.698842 |
4 | one | B | bar | -0.350703 | 1.788431 |
5 | one | C | bar | 2.225344 | 0.052856 |
6 | two | A | foo | 0.748157 | 0.376670 |
7 | three | B | foo | -1.509539 | -0.405203 |
8 | one | C | foo | -1.840205 | -0.195269 |
9 | one | A | bar | 1.051340 | -1.058422 |
10 | two | B | bar | 0.587531 | -0.431633 |
11 | three | C | bar | -0.191187 | -0.008472 |
We can produce pivot tables from this data very easily:
我们可以很容易地从这些数据生成数据透视表:
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
C | bar | foo | |
---|---|---|---|
A | B | ||
one | A | 1.051340 | 0.022691 |
B | -0.350703 | 1.246638 | |
C | 2.225344 | -1.840205 | |
three | A | 0.781579 | NaN |
B | NaN | -1.509539 | |
C | -0.191187 | NaN | |
two | A | NaN | 0.748157 |
B | 0.587531 | NaN | |
C | NaN | 0.237767 |
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.
pandas具有简单、强大且高效的功能,用于在频率转换期间执行再采样 *** 作(例如,将二次数据转换为5分钟数据)。 这在金融应用中非常常见,但不限于此。参见时间序列部分。
rng = pd.date_range("1/1/2012", periods=1000, freq="S")
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2012-01-01 00:00:00 145
2012-01-01 00:00:01 198
2012-01-01 00:00:02 434
2012-01-01 00:00:03 39
2012-01-01 00:00:04 250
2012-01-01 00:00:05 36
2012-01-01 00:00:06 174
2012-01-01 00:00:07 424
2012-01-01 00:00:08 339
2012-01-01 00:00:09 194
2012-01-01 00:00:10 175
2012-01-01 00:00:11 476
2012-01-01 00:00:12 88
2012-01-01 00:00:13 61
2012-01-01 00:00:14 255
2012-01-01 00:00:15 21
2012-01-01 00:00:16 283
2012-01-01 00:00:17 437
2012-01-01 00:00:18 140
2012-01-01 00:00:19 246
2012-01-01 00:00:20 246
2012-01-01 00:00:21 354
2012-01-01 00:00:22 287
2012-01-01 00:00:23 429
2012-01-01 00:00:24 39
2012-01-01 00:00:25 367
2012-01-01 00:00:26 296
2012-01-01 00:00:27 384
2012-01-01 00:00:28 482
2012-01-01 00:00:29 457
...
2012-01-01 00:16:10 150
2012-01-01 00:16:11 180
2012-01-01 00:16:12 175
2012-01-01 00:16:13 16
2012-01-01 00:16:14 109
2012-01-01 00:16:15 413
2012-01-01 00:16:16 446
2012-01-01 00:16:17 220
2012-01-01 00:16:18 367
2012-01-01 00:16:19 465
2012-01-01 00:16:20 178
2012-01-01 00:16:21 348
2012-01-01 00:16:22 322
2012-01-01 00:16:23 24
2012-01-01 00:16:24 236
2012-01-01 00:16:25 496
2012-01-01 00:16:26 467
2012-01-01 00:16:27 400
2012-01-01 00:16:28 177
2012-01-01 00:16:29 267
2012-01-01 00:16:30 21
2012-01-01 00:16:31 115
2012-01-01 00:16:32 173
2012-01-01 00:16:33 66
2012-01-01 00:16:34 240
2012-01-01 00:16:35 287
2012-01-01 00:16:36 259
2012-01-01 00:16:37 288
2012-01-01 00:16:38 489
2012-01-01 00:16:39 335
Freq: S, Length: 1000, dtype: int64
ts.resample("3Min").sum()
2012-01-01 00:00:00 43769
2012-01-01 00:03:00 46206
2012-01-01 00:06:00 48664
2012-01-01 00:09:00 45263
2012-01-01 00:12:00 45174
2012-01-01 00:15:00 25726
Freq: 3T, dtype: int64
Time zone representation:
时区表示:
rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2012-03-06 1.082849
2012-03-07 -0.260217
2012-03-08 -0.878703
2012-03-09 -0.883832
2012-03-10 0.832079
Freq: D, dtype: float64
ts_utc = ts.tz_localize("UTC")
ts_utc
2012-03-06 00:00:00+00:00 1.082849
2012-03-07 00:00:00+00:00 -0.260217
2012-03-08 00:00:00+00:00 -0.878703
2012-03-09 00:00:00+00:00 -0.883832
2012-03-10 00:00:00+00:00 0.832079
Freq: D, dtype: float64
Converting to another time zone:
转换到另一个时区:
ts_utc.tz_convert("US/Eastern")
2012-03-05 19:00:00-05:00 1.082849
2012-03-06 19:00:00-05:00 -0.260217
2012-03-07 19:00:00-05:00 -0.878703
2012-03-08 19:00:00-05:00 -0.883832
2012-03-09 19:00:00-05:00 0.832079
Freq: D, dtype: float64
Converting between time span representations:
在时间跨度表示之间转换:
rng = pd.date_range("1/1/2012", periods=5, freq="M")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-01-31 0.085870
2012-02-29 0.320371
2012-03-31 -0.144583
2012-04-30 0.259971
2012-05-31 0.031106
Freq: M, dtype: float64
ps = ts.to_period()
ps
2012-01 0.085870
2012-02 0.320371
2012-03 -0.144583
2012-04 0.259971
2012-05 0.031106
Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01 0.085870
2012-02-01 0.320371
2012-03-01 -0.144583
2012-04-01 0.259971
2012-05-01 0.031106
Freq: MS, dtype: float64
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:
周期和时间戳之间的转换可以使用一些方便的算术函数。在下面的示例中,我们将11月作为年度终点的季度频率转换为对应当季度末月上午9点的数据:
prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")
ts = pd.Series(np.random.randn(len(prng)), prng)
ts
1990Q1 0.205428
1990Q2 -0.468783
1990Q3 -0.660921
1990Q4 -0.603024
1991Q1 0.626573
1991Q2 0.930349
1991Q3 1.025994
1991Q4 0.757378
1992Q1 0.239052
1992Q2 -0.188778
1992Q3 0.400186
1992Q4 -1.182243
1993Q1 0.488901
1993Q2 -0.229461
1993Q3 -1.149555
1993Q4 -0.493716
1994Q1 0.358941
1994Q2 -0.862758
1994Q3 1.415536
1994Q4 0.667995
1995Q1 -0.082420
1995Q2 -0.131518
1995Q3 -0.942415
1995Q4 0.045751
1996Q1 0.542599
1996Q2 0.438003
1996Q3 -0.391305
1996Q4 -2.592706
1997Q1 0.799962
1997Q2 -0.667447
1997Q3 -0.166855
1997Q4 0.476623
1998Q1 -0.948281
1998Q2 0.508382
1998Q3 1.489794
1998Q4 -0.090221
1999Q1 -2.080581
1999Q2 0.944585
1999Q3 1.499972
1999Q4 -1.385293
2000Q1 1.545408
2000Q2 0.536199
2000Q3 -0.835179
2000Q4 -0.902938
Freq: Q-NOV, dtype: float64
ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9
ts.head()
1990-03-01 09:00 0.205428
1990-06-01 09:00 -0.468783
1990-09-01 09:00 -0.660921
1990-12-01 09:00 -0.603024
1991-03-01 09:00 0.626573
Freq: H, dtype: float64
10.Categoricals分类
pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.
pandas可以在数据框中包含分类数据。有关完整文档,请参阅分类介绍和API文档。
df = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
Converting the raw grades to a categorical data type:
将原始等级转换为分类数据类型:
df["grade"] = df["raw_grade"].astype("category")
df
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | a |
1 | 2 | b | b |
2 | 3 | b | b |
3 | 4 | a | a |
4 | 5 | a | a |
5 | 6 | e | e |
df["raw_grade"] = df["raw_grade"].astype("str")
df.info()
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
id 6 non-null int64
raw_grade 6 non-null object
grade 6 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 286.0+ bytes
Rename the categories to more meaningful names (assigning to Series.cat.categories() is in place!):
将类别重命名为更有意义的名称(指定给Series.cat.categories()已就位!)
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | very good |
1 | 2 | b | good |
2 | 3 | b | good |
3 | 4 | a | very good |
4 | 5 | a | very good |
5 | 6 | e | very bad |
Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):
重新排列类别,同时添加缺少的类别(默认情况下,Series.cat()下的方法会返回一个新的系列):
df["grade"] = df["grade"].cat.set_categories(
["very bad", "bad", "medium", "good", "very good"]
)
df
id | raw_grade | grade | |
---|---|---|---|
0 | 1 | a | very good |
1 | 2 | b | good |
2 | 3 | b | good |
3 | 4 | a | very good |
4 | 5 | a | very good |
5 | 6 | e | very bad |
Sorting is per order in the categories, not lexical order:
排序是根据类别中的顺序,而不是词汇顺序:
df.sort_values(by="grade")
id | raw_grade | grade | |
---|---|---|---|
5 | 6 | e | very bad |
1 | 2 | b | good |
2 | 3 | b | good |
0 | 1 | a | very good |
3 | 4 | a | very good |
4 | 5 | a | very good |
Grouping by a categorical column also shows empty categories:
按类别列分组也会显示空类别:
df.groupby("grade").size()
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)