pandas10minnutes

pandas10minnutes,第1张

10 minutes to pandas英文网址 pandas10minnutes_中英对照01 pandas10minnutes_中英对照02 pandas10minnutes_中英对照03 [pandas10minnutes_中英对照04 待更新]

本次主要讲以下4部分:
7.Grouping 分组
8.Reshaping 重塑
9.Time series 时间序列
10.Categoricals分类

7.Grouping 分组

By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
See the Grouping section.

“分组方式”指的是涉及以下一个或多个步骤的流程:
根据某些标准将数据分成若干组
将一个函数独立应用于每个组
将结果合并到数据结构中
请参见分组部分。

import numpy as np
import pandas as pd
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),})
df
ABCD
0fooone-1.688913-2.514367
1barone-1.2522921.632814
2footwo0.3634611.565951
3barthree-0.330988-1.242105
4footwo1.3877861.159832
5bartwo2.395909-0.361804
6fooone-1.0859300.089441
7foothree-0.6945350.752959

Grouping and then applying the sum() function to the resulting groups:
分组,然后将sum()函数应用于结果组:

df.groupby("A").sum()
CD
A
bar-2.250374-0.334026
foo1.2440430.054678

Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function:
按多列分组形成一个层次索引,然后再次应用sum()函数:

df.groupby(["A", "B"]).sum()
CD
AB
barone-1.3122580.076088
three-0.2859970.058971
two-0.652119-0.469085
fooone-1.510840-0.534755
three0.9819950.419014
two1.7728880.170420
8.Reshaping 重塑

see the sections on Hierarchical Indexing and Reshaping.
请参阅有关层次索引和重塑的章节。

8.1Stack 堆栈
tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)

tuples
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2 = df[:4]
df2
AB
firstsecond
barone-0.642310-0.121771
two0.105929-0.170706
bazone0.6245191.795898
two0.3559890.295519

The stack() method “compresses” a level in the DataFrame’s columns:
stack()方法“压缩”数据帧(DataFrame)列中的一个级别:

stacked = df2.stack()
stacked
first  second   
bar    one     A   -0.642310
               B   -0.121771
       two     A    0.105929
               B   -0.170706
baz    one     A    0.624519
               B    1.795898
       two     A    0.355989
               B    0.295519
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:
使用“堆叠”的数据帧或序列(以多索引作为索引),stack()的逆 *** 作是unstack(),默认情况下,它会取消最后一级的堆栈:

stacked.unstack()
AB
firstsecond
barone-0.642310-0.121771
two0.105929-0.170706
bazone0.6245191.795898
two0.3559890.295519
stacked.unstack(1)
secondonetwo
first
barA-0.6423100.105929
B-0.121771-0.170706
bazA0.6245190.355989
B1.7958980.295519
stacked.unstack(0)
firstbarbaz
second
oneA-0.6423100.624519
B-0.1217711.795898
twoA0.1059290.355989
B-0.1707060.295519
8.2Pivot tables 数据透视表

See the section on Pivot Tables.
请参阅有关透视表的部分。

df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)

df
ABCDE
0oneAfoo0.0226910.250766
1oneBfoo1.2466380.597852
2twoCfoo0.2377671.409630
3threeAbar0.7815790.698842
4oneBbar-0.3507031.788431
5oneCbar2.2253440.052856
6twoAfoo0.7481570.376670
7threeBfoo-1.509539-0.405203
8oneCfoo-1.840205-0.195269
9oneAbar1.051340-1.058422
10twoBbar0.587531-0.431633
11threeCbar-0.191187-0.008472

We can produce pivot tables from this data very easily:
我们可以很容易地从这些数据生成数据透视表:

pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Cbarfoo
AB
oneA1.0513400.022691
B-0.3507031.246638
C2.225344-1.840205
threeA0.781579NaN
BNaN-1.509539
C-0.191187NaN
twoANaN0.748157
B0.587531NaN
CNaN0.237767
9.Time series 时间序列

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.
pandas具有简单、强大且高效的功能,用于在频率转换期间执行再采样 *** 作(例如,将二次数据转换为5分钟数据)。 这在金融应用中非常常见,但不限于此。参见时间序列部分。

rng = pd.date_range("1/1/2012", periods=1000, freq="S")
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2012-01-01 00:00:00    145
2012-01-01 00:00:01    198
2012-01-01 00:00:02    434
2012-01-01 00:00:03     39
2012-01-01 00:00:04    250
2012-01-01 00:00:05     36
2012-01-01 00:00:06    174
2012-01-01 00:00:07    424
2012-01-01 00:00:08    339
2012-01-01 00:00:09    194
2012-01-01 00:00:10    175
2012-01-01 00:00:11    476
2012-01-01 00:00:12     88
2012-01-01 00:00:13     61
2012-01-01 00:00:14    255
2012-01-01 00:00:15     21
2012-01-01 00:00:16    283
2012-01-01 00:00:17    437
2012-01-01 00:00:18    140
2012-01-01 00:00:19    246
2012-01-01 00:00:20    246
2012-01-01 00:00:21    354
2012-01-01 00:00:22    287
2012-01-01 00:00:23    429
2012-01-01 00:00:24     39
2012-01-01 00:00:25    367
2012-01-01 00:00:26    296
2012-01-01 00:00:27    384
2012-01-01 00:00:28    482
2012-01-01 00:00:29    457
                      ... 
2012-01-01 00:16:10    150
2012-01-01 00:16:11    180
2012-01-01 00:16:12    175
2012-01-01 00:16:13     16
2012-01-01 00:16:14    109
2012-01-01 00:16:15    413
2012-01-01 00:16:16    446
2012-01-01 00:16:17    220
2012-01-01 00:16:18    367
2012-01-01 00:16:19    465
2012-01-01 00:16:20    178
2012-01-01 00:16:21    348
2012-01-01 00:16:22    322
2012-01-01 00:16:23     24
2012-01-01 00:16:24    236
2012-01-01 00:16:25    496
2012-01-01 00:16:26    467
2012-01-01 00:16:27    400
2012-01-01 00:16:28    177
2012-01-01 00:16:29    267
2012-01-01 00:16:30     21
2012-01-01 00:16:31    115
2012-01-01 00:16:32    173
2012-01-01 00:16:33     66
2012-01-01 00:16:34    240
2012-01-01 00:16:35    287
2012-01-01 00:16:36    259
2012-01-01 00:16:37    288
2012-01-01 00:16:38    489
2012-01-01 00:16:39    335
Freq: S, Length: 1000, dtype: int64
ts.resample("3Min").sum()
2012-01-01 00:00:00    43769
2012-01-01 00:03:00    46206
2012-01-01 00:06:00    48664
2012-01-01 00:09:00    45263
2012-01-01 00:12:00    45174
2012-01-01 00:15:00    25726
Freq: 3T, dtype: int64

Time zone representation:
时区表示:

rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2012-03-06    1.082849
2012-03-07   -0.260217
2012-03-08   -0.878703
2012-03-09   -0.883832
2012-03-10    0.832079
Freq: D, dtype: float64
ts_utc = ts.tz_localize("UTC")
ts_utc
2012-03-06 00:00:00+00:00    1.082849
2012-03-07 00:00:00+00:00   -0.260217
2012-03-08 00:00:00+00:00   -0.878703
2012-03-09 00:00:00+00:00   -0.883832
2012-03-10 00:00:00+00:00    0.832079
Freq: D, dtype: float64

Converting to another time zone:
转换到另一个时区:

ts_utc.tz_convert("US/Eastern")
2012-03-05 19:00:00-05:00    1.082849
2012-03-06 19:00:00-05:00   -0.260217
2012-03-07 19:00:00-05:00   -0.878703
2012-03-08 19:00:00-05:00   -0.883832
2012-03-09 19:00:00-05:00    0.832079
Freq: D, dtype: float64

Converting between time span representations:
在时间跨度表示之间转换:

rng = pd.date_range("1/1/2012", periods=5, freq="M")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-01-31    0.085870
2012-02-29    0.320371
2012-03-31   -0.144583
2012-04-30    0.259971
2012-05-31    0.031106
Freq: M, dtype: float64
ps = ts.to_period()
ps
2012-01    0.085870
2012-02    0.320371
2012-03   -0.144583
2012-04    0.259971
2012-05    0.031106
Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01    0.085870
2012-02-01    0.320371
2012-03-01   -0.144583
2012-04-01    0.259971
2012-05-01    0.031106
Freq: MS, dtype: float64

Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end:

周期和时间戳之间的转换可以使用一些方便的算术函数。在下面的示例中,我们将11月作为年度终点的季度频率转换为对应当季度末月上午9点的数据:

prng = pd.period_range("1990Q1", "2000Q4", freq="Q-NOV")
ts = pd.Series(np.random.randn(len(prng)), prng)
ts
1990Q1    0.205428
1990Q2   -0.468783
1990Q3   -0.660921
1990Q4   -0.603024
1991Q1    0.626573
1991Q2    0.930349
1991Q3    1.025994
1991Q4    0.757378
1992Q1    0.239052
1992Q2   -0.188778
1992Q3    0.400186
1992Q4   -1.182243
1993Q1    0.488901
1993Q2   -0.229461
1993Q3   -1.149555
1993Q4   -0.493716
1994Q1    0.358941
1994Q2   -0.862758
1994Q3    1.415536
1994Q4    0.667995
1995Q1   -0.082420
1995Q2   -0.131518
1995Q3   -0.942415
1995Q4    0.045751
1996Q1    0.542599
1996Q2    0.438003
1996Q3   -0.391305
1996Q4   -2.592706
1997Q1    0.799962
1997Q2   -0.667447
1997Q3   -0.166855
1997Q4    0.476623
1998Q1   -0.948281
1998Q2    0.508382
1998Q3    1.489794
1998Q4   -0.090221
1999Q1   -2.080581
1999Q2    0.944585
1999Q3    1.499972
1999Q4   -1.385293
2000Q1    1.545408
2000Q2    0.536199
2000Q3   -0.835179
2000Q4   -0.902938
Freq: Q-NOV, dtype: float64
ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9
ts.head()
1990-03-01 09:00    0.205428
1990-06-01 09:00   -0.468783
1990-09-01 09:00   -0.660921
1990-12-01 09:00   -0.603024
1991-03-01 09:00    0.626573
Freq: H, dtype: float64
10.Categoricals分类

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.
pandas可以在数据框中包含分类数据。有关完整文档,请参阅分类介绍和API文档。

df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)

Converting the raw grades to a categorical data type:
将原始等级转换为分类数据类型:

df["grade"] = df["raw_grade"].astype("category")
df
idraw_gradegrade
01aa
12bb
23bb
34aa
45aa
56ee
df["raw_grade"] = df["raw_grade"].astype("str")
df.info()

RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
id           6 non-null int64
raw_grade    6 non-null object
grade        6 non-null category
dtypes: category(1), int64(1), object(1)
memory usage: 286.0+ bytes

Rename the categories to more meaningful names (assigning to Series.cat.categories() is in place!):
将类别重命名为更有意义的名称(指定给Series.cat.categories()已就位!)

df["grade"].cat.categories = ["very good", "good", "very bad"]
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad

Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):
重新排列类别,同时添加缺少的类别(默认情况下,Series.cat()下的方法会返回一个新的系列):

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad

Sorting is per order in the categories, not lexical order:
排序是根据类别中的顺序,而不是词汇顺序:

df.sort_values(by="grade")
idraw_gradegrade
56every bad
12bgood
23bgood
01avery good
34avery good
45avery good

Grouping by a categorical column also shows empty categories:
按类别列分组也会显示空类别:

df.groupby("grade").size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/793319.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-05-06
下一篇 2022-05-06

发表评论

登录后才能评论

评论列表(0条)

保存