pandas10minnutes

pandas10minnutes,第1张

10 minutes to pandas英文网址 pandas10minnutes_中英对照01 pandas10minnutes_中英对照02 pandas10minnutes_中英对照03 [pandas10minnutes_中英对照04 待更新]

本次主要讲以下三部分:
1.Object creation(对象创建)
2.Viewing data(查看数据)
3.Selection(筛选)

导入包
import numpy as np
import pandas as pd
1.Object creation(对象创建)

Creating a Series by passing a list of values, letting pandas create a default integer index:
通过传递一列值创建序列,利用pandas(熊猫)创建默认整数索引

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
通过传递NumPy数组创建带有日期时间索引和带标签列名的数据帧(数据框),
创建时间索引

#dates = pd.date_range("20130101", periods=6)
#dates = pd.date_range("2013-01-01", periods=6)
dates = pd.date_range("2013/01/01", periods=6,freq='d')
dates
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
#创建数据框
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df = pd.DataFrame(np.random.randn(6, 4), index=dates,columns=['A','B','C','D'])
df
ABCD
2013-01-01-0.520896-0.340412-1.265841-0.419562
2013-01-02-0.2704851.139635-0.099596-0.622623
2013-01-031.380236-1.9222051.406446-1.534292
2013-01-041.0490230.363657-0.479516-0.243051
2013-01-050.7208960.8215810.369389-0.133051
2013-01-06-0.337006-0.3295371.296696-2.602595

Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:
通过传递字典对象创建数据帧,这些对象可以转换为类似序列的结构

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2
ABCDEF
01.02013-01-021.03testfoo
11.02013-01-021.03trainfoo
21.02013-01-021.03testfoo
31.02013-01-021.03trainfoo

The columns of the resulting DataFrame have different dtypes:
结果数据帧的列具有不同的数据类型:

#查看数据的列类型
df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed:
如果使用的是IPython,则会自动启用列名(以及公共属性)的制表符补齐功能,以下是将要完成的属性子集:

#对数据进行统计描述
df2.describe()
ACD
count4.04.04.0
mean1.01.03.0
std0.00.00.0
min1.01.03.0
25%1.01.03.0
50%1.01.03.0
75%1.01.03.0
max1.01.03.0

As you can see, the columns A, B, C, and D are automatically tab completed. E and F are there as well; the rest of the attributes have been truncated for brevity.
如您所见,A、B、C和D列是自动完成的。E和F也存在;为简洁起见,其余属性已被截断。

2.Viewing data(查看数据)

Here is how to view the top and bottom rows of the frame:
以下是如何查看数据框的顶行和底行:

# 访问头部数据
df.head()
ABCD
2013-01-01-0.520896-0.340412-1.265841-0.419562
2013-01-02-0.2704851.139635-0.099596-0.622623
2013-01-031.380236-1.9222051.406446-1.534292
2013-01-041.0490230.363657-0.479516-0.243051
2013-01-050.7208960.8215810.369389-0.133051
# 访问底部数据
df.tail(3)
ABCD
2013-01-041.0490230.363657-0.479516-0.243051
2013-01-050.7208960.8215810.369389-0.133051
2013-01-06-0.337006-0.3295371.296696-2.602595

Display the index, columns:
显示索引,列:

#显示索引
df.index
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
# 显示列名
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

DataFrame.to_numpy()给出了底层数据的numpy表示。请注意,当您的DataFrame具有不同数据类型的列时,这可能是一个代价昂贵的 *** 作,这可以归结为pandas和numpy之间的一个根本区别:NumPy整个数组有一个数据类型,而pandas数据框的每列有自己的一个数据类型。当你调用函数DataFrame.to_numpy()时,pandas需要找到可以保存数据帧中所有数据类型的NumPy数据类型。这最终可能将数据类型转化为一个对象,需要将每个值都转换为Python对象。

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data:
对于df,DataFrame中的每个值都是浮点型,DataFrame.to_numpy() 很快,不需要对数据进行复制

#df.to_numpy() #已经删除此功能
#dir(df)
#df.to_xarray()

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive:
对于 df2, DataFrame(数据框)有多种数据类型, DataFrame.to_numpy() 的 *** 作代价相对昂贵

df2.to_numpy()
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

note:
DataFrame.to_numpy() does not include the index or column labels in the output.
注意:DataFrame.to_numpy() 在输出中不包括索引或列标签。

describe() shows a quick statistic summary of your data:
函数descriple()显示数据的快速统计概要

df.describe()
ABCD
count6.0000006.0000006.0000006.000000
mean0.336962-0.0445470.204596-0.925862
std0.8126501.0966721.0379800.961737
min-0.520896-1.922205-1.265841-2.602595
25%-0.320375-0.337693-0.384536-1.306375
50%0.2252060.0170600.134896-0.521093
75%0.9669910.7071001.064869-0.287179
max1.3802361.1396351.406446-0.133051

Transposing your data:
对数据进行转置

df.T
2013-01-01 00:00:002013-01-02 00:00:002013-01-03 00:00:002013-01-04 00:00:002013-01-05 00:00:002013-01-06 00:00:00
A-0.520896-0.2704851.3802361.0490230.720896-0.337006
B-0.3404121.139635-1.9222050.3636570.821581-0.329537
C-1.265841-0.0995961.406446-0.4795160.3693891.296696
D-0.419562-0.622623-1.534292-0.243051-0.133051-2.602595

Sorting by an axis:
按轴排序:

#按照列名做降序
df.sort_index(axis=1, ascending=False)
DCBA
2013-01-01-0.419562-1.265841-0.340412-0.520896
2013-01-02-0.622623-0.0995961.139635-0.270485
2013-01-03-1.5342921.406446-1.9222051.380236
2013-01-04-0.243051-0.4795160.3636571.049023
2013-01-05-0.1330510.3693890.8215810.720896
2013-01-06-2.6025951.296696-0.329537-0.337006
#按照行索引做降序
df.sort_index(axis=0, ascending=False)
ABCD
2013-01-06-0.337006-0.3295371.296696-2.602595
2013-01-050.7208960.8215810.369389-0.133051
2013-01-041.0490230.363657-0.479516-0.243051
2013-01-031.380236-1.9222051.406446-1.534292
2013-01-02-0.2704851.139635-0.099596-0.622623
2013-01-01-0.520896-0.340412-1.265841-0.419562

Sorting by values:
对值进行排序

df.sort_values(by="B")
ABCD
2013-01-031.380236-1.9222051.406446-1.534292
2013-01-01-0.520896-0.340412-1.265841-0.419562
2013-01-06-0.337006-0.3295371.296696-2.602595
2013-01-041.0490230.363657-0.479516-0.243051
2013-01-050.7208960.8215810.369389-0.133051
2013-01-02-0.2704851.139635-0.099596-0.622623
3.Selection (筛选)

note:
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc and .iloc. See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.
注意:虽然用于选择和设置的标准Python/NumPy表达式非常直观,并且对于交互式工作非常方便,对于生产代码,我们推荐优化的pandas数据访问方法。如 at,iat,loc和 .iloc.请参阅索引文档索引,选择数据以及多索引/高级索引。

3,1Getting 获取

Selecting a single column, which yields a Series, equivalent to df.A:
选择一个列,生成一个序列,相当于df.A:

df["A"]
2013-01-01   -0.520896
2013-01-02   -0.270485
2013-01-03    1.380236
2013-01-04    1.049023
2013-01-05    0.720896
2013-01-06   -0.337006
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows:
通过[]进行筛选,将行切片

df[0:3]
ABCD
2013-01-01-0.520896-0.340412-1.265841-0.419562
2013-01-02-0.2704851.139635-0.099596-0.622623
2013-01-031.380236-1.9222051.406446-1.534292
df["2013-01-02":"2013-05-04"]
ABCD
2013-01-02-0.2704851.139635-0.099596-0.622623
2013-01-031.380236-1.9222051.406446-1.534292
2013-01-041.0490230.363657-0.479516-0.243051
2013-01-050.7208960.8215810.369389-0.133051
2013-01-06-0.337006-0.3295371.296696-2.602595
3.2Selection by label

See more in Selection by Label.
For getting a cross section using a label:
按标签选择
请参阅“按标签选择”中的详细信息
要使用标签获取横截面:

df.loc[dates[0]]
A   -0.520896
B   -0.340412
C   -1.265841
D   -0.419562
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:
按标签在多轴上选择:

df.loc[:, ["A", "B"]]
AB
2013-01-31-0.512502-1.073798
2013-02-281.671920-1.603149
2013-03-310.116484-0.519765
2013-04-300.3833180.410609
2013-05-31-0.818920-2.595957
2013-06-301.0591150.402510

Showing label slicing, both endpoints are included:
显示标签切片时,包括两个端点:

df.loc["20130102":"20130104", ["A", "B"]]
AB
2013-01-02-0.2704851.139635
2013-01-031.380236-1.922205
2013-01-041.0490230.363657

Reduction in the dimensions of the returned object:
减少返回对象的维度:

df.loc["20130102", ["A", "B"]]
A   -0.270485
B    1.139635
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:
要获取标量值,

df.loc[dates[0], "A"]
-0.52089556678858

For getting fast access to a scalar (equivalent to the prior method):
为了快速访问标量(相当于前面的方法):

df.at[dates[0], "A"]
-0.52089556678858
3.3Selection by position

See more in Selection by Position.
Select via the position of the passed integers:
按位置选择
请参阅“按位置选择”中的更多内容
通过传递的整数位置选择:

df.iloc[3]
A    1.049023
B    0.363657
C   -0.479516
D   -0.243051
Name: 2013-01-04 00:00:00, dtype: float64

By integer slices, acting similar to NumPy/Python:
通过整数切片,其作用类似于NumPy/Python:

df.iloc[3:5, 0:2]

By lists of integer position locations, similar to the NumPy/Python style:
通过整数位置列表,类似于NumPy/Python样式:

df.iloc[[1, 2, 4], [0, 2]]
AC
2013-01-02-0.440009-0.094901
2013-01-03-1.0955891.443271
2013-01-05-0.8263572.082919

For slicing rows explicitly:
对于精确地行切片:

df.iloc[1:3, :]
ABCD
2013-01-02-0.4400090.666086-0.0949011.087610
2013-01-03-1.0955890.7084281.443271-0.012472

For slicing columns explicitly:
对于精确地列切片:

df.iloc[:, 1:3]
BC
2013-01-010.1299660.749187
2013-01-020.666086-0.094901
2013-01-030.7084281.443271
2013-01-04-0.3399910.584877
2013-01-050.0721592.082919
2013-01-06-0.7462470.195187

For getting a value explicitly:
对于精确地获取值,

df.iloc[1, 1]
0.6660861685291358

For getting fast access to a scalar (equivalent to the prior method):
为了快速访问标量(相当于前面的方法):

df.iat[1, 1]
0.6660861685291358
3.4Boolean indexing

Using a single column’s values to select data:
布尔索引
使用某个列的值选择数据:

df[df["A"] > 0]
ABCD
2013-02-281.671920-1.603149-0.154643-0.752101
2013-03-310.116484-0.5197650.918146-0.717562
2013-04-300.3833180.4106090.071098-0.029965
2013-06-301.0591150.4025100.773409-1.164358

Selecting values from a DataFrame where a boolean condition is met:
从满足布尔条件的 DataFrame(数据帧)中选择值:

df[df > 0]
ABCD
2013-01-31NaNNaN1.407725NaN
2013-02-281.671920NaNNaNNaN
2013-03-310.116484NaN0.918146NaN
2013-04-300.3833180.4106090.071098NaN
2013-05-31NaNNaN0.362031NaN
2013-06-301.0591150.4025100.773409NaN

Using the isin() method for filtering:
通过isin()方法进行过滤

df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2
ABCDE
2013-01-01-0.6330720.1299660.7491871.201542one
2013-01-02-0.4400090.666086-0.0949011.087610one
2013-01-03-1.0955890.7084281.443271-0.012472two
2013-01-04-0.012166-0.3399910.584877-0.930127three
2013-01-05-0.8263570.0721592.082919-0.478526four
2013-01-06-0.357370-0.7462470.195187-1.009280three
df2[df2["E"].isin(["two", "four"])]
ABCDE
2013-01-03-1.0955890.7084281.443271-0.012472two
2013-01-05-0.8263570.0721592.082919-0.478526four
3.5Setting 设置值

Setting a new column automatically aligns the data by the indexes:
设置新列并自动按索引对齐原数据:

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130131", periods=6))
df["F"] = s1
df
ABCDF
2013-01-31-0.512502-1.0737981.407725-2.0425281.0
2013-02-281.671920-1.603149-0.154643-0.752101NaN
2013-03-310.116484-0.5197650.918146-0.717562NaN
2013-04-300.3833180.4106090.071098-0.029965NaN
2013-05-31-0.818920-2.5959570.362031-1.440398NaN
2013-06-301.0591150.4025100.773409-1.164358NaN

Setting values by label:

df.at[dates[0], "A"] = 0
df
ABCDF
2013-01-010.0000000.1299660.7491871.201542NaN
2013-01-02-0.4400090.666086-0.0949011.0876101.0
2013-01-03-1.0955890.7084281.443271-0.0124722.0
2013-01-04-0.012166-0.3399910.584877-0.9301273.0
2013-01-05-0.8263570.0721592.082919-0.4785264.0
2013-01-06-0.357370-0.7462470.195187-1.0092805.0

Setting values by position:
按位置设置值:

df.iat[0, 1] = 0
df
ABCDF
2013-01-010.0000000.0000000.7491871.201542NaN
2013-01-02-0.4400090.666086-0.0949011.0876101.0
2013-01-03-1.0955890.7084281.443271-0.0124722.0
2013-01-04-0.012166-0.3399910.584877-0.9301273.0
2013-01-05-0.8263570.0721592.082919-0.4785264.0
2013-01-06-0.357370-0.7462470.195187-1.0092805.0

Setting by assigning with a NumPy array:
通过使用NumPy数组来赋值:

df.loc[:, "D"] = np.array([5] * len(df))
df
ABCDF
2013-01-010.0000000.0000000.7491875NaN
2013-01-02-0.4400090.666086-0.09490151.0
2013-01-03-1.0955890.7084281.44327152.0
2013-01-04-0.012166-0.3399910.58487753.0
2013-01-05-0.8263570.0721592.08291954.0
2013-01-06-0.357370-0.7462470.19518755.0

A where operation with setting:
使用where *** 作赋值:

df2 = df.copy()
df2[df2 > 0] = -df2
df2
ABCDF
2013-01-010.0000000.000000-0.749187-5NaN
2013-01-02-0.440009-0.666086-0.094901-5-1.0
2013-01-03-1.095589-0.708428-1.443271-5-2.0
2013-01-04-0.012166-0.339991-0.584877-5-3.0
2013-01-05-0.826357-0.072159-2.082919-5-4.0
2013-01-06-0.357370-0.746247-0.195187-5-5.0
df2[df2 <0] = -df2+1
df2
ABCDF
2013-01-010.0000000.0000001.7491876NaN
2013-01-021.4400091.6660861.09490162.0
2013-01-032.0955891.7084282.44327163.0
2013-01-041.0121661.3399911.58487764.0
2013-01-051.8263571.0721593.08291965.0
2013-01-061.3573701.7462471.19518766.0

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/793290.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-05-05
下一篇 2022-05-05

发表评论

登录后才能评论

评论列表(0条)

保存