Pandas 数据框增、删、改、查、去重、抽样基本 *** 作方法_python

概述总括pandas的索引函数主要有三种：loc标签索引，行和列的名称iloc整型索引（绝对位置索引），绝对意义上的几行几列，起始索引为0

总括

pandas的索引函数主要有三种：

loc 标签索引，行和列的名称

iloc 整型索引（绝对位置索引），绝对意义上的几行几列，起始索引为0

ix 是 iloc 和 loc的合体

at是loc的快捷方式

iat是iloc的快捷方式

建立测试数据集：

@H_404_27@import pandas as pddf = pd.DataFrame({'a': [1,2,3],'b': ['a','b','c'],'c': ["A","B","C"]})print(df) a b c0 1 a A1 2 b B2 3 c C

行 *** 作

选择某一行

@H_404_27@print(df.loc[1,:])a 2b bc Bname: 1,dtype: object

选择多行

@H_404_27@print(df.loc[1:2,:])#选择1:2行，slice为1 a b c1 2 b B2 3 c Cprint(df.loc[::-1,:])#选择所有行，slice为-1，所以为倒序 a b c2 3 c C1 2 b B0 1 a Aprint(df.loc[0:2:2,:])#选择0至2行，slice为2，等同于print(df.loc[0:2:2,:])因为只有3行 a b c0 1 a A2 3 c C

条件筛选

普通条件筛选

@H_404_27@print(df.loc[:,"a"]>2)#原理是首先做了一个判断，然后再筛选0 False1 False2 Truename: a,dtype: boolprint(df.loc[df.loc[:,"a"]>2,:]) a b c2 3 c C

另外条件筛选还可以集逻辑运算符 | for or,& for and,and ~for not

@H_404_27@In [129]: s = pd.SerIEs(range(-3,4))In [132]: s[(s < -1) | (s > 0.5)]Out[132]: 0 -31 -24 15 26 3dtype: int64

isin

非索引列使用isin

@H_404_27@In [141]: s = pd.SerIEs(np.arange(5),index=np.arange(5)[::-1],dtype='int64')In [143]: s.isin([2,4,6])Out[143]: 4 False3 False2 True1 False0 Truedtype: boolIn [144]: s[s.isin([2,6])]Out[144]: 2 20 4dtype: int64

索引列使用isin

@H_404_27@In [145]: s[s.index.isin([2,6])]Out[145]: 4 02 2dtype: int64# compare it to the followingIn [146]: s[[2,6]]Out[146]: 2 2.04 0.06 NaNdtype: float64

结合any()/all()在多列索引时

@H_404_27@In [151]: df = pd.DataFrame({'vals': [1,3,4],'IDs': ['a','f','n'],.....: 'IDs2': ['a','n','c','n']}) .....: In [156]: values = {'IDs': ['a','b'],'IDs2': ['a','vals': [1,3]}In [157]: row_mask = df.isin(values).all(1)In [158]: df[row_mask]Out[158]: IDs IDs2 vals0 a a 1

where()

@H_404_27@In [1]: dates = pd.date_range('1/1/2000',periods=8)In [2]: df = pd.DataFrame(np.random.randn(8,4),index=dates,columns=['A','B','C','D'])In [3]: dfOut[3]: A B C D2000-01-01 0.469112 -0.282863 -1.509059 -1.1356322000-01-02 1.212112 -0.173215 0.119209 -1.0442362000-01-03 -0.861849 -2.104569 -0.494929 1.0718042000-01-04 0.721555 -0.706771 -1.039575 0.2718602000-01-05 -0.424972 0.567020 0.276232 -1.0874012000-01-06 -0.673690 0.113648 -1.478427 0.5249882000-01-07 0.404705 0.577046 -1.715002 -1.0392682000-01-08 -0.370647 -1.157892 -1.344312 0.844885In [162]: df.where(df < 0,-df)Out[162]: A B C D2000-01-01 -2.104139 -1.309525 -0.485855 -0.2451662000-01-02 -0.352480 -0.390389 -1.192319 -1.6558242000-01-03 -0.864883 -0.299674 -0.227870 -0.2810592000-01-04 -0.846958 -1.222082 -0.600705 -1.2332032000-01-05 -0.669692 -0.605656 -1.169184 -0.3424162000-01-06 -0.868584 -0.948458 -2.297780 -0.6847182000-01-07 -2.670153 -0.114722 -0.168904 -0.0480482000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

DataFrame.where() differs from numpy.where()的区别

@H_404_27@In [172]: df.where(df < 0,-df) == np.where(df < 0,df,-df)

当serIEs对象使用where()时，则返回一个序列

@H_404_27@In [141]: s = pd.SerIEs(np.arange(5),dtype='int64')In [159]: s[s > 0]Out[159]: 3 12 21 30 4dtype: int64In [160]: s.where(s > 0)Out[160]: 4 NaN3 1.02 2.01 3.00 4.0dtype: float64

抽样筛选

@H_404_27@DataFrame.sample(n=None,frac=None,replace=False,weights=None,random_state=None,axis=None)

当在有权重筛选时，未赋值的列权重为0，如果权重和不为1，则将会将每个权重除以总和。random_state可以设置抽样的种子（seed）。axis可是设置列随机抽样。

@H_404_27@In [105]: df2 = pd.DataFrame({'col1':[9,8,7,6],'weight_column':[0.5,0.4,0.1,0]})In [106]: df2.sample(n = 3,weights = 'weight_column')Out[106]: col1 weight_column1 8 0.40 9 0.52 7 0.1

增加行

@H_404_27@df.loc[3,:]=4 a b c0 1.0 a A1 2.0 b B2 3.0 c C3 4.0 4 4

插入行

pandas里并没有直接指定索引的插入行的方法，所以要自己设置

@H_404_27@line = pd.DataFrame({df.columns[0]:"--",df.columns[1]:"--",df.columns[2]:"--"},index=[1])df = pd.concat([df.loc[:0],line,df.loc[1:]]).reset_index(drop=True)#df.loc[:0]这里不能写成df.loc[0]，因为df.loc[0]返回的是serIEs a b c0 1.0 a A1 -- -- --2 2.0 b B3 3.0 c C4 4.0 4 4

交换行

@H_404_27@df.loc[[1,2],:]=df.loc[[2,1],:].values a b c0 1 a A1 3 c C2 2 b B

删除行

@H_404_27@df.drop(0,axis=0,inplace=True)print(df) a b c1 2 b B2 3 c C

注意

在以时间作为索引的数据框中，索引是以整形的方式来的。

@H_404_27@In [39]: dfl = pd.DataFrame(np.random.randn(5,columns=List('ABCD'),index=pd.date_range('20130101',periods=5))In [40]: dflOut[40]: A B C D2013-01-01 1.075770 -0.109050 1.643563 -1.4693882013-01-02 0.357021 -0.674600 -1.776904 -0.9689142013-01-03 -1.294524 0.413738 0.276662 -0.4720352013-01-04 -0.013960 -0.362543 -0.006154 -0.9230612013-01-05 0.895717 0.805244 -1.206412 2.565646In [41]: dfl.loc['20130102':'20130104']Out[41]: A B C D2013-01-02 0.357021 -0.674600 -1.776904 -0.9689142013-01-03 -1.294524 0.413738 0.276662 -0.4720352013-01-04 -0.013960 -0.362543 -0.006154 -0.923061

列 *** 作

选择某一列

@H_404_27@print(df.loc[:,"a"])0 11 22 3name: a,dtype: int64

选择多列

@H_404_27@print(df.loc[:,"a":"b"]) a b0 1 a1 2 b2 3 c

增加列,如果对已有的列,则是赋值

@H_404_27@df.loc[:,"d"]=4 a b c d0 1 a A 41 2 b B 42 3 c C 4

交换两列的值

@H_404_27@df.loc[:,['b','a']] = df.loc[:,['a','b']].valuesprint(df) a b c0 a 1 A1 b 2 B2 c 3 C

删除列

1）直接del DF[‘column-name']

2）采用drop方法，有下面三种等价的表达式：