Pandas快速入门指南_python

Pandas快速入门指南

文章目录

Pandas快速入门指南
- 1.Pandas数据结构对象创建
- - （1）Series对象
  - （2）DataFrame对象
- 2.基本功能
- - （1）查看数据首尾
  - （2）基本属性
  - （3）广播(Broadcasting)机制
  - （4）统计数据
  - （5）标签 *** 作
  - （6）迭代
  - （7）排序
  - - a.索引排序
    - b.值排序
  - （8）IO *** 作

在阅读此文章前，你应该对Numpy有一定的了解，如果有对Numpy还不太熟悉的读者，可以参考我的另一篇文章。
Numpy快速上手指南

本文仅涉及用于数据预处理阶段规范化的常用 *** 作，对Pandas进阶 *** 作内容涉及有限

Pandas简介
Pandas是一个用来处理表格数据集的Python库，可以轻松地做到对数据的挖掘和处理，通常它与Numpy库和Matplotlib库结合使用，可以做到数据的可视化。Pandas 对一些基本数据的计算是十分简单的，比如均值、中位数，最大值，最小值等。本文将介绍Pandas的基本用法，让读者快速入门Pandas的使用。
在使用之前，我们需要导入Pandas库，以及作为辅助的Numpy库：

>>> import pandas as pd # 常用的简写是pd
>>> import numpy as np # 常用的简写是np

1.Pandas数据结构对象创建

Pandas有两种数据结构，Series和DataFrame，其中Series用于一位数组的处理，而DataFrame用于二维数组的处理。

（1）Series对象

我们可以通过下面的方式来创建一个Series结构：

>>> s = pd.Series([1,2,3,np.nan,5,6])
>>> s
0    1.0
1    2.0
2    3.0
3    NaN   # NaN表示not a number,是Pandas中标准的缺失值标记
4    5.0
5    6.0
dtype: float64

可以发现，Series对象在创建时有一个默认的索引，我们也可以自定义这个索引：

>>> s = pd.Series(['D','5.0','str'],index = [4,5,7])
>>> s
4      D
5    5.0
7    str
dtype: object

可以看到，索引被重新定义，而且我们传入的数据也可以是任何类型的数据.

我们也可以通过一个Numpy的ndarray来创建一个Series对象：

>>> s = pd.Series(np.random.randn(5),index = ['a','b','c','d','e'])
>>> s
a   -0.516244
b   -1.249185
c    0.704626
d   -0.430690
e   -0.297458
dtype: float64

我们可以通过访问Series对象的index属性，来获取它的索引

>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

我们可以通过字典来创建Series对象：

>>> d = {'a':1,'b':2,'c':3}
>>> s = pd.Series(d)
>>> s
a    1
b    2
c    3
dtype: int64

当你使用字典创建时，如果传入的索引数目多于你的字典键值对，则会用NaN代替对应值

>>> d = {'a':1,'b':2,'c':3}
>>> pd.Series(d,index=['a','b','d','c'])
a    1.0
b    2.0
d    NaN
c    3.0
dtype: float64

也可以使用一个标量创建Series对象，此时，必须要传入index索引，会根据传入的索引的长度重复这个标量：

>>> pd.Series(6,index=[1,2,3,4,5])
1    6
2    6
3    6
4    6
5    6
dtype: int64

Series的很多 *** 作都和Numpy中的ndarray相似，大部分的Numpy函数 *** 作也可以使用在Series中，可以参考下面几个例子：

>>> s = pd.Series(np.random.randn(5),index = ['a','b','c','d','e'])
>>> s
a   -1.179425
b    1.737900
c   -1.817220
d    0.635392
e   -0.008855
dtype: float64

>>> s[0] # 0号元素
-1.1794246984641934

>>> s[3:] # 3到末尾的元素,不包括3
d    0.635392
e   -0.008855
dtype: float64

>>> s[s > s.median()] # 找出s中大于s的中位数的元素
b    1.737900
d    0.635392
dtype: float64

>>> np.exp(s) # 求出s的e指数结果
a    0.307456
b    5.685393
c    0.162477
d    1.887762
e    0.991184
dtype: float64

我们可以查看Series对象的dtype属性知道其保存数据的类型：

>>> s.dtype
dtype('float64')

我们可以通过Series.array的方式把Series结构的数据转化为array格式，请看下面的例子：

>>> s.array
<PandasArray>
[-1.1794246984641934,  1.7379002494230982, -1.8172198722285076,
  0.6353918842322767, -0.0088550837527395]
Length: 5, dtype: float64

我们也可以使用to_numpy把Series对象转化为ndarray对象:

>>> s.to_numpy()
array([-1.1794247 ,  1.73790025, -1.81721987,  0.63539188, -0.00885508])

对于这些索引不是默认的索引值（0，1···）时，我们也可以使用对应的形式进行索引：

>>> s['a']
-1.1794246984641934

>>> 'e' in s
True

>>> 'f' in s
False

而如果使用get()方法，则什么都不会返回，或者返回你所指定的而默认值：

>>> s.get('f')
# 什么都不输出
>>> s.get('f',np.nan)
nan

使用Series的一些基本运算和Numpy有一定的相似之处，请看下面的例子：

>>> s = pd.Series(np.random.randn(4),index = ['a','b','c','d'])
>>> s
a    1.037190
b    0.226480
c    0.507962
d   -1.296568
dtype: float64

>>> s + s # 求出s + s 的值
a    2.074381
b    0.452961
c    1.015925
d   -2.593136
dtype: float64

>>> s * 2 # 求出s*2的值
a    2.074381
b    0.452961
c    1.015925
d   -2.593136
dtype: float64

特殊地，Series之间的运算是和索引有关的，他会跟据索引的大小做运算，所以即使Series的元素数目不相等，也会根据索引值进行运算，而对应索引找不到对应值的结果将被赋予nan

>>> s[:2] + s[1:] # 计算s中0到2和s中1到末尾的和
a         NaN
b    0.452961
c         NaN
d         NaN
dtype: float64

在建立一个Series对象之后，我们可以对这个对象起名，然后通过查看对象的name属性进行查看，使用rename()方法进行改名并复制到另一个Series对象上，请看下面的例子：

>>> s = pd.Series(np.random.randn(3),name='一个Series对象')
>>>s.name
'一个Series对象'

>>> s1 = s.rename('一个新的名字')
>>> s1.name
'一个新的名字'

（2）DataFrame对象

DataFrame的创建方法有很多，我们可以从字典或者Series通过键值对的方式来建立,其中键的值为DataFrame列的名称：

>>> d = {
... 'one':pd.Series([1,2,3],index=['a','b','c']),
... 'two':pd.Series([4,5,6,7],index=['a','b','c', 'd'])
...}
>>> df = pd.DataFrame(d)
>>> df
	one	two
a	1.0	  4
b	2.0	  5
c	3.0	  6
d	NaN	  7

>>> df = pd.DataFrame(d,index = ['a','b','c','d'],
... columns=['two','three'])
>>> df
	two	three
a	4	NaN
b	5	NaN
c	6	NaN
d	7	NaN

我们可以通过DataFrame的index属性和columns属性获取它的标签和列名：

>>> df.index
Index(['a', 'b', 'c', 'd'], dtype='object')

>>> df.columns
Index(['two', 'three'], dtype='object')

除了这种方法，还可以通过ndarray和列表并用字典的方式创建DataFrame，同样字典的键的值是它的列名，请看下面的例子：

>>> d = {'first':[1,2,3],'second':[4,5,6]}
>>> df = pd.DataFrame(d)
>>> df
    first	second
0	    1	    4
1	    2	    5
2	    3	    6

还可以从一个元素为字典的列表中创建一个DataFrame对象：

>>> data = [{'a':1,'b':2},{'c':3,'d':4}]
>>> df = pd.DataFrame(data)
>>> df
	a	b	c	d
0	1.0	2.0	NaN	NaN
1	NaN	NaN	3.0	4.0

也可以从Series对象创建DataFrame对象:

>>> s1 = pd.Series(np.random.randn(4))
>>> s2 = pd.Series(np.random.randn(4))
>>> pd.DataFrame({'one':s1,'two':s2})
	one	        two
0	1.761223	0.561849
1	0.896577	1.826814
2	1.841716	0.056537
3	1.103170	-1.337749

我们可以使用columns列名，对DataFrame的列进行选取、添加和删除 *** 作：

>>> df = pd.DataFrame([{'a':1,'b':2},{'a':3,'b':4},{'a':5,'b':6}],
... index=['one','two','three'])
>>> df
	    a	b
one	    1	2
two	    3	4
three	5	6

>>> df['a']
one      1
two      3
three    5
Name: a, dtype: int64

>>> df['c'] = df['a'] * df['b']
>>> df
	    a	b	c
one	    1	2	2
two	    3	4	12
three	5	6	30

>>> df['flag'] = df['c']>10
>>> df
    	a	b	c	flag
one	    1	2	2	False
two	    3	4	12	True
three	5	6	30	True

>>> del df['flag'] # 删除’flag‘列
>>> df.pop('c') # 删除’c‘列，使用pop与出栈可能有一定关系
>>> df
	    a	b
one	    1	2
two	    3	4
three	5	6

我们也可以把一个标量直接赋值给一个列,或者把某一列的某些数据复制到新的列中

>>> df['new'] = 'Hello' # 把new列的值全部赋值为Hello
>>> df
	    a	b	new
one	    1	2	Hello
two	    3	4	Hello
three	5	6	Hello

>>> df['new'] = df['b'][:1] # 把new列复制为b列中的0-1个元素（不包括1）
>>> df
	    a	b	new
one	    1	2	2.0
two	    3	4	NaN
three	5	6	NaN

我们也可以使用insert()方法实现在指定的位置处插入一列的 *** 作：

>>> df.insert(1,'insert',df['b']) 
# 在1位置处插入名为‘insert’的列，数据为df中b列的值
>> df
	    a	insert	b	new
one	    1	2	2	2.0
two 	3	4	4	NaN
three	5	6	6	NaN

我们可以对DataFrame对象进行索引和分割，下面用一个表格说明具体的情况：

*** 作	方法	结果
选择某一列	d f [ c o l ] df[col] df[col]	Series
通过标签选择某一行	d f . l o c [ l a b e l ] df.loc[label] df.loc[label]	Series
通过整型的位置选择某一行	d f . i l o c [ l o c ] df.iloc[loc] df.iloc[loc]	Series
分割某些行	d f [ 4 : 8 ] df[4:8] df[4:8]	DataFrame
通过布尔向量选择某些行	d f [ b o o l _ v e c ] df[bool\_vec] df[bool_vec]	DataFrame

下面看几个例子：

>>> df.loc['one'] # 选出one这一行，结果是一个Series
a         1.0
insert    2.0
b         2.0
new       2.0
Name: one, dtype: float64

>>> df.iloc[2] # 选出three这一行，结果是一个Series
a         5.0
insert    6.0
b         6.0
new       NaN
Name: three, dtype: float64

我们可以使用.T对一个DataFrame进行转置

>>> df.T
	    one	two	three
a	    1.0	3.0	5.0
insert	2.0	4.0	6.0
b	    2.0	4.0	6.0
new	    2.0	NaN	NaN

同时，标签名其实也是DataFrame对象的一个属性变量，可以直接查看：

>>> df.a
one      1
two      3
three    5
Name: a, dtype: int64

2.基本功能

这里将介绍一些Pandas的基本功能，我们首先创建一些对象以便之后的使用；

>>> date = pd.date_range('2022/1/1',periods=8) # 创建一个日期序列
>>> date
DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08'],
              dtype='datetime64[ns]', freq='D')

>>> s = pd.Series(np.random.randn(5),index = ['a','b','c','d','e'])
>>> df = pd.DataFrame(np.random.randn(8,3),index=date,columns=['A','B','C'])

（1）查看数据首尾

使用head()和tail()方法可以看到数据的前几行数据和后几行数据，默认为5行，你可以自己指定查看多少行的数据，请看下面的例子：

>>> long_series = pd.Series(np.random.randn(1000))
>>> long_series.head()
0   -0.822034
1    0.918064
2   -0.029024
3   -0.453670
4    0.148209
dtype: float64

>>> long_series.tail(3) # 查看末尾3行的数据
997    0.482723
998   -1.188191
999    0.384464
dtype: float64

（2）基本属性

Pandas的Series对象和DataFrame对象有以下的属性：
shape属性，可以获取DataFrame的形状；
轴标签属性，可以获取它们的轴的标签。

请看如下的例子：

>>> df.shape
(3, 4)

>>> df.columns = [x.lower() for x in df.columns] # 把df中列的标签改为小写
>>> df.head()
	          a	         b      	 c
2022-01-01	0.278170	0.699225	0.848391
2022-01-02	0.145116	2.537738	-1.286354
2022-01-03	1.715956	0.716130	-0.923585
2022-01-04	-0.743891	0.571609	-1.313765
2022-01-05	-0.779608	-0.757700	0.469854

我们可以使用.array使一个Index或Series变为一个数组：

>>> s.index.array
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object

>>> s.array 
<PandasArray>
[-0.30892022321403717,  -1.3042857117851279,  0.12166198582976105,
   0.3368197766037855,  0.10975093002310791]
Length: 5, dtype: float64

也可以使用to_numpy或者numpy.asarray()把它们转化为一个Numpy中的ndarray数组:

>>> s.to_numpy()
array([-0.30892022, -1.30428571,  0.12166199,  0.33681978,  0.10975093])

>>> np.asarray(s)
array([-0.30892022, -1.30428571,  0.12166199,  0.33681978,  0.10975093])

（3）广播(Broadcasting)机制

Pandas提供了一些用于计算的方法：add(),sub(),mul(),div(),分别对应加减乘除 *** 作，下面的几个例子将演示它们的用法：

>>> df = pd.DataFrame(
...  {
...      'one':pd.Series(np.random.randn(3),index=['a','b','c']),
...      'two':pd.Series(np.random.randn(4),index=['a','b','c','d']),
...      'three':pd.Series(np.random.randn(2),index=['b','c'])
...  })
>>> df
	one	        two	        three
a	0.266653	-0.212135	NaN
b	0.870522	0.298908	-0.525418
c	-1.113338	-1.510115	0.166605
d	NaN	        0.665832	NaN

>>> row = df.iloc[1]
>>> df.add(row,axis=1) # 把df第三行与row相加，实际上是与df中的每一行相加
    one	        two	        three
a	1.137175	0.086774	NaN
b	1.741043	0.597817	-1.050836
c	-0.242817	-1.211206	-0.358813
d	NaN	        0.964741	NaN

>>> columns = df['two'] 
>>> df.sub(columns,axis=0) # 用df的每一列减去two列的值
	one	        two	    three
a	0.478788	0.0	    NaN
b	0.571613	0.0	    -0.824326
c	0.396776	0.0	    1.676719
d	NaN	        0.0	    NaN

可以从中体会到Pandas的广播机制也就是使用axis参数与index、columns进行了关联。

缺失值处理

当我们在使用上述的函数的时候，发现有一些缺失值是无法参与计算的，此时可以在调用函数的时候，增加一个 f i l l _ v a l u e fill\_value fill_value的选项，使得NaN变为一个实际的值。请看下面的例子。

>>> df
	one			two			three
a	0.266653	-0.212135	NaN
b	0.870522	0.298908	-0.525418
c	-1.113338	-1.510115	0.166605
d	NaN			0.665832	NaN

>>> df2 = pd.DataFrame({'one':[1,2,3,4],'two':[2,5,4,8],
	'three':[5,6,1,2]},index=['a','b','c','d'])
>>> df2
	one	two	three
a	1	2	5
b	2	5	6
c	3	4	1
d	4	8	2

>>> df.add(df2, fill_value=0) # 将DataFrame中的NaN值替换为0，再进行计算
	one			two			three
a	1.266653	1.787865	5.000000
b	2.870522	5.298908	5.474582
c	1.886662	2.489885	1.166605
d	4.000000	8.665832	2.000000

我们可以使用一些判断方法来对数据进行描述empty,any(),all()和bool()
其中，all()用来判断是否所有的值都满足条件；
any()表示是否至少有一个值满足条件；
empty可以判断数据是否为空；
bool()可以判断单个元素的布尔值；
请看下面的例子：

>>> (df>0).all() # 判断df中每一列中是否全为>0的值
one      False
two      False
three    False
dtype: bool

>>> (df>0).all().all() # 判断整个df中是否所有的元素都>0
False

>>> df.empty # 判断这个df是否为空
False

>>> pd.DataFrame(columns=['ABC']).empty # 一个只有列名的df是否为空
True

>>> (df>0).any()
one      True
two      True
three    True
dtype: bool

>>> pd.Series([False]).bool()
False

>>> pd.DataFrame([[True]]).bool()
True

有些时候我们在使用符号进行表示的时候，比如 d f + d f = = 2 ∗ d f df + df ==2 * df df+df==2∗df的时候，会发现有的值是False，那是因为在df中的nan并不是一个明确的数，这一点在这里不做过多讨论。
加以替代的是使用equals()方法

>>> (df +df).equals(2 * df)
True

注意在使用的时候，有时也有可能返回一个False，那是因为没有把索引Index排序，只需要在使用equals()方法时对未将索引排序的df进行排序df.sort_index()即可。

（4）统计数据

在进行数据处理时，我们通常想快速知道这组数据的一些数字特征，以便对数据有一个大致的了解，此时可以使用一些Pandas提供的方法进行快速处理，请看下面的例子：

>>> df
	one			two			three
a	0.242950	1.626914	NaN
b	-0.948663	-0.352977	0.956809
c	-0.742458	0.882777	0.374549
d	NaN			-0.285258	NaN

>>> df.mean() # 计算df中每一列的均值
one      0.312146
two      0.985132
three   -0.456096
dtype: float64

>>> df.median(axis=1) # 计算df中每一行的算术中位数
a    0.419152
b   -0.833066
c   -0.092395
d   -0.280310
dtype: float64

我们可以使用skipna选项排除NaN值：

>>> df.std(axis=0,skipna=True) # 计算df列的标准差排除NaN
one      0.724562
two      0.790647
three    0.297683
dtype: float64

除了上述演示出来的方法，还有其他常用的方法，在下表中列出:

方法名	功能
count	显示非NaN值的个数
sum	求和
mean	求均值
mad	平均绝对偏差
median	算术中位数
min	最小值
max	最大值
mode	众数
abs	绝对值
std	样品标准偏差
var	无偏方差
sem	均值标准误差
skew	样本偏斜度（第三时刻）
kurt	样本偏斜度（第四时刻）
quantile	样本分位数（在%处）
cumsum	累积和
cumprod	累积积
cummax	累积最大值
cummin	累积最小值

我们也可以使用 d e s c r i b e ( ) describe() describe()方法快速得到一些数据特征：

>>> s = pd.Series(np.random.randn(1000))
>>> s.describe()
count    1000.000000  # 总数
mean        0.023479  # 均值
std         1.022949  # 标准差
min        -3.031548  # 最小值
25%        -0.656509  # 25%处的值
50%         0.014604
75%         0.681994
max         2.823129  # 最大值
dtype: float64

我们可以使用idxmin(),idxmax()方法找出其中最小值和最大值的索引：

>>> df = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])
>>> df
	A			B			C
0	-0.071734	0.336667	-0.413956
1	-0.588041	1.325679	-0.695395
2	-0.926539	0.674262	0.591602
3	0.292243	0.612337	0.401866
4	-0.526544	-0.699263	0.006968

>>> df.idxmax(axis=1) # 每一行中最大值对应的列索引
0    B
1    B
2    B
3    B
4    C
dtype: object

>>> df.idxmin(axis=0) # 每一列中最小值对应的行索引
A    2
B    4
C    1
dtype: int64

使用value_count方法进行数据的计数，显示每个数字出现了几次

>>> d = np.random.randint(0,10,size=(50))
>>> d
array([2, 0, 8, 4, 5, 8, 1, 8, 9, 5, 2, 2, 5, 3, 1, 5, 2, 5, 4, 7, 6, 2,
       1, 3, 9, 4, 1, 1, 4, 8, 1, 3, 3, 7, 6, 0, 2, 4, 5, 0, 6, 5, 9, 2,
       8, 1, 1, 9, 8, 8])

>>> s = pd.Series(d)
>>> s.value_counts()
1    8
2    7
8    7
5    7
4    5
9    4
3    4
0    3
6    3
7    2
dtype: int64

（5）标签 *** 作

我们可以使用reindex()方法进行标签的重置，如果你希望一个数据的标签和另一个数据的标签一致，那么可以使用reindex_like()

>>> s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
>>> s
a    1.584789
b    0.996338
c    0.484087
d    0.496449
e   -1.393620
dtype: float64

>>> s.reindex(['e','b','g','a','i'])
e   -1.393620
b    0.996338
g         NaN
a    1.584789
i         NaN
dtype: float64

>>> s1.reindex_like(s)
a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
dtype: float64

我们可以在进行重置索引的时候对缺失值进行填充，这里需要在reindex中加入method选项，填充方法主要包括以下几种：

方法	规则
ffill	向前填充值
bfill	向后填充值
nearest	从最近的index值处填充

>>> date = pd.date_range('2022/1/1',periods=8)
>>> s = pd.Series(np.random.randn(8),index=date)
>>> s2 = s[[0,3,6]]
2022-01-01   -0.579916
2022-01-02    2.261039
2022-01-03    2.422621
2022-01-04   -0.350369
2022-01-05   -0.400870
2022-01-06   -1.017559
2022-01-07   -1.343754
2022-01-08    0.411179
Freq: D, dtype: float64

>>> s2 
2022-01-01   -0.579916
2022-01-04   -0.350369
2022-01-07   -1.343754
Freq: 3D, dtype: float64

>>> ts.reindex(s2.index)
2022-01-01   -0.579916
2022-01-02         NaN
2022-01-03         NaN
2022-01-04   -0.350369
2022-01-05         NaN
2022-01-06         NaN
2022-01-07   -1.343754
2022-01-08         NaN
Freq: D, dtype: float64

>>> s2.reindex(s.index,method = 'ffill') # 使用前值进行填充
2022-01-01   -0.579916
2022-01-02   -0.579916
2022-01-03   -0.579916
2022-01-04   -0.350369
2022-01-05   -0.350369
2022-01-06   -0.350369
2022-01-07   -1.343754
2022-01-08   -1.343754
Freq: D, dtype: float64

>>> s2.reindex(s.index,method = 'bfill') # 使用后值进行填充
2022-01-01   -0.579916
2022-01-02   -0.350369
2022-01-03   -0.350369
2022-01-04   -0.350369
2022-01-05   -1.343754
2022-01-06   -1.343754
2022-01-07   -1.343754
2022-01-08         NaN
Freq: D, dtype: float64

>>> s2.reindex(s.index,method = 'nearest') # 使用最近的值进行填充
2022-01-01   -0.579916
2022-01-02   -0.579916
2022-01-03   -0.350369
2022-01-04   -0.350369
2022-01-05   -0.350369
2022-01-06   -1.343754
2022-01-07   -1.343754
2022-01-08   -1.343754
Freq: D, dtype: float64

我们还可以通过limit选项控制匹配的个数，以至于消除一定的过饱和现象

>>> s2.reindex(s.index,method = 'ffill',limit = 1)
2022-01-01   -0.579916
2022-01-02   -0.579916
2022-01-03         NaN
2022-01-04   -0.350369
2022-01-05   -0.350369
2022-01-06         NaN
2022-01-07   -1.343754
2022-01-08   -1.343754
Freq: D, dtype: float64

我们可以使用drop()方法删除数据中的某一行或某一列，请看下面的例子：

>>> df.drop('A',axis=1)
	B			C
0	-0.338204	-0.144193
1	-0.701576	-0.366568
2	-0.056439	0.678802
3	1.076433	-1.252925
4	-2.340144	-0.469283

>>> df.drop([1,3],axis=0)
	A			B			C
0	-0.567549	-0.338204	-0.144193
2	-0.107124	-0.056439	0.678802
4	-0.753300	-2.340144	-0.469283

我们可以看到的是，虽然使用了reindex方法进行了重命名，但是其中的数据也随之消失，如果想在不改变数据的情况下重命名标签，我们可以使用rename()方法：

>>> s = pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
a   -0.326658
b    1.032068
c    1.408185
d   -0.498249
e   -0.780058
dtype: float64

>>> s.rename(str.upper) # 把标签变为大写字母
A   -0.326658
B    1.032068
C    1.408185
D   -0.498249
E   -0.780058
dtype: float64

>>> s.rename({'a':'o','b':'p','c':'q','d':'r','e':'s'}) # 使用字典进行改名
o   -0.326658
p    1.032068
q    1.408185
r   -0.498249
s   -0.780058
dtype: float64

（6）迭代

最简单的迭代方式是for循环，请看下面的例子：

>>> df = pd.DataFrame({"col1": np.random.randn(3), 
...					"col2": np.random.randn(3)}, 
...					index=["a", "b","c"])
>>> for col in df:
...     print(col)
col1
col2

Pandas 还提供了一个item()方法用来遍历键值对类似的数据，也可以使用如下的iterrows()和itertuples()方法来迭代,同城情况下后者比前者迭代的速度快得多：

>>> for label, ser in df.items():
    print(label)
    print(ser)
col1
a   -0.752748
b    0.355623
c   -1.396863
Name: col1, dtype: float64
col2
a    0.170046
b    2.115181
c    1.130240
Name: col2, dtype: float64

iterrows()可以使你像一个Series对象进行迭代DataFrame对象

>>>  for index, row in df.iterrows():
...    print(index,row,sep='\n')
a
col1   -0.752748
col2    0.170046
Name: a, dtype: float64
b
col1    0.355623
col2    2.115181
Name: b, dtype: float64
c
col1   -1.396863
col2    1.130240
Name: c, dtype: float64

如果使用itertuples()，则返回的值是一个元组

>>> for row in df.itertuples():
...		print(row)
Pandas(Index='a', col1=-0.7527479960512652, col2=0.17004562449070096)
Pandas(Index='b', col1=0.3556232583123519, col2=2.1151810120272803)
Pandas(Index='c', col1=-1.3968625542650128, col2=1.1302402391429855)

（7）排序

Pandas支持三种类型的排序：对标签排序，对值排序或者使用它们两者的组合进行排序：

a.索引排序

使用sort_index()方法即可

>>> df = pd.DataFrame({
... "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
... "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
... "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),})
>>> unsorted_df = df.reindex(
... index=["a", "d", "c", "b"], columns=["three", "two", "one"])
>>> unsorted_df
	three		two			one
a	NaN			0.663213	0.334880
d	0.002582	0.722183	NaN
c	-0.063257	-2.603502	0.459381
b	-0.926569	0.244567	-0.165420

>>> unsorted_df.sort_index()
	three		two			one
a	NaN			0.663213	0.334880
b	-0.926569	0.244567	-0.165420
c	-0.063257	-2.603502	0.459381
d	0.002582	0.722183	NaN

>>> unsorted_df.sort_index(ascending=False) # 降序排列
	three		two			one
d	0.002582	0.722183	NaN
c	-0.063257	-2.603502	0.459381
b	-0.926569	0.244567	-0.165420
a	NaN			0.663213	0.334880

>>> unsorted_df.sort_index(axis=1) # 对列标签排序
	one			three		two
a	0.334880	NaN			0.663213
d	NaN			0.002582	0.722183
c	0.459381	-0.063257	-2.603502
b	-0.165420	-0.926569	0.244567

b.值排序

直接使用sort_values()方法即可，by选项可以控制对哪一部分数据进行排序,当数据中存在NaN值的时候，我们可以使用na_position选项确定NaN出现的位置：

>>> df1 = pd.DataFrame({"one": [2, 1, 1, 1], 
...						"two": [1, 3, 2, np.nan], 
...						"three": [5, 4, 3, 2]})
>>> df1.sort_values(by='two') # 对two列的数据进行排序
	one	two	three
0	2	1	5
2	1	2	3
1	1	3	NaN
3	1	4	2

>>> df1.sort_values('three',na_position='first')
	one	two	three
1	1	3	NaN
3	1	4	2.0
2	1	2	3.0
0	2	1	5.0

（8）IO *** 作

一般情况下我们得到的数据集是一个文件，比如csv格式的文件，我们需要从文件中读取数据，然后使用Pandas进行进一步的处理，读文件也相对比较简单，这里简单介绍：
一般情况下，想要读取哪一种类型的文件，只需要调用 p d . r e a d _ ∗ pd.read\_* pd.read_∗即可，*表示那个文件的类型。例如，我现在需要读一个csv类型的文件，只需要键入

pd.read_csv('data.csv') # 后面的是文件路径和文件名

再例如需要读入Excel文件：

pd.read_excel('data.xls')

其他类型的文件类型见到的相对较少，这里不做过多介绍。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/923144.html

Pandas快速入门指南

发表评论

评论列表（0条）