Python
数据分析
- 🌸个人主页:JoJo的数据分析历险记
- 📝个人介绍:小编大四统计在读,目前保研到统计学top3高校继续攻读统计研究生
- 💌如果文章对你有帮助,欢迎关注、点赞、收藏、订阅专栏
本专栏主要介绍python数据分析领域的应用
参考资料:
利用python数据分析
文章目录
- Python数据分析
- 💮1.Series 对象
-
- 🌹2.DataFrame对象
- 🥀3.pandas基本数据运算
- 🌺3.1 算术运算
- 🌻3.2 基本算术运算符
- 🌼3.3 函数映射
- 🌷4.统计函数
- 🌱4.1 相关性和协方差
- 🌲4.1.1 Series对象
- 🌳4.1.2DataFrame对象
- 🌴4.1.3DataFrame和Series相关性
- 🌵4.2排序和秩
- 🌾5.Pandas缺失值处理
- 🌿5.1 创建NaN数据
- ☘️5.2 删除NaN
- 🍀5.3 为NaN元素填充其他值
- 🍁6. 层级索引和分层统计
- 🍂6.1 unstack()函数和stack()函数
- 🍃6.2调整层级顺序
- 🌍6.3按层级统计数据
- 🌎7.数据导入
- 🌏8.数据处理
- 🌐8.1 连接
- 🎇8.1.1 内连接
- 🎉8.1.2 外连接
- 🎊8.1.3 以索引作为键进行连接
- 🎄8.2拼接
- 🎋8.3组合
- 🏆文章推荐
我们介绍了Numpy在数据处理方面的应用,本文介绍一下pandas在数据处理方面的应用,pandas可以是基于numpy构建的,但是可以让数据处理变得更便捷
导入相关库
import numpy as np
import pandas as pd
💮1.Series 对象
pandas主要有两个数据对象,一个是Series,类似于一个向量的形式,另一个是DataFrame数据框形式。我们先来看一下如何创建一个Series数据对象。
s = pd.Series([12,-4,7,9])
s
0 12
1 -4
2 7
3 9
dtype: int64
🏵️1.1 Series基本 *** 作
#选择内部元素
s[2]
7
#为元素赋值
s[2]=5
s
s['a'] = 4
s
0 12
1 -4
2 5
3 9
a 4
dtype: int64
#用其它对象定义新的series对象
arr = np.array([1,2,3,4])
#此时s2只是原来的一个动态视图,会随数组的改变而改变,例如我们改变原来数组中的第二个元素值
s2 = pd.Series(arr)
s2
arr[1] = 9
s2
0 1
1 9
2 3
3 4
dtype: int32
#筛选元素
s[s>8]
0 12
3 9
dtype: int64
#Series对象的组成元素
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd
white 1
white 0
blue 2
green 1
green 2
yellow 3
dtype: int64
#unique去重 返回一个数组
serd.unique()
array([1, 0, 2, 3], dtype=int64)
#value_counts 去重 返回出现次数
serd.value_counts()
2 2
1 2
3 1
0 1
dtype: int64
#isin 函数,返回布尔值
serd.isin([0,3])
white False
white True
blue False
green False
green False
yellow True
dtype: bool
serd[serd.isin([0,3])]
white 0
yellow 3
dtype: int64
#NaN
s2 = pd.Series([-5,3,np.NaN,14])
s2
0 -5.0
1 3.0
2 NaN
3 14.0
dtype: float64
# 用isnull 和 notnull 来进行判断
s2.isnull()
s2.notnull()
0 True
1 True
2 False
3 True
dtype: bool
s2
0 -5.0
1 3.0
2 NaN
3 14.0
dtype: float64
#用作字典
mydict = {'red':2000,'blue':1000,'yellow':500,'orange':1000}
myseries = pd.Series(mydict)
myseries
red 2000
blue 1000
yellow 500
orange 1000
dtype: int64
当出现缺失值时,会直接用NaN替代
colors = ['red','blue','yellow','orange','green']
myseries = pd.Series(mydict, index = colors)
myseries
red 2000.0
blue 1000.0
yellow 500.0
orange 1000.0
green NaN
dtype: float64
进行运算时有NaN为NaN
mydict2 ={'red':400,'yellow':1000,"black":700}
myseries2 = pd.Series(mydict2)
myseries.fillna(0) + myseries2.fillna(0)
black NaN
blue NaN
green NaN
orange NaN
red 2400.0
yellow 1500.0
dtype: float64
🌹2.DataFrame对象
DataFrame对象是我们在进行数据分析时最常见的数据格式,相当于一个矩阵数据,由不同行不同列组成,通常每一列代表一个变量,每一行代表一个观察数据。我们先来看一下DataFrame的一些基础应用。
创建DataFrame对象
#DataFrame对象
data = {'color':['blue','green','yellow','red','white'],
'object':['ball','pen','pencil','paper','mug'],
'price':[1.2,1.0,0.6,0.9,1.7]}
frame = pd.DataFrame(data)
frame
| color | object | price |
---|
0 | blue | ball | 1.2 |
---|
1 | green | pen | 1.0 |
---|
2 | yellow | pencil | 0.6 |
---|
3 | red | paper | 0.9 |
---|
4 | white | mug | 1.7 |
---|
frame2 = pd.DataFrame(data, columns=['object','price'])
frame2
| object | price |
---|
0 | ball | 1.2 |
---|
1 | pen | 1.0 |
---|
2 | pencil | 0.6 |
---|
3 | paper | 0.9 |
---|
4 | mug | 1.7 |
---|
frame3 = pd.DataFrame(data,index=['one','two','three','four','five'])
frame3
| color | object | price |
---|
one | blue | ball | 1.2 |
---|
two | green | pen | 1.0 |
---|
three | yellow | pencil | 0.6 |
---|
four | red | paper | 0.9 |
---|
five | white | mug | 1.7 |
---|
#选取元素
#获得所有列的名称
frame.columns
Index(['color', 'object', 'price'], dtype='object')
#获得所有行的名称
frame.index
RangeIndex(start=0, stop=5, step=1)
#获得所有值
frame.values
array([['blue', 'ball', 1.2],
['green', 'pen', 1.0],
['yellow', 'pencil', 0.6],
['red', 'paper', 0.9],
['white', 'mug', 1.7]], dtype=object)
#获得某一列的值
frame['price']
0 1.2
1 1.0
2 0.6
3 0.9
4 1.7
Name: price, dtype: float64
#获得行的值 用ix属性和行的索引项
frame.iloc[2]
color yellow
object pencil
price 0.6
Name: 2, dtype: object
#指定多个索引值能选取多行
frame.iloc[[2,4]]
| color | object | price |
---|
2 | yellow | pencil | 0.6 |
---|
4 | white | mug | 1.7 |
---|
#可以用frame[0:1]或者frame[0:2]选择行 但切记frame[0]没有数
frame[0:4]
对DataFrame进行行选择时,使用索引frame[0:1]返回第一行数据,[1:2]返回第二行数据
| color | object | price |
---|
0 | blue | ball | 1.2 |
---|
1 | green | pen | 1.0 |
---|
2 | yellow | pencil | 0.6 |
---|
3 | red | paper | 0.9 |
---|
#如果要获取其中的一个元素,必须依次指定元素所在的列名称、行的索引值或标签
frame['object'][3]
'paper'
#赋值
frame['new']=12 #直接添加某一列
frame
| color | object | price | new |
---|
0 | blue | ball | 1.2 | 12 |
---|
1 | green | pen | 1.0 | 12 |
---|
2 | yellow | pencil | 0.6 | 12 |
---|
3 | red | paper | 0.9 | 12 |
---|
4 | white | mug | 1.7 | 12 |
---|
frame['new']=[1,2,3,4,5]
frame
| color | object | price | new |
---|
0 | blue | ball | 1.2 | 1 |
---|
1 | green | pen | 1.0 | 2 |
---|
2 | yellow | pencil | 0.6 | 3 |
---|
3 | red | paper | 0.9 | 4 |
---|
4 | white | mug | 1.7 | 5 |
---|
#修改单个元素的方法
frame['price'][2]=3.3
frame
| color | object | price | new |
---|
0 | blue | ball | 1.2 | 1 |
---|
1 | green | pen | 1.0 | 2 |
---|
2 | yellow | pencil | 3.3 | 3 |
---|
3 | red | paper | 0.9 | 4 |
---|
4 | white | mug | 1.7 | 5 |
---|
# 删除一整列的所有数据,用del
frame['new'] = 12
frame
del frame['new']
frame
| color | object | price |
---|
0 | blue | ball | 1.2 |
---|
1 | green | pen | 1.0 |
---|
2 | yellow | pencil | 3.3 |
---|
3 | red | paper | 0.9 |
---|
4 | white | mug | 1.7 |
---|
#筛选元素
frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['red','white','blue','green'],
columns=['ball','pen','pencil','paper'])
frame3
frame3[frame3>12]
| ball | pen | pencil | paper |
---|
red | NaN | NaN | NaN | NaN |
---|
white | NaN | NaN | NaN | NaN |
---|
blue | NaN | NaN | NaN | NaN |
---|
green | NaN | 13.0 | 14.0 | 15.0 |
---|
#用嵌套字典生成DataFrame对象 当出现缺失值时用NaN替代
nestdict = {'red':{2012:22, 2013:33},'white':{2011: 13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:48}}
nestdict
{'red': {2012: 22, 2013: 33},
'white': {2011: 13, 2012: 22, 2013: 16},
'blue': {2011: 17, 2012: 27, 2013: 48}}
frame2 = pd.DataFrame(nestdict)
frame2
| red | white | blue |
---|
2011 | NaN | 13 | 17 |
---|
2012 | 22.0 | 22 | 27 |
---|
2013 | 33.0 | 16 | 48 |
---|
进行转置
frame2.T
| 2011 | 2012 | 2013 |
---|
red | NaN | 22.0 | 33.0 |
---|
white | 13.0 | 22.0 | 16.0 |
---|
blue | 17.0 | 27.0 | 48.0 |
---|
#index对象
ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])
ser.index
Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')
ser.idxmax()
'white'
ser.idxmin()
'blue'
#含重复标签的Index
serd = pd.Series(range(6), index=['white','white','blue','green','green','yellow'])
serd
white 0
white 1
blue 2
green 3
green 4
yellow 5
dtype: int64
#当一个标签对应多个元素时,返回一个Series对象 而不是单个元素
serd['white']
white 0
white 1
dtype: int64
#判断是否由重复值, is_unique
#索引对象的其他功能
ser = pd.Series([2,5,7,4],index = ['one','two','three','four'])
ser
one 2
two 5
three 7
four 4
dtype: int64
#reindex()函数可以更换series对象的索引,生成一个新的series对象
ser.reindex(['three','one','five','two'])
three 7.0
one 2.0
five NaN
two 5.0
dtype: float64
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])
ser3
0 1
3 5
5 6
6 3
dtype: int64
#自动插补
#reindex()函数,method:ffill 表示插补的数为前面的值,bfill表示插补的数为后面的值
ser3.reindex(range(6),method='ffill')
0 1
1 1
2 1
3 5
4 5
5 6
dtype: int64
ser3.reindex(range(8),method='bfill')
0 1.0
1 5.0
2 5.0
3 5.0
4 6.0
5 6.0
6 3.0
7 NaN
dtype: float64
frame.reindex(range(5), method='ffill',columns=['colors','price','new','object'])
| colors | price | new | object |
---|
0 | blue | 1.2 | blue | ball |
---|
1 | green | 1.0 | green | pen |
---|
2 | yellow | 3.3 | yellow | pencil |
---|
3 | red | 0.9 | red | paper |
---|
4 | white | 1.7 | white | mug |
---|
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])
ser
red 0.0
blue 1.0
yellow 2.0
white 3.0
dtype: float64
ser.drop('yellow')
red 0.0
blue 1.0
white 3.0
dtype: float64
ser.drop(['blue','white'])
red 0.0
yellow 2.0
dtype: float64
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
#删除时默认是行 axis指定轴,1为列
frame.drop(['pen'],axis=1)
| ball | pencil | paper |
---|
red | 0 | 2 | 3 |
---|
blue | 4 | 6 | 7 |
---|
yellow | 8 | 10 | 11 |
---|
white | 12 | 14 | 15 |
---|
🥀3.pandas基本数据运算
🌺3.1 算术运算
- 当有两个series或DataFrame对象时,如果一个标签,两个对象都有,则把他们的值相加
- 当一个标签只有一个对象有时,则为NaN
s1 = pd.Series([3,2,5,1],index=['white','yellow','green','blue'])
s1
white 3
yellow 2
green 5
blue 1
dtype: int64
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
s1 + s2
black NaN
blue 3.0
brown NaN
green NaN
white 4.0
yellow 6.0
dtype: float64
# DateFrame对象也一样
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=['ball','pen','pencil','paper'],
index = ['red','blue','yellow','white'])
frame1
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index = ['blue','yellow','green','white']
,columns=['ball','pen','mug'])
frame2
| ball | pen | mug |
---|
blue | 0 | 1 | 2 |
---|
yellow | 3 | 4 | 5 |
---|
green | 6 | 7 | 8 |
---|
white | 9 | 10 | 11 |
---|
frame3 = frame1+frame2
frame3
| ball | mug | paper | pen | pencil |
---|
blue | 4.0 | NaN | NaN | 6.0 | NaN |
---|
green | NaN | NaN | NaN | NaN | NaN |
---|
red | NaN | NaN | NaN | NaN | NaN |
---|
white | 21.0 | NaN | NaN | 23.0 | NaN |
---|
yellow | 11.0 | NaN | NaN | 13.0 | NaN |
---|
🌻3.2 基本算术运算符
主要的算术运算符如下
- add() frame1.add(frame2) = frame1+frame2
- sub()
- div()
- mul()
下面通过一些案例来说明
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
columns=['ball','pen','pencil','paper'],
index = ['red','blue','yellow','white'])
frame
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
ser = pd.Series(np.arange(4),['ball','pen','pencil','paper'])
ser #与frame 的列名称保持一致,行不可以
ball 0
pen 1
pencil 2
paper 3
dtype: int32
frame-ser
| ball | pen | pencil | paper |
---|
red | 0 | 0 | 0 | 0 |
---|
blue | 4 | 4 | 4 | 4 |
---|
yellow | 8 | 8 | 8 | 8 |
---|
white | 12 | 12 | 12 | 12 |
---|
当索引项只存在于其中一个数据结构时,那么运算结果会为其产生一个新的索引项,但其值为NaN
具体案例如下,我们给ser增加一列mug
ser['mug'] = 9
ser
ball 0
pen 1
pencil 2
paper 3
mug 9
dtype: int64
frame - ser
| ball | mug | paper | pen | pencil |
---|
red | 0 | NaN | 0 | 0 | 0 |
---|
blue | 4 | NaN | 4 | 4 | 4 |
---|
yellow | 8 | NaN | 8 | 8 | 8 |
---|
white | 12 | NaN | 12 | 12 | 12 |
---|
🌼3.3 函数映射
在dataframe和series数据对象中,可以使用函数对所有元素进行 *** 作
frame
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
# 求所有元素的平方根
np.sqrt(frame)
| ball | pen | pencil | paper |
---|
red | 0.000000 | 1.000000 | 1.414214 | 1.732051 |
---|
blue | 2.000000 | 2.236068 | 2.449490 | 2.645751 |
---|
yellow | 2.828427 | 3.000000 | 3.162278 | 3.316625 |
---|
white | 3.464102 | 3.605551 | 3.741657 | 3.872983 |
---|
#定义函数
#法一:
f = lambda x:x.max()-x.min()#返回数组取值范围
#法二:
def f(x):
return x.max()-x.min()
# apply函数可以调用定义的函数
frame.apply(f)
ball 12
pen 12
pencil 12
paper 12
dtype: int64
def f(x):
return pd.Series([x.min(),x.max()],index = ['min','max'])
frame.apply(f,axis = 1)
# 默认axis=0
| min | max |
---|
red | 0 | 3 |
---|
blue | 4 | 7 |
---|
yellow | 8 | 11 |
---|
white | 12 | 15 |
---|
🌷4.统计函数
- 数组大多数统计函数对DataFrame对象有用,故可以直接使用
frame.sum()
ball 24
pen 28
pencil 32
paper 36
dtype: int64
frame.mean()
ball 6.0
pen 7.0
pencil 8.0
paper 9.0
dtype: float64
# describe()函数可以计算多个统计量
frame.describe()
| ball | pen | pencil | paper |
---|
count | 4.000000 | 4.000000 | 4.000000 | 4.000000 |
---|
mean | 6.000000 | 7.000000 | 8.000000 | 9.000000 |
---|
std | 5.163978 | 5.163978 | 5.163978 | 5.163978 |
---|
min | 0.000000 | 1.000000 | 2.000000 | 3.000000 |
---|
25% | 3.000000 | 4.000000 | 5.000000 | 6.000000 |
---|
50% | 6.000000 | 7.000000 | 8.000000 | 9.000000 |
---|
75% | 9.000000 | 10.000000 | 11.000000 | 12.000000 |
---|
max | 12.000000 | 13.000000 | 14.000000 | 15.000000 |
---|
```python ser.rank(method='first') ```
red 4.0
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
🌱4.1 相关性和协方差
🌲4.1.1 Series对象
- 通常涉及两个数据对象
- 函数分别corr()和cov()
seq = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq
2006 3
2007 4
2008 3
2009 4
2010 5
2011 4
2012 3
2013 2
dtype: int64
seq2 = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq2
2006 1
2007 2
2008 3
2009 4
2010 4
2011 3
2012 2
2013 1
dtype: int64
seq.corr(seq2)
0.7745966692414834
seq.cov(seq2)
0.8571428571428571
🌳4.1.2DataFrame对象
DataFrame对象计算相关性和协方差依然返回一个dataframe对象
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
frame2
| ball | pen | pencil | paper |
---|
red | 1 | 4 | 3 | 6 |
---|
blue | 4 | 5 | 6 | 1 |
---|
yellow | 3 | 3 | 1 | 5 |
---|
white | 4 | 1 | 6 | 4 |
---|
frame2.corr()
| ball | pen | pencil | paper |
---|
ball | 1.000000 | -0.276026 | 0.577350 | -0.763763 |
---|
pen | -0.276026 | 1.000000 | -0.079682 | -0.361403 |
---|
pencil | 0.577350 | -0.079682 | 1.000000 | -0.692935 |
---|
paper | -0.763763 | -0.361403 | -0.692935 | 1.000000 |
---|
frame2.cov()
| ball | pen | pencil | paper |
---|
ball | 2.000000 | -0.666667 | 2.000000 | -2.333333 |
---|
pen | -0.666667 | 2.916667 | -0.333333 | -1.333333 |
---|
pencil | 2.000000 | -0.333333 | 6.000000 | -3.666667 |
---|
paper | -2.333333 | -1.333333 | -3.666667 | 4.666667 |
---|
🌴4.1.3DataFrame和Series相关性
corrwith()可以计算DataFrame对象的列或行与Series对象或者其他DataFrame对象元素两两之间的相关性
ser
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
frame2.corrwith(ser)
ball -0.140028
pen -0.869657
pencil 0.080845
paper 0.595854
dtype: float64
frame2.corrwith(frame)
ball 0.730297
pen -0.831522
pencil 0.210819
paper -0.119523
dtype: float64
🌵4.2排序和秩
- Series用sort_values()和rank(),默认是升序,使用ascending=False改变为升序,下同
- DataFrame用sort_index(by=‘’)和rank()
对ser排序
ser.sort_values()
blue 0
yellow 3
green 4
red 5
white 8
dtype: int64
对ser求秩
ser.rank()
red 4.0
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
安装pen对frame进行排序
frame.sort_values(by='pen')
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])
ser
red 5
blue 0
yellow 3
white 8
green 4
dtype: int64
ser.sort_index()
#按字母表顺序升序排列
blue 0
green 4
red 5
white 8
yellow 3
dtype: int64
ser.sort_index(ascending=False)
# 改为降序
yellow 3
white 8
red 5
green 4
blue 0
dtype: int64
ser.sort_values()
#根据值排序
blue 0
yellow 3
green 4
red 5
white 8
dtype: int64
frame
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
frame.sort_index()
| ball | pen | pencil | paper |
---|
blue | 4 | 5 | 6 | 7 |
---|
red | 0 | 1 | 2 | 3 |
---|
white | 12 | 13 | 14 | 15 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
axis代表轴,1表示纵轴,0表示横轴
frame.sort_index(axis=1)
| ball | paper | pen | pencil |
---|
red | 0 | 3 | 1 | 2 |
---|
blue | 4 | 7 | 5 | 6 |
---|
yellow | 8 | 11 | 9 | 10 |
---|
white | 12 | 15 | 13 | 14 |
---|
🌾5.Pandas缺失值处理
🌿5.1 创建NaN数据
ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])
ser
red 0.0
blue 1.0
yellow 2.0
white NaN
green 9.0
dtype: float64
ser['white']
nan
☘️5.2 删除NaN
- dropna()
- ser[ser.notnull()]
- DataFrame中去除时 为避免删除整行或整列,用how='all’来表示只删除所有元素均为NAN的行或列,如果使用how=‘any’,则只要这一列有缺失值就删除整列
frame3.dropna(how='all')
| ball | mug | paper | pen | pencil |
---|
blue | 4.0 | NaN | NaN | 6.0 | NaN |
---|
white | 21.0 | NaN | NaN | 23.0 | NaN |
---|
yellow | 11.0 | NaN | NaN | 13.0 | NaN |
---|
🍀5.3 为NaN元素填充其他值
frame3.fillna(0)
| ball | mug | paper | pen | pencil |
---|
blue | 4.0 | 0.0 | 0.0 | 6.0 | 0.0 |
---|
green | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
---|
red | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
---|
white | 21.0 | 0.0 | 0.0 | 23.0 | 0.0 |
---|
yellow | 11.0 | 0.0 | 0.0 | 13.0 | 0.0 |
---|
#若要将不同列的NaN换成不同元素,依次指定列名称及要替换成的元素即可
frame3.fillna({'ball':1,"pen":99})
| ball | mug | paper | pen | pencil |
---|
blue | 4.0 | NaN | NaN | 6.0 | NaN |
---|
green | 1.0 | NaN | NaN | 99.0 | NaN |
---|
red | 1.0 | NaN | NaN | 99.0 | NaN |
---|
white | 21.0 | NaN | NaN | 23.0 | NaN |
---|
yellow | 11.0 | NaN | NaN | 13.0 | NaN |
---|
🍁6. 层级索引和分层统计
有时候我们需要对数据进行分层级的索引,具体看下面这个例子
mser = pd.Series(np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],
['up','down','right','up','down','up','down','left']])
mser
white up 0.323513
down 0.080292
right 0.503630
blue up 0.201143
down 0.173879
red up 0.866267
down 0.601906
left 0.140885
dtype: float64
mser.index
MultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],
codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])
mser['white']
up 0.323513
down 0.080292
right 0.503630
dtype: float64
mser[:,'up']
white 0.323513
blue 0.201143
red 0.866267
dtype: float64
mser[:,'right']
white 0.50363
dtype: float64
mser['white','up']#可以得到某一特定元素的
0.32351250980575463
🍂6.1 unstack()函数和stack()函数
unstack把等级索引Series对象转换为一个简单的DataFrame对象,把第二列索引转换为相应的列,stack则相反,具体如下
mser.unstack() #将series转换为dataframe
mser.unstack().fillna(0)
| down | left | right | up |
---|
blue | 0.173879 | 0.000000 | 0.00000 | 0.201143 |
---|
red | 0.601906 | 0.140885 | 0.00000 | 0.866267 |
---|
white | 0.080292 | 0.000000 | 0.50363 | 0.323513 |
---|
frame
| ball | pen | pencil | paper |
---|
red | 0 | 1 | 2 | 3 |
---|
blue | 4 | 5 | 6 | 7 |
---|
yellow | 8 | 9 | 10 | 11 |
---|
white | 12 | 13 | 14 | 15 |
---|
frame.stack()#将dataframe转换为series对象
red ball 0
pen 1
pencil 2
paper 3
blue ball 4
pen 5
pencil 6
paper 7
yellow ball 8
pen 9
pencil 10
paper 11
white ball 12
pen 13
pencil 14
paper 15
dtype: int32
dataframe对象的行与列也可以定义分层级索引
mframe = pd.DataFrame(np.arange(16).reshape((4,4)),
index = [['white','white','red','red'],['up','down','up','down']],
columns=[['pen','pen','paper','paper'],[1,2,1,2]])
mframe
| | pen | paper |
---|
| | 1 | 2 | 1 | 2 |
---|
white | up | 0 | 1 | 2 | 3 |
---|
down | 4 | 5 | 6 | 7 |
---|
red | up | 8 | 9 | 10 | 11 |
---|
down | 12 | 13 | 14 | 15 |
---|
🍃6.2调整层级顺序
- swaplevel()函数以要互换位置的两个层级的名称为参数,返回交换位置后的一个新对象,其中的个元素的顺序保持不变
mframe.columns.names = ['object','id']
mframe.index.names = ['colors','status']
mframe
| object | pen | paper |
---|
| id | 1 | 2 | 1 | 2 |
---|
colors | status | | | | |
---|
white | up | 0 | 1 | 2 | 3 |
---|
down | 4 | 5 | 6 | 7 |
---|
red | up | 8 | 9 | 10 | 11 |
---|
down | 12 | 13 | 14 | 15 |
---|
mframe.swaplevel('colors','status')
| object | pen | paper |
---|
| id | 1 | 2 | 1 | 2 |
---|
status | colors | | | | |
---|
up | white | 0 | 1 | 2 | 3 |
---|
down | white | 4 | 5 | 6 | 7 |
---|
up | red | 8 | 9 | 10 | 11 |
---|
down | red | 12 | 13 | 14 | 15 |
---|
🌍6.3按层级统计数据
mframe.sum(level='colors')
object | pen | paper |
---|
id | 1 | 2 | 1 | 2 |
---|
colors | | | | |
---|
white | 4 | 6 | 8 | 10 |
---|
red | 20 | 22 | 24 | 26 |
---|
若想对某一层级的列进行统计,则需要把axis的值设置为1
mframe.sum(level='id', axis=1)
| id | 1 | 2 |
---|
colors | status | | |
---|
white | up | 2 | 4 |
---|
down | 10 | 12 |
---|
red | up | 18 | 20 |
---|
down | 26 | 28 |
---|
🌎7.数据导入
很多时候,我们要分析的数据来自电脑上保存的数据文件,本文介绍一下如何导入我们最常用的csv文件,后续我还会介绍如何导入json
数据、以及连接SQL数据库等其他的方式来导入数据
import pandas as pd # 加载模块
df = pd.read_csv('student.csv')
df
Student ID name age gender
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
| id | color | brand_x | sid | brand_y |
---|
0 | ball | white | OMG | ball | ABC |
---|
1 | pencil | red | ABC | pencil | OMG |
---|
2 | pencil | red | ABC | pencil | POD |
---|
3 | pen | red | ABC | pen | POD |
---|
🌏8.数据处理
🌐8.1 连接
使用merge()函数 类似sql中的多表连接
🎇8.1.1 内连接
import numpy as np
import pandas as pd
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
'price':[12.33,11.44,33.21,13.23,33.62]})
frame1
| id | price |
---|
0 | ball | 12.33 |
---|
1 | pencil | 11.44 |
---|
2 | pen | 33.21 |
---|
3 | mug | 13.23 |
---|
4 | ashtray | 33.62 |
---|
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
'color':['white','red','red','black']})
frame2
| id | color |
---|
0 | pencil | white |
---|
1 | pencil | red |
---|
2 | ball | red |
---|
3 | pen | black |
---|
pd.merge(frame1,frame2)
| id | price | color |
---|
0 | ball | 12.33 | red |
---|
1 | pencil | 11.44 | white |
---|
2 | pencil | 11.44 | red |
---|
3 | pen | 33.21 | black |
---|
上述返回的DataFrame对象由原来的两个DataFrame对象中ID相同的行组成 并且没有指定基于哪一列进行合并,实际应用中通常要指定连接条件, 用on来zhid
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'],
'color':['white','red','red','black','green'],
'brand':['OMG','ABC','ABC','POD','POD']})
frame1
| id | color | brand |
---|
0 | ball | white | OMG |
---|
1 | pencil | red | ABC |
---|
2 | pen | red | ABC |
---|
3 | mug | black | POD |
---|
4 | ashtray | green | POD |
---|
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'],
'brand':['OMG','POD','ABC','POD']})
frame2
| id | brand |
---|
0 | pencil | OMG |
---|
1 | pencil | POD |
---|
2 | ball | ABC |
---|
3 | pen | POD |
---|
pd.merge(frame1,frame2,on='id') # 以id进行合并
| id | color | brand_x | brand_y |
---|
0 | ball | white | OMG | ABC |
---|
1 | pencil | red | ABC | OMG |
---|
2 | pencil | red | ABC | POD |
---|
3 | pen | red | ABC | POD |
---|
pd.merge(frame1,frame2,on='brand') # 以brand进行合并
| id_x | color | brand | id_y |
---|
0 | ball | white | OMG | pencil |
---|
1 | pencil | red | ABC | ball |
---|
2 | pen | red | ABC | ball |
---|
3 | mug | black | POD | pencil |
---|
4 | mug | black | POD | pen |
---|
5 | ashtray | green | POD | pencil |
---|
6 | ashtray | green | POD | pen |
---|
当出现两个列的名称不一致的时候,使用left_on 和 right_on,例如,下面两个表,一个是id
,一个是sid
,我们相当于是用第一个表的id和第二个表的sid连接
frame2.columns = ['sid','brand']
frame2
| sid | brand |
---|
0 | pencil | OMG |
---|
1 | pencil | POD |
---|
2 | ball | ABC |
---|
3 | pen | POD |
---|
pd.merge(frame1,frame2,left_on = 'id',right_on ='sid')
| id | color | brand_x | sid | brand_y |
---|
0 | ball | white | OMG | ball | ABC |
---|
1 | pencil | red | ABC | pencil | OMG |
---|
2 | pencil | red | ABC | pencil | POD |
---|
3 | pen | red | ABC | pen | POD |
---|
merge()函数默认的是内连接,上述结果中的键是由交叉 *** 作出来的
🎉8.1.2 外连接
- 连接类型用how选项指定
- 左连接 共有的加上左边的
- 右连接 共有的加上右边的
- 外连接把所有的键整合到一起
frame2.columns=['id','brand']
pd.merge(frame1,frame2,how='outer')
| id | color | brand |
---|
0 | ball | white | OMG |
---|
1 | pencil | red | ABC |
---|
2 | pen | red | ABC |
---|
3 | mug | black | POD |
---|
4 | ashtray | green | POD |
---|
5 | pencil | NaN | OMG |
---|
6 | pencil | NaN | POD |
---|
7 | ball | NaN | ABC |
---|
8 | pen | NaN | POD |
---|
pd.merge(frame1,frame2,how='left')
| id | color | brand |
---|
0 | ball | white | OMG |
---|
1 | pencil | red | ABC |
---|
2 | pen | red | ABC |
---|
3 | mug | black | POD |
---|
4 | ashtray | green | POD |
---|
pd.merge(frame1,frame2,how='right')
| id | color | brand |
---|
0 | pencil | NaN | OMG |
---|
1 | pencil | NaN | POD |
---|
2 | ball | NaN | ABC |
---|
3 | pen | NaN | POD |
---|
要合并多个键,则把多个键给on选项
pd.merge(frame1,frame2,on=['id','brand'],how='outer')
| id | color | brand |
---|
0 | ball | white | OMG |
---|
1 | pencil | red | ABC |
---|
2 | pen | red | ABC |
---|
3 | mug | black | POD |
---|
4 | ashtray | green | POD |
---|
5 | pencil | NaN | OMG |
---|
6 | pencil | NaN | POD |
---|
7 | ball | NaN | ABC |
---|
8 | pen | NaN | POD |
---|
🎊8.1.3 以索引作为键进行连接
#方法一
pd.merge(frame1,frame2,left_index=True,right_index=True)
| id_x | color | brand_x | id_y | brand_y |
---|
0 | ball | white | OMG | pencil | OMG |
---|
1 | pencil | red | ABC | pencil | POD |
---|
2 | pen | red | ABC | ball | ABC |
---|
3 | mug | black | POD | pen | POD |
---|
#方法二 用join函数 默认左连接
frame2.columns = ['id2','brand2']
frame1.join(frame2)
| id | color | brand | id2 | brand2 |
---|
0 | ball | white | OMG | pencil | OMG |
---|
1 | pencil | red | ABC | pencil | POD |
---|
2 | pen | red | ABC | ball | ABC |
---|
3 | mug | black | POD | pen | POD |
---|
4 | ashtray | green | POD | NaN | NaN |
---|
🎄8.2拼接
- numpy中的concatenation()函数可以用来进行拼接 *** 作
- pandas的concat()函数实现了按轴拼接的功能()
arr1 = np.arange(9).reshape(3,3)
arr1
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
arr2 = np.arange(6,15).reshape(3,3)
arr2
array([[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
np.concatenate([arr1,arr2],axis=1)#默认是axis=0
array([[ 0, 1, 2, 6, 7, 8],
[ 3, 4, 5, 9, 10, 11],
[ 6, 7, 8, 12, 13, 14]])
ser1 = pd.Series(np.random.rand(4), index = [1,2,3,4])
ser1
1 0.180191
2 0.061649
3 0.236378
4 0.105309
dtype: float64
ser2 = pd.Series(np.random.rand(4), index = [5,6,7,8])
ser2
5 0.935277
6 0.516146
7 0.210461
8 0.912048
dtype: float64
pd.concat([ser1,ser2])
1 0.180191
2 0.061649
3 0.236378
4 0.105309
5 0.935277
6 0.516146
7 0.210461
8 0.912048
dtype: float64
pd.concat([ser1,ser2],axis = 1)
| 0 | 1 |
---|
1 | 0.180191 | NaN |
---|
2 | 0.061649 | NaN |
---|
3 | 0.236378 | NaN |
---|
4 | 0.105309 | NaN |
---|
5 | NaN | 0.935277 |
---|
6 | NaN | 0.516146 |
---|
7 | NaN | 0.210461 |
---|
8 | NaN | 0.912048 |
---|
默认是外连接
pd.concat([ser1,ser2],axis=1,join='inner')
如果想要创建等级索引,需要用keys选项来完成
pd.concat([ser1,ser2],keys=[1,2])
1 1 0.180191
2 0.061649
3 0.236378
4 0.105309
2 5 0.935277
6 0.516146
7 0.210461
8 0.912048
dtype: float64
pd.concat([ser1,ser2],axis=1,keys=[1,2])
| 1 | 2 |
---|
1 | 0.180191 | NaN |
---|
2 | 0.061649 | NaN |
---|
3 | 0.236378 | NaN |
---|
4 | 0.105309 | NaN |
---|
5 | NaN | 0.935277 |
---|
6 | NaN | 0.516146 |
---|
7 | NaN | 0.210461 |
---|
8 | NaN | 0.912048 |
---|
🎋8.3组合
- 当无法通过合并或者拼接方法组合数据用组合函数
- combine_first()函数可以用来组合Series对象,同时对齐数据
ser1 = pd.Series(np.random.rand(5), index=[1,2,3,4,5])
ser1
1 0.708279
2 0.233048
3 0.030991
4 0.261291
5 0.379752
dtype: float64
ser2 = pd.Series(np.random.rand(4), index = [2,4,5,6])
ser2
2 0.017397
4 0.764295
5 0.407552
6 0.352605
dtype: float64
ser1.combine_first(ser2)
1 0.708279
2 0.233048
3 0.030991
4 0.261291
5 0.379752
6 0.352605
dtype: float64
pd.concat([ser1,ser2])
1 0.708279
2 0.233048
3 0.030991
4 0.261291
5 0.379752
2 0.017397
4 0.764295
5 0.407552
6 0.352605
dtype: float64
🏆文章推荐
Python数据可视化大杀器之Seaborn:学完可实现90%数据分析绘图
Python数据分析大杀器之Numpy详解
评论列表(0条)