Pandas-数据结构：DataFrame【二维】（四）【按列索引：df[列名]】【按行索引：df.loc()、df.iloc_python

Dataframe既有行索引也有列索引，可以被看做由Series组成的字典（共用一个索引）。

一、按照“列名”索引：df[‘col_name’]

按照列名选择列，只选择一列输出Series，选择多列输出Dataframe

df[]一般用于选择列，[]中写列名（所以一般数据colunms都会单独制定，不会用默认数字列名，以免和index冲突）；
单选列为Series，print结果为Series格式；
多选列为Dataframe，print结果为Dataframe格式；

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(12).reshape(3, 4) * 100,
                  index=['one', 'two', 'three'],
                  columns=['a', 'b', 'c', 'd'])
print("df = ", df)
print('-' * 100)

# 按照列名选择列，只选择一列输出Series，选择多列输出Dataframe
data1 = df['a']
data2 = df[['a', 'c']]
print("data1 = \n{0}\ntype(data1) = {1}".format(data1, type(data1)))
print('-' * 100)
print("data2 = \n{0}\ntype(data2) = {1}".format(data2, type(data2)))

打印结果：

df =                 a          b          c          d
one    12.427304  39.089892  22.467365  22.711018
two    50.808058  67.916443  39.312617  95.227642
three   3.399731  57.874266  45.771234  99.649908
----------------------------------------------------------------------------------------------------
data1 = 
one      12.427304
two      50.808058
three     3.399731
Name: a, dtype: float64
type(data1) = <class 'pandas.core.series.Series'>
----------------------------------------------------------------------------------------------------
data2 = 
               a          c
one    12.427304  22.467365
two    50.808058  39.312617
three   3.399731  45.771234
type(data2) = <class 'pandas.core.frame.DataFrame'>

二、行索引

按照index选择行，只选择一行输出Series，选择多行输出Dataframe

1、单标签索引：df.loc[1]、df.loc[‘one’]

按照行名索引：df.loc[row_name]
按照行下标索引：df.loc[row_index]

import numpy as np
import pandas as pd

# df.loc[] - 按index选择行
# 核心：df.loc[label]主要针对index选择行，同时支持指定index，及默认数字index

df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
                   index=['one', 'two', 'three', 'four'],
                   columns=['a', 'b', 'c', 'd'])

df2 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
                   columns=['a', 'b', 'c', 'd'])

print("df1 = \n{0}\ntype(df1) = {1}".format(df1, type(df1)))
print('-' * 50)
print("df2 = \n{0}\ntype(df2) = {1}".format(df2, type(df2)))
print('-' * 100)

# 单个标签索引，返回Series
data1 = df1.loc['one']
data2 = df2.loc[1]
print("单标签索引：data1 = \ndf1.loc['one'] = \n{0}\ntype(data1) = {1}".format(data1, type(data1)))
print('-' * 50)
print("单标签索引：data2 = \ndf2.loc[1] = \n{0}\ntype(data2) = {1}".format(data2, type(data2)))
print('-' * 100)

打印结果：

df1 = 
               a          b          c          d
one    93.037642  52.895322  42.547540  95.435676
two    24.088954  56.966169  79.185705  48.582922
three  76.162602  32.962263  41.853371  99.138612
four   24.979909  10.191909  27.335317  20.452524
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df2 = 
           a          b          c          d
0  21.656858  31.404614  88.520987  41.839721
1  26.884644   9.943081  91.739139  81.479288
2  96.522109  71.673956  55.843560  38.131336
3  73.574839  93.350715  89.358183  45.521198
type(df2) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
单标签索引：data1 = 
df1.loc['one'] = 
a    93.037642
b    52.895322
c    42.547540
d    95.435676
Name: one, dtype: float64
type(data1) = <class 'pandas.core.series.Series'>
--------------------------------------------------
单标签索引：data2 = 
df2.loc[1] = 
a    26.884644
b     9.943081
c    91.739139
d    81.479288
Name: 1, dtype: float64
type(data2) = <class 'pandas.core.series.Series'>

2、多标签索引：df.loc[[3, 2, 1]]、df.loc[[‘two’, ‘three’]]

import numpy as np
import pandas as pd

# df.loc[] - 按index选择行
# 核心：df.loc[label]主要针对index选择行，同时支持指定index，及默认数字index

df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
                   index=['one', 'two', 'three', 'four'],
                   columns=['a', 'b', 'c', 'd'])

df2 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
                   columns=['a', 'b', 'c', 'd'])

print("df1 = \n{0}\ntype(df1) = {1}".format(df1, type(df1)))
print('-' * 50)
print("df2 = \n{0}\ntype(df2) = {1}".format(df2, type(df2)))
print('-' * 100)

# 多个标签索引【顺序可变】
data3 = df1.loc[['two', 'three']]
data4 = df2.loc[[3, 2, 1]]
print("多标签索引：data3 = \ndf1.loc[['two', 'three']] = \n{0}\ntype(data3) = {1}".format(data3, type(data3)))
print('-' * 50)
print("多标签索引：data4 = \ndf2.loc[[3, 2, 1]] = \n{0}\ntype(data4) = {1}".format(data4, type(data4)))
print('-' * 100)

打印结果：

df1 = 
               a          b          c          d
one    93.037642  52.895322  42.547540  95.435676
two    24.088954  56.966169  79.185705  48.582922
three  76.162602  32.962263  41.853371  99.138612
four   24.979909  10.191909  27.335317  20.452524
type(df1) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
df2 = 
           a          b          c          d
0  21.656858  31.404614  88.520987  41.839721
1  26.884644   9.943081  91.739139  81.479288
2  96.522109  71.673956  55.843560  38.131336
3  73.574839  93.350715  89.358183  45.521198
type(df2) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
多标签索引：data3 = 
df1.loc[['two', 'three']] = 
               a          b          c          d
two    24.088954  56.966169  79.185705  48.582922
three  76.162602  32.962263  41.853371  99.138612
type(data3) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
多标签索引：data4 = 
df2.loc[[3, 2, 1]] = 
           a          b          c          d
3  73.574839  93.350715  89.358183  45.521198
2  96.522109  71.673956  55.843560  38.131336
1  26.884644   9.943081  91.739139  81.479288
type(data4) = <class 'pandas.core.frame.DataFrame'>

3、行切片索引：df.loc[1:3]、df.loc[‘one’:‘three’]、df.iloc[1:3]、df[1:3]

df.loc[]

df.loc[]：利用index的名称来获取想要的行【末端包含】
核心：df.loc[label]主要针对index选择行，同时支持指定index，及默认数字index
其中的int类型的索引时索引的名称，而非下标位置信息；

df.iloc[]

按行的位置选择行【末端不包含】

import numpy as np
import pandas as pd

# df.loc[] - 按index选择行
# 核心：df.loc[label]主要针对index选择行，同时支持指定index，及默认数字index

df1 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
                   index=['one', 'two', 'three', 'four'],
                   columns=['a', 'b', 'c', 'd'])

df2 = pd.DataFrame(np.random.rand(16).reshape(4, 4) * 100,
                   columns=['a', 'b', 'c', 'd'])

print("df1 = \n{0}\ntype(df1) = {1}".format(df1, type(df1)))
print('-' * 50)
print("df2 = \n{0}\ntype(df2) = {1}".format(df2, type(df2)))
print('-' * 100)

# 可以做切片对象【末端包含】
data5 = df1.loc['one':'three']
data6 = df2.loc[1:3]
print("切片索引：data5 = \ndf1.loc['one':'three'] = \n{0}\ntype(data5) = {1}".format(data5, type(data5)))
print('-' * 50)
print("切片索引：data6 = \ndf2.loc[1:3] = \n{0}\ntype(data6) = {1}".format(data6, type(data6)))
print('-' * 100)

# 利用iloc()做切片对象【末端不包含】
data7 = df2.iloc[1:3]
print("切片索引：data7 = \ndf2.iloc[1:3] = \n{0}\ntype(data7) = {1}".format(data7, type(data7)))
print('-' * 100)

打印结果：

切片索引：data5 = 
df1.loc['one':'three'] = 
               a          b          c          d
one    93.037642  52.895322  42.547540  95.435676
two    24.088954  56.966169  79.185705  48.582922
three  76.162602  32.962263  41.853371  99.138612
type(data5) = <class 'pandas.core.frame.DataFrame'>
--------------------------------------------------
切片索引：data6 = 
df2.loc[1:3] = 
           a          b          c          d
1  26.884644   9.943081  91.739139  81.479288
2  96.522109  71.673956  55.843560  38.131336
3  73.574839  93.350715  89.358183  45.521198
type(data6) = <class 'pandas.core.frame.DataFrame'>
----------------------------------------------------------------------------------------------------
切片索引：data7 = 
df2.iloc[1:3] = 
           a          b          c          d
1  27.523996  42.360457  38.575211   0.698684
2  29.461314  53.466241  83.289472  36.324424
type(data7) = <class 'pandas.core.frame.DataFrame'>

按照行切片索引：df[:row_index]

df[]中为数字时，默认选择行，且只能进行切片的选择，不能单独选择（df[0]是错误的）
输出结果为Dataframe，即便只选择一行
df[]不能通过索引标签名来选择行(df['one'])

import numpy as np
import pandas as pd

# df[]一般用于选择列，也可以选择行
# 核心：df[col]一般用于选择列，[]中写列名

df = pd.DataFrame(np.random.rand(12).reshape(3, 4) * 100,
                  index=['one', 'two', 'three'],
                  columns=['a', 'b', 'c', 'd'])
print("df = ", df)
print('-' * 100)

# df[]中为数字时，默认选择行，且只能进行切片的选择，不能单独选择（df[0]）
# 输出结果为Dataframe，即便只选择一行
# df[]不能通过索引标签名来选择行(df['one'])
data3 = df[:1]
# data3 = df[0] # 错误
# data3 = df['one'] # 错误
print("data3 = \n{0}\ntype(data3) = {1}".format(data3, type(data3)))

4、df.loc[] 和 df.iloc[] 的区别

前提，简单介绍一下它俩：

loc利用 index的名称，来获取想要的行（或列）【名称导向】
iloc利用 index的具体位置（所以它只能是整数型参数），来获取想要的行（或列）。

import numpy as np
import pandas as pd

s = pd.Series(np.nan, index=[49, 48, 47, 46, 45, 1, 2, 3, 4, 5])
print("s = \n", s)

打印结果：

s = 
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
dtype: float64

让我们用整数3来试着提数

s.iloc[:3]返回给我们的是前3行的数（因为把3当作位置信息做的处理）；
s.loc[:3]返回前8行得数（因为把3当作名称对象做的处理）；

import numpy as np
import pandas as pd

s = pd.Series(np.nan, index=[49, 48, 47, 46, 45, 1, 2, 3, 4, 5])
print("s.iloc[:3] = \n", s.iloc[:3])
print("-" * 50)
print("s.loc[:3] = \n", s.loc[:3])

打印结果：

s.iloc[:3] = 
49   NaN
48   NaN
47   NaN
dtype: float64
--------------------------------------------------
s.loc[:3] = 
49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
dtype: float64

如果我们试着用一个不在index里的整数，比如6会出现什么结果呢？

当然s.iloc[:6]返回的是前6行的数。

import numpy as np
import pandas as pd

s = pd.Series(np.nan, index=[49, 48, 47, 46, 45, 1, 2, 3, 4, 5])
print("s.iloc[:6] = \n", s.iloc[:6])

打印结果：

s.iloc[:6] = 
 49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
dtype: float64

但是，s.loc[:6]会被挂起提示KeyError，这是因为6不是index的元素。

三、行列同时切片索引：df.iloc[]

根据位置和名称信息混搭的取数：对于一个DaraFrame，如果我想提取c行及其之前所有的，同时属于前4列的数呢？

iloc[num_of_row_start : num_of_row_end, num_of_column_start : num_of_column_end]

import numpy as np
import pandas as pd

df = pd.DataFrame(np.nan,
                  index=list('abcde'),
                  columns=['x', 'y', 'z', 8, 9])
print("df = \n", df)

print("-" * 100)

df_select = df.iloc[:df.index.get_loc('c') + 1, :4]
print("df_select = \n", df_select)

打印结果：

df = 
     x   y   z   8   9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN
----------------------------------------------------------------------------------------------------
df_select = 
     x   y   z   8
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN

Process finished with exit code 0

get_loc(pandas 0.24.1)是一个应用在index的工具，即“获取名称对象在index的位置（整数）”。

注意，因为不包含num_of_end，所以需要 +1才能包含c行。

四、提取目标行 / 目标列：df.loc[]

import numpy as np
import pandas as pd

df = pd.DataFrame(np.nan,
                  index=list('abcde'),
                  columns=['x', 'y', 'z', 8, 9])
print("df = \n", df)

print("-" * 100)

data1 = df.loc[['b', 'c']]
print("data1 = \n", data1)
print("-" * 50)
data2 = df.loc[:, ['y', 8]]
print("data2 = \n", data2)
print("-" * 50)
data3 = df.loc[['b', 'c'], ['y', 8]]
print("data3 = \n", data3)

打印结果：

df = 
     x   y   z   8   9
a NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN
e NaN NaN NaN NaN NaN
----------------------------------------------------------------------------------------------------
data1 = 
     x   y   z   8   9
b NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN
--------------------------------------------------
data2 = 
     y   8
a NaN NaN
b NaN NaN
c NaN NaN
d NaN NaN
e NaN NaN
--------------------------------------------------
data3 = 
     y   8
b NaN NaN
c NaN NaN

Process finished with exit code 0