# 导入 pandas 库,并简记成"pd" import pandas as pd
我们使用的 "homelessness" 数据集一览
.head() / .info() / .shape / .describe()
# 输出 Dataframe 数据的前五行 print(homelessness.head()) # 查看 Dataframe 数据的 Column / 有无Null值 / 数据类型 print(homelessness.info()) # 输出 Dataframe 数据的大小 print(homelessness.shape) # 输出 Dataframe 数据中数字类型数据的统计特征,比如均值,方差等 print(homelessness.describe())
输出结果:
output: region state individuals family_members state_pop 0 East South Central Alabama 2570.0 864.0 4887681 1 Pacific Alaska 1434.0 582.0 735139 2 Mountain Arizona 7259.0 2606.0 7158024 3 West South Central Arkansas 2280.0 432.0 3009733 4 Pacific California 109008.0 20964.0 39461588 Int64Index: 51 entries, 0 to 50 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 region 51 non-null object 1 state 51 non-null object 2 individuals 51 non-null float64 3 family_members 51 non-null float64 4 state_pop 51 non-null int64 dtypes: float64(2), int64(1), object(2) memory usage: 2.4+ KB None (51, 5) individuals family_members state_pop count 51.000 51.000 5.100e+01 mean 7225.784 3504.882 6.406e+06 std 15991.025 7805.412 7.327e+06 min 434.000 75.000 5.776e+05 25% 1446.500 592.000 1.777e+06 50% 3082.000 1482.000 4.461e+06 75% 6781.500 3196.000 7.341e+06 max 109008.000 52070.000 3.946e+07
.values / .columns / .index
# 查看具体数值 print(homelessness.values) #查看表格的列 print(homelessness.columns) # 查看表格的行 print(homelessness.index)
输出结果:
output: [['East South Central' 'Alabama' 2570.0 864.0 4887681] ['Pacific' 'Alaska' 1434.0 582.0 735139] : : : ['South Atlantic' 'West Virginia' 1021.0 222.0 1804291] ['East North Central' 'Wisconsin' 2740.0 2167.0 5807406] ['Mountain' 'Wyoming' 434.0 205.0 577601]] Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object') Int64Index([0, 1, ... 48, 49, 50], dtype='int64')
.sort_values
# 按照 "individuals" 这一列的数字的升序排列 one_column_increase = homelessness.sort_values('family_members') # 按照 "individuals" 这一列的数字的降序排列 one_column_decrease = homelessness.sort_values('family_members',ascending = False) # 先按照 "region" 列的数字升序排列,再按照 "family_members" 列的数字降序排列 multi_column_incre_decre = homelessness.sort_values(["region","family_members"],ascending = [True,False])
用.head()输出看一下:
output: one_column_increase.head() is region state individuals family_members state_pop 50 Mountain Wyoming 434.0 205.0 577601 34 West North Central North Dakota 467.0 75.0 758080 7 South Atlantic Delaware 708.0 374.0 965479 39 New England Rhode Island 747.0 354.0 1058287 45 New England Vermont 780.0 511.0 624358 one_column_decrease.head() is region state individuals family_members state_pop 4 Pacific California 109008.0 20964.0 39461588 32 Mid-Atlantic New York 39827.0 52070.0 19530351 9 South Atlantic Florida 21443.0 9587.0 21244317 43 West South Central Texas 19199.0 6111.0 28628666 47 Pacific Washington 16424.0 5880.0 7523869 multi_column_incre_decre.head() is region state individuals family_members state_pop 13 East North Central Illinois 6752.0 3891.0 12723071 35 East North Central Ohio 6929.0 3320.0 11676341 22 East North Central Michigan 5209.0 3142.0 9984072 49 East North Central Wisconsin 2740.0 2167.0 5807406 14 East North Central Indiana 3776.0 1482.0 6695497
一些常见的子集索引
列的名称索引:
# 单列 "individuals" 索引 one_col = homelessness["individuals"] # 双列 "state","family_members" 索引 two_cols = homelessness[["state","family_members"]]
行的名称索引:
# 找出 "state" 为几个特定名称的行 fix_state = ["California", "Arizona", "Nevada", "Utah"] # 特定名称如下 name_find_rows = homelessness[homelessness["state"].isin(canu)] # 索引使用.isin()
行的逻辑索引:
# 找出 "family_members" 少于 1000 ,"region" 是 "Pacific" 的行 logi_find_rows = homelessness[(homelessness['family_members']<1000)&(homelessness['region']=='Pacific')]
创建新的列
# 创建 "indiv_per_10k" 列表示每 10k 人口的流浪人口比例 homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]
output: region state individuals family_members state_pop indiv_per_10k 0 East South Central Alabama 2570.0 864.0 4887681 5.258 1 Pacific Alaska 1434.0 582.0 735139 19.507 2 Mountain Arizona 7259.0 2606.0 7158024 10.141 3 West South Central Arkansas 2280.0 432.0 3009733 7.575 4 Pacific California 109008.0 20964.0 39461588 27.624
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)