「数据分析师的基础算法应用」使用Python进行数据预处理方法 数值数据 总结

「数据分析师的基础算法应用」使用Python进行数据预处理方法 数值数据 总结,第1张

概述文章目录内容介绍离散数据处理Map类别转换One-hotEncoding数值数据基本描述二值特征多项式特征数值区间统计归类特征分位数切分对数变换日期相关特征时间相关特征内容介绍本章节为数值数据处理总结,其中包括数值特征、Map类别转换、One-hotEncoding、数值数据基本

文章目录内容介绍离散数据处理Map类别转换One-hot Encoding数值数据基本描述二值特征多项式特征数值区间统计归类特征分位数切分对数变换日期相关特征时间相关特征

内容介绍

本章节为 数值数据 处理总结,其中包括数值特征、Map类别转换、One-hot EnCoding、数值数据基本描述、二值特征、多项式特征、数值区统计归类特征、分位数切分、对数变换、日期相关特征、时间相关特征的介绍。

文本介绍关于数据分析工作中常用的 使用Python进行数据预处理 的方法总结。通过对图片数据、数值数字、文本数据、特征提取、特征处理等方面讲解作为一名数据分析师常用的数据处理套路。

import pandas as pdimport numpy as np
离散数据处理
# 读取观察数据vg_df = pd.read_csv('datasets/vgsales.csv', enCoding = "ISO-8859-1")vg_df[['name', 'Platform', 'Year', 'Genre', 'Publisher']].iloc[1:7]

# 离散变量字段取唯一值genres = np.unique(vg_df['Genre'])genres>>> array(['Action', 'Adventure', 'fighting', 'Misc', 'Platform',            'Puzzle','Racing', 'Role-Playing', 'Shooter', 'Simulation',            'Sports','Strategy'], dtype=object)# 离散变量变换转换from sklearn.preprocessing import LabelEncodergle = LabelEncoder()genre_labels = gle.fit_transform(vg_df['Genre'])genre_labels>>> array([10,  4,  6, ...,  6,  5,  4])# 类别变量字典映射genre_mapPings = {index: label for index, label in enumerate(gle.classes_)}genre_mapPings>>> {0: 'Action',     1: 'Adventure',     2: 'fighting',     3: 'Misc',     4: 'Platform',     5: 'Puzzle',     6: 'Racing',     7: 'Role-Playing',     8: 'Shooter',     9: 'Simulation',     10: 'Sports',     11: 'Strategy'}# 数据DF 切片 *** 作vg_df['GenreLabel'] = genre_labelsvg_df[['name', 'Platform', 'Year', 'Genre', 'GenreLabel']].iloc[1:7]

Map类别转换
poke_df = pd.read_csv('datasets/Pokemon.csv', enCoding='utf-8')poke_df = poke_df.sample(random_state=1, frac=1).reset_index(drop=True)np.unique(poke_df['Generation'])>>> array(['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6'], dtype=object)# 构建MAP转换字典 gen_ord_map = {'Gen 1': 1, 'Gen 2': 2, 'Gen 3': 3,                'Gen 4': 4, 'Gen 5': 5, 'Gen 6': 6}# 哑变量转换poke_df['GenerationLabel'] = poke_df['Generation'].map(gen_ord_map)poke_df[['name', 'Generation', 'GenerationLabel']].iloc[4:10]

One-hot EnCoding
# 提取需要转换的数据poke_df[['name', 'Generation', 'Legendary']].iloc[4:10]

# 使用MAP将类别变量转换成数值from sklearn.preprocessing import OneHotEncoder, LabelEncodergen_le = LabelEncoder()gen_labels = gen_le.fit_transform(poke_df['Generation'])poke_df['Gen_Label'] = gen_labelsleg_le = LabelEncoder()leg_labels = leg_le.fit_transform(poke_df['Legendary'])poke_df['Lgnd_Label'] = leg_labelspoke_df_sub = poke_df[['name', 'Generation', 'Gen_Label', 'Legendary', 'Lgnd_Label']]poke_df_sub.iloc[4:10]

# 在原有DF中创建 One-hot EnCoding 字段gen_ohe = OneHotEncoder()gen_feature_arr = gen_ohe.fit_transform(poke_df[['Gen_Label']]).toarray()gen_feature_labels = List(gen_le.classes_)print (gen_feature_labels)gen_features = pd.DataFrame(gen_feature_arr, columns=gen_feature_labels)>>> ['Gen 1', 'Gen 2', 'Gen 3', 'Gen 4', 'Gen 5', 'Gen 6']leg_ohe = OneHotEncoder()leg_feature_arr = leg_ohe.fit_transform(poke_df[['Lgnd_Label']]).toarray()leg_feature_labels = ['Legendary_'+str(cls_label) for cls_label in leg_le.classes_]print (leg_feature_labels)leg_features = pd.DataFrame(leg_feature_arr, columns=leg_feature_labels)>>> ['Legendary_False', 'Legendary_True']# 进行转换poke_df_ohe = pd.concat([poke_df_sub, gen_features, leg_features], axis=1)columns = sum([['name', 'Generation', 'Gen_Label'],gen_feature_labels,              ['Legendary', 'Lgnd_Label'],leg_feature_labels], [])poke_df_ohe[columns].iloc[4:10]

数值数据基本描述
poke_df = pd.read_csv('datasets/Pokemon.csv', enCoding='utf-8')poke_df.head()

poke_df[['HP', 'Attack', 'Defense']].head()

poke_df[['HP', 'Attack', 'Defense']].describe()

二值特征
watched = np.array(popsong_df['Listen_count']) watched[watched >= 1] = 1popsong_df['watched'] = watchedpopsong_df.head(10)

# 基于阈值判断转换类别from sklearn.preprocessing import Binarizerbn = Binarizer(threshold=0.9)pd_watched = bn.transform([popsong_df['Listen_count']])[0]popsong_df['pd_watched'] = pd_watchedpopsong_df.head(11)

多项式特征
atk_def = poke_df[['Attack', 'Defense']]atk_def.head()

# 2 次多项式的次数为 [1,a,b,a方,ab,b方]from sklearn.preprocessing import polynomialFeaturespf = polynomialFeatures(degree=2, interaction_only=False, include_bias=False)res = pf.fit_transform(atk_def)res

intr_features = pd.DataFrame(res, columns=['Attack', 'Defense', 'Attack^2', 'Attack x Defense', 'Defense^2'])intr_features.head(5)

数值区间统计归类特征
fcc_survey_df = pd.read_csv('datasets/fcc_2016_coder_survey_subset.csv', enCoding='utf-8')fcc_survey_df[['ID.x', 'EmploymentFIEld', 'Age', 'Income']].head()

# 构建年龄直方图fig, ax = plt.subplots()fcc_survey_df['Age'].hist(color='#A9C5D3')ax.set_Title('Developer Age Histogram', Fontsize=12)ax.set_xlabel('Age', Fontsize=12)ax.set_ylabel('Frequency', Fontsize=12)


# 以年龄除以10为阶段进行划分fcc_survey_df['Age_bin_round'] = np.array(np.floor(np.array(fcc_survey_df['Age']) / 10.))fcc_survey_df[['ID.x', 'Age', 'Age_bin_round']].iloc[1071:1076]

分位数切分
fcc_survey_df[['ID.x', 'Age', 'Income']].iloc[4:9]

# 构建直方图fig, ax = plt.subplots()fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')ax.set_Title('Developer Income Histogram', Fontsize=12)ax.set_xlabel('Developer Income', Fontsize=12)ax.set_ylabel('Frequency', Fontsize=12)

# 四分位区分quantile_List = [0, .25, .5, .75, 1.]quantiles = fcc_survey_df['Income'].quantile(quantile_List)quantiles

# 四分卫可视化fig, ax = plt.subplots()fcc_survey_df['Income'].hist(bins=30, color='#A9C5D3')for quantile in quantiles:    qvl = plt.axvline(quantile, color='r')ax.legend([qvl], ['Quantiles'], Fontsize=10)ax.set_Title('Developer Income Histogram with Quantiles', Fontsize=12)ax.set_xlabel('Developer Income', Fontsize=12)ax.set_ylabel('Frequency', Fontsize=12)

# 基于分位数的数据描述,添加对应的标签quantile_labels = ['0-25Q', '25-50Q', '50-75Q', '75-100Q']fcc_survey_df['Income_quantile_range'] = pd.qcut(fcc_survey_df['Income'], q=quantile_List)fcc_survey_df['Income_quantile_label'] = pd.qcut(fcc_survey_df['Income'], q=quantile_List, labels=quantile_labels)fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_quantile_range', 'Income_quantile_label']].iloc[4:9]

对数变换
fcc_survey_df['Income_log'] = np.log((1+ fcc_survey_df['Income']))fcc_survey_df[['ID.x', 'Age', 'Income', 'Income_log']].iloc[4:9]

# 数值数据取LOG后 可视化直方图income_log_mean = np.round(np.mean(fcc_survey_df['Income_log']), 2)fig, ax = plt.subplots()fcc_survey_df['Income_log'].hist(bins=30, color='#A9C5D3')plt.axvline(income_log_mean, color='r')ax.set_Title('Developer Income Histogram after Log transform', Fontsize=12)ax.set_xlabel('Developer Income (log scale)', Fontsize=12)ax.set_ylabel('Frequency', Fontsize=12)ax.text(11.5, 450, r'$\mu$='+str(income_log_mean), Fontsize=10)

日期相关特征
import datetimeimport numpy as npimport pandas as pdfrom dateutil.parser import parseimport pytztime_stamps = ['2015-03-08 10:30:00.360000+00:00', '2017-07-13 15:45:05.755000-07:00',               '2012-01-20 22:30:00.254000+05:30', '2016-12-25 00:30:00.000000+10:00']df = pd.DataFrame(time_stamps, columns=['Time'])df

# 转换日期类型ts_obJs = np.array([pd.Timestamp(item) for item in np.array(df.Time)])df['TS_obj'] = ts_obJsts_obJs

# 提取日期中的字段信息构建新的日期分类字段df['Year'] = df['TS_obj'].apply(lambda d: d.year)df['Month'] = df['TS_obj'].apply(lambda d: d.month)df['Day'] = df['TS_obj'].apply(lambda d: d.day)df['DayOfWeek'] = df['TS_obj'].apply(lambda d: d.dayofweek)df['Dayname'] = df['TS_obj'].apply(lambda d: d.weekday_name)df['DayOfYear'] = df['TS_obj'].apply(lambda d: d.dayofyear)df['WeekOfYear'] = df['TS_obj'].apply(lambda d: d.weekofyear)df['Quarter'] = df['TS_obj'].apply(lambda d: d.quarter)df[['Time', 'Year', 'Month', 'Day', 'Quarter',     'DayOfWeek', 'Dayname', 'DayOfYear', 'WeekOfYear']]

时间相关特征
df['Hour'] = df['TS_obj'].apply(lambda d: d.hour)df['Minute'] = df['TS_obj'].apply(lambda d: d.minute)df['Second'] = df['TS_obj'].apply(lambda d: d.second)df['MUsecond'] = df['TS_obj'].apply(lambda d: d.microsecond)   #毫秒df['UTC_offset'] = df['TS_obj'].apply(lambda d: d.utcoffset()) #UTC时间位移df[['Time', 'Hour', 'Minute', 'Second', 'MUsecond', 'UTC_offset']]

# 按照早晚切分时间 hour_bins = [-1, 5, 11, 16, 21, 23]bin_names = ['Late Night', 'Morning', 'Afternoon', 'Evening', 'Night']df['TimeOfDayBin'] = pd.cut(df['Hour'],                             bins=hour_bins, labels=bin_names)df[['Time', 'Hour', 'TimeOfDayBin']]

总结

以上是内存溢出为你收集整理的「数据分析师的基础算法应用」使用Python进行数据预处理方法 数值数据 总结全部内容,希望文章能够帮你解决「数据分析师的基础算法应用」使用Python进行数据预处理方法 数值数据 总结所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: https://outofmemory.cn/langs/1184578.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-03
下一篇 2022-06-03

发表评论

登录后才能评论

评论列表(0条)

保存