1.基本概念2.转换数据类型3.处理重复数据4.处理缺失数据
a.检查缺失数据b.直接删除缺失数据c.用指定值填补缺失数据d.根据规律填补缺失值 5.离群数据
5.1 通过可视化Box-plotScatter-plot5.2 通过数学计算Z-ScoreIQR score5.3 处理离群数据 参考资料
1.基本概念import pandas as pd df = pd.read_csv("test.csv") df.sample(10)
获取前几行数据
data.head()
获取数据维度信息
df.shape
获取数据表属性的相关信息
data.info()
获取数据表属性类型信息
data.head()2.转换数据类型
import pandas as pd df = pd.Dataframe([{'col1':'a', 'col2':'1'}, {'col1':'b', 'col2':'2'}]) df.dtypes
Object转换为数值型
df['col2-int'] = df['col2'].astype(int)
含有不是数字的Object类型转换为数值型
#此时由于含有不是数字的值,所以就算忽略报错后转换后的类型还是Object s = pd.Series(['1', '2', '4.7', 'pandas', '10']) s.astype(float,errors='ignore')
#这种方法可以将其转换为数值型,Pandas则变为Nan pd.to_numeric(s, errors='coerce')
转换为日期类型
#可以将三列数据Month、Day、Year转换为日期类型的数据 pd.to_datetime(df[['Month', 'Day', 'Year']])
最终的代码
import pandas as pd import numpy as np def convert_money(value): new_value = value.replace(",","").replace("$","") return float(new_value) df2 = pd.read_csv("/home/aistudio/data/data20506/sales_types.csv", dtype = {'Customer Number': 'int'}, converters = {'2016': convert_money, '2017': convert_money, 'Percent Growth': lambda x: float(x.replace("%", "")) / 100, 'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'), 'Active': lambda x: np.where(x =='Y', 1, 0), }) df2['Date'] = pd.to_datetime(df[['Month', 'Day', 'Year']]) df23.处理重复数据
import pandas as pd d = {'Name':['Newton', 'Galilei', 'Einstein', 'Feynman', 'Newton', 'Maxwell', 'Galilei'], 'Age':[26, 30, 28, 28, 26, 39, 40], 'Score':[90, 80, 90, 100, 90, 70, 90]} df = pd.Dataframe(d,columns=['Name','Age','Score']) df
df.duplicated()函数使用
Consider dataset containing ramen rating. >>> df = pd.Dataframe({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0 By default, for each set of duplicated values, the first occurrence is set on False and all others on True. >>> df.duplicated() 0 False 1 True 2 False 3 False 4 False dtype: bool By using 'last', the last occurrence of each set of duplicated values is set on False and all others on True. >>> df.duplicated(keep='last') 0 True 1 False 2 False 3 False 4 False dtype: bool By setting ``keep`` on False, all duplicates are True. >>> df.duplicated(keep=False) 0 True 1 True 2 False 3 False 4 False dtype: bool To find duplicates on specific column(s), use ``subset``. >>> df.duplicated(subset=['brand']) 0 False 1 True 2 False 3 True 4 True dtype: bool
删除重复的
#保留第一个 df.drop_duplicates() #保留最后一个重复的元素 df.drop_duplicates('Age', keep='last')4.处理缺失数据 a.检查缺失数据
#方法一 #isna()函数,若为空则为False,若不为空则为True df = pd.Dataframe({"one":[1, 2, np.nan], "two":[np.nan, 3, 4]}) df.isna() #返回m行n列,每个元素的值都会返回(True,False) df.isna().any() #只返回1列,只要有一个是False就整个属性的值就为False #方法二 #可以统计缺失率 (data.shape[0] - data.count()) / data.shape[0]b.直接删除缺失数据
Dropna()函数
Examples -------- >>> df = pd.Dataframe({"name": ['Alfred', 'Batman', 'Catwoman'], ... "toy": [np.nan, 'Batmobile', 'Bullwhip'], ... "born": [pd.NaT, pd.Timestamp("1940-04-25"), ... pd.NaT]}) >>> df name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT Drop the rows where at least one element is missing. >>> df.dropna() name toy born 1 Batman Batmobile 1940-04-25 Drop the columns where at least one element is missing. >>> df.dropna(axis='columns') name 0 Alfred 1 Batman 2 Catwoman Drop the rows where all elements are missing. >>> df.dropna(how='all') name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT Keep only the rows with at least 2 non-NA values. >>> df.dropna(thresh=2) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT Define in which columns to look for missing values. >>> df.dropna(subset=['name', 'born']) name toy born 1 Batman Batmobile 1940-04-25 Keep the Dataframe with valid entries in the same variable. 在同一个变量改变 >>> df.dropna(inplace=True) >>> df name toy born 1 Batman Batmobile 1940-04-25c.用指定值填补缺失数据
df = pd.Dataframe({'ColA':[1, np.nan, np.nan, 4, 5, 6, 7], 'ColB':[1, 1, 1, 1, 2, 2, 2]}) df['ColA'].fillna(method='ffill')#以前面一个值填充 df['ColA'].fillna(method='bfill')#以后面一个值填充
调用sklearn.impute中的SimpleImputer来填补缺失数据
from sklearn.impute import SimpleImputer #以均值填充空值 imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') col_values = imp_mean.fit_transform(pdf2['Height-na'].values.reshape((-1, 1))) #以常数填充空值 df = pd.Dataframe({"name": ["Google", "Huawei", "Facebook", "Alibaba"],"price": [100, -1, -1, 90]}) imp = SimpleImputer(missing_values=-1, strategy='constant', fill_value=110) # ⑤ imp.fit_transform(df['price'].values.reshape((-1, 1)))d.根据规律填补缺失值
利用sklearn.linear_model的LinearRegression来回归未知的数据
df = pd.Dataframe({"one":np.random.randint(1, 100, 10), "two": [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], "three":[5, 9, 13, np.nan, 21, np.nan, 29, 33, 37, 41]}) from sklearn.linear_model import LinearRegression # ⑥ df_train = df.dropna() #训练集 df_test = df[df['three'].isnull()] #测试集 regr = LinearRegression() regr.fit(df_train['two'].values.reshape(-1, 1), df_train['three'].values.reshape(-1, 1)) # ⑦ df_three_pred = regr.predict(df_test['two'].values.reshape(-1, 1)) # 将所得数值填补到原数据集中 df.loc[(df.three.isnull()), 'three'] = df_three_pred df
使用sklearn.ensemble的RandomForestRegressor来进行预测
import pandas as pd train_data = pd.read_csv("/home/aistudio/data/data20507/train.csv") train_data.info() df = train_data[['Age','Fare', 'Parch', 'SibSp', 'Pclass']] #可能跟年龄有关的特征 known_age = df[df['Age'].notnull()].values unknown_age = df[df['Age'].isnull()].values y = known_age[:, 0] X = known_age[:, 1:] from sklearn.ensemble import RandomForestRegressor # ⑩ rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1) # ○11 rfr.fit(X, y) pred_age = rfr.predict(unknown_age[:, 1:]) # ○13 pred_age.mean() train_data.loc[(train_data.Age.isnull()), 'Age'] = pred_age train_data.isna().any()
还可以利用KNN来填补缺失值
%matplotlib inline import seaborn as sns from sklearn.datasets import load_iris # 引入鸢尾花数据集 import numpy as np iris = load_iris() X = iris.data # 制造含有缺失值的数据集 rng = np.random.RandomState(0) X_missing = X.copy() mask = np.abs(X[:, 2] - rng.normal(loc=5.5, scale=0.7, size=X.shape[0])) < 0.6 X_missing[mask, 3] = np.nan # X_missing是包含了缺失值的数据集 from missingpy import KNNImputer # 引入KNN填充缺失值的模型 imputer = KNNImputer(n_neighbors=3, weights="uniform") X_imputed = imputer.fit_transform(X_missing) #填补之前的数据分布 sns.distplot(X.reshape((-1, 1))) #填补缺失数据后的分布 sns.distplot(X_imputed.reshape((-1, 1)))5.离群数据 5.1 通过可视化
import pandas as pd import matplotlib.pyplot as plt %matplotlib inline from sklearn.datasets import load_boston boston = load_boston() x = boston.data y = boston.target columns = boston.feature_names #create the dataframe boston_df = pd.Dataframe(x) boston_df.columns = columns boston_df.head()Box-plot
import seaborn as sns sns.boxplot(x=boston_df['INDUS']);
![[箱线图.png]]
sns.boxplot(x=boston_df['PTRATIO']); sns.boxplot(x=boston_df['DIS']);Scatter-plot
fig, ax = plt.subplots(figsize=(16,8)) ax.scatter(boston_df['INDUS'], boston_df['TAX']) ax.set_xlabel('Proportion of non-retail business acres per town') ax.set_ylabel('Full-value property-tax rate per ,000') plt.show();5.2 通过数学计算 Z-Score
Z-Score是指观测点或数据的值超过观测值或测量值平均值的标准差的符号数。
from scipy import stats import numpy as np z = np.abs(stats.zscore(boston_df)) print(z) threshold = 3 print(np.where(z > 3))IQR score
四分位数范围(IQR),又称平均数或50%平均数,或技术上称为H-spread,是衡量统计学分散度的指标,等于75%和25%之间的差值,或上四分位数和下四分位数之间的差值,IQR=Q3 - Q1。百度百科解释
![[IQR.png]]
Q1 = boston_df.quantile(0.25) Q3 = boston_df.quantile(0.75) IQR = Q3 - Q1 print(IQR)
既然我们现在有了IQR估计,那么就可以选择离群值。下面的代码将产生带有真值和假值的结果。带有False的数据点表示这些值是有效的,而True则表示有释放。
print((boston_df < (Q1 - 1.5 * IQR)) | (boston_df > (Q3 + 1.5 * IQR)))5.3 处理离群数据
识别异常值–分析有无异常值的结果模型–做出结论。如果你确定数值是错误的,就修正它。如果离群值不在利益分配范围内,则删除。考虑到数据的差距,使用抗离群值的统计工具,例如,稳健回归(用另一种参数估计方法)Robust_regression。离群值的常见原因是两种分布的混合,可能是两个不同的子人群,也可能表明 "测量正确 "和 “测量误差”;这通常是由混合模型来建模。 (Mixture model). 参考资料
https://blog.csdn.net/weixin_42199542/article/details/106898892
https://blog.csdn.net/bbbeoy/article/details/72124004
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)