要将平均值用于数字列,将最频繁的值用于非数字列,您可以执行以下 *** 作。您可以进一步区分整数和浮点数。我想用中位数代替整数列可能有意义。
import pandas as pdimport numpy as npfrom sklearn.base import TransformerMixinclass DataframeImputer(TransformerMixin): def __init__(self): """Impute missing values. Columns of dtype object are imputed with the most frequent value in column. Columns of other types are imputed with mean of column. """ def fit(self, X, y=None): self.fill = pd.Series([X[c].value_counts().index[0] if X[c].dtype == np.dtype('O') else X[c].mean() for c in X], index=X.columns) return self def transform(self, X, y=None): return X.fillna(self.fill)data = [ ['a', 1, 2], ['b', 1, 1], ['b', 2, 2], [np.nan, np.nan, np.nan]]X = pd.Dataframe(data)xt = DataframeImputer().fit_transform(X)print('before...')print(X)print('after...')print(xt)
哪个打印,
before... 0 1 20 a 1 21 b 1 12 b 2 23 NaN NaN NaNafter... 0 1 20 a 1.000000 2.0000001 b 1.000000 1.0000002 b 2.000000 2.0000003 b 1.333333 1.666667
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)