推断Pandas DataFrame中的值_随笔

推断Pandas DataFrame中的值推断熊猫

Dataframe

小号

Dataframe

可能是推断出来的，但是，pandas中没有简单的方法调用，需要另一个库（例如scipy.optimize）。

外推

通常，外推要求人们对要外推的数据做出某些假设。一种方法是将一些通用的参数化方程曲线拟合到数据，以找到最能描述现有数据的参数值，然后将其用于计算超出此数据范围的值。这种方法的困难和局限性问题是对
趋势的 一些假设 __选择参数化方程式时必须进行。
可以通过反复试验找到不同的方程式，以得出所需的结果，或者有时可以从数据源中推断出来。问题中提供的数据实际上不足以获取良好拟合曲线的数据集；但是，它足以说明问题。

下面是外推的一个例子

Dataframe

用3次多项式

f （ x ）= a x 3 + b x 2 + c x + d
（等式1）

将该通用函数（

func()

）曲线拟合到每一列上，以获得唯一的列特定参数（即 a ， b ， c ， d
）。然后，将这些参数化的方程式用于使用

NaN

s推断所有索引的每一列中的数据。

import pandas as pdfrom cStringIO import StringIOfrom scipy.optimize import curve_fitdf = pd.read_table(StringIO('''     neg       neu       pos       avg    0NaN       NaN       NaN       NaN    250    0.508475  0.527027  0.641292  0.558931    500         NaN       NaN       NaN       NaN    1000   0.650000  0.571429  0.653983  0.625137    2000        NaN       NaN       NaN       NaN    3000   0.619718  0.663158  0.665468  0.649448    4000        NaN       NaN       NaN       NaN    6000        NaN       NaN       NaN       NaN    8000        NaN       NaN       NaN       NaN    10000       NaN       NaN       NaN       NaN    20000       NaN       NaN       NaN       NaN    30000       NaN       NaN       NaN       NaN    50000       NaN       NaN       NaN       NaN'''), sep='s+')# Do the original interpolationdf.interpolate(method='nearest', xis=0, inplace=True)# Display resultprint ('Interpolated data:')print (df)print ()# Function to curve fit to the datadef func(x, a, b, c, d):    return a * (x ** 3) + b * (x ** 2) + c * x + d# Initial parameter guess, just to kick off the optimizationguess = (0.5, 0.5, 0.5, 0.5)# Create copy of data to remove NaNs for curve fittingfit_df = df.dropna()# Place to store function parameters for each columncol_params = {}# Curve fit each columnfor col in fit_df.columns:    # Get x & y    x = fit_df.index.astype(float).values    y = fit_df[col].values    # Curve fit column and get curve parameters    params = curve_fit(func, x, y, guess)    # Store optimized parameters    col_params[col] = params[0]# Extrapolate each columnfor col in df.columns:    # Get the index values for NaNs in the column    x = df[pd.isnull(df[col])].index.astype(float).values    # Extrapolate those points with the fitted function    df[col][x] = func(x, *col_params[col])# Display resultprint ('Extrapolated data:')print (df)print ()print ('Data was extrapolated with these column functions:')for col in col_params:    print ('f_{}(x) = {:0.3e} x^3 + {:0.3e} x^2 + {:0.4f} x + {:0.4f}'.format(col, *col_params[col]))

外推结果

Interpolated data: neg       neu       pos       avg0NaN       NaN       NaN       NaN250    0.508475  0.527027  0.641292  0.558931500    0.508475  0.527027  0.641292  0.5589311000   0.650000  0.571429  0.653983  0.6251372000   0.650000  0.571429  0.653983  0.6251373000   0.619718  0.663158  0.665468  0.6494484000        NaN       NaN       NaN       NaN6000        NaN       NaN       NaN       NaN8000        NaN       NaN       NaN       NaN10000       NaN       NaN       NaN       NaN20000       NaN       NaN       NaN       NaN30000       NaN       NaN       NaN       NaN50000       NaN       NaN       NaN       NaNExtrapolated data:    neg          neu         pos          avg0         0.411206     0.486983    0.631233     0.509807250       0.508475     0.527027    0.641292     0.558931500       0.508475     0.527027    0.641292     0.5589311000      0.650000     0.571429    0.653983     0.6251372000      0.650000     0.571429    0.653983     0.6251373000      0.619718     0.663158    0.665468     0.6494484000      0.621036     0.969232    0.708464     0.7662456000      1.197762     2.799529    0.991552     1.6629548000      3.281869     7.191776    1.702860     4.05885510000     7.767992    15.272849    3.041316     8.69409620000    97.540944   150.451269   26.103320    91.36559930000   381.559069   546.881749   94.683310   341.04288350000  1979.646859  2686.936912  467.861511  1711.489069Data was extrapolated with these column functions:f_neg(x) = 1.864e-11 x^3 + -1.471e-07 x^2 + 0.0003 x + 0.4112f_neu(x) = 2.348e-11 x^3 + -1.023e-07 x^2 + 0.0002 x + 0.4870f_avg(x) = 1.542e-11 x^3 + -9.016e-08 x^2 + 0.0002 x + 0.5098f_pos(x) = 4.144e-12 x^3 + -2.107e-08 x^2 + 0.0000 x + 0.6312

对于情节

avg

列

如果没有更大的数据集或不知道数据源，则此结果可能完全错误，但应举例说明外推a的过程

Dataframe

。在假设的公式

func()

很可能需要被打
与以获得正确的推断。另外，没有尝试使代码高效。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5623395.html

推断Pandas DataFrame中的值

发表评论

评论列表（0条）