Dataframe小号
Dataframe可能是推断出来的,但是,pandas中没有简单的方法调用,需要另一个库(例如scipy.optimize)。外推
通常,外推要求人们对要外推的数据做出某些假设。一种方法是将一些通用的参数化方程曲线拟合到数据,以找到最能描述现有数据的参数值,然后将其用于计算超出此数据范围的值。这种方法的困难和局限性问题是对
趋势的 一些假设 __选择参数化方程式时必须进行。
可以通过反复试验找到不同的方程式,以得出所需的结果,或者有时可以从数据源中推断出来。问题中提供的数据实际上不足以获取良好拟合曲线的数据集;但是,它足以说明问题。
下面是外推的一个例子
Dataframe用3次多项式
f ( x )= a x 3 + b x 2 + c x + d
(等式1)
将该通用函数(
func())曲线拟合到每一列上,以获得唯一的列特定参数(即 a , b , c , d
)。然后,将这些参数化的方程式用于使用
NaNs推断所有索引的每一列中的数据。
外推结果import pandas as pdfrom cStringIO import StringIOfrom scipy.optimize import curve_fitdf = pd.read_table(StringIO(''' neg neu pos avg 0NaN NaN NaN NaN 250 0.508475 0.527027 0.641292 0.558931 500 NaN NaN NaN NaN 1000 0.650000 0.571429 0.653983 0.625137 2000 NaN NaN NaN NaN 3000 0.619718 0.663158 0.665468 0.649448 4000 NaN NaN NaN NaN 6000 NaN NaN NaN NaN 8000 NaN NaN NaN NaN 10000 NaN NaN NaN NaN 20000 NaN NaN NaN NaN 30000 NaN NaN NaN NaN 50000 NaN NaN NaN NaN'''), sep='s+')# Do the original interpolationdf.interpolate(method='nearest', xis=0, inplace=True)# Display resultprint ('Interpolated data:')print (df)print ()# Function to curve fit to the datadef func(x, a, b, c, d): return a * (x ** 3) + b * (x ** 2) + c * x + d# Initial parameter guess, just to kick off the optimizationguess = (0.5, 0.5, 0.5, 0.5)# Create copy of data to remove NaNs for curve fittingfit_df = df.dropna()# Place to store function parameters for each columncol_params = {}# Curve fit each columnfor col in fit_df.columns: # Get x & y x = fit_df.index.astype(float).values y = fit_df[col].values # Curve fit column and get curve parameters params = curve_fit(func, x, y, guess) # Store optimized parameters col_params[col] = params[0]# Extrapolate each columnfor col in df.columns: # Get the index values for NaNs in the column x = df[pd.isnull(df[col])].index.astype(float).values # Extrapolate those points with the fitted function df[col][x] = func(x, *col_params[col])# Display resultprint ('Extrapolated data:')print (df)print ()print ('Data was extrapolated with these column functions:')for col in col_params: print ('f_{}(x) = {:0.3e} x^3 + {:0.3e} x^2 + {:0.4f} x + {:0.4f}'.format(col, *col_params[col]))
对于情节Interpolated data: neg neu pos avg0NaN NaN NaN NaN250 0.508475 0.527027 0.641292 0.558931500 0.508475 0.527027 0.641292 0.5589311000 0.650000 0.571429 0.653983 0.6251372000 0.650000 0.571429 0.653983 0.6251373000 0.619718 0.663158 0.665468 0.6494484000 NaN NaN NaN NaN6000 NaN NaN NaN NaN8000 NaN NaN NaN NaN10000 NaN NaN NaN NaN20000 NaN NaN NaN NaN30000 NaN NaN NaN NaN50000 NaN NaN NaN NaNExtrapolated data: neg neu pos avg0 0.411206 0.486983 0.631233 0.509807250 0.508475 0.527027 0.641292 0.558931500 0.508475 0.527027 0.641292 0.5589311000 0.650000 0.571429 0.653983 0.6251372000 0.650000 0.571429 0.653983 0.6251373000 0.619718 0.663158 0.665468 0.6494484000 0.621036 0.969232 0.708464 0.7662456000 1.197762 2.799529 0.991552 1.6629548000 3.281869 7.191776 1.702860 4.05885510000 7.767992 15.272849 3.041316 8.69409620000 97.540944 150.451269 26.103320 91.36559930000 381.559069 546.881749 94.683310 341.04288350000 1979.646859 2686.936912 467.861511 1711.489069Data was extrapolated with these column functions:f_neg(x) = 1.864e-11 x^3 + -1.471e-07 x^2 + 0.0003 x + 0.4112f_neu(x) = 2.348e-11 x^3 + -1.023e-07 x^2 + 0.0002 x + 0.4870f_avg(x) = 1.542e-11 x^3 + -9.016e-08 x^2 + 0.0002 x + 0.5098f_pos(x) = 4.144e-12 x^3 + -2.107e-08 x^2 + 0.0000 x + 0.6312
avg列
如果没有更大的数据集或不知道数据源,则此结果可能完全错误,但应举例说明外推a的过程
Dataframe。在假设的公式
func()很可能需要被 打
与以获得正确的推断。另外,没有尝试使代码高效。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)