好吧,这正是可以通过提供的一些实验和代码段 真正 解决的问题…
无论如何,这似乎是最普遍的回答是坚定的 没有 :子模块不仅Python和星火MLlib之间,但即使星火之间,或Python和NumPy的之间…
这是一些可重现的代码,在Databricks社区云中运行(
pyspark已在其中导入并且初始化了相关上下文):
import sysimport randomimport pandas as pdimport numpy as npfrom pyspark.sql.functions import rand, randnfrom pyspark.mllib import random as r # avoid conflict with native Python random moduleprint("Spark version " + spark.version)print("Python version %s.%s.%s" % sys.version_info[:3])print("Numpy version " + np.version.version)# Spark version 2.3.1 # Python version 3.5.2 # Numpy version 1.11.1s = 1234 # RNG seed# Spark SQL random module:spark_df = sqlContext.range(0, 10)spark_df = spark_df.select("id", randn(seed=s).alias("normal"), rand(seed=s).alias("uniform"))# Python 3 random module:random.seed(s)x = [random.uniform(0,1) for i in range(10)] # random.rand() gives exact same resultsrandom.seed(s)y = [random.normalvariate(0,1) for i in range(10)]df = pd.Dataframe({'uniform':x, 'normal':y})# numpy random modulenp.random.seed(s)xx = np.random.uniform(size=10) # again, np.random.rand(10) gives exact same resultsnp.random.seed(s)yy = np.random.randn(10)numpy_df = pd.Dataframe({'uniform':xx, 'normal':yy})# Spark MLlib random modulerdd_uniform = r.RandomRDDs.uniformRDD(sc, 10, seed=s).collect()rdd_normal = r.RandomRDDs.normalRDD(sc, 10, seed=s).collect()rdd_df = pd.Dataframe({'uniform':rdd_uniform, 'normal':rdd_normal})
这里是 结果 :
本机Python 3:
# df normal uniform0 1.430825 0.9664541 1.803801 0.440733 2 0.321290 0.007491 3 0.599006 0.910976 4 -0.700891 0.939269 5 0.233350 0.5822286 -0.613906 0.6715637 -1.622382 0.0839388 0.131975 0.7664819 0.191054 0.236810
脾气暴躁:
# numpy_df normal uniform0 0.471435 0.1915191 -1.190976 0.622109 2 1.432707 0.4377283 -0.312652 0.7853594 -0.720589 0.7799765 0.887163 0.2725936 0.859588 0.276464 7 -0.636524 0.801872 8 0.015696 0.9581399 -2.242685 0.875933
Spark SQL:
# spark_df.show()+---+--------------------+-------------------+ | id| normal| uniform|+---+--------------------+-------------------+| 0| 0.9707422835368164| 0.9499610869333489| | 1| 0.3641589200870126| 0.9682554532421536|| 2|-0.22282955491417034|0.20293463923130883|| 3|-0.00607734375219...|0.49540111648680385|| 4| -0.603246393509015|0.04350782074761239|| 5|-0.12066287904491797|0.09390549680302918|| 6| 0.2899567922101867| 0.6789838400775526|| 7| 0.5827830892516723| 0.6560703836291193|| 8| 1.351649207673346| 0.7750229279150739|| 9| 0.5286035772104091| 0.6075560897646175|+---+--------------------+-------------------+
Spark MLlib:
# rdd_df normal uniform 0 -0.957840 0.259282 1 0.742598 0.674052 2 0.225768 0.707127 3 1.109644 0.850683 4 -0.269745 0.414752 5 -0.148916 0.494394 6 0.172857 0.7243377 -0.276485 0.2529778 -0.963518 0.3567589 1.366452 0.703145
当然,即使以上结果相同,也不能保证scikit-learn中的Random Forest的结果与pyspark Random Forest的结果
完全相同…
尽管答案是否定的,但我真的看不到它如何影响任何ML系统的 部署 ,即,如果结果 关键 取决于RNG,则肯定是不正确的…
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)