我发现做到这一点的最好方法是将
StringIndex一个列表中的几个合并并使用a
Pipeline来执行它们:
from pyspark.ml import Pipelinefrom pyspark.ml.feature import StringIndexerindexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]pipeline = Pipeline(stages=indexers)df_r = pipeline.fit(df).transform(df)df_r.show()+-------+--------------+----+----+----------+----------+-------------+|address| date|food|name|food_index|name_index|address_index|+-------+--------------+----+----+----------+----------+-------------+|1111111|20151122045510| gre| Yin| 0.0| 0.0| 0.0||1111111|20151122045501| gra| Yin| 2.0| 0.0| 0.0||1111111|20151122045500| gre| Yln| 0.0| 2.0| 0.0||1111112|20151122065832| gre| Yun| 0.0| 4.0| 3.0||1111113|20160101003221| gre| Yan| 0.0| 3.0| 1.0||1111111|20160703045231| gre| Yin| 0.0| 0.0| 0.0||1111114|20150419134543| gre| Yin| 0.0| 0.0| 5.0||1111115|20151123174302| ddd| Yen| 1.0| 1.0| 2.0||2111115| 20123192| ddd| Yen| 1.0| 1.0| 4.0|+-------+--------------+----+----+----------+----------+-------------+
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)