Pipeline:
from pyspark.ml.feature import StringIndexer, StringIndexerModeldf = spark.createDataframe([("a", "foo"), ("b", "bar")], ("x1", "x2"))pipeline = Pipeline(stages=[ StringIndexer(inputCol=c, outputCol='{}_index'.format(c)) for c in df.columns])model = pipeline.fit(df)
stages:
# Accessing _java_obj shouldn't be necessary in Spark 2.3+{x._java_obj.getOutputCol(): x.labels for x in model.stages if isinstance(x, StringIndexerModel)}{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}
从转换后的元数据
Dataframe:
indexed = model.transform(df){c.name: c.metadata["ml_attr"]["vals"]for c in indexed.schema.fields if c.name.endswith("_index")}{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)