一种将多个列合并为一个向量列的转换器。它将多个原始特征合并为
一个特征向量, 从而能用于训练线性回归和决策树之类的机器学习模型。
对类别特征进行索引, 这里指从 Vector Assembler 传来的那些类别特征。
它会自动确定哪些特征是类别特征, 并将实际的值转为类别索引。
类别型的分类最多有 maxCategories 种
代码演示:def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").appName("").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val seqRdd = spark.sparkContext.makeRDD(Seq(
(1, 1, 5),
(2, 2, 0),
(2, 3, 5),
(1, 4, 2),
(3, 5, 2)
))
val df = spark.createDataFrame(seqRdd).toDF("a", "b", "c")
println("===========原始数据=============")
df.show(false)
val vectorAssembler = new VectorAssembler().setInputCols(Array("a", "b", "c")).setOutputCol("rawFeatures")
val df1 = vectorAssembler.transform(df)
println("=========== VectorAssembler =============")
vectorAssembler.transform(df).show(false)
val indexerModel: VectorIndexerModel = new VectorIndexer()
.setInputCol("rawFeatures")
.setOutputCol("indexedFeatures")
// .setMaxCategories(2) //指定本特者的取值超过多少被视为连续特征。对连续特征其不作处理,只对离散特征进行索引。
.fit(df1)
println("=========== VectorIndexer =============")
indexerModel.transform(df1).show(false)
}
===========原始数据=============
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |5 |
|2 |2 |0 |
|2 |3 |5 |
|1 |4 |2 |
|3 |5 |2 |
+---+---+---+
=========== VectorAssembler =============
+---+---+---+-------------+
|a |b |c |rawFeatures |
+---+---+---+-------------+
|1 |1 |5 |[1.0,1.0,5.0]|
|2 |2 |0 |[2.0,2.0,0.0]|
|2 |3 |5 |[2.0,3.0,5.0]|
|1 |4 |2 |[1.0,4.0,2.0]|
|3 |5 |2 |[3.0,5.0,2.0]|
+---+---+---+-------------+
=========== VectorIndexer =============
+---+---+---+-------------+---------------+
|a |b |c |rawFeatures |indexedFeatures|
+---+---+---+-------------+---------------+
|1 |1 |5 |[1.0,1.0,5.0]|[0.0,0.0,2.0] |
|2 |2 |0 |[2.0,2.0,0.0]|[1.0,1.0,0.0] |
|2 |3 |5 |[2.0,3.0,5.0]|[1.0,2.0,2.0] |
|1 |4 |2 |[1.0,4.0,2.0]|[0.0,3.0,1.0] |
|3 |5 |2 |[3.0,5.0,2.0]|[2.0,4.0,1.0] |
+---+---+---+-------------+---------------+
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)