来源:LSHMinHash
import org.apache.spark.ml.feature.MinHashLSH import org.apache.spark.ml.linalg.Vectors import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions.col val dfA = spark.createDataframe(Seq( (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0)))), (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0)))), (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0)))) )).toDF("id", "features") val dfB = spark.createDataframe(Seq( (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0)))), (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0)))), (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0)))) )).toDF("id", "features") val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0))) val mh = new MinHashLSH() .setNumHashTables(5) .setInputCol("features") .setOutputCol("hashes") val model = mh.fit(dfA) // Feature Transformation println("The hashed dataset where hashed values are stored in the column 'hashes':") model.transform(dfA).show(false) // Compute the locality sensitive hashes for the input rows, then perform approximate // similarity join. // We could avoid computing hashes by passing in the already-transformed dataset, e.g. // `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` println("Approximately joining dfA and dfB on Jaccard distance smaller than 0.6:") model.approxSimilarityJoin(dfA, dfB, 0.6, "JaccardDistance") .select(col("datasetA.id").alias("idA"), col("datasetB.id").alias("idB"), col("JaccardDistance")).show() // Compute the locality sensitive hashes for the input rows, then perform approximate nearest // neighbor search. // We could avoid computing hashes by passing in the already-transformed dataset, e.g. // `model.approxNearestNeighbors(transformedA, key, 2)` // It may return less than 2 rows when not enough approximate near-neighbor candidates are // found. println("Approximately searching dfA for 2 nearest neighbors of the key:") model.approxNearestNeighbors(dfA, key, 2).show()
密集向量和稀疏向量的区别
密集向量的值是一个普通的Double数组,而稀疏向量由两个并列的数组indices和values组成。
例如:
向量(1.0, 0.0, 1.0, 3.0)
用密集格式表示为
[1.0, 0.0, 1.0, 3.0]
用稀疏格式表示为
Vectors.sparse(4,[0,2,3],[1.0, 1.0, 3.0]) 或者Vectors.sparse(4, Seq((0, 1.0), (2, 1.0), (3, 3.0)))
其中,各个部分的含义如下表
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)