使用随机森林分类器RandomForestClassifier进行特征重要程度的选择
返回一个模型,模型具有特征重要程度的结果(稀疏向量)
01.导入模块生成对象
from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.driver.host","192.168.1.4") .config("spark.ui.showConsoleProgress","false") .appName("importancefeatures").master("local[*]").getOrCreate()
02.读取数据为Dataframe,并查看结构
data = spark.read.csv("/mnt/e/win_ubuntu/Code/DataSet/MLdataset/dog_food.csv",header=True,inferSchema=True) data.printSchema()
输出结果:
root |-- A: integer (nullable = true) |-- B: integer (nullable = true) |-- C: double (nullable = true) |-- D: integer (nullable = true) |-- Spoiled: double (nullable = true)
03.查看数据详情,显示前3条数据
data.show(3)
输出结果:
+---+---+----+---+-------+ | A| B| C| D|Spoiled| +---+---+----+---+-------+ | 4| 2|12.0| 3| 1.0| | 5| 6|12.0| 7| 1.0| | 6| 2|13.0| 6| 1.0| +---+---+----+---+-------+ only showing top 3 rows
04.引入模块,并将特征标签向量化,并查看向量化结果表
from pyspark.ml.feature import VectorAssembler vectorAssembler = VectorAssembler(inputCols=["A","B","C","D"],outputCol="features") datavector = vectorAssembler.transform(data) datavector.show()
输出结果:
+---+---+----+---+-------+-------------------+ | A| B| C| D|Spoiled| features| +---+---+----+---+-------+-------------------+ | 4| 2|12.0| 3| 1.0| [4.0,2.0,12.0,3.0]| | 5| 6|12.0| 7| 1.0| [5.0,6.0,12.0,7.0]| | 6| 2|13.0| 6| 1.0| [6.0,2.0,13.0,6.0]| | 4| 2|12.0| 1| 1.0| [4.0,2.0,12.0,1.0]| | 4| 2|12.0| 3| 1.0| [4.0,2.0,12.0,3.0]| | 10| 3|13.0| 9| 1.0|[10.0,3.0,13.0,9.0]| ..............................................
05.选取需要的列,查看前3条数据
datavector = datavector.select("features","Spoiled") datavector.show(3)
输出结果:
+------------------+-------+ | features|Spoiled| +------------------+-------+ |[4.0,2.0,12.0,3.0]| 1.0| |[5.0,6.0,12.0,7.0]| 1.0| |[6.0,2.0,13.0,6.0]| 1.0| +------------------+-------+ only showing top 3 rows
06.引入随机森林分类器
from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier(featuresCol="features",labelCol="Spoiled")
07.使用构造的分类器,对数据集进行训练,得到一个分类模型
model = rf.fit(datavector)
08.查看该模型下,各个标签的重要程度:
model.featureimportances
输出结果:
SparseVector(4, {0: 0.0183, 1: 0.0201, 2: 0.9359, 3: 0.0256})
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)