CMAR代码学习2 计算卡方值并利用CR-tree剪枝_java

CMAR代码学习1 生成FP-tree挖掘规则https://blog.csdn.net/woshiwu6666/article/details/124717377

有一个java语法的知识点，通过赋值给属性，会自动调用toString方法

要特别注意的是，下面这个例子当中，要写成return "Animal [i=" + this.i + ", o=" + this.o +",kafang=" +(this.kafang=(NUMBER_INSTANCES/3))+"]";有this指针toString()方法自动被调用，并且有值更新的时候toString()里的值也会默认更新。如果写成,return "Animal [i=" + i + ", o=" + o +",kafang=" +(kafang=(NUMBER_INSTANCES/3))+"]";则toString()方法不能被自动调用，要使用System.out.println();才能被调用

public class Animal {
	
	int i=1;
	String o="hello";
	double kafang=0;
	static double NUMBER_INSTANCES=0;
	public void printType() {
		System.out.println("Cat");
		}
	
	@Override
	public String toString() {
		return "Animal [i=" + this.i + ", o=" + this.o +",kafang=" +(this.kafang=(NUMBER_INSTANCES/3))+"]";
	}
	
	public static void main(String[] args) {
		Animal dog=new Animal();
		//通过赋值,会调用toString方法
		dog.i=2;
		System.out.println(dog);
		//更新值，toString方法也会更新
//		dog.NUMBER_INSTANCES=9;
		//通过赋值,会调用toString方法,自动计算了卡方值
		Animal.NUMBER_INSTANCES=9;
//		dog.i=2;
		System.out.println("kafang="+dog.kafang);
		System.out.println(dog);
	}

}

卡方值的计算

AlgoCMAR.java

生成这个类，new ClassifierCMAR(rules, dataset, delta);，利用该类计算卡方值并利用CR-tree剪枝

/**
     * Train a classifier
     * @param dataset a training dataset
     * @return a rule classifier
     * @throws Exception if an error occurs
     */
    @Override
    public ClassifierCMAR train(Dataset dataset){
    	// Apply a modified FPGrowth algorithm to obtain the rules
        FPGrowthForCMAR fpgrowth = new FPGrowthForCMAR(dataset, minSup, minConf);
        List rules = fpgrowth.run();
        
        // Return a classifier that is created using these rules
        return new ClassifierCMAR(rules, dataset, delta);
    }

ClassifierCMAR.java

NUMBER_INSTANCES表示数据的总条数

	/**
     * Constructor
     * 
     * @param rules    forming the current classifier
     * @param training dataset used while training classifier
     * @param delta  delta
     * 
     * @throws Exception
     */
    public ClassifierCMAR(List rules, Dataset training, int delta) {
        super("CMAR");
        //传了一个数进去，就可以计算卡方值
       RuleCMAR.NUMBER_INSTANCES = training.getInstances().size();
 
//        System.out.println(1);
       //一共有多少个属性值
        CRTree.NUMBER_SINGLETONS = training.getDistinctItemsCount();
        CRTree crTree = new CRTree(training, delta);
        for (Rule rule : rules) {
            crTree.insert(rule);
        }
        //它为什么把那些size并且优先级高的都去掉了
        crTree.pruneUsingCover();
        this.rules = crTree.getRules();
//        System.out.println(1);
    }

Rule.java

属性有更新，会自动调用toString()方法

/*
	 * (non-Javadoc)
	 * 
	 * @see java.lang.Object#toString()
	 */
	public String toString() {
//		System.out.println(1);
		return this.antecedent.toString() + " -> " + this.getKlass() + getMeasuresToString();
	}
	
	public abstract String getMeasuresToString();

RuleCMAR.java

	@Override
	public String getMeasuresToString() {
		return " #SUP: " + getSupportRule() + " #CONF: " + getConfidence() + " #CHISQUARE: " + this.getChiSquare();
	}

计算卡方值

假设有这么一条规则，计算它的卡方值

生成实际表格与期望表格

2.利用公式计算，得到结果

调用getChiSquare()函数进行计算

以下面的两条规则为例，解释SupportKlass，SupportAntecedent，SupportRule以及CONF

SupportKlass: 分类为class=12的数量

SupportAntecedent：条件部分的出现次数

SupportRule: 这些条件判断为12的次数

CONF: SupportRule/SupportAntecedent

[5] -> 12 #SupportRule: 1#SupportAntecedent: 1#SupportKlass: 6 #CONF: 1.0

[5, 8] -> 12 #SupportRule: 1#SupportAntecedent: 1#SupportKlass: 6 #CONF: 1.0

/**
	 * Get the chi square of this rule
	 * 
	 * @return chi square of this rule
	 */
	public double getChiSquare() {
		double[] observedValues = new double[4];
		double[] expectedValues = new double[4];
		// Calculate observed and expected values
		observedValues[0] = this.supportRule;
		observedValues[1] = this.supportAntecedent - this.supportRule;
		observedValues[2] = this.supportKlass - this.supportRule;
		observedValues[3] = NUMBER_INSTANCES - this.supportAntecedent - this.supportKlass + this.supportRule;

		// Calculate additional support values
		double supNotAntecedent = NUMBER_INSTANCES - this.supportAntecedent;
		double supNotConsequent = NUMBER_INSTANCES - this.supportKlass;

		// Expected values
		expectedValues[0] = (this.supportKlass * this.supportAntecedent) / NUMBER_INSTANCES;
		expectedValues[1] = (supNotConsequent * this.supportAntecedent) / NUMBER_INSTANCES;
		expectedValues[2] = (this.supportKlass * supNotAntecedent) / NUMBER_INSTANCES;
		expectedValues[3] = (supNotConsequent * supNotAntecedent) / NUMBER_INSTANCES;

		double sumChiSquaredValues = 0.0;

		for (int index = 0; index < observedValues.length; index++) {
			double chiValue = Math.pow((observedValues[index] - expectedValues[index]), 2.0) / expectedValues[index];
			sumChiSquaredValues = sumChiSquaredValues + chiValue;
		}

		return sumChiSquaredValues;
	}

下面展示计算的输出

[5] -> 12 #SupportRule: 1#SupportAntecedent: 1#SupportKlass: 6 #CONF: 1.0
observedValues[0]1.0
observedValues[1]0.0
observedValues[2]5.0
observedValues[3]3.0
supNotAntecedent8.0
supNotConsequent3.0
expectedValues[0]0.6666666666666666
expectedValues[1]0.3333333333333333
expectedValues[2]5.333333333333333
expectedValues[3]2.6666666666666665
chiValue0.1666666666666667
chiValue0.3333333333333333
chiValue0.020833333333333297
chiValue0.041666666666666706
sumChiSquaredValues0.5625
[5, 8] -> 12 #SupportRule: 1#SupportAntecedent: 1#SupportKlass: 6 #CONF: 1.0
observedValues[0]1.0
observedValues[1]0.0
observedValues[2]5.0
observedValues[3]3.0
supNotAntecedent8.0
supNotConsequent3.0
expectedValues[0]0.6666666666666666
expectedValues[1]0.3333333333333333
expectedValues[2]5.333333333333333
expectedValues[3]2.6666666666666665
chiValue0.1666666666666667
chiValue0.3333333333333333
chiValue0.020833333333333297
chiValue0.041666666666666706
sumChiSquaredValues0.5625

Yes表示是5，No表示非

	判断为12	判断为11	total
[5]Yes	1	0=1-0	1
[5]No	5=6-1	3=9-5-1	8
total	6	3	9

	判断为12	判断为11	total
[5, 8]Yes	1	0=1-0	1
[5, 8]No	5=6-1	3=9-5-1	8
total	6	3	9

下一步是创建CR-Tree

为CR-Tree赋值
调用 crTree.insert( )方法插入规则
调用crTree.pruneUsingCover( )过滤掉不符合规则的方法
得到过滤后的规则crTree.getRules( )

minCover：是database coverage的参数

第一步为CR-tree赋值

CRTree(Dataset dataset, int delta) {
		minCover = delta;
		this.dataset = dataset;
	}

第二步调用insert()方法

/**
	 * Insert a rule in the tree
	 * 
	 * @param baseRule a rule to be inserted into the CRTree
	 */
	protected void insert(Rule baseRule) {

		RuleCMAR rule = (RuleCMAR) baseRule;

		// If the rule fails the Chi-Squared test
		if (rule.getChiSquare() <= THRESHOLD_CHI_SQUARE) {
			// We dont add the rule
			return;
		}

		// Otherwise, we create a new node for this rule
		CRNode newNode = new CRNode(rule);

		// If it is the first rule, it will be added as the root of the tree.
		//添加到树的rootnode当中
		if (rootNode == null) {
			rootNode = newNode;
			return;
		}

		// If more general rule with higher ranking exists, current rule will be
		// discarded
		//如果两项都满足就discarded，只要有一项不满足就可以添加
		if (isMoreGeneralNode(newNode)) {
			return;
		}

		// Add current node as the first node
		//和这颗树的首节点比较，如果比它优先级高就添加它到树的下一个节点当中然后返回
		//如果比这颗树的头节点大
		if (newNode.rule.isGreater(rootNode.rule)) {
			newNode.next = rootNode;
			rootNode = newNode;
			return;
		}

		// Search for the position where
		// the rule should be inserted in terms of rank
		//则要找到合适的位置的位置添加
		CRNode currentNode = rootNode;
		CRNode nextNode = rootNode.next;
		while (nextNode != null) {
			if (newNode.rule.isGreater(nextNode.rule)) {
				currentNode.next = newNode;
				newNode.next = nextNode;
				return;
			}
			currentNode = nextNode;
			nextNode = nextNode.next;
		}

		// Add new node at the very end
		//整棵树里都没有比它小的就添加到最后，最小的放到最后
		currentNode.next = newNode;
	}

首先,有一个THRESHOLD_CHI_SQUARE 的筛选，默认阈值1.6424。不满足阈值，则不放进这颗树中

/**
	 * Critical threshold for 25% "significance" level (assuming "degree of freedom"
	 * equivalent to 1).
	 */
	private static final double THRESHOLD_20 = 1.6424;

	/**
	 * Critical threshold value for CHI_SQUARE
	 */
	private static final double THRESHOLD_CHI_SQUARE = THRESHOLD_20; // Default

否则放进这颗树中，这一棵单路径的树，其数据结构如下所示

CRNode.java

/**
 * Class representing a CRTree node, as used by the CMAR algorithm.
 * 
 * @see AlgoCMAR
 */
public class CRNode {
    /**
     * Rule stored in the current node
     */
    RuleCMAR rule = null;

    /**
     * Link to the next node
     */
    CRNode next = null;

    /**
     * Constructor 
     * 
     * @param rule to be stored in the node
     */
    CRNode(RuleCMAR rule) {
        this.rule = rule;
    }
}

接着判断规则是不是general和high confidence的，用general和high confidence的去覆盖specific和low confidence的

	/**
	 * Checks whether there are a more general rule, with higher ranking in the
	 * current tree
	 * 
	 * @param ruleNode a node to be inserted
	 * @return true if more general rule exists in the tree
	 */
	private boolean isMoreGeneralNode(CRNode ruleNode) {
		
		CRNode currentNode = rootNode;

		// Search in tree by follwing the "next" links between nodes
		while (currentNode != null) {
			//如果这两个都满足就discard
			//如果size更小，并且优先级更高，就discard，不用放到树当中，因为都满足的不需要修剪
			//只满足一个的才需要修剪
			if (ruleNode.rule.isMoreGeneral(currentNode.rule) && ruleNode.rule.isGreater(currentNode.rule))
				{
//				System.out.println(1);
//				System.out.println(ruleNode.rule);
				return true;
				}
			currentNode = currentNode.next;
		}

		return false;
	}

如何算是more generalisMoreGeneralNode(newNode)，原文中是怎么描述的

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/942366.html

CMAR代码学习2 计算卡方值并利用CR-tree剪枝

发表评论

评论列表（0条）