数据挖掘笔记-关联规则-Apriori-1

fighting_2013

浏览: 15880 次
性别:

最近访客更多访客>>

98es

123456.H

woodding2008

tianshun.cao

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

DATA_MINING

数据挖掘关联规则 Apriori java

今天看了一下关联规则分析中的Apriori算法，先了解下基本概念：

关联规则分析用于发现隐藏在大型数据集中的有意义的联系。在交易数据、关系数据或其他信息载体中，查找存在于项目集合或对象集合之间的频繁模式、关联、相关性或因果结构。

•关联规则挖掘形式化定义：

•原始数据描述

设I ={i1, i2,…,im}是所有项(item)的集合，若干项的集合，称为项集(Item Sets)。

记T为交易(transaction,事务) t的集合，其中，交易t是项的集合，并且t⊆I。

设A、C分别是一个I中项的集合，如果A⊆T且C⊆T，且A∩C=Φ那么称交易T包含A ∪ C。

•目标数据描述

所有形如A⇒C蕴涵式的称为关联规则，这里A⊂I, C⊂I，并且A∩C=Φ。

•为了描述关联规则的有用性和确定性

•Ø关联规则的支持度

–如果交易集T中s次交易包含A∪C，则称规则A=>C在事务集T上的支持度为s。

–Support(A=>C)=P(A∪C)

•Ø关联规则的置信度

–如果交易数据库D中，包含A的交易中有c(%)的交易同时也包含C，称规则的置信度为c。（条件概率）

–Confidence(A=>C)=P(C|A) =support({A} ∪{C})/support({A})

•支持度, s,一次交易中包含{A、C}的可能性

•置信度, c,包含{A}的交易中也包含{C}的条件概率

•量化后的目标

–查找所有满足最小支持度和可信度的规则A=>C

•频繁项集

–如果项集满足最小支持度，则称之为频繁项集

•例如A={尿布，啤酒} ,支持度=3

•如果最小支持度= 3,则A是频繁项集

•如果频繁项集中包含K个项，则称为频繁K-项集，A为2-项集

•关联规则的挖掘步骤

–发现频繁项集

–由频繁项集生成满足最小支持度和最小置信度的关联规则

Apriori性质

–一个频繁项集中的任一非空子集也应是频繁项集。

–如果一项交易包含{牛奶，面包，汽水}，那么它一定包含{牛奶，面包}

–{牛奶，面包，汽水}是频繁的=>{牛奶，面包}一定也是频繁的

–即：任何非频繁项集的超集一定也是非频繁的

非频繁项集的超集可以不用进行测试，许多项之间的组合可以去掉（不满足频繁条件）

算法核心：逐层搜索的迭代方法，寻找最大频繁集。

下面是Apriori算法Java的简单实现：

public class AprioriBuilder {
	/** 最小支持度*/
	private int minSupport = 2;
	/** 最小置信度*/
	private double minConfidence = 0.6;
	/** 数据集*/
	private Data data = null;
	/** 候选集集合*/
	private List<List<ItemSet>> candidates = null;
	/** 频繁集集合*/
	private List<List<ItemSet>> frequencies = null;
	/** 关联规则集合*/
	private Set<AssociationRule> associationRules = null;
	
	public void initialize() {
		data = DataLoader.load("d:\\apriori.txt");
		candidates = new ArrayList<List<ItemSet>>();
		frequencies = new ArrayList<List<ItemSet>>();
		associationRules = new HashSet<AssociationRule>();
	}
	
	/** 生成频繁一项集*/
	private void frequency_1_itemset_gen() {
		List<ItemSet> frequency = new ArrayList<ItemSet>();
		List<ItemSet> candidate = new ArrayList<ItemSet>();
		Map<String, Integer> map = new HashMap<String, Integer>();
		for (Instance instance : data.getInstances()) {
			Set<String> valueSet = new TreeSet<String>();
			for (String value : instance.getValues()) {
				Integer mValue = map.get(value);
				map.put(value, null == mValue ? 1 : mValue + 1);
				valueSet.add(value);
			}
		}
		ShowUtils.print(map);
		for (Map.Entry<String, Integer> entry : map.entrySet()) {
			candidate.add(new ItemSet(entry.getKey(), entry.getValue()));
			if (entry.getValue() >= minSupport) {
				frequency.add(new ItemSet(entry.getKey(), entry.getValue()));
			}
		}
		candidates.add(candidate);
		frequencies.add(frequency);
	}
	
	/** 生成频繁K项集*/
	private void frequency_k_itemset_gen(int k) {
		Iterator<ItemSet> f1Iter = frequencies.get(k - 2).iterator();
		Iterator<ItemSet> f2Iter = frequencies.get(0).iterator();
		List<ItemSet> candidate = new ArrayList<ItemSet>();
		while (f1Iter.hasNext()) {
			ItemSet item1 = f1Iter.next();
			while (f2Iter.hasNext()) {
				ItemSet item2 = f2Iter.next();
				ItemSet temp = new ItemSet();
				temp.getItems().addAll(item1.getItems());
				if (!temp.getItems().containsAll(item2.getItems())) {
					temp.getItems().addAll(item2.getItems());
					boolean isContain = false;
					for (ItemSet itemSet : candidate) {
						if (itemSet.getItems().containsAll(temp.getItems())) {
							isContain = true;
						}
					}
					if (!isContain) {
						candidate.add(temp);
					}
				}
			}
			f2Iter = frequencies.get(0).iterator();
		}
		candidates.add(candidate);
		List<ItemSet> frequency = new ArrayList<ItemSet>();
		for (ItemSet itemSet : candidate) {
			int support = calculateSupport(itemSet.getItemsArray());
			if (support >= minSupport) {
				frequency.add(itemSet);
			}
		}
		frequencies.add(frequency);
	}
	
	/** 计算项集支持度*/
	private int calculateSupport(String... items) {
		if (null == items || items.length == 0) return 0; 
		int support = 0;
		for (Instance instance : data.getInstances()) {
			int temp = 0;
			for (String value : instance.getValues()) {
				for (String item : items) {
					if (item.equals(value)) {
						temp++;
					}
				}
			}
			if (temp == items.length) {
				support++;
			}
		}
		return support;
	}
	
	/** 计算关联规则置信度*/
	private void calculateConfidence(AssociationRule associationRule) {
		String[] arLeft = associationRule.getLeft().getItemsArray();
		String[] arRight = associationRule.getRight().getItemsArray();
		int leftLength = arLeft.length;
		int rightLength = arRight.length;
		String[] left = new String[leftLength + rightLength];
		String[] right = new String[rightLength];
		System.arraycopy(arLeft, 0, left, 0, leftLength);
		System.arraycopy(arRight, 0, left, leftLength, rightLength);
		System.arraycopy(arRight, 0, right, 0, rightLength);
		double leftSup = calculateSupport(left);
		double rightSup = calculateSupport(right);
		System.out.print(AssociationRuleHelper.convert(left) + ": " + leftSup + " ");
		System.out.println(AssociationRuleHelper.convert(right) + ": " + rightSup + " ");
		if (rightSup != 0) {
			double confidence = leftSup / rightSup;
			associationRule.setConfidence(confidence);
			if (confidence >= minConfidence && !AssociationRuleHelper.isContain(
					associationRules, associationRule)) {
				associationRules.add(associationRule);
			}
		}
		for (AssociationRule child : associationRule.getChildren()) {
			calculateConfidence(child);
		}
	}
	
	/** 获取最新频繁项集*/
	private List<ItemSet> getLastFrequency() {
		int index = frequencies.size() - 1;
		List<ItemSet> frequency = frequencies.get(index);
		while (0 == frequency.size()) {
			frequency = frequencies.get((index--));
		}
		return frequency;
	}
	
	/** 生成关联规则并且计算置信度*/
	private void association_rule_gen(List<ItemSet> frequency) {
		for (ItemSet itemSet : frequency) {
			AssociationRule ar = new AssociationRule(itemSet, null);
			child_association_rule_gen(ar);
			calculateConfidence(ar);
			AssociationRuleHelper.print(ar, 0);
		}
	}
	
	/** 生成子关联规则*/
	private void child_association_rule_gen(AssociationRule associationRule) {
		ItemSet left = associationRule.getLeft();
		TreeSet<String> items = left.getItems();
		int length = items.size();
		if (length == 1) return;
		List<String> temp = new ArrayList<String>(items);
		for (int i = 0; i < length; i++) {
			AssociationRule child = new AssociationRule();
			associationRule.getChildren().add(child);
			child.getRight().addAll(associationRule.getRight().getItems());
			child.getRight().add(temp.get(i));
			for (int j = 0; j < length; j++) {
				if (j != i) {
					child.getLeft().add(temp.get(j));
				}
			}
			child_association_rule_gen(child);
		}
	}
	
	public void build() {
		initialize();
		frequency_1_itemset_gen();
		print(candidates, true);
		print(frequencies, false);
		for (int k = 2; frequencies.get(k - 2).size() > 0; k++) {
			frequency_k_itemset_gen(k);
			print(candidates, true);
			print(frequencies, false);
		}
		List<ItemSet> lastFrequency = getLastFrequency();
		print(lastFrequency);
		association_rule_gen(lastFrequency);
		System.out.println("associationRules size: " + associationRules.size());
		for (AssociationRule associationRule : associationRules) {
			AssociationRuleHelper.print(associationRule);
		}
	}
	
	public void print(List<List<ItemSet>> itemSetss, boolean isCandidate) {
		System.out.println((isCandidate ?  "Candidate" : "Frequency") + " Item Set");
		System.out.println(itemSetss.size());
		for (List<ItemSet> itemSets : itemSetss) {
			print(itemSets);
		}
	}
	
	public void print(List<ItemSet> itemSets) {
		System.out.println("----------");
		for (ItemSet itemSet : itemSets) {
			System.out.println(itemSet.getItems());
		}
		System.out.println("----------");
	}
	
	public static void main(String[] args) {
		AprioriBuilder ab = new AprioriBuilder();
		ab.build();
	}
}

分享到：

数据挖掘笔记-关联规则-FPGrowth-1 | 数据挖掘笔记-分类-决策树-5

2014-05-29 17:17
浏览 791
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论