M个int 排重分析 -

981875739

浏览: 7916 次
性别:
来自: 长沙

最近访客更多访客>>

ITJung

一介布衣

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

M个int 排重分析

博客分类：

学习篇

大数据

最近两周学习研究了有关大数据的解决办法，下面就M个int（数据量很大，大于内存）排重进行浅层分析。做这种大数据的问题，我们要真正的动脑筋去想解决办法，在可行的条件下最好自己去实践一下，否则也只能是纸上谈兵。

对几亿个int型数据进行排重，我们的电脑内存空间都是有限的【以1G内存为例，我们最多存2^30B，也就是是2^28个int（1GB=1024MB，1MB=1024KB，1KB=1024B）】想要把数据都装入数组中，那显然是装不下的~~那么怎么样才可行呢？

下面分析一下解决办法：

方法一：

将大数据分为若干部分，在每个小部分里面先进行一次排序（从大到小），每次从每个部分中取出各自最大的那个，再在其中选出最大存入文件，在拿若干个数据进行比较，选出第二大的，若与前一个元素相同，则舍去，不同则写入文件。这样依次进行下去，遍历完所有的数据也就完成了排重的工作。

但是这个方法需要扫描数据两遍，比较费事，在时间花销上应该是很大的。

那么有没有其他更好的方法呢？

方法二：

1个int—>4个Byte—>32个bit

试想想，我们可不可以在1个int里存入32个数据的相关信息呢？这样的话，是不是将占用的内存降低到1/32了呢？

将一个int表示为32位的01字符串，如果对应位置的数据存在就把bit设为1，不存在就为0。

假设有数据“5”，那么就把第5个位置的0改成1

将01串转换成int存入数组里就可以了

下面看一下我写的方法，比较简单

package Eliminate;

/**
 * 一组int很多，排重
 * @author Administrator
 * 最多有2^32(2147483647])个int数字，用一个bit来表示，最多用2^27大小的数组来表示
 */
	
public class Eliminate {

		/**
		 * 将一个32位的字符串转成一个整数
		 * 
		 * @param s
		 * @return
		 */
		public int stringTOint(String s) {
			int a=(int) (((int) s.charAt(0) - 48) * (2147483647+1));
			
			int b=(int)((int) s.charAt(1) - 48) * 1073741824 + ((int) s.charAt(2) - 48)* 536870912
					+ ((int) s.charAt(3) - 48) * 268435456+ ((int) s.charAt(4) - 48) * 134217728
					+ ((int) s.charAt(5) - 48) * 67108864 + ((int) s.charAt(6) - 48) * 33554432
					+ ((int) s.charAt(7) - 48) * 16777216 + ((int) s.charAt(8) - 48) * 8388608;
			
			int c=(int)((int) s.charAt(9) - 48) * 4194304 + ((int) s.charAt(10) - 48)*2097152
			        + ((int) s.charAt(11) - 48) * 1048576 + ((int) s.charAt(12) - 48)*524288
					+ ((int) s.charAt(13) - 48) *  262144+ ((int) s.charAt(14) - 48) * 131072
					+ ((int) s.charAt(15) - 48) * 65536 + ((int) s.charAt(16) - 48) * 32768 
					+ ((int) s.charAt(17) - 48)*16384+ ((int) s.charAt(18) - 48) * 8192
					+ ((int) s.charAt(19) - 48)*4096 + ((int) s.charAt(20) - 48) * 2048
					+ ((int) s.charAt(21) - 48) * 1024 + ((int) s.charAt(22) - 48) * 512
					+ ((int) s.charAt(23) - 48) * 256 + ((int) s.charAt(24) - 48)* 128
					+ ((int) s.charAt(25) - 48) * 64 + ((int) s.charAt(26) - 48)* 32 
					+ ((int) s.charAt(27) - 48) * 16+ ((int) s.charAt(28) - 48) * 8 
					+ ((int) s.charAt(29) - 48) * 4+ ((int) s.charAt(30) - 48) * 2
					+ ((int) s.charAt(31) - 48);
			return a+b+c;
					
		}
		
		/**
		 * 将int型转化为32位的bit
		 */
		public  String intTOstring(int index){
			String str=Integer.toBinaryString(index);
			int strlength=str.length();
			if(strlength<32){
				for(int i=0;i<32-strlength;i++){
					str="0"+str;
				}
			}
			return str;
		}
		
		/**
		 * 打印最终排重结果输出 
		 * @param args
		 */
		public void  print(int src[]){
			int number;
			int count=0;
			//将src数组中的数据还原成01字符串，找出相应位置的下标
			//循环src数组
			for(int i=1;i<src.length;i++){
				String s=intTOstring(src[i]);
				System.out.println(s);
				for(int j=0;j<s.length();j++){
					if(s.charAt(j)=='1'){
						number=(i-1)*32+j;
						count++;
						System.out.print(number+"  ");					
					}
					
				}
				System.out.println();
			}
			System.out.println("排重后的数据有"+count+"个");
			
		}
		
		
	}

测试代码：

package Eliminate;

import java.util.Random;

public class Test {
	private static  int SrcSize=80000;//最多用2^27大小的数组来表示
	private static int M =1000000;//数据量
	private static  int [] src=new int [SrcSize];
	private static  Random r=new Random();
	
	/**
	 * 主函数
	 * @param args
	 */
    public static void main(String args[]){
    	Long BeginTime=System.currentTimeMillis(); 
    	Eliminate e=new Eliminate();
		int temp;//随机数
		int number;//商
		int index;//数组下标
		int mod;//余数
		int x;
		for(int i=0;i<M;i++){
			    temp=r.nextInt(2000000);	
			    System.out.print(temp+" ");
			    System.out.println();
				number=temp/32;//取商			
				index=number+1;
				mod=temp%32;//取余			
				//将int型转化为32位的bit
				String str=e.intTOstring(src[index]);
				//修改mod处的bit值
				String newStr=new String();
                for(int j=0;j<32;j++){
                	char c=str.charAt(j);
                	if(j==mod){
                		newStr=newStr+1;             		
                	}else{
                		newStr=newStr+c;
                	}
                }
                //将修改后的字符串转化为int，放回数组
			    x=e.stringTOint(newStr);
			    src[index]=x;
			
		}
		e.print(src);
		Long EndTime=System.currentTimeMillis();
        System.out.println("排重时间为："+(EndTime-BeginTime));
	}
}

由于时间问题，没有对很庞大的数据进行测试。这里的数据量为100万，数组大小为8万

测试结果：

排重后的数据有786678个

排重时间为：9344

个人认为排重速度还是不错的~~是不是很有效哟~~

下面进行了一个小数据的测试，证明方法的有效和正确性：

测试结果：

随机产生数据50个：

83 62 85 29 98 21 22 36 84 13 50 33 90 39 17 26 90 52 15 56 54 85 19 48 75 46 93 36 52 14 8 44 68 36 24 22 63 30 54 18 28 95 92 98 92 67 36 49 63 51

00000000100001110111011010101110

8 13 14 15 17 18 19 21 22 24 26 28 29 30

01001001000010101111101010000011

33 36 39 44 46 48 49 50 51 52 54 56 62 63

00011000000100000001110000101101

67 68 75 83 84 85 90 92 93 95

00100000000000000000000000000000

排重后的数据有39个

排重时间为：15

以上就是M个int（数据量很大，大于内存）排重的分析及解决方案，大数据问题还有很多很多，还需要进一步的分析研究~~应该会很有趣哟~~

分享到：

MyHash实现

2012-10-28 00:58
浏览 981
评论(1)
分类:非技术
查看更多

1 楼 lixiongzhi_m 2013-01-21

   不知道是不是我理解上的差异。这个方法的核心是不是比如有0,1,2,3、、、、31这样的32个数字我们可以用三十二个1也就是来表示，将他转换成一个int即2^32-1来表示，然后存储到数组中。这样当我们要去用的时候取出来，逆过程：先将这个数取出来然后转换成32个01串，1代表这个位置有数字0代表没有。
    这样数组下标为0的一个数能存储0到31，为1的能存储32到63，依次列推。
    是不是大概是这样的思路？？？要是不是的话那么可能是我理解上的错误。要是是的话那么这边有几个问题？
    1.要是这M个数中有两个一样的，因为这些01串的位置都是唯一的，那么该如何去存储？
    2.还有这样一个情况，这个M不大，但其中有一个数非常的大，那么这个大数按这种方法存储到数组下标对应的数字也就比较大。那么这些不多的数按道理来讲是不是也要开个大数组来存储？

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

M个int 排重分析

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

M个int 排重分析

评论

发表评论

相关推荐

MyHash实现

TCP的三次握手及四次断开

2012上半年总结

黄金矿工游戏总结

画图板总结

java学习总结

java学习总结

关键字小结

最近访客更多访客>>