Mapreduce中的字符串编码

Mapreduce中的字符串 编码

$$$

Shuffle的执行过程，需要经过多次比较排序。如果对每一个数据的比较都需要先反序列化，对性能影响极大。 RawComparator的作用就不言而喻，能够直接使用序列化后的字节流进行比较，不需要反序列化就能够完成排序功能。

$$$

hadoop使用的是jdk自带编码器和解码器(DataOutputStream和DataInputStream)，它有一套规则把字符转化成字节。1个字符可能转化成1个，2个或者3个字节。

字节流开始处用2个字节，写了字节流的有效长度，它的字节流最长是65535 (这个长度是编码后的字节流长度，不是传进行的字符串长度)

在RawComparator读到的，是在这种规则下转换后的字节流，不能够直接使用它来做比较了。。。在二次排序里面我用了一种变通的方法解决这个问题。最好的方法是模仿Text的编码和比较器的实现，然后实现自定义key的比较器

在这里主要是解析一下，Why???

DataOutputStream.writeUTF()的源码不长，直接上

    /**

    ** 把字符串编码为utf字节流，并写进out

    **/

    static int writeUTF(String str, DataOutput out) throws IOException {

	int strlen = str.length();

	int utflen = 0;

	int c, count = 0;

	/* 计算编码后的字节流长度 */

	for (int i = 0; i < strlen; i++) {

		c = str.charAt(i);

		if ((c >= 0x0001) && (c <= 0x007F)) {	// ascii字符，1个字符->1个字节

			utflen++;

		} else if (c > 0x07FF) {							// 1个字符->3个字节

			utflen += 3;

		} else {

			utflen += 2;									  // 1个字符->2个字节

		}

	}

	// 编码后的字节流总长度不能超过65535

	if (utflen > 65535)

		throw new UTFDataFormatException(

			"encoded string too long: " + utflen + " bytes");

	// 初始化用于编码的缓冲区

	byte[] bytearr = null;

	if (out instanceof DataOutputStream) {

		DataOutputStream dos = (DataOutputStream)out;

		if(dos.bytearr == null || (dos.bytearr.length < (utflen+2)))

			dos.bytearr = new byte[(utflen*2) + 2];

		bytearr = dos.bytearr;

	} else {

		bytearr = new byte[utflen+2];

	}

	// 开头的2个字节，写字符串长度 utflen

	bytearr[count++] = (byte) ((utflen >>> 8) & 0xFF);

	bytearr[count++] = (byte) ((utflen >>> 0) & 0xFF);

	// 先写入 [1到7E]的字符, 即ascii字符， 1个字符->1个字节

	int i=0;

	for (i=0; i<strlen; i++) {

	   c = str.charAt(i);

	   if (!((c >= 0x0001) && (c <= 0x007F))) break;

	   bytearr[count++] = (byte) c;

	}

	for (;i < strlen; i++){

		c = str.charAt(i);

		if ((c >= 0x0001) && (c <= 0x007F)) {			// ascii字符

			bytearr[count++] = (byte) c;

		} else if (c > 0x07FF) {						// 非ascii字符，需要3个字节

			bytearr[count++] = (byte) (0xE0 | ((c >> 12) & 0x0F));

			bytearr[count++] = (byte) (0x80 | ((c >>  6) & 0x3F));

			bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));

		} else {

			bytearr[count++] = (byte) (0xC0 | ((c >>  6) & 0x1F));		// 其余的非ascii字符，需要2个字节

			bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));

		}

	}

	out.write(bytearr, 0, utflen+2);

	return utflen + 2;

}

Mapreduce中的字符串编码的相关教程结束。

《Mapreduce中的字符串编码.doc》

下载本文的Word格式文档，以方便收藏与打印。

Mapreduce中的字符串编码

Mapreduce中的字符串 编码

Mapreduce中的字符串编码的相关教程结束。

相关推荐

怎么在Redis上对Java执行分布式MapReduce

python字符串定义的方式有哪些

python如何遍历字符串中每一个字符

Java中如何实现String字符串分割

使用python怎么将字符串转换成dict格式

python中有哪些字符串拼接的方法

怎么在C语言中对字符串与各数值类型进行转换

javascript与php地址url解析函数