Hadoop序列化与Writable接口(一)

序列化序列化（serialization）是指将结构化的对象转化为字节流，以便在网络上传输或者写入到硬盘进行永久存储；相对的反序列化（deserialization）是指将字节流转回到结构化对象的过程。在分布式系统中进程将对象序列化为字节流，通过网络传输到另一进
序列化序列化（serialization）是指将结构化的对象转化为字节流，以便在网络上传输或者写入到硬盘进行永久存储；相对的反序列化（deserialization）是指将字节流转回到结构化对象的过程。
在分布式系统中进程将对象序列化为字节流，通过网络传输到另一进程，另一进程接收到字节流，通过反序列化转回到结构化对象，以达到进程间通信。在hadoop中，mapper，combiner，reducer等阶段之间的通信都需要使用序列化与反序列化技术。举例来说，mapper产生的中间结果（）需要写入到本地硬盘，这是序列化过程（将结构化对象转化为字节流，并写入硬盘），而reducer阶段读取mapper的中间结果的过程则是一个反序列化过程（读取硬盘上存储的字节流文件，并转回为结构化对象），需要注意的是，能够在网络上传输的只能是字节流，mapper的中间结果在不同主机间洗牌时，对象将经历序列化和反序列化两个过程。
序列化是hadoop核心的一部分，在hadoop中，位于org.apache.hadoop.io包中的writable接口是hadoop序列化格式的实现。
writable接口hadoop writable接口是基于datainput和dataoutput实现的序列化协议，紧凑（高效使用存储空间），快速（读写数据、序列化与反序列化的开销小）。hadoop中的键（key）和值（value）必须是实现了writable接口的对象（键还必须实现writablecomparable，以便进行排序）。
以下是hadoop（使用的是hadoop 1.1.2）中writable接口的声明：
package org.apache.hadoop.io;import java.io.dataoutput;import java.io.datainput;import java.io.ioexception;public interface writable { /** * serialize the fields of this object to out. * * @param out dataouput to serialize this object into. * @throws ioexception */ void write(dataoutput out) throws ioexception; /** * deserialize the fields of this object from in. * * for efficiency, implementations should attempt to re-use storage in the * existing object where possible.
* * @param in datainput to deseriablize this object from. * @throws ioexception */ void readfields(datainput in) throws ioexception;}
writable类hadoop自身提供了多种具体的writable类，包含了常见的java基本类型（boolean、byte、short、int、float、long和double等）和集合类型（byteswritable、arraywritable和mapwritable等）。这些类型都位于org.apache.hadoop.io包中。
(图片来源：safaribooksonline.com)
定制writable类虽然hadoop内建了多种writable类提供用户选择，hadoop对java基本类型的包装writable类实现的rawcomparable接口，使得这些对象不需要反序列化过程，便可以在字节流层面进行排序，从而大大缩短了比较的时间开销，但是当我们需要更加复杂的对象时，hadoop的内建writable类就不能满足我们的需求了(需要注意的是hadoop提供的writable集合类型并没有实现rawcomparable接口，因此也不满足我们的需要)，这时我们就需要定制自己的writable类，特别将其作为键（key）的时候更应该如此，以求达到更高效的存储和快速的比较。
下面的实例展示了如何定制一个writable类，一个定制的writable类首先必须实现writable或者writablecomparable接口，然后为定制的writable类编写write(dataoutput out)和readfields(datainput in)方法，来控制定制的writable类如何转化为字节流（write方法）和如何从字节流转回为writable对象。
package com.yoyzhou.weibo;import java.io.datainput;import java.io.dataoutput;import java.io.ioexception;import org.apache.hadoop.io.vlongwritable;import org.apache.hadoop.io.writable;/** *this mywritable class demonstrates how to write a custom writable class * **/public class mywritable implements writable{ private vlongwritable field1; private vlongwritable field2; public mywritable(){ this.set(new vlongwritable(), new vlongwritable()); } public mywritable(vlongwritable fld1, vlongwritable fld2){ this.set(fld1, fld2); } public void set(vlongwritable fld1, vlongwritable fld2){ //make sure the smaller field is always put as field1 if(fld1.get() o is a mywritable with the same values. */ @override public boolean equals(object o) { if (!(o instanceof mywritable)) return false; mywritable other = (mywritable)o; return field1.equals(other.field1) && field2.equals(other.field2); } @override public int hashcode(){ return field1.hashcode() * 163 + field2.hashcode(); } @override public string tostring() { return field1.tostring() + \t + field2.tostring(); } }
未完待续，下一篇中将介绍writable对象序列化为字节流时占用的字节长度以及其字节序列的构成。
参考资料tom white, hadoop: the definitive guide, 3rd edition
---to be continued---
原文地址：hadoop序列化与writable接口(一), 感谢原作者分享。

Hadoop序列化与Writable接口(一)

推荐信息