HadoooIO.ppt

Hadoop I/O
• Data Integrity
• Compression
• Serialization
• File-Based Data Sructures

Data Integrity
• Users of Hadoop expect that no data will be lost or corrupted during storage or processing.
Every I/O operation on the disk or network has chance of introducing errors into the data
that it is reading or writing.
• As Hadoop is capable of handling huge volumes of data, the chance of data corruption
occurring is high.
• The usual way of detecting corrupted data is by computing a checksum for the data when it
first enters the system, and again whenever it is transmitted across a channel that is
unreliable.
• The data is decided to be corrupt if the newly generated checksum doesn’t exactly match
the original. This technique doesn’t offer any way to fix the data—merely error detection.
• A commonly used error-detecting code is CRC-32 (cyclic redundancy check), which
computes a 32-bit integer checksum for input of any size.

Data Integrity in HDFS
• HDFS transparently checksums all data written to it and by default verifies checksums
when reading data.
• A separate checksum is created for every io.bytes.per.checksum bytes of data.
Datanodes are responsible for verifying the data they receive before storing the data and
its checksum.
• A client writing data sends it to a pipeline of datanodes and the last datanode in the
pipeline verifies the checksum. If it detects an error, the client receives a
ChecksumException, a subclass of IOException, which it should handle in an application-
specific manner, by retrying the operation, for example.
• When clients read data from datanodes, they verify checksums comparing them with
the ones stored at the datanode. Each datanode keeps a persistent log of checksum
verifications, so it knows the last time each of its blocks was verified.
• When a client successfully verifies a block, it tells the datanode, which updates its log.
Keeping statistics such as these is valuable in detecting bad disks.

• On client reads, each datanode runs a DataBlockScanner in a background thread that
periodically verifies all the blocks stored on the datanode. This is to guard against
corruption due to “bit rot” in the physical storage media. Since HDFS stores replicas of
blocks, it can “heal” corrupted blocks by copying one of the good replicas to produce a
new, uncorrupt replica.
• The way this works is that if a client detects an error when reading a block, it reports the
bad block and the datanode it was trying to read from to the namenode before throwing
a ChecksumException.
• The namenode marks the block replica as corrupt, so it doesn’t direct clients to it, or try
to copy this replica to another datanode.
• It is possible to disable verification of checksums by passing false to the setVerify
Checksum() method on FileSystem, before using the open() method to read a file.
• The same effect is possible from the shell by using the -ignoreCrc option with the -get or
the equivalent -copyToLocal command.
• This feature is useful if we have a corrupt file that we want to inspect so that we can
decide what to do with it. For example, we might want to see whether it can be salvaged
before you delete it.
Data Integrity in HDFS

LocalFileSystem
• The Hadoop LocalFileSystem performs client-side checksumming.
• This means that when you write a file called filename, the filesystem client
transparently creates a hidden file, .filename.crc, in the same directory containing the
checksums for each chunk of the file.
• In HDFS, the chunk size is controlled by the io.bytes.per.checksum property, which
defaults to 512 bytes. The chunk size is stored as metadata in the .crc file, so the file
can be read back correctly.
• Checksums are verified when the file is read, and if an error is detected,
LocalFileSystem throws a ChecksumException. It is possible to disable checksums.
We can also disable checksum verification for only some reads.
ChecksumFileSystem
• LocalFileSystem uses ChecksumFileSystem to do its work. Checksum FileSystem
is a wrapper around FileSystem. The general idiom is as follows:
FileSystem rawFs = ...
FileSystem checksummedFs = new ChecksumFileSystem(rawFs);
• The underlying filesystem is called the raw filesystem. It may be retrieved using the
getRawFileSystem() or getChecksumFile() method on ChecksumFileSystem.
• If an error is detected by ChecksumFileSystem when reading a file, it will call its
eportChecksumFailure() method. The LocalFileSystem moves the offending file and
its checksum to a side directory on the same device called bad_files. Administrators
should periodically check for these bad files and take action on them.

Compression
File compression brings two major benefits: it reduces the space needed to store files,
and it speeds up data transfer across the network, or to or from disk. When dealing
with large volumes of data, both of these savings can be significant, so it pays to
carefully consider how to use compression in Hadoop. There are many different
compression formats, tools and algorithms, each with different characteristics.
A summary of compression formats
All compression algorithms exhibit a space/time trade-off: faster compression and
decompression speeds usually come at the expense of smaller space savings. The
tools listed in Table typically give some control over this trade-off at compression time
by offering nine different options: –1 means optimize for speed and -9 means optimize
for space.
The “Splittable” column in Table indicates whether the compression format supports
splitting; that is, whether we can seek to any point in the stream and start reading from
some point further on. Splittable compression formats are especially suitable for
MapReduce;

A codec is the implementation of a compression-decompression
algorithm. In Hadoop, a codec is represented by an implementation of the
CompressionCodec interface. So, for example, GzipCodec encapsulates
the compression and decompression algorithm for gzip.
Codecs
Hadoop compression codecs

Compressing and decompressing streams with CompressionCodec
CompressionCodec has two methods that allow us to easily compress or
decompress data. To compress data being written to an output stream, use the
createOutputStream(OutputStream out) method to create a
CompressionOutputStream to which we write our uncompressed data to have it
written in compressed form to the underlying stream. To decompress data being
read from an input stream, call createInputStream(InputStream in) to obtain a
CompressionInputStream, which allows us to read uncompressed data from the
underlying stream.
API to compress data read from standard input and write it to standard output
public class StreamCompressor {
public static void main(String[] args) throws Exception {
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);
CompressionOutputStream out = codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
}
}

• We use ReflectionUtils to construct a new instance of the codec, then obtain a
compression wrapper around System.out.
• Then we call the utility method copyBytes() on IOUtils to copy the input to the
output, which is compressed by the CompressionOutputStream.
• Finally, we call finish() on commpressionOutputStream, which tells the
compressor to finish writing to the compressed stream, but doesn’t close the
stream.
• We can try it out with the following command line, which compresses the string
“Text” using the StreamCompressor program with the GzipCodec, then
decompresses it from standard input using gunzip:
% echo "Text" | hadoop StreamCompressor
org.apache.hadoop.io.compress.GzipCodec | gunzip -
Text

Inferring CompressionCodecs using CompressionCodecFactory
A program to decompress a compressed file using a codec
inferred from the file’s extension
public class FileDecompressor {
String uri = args[0];
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path inputPath = new Path(uri);
CompressionCodecFactory factory = new
CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
System.exit(1);
}
String outputUri =
CompressionCodecFactory.removeSuffix(uri,
codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
IOUtils.closeStream(in);
IOUtils.closeStream(out);
}
}
}
If we are reading a compressed file, we can
normally infer the codec to use by looking at its
filename extension.
A file ending in .gz can be read with
GzipCodec, and so on.
CompressionCodecFactory provides a way of
mapping a filename extension to a
CompressionCodec using its getCodec()
method, which takes a Path object for the file
in question.
Once the codec has been found, it is used to
strip off the file suffix to form the output
filename (via the removeSuffix() static method
of CompressionCodecFactory).
In this way, a file named file.gz is
decompressed to file by invoking the program
as follows:
% hadoop FileDecompressor file.gz
CompressionCodecFactory finds codecs from
a list defined by the io.compression.codecs
configuration property Each codec knows its
default filename extension, thus permitting
CompressionCodecFactory to search through
the registered codecs to find a match for a
given extension (if any).

Which Compression Format Should we Use?
Which compression format we should use depends on our application. Do we want
to maximize the speed of our application or are we more concerned about keeping
storage costs down? In general, we should try different strategies for our application,
and benchmark them with representative datasets to find the best approach.
For large, unbounded files, like logfiles, the options are:
• Store the files uncompressed.
• Use a compression format that supports splitting, like bzip2 (although bzip2 is
fairly slow), or one that can be indexed to support splitting, like LZO.
• Split the file into chunks in the application and compress each chunk separately
using any supported compression format (it doesn’t matter whether it is splittable). In
this case, you should choose the chunk size so that the compressed chunks are
approximately the size of an HDFS block.
• Use Sequence File, which supports compression and splitting.
• Use an Avro data file, which supports compression and splitting, just like Sequence
File, but has the added advantage of being readable and writable from many
languages, not just Java. See “Avro data files”.
• For large files, we should not use a compression format that does not support
splitting on the whole file, since we lose locality and make MapReduce applications
very inefficient.
• For archival purposes, consider the Hadoop archive format, although it does not
support compression.

Using Compression in MapReduce
If our input files are compressed, they will be automatically decompressed as they are read by
MapReduce, using the filename extension to determine the codec to use. To compress the
output of a MapReduce job, in the job configuration, set the mapred.output.compress property
to true and the mapred.output.compression.codec property to the classname of the
compression codec.
public class MaxTemperatureWithCompression {
if (args.length != 2) {
System.err.println("Usage:
MaxTemperatureWithCompression <input path> " +
"<output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job,
GzipCodec.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
} }
% hadoop
MaxTemperatureWithCompression
input/ncdc/sample.txt.gz
output
Each part of the final output is
compressed; in this case, there is a single
part:
% gunzip -c output/part-r-00000.gz
1949 111
1950 22

Map output compression properties
The configuration properties to set compression for MapReduce job outputs
MapReduce compression properties
conf.setBoolean("mapred.compress.map.output", true);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class,
CompressionCodec.class);
Job job = new Job(conf);
Even if your MapReduce application reads and writes uncompressed data, it may
benefit from compressing the intermediate output of the map phase. Since the map
output is written to disk and transferred across the network to the reducer nodes, by
using a fast compressor such as LZO or Snappy, you can get performance gains
simply because the volume of data to transfer is reduced. The configuration
properties to enable compression for map outputs and to set the compression
format are shown in the following table.

Serialization
Serialization is the process of turning structured objects into a byte stream for
transmission over a network or for writing to persistent storage. Deserialization is
the reverse process of turning a byte stream back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data processing: for
interprocess communication and for persistent storage
In Hadoop, interprocess communication between nodes in the system is implemented
using remote procedure calls (RPCs). The RPC protocol uses serialization to render
the message into a binary stream to be sent to the remote node, which then
deserializes the binary stream into the original message. In general, it is desirable that
an RPC serialization format is:
Compact - A compact format makes the best use of network bandwidth, which is the
most scarce resource in a data center.
Fast -Interprocess communication forms the backbone for a distributed system, so it
is essential that there is as little performance overhead as possible for the serialization
and deserialization process.
Extensible - Protocols change over time to meet new requirements, so it should be
straightforward to evolve the protocol in a controlled manner for clients and servers.
Interoperable -For some systems, it is desirable to be able to support clients that are
written in different languages to the server, so the format needs to be designed to
make this possible.

The Writable Interface
The Writable interface defines two methods: one for writing its state to a
DataOutput binary stream, and one for reading its state from a DataInput binary
stream:
Let’s look at a particular Writable to see what we can do with it. We
will use IntWritable, a wrapper for a Java int. We can create one
and set its value using the set() method:
IntWritable writable = new IntWritable();
writable.set(163);
Equivalently, we can use the constructor that takes the integer
value:
IntWritable writable = new IntWritable(163);
To examine the serialized form of the IntWritable, we write a small
helper method that wraps a java.io.ByteArrayOutputStream in a
java.io.DataOutputStream (to capture the bytes in the serialized
stream:
public static byte[] serialize(Writable writable) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
writable.write(dataOut);
dataOut.close();
return out.toByteArray();
}

The Writable Interface
An integer is written using four bytes (as we see using JUnit 4 assertions):
byte[] bytes = serialize(writable);
assertThat(bytes.length, is(4));
The bytes are written in big-endian order (so the most significant byte is written
to the stream first, this is dictated by the java.io.DataOutput interface), and we
can see their hexadecimal representation by using a method on Hadoop’s
StringUtils:
assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));
Let’s try deserialization. Again, we create a helper method to read a Writable
object from a byte array:
public static byte[] deserialize(Writable writable, byte[] bytes)
throws IOException {
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
DataInputStream dataIn = new DataInputStream(in);
writable.readFields(dataIn);
dataIn.close();
return bytes;
}

Writable Classes
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io
package.
Writable wrappers for Java primitives
There are Writable wrappers for all the Java primitive types (see Table) except char
(which can be stored in an IntWritable). All have a get() and a set() method for
retrieving and storing the wrapped value.
Writable wrapper classes for Java primitives

Serialization Frameworks
Although most MapReduce programs use Writable key and value types, any types
can be used; the only requirement is that there be a mechanism that translates to
and from a binary representation of each type. To support this, Hadoop has an API
for pluggable serialization frameworks.
A serialization framework is represented by an implementation of Serialization (in
the org.apache.hadoop.io.serializer package). WritableSerialization, is the
implementation of Serialization for Writable types.
Hadoop includes a class called JavaSerialization that uses Java Object
Serialization. Although it makes it convenient to be able to use standard Java types
in MapReduce programs, like Integer or String, Java Object Serialization is not as
efficient as Writables.
Serialization IDL
There are a number of other serialization frameworks that approach the problem in
a different way: rather than defining types through code, you define them in a
language neutral, declarative fashion, using an interface description language (IDL)
i) Apache Thrift and ii) Google Protocol Buffers
Commonly used as a format for persistent binary data. There is limited support for
these as MapReduce formats. Used internally in parts of Hadoop for RPC and data
exchange
iii) Avro - an IDL-based serialization framework designed to work well with large-
scale data processing in Hadoop.

Avro created by Doug Cutting
Apache Avro4 is a language-neutral data serialization system developed to
address the major downside of Hadoop Writables: lack of language
portability. Having a data format that can be processed by many languages
(currently C, C++, C#, Java, Python, and Ruby) makes it easier to share datasets
with a wider audience than one tied to a single language. It is also more future-
proof, allowing data to potentially outlive the language used to read and write it.
Features of Avro that differentiate it from other systems
• Avro data is described using a language-independent schema.
• Unlike some other systems, code generation is optional in Avro, which means we
can read and write data that conforms to a given schema even if our code has not
seen that particular schema before. To achieve this, Avro assumes that the schema
is always present—at both read and write time
• Avro schemas are usually written in JSON, and data is usually encoded using a
binary format, but there are other options, too.
• There is a higher-level language called Avro IDL, for writing schemas in a C-like
language that is more familiar to developers.
• There is also a JSON-based data encoder, which, being human-readable, is useful
for prototyping and debugging Avro data.
• The Avro specification precisely defines the binary format that all implementations
must support

Features of Avro
• Avro has rich schema resolution capabilities The schema used to read data
need not be identical to the schema that was used to write the data. This is the
mechanism by which Avro supports schema evolution. For example, a new,
optional field may be added to a record by declaring it in the schema used to read
the old data.
• Avro specifies an object container format for sequences of objects—
similar to Hadoop’s sequence file.
• An Avro data file has a metadata section where the schema is stored, which
makes the file self-describing.
• Avro data files support compression and are splittable, which is crucial for a
MapReduce data input format.
• Avro was designed with MapReduce in mind. So in the future it will be
possible to use Avro to bring first-class MapReduce APIs. Avro can be used for
RPC .

Avro data types and schemas
Avro defines a small number of data types, which can be used to build
applicationspecific data structures by writing schemas. For interoperability,
implementations must support all Avro types.
Avro primitive types

Avro complex types
Avro also defines the complex types listed in Table, along with a representative
example of a schema of each type.

Avro Java type mappings –generic, specific, reflect
• Each Avro language API has a representation for each Avro type that is specific to
the language. For example, Avro’s double type is represented in C, C++, and Java
by a double, in Python by a float, and in Ruby by a Float. There may be more than
one representation, or mapping, for a language.
• All languages support a dynamic mapping, which can be used even when the
schema is not known ahead of run time. Java calls this the generic mapping.
• The Java and C++ implementations can generate code to represent the data for an
Avro schema. Code generation, which is called the specific mapping in Java, is an
optimization that is useful when you have a copy of the schema before you read or
write data. Generated classes also provide a more domain-oriented API for user
code than generic ones.
• Java has a third mapping, the reflect mapping, which maps Avro types onto
preexisting Java types, using reflection. It is slower than the generic and specific
mappings, and is not generally recommended for new applications.

In-memory serialization and deserialization
Let’s write a Java program to read and write Avro data to and from streams. We’ll
start with a simple Avro schema for representing a pair of strings as a record:
{ "type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"} ] }
If this schema is saved in a file on the classpath called StringPair.avsc (.avsc is the
conventional extension for an Avro schema), then we can load it using the following
two lines of code:
Schema.Parser parser = new Schema.Parser();
Schema schema =
parser.parse(getClass().getResourceAsStream("StringPair.avsc"));
We can create an instance of an Avro record using the generic API as follows:
GenericRecord datum = new GenericData.Record(schema);
datum.put("left", "L");
datum.put("right", "R");

Next, we serialize the record to an output stream:
DatumWriter<GenericRecord> writer = new
GenericDatumWriter<GenericRecord>(schema);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(datum, encoder);
encoder.flush();
out.close();
We can reverse the process and read the object back from the byte buffer:
DatumReader<GenericRecord> reader = new
GenericDatumReader<GenericRecord>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(),
null);
GenericRecord result = reader.read(null, decoder);
assertThat(result.get("left").toString(), is("L"));
assertThat(result.get("right").toString(), is("R"));
Code for serialization and deserialization

In the code for serializing and deserializing, instead of a GenericRecord we
construct a StringPair instance, which we write to the stream using a
SpecificDatumWriter, and read back using a SpecificDatumReader:
StringPair datum = new StringPair();
datum.left = "L";
datum.right = "R";
DatumWriter<StringPair> writer =
new SpecificDatumWriter<StringPair>(StringPair.class);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(datum, encoder);
encoder.flush();
out.close();
DatumReader<StringPair> reader =
new SpecificDatumReader<StringPair>(StringPair.class);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
StringPair result = reader.read(null, decoder);
assertThat(result.left.toString(), is("L"));
assertThat(result.right.toString(), is("R"));
Code for serialization and deserialization
Using StringPair instance, SpecificDatumWriter and SpecificDatumReader

A C program for reading Avro record pairs from a data file
#include <avro.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: dump_pairs <data_file>n");
exit(EXIT_FAILURE);
}
const char *avrofile = argv[1];
avro_schema_error_t error;
avro_file_reader_t filereader;
avro_datum_t pair;
avro_datum_t left;
avro_datum_t right;
int rval;
char *p;
avro_file_reader(avrofile, &filereader);
while (1) {
rval = avro_file_reader_read(filereader, NULL, &pair);
if (rval) break;
if (avro_record_get(pair, "left", &left) == 0) {
avro_string_get(left, &p);
fprintf(stdout, "%s,", p); }
if (avro_record_get(pair, "right", &right) == 0) {
avro_string_get(right, &p);
fprintf(stdout, "%sn", p); } }
avro_file_reader_close(filereader);
return 0; }
The core of the program does three things:
1. opens a file reader of type avro_file_reader_t
by calling Avro’s
avro_file_reader function
2. reads Avro data from the file reader with the
avro_file_reader_read function in a
while loop until there are no pairs left (as
determined by the return value rval), and
3. closes the file reader with
avro_file_reader_close.
Running the program using the output of
the Python program prints the original
input:
% ./dump_pairs pairs.avro
a,1
c,2
b,3
b,2
We have successfully exchanged complex
data between two Avro implementations.
Interoperability

Interoperability
A Python program for writing Avro record pairs to a data file
import os
import string
import sys
from avro import schema
from avro import io
from avro import datafile
if __name__ == '__main__':
if len(sys.argv) != 2:
sys.exit('Usage: %s <data_file>' % sys.argv[0])
avro_file = sys.argv[1]
writer = open(avro_file, 'wb')
datum_writer = io.DatumWriter()
schema_object = schema.parse("""
{ "type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}
]
}""")
dfw = datafile.DataFileWriter(writer, datum_writer,
schema_object)
for line in sys.stdin.readlines():
(left, right) = string.split(line.strip(), ',')
dfw.append({'left':left, 'right':right});
dfw.close()
The program reads comma-separated strings
from standard input and writes them as StringPair
records to an Avro data file. Like the Java code
for writing a data file, we create a DatumWriter
and a DataFileWriter object. Notice that we have
embedded the Avro schema in the code,
although we could equally well have read it from
a file.
Python represents Avro records as dictionaries;
each line that is read from standard in is turned
into a dict object and appended to the
DataFileWriter.
Before we can run the program, we need to install Avro
for Python:
% easy_install avro
To run the program, we specify the name of the file to
write output to (pairs.avro) and
send input pairs over standard in, marking the end of file
by typing Control-D:
% python avro/src/main/py/write_pairs.py pairs.avro
a,1
c,2
b,3
b,2
^D

File-Based Data Structures
Sequence File
• Imagine a logfile, where each log record is a new line of text.
• Hadoop’s SequenceFile class provides a persistent data structure for binary
key-value pairs.
• To use it as a logfile format, we would choose a key, such as timestamp
represented by a LongWritable, and the value is a Writable that represents the
quantity being logged.
• SequenceFiles also work well as containers for smaller files. HDFS and
MapReduce are optimized for large files, so packing files into a SequenceFile
makes storing and processing the smaller files more efficient.

The SequenceFile format
A sequence file consists of a header
followed by one or more records
The first three bytes of a sequence
file are the bytes SEQ, followed by a
single byte representing the version
number. The header contains other
fields including the names of the
key and value classes,
compression details, user defined
metadata, and the sync marker.
The sync marker is used to allow a
reader to synchronize to a record
boundary from any position in the
file.
Each file has a randomly generated
sync marker, whose value is stored
in the header. Sync markers appear
between records in the sequence
file. They are designed to incur less
than a 1% storage overhead, so
they don’t necessarily appear
between every pair of records (such
is the case for short records).

Writing a SequenceFile
Program to write some key-value pairs to a SequenceFile
private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"
};
public static void main(String[] args) throws IOException {
Path path = new Path(uri);
IntWritable key = new IntWritable();
Text value = new Text();
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());
for (int i = 0; i < 100; i++) {
key.set(100 - i);
value.set(DATA[i % DATA.length]);
System.out.printf("[%s]t%st%sn", writer.getLength(), key,
value);
writer.append(key, value); }
} finally {
IOUtils.closeStream(writer);
}}}
% hadoop SequenceFileWriteDemo
numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
...
...

Writing a SequenceFile
To create a SequenceFile, use one of its createWriter() static methods, which
returns a SequenceFile.Writer instance.
There are several overloaded versions, but they all require you to specify a stream
to write to (either a FSDataOutputStream or a FileSys tem and Path pairing), a
Configuration object, and the key and value types. Optional arguments include the
compression type and codec, a Progressable callback to be informed of write
progress, and a Metadata instance to be stored in the SequenceFile header.
The keys and values stored in a SequenceFile do not necessarily need to be
Writable. Any types that can be serialized and deserialized by a Serialization may
be used. Once you have a SequenceFile.Writer, you then write key-value pairs,
using the append() method. Then when you’ve finished, you call the close() method
(Sequence File.Writer implements java.io.Closeable).

Reading a SequenceFile
Reading sequence files from beginning to end is a matter of creating an instance
of SequenceFile.Reader and iterating over records by repeatedly invoking one of
the next() methods.
Which one you use depends on the serialization framework you are using. If you
are using Writable types, you can use the next() method that takes a key and a
value argument, and reads the next key and value in the stream into these
variables:
public boolean next(Writable key, Writable val)
The return value is true if a key-value pair was read and false if the end of the file
has been reached.

Reading a SequenceFile
public class SequenceFileReadDemo {
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while (reader.next(key, value)) {
String syncSeen = reader.syncSeen() ? "*" : "";
System.out.printf("[%s%s]t%st%sn", position, syncSeen, key, value);
position = reader.getPosition(); // beginning of next record
}
} finally {
IOUtils.closeStream(reader);
}
}
}

% hadoop SequenceFileReadDemo numbers.seq
...
[2021*] 59 Three, four, shut the door
...

Displaying a SequenceFile with the command-line interface
% hadoop fs -text numbers.seq | head
100 One, two, buckle my shoe
99 Three, four, shut the door
98 Five, six, pick up sticks
97 Seven, eight, lay them straight
96 Nine, ten, a big fat hen

Sorting and merging SequenceFiles
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1
-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat
-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat
-outKey org.apache.hadoop.io.IntWritable
-outValue org.apache.hadoop.io.Text
numbers.seq sorted
% hadoop fs -text sorted/part-00000 | head

The MapFile format
• A MapFile is a sorted SequenceFile with an index to permit lookups by key.
• MapFile can be thought of as a persistent form of java.util.Map (although it
doesn’t implement this interface), which is able to grow beyond the size of a Map
that is kept in memory.

Writing a MapFile
public class MapFileWriteDemo {
private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"
};
IntWritable key = new IntWritable();
MapFile.Writer writer = null;
try {
writer = new MapFile.Writer(conf, fs, uri,
key.getClass(), value.getClass());
for (int i = 0; i < 1024; i++) {
key.set(i + 1);
value.set(DATA[i % DATA.length]);
writer.append(key, value);
}
} finally {
IOUtils.closeStream(writer); }}}

Let’s use this program to build a MapFile:
% hadoop MapFileWriteDemo
numbers.map
If we look at the MapFile, we see it’s actually
a directory containing two files called
data and index:
% ls -l numbers.map
total 104
-rw-r--r-- 1 tom tom 47898 Jul 29 22:06 data
-rw-r--r-- 1 tom tom 251 Jul 29 22:06 index
Both files are SequenceFiles. The data file
contains all of the entries, in order:
% hadoop fs -text numbers.map/data |
head
The index file contains a fraction of the keys,
and contains a mapping from the key to
that key’s offset in the data file:
% hadoop fs -text numbers.map/index
1 128
129 6079
257 12054
385 18030
513 24002
641 29976
769 35947
897 41922
As we can see from the output, by default only
every 128th key is included in the index,
although you can change this value either by
setting the io.map.index.interval property or by
calling the setIndexInterval() method on the
MapFile.Writer instance. A reason to increase
the index interval would be to decrease the
amount of memory that the MapFile needs to
store the index.

Reading a MapFile
Iterating through the entries in order in a MapFile is similar to the procedure for a
SequenceFile: you create a MapFile.Reader, then call the next() method until it returns false,
signifying that no entry was read because the end of the file was reached:
public boolean next(WritableComparable key, Writable val) throws IOException
A random access lookup can be performed by calling the get() method:
public Writable get(WritableComparable key, Writable val) throws IOException
The return value is used to determine if an entry was found in the MapFile; if it’s null, then no
value exists for the given key. If key was found, then the value for that key is read into val, as
well as being returned from the method call. It might be helpful to understand how this is
implemented. Here is a snippet of code that retrieves an entry for the MapFile we created in
the previous section:
reader.get(new IntWritable(496), value);
assertThat(value.toString(), is("One, two, buckle my shoe"));
For this operation, the MapFile.Reader reads the index file into memory (this is cached so that
subsequent random access calls will use the same in-memory index). The reader then
performs a binary search on the in-memory index to find the key in the index that is less than
or equal to the search key, 496. In this example, the index key found is 385, with value 18030,
which is the offset in the data file. Next the reader seeks to this offset in the data file and reads
entries until the key is greater than or equal to the search key, 496. In this case, a match is
found and the value is read from the data file.

MapFile variants
Hadoop comes with a few variants on the general key-value MapFile interface:
• SetFile is a specialization of MapFile for storing a set of Writable keys. The keys
must be added in sorted order.
• ArrayFile is a MapFile where the key is an integer representing the index of the
element in the array, and the value is a Writable value.
• BloomMapFile is a MapFile which offers a fast version of the get() method,
especially for sparsely populated files. The implementation uses a dynamic bloom
filter for testing whether a given key is in the map. The test is very fast since it is
in-memory, but it has a non-zero probability of false positives, in which case the
regular get() method is called.

Converting a SequenceFile to a MapFile
One way of looking at a MapFile is as an indexed and sorted SequenceFile. So it’s
quite natural to want to be able to convert a SequenceFile into a MapFile.
public class MapFileFixer {
String mapUri = args[0];
FileSystem fs = FileSystem.get(URI.create(mapUri), conf);
Path map = new Path(mapUri);
Path mapData = new Path(map, MapFile.DATA_FILE_NAME);
// Get key and value types from data sequence file
SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf);
Class keyClass = reader.getKeyClass();
Class valueClass = reader.getValueClass();
reader.close();
// Create the map file index file
long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf);
System.out.printf("Created MapFile %s with %d entriesn", map, entries);
}
}

The fix() method is usually used for re-creating corrupted indexes, but since it creates
a new index from scratch, it’s exactly what we need here. The recipe is as follows:
1. Sort the sequence file numbers.seq into a new directory called number.map that
will become the MapFile (if the sequence file is already sorted, then you can skip this
step. Instead, copy it to a file number.map/data, then go to step 3):
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1
-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat
-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat
-outKey org.apache.hadoop.io.IntWritable
-outValue org.apache.hadoop.io.Text
numbers.seq numbers.map
2. Rename the MapReduce output to be the data file:
% hadoop fs -mv numbers.map/part-00000 numbers.map/data
3. Create the index file:
% hadoop MapFileFixer numbers.map
Created MapFile numbers.map with 100 entries
The MapFile numbers.map now exists and can be used.
144 |

HadoooIO.ppt

More Related Content

What's hot (20)

Similar to HadoooIO.ppt (20)

More from Sheba41 (8)

Recently uploaded (20)

HadoooIO.ppt