Hadoop The Definitive Guide 4th Ed Tom White
download
https://ebookbell.com/product/hadoop-the-definitive-guide-4th-ed-
tom-white-32715658
Explore and download more ebooks at ebookbell.com
Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Hadoop The Definitive Guide Third White Tom
https://ebookbell.com/product/hadoop-the-definitive-guide-third-white-
tom-55285128
Hadoop The Definitive Guide 2nd Edition Tom White
https://ebookbell.com/product/hadoop-the-definitive-guide-2nd-edition-
tom-white-2310272
Hadoop The Definitive Guide Tom White
https://ebookbell.com/product/hadoop-the-definitive-guide-tom-
white-4102112
Hadoop The Definitive Guide 4th Edition 4th Edition Tom White
https://ebookbell.com/product/hadoop-the-definitive-guide-4th-
edition-4th-edition-tom-white-4760686
Hadoop The Definitive Guide Tom White
https://ebookbell.com/product/hadoop-the-definitive-guide-tom-
white-36051600
Hadoopthe Definitive Guide 3rd Edition Tom White
https://ebookbell.com/product/hadoopthe-definitive-guide-3rd-edition-
tom-white-30848270
Hadoop The Definitive Guide Tom White White Tom
https://ebookbell.com/product/hadoop-the-definitive-guide-tom-white-
white-tom-31745004
Hadoop The Definitive Guide Tom Tom White
https://ebookbell.com/product/hadoop-the-definitive-guide-tom-tom-
white-36051640
Hadoop The Definitive Guide Tom White
https://ebookbell.com/product/hadoop-the-definitive-guide-tom-
white-37260814
Hadoop The Definitive Guide 4th Ed Tom White
PROGRAMMING LANGUAGES/HADOOP
Hadoop: The Definitive Guide
ISBN: 978-1-491-90163-2
US $49.99 CAN $57.99
“
Now you have the 
opportunity to learn
about Hadoop from a
master—not only of the
technology, but also 
of common sense and
plain talk.” —Doug Cutting
Cloudera
Twitter: @oreillymedia
facebook.com/oreilly
Get ready to unlock the power of your data. With the fourth edition of
this comprehensive guide, you’ll learn how to build and maintain reliable,
scalable, distributed systems with Apache Hadoop. This book is ideal for
programmers looking to analyze datasets of any size, and for administrators
who want to set up and run Hadoop clusters.
Using Hadoop 2 exclusively, author Tom White presents new chapters
on YARN and several Hadoop-related projects such as Parquet, Flume,
Crunch, and Spark. You’ll learn about recent changes to Hadoop, and
explore new case studies on Hadoop’s role in healthcare systems and
genomics data processing.
■
■ Learn fundamental components such as MapReduce, HDFS,
and YARN
■
■ Explore MapReduce in depth, including steps for developing
applications with it
■
■ Set up and maintain a Hadoop cluster running HDFS and
MapReduce on YARN
■
■ Learn two data formats: Avro for data serialization and Parquet
for nested data
■
■ Use data ingestion tools such as Flume (for streaming data) and
Sqoop (for bulk data transfer)
■
■ Understand how high-level data processing tools like Pig, Hive,
Crunch, and Spark work with Hadoop
■
■ Learn the HBase distributed database and the ZooKeeper
distributed configuration service
Tom White, an engineer at Cloudera and member of the Apache Software
Foundation, has been an Apache Hadoop committer since 2007. He has written
numerous articles for oreilly.com, java.net, and IBM’s developerWorks, and speaks
regularly about Hadoop at industry conferences.
Hadoop:
The
Definitive
Guide
FOURTH EDITION
White
Tom White
Hadoop
The Definitive Guide
STORAGE AND ANALYSIS AT INTERNET SCALE
4
t
h
E
d
i
t
i
o
n
R
e
v
i
s
e
d

U
p
d
a
t
e
d
PROGRAMMING LANGUAGES/HADOOP
Hadoop: The Definitive Guide
ISBN: 978-1-491-90163-2
US $49.99 CAN $57.99
“
Now you have the 
opportunity to learn
about Hadoop from a
master—not only of the
technology, but also 
of common sense and
plain talk.” —Doug Cutting
Cloudera
Twitter: @oreillymedia
facebook.com/oreilly
Get ready to unlock the power of your data. With the fourth edition of
this comprehensive guide, you’ll learn how to build and maintain reliable,
scalable, distributed systems with Apache Hadoop. This book is ideal for
programmers looking to analyze datasets of any size, and for administrators
who want to set up and run Hadoop clusters.
Using Hadoop 2 exclusively, author Tom White presents new chapters
on YARN and several Hadoop-related projects such as Parquet, Flume,
Crunch, and Spark. You’ll learn about recent changes to Hadoop, and
explore new case studies on Hadoop’s role in healthcare systems and
genomics data processing.
■
■ Learn fundamental components such as MapReduce, HDFS,
and YARN
■
■ Explore MapReduce in depth, including steps for developing
applications with it
■
■ Set up and maintain a Hadoop cluster running HDFS and
MapReduce on YARN
■
■ Learn two data formats: Avro for data serialization and Parquet
for nested data
■
■ Use data ingestion tools such as Flume (for streaming data) and
Sqoop (for bulk data transfer)
■
■ Understand how high-level data processing tools like Pig, Hive,
Crunch, and Spark work with Hadoop
■
■ Learn the HBase distributed database and the ZooKeeper
distributed configuration service
Tom White, an engineer at Cloudera and member of the Apache Software
Foundation, has been an Apache Hadoop committer since 2007. He has written
numerous articles for oreilly.com, java.net, and IBM’s developerWorks, and speaks
regularly about Hadoop at industry conferences.
Hadoop:
The
Definitive
Guide
FOURTH EDITION
White
Tom White
Hadoop
The Definitive Guide
STORAGE AND ANALYSIS AT INTERNET SCALE
4
t
h
E
d
i
t
i
o
n
R
e
v
i
s
e
d

U
p
d
a
t
e
d
Tom White
FOURTH EDITION
Hadoop: The Definitive Guide
Hadoop: The Definitive Guide, Fourth Edition
by Tom White
Copyright © 2015 Tom White. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/
institutional sales department: 800-998-9938 or corporate@oreilly.com.
Editors: Mike Loukides and Meghan Blanchette
Production Editor: Matthew Hacker
Copyeditor: Jasmine Kwityn
Proofreader: Rachel Head
Indexer: Lucie Haskins
Cover Designer: Ellie Volckhausen
Interior Designer: David Futato
Illustrator: Rebecca Demarest
June 2009: First Edition
October 2010: Second Edition
May 2012: Third Edition
April 2015: Fourth Edition
Revision History for the Fourth Edition:
2015-03-19: First release
2015-04-17: Second release
See http://oreilly.com/catalog/errata.csp?isbn=9781491901632 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hadoop: The Definitive Guide, the cover
image of an African elephant, and related trade dress are trademarks of O’Reilly Media, Inc.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed as
trademarks.Wherethosedesignationsappearinthisbook,andO’ReillyMedia,Inc.wasawareofatrademark
claim, the designations have been printed in caps or initial caps.
While the publisher and the author have used good faith efforts to ensure that the information and instruc‐
tions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors
or omissions, including without limitation responsibility for damages resulting from the use of or reliance
on this work. Use of the information and instructions contained in this work is at your own risk. If any code
samples or other technology this work contains or describes is subject to open source licenses or the intel‐
lectual property rights of others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
ISBN: 978-1-491-90163-2
[LSI]
For Eliane, Emilia, and Lottie
Hadoop The Definitive Guide 4th Ed Tom White
Table of Contents
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Part I. Hadoop Fundamentals
1. Meet Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Data! 3
Data Storage and Analysis 5
Querying All Your Data 6
Beyond Batch 6
Comparison with Other Systems 8
Relational Database Management Systems 8
Grid Computing 10
Volunteer Computing 11
A Brief History of Apache Hadoop 12
What’s in This Book? 15
2. MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
A Weather Dataset 19
Data Format 19
Analyzing the Data with Unix Tools 21
Analyzing the Data with Hadoop 22
Map and Reduce 22
Java MapReduce 24
Scaling Out 30
Data Flow 30
Combiner Functions 34
Running a Distributed MapReduce Job 37
Hadoop Streaming 37
v
Ruby 37
Python 40
3. The Hadoop Distributed Filesystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
The Design of HDFS 43
HDFS Concepts 45
Blocks 45
Namenodes and Datanodes 46
Block Caching 47
HDFS Federation 48
HDFS High Availability 48
The Command-Line Interface 50
Basic Filesystem Operations 51
Hadoop Filesystems 53
Interfaces 54
The Java Interface 56
Reading Data from a Hadoop URL 57
Reading Data Using the FileSystem API 58
Writing Data 61
Directories 63
Querying the Filesystem 63
Deleting Data 68
Data Flow 69
Anatomy of a File Read 69
Anatomy of a File Write 72
Coherency Model 74
Parallel Copying with distcp 76
Keeping an HDFS Cluster Balanced 77
4. YARN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Anatomy of a YARN Application Run 80
Resource Requests 81
Application Lifespan 82
Building YARN Applications 82
YARN Compared to MapReduce 1 83
Scheduling in YARN 85
Scheduler Options 86
Capacity Scheduler Configuration 88
Fair Scheduler Configuration 90
Delay Scheduling 94
Dominant Resource Fairness 95
Further Reading 96
vi | Table of Contents
5. Hadoop I/O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Data Integrity 97
Data Integrity in HDFS 98
LocalFileSystem 99
ChecksumFileSystem 99
Compression 100
Codecs 101
Compression and Input Splits 105
Using Compression in MapReduce 107
Serialization 109
The Writable Interface 110
Writable Classes 113
Implementing a Custom Writable 121
Serialization Frameworks 126
File-Based Data Structures 127
SequenceFile 127
MapFile 135
Other File Formats and Column-Oriented Formats 136
Part II. MapReduce
6. Developing a MapReduce Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
The Configuration API 141
Combining Resources 143
Variable Expansion 143
Setting Up the Development Environment 144
Managing Configuration 146
GenericOptionsParser, Tool, and ToolRunner 148
Writing a Unit Test with MRUnit 152
Mapper 153
Reducer 156
Running Locally on Test Data 156
Running a Job in a Local Job Runner 157
Testing the Driver 158
Running on a Cluster 160
Packaging a Job 160
Launching a Job 162
The MapReduce Web UI 165
Retrieving the Results 167
Debugging a Job 168
Hadoop Logs 172
Table of Contents | vii
Remote Debugging 174
Tuning a Job 175
Profiling Tasks 175
MapReduce Workflows 177
Decomposing a Problem into MapReduce Jobs 177
JobControl 178
Apache Oozie 179
7. How MapReduce Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Anatomy of a MapReduce Job Run 185
Job Submission 186
Job Initialization 187
Task Assignment 188
Task Execution 189
Progress and Status Updates 190
Job Completion 192
Failures 193
Task Failure 193
Application Master Failure 194
Node Manager Failure 195
Resource Manager Failure 196
Shuffle and Sort 197
The Map Side 197
The Reduce Side 198
Configuration Tuning 201
Task Execution 203
The Task Execution Environment 203
Speculative Execution 204
Output Committers 206
8. MapReduce Types and Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
MapReduce Types 209
The Default MapReduce Job 214
Input Formats 220
Input Splits and Records 220
Text Input 232
Binary Input 236
Multiple Inputs 237
Database Input (and Output) 238
Output Formats 238
Text Output 239
Binary Output 239
viii | Table of Contents
Multiple Outputs 240
Lazy Output 245
Database Output 245
9. MapReduce Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Counters 247
Built-in Counters 247
User-Defined Java Counters 251
User-Defined Streaming Counters 255
Sorting 255
Preparation 256
Partial Sort 257
Total Sort 259
Secondary Sort 262
Joins 268
Map-Side Joins 269
Reduce-Side Joins 270
Side Data Distribution 273
Using the Job Configuration 273
Distributed Cache 274
MapReduce Library Classes 279
Part III. Hadoop Operations
10. Setting Up a Hadoop Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Cluster Specification 284
Cluster Sizing 285
Network Topology 286
Cluster Setup and Installation 288
Installing Java 288
Creating Unix User Accounts 288
Installing Hadoop 289
Configuring SSH 289
Configuring Hadoop 290
Formatting the HDFS Filesystem 290
Starting and Stopping the Daemons 290
Creating User Directories 292
Hadoop Configuration 292
Configuration Management 293
Environment Settings 294
Important Hadoop Daemon Properties 296
Table of Contents | ix
Hadoop Daemon Addresses and Ports 304
Other Hadoop Properties 307
Security 309
Kerberos and Hadoop 309
Delegation Tokens 312
Other Security Enhancements 313
Benchmarking a Hadoop Cluster 314
Hadoop Benchmarks 314
User Jobs 316
11. Administering Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
HDFS 317
Persistent Data Structures 317
Safe Mode 322
Audit Logging 324
Tools 325
Monitoring 330
Logging 330
Metrics and JMX 331
Maintenance 332
Routine Administration Procedures 332
Commissioning and Decommissioning Nodes 334
Upgrades 337
Part IV. Related Projects
12. Avro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Avro Data Types and Schemas 346
In-Memory Serialization and Deserialization 349
The Specific API 351
Avro Datafiles 352
Interoperability 354
Python API 354
Avro Tools 355
Schema Resolution 355
Sort Order 358
Avro MapReduce 359
Sorting Using Avro MapReduce 363
Avro in Other Languages 365
x | Table of Contents
13. Parquet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Data Model 368
Nested Encoding 370
Parquet File Format 370
Parquet Configuration 372
Writing and Reading Parquet Files 373
Avro, Protocol Buffers, and Thrift 375
Parquet MapReduce 377
14. Flume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Installing Flume 381
An Example 382
Transactions and Reliability 384
Batching 385
The HDFS Sink 385
Partitioning and Interceptors 387
File Formats 387
Fan Out 388
Delivery Guarantees 389
Replicating and Multiplexing Selectors 390
Distribution: Agent Tiers 390
Delivery Guarantees 393
Sink Groups 395
Integrating Flume with Applications 398
Component Catalog 399
Further Reading 400
15. Sqoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Getting Sqoop 401
Sqoop Connectors 403
A Sample Import 403
Text and Binary File Formats 406
Generated Code 407
Additional Serialization Systems 407
Imports: A Deeper Look 408
Controlling the Import 410
Imports and Consistency 411
Incremental Imports 411
Direct-Mode Imports 411
Working with Imported Data 412
Imported Data and Hive 413
Importing Large Objects 415
Table of Contents | xi
Performing an Export 417
Exports: A Deeper Look 419
Exports and Transactionality 420
Exports and SequenceFiles 421
Further Reading 422
16. Pig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Installing and Running Pig 424
Execution Types 424
Running Pig Programs 426
Grunt 426
Pig Latin Editors 427
An Example 427
Generating Examples 429
Comparison with Databases 430
Pig Latin 432
Structure 432
Statements 433
Expressions 438
Types 439
Schemas 441
Functions 445
Macros 447
User-Defined Functions 448
A Filter UDF 448
An Eval UDF 452
A Load UDF 453
Data Processing Operators 456
Loading and Storing Data 456
Filtering Data 457
Grouping and Joining Data 459
Sorting Data 465
Combining and Splitting Data 466
Pig in Practice 466
Parallelism 467
Anonymous Relations 467
Parameter Substitution 467
Further Reading 469
17. Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Installing Hive 472
The Hive Shell 473
xii | Table of Contents
An Example 474
Running Hive 475
Configuring Hive 475
Hive Services 478
The Metastore 480
Comparison with Traditional Databases 482
Schema on Read Versus Schema on Write 482
Updates, Transactions, and Indexes 483
SQL-on-Hadoop Alternatives 484
HiveQL 485
Data Types 486
Operators and Functions 488
Tables 489
Managed Tables and External Tables 490
Partitions and Buckets 491
Storage Formats 496
Importing Data 500
Altering Tables 502
Dropping Tables 502
Querying Data 503
Sorting and Aggregating 503
MapReduce Scripts 503
Joins 505
Subqueries 508
Views 509
User-Defined Functions 510
Writing a UDF 511
Writing a UDAF 513
Further Reading 518
18. Crunch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519
An Example 520
The Core Crunch API 523
Primitive Operations 523
Types 528
Sources and Targets 531
Functions 533
Materialization 535
Pipeline Execution 538
Running a Pipeline 538
Stopping a Pipeline 539
Inspecting a Crunch Plan 540
Table of Contents | xiii
Iterative Algorithms 543
Checkpointing a Pipeline 545
Crunch Libraries 545
Further Reading 548
19. Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
Installing Spark 550
An Example 550
Spark Applications, Jobs, Stages, and Tasks 552
A Scala Standalone Application 552
A Java Example 554
A Python Example 555
Resilient Distributed Datasets 556
Creation 556
Transformations and Actions 557
Persistence 560
Serialization 562
Shared Variables 564
Broadcast Variables 564
Accumulators 564
Anatomy of a Spark Job Run 565
Job Submission 565
DAG Construction 566
Task Scheduling 569
Task Execution 570
Executors and Cluster Managers 570
Spark on YARN 571
Further Reading 574
20. HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
HBasics 575
Backdrop 576
Concepts 576
Whirlwind Tour of the Data Model 576
Implementation 578
Installation 581
Test Drive 582
Clients 584
Java 584
MapReduce 587
REST and Thrift 589
Building an Online Query Application 589
xiv | Table of Contents
Schema Design 590
Loading Data 591
Online Queries 594
HBase Versus RDBMS 597
Successful Service 598
HBase 599
Praxis 600
HDFS 600
UI 601
Metrics 601
Counters 601
Further Reading 601
21. ZooKeeper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Installing and Running ZooKeeper 604
An Example 606
Group Membership in ZooKeeper 606
Creating the Group 607
Joining a Group 609
Listing Members in a Group 610
Deleting a Group 612
The ZooKeeper Service 613
Data Model 614
Operations 616
Implementation 620
Consistency 621
Sessions 623
States 625
Building Applications with ZooKeeper 627
A Configuration Service 627
The Resilient ZooKeeper Application 630
A Lock Service 634
More Distributed Data Structures and Protocols 636
ZooKeeper in Production 637
Resilience and Performance 637
Configuration 639
Further Reading 640
Table of Contents | xv
Part V. Case Studies
22. Composable Data at Cerner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
From CPUs to Semantic Integration 643
Enter Apache Crunch 644
Building a Complete Picture 644
Integrating Healthcare Data 647
Composability over Frameworks 650
Moving Forward 651
23. Biological Data Science: Saving Lives with Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
The Structure of DNA 655
The Genetic Code: Turning DNA Letters into Proteins 656
Thinking of DNA as Source Code 657
The Human Genome Project and Reference Genomes 659
Sequencing and Aligning DNA 660
ADAM, A Scalable Genome Analysis Platform 661
Literate programming with the Avro interface description language (IDL) 662
Column-oriented access with Parquet 663
A simple example: k-mer counting using Spark and ADAM 665
From Personalized Ads to Personalized Medicine 667
Join In 668
24. Cascading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Fields, Tuples, and Pipes 670
Operations 673
Taps, Schemes, and Flows 675
Cascading in Practice 676
Flexibility 679
Hadoop and Cascading at ShareThis 680
Summary 684
A. Installing Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685
B. Cloudera’s Distribution Including Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
C. Preparing the NCDC Weather Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
D. The Old and New Java MapReduce APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
xvi | Table of Contents
Foreword
Hadoop got its start in Nutch. A few of us were attempting to build an open source web
search engine and having trouble managing computations running on even a handful
ofcomputers.OnceGooglepublisheditsGFSandMapReducepapers,theroutebecame
clear.They’ddevisedsystemstosolvepreciselytheproblemswewerehavingwithNutch.
So we started, two of us, half-time, to try to re-create these systems as a part of Nutch.
We managed to get Nutch limping along on 20 machines, but it soon became clear that
to handle the Web’s massive scale, we’d need to run it on thousands of machines, and
moreover, that the job was bigger than two half-time developers could handle.
Around that time, Yahoo! got interested, and quickly put together a team that I joined.
We split off the distributed computing part of Nutch, naming it Hadoop. With the help
of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.
In 2006, Tom White started contributing to Hadoop. I already knew Tom through an
excellent article he’d written about Nutch, so I knew he could present complex ideas in
clear prose. I soon learned that he could also develop software that was as pleasant to
read as his prose.
From the beginning, Tom’s contributions to Hadoop showed his concern for users and
for the project. Unlike most open source contributors, Tom is not primarily interested
in tweaking the system to better meet his own needs, but rather in making it easier for
anyone to use.
Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services.
Then he moved on to tackle a wide variety of problems, including improving the Map‐
Reduce APIs, enhancing the website, and devising an object serialization framework.
In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of
Hadoop committer and soon thereafter became a member of the Hadoop Project Man‐
agement Committee.
xvii
Tom is now a respected senior member of the Hadoop developer community. Though
he’s an expert in many technical corners of the project, his specialty is making Hadoop
easier to use and understand.
Given this, I was very pleased when I learned that Tom intended to write a book about
Hadoop. Who could be better qualified? Now you have the opportunity to learn about
Hadoop from a master—not only of the technology, but also of common sense and
plain talk.
—Doug Cutting, April 2009
Shed in the Yard, California
xviii | Foreword
1. Alex Bellos, “The science of fun,” The Guardian, May 31, 2008.
2. It was added to the Oxford English Dictionary in 2013.
Preface
Martin Gardner, the mathematics and science writer, once said in an interview:
Beyond calculus, I am lost. That was the secret of my column’s success. It took me so long
to understand what I was writing about that I knew how to write in a way most readers
would understand.1
In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting
as they do on a mixture of distributed systems theory, practical engineering, and com‐
mon sense. And to the uninitiated, Hadoop can appear alien.
But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides
for working with big data are simple. If there’s a common theme, it is about raising the
level of abstraction—to create building blocks for programmers who have lots of data
to store and analyze, and who don’t have the time, the skill, or the inclination to become
distributed systems experts to build the infrastructure to handle it.
With such a simple and generally applicable feature set, it seemed obvious to me when
I started using it that Hadoop deserved to be widely used. However, at the time (in early
2006), setting up, configuring, and writing programs to use Hadoop was an art. Things
have certainly improved since then: there is more documentation, there are more ex‐
amples, and there are thriving mailing lists to go to when you have questions. And yet
the biggest hurdle for newcomers is understanding what this technology is capable of,
where it excels, and how to use it. That is why I wrote this book.
The Apache Hadoop community has come a long way. Since the publication of the first
edition of this book, the Hadoop project has blossomed. “Big data” has become a house‐
hold term.2
In this time, the software has made great leaps in adoption, performance,
reliability, scalability, and manageability. The number of things being built and run on
the Hadoop platform has grown enormously. In fact, it’s difficult for one person to keep
xix
track. To gain even wider adoption, I believe we need to make Hadoop even easier to
use. This will involve writing more tools; integrating with even more systems; and writ‐
ing new, improved APIs. I’m looking forward to being a part of this, and I hope this
book will encourage and enable others to do so, too.
Administrative Notes
During discussion of a particular Java class in the text, I often omit its package name to
reduce clutter. If you need to know which package a class is in, you can easily look it up
in the Java API documentation for Hadoop (linked to from the Apache Hadoop home
page),ortherelevantproject.Orifyou’reusinganintegrateddevelopmentenvironment
(IDE), its auto-complete mechanism can help find what you’re looking for.
Similarly, although it deviates from usual style guidelines, program listings that import
multiple classes from the same package may use the asterisk wildcard character to save
space (for example, import org.apache.hadoop.io.*).
The sample programs in this book are available for download from the book’s website.
You will also find instructions there for obtaining the datasets that are used in examples
throughout the book, as well as further notes for running the programs in the book and
links to updates, additional resources, and my blog.
What’s New in the Fourth Edition?
The fourth edition covers Hadoop 2 exclusively. The Hadoop 2 release series is the
current active release series and contains the most stable versions of Hadoop.
There are new chapters covering YARN (Chapter 4), Parquet (Chapter 13), Flume
(Chapter 14), Crunch (Chapter 18), and Spark (Chapter 19). There’s also a new section
to help readers navigate different pathways through the book (“What’s in This Book?”
on page 15).
This edition includes two new case studies (Chapters 22 and 23): one on how Hadoop
is used in healthcare systems, and another on using Hadoop technologies for genomics
data processing. Case studies from the previous editions can now be found online.
Many corrections, updates, and improvements have been made to existing chapters to
bring them up to date with the latest releases of Hadoop and its related projects.
What’s New in the Third Edition?
The third edition covers the 1.x (formerly 0.20) release series of Apache Hadoop, as well
as the newer 0.22 and 2.x (formerly 0.23) series. With a few exceptions, which are noted
in the text, all the examples in this book run against these versions.
xx | Preface
This edition uses the new MapReduce API for most of the examples. Because the old
API is still in widespread use, it continues to be discussed in the text alongside the new
API, and the equivalent code using the old API can be found on the book’s website.
The major change in Hadoop 2.0 is the new MapReduce runtime, MapReduce 2, which
is built on a new distributed resource management system called YARN. This edition
includes new sections covering MapReduce on YARN: how it works (Chapter 7) and
how to run it (Chapter 10).
There is more MapReduce material, too, including development practices such as pack‐
aging MapReduce jobs with Maven, setting the user’s Java classpath, and writing tests
with MRUnit (all in Chapter 6). In addition, there is more depth on features such as
outputcommittersandthedistributedcache(bothinChapter9),aswellastaskmemory
monitoring (Chapter 10). There is a new section on writing MapReduce jobs to process
Avro data (Chapter 12), and one on running a simple MapReduce workflow in Oozie
(Chapter 6).
ThechapteronHDFS(Chapter3)nowhasintroductionstohighavailability,federation,
and the new WebHDFS and HttpFS filesystems.
The chapters on Pig, Hive, Sqoop, and ZooKeeper have all been expanded to cover the
new features and changes in their latest releases.
In addition, numerous corrections and improvements have been made throughout the
book.
What’s New in the Second Edition?
The second edition has two new chapters on Sqoop and Hive (Chapters 15 and 17,
respectively), a new section covering Avro (in Chapter 12), an introduction to the new
security features in Hadoop (in Chapter 10), and a new case study on analyzing massive
network graphs using Hadoop.
This edition continues to describe the 0.20 release series of Apache Hadoop, because
this was the latest stable release at the time of writing. New features from later releases
are occasionally mentioned in the text, however, with reference to the version that they
were introduced in.
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic
Indicates new terms, URLs, email addresses, filenames, and file extensions.
Preface | xxi
Constant width
Used for program listings, as well as within paragraphs to refer to commands and
command-line options and to program elements such as variable or function
names, databases, data types, environment variables, statements, and keywords.
Constant width bold
Shows commands or other text that should be typed literally by the user.
Constant width italic
Shows text that should be replaced with user-supplied values or by values deter‐
mined by context.
This icon signifies a general note.
This icon signifies a tip or suggestion.
This icon indicates a warning or caution.
Using Code Examples
Supplemental material (code, examples, exercise, etc.) is available for download at this
book’s website and on GitHub.
This book is here to help you get your job done. In general, you may use the code in
this book in your programs and documentation. You do not need to contact us for
permission unless you’re reproducing a significant portion of the code. For example,
writing a program that uses several chunks of code from this book does not require
permission. Selling or distributing a CD-ROM of examples from O’Reilly books does
requirepermission.Answeringaquestionbycitingthisbookandquotingexamplecode
does not require permission. Incorporating a significant amount of example code from
this book into your product’s documentation does require permission.
xxii | Preface
We appreciate, but do not require, attribution. An attribution usually includes the title,
author, publisher, and ISBN. For example: “Hadoop: The Definitive Guide, Fourth Ed‐
ition, by Tom White (O’Reilly). Copyright 2015 Tom White, 978-1-491-90163-2.”
If you feel your use of code examples falls outside fair use or the permission given here,
feel free to contact us at permissions@oreilly.com.
Safari® Books Online
Safari Books Online is an on-demand digital library that
delivers expert content in both book and video form from
the world’s leading authors in technology and business.
Technology professionals, software developers, web designers, and business and crea‐
tive professionals use Safari Books Online as their primary resource for research, prob‐
lem solving, learning, and certification training.
Safari Books Online offers a range of plans and pricing for enterprise, government,
education, and individuals.
Members have access to thousands of books, training videos, and prepublication manu‐
scripts in one fully searchable database from publishers like O’Reilly Media, Prentice
Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit
Press, Focal Press, Cisco Press, John Wiley  Sons, Syngress, Morgan Kaufmann, IBM
Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill,
Jones  Bartlett, Course Technology, and hundreds more. For more information about
Safari Books Online, please visit us online.
How to Contact Us
Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc.
1005 Gravenstein Highway North
Sebastopol, CA 95472
800-998-9938 (in the United States or Canada)
707-829-0515 (international or local)
707-829-0104 (fax)
We have a web page for this book, where we list errata, examples, and any additional
information. You can access this page at http://bit.ly/hadoop_tdg_4e.
To comment or ask technical questions about this book, send email to
bookquestions@oreilly.com.
Preface | xxiii
For more information about our books, courses, conferences, and news, see our website
at http://www.oreilly.com.
Find us on Facebook: http://facebook.com/oreilly
Follow us on Twitter: http://twitter.com/oreillymedia
Watch us on YouTube: http://www.youtube.com/oreillymedia
Acknowledgments
I have relied on many people, both directly and indirectly, in writing this book. I would
like to thank the Hadoop community, from whom I have learned, and continue to learn,
a great deal.
In particular, I would like to thank Michael Stack and Jonathan Gray for writing the
chapter on HBase. Thanks also go to Adrian Woodhead, Marc de Palol, Joydeep Sen
Sarma, Ashish Thusoo, Andrzej Białecki, Stu Hood, Chris K. Wensel, and Owen
O’Malley for contributing case studies.
I would like to thank the following reviewers who contributed many helpful suggestions
and improvements to my drafts: Raghu Angadi, Matt Biddulph, Christophe Bisciglia,
Ryan Cox, Devaraj Das, Alex Dorman, Chris Douglas, Alan Gates, Lars George, Patrick
Hunt, Aaron Kimball, Peter Krey, Hairong Kuang, Simon Maxen, Olga Natkovich,
Benjamin Reed, Konstantin Shvachko, Allen Wittenauer, Matei Zaharia, and Philip
Zeyliger. Ajay Anand kept the review process flowing smoothly. Philip (“flip”) Kromer
kindly helped me with the NCDC weather dataset featured in the examples in this book.
Special thanks to Owen O’Malley and Arun C. Murthy for explaining the intricacies of
the MapReduce shuffle to me. Any errors that remain are, of course, to be laid at my
door.
For the second edition, I owe a debt of gratitude for the detailed reviews and feedback
from Jeff Bean, Doug Cutting, Glynn Durham, Alan Gates, Jeff Hammerbacher, Alex
Kozlov, Ken Krugler, Jimmy Lin, Todd Lipcon, Sarah Sproehnle, Vinithra Varadharajan,
and Ian Wrigley, as well as all the readers who submitted errata for the first edition. I
would also like to thank Aaron Kimball for contributing the chapter on Sqoop, and
Philip (“flip”) Kromer for the case study on graph processing.
For the third edition, thanks go to Alejandro Abdelnur, Eva Andreasson, Eli Collins,
Doug Cutting, Patrick Hunt, Aaron Kimball, Aaron T. Myers, Brock Noland, Arvind
Prabhakar, Ahmed Radwan, and Tom Wheeler for their feedback and suggestions. Rob
Weltmankindlygaveverydetailedfeedbackforthewholebook,whichgreatlyimproved
the final manuscript. Thanks also go to all the readers who submitted errata for the
second edition.
xxiv | Preface
For the fourth edition, I would like to thank Jodok Batlogg, Meghan Blanchette, Ryan
Blue, Jarek Jarcec Cecho, Jules Damji, Dennis Dawson, Matthew Gast, Karthik Kam‐
batla, Julien Le Dem, Brock Noland, Sandy Ryza, Akshai Sarma, Ben Spivey, Michael
Stack,KateTing,JoshWalter,JoshWills,andAdrianWoodheadforalloftheirinvaluable
reviewfeedback.RyanBrush,MicahWhitacre,andMattMassiekindlycontributednew
case studies for this edition. Thanks again to all the readers who submitted errata.
I am particularly grateful to Doug Cutting for his encouragement, support, and friend‐
ship, and for contributing the Foreword.
Thanks also go to the many others with whom I have had conversations or email
discussions over the course of writing the book.
Halfway through writing the first edition of this book, I joined Cloudera, and I want to
thank my colleagues for being incredibly supportive in allowing me the time to write
and to get it finished promptly.
I am grateful to my editors, Mike Loukides and Meghan Blanchette, and their colleagues
at O’Reilly for their help in the preparation of this book. Mike and Meghan have been
there throughout to answer my questions, to read my first drafts, and to keep me on
schedule.
Finally, the writing of this book has been a great deal of work, and I couldn’t have done
it without the constant support of my family. My wife, Eliane, not only kept the home
going, but also stepped in to help review, edit, and chase case studies. My daughters,
Emilia and Lottie, have been very understanding, and I’m looking forward to spending
lots more time with all of them.
Preface | xxv
Hadoop The Definitive Guide 4th Ed Tom White
PART I
Hadoop Fundamentals
Hadoop The Definitive Guide 4th Ed Tom White
1. These statistics were reported in a study entitled “The Digital Universe of Opportunities: Rich Data and the
Increasing Value of the Internet of Things.”
2. All figures are from 2013 or 2014. For more information, see Tom Groenfeldt, “At NYSE, The Data Deluge
Overwhelms Traditional Databases”; Rich Miller, “Facebook Builds Exabyte Data Centers for Cold Stor‐
age”; Ancestry.com’s “Company Facts”; Archive.org’s “Petabox”; and the Worldwide LHC Computing Grid
project’s welcome page.
CHAPTER 1
Meet Hadoop
In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for
more systems of computers.
—Grace Hopper
Data!
We live in the data age. It’s not easy to measure the total volume of data stored elec‐
tronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in
2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes.1
A zettabyte is 1021
bytes, or equivalently one thousand exabytes, one million petabytes, or one billion
terabytes. That’s more than one disk drive for every person in the world.
This flood of data is coming from many sources. Consider the following:2
• The New York Stock Exchange generates about 4−5 terabytes of data per day.
• Facebook hosts more than 240 billion photos, growing at 7 petabytes per month.
• Ancestry.com, the genealogy site, stores around 10 petabytes of data.
• The Internet Archive stores around 18.5 petabytes of data.
3
• The Large Hadron Collider near Geneva, Switzerland, produces about 30 petabytes
of data per year.
So there’s a lot of data out there. But you are probably wondering how it affects you.
Most of the data is locked up in the largest web properties (like search engines) or in
scientific or financial institutions, isn’t it? Does the advent of big data affect smaller
organizations or individuals?
I argue that it does. Take photos, for example. My wife’s grandfather was an avid pho‐
tographerandtookphotographsthroughouthisadultlife.Hisentirecorpusofmedium-
format, slide, and 35mm film, when scanned in at high resolution, occupies around 10
gigabytes. Compare this to the digital photos my family took in 2008, which take up
about 5 gigabytes of space. My family is producing photographic data at 35 times the
rate my wife’s grandfather’s did, and the rate is increasing every year as it becomes easier
to take more and more photos.
More generally, the digital streams that individuals are producing are growing apace.
Microsoft Research’s MyLifeBits project gives a glimpse of the archiving of personal
information that may become commonplace in the near future. MyLifeBits was an ex‐
periment where an individual’s interactions—phone calls, emails, documents—were
captured electronically and stored for later access. The data gathered included a photo
taken every minute, which resulted in an overall data volume of 1 gigabyte per month.
When storage costs come down enough to make it feasible to store continuous audio
and video, the data volume for a future MyLifeBits service will be many times that.
The trend is for every individual’s data footprint to grow, but perhaps more significantly,
the amount of data generated by machines as a part of the Internet of Things will be
even greater than that generated by people. Machine logs, RFID readers, sensor net‐
works, vehicle GPS traces, retail transactions—all of these contribute to the growing
mountain of data.
The volume of data being made publicly available increases every year, too. Organiza‐
tions no longer have to merely manage their own data; success in the future will be
dictated to a large extent by their ability to extract value from other organizations’ data.
Initiatives such as Public Data Sets on Amazon Web Services and Infochimps.org exist
to foster the “information commons,” where data can be freely (or for a modest price)
shared for anyone to download and analyze. Mashups between different information
sources make for unexpected and hitherto unimaginable applications.
Take, for example, the Astrometry.net project, which watches the Astrometry group on
Flickr for new photos of the night sky. It analyzes each image and identifies which part
of the sky it is from, as well as any interesting celestial bodies, such as stars or galaxies.
This project shows the kinds of things that are possible when data (in this case, tagged
photographic images) is made available and used for something (image analysis) that
was not anticipated by the creator.
4 | Chapter 1: Meet Hadoop
3. The quote is from Anand Rajaraman’s blog post “More data usually beats better algorithms,” in which he
writes about the Netflix Challenge. Alon Halevy, Peter Norvig, and Fernando Pereira make the same point
in “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems, March/April 2009.
4. These specifications are for the Seagate ST-41600n.
It has been said that “more data usually beats better algorithms,” which is to say that for
some problems (such as recommending movies or music based on past preferences),
however fiendish your algorithms, often they can be beaten simply by having more data
(and a less sophisticated algorithm).3
The good news is that big data is here. The bad news is that we are struggling to store
and analyze it.
Data Storage and Analysis
The problem is simple: although the storage capacities of hard drives have increased
massivelyovertheyears,accessspeeds—therateatwhichdatacanbereadfromdrives—
have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a
transfer speed of 4.4 MB/s,4
so you could read all the data from a full drive in around
five minutes. Over 20 years later, 1-terabyte drives are the norm, but the transfer speed
is around 100 MB/s, so it takes more than two and a half hours to read all the data off
the disk.
This is a long time to read all data on a single drive—and writing is even slower. The
obvious way to reduce the time is to read from multiple disks at once. Imagine if we had
100 drives, each holding one hundredth of the data. Working in parallel, we could read
the data in under two minutes.
Using only one hundredth of a disk may seem wasteful. But we can store 100 datasets,
each of which is 1 terabyte, and provide shared access to them. We can imagine that the
users of such a system would be happy to share access in return for shorter analysis
times, and statistically, that their analysis jobs would be likely to be spread over time,
so they wouldn’t interfere with each other too much.
There’s more to being able to read and write data in parallel to or from multiple disks,
though.
The first problem to solve is hardware failure: as soon as you start using many pieces of
hardware, the chance that one will fail is fairly high. A common way of avoiding data
loss is through replication: redundant copies of the data are kept by the system so that
in the event of failure, there is another copy available. This is how RAID works, for
instance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS),
takes a slightly different approach, as you shall see later.
Data Storage and Analysis | 5
The second problem is that most analysis tasks need to be able to combine the data in
some way, and data read from one disk may need to be combined with data from any
of the other 99 disks. Various distributed systems allow data to be combined from mul‐
tiple sources, but doing this correctly is notoriously challenging. MapReduce provides
a programming model that abstracts the problem from disk reads and writes, trans‐
forming it into a computation over sets of keys and values. We look at the details of this
model in later chapters, but the important point for the present discussion is that there
are two parts to the computation—the map and the reduce—and it’s the interface be‐
tween the two where the “mixing” occurs. Like HDFS, MapReduce has built-in
reliability.
In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and
analysis. What’s more, because it runs on commodity hardware and is open source,
Hadoop is affordable.
Querying All Your Data
The approach taken by MapReduce may seem like a brute-force approach. The premise
is that the entire dataset—or at least a good portion of it—can be processed for each
query. But this is its power. MapReduce is a batch query processor, and the ability to
run an ad hoc query against your whole dataset and get the results in a reasonable time
is transformative. It changes the way you think about data and unlocks data that was
previously archived on tape or disk. It gives people the opportunity to innovate with
data. Questions that took too long to get answered before can now be answered, which
in turn leads to new questions and new insights.
For example, Mailtrust, Rackspace’s mail division, used Hadoop for processing email
logs. One ad hoc query they wrote was to find the geographic distribution of their users.
In their words:
This data was so useful that we’ve scheduled the MapReduce job to run monthly and we
will be using this data to help us decide which Rackspace data centers to place new mail
servers in as we grow.
By bringing several hundred gigabytes of data together and having the tools to analyze
it, the Rackspace engineers were able to gain an understanding of the data that they
otherwise would never have had, and furthermore, they were able to use what they had
learned to improve the service for their customers.
Beyond Batch
For all its strengths, MapReduce is fundamentally a batch processing system, and is not
suitable for interactive analysis. You can’t run a query and get results back in a few
seconds or less. Queries typically take minutes or more, so it’s best for offline use, where
there isn’t a human sitting in the processing loop waiting for results.
6 | Chapter 1: Meet Hadoop
However, since its original incarnation, Hadoop has evolved beyond batch processing.
Indeed, the term “Hadoop” is sometimes used to refer to a larger ecosystem of projects,
not just HDFS and MapReduce, that fall under the umbrella of infrastructure for dis‐
tributed computing and large-scale data processing. Many of these are hosted by the
Apache Software Foundation, which provides support for a community of open source
software projects, including the original HTTP Server from which it gets its name.
The first component to provide online access was HBase, a key-value store that uses
HDFS for its underlying storage. HBase provides both online read/write access of in‐
dividual rows and batch operations for reading and writing data in bulk, making it a
good solution for building applications on.
The real enabler for new processing models in Hadoop was the introduction of YARN
(which stands for Yet Another Resource Negotiator) in Hadoop 2. YARN is a cluster
resource management system, which allows any distributed program (not just MapRe‐
duce) to run on data in a Hadoop cluster.
In the last few years, there has been a flowering of different processing patterns that
work with Hadoop. Here is a sample:
Interactive SQL
By dispensing with MapReduce and using a distributed query engine that uses
dedicated “always on” daemons (like Impala) or container reuse (like Hive on Tez),
it’s possible to achieve low-latency responses for SQL queries on Hadoop while still
scaling up to large dataset sizes.
Iterative processing
Many algorithms—such as those in machine learning—are iterative in nature, so
it’s much more efficient to hold each intermediate working set in memory, com‐
pared to loading from disk on each iteration. The architecture of MapReduce does
not allow this, but it’s straightforward with Spark, for example, and it enables a
highly exploratory style of working with datasets.
Stream processing
Streaming systems like Storm, Spark Streaming, or Samza make it possible to run
real-time, distributed computations on unbounded streams of data and emit results
to Hadoop storage or external systems.
Search
The Solr search platform can run on a Hadoop cluster, indexing documents as they
are added to HDFS, and serving search queries from indexes stored in HDFS.
Despite the emergence of different processing frameworks on Hadoop, MapReduce still
has a place for batch processing, and it is useful to understand how it works since it
introduces several concepts that apply more generally (like the idea of input formats,
or how a dataset is split into pieces).
Beyond Batch | 7
5. In January 2007, David J. DeWitt and Michael Stonebraker caused a stir by publishing “MapReduce: A major
step backwards,” in which they criticized MapReduce for being a poor substitute for relational databases.
Many commentators argued that it was a false comparison (see, for example, Mark C. Chu-Carroll’s “Data‐
bases are hammers; MapReduce is a screwdriver”), and DeWitt and Stonebraker followed up with “MapRe‐
duce II,” where they addressed the main topics brought up by others.
Comparison with Other Systems
Hadoop isn’t the first distributed system for data storage and analysis, but it has some
unique properties that set it apart from other systems that may seem similar. Here we
look at some of them.
Relational Database Management Systems
Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop
needed?
The answer to these questions comes from another trend in disk drives: seek time is
improving more slowly than transfer rate. Seeking is the process of moving the disk’s
head to a particular place on the disk to read or write data. It characterizes the latency
of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or write large
portions of the dataset than streaming through it, which operates at the transfer rate.
Ontheotherhand,forupdatingasmallproportionofrecordsinadatabase,atraditional
B-Tree (the data structure used in relational databases, which is limited by the rate at
which it can perform seeks) works well. For updating the majority of a database, a B-
Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
In many ways, MapReduce can be seen as a complement to a Relational Database Man‐
agement System (RDBMS). (The differences between the two systems are shown in
Table 1-1.) MapReduce is a good fit for problems that need to analyze the whole dataset
in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries
or updates, where the dataset has been indexed to deliver low-latency retrieval and
update times of a relatively small amount of data. MapReduce suits applications where
the data is written once and read many times, whereas a relational database is good for
datasets that are continually updated.5
Table 1-1. RDBMS compared to MapReduce
Traditional RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Transactions ACID None
8 | Chapter 1: Meet Hadoop
Traditional RDBMS MapReduce
Structure Schema-on-write Schema-on-read
Integrity High Low
Scaling Nonlinear Linear
However,thedifferencesbetweenrelationaldatabasesandHadoopsystemsareblurring.
Relational databases have started incorporating some of the ideas from Hadoop, and
from the other direction, Hadoop systems such as Hive are becoming more interactive
(by moving away from MapReduce) and adding features like indexes and transactions
that make them look more and more like traditional RDBMSs.
Another difference between Hadoop and an RDBMS is the amount of structure in the
datasets on which they operate. Structured data is organized into entities that have a
defined format, such as XML documents or database tables that conform to a particular
predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other
hand, is looser, and though there may be a schema, it is often ignored, so it may be used
only as a guide to the structure of the data: for example, a spreadsheet, in which the
structure is the grid of cells, although the cells themselves may hold any form of data.
Unstructured data does not have any particular internal structure: for example, plain
text or image data. Hadoop works well on unstructured or semi-structured data because
it is designed to interpret the data at processing time (so called schema-on-read). This
provides flexibility and avoids the costly data loading phase of an RDBMS, since in
Hadoop it is just a file copy.
Relational data is often normalized to retain its integrity and remove redundancy.
NormalizationposesproblemsforHadoopprocessingbecauseitmakesreadingarecord
a nonlocal operation, and one of the central assumptions that Hadoop makes is that it
is possible to perform (high-speed) streaming reads and writes.
Awebserverlogisagoodexampleofasetofrecordsthatisnotnormalized(forexample,
the client hostnames are specified in full each time, even though the same client may
appear many times), and this is one reason that logfiles of all kinds are particularly well
suited to analysis with Hadoop. Note that Hadoop can perform joins; it’s just that they
are not used as much as in the relational world.
MapReduce—and the other processing models in Hadoop—scales linearly with the size
of the data. Data is partitioned, and the functional primitives (like map and reduce) can
work in parallel on separate partitions. This means that if you double the size of the
input data, a job will run twice as slowly. But if you also double the size of the cluster, a
job will run as fast as the original one. This is not generally true of SQL queries.
Comparison with Other Systems | 9
6. Jim Gray was an early advocate of putting the computation near the data. See “Distributed Computing Eco‐
nomics,” March 2003.
Grid Computing
The high-performance computing (HPC) and grid computing communities have been
doing large-scale data processing for years, using such application program interfaces
(APIs) as the Message Passing Interface (MPI). Broadly, the approach in HPC is to
distributetheworkacrossaclusterofmachines,whichaccessasharedfilesystem,hosted
by a storage area network (SAN). This works well for predominantly compute-intensive
jobs,butitbecomesaproblemwhennodesneedtoaccesslargerdatavolumes(hundreds
of gigabytes, the point at which Hadoop really starts to shine), since the network band‐
width is the bottleneck and compute nodes become idle.
Hadoop tries to co-locate the data with the compute nodes, so data access is fast because
it is local.6
This feature, known as data locality, is at the heart of data processing in
Hadoop and is the reason for its good performance. Recognizing that network band‐
width is the most precious resource in a data center environment (it is easy to saturate
network links by copying data around), Hadoop goes to great lengths to conserve it by
explicitly modeling network topology. Notice that this arrangement does not preclude
high-CPU analyses in Hadoop.
MPI gives great control to programmers, but it requires that they explicitly handle the
mechanics of the data flow, exposed via low-level C routines and constructs such as
sockets, as well as the higher-level algorithms for the analyses. Processing in Hadoop
operates only at the higher level: the programmer thinks in terms of the data model
(such as key-value pairs for MapReduce), while the data flow remains implicit.
Coordinating the processes in a large-scale distributed computation is a challenge. The
hardest aspect is gracefully handling partial failure—when you don’t know whether or
notaremoteprocesshasfailed—andstillmakingprogresswiththeoverallcomputation.
DistributedprocessingframeworkslikeMapReducesparetheprogrammerfromhaving
to think about failure, since the implementation detects failed tasks and reschedules
replacements on machines that are healthy. MapReduce is able to do this because it is a
shared-nothingarchitecture,meaningthattaskshavenodependenceononeother.(This
is a slight oversimplification, since the output from mappers is fed to the reducers, but
this is under the control of the MapReduce system; in this case, it needs to take more
care rerunning a failed reducer than rerunning a failed map, because it has to make sure
it can retrieve the necessary map outputs and, if not, regenerate them by running the
relevant maps again.) So from the programmer’s point of view, the order in which the
tasks run doesn’t matter. By contrast, MPI programs have to explicitly manage their own
checkpointing and recovery, which gives more control to the programmer but makes
them more difficult to write.
10 | Chapter 1: Meet Hadoop
7. In January 2008, SETI@home was reported to be processing 300 gigabytes a day, using 320,000 computers
(most of which are not dedicated to SETI@home; they are used for other things, too).
Volunteer Computing
When people first hear about Hadoop and MapReduce they often ask, “How is it dif‐
ferent from SETI@home?” SETI, the Search for Extra-Terrestrial Intelligence, runs a
project called SETI@home in which volunteers donate CPU time from their otherwise
idle computers to analyze radio telescope data for signs of intelligent life outside Earth.
SETI@home is the most well known of many volunteer computing projects; others in‐
clude the Great Internet Mersenne Prime Search (to search for large prime numbers)
and Folding@home (to understand protein folding and how it relates to disease).
Volunteer computing projects work by breaking the problems they are trying to
solve into chunks called work units, which are sent to computers around the world to
be analyzed. For example, a SETI@home work unit is about 0.35 MB of radio telescope
data, and takes hours or days to analyze on a typical home computer. When the analysis
is completed, the results are sent back to the server, and the client gets another work
unit. As a precaution to combat cheating, each work unit is sent to three different ma‐
chines and needs at least two results to agree to be accepted.
Although SETI@home may be superficially similar to MapReduce (breaking a problem
into independent pieces to be worked on in parallel), there are some significant differ‐
ences. The SETI@home problem is very CPU-intensive, which makes it suitable for
running on hundreds of thousands of computers across the world7
because the time to
transfer the work unit is dwarfed by the time to run the computation on it. Volunteers
are donating CPU cycles, not bandwidth.
Comparison with Other Systems | 11
8. In this book, we use the lowercase form, “namenode,” to denote the entity when it’s being referred to generally,
and the CamelCase form NameNode to denote the Java class that implements it.
9. See Mike Cafarella and Doug Cutting, “Building Nutch: Open Source Search,” ACM Queue, April 2004.
MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated
hardware running in a single data center with very high aggregate bandwidth
interconnects. By contrast, SETI@home runs a perpetual computation on untrusted
machines on the Internet with highly variable connection speeds and no data locality.
A Brief History of Apache Hadoop
Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used
text search library. Hadoop has its origins in Apache Nutch, an open source web search
engine, itself a part of the Lucene project.
The Origin of the Name “Hadoop”
The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug
Cutting, explains how the name came about:
The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and
pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids
are good at generating such. Googol is a kid’s term.
Projects in the Hadoop ecosystem also tend to have names that are unrelated to their
function, often with an elephant or other animal theme (“Pig,” for example). Smaller
components are given more descriptive (and therefore more mundane) names. This is
a good principle, as it means you can generally work out what something does from its
name. For example, the namenode8
manages the filesystem namespace.
Building a web search engine from scratch was an ambitious goal, for not only is the
software required to crawl and index websites complex to write, but it is also a challenge
to run without a dedicated operations team, since there are so many moving parts. It’s
expensive, too: Mike Cafarella and Doug Cutting estimated a system supporting a
one-billion-page index would cost around $500,000 in hardware, with a monthly run‐
ning cost of $30,000.9
Nevertheless, they believed it was a worthy goal, as it would open
up and ultimately democratize search engine algorithms.
Nutch was started in 2002, and a working crawler and search system quickly emerged.
However, its creators realized that their architecture wouldn’t scale to the billions of
pagesontheWeb.Helpwasathandwiththepublicationofapaperin2003thatdescribed
the architecture of Google’s distributed filesystem, called GFS, which was being used in
12 | Chapter 1: Meet Hadoop
10. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003.
11. Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” December
2004.
12. “Yahoo! Launches World’s Largest Hadoop Production Application,” February 19, 2008.
production at Google.10
GFS, or something like it, would solve their storage needs for
the very large files generated as a part of the web crawl and indexing process. In par‐
ticular, GFS would free up time being spent on administrative tasks such as managing
storage nodes. In 2004, Nutch’s developers set about writing an open source implemen‐
tation, the Nutch Distributed Filesystem (NDFS).
In 2004, Google published the paper that introduced MapReduce to the world.11
Early
in 2005, the Nutch developers had a working MapReduce implementation in Nutch,
and by the middle of that year all the major Nutch algorithms had been ported to run
using MapReduce and NDFS.
NDFS and the MapReduce implementation in Nutch were applicable beyond the realm
of search, and in February 2006 they moved out of Nutch to form an independent
subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined
Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a
system that ran at web scale (see the following sidebar). This was demonstrated in Feb‐
ruary 2008 when Yahoo! announced that its production search index was being gener‐
ated by a 10,000-core Hadoop cluster.12
Hadoop at Yahoo!
BuildingInternet-scalesearchenginesrequireshugeamountsofdataandthereforelarge
numbers of machines to process it. Yahoo! Search consists of four primary components:
the Crawler, which downloads pages from web servers; the WebMap, which builds a
graph of the known Web; the Indexer, which builds a reverse index to the best pages;
and the Runtime, which answers users’ queries. The WebMap is a graph that consists of
roughly 1 trillion (1012
) edges, each representing a web link, and 100 billion (1011
) nodes,
each representing distinct URLs. Creating and analyzing such a large graph requires a
large number of computers running for many days. In early 2005, the infrastructure for
the WebMap, named Dreadnaught, needed to be redesigned to scale up to more nodes.
Dreadnaught had successfully scaled from 20 to 600 nodes, but required a complete
redesign to scale out further. Dreadnaught is similar to MapReduce in many ways, but
provides more flexibility and less structure. In particular, each fragment in a Dread‐
naught job could send output to each of the fragments in the next stage of the job, but
the sort was all done in library code. In practice, most of the WebMap phases were pairs
that corresponded to MapReduce. Therefore, the WebMap applications would not re‐
quire extensive refactoring to fit into MapReduce.
A Brief History of Apache Hadoop | 13
13. Derek Gottfrid, “Self-Service, Prorated Super Computing Fun!” November 1, 2007.
14. Owen O’Malley, “TeraByte Sort on Apache Hadoop,” May 2008.
15. Grzegorz Czajkowski, “Sorting 1PB with MapReduce,” November 21, 2008.
16. Owen O’Malley and Arun C. Murthy, “Winning a 60 Second Dash with a Yellow Elephant,” April 2009.
Eric Baldeschwieler (aka Eric14) created a small team, and we started designing and
prototyping a new framework, written in C++ modeled and after GFS and MapReduce,
to replace Dreadnaught. Although the immediate need was for a new framework for
WebMap, it was clear that standardization of the batch platform across Yahoo! Search
was critical and that by making the framework general enough to support other users,
we could better leverage investment in the new platform.
At the same time, we were watching Hadoop, which was part of Nutch, and its progress.
In January 2006, Yahoo! hired Doug Cutting, and a month later we decided to abandon
our prototype and adopt Hadoop. The advantage of Hadoop over our prototype and
design was that it was already working with a real application (Nutch) on 20 nodes. That
allowed us to bring up a research cluster two months later and start helping real cus‐
tomers use the new framework much sooner than we could have otherwise. Another
advantage, of course, was that since Hadoop was already open source, it was easier
(although far from easy!) to get permission from Yahoo!’s legal department to work in
open source. So, we set up a 200-node cluster for the researchers in early 2006 and put
the WebMap conversion plans on hold while we supported and improved Hadoop for
the research users.
—Owen O’Malley, 2009
In January 2008, Hadoop was made its own top-level project at Apache, confirming its
success and its diverse, active community. By this time, Hadoop was being used by many
other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times.
In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to
crunch through 4 terabytes of scanned archives from the paper, converting them to
PDFs for the Web.13
The processing took less than 24 hours to run using 100 machines,
and the project probably wouldn’t have been embarked upon without the combination
of Amazon’s pay-by-the-hour model (which allowed the NYT to access a large number
of machines for a short period) and Hadoop’s easy-to-use parallel programming model.
In April 2008, Hadoop broke a world record to become the fastest system to sort an
entire terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209
seconds (just under 3.5 minutes), beating the previous year’s winner of 297 seconds.14
In November of the same year, Google reported that its MapReduce implementation
sorted 1 terabyte in 68 seconds.15
Then, in April 2009, it was announced that a team at
Yahoo! had used Hadoop to sort 1 terabyte in 62 seconds.16
14 | Chapter 1: Meet Hadoop
17. Reynold Xin et al., “GraySort on Apache Spark by Databricks,” November 2014.
The trend since then has been to sort even larger volumes of data at ever faster rates. In
the 2014 competition, a team from Databricks were joint winners of the Gray Sort
benchmark. They used a 207-node Spark cluster to sort 100 terabytes of data in 1,406
seconds, a rate of 4.27 terabytes per minute.17
Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a general-
purpose storage and analysis platform for big data has been recognized by the industry,
and this fact is reflected in the number of products that use or incorporate Hadoop in
some way. Commercial Hadoop support is available from large, established enterprise
vendors, including EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoop
companies such as Cloudera, Hortonworks, and MapR.
What’s in This Book?
The book is divided into five main parts: Parts I to III are about core Hadoop, Part IV
covers related projects in the Hadoop ecosystem, and Part V contains Hadoop case
studies. You can read the book from cover to cover, but there are alternative pathways
through the book that allow you to skip chapters that aren’t needed to read later ones.
See Figure 1-1.
Part I is made up of five chapters that cover the fundamental components in Hadoop
and should be read before tackling later chapters. Chapter 1 (this chapter) is a high-level
introduction to Hadoop. Chapter 2 provides an introduction to MapReduce. Chap‐
ter 3 looks at Hadoop filesystems, and in particular HDFS, in depth. Chapter 4 discusses
YARN, Hadoop’s cluster resource management system. Chapter 5 covers the I/O build‐
ing blocks in Hadoop: data integrity, compression, serialization, and file-based data
structures.
Part II has four chapters that cover MapReduce in depth. They provide useful under‐
standing for later chapters (such as the data processing chapters in Part IV), but could
be skipped on a first reading. Chapter 6 goes through the practical steps needed to
develop a MapReduce application. Chapter 7 looks at how MapReduce is implemented
in Hadoop, from the point of view of a user. Chapter 8 is about the MapReduce pro‐
gramming model and the various data formats that MapReduce can work with. Chap‐
ter 9 is on advanced MapReduce topics, including sorting and joining data.
Part III concerns the administration of Hadoop: Chapters 10 and 11 describe how to
set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN.
Part IV of the book is dedicated to projects that build on Hadoop or are closely related
to it. Each chapter covers one project and is largely independent of the other chapters
in this part, so they can be read in any order.
What’s in This Book? | 15
The first two chapters in this part are about data formats. Chapter 12 looks at Avro, a
cross-language data serialization library for Hadoop, and Chapter 13 covers Parquet,
an efficient columnar storage format for nested data.
The next two chapters look at data ingestion, or how to get your data into Hadoop.
Chapter 14 is about Flume, for high-volume ingestion of streaming data. Chapter 15 is
about Sqoop, for efficient bulk transfer of data between structured data stores (like
relational databases) and HDFS.
The common theme of the next four chapters is data processing, and in particular using
higher-level abstractions than MapReduce. Pig (Chapter 16) is a data flow language for
exploring very large datasets. Hive (Chapter 17) is a data warehouse for managing data
stored in HDFS and provides a query language based on SQL. Crunch (Chapter 18) is
a high-level Java API for writing data processing pipelines that can run on MapReduce
or Spark. Spark (Chapter 19) is a cluster computing framework for large-scale data
processing; it provides a directed acyclic graph (DAG) engine, and APIs in Scala, Java,
and Python.
Chapter 20 is an introduction to HBase, a distributed column-oriented real-time data‐
base that uses HDFS for its underlying storage. And Chapter 21 is about ZooKeeper, a
distributed, highly available coordination service that provides useful primitives for
building distributed applications.
Finally, Part V is a collection of case studies contributed by people using Hadoop in
interesting ways.
Supplementary information about Hadoop, such as how to install it on your machine,
can be found in the appendixes.
16 | Chapter 1: Meet Hadoop
Figure 1-1. Structure of the book: there are various pathways through the content
What’s in This Book? | 17
Hadoop The Definitive Guide 4th Ed Tom White
CHAPTER 2
MapReduce
MapReduce is a programming model for data processing. The model is simple, yet not
toosimpletoexpressusefulprogramsin.HadoopcanrunMapReduceprogramswritten
in various languages; in this chapter, we look at the same program expressed in Java,
Ruby, and Python. Most importantly, MapReduce programs are inherently parallel, thus
putting very large-scale data analysis into the hands of anyone with enough machines
attheirdisposal.MapReducecomesintoitsownforlargedatasets,solet’sstartbylooking
at one.
A Weather Dataset
For our example, we will write a program that mines weather data. Weather sensors
collect data every hour at many locations across the globe and gather a large volume of
log data, which is a good candidate for analysis with MapReduce because we want to
process all the data, and the data is semi-structured and record-oriented.
Data Format
The data we will use is from the National Climatic Data Center, or NCDC. The data is
stored using a line-oriented ASCII format, in which each line is a record. The format
supports a rich set of meteorological elements, many of which are optional or with
variabledatalengths.Forsimplicity,wefocusonthebasicelements,suchastemperature,
which are always present and are of fixed width.
Example 2-1 shows a sample line with some of the salient fields annotated. The line has
been split into multiple lines to show each field; in the real file, fields are packed into
one line with no delimiters.
19
Example 2-1. Format of a National Climatic Data Center record
0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
C
N
010000 # visibility distance (meters)
1 # quality code
N
9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code
Datafiles are organized by date and weather station. There is a directory for each year
from 1901 to 2001, each containing a gzipped file for each weather station with its
readings for that year. For example, here are the first entries for 1990:
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
010040-99999-1990.gz
010080-99999-1990.gz
010100-99999-1990.gz
010150-99999-1990.gz
There are tens of thousands of weather stations, so the whole dataset is made up of a
large number of relatively small files. It’s generally easier and more efficient to process
20 | Chapter 2: MapReduce
a smaller number of relatively large files, so the data was preprocessed so that each year’s
readings were concatenated into a single file. (The means by which this was carried out
is described in Appendix C.)
Analyzing the Data with Unix Tools
What’s the highest recorded global temperature for each year in the dataset? We will
answer this first without using Hadoop, as this information will provide a performance
baseline and a useful means to check our results.
The classic tool for processing line-oriented data is awk. Example 2-2 is a small script
to calculate the maximum temperature for each year.
Example 2-2. A program for finding the maximum recorded temperature by year from
NCDC weather records
#!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`t
gunzip -c $year | 
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999  q ~ /[01459]/  temp  max) max = temp }
END { print max }'
done
The script loops through the compressed year files, first printing the year, and then
processing each file using awk. The awk script extracts two fields from the data: the air
temperature and the quality code. The air temperature value is turned into an integer
by adding 0. Next, a test is applied to see whether the temperature is valid (the value
9999 signifies a missing value in the NCDC dataset) and whether the quality code in‐
dicates that the reading is not suspect or erroneous. If the reading is OK, the value is
compared with the maximum value seen so far, which is updated if a new maximum is
found. The END block is executed after all the lines in the file have been processed, and
it prints the maximum value.
Here is the beginning of a run:
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
...
The temperature values in the source file are scaled by a factor of 10, so this works out
as a maximum temperature of 31.7°C for 1901 (there were very few readings at the
Analyzing the Data with Unix Tools | 21
beginning of the century, so this is plausible). The complete run for the century took 42
minutes in one run on a single EC2 High-CPU Extra Large instance.
To speed up the processing, we need to run parts of the program in parallel. In theory,
this is straightforward: we could process different years in different processes, using all
the available hardware threads on a machine. There are a few problems with this,
however.
First, dividing the work into equal-size pieces isn’t always easy or obvious. In this case,
the file size for different years varies widely, so some processes will finish much earlier
thanothers.Eveniftheypickupfurtherwork,thewholerunisdominatedbythelongest
file. A better approach, although one that requires more work, is to split the input into
fixed-size chunks and assign each chunk to a process.
Second, combining the results from independent processes may require further pro‐
cessing. In this case, the result for each year is independent of other years, and they may
be combined by concatenating all the results and sorting by year. If using the fixed-size
chunkapproach,thecombinationismoredelicate.Forthisexample,dataforaparticular
year will typically be split into several chunks, each processed independently. We’ll end
up with the maximum temperature for each chunk, so the final step is to look for the
highest of these maximums for each year.
Third, you are still limited by the processing capacity of a single machine. If the best
time you can achieve is 20 minutes with the number of processors you have, then that’s
it. You can’t make it go faster. Also, some datasets grow beyond the capacity of a single
machine. When we start using multiple machines, a whole host of other factors come
into play, mainly falling into the categories of coordination and reliability. Who runs
the overall job? How do we deal with failed processes?
So, although it’s feasible to parallelize the processing, in practice it’s messy. Using a
framework like Hadoop to take care of these issues is a great help.
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to express
our query as a MapReduce job. After some local, small-scale testing, we will be able to
run it on a cluster of machines.
Map and Reduce
MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer. The programmer also specifies two functions: the
map function and the reduce function.
22 | Chapter 2: MapReduce
Discovering Diverse Content Through
Random Scribd Documents
stump should be rounded with a sharp knife, and then the whole
peg should be finished off with glass-paper. These pegs must then
be fixed knob downwards on to the base. Fig. 39 on page 34 shows
a suitable method for this.
If you are at all skilful with your tools you will be able to cut a nice
moulding round the edge of the base, and so improve the artistic
effect of your model.
Two thin coats of varnish, or of good enamel, will complete this
attractive little article.
One little wooden toy, quite interesting in itself, and very useful
when playing with soldiers, is
The Windlass.—Some odd pieces of lath or cigar-box wood, a cotton
reel, a length of string, some stout wire, and some glue and pins,
provide all the necessaries. The cotton reel should be the largest
obtainable.
Fig. 45 shows the completed work. First of all, make a square base
for the windlass. If the reel is 3 in. long, cut off four lengths of lath
(or four inch-strips of cigar-wood box) each 4 in. long, and glue
these into a hollow square, two under and two over. Now cut off two
more lengths, 3 in. long, for the upright supports—making the top
ends pointed to hold the slanting covers.
Fig. 45.
Before these side-pieces are glued and pinned into position, it will be
necessary to insert the reel. Get a piece of skewer, or lead pencil, 4
in. long, and glue it into the hole in the reel. At one end of the axle
so formed will be placed the handle. This can be made in several
ways, either with wood or wire, or a mixture of the two (Figs. 46, 47,
48 show some varieties, which may also be useful in making other
toys). Holes just large enough to allow the axle to turn freely must
then be cut in the side supports.
The two slanting covers should be about 4 in. long, so as to allow a
trifle to project at each end, and should be from 3/4 in. to 1 in.
wide. The two edges which meet to form the apex of the cover
should be bevelled off so as to form a clean join.
In making this model it would perhaps be as well to use carpenter's
glue in place of the prepared stuff.
Fig. 46.
Fig. 47.
Fig. 48.
From the material supplied by one or two empty cigar boxes, many
interesting things can be made, especially articles for use with dolls
—cradles, carts, furniture, c. If these articles are of no use to you,
they come in very handy for presents to little sisters and friends,
especially when well made and carefully finished.
A Doll's Cradle is perhaps one of the simplest to commence with. To
a box from which the lid has been removed, it is only necessary to
add two rockers. These can be cut out from the lid by means of a
fret saw, and then smoothed down with glass-paper. Fig. 49 shows
the best shape for the rockers, which should be glued on about an
inch from each end of the box (Fig. 50). Great care should be taken
that the two rockers are as nearly alike as possible, otherwise the
cradle will not swing to and fro freely.
Fig. 49.
Fig. 50.
A Doll's Cart is also comparatively easy to make, the only really
trying part being the cutting of the four wheels.
For the body of the cart use a cigar box which has been deprived of
its lid, and planed down level round the edges. To the under side of
this body, and about one inch from each end, glue two pieces of
wood to which to fix the wheels. Strengthen these joins by means of
short pins driven through. Fix the wheels to these pieces by means
of pins (Fig. 51). In order to support these two wheel-holders,
stretch another piece across the space between them, at right
angles to each, gluing it firmly to the two centres.
Fig. 51.
The wheels should be cut with a fret saw, if you possess one. If you
do not possess one, then draw out the circle on the wood, and cut
the square containing the circle. Then saw off the corners to form an
eight-sided figure, and go on cutting off corners until you get down
to the circle, which you can finish off with glass-paper (Fig. 52).
Fig. 52.
A little hook or ring should be attached at the bottom of one end, in
order that a string may be tied on, and the vehicle drawn along.
A Jack-in-the-Box.—One of the most old-fashioned of toys, this
never loses its interest. The box required for it is practically cubical:
therefore 6 four-inch squares of cigar-box wood must be cut out.
Two of these will need to be cut down to 3-3/4 in. in width, so that
the four-inch bottom and lid will fit: so from two squares cut a strip
1/4 in. wide. Glue and pin together the two 3-3/4 pieces and two of
the four-inch pieces to form a hollow square. To this will be fixed
one of the other four-inch pieces to form a bottom; and at the other
end the remaining four-inch piece will be hinged (or wired on like
the lid of a chocolate box).
Before the bottom is finally put on, it will be necessary to attach the
mechanism. For this you will require a strong piece of spring about 6
in. long when released, and a doll's head. One end of the spring
must be fixed to the centre of the base. You can do this by means of
tiny wire staples (bent pins with the heads nipped off) hammered
over the wire into the base, and then bent back on the opposite side
of the wood (Fig. 53). At the other end of the spring a piece of
cardboard must be fixed, and to it the doll's head must be firmly
glued. When the mechanism is complete, nail on the bottom, and fix
the lid.
Fig. 53.
Into the centre of the front edge of the lid drive a small nail, or stout
pin, and on the box just below fix a revolving catch hook. This you
can quite easily cut from an old piece of fairly thick tin (Fig. 54). In
this way an effective means is provided of releasing the lid and
enabling the Jack to shoot out suddenly.
Fig. 54.
The Jig-saw Puzzle was at one time a very popular toy, and there are
signs that its popularity is being revived. If it does not interest you
particularly, it will provide a little brother or sister with endless
amusement.
In reality the puzzle consists merely of a picture (generally an
interesting coloured one) glued very firmly to a piece of fretwood or
cigar-box wood. This is then by means of a fret saw cut into a great
many pieces, shaped as quaintly and awkwardly as possible (see Fig.
55). These pieces are then jumbled up into disorder, and passed on
to the little one in order that the shapes may be fitted into place and
the original picture reconstructed.
Fig. 55.
Somewhat after the style of the jig-saw puzzle just described is the
Geometrical Puzzle shown in Fig. 56. Each of these consists of a
capital letter divided up by one or two straight lines into right-angled
triangles and other geometrical shapes. While very simple to look at
when completed, these little puzzles are by no means easy to solve
when the odd pieces are given in a jumbled state. The capital letters
should be drawn on a piece of cigar-box wood, and then carefully
cut out with a fret saw, or, better still, with a tenon saw if you have
one. If you cannot manage wood, then the puzzle can be done in
stout cardboard and cut out with a sharp thin knife.
Fig. 56.
Of other cheaply made puzzles
The Reels and String Puzzle is highly entertaining. The only materials
required for it are the lid of a cigar box, two cotton reels, two beads,
and a length of smooth string or thin silk cord. The making is
simplicity itself. All you need do is cut the lid in halves and bore
three holes in a line in one of the halves. Of course you can
ornament your wood as much as you like, but that will in no way
increase or decrease the effectiveness of the puzzle.
When you have cut it out and finished it off nicely with glass-paper,
thread the beads and reels as shown in Fig. 57. Take special care
that you do not make any mistake in the arrangement, or your
solution will result in a hopeless tangle.
Fig. 57.
The object of the puzzle is to get the two cotton reels, which, as you
see, are now on quite separate loops, on to one loop. To solve it
proceed as follows: Take hold of the centre loop, and pull it down to
its full extent. Now pass the right-hand reel through the loop. Taking
care not to twist the cord, pass this loop through the hole on the
right-hand side, over the bead, and then draw it back again.
Now if you follow the same procedure with the left-hand reel you will
find that the centre loop is released and can be pulled through the
centre hole. Then will the two reels slide down side by side.
One thoroughly entertaining and, to a certain extent, bewildering
puzzle is
The Three-hole Puzzle.—Really the puzzle consists of a piece of thin
wood with three holes cut in it. These three holes are respectively
circular, square, and triangular (Fig. 58). The problem is to cut one
block of wood which will pass through each hole and at the same
time fit the hole exactly.
Fig. 58.
Can it be done? At first it looks to be quite impossible; but there is a
very neat solution to the difficulty.
First cut out your holes. To do this get a cigar-box lid and draw out
the three figures, taking care that the length of the side of the
square and the length of the side of the triangle and the length of
the diameter of the circle are equal. Now, using your fret saw, cut
out these holes very neatly and precisely.
For the block you need a small cylinder of wood: an odd piece of
broken broom handle will do admirably. This must be cut and
finished with glass-paper so that it will fit the circular hole exactly.
Now saw a piece just as long as the cylinder is wide. This looked at
in one way gives an exact square which will fit the second hole. Thus
two holes are catered for.
Finally, for the third hole the cylinder must be tapered on two sides.
To do this draw a diameter at one end and then gradually pare away
a flat surface till the triangular section is obtained.
Fig. 59 shows how the block, when turned in different ways, fits the
three holes.
Fig. 59.
Another toy which can be made quite easily from cigar-box wood is
A Model Signal.—First cut two strips of wood, half an inch wide and
as long as you can get them, which will be 8 or 9 in. These will
stand upright on a base board, and form the sides of the standard.
Now between these two you must glue shorter pieces of half-inch
strip, so as to make the standard solid at the top and bottom, and
leave a hollow slot, 1 in. long, in which the signal arm will fit and
work up and down (Fig. 60).
Fig. 60.
Now cut out and paint a signal arm, about 2-1/2 in. long. Fix this by
means of a pin passing through the two sides of the standard, and
through the arm about 3/4 in. from the square end. If it does not
move easily in the slot, take off the top surface with glass-paper.
Before fixing the signal arm in position, bore a small hole 1/4 in.
from the square end, and knot in a piece of twine or thin wire to act
as a connection between the movable arm and the controlling lever
(Fig. 61).
Fig. 61.
At the base of the standard fix the controlling lever. This consists of
a small strip, with a pin passing through one end into the standard.
Adjust the length of the twine or wire, so that when the signal arm is
down, the lever is horizontal; and when the lever is pressed down,
the arm rises. You can make a little contrivance for fixing the lever
by erecting a small post close to the standard, and gluing on two
stops, under which to rest the free end of the lever in its two
positions (Fig. 62).
Fig. 62.
If you prefer it, you can have the controlling lever at a distance from
the signal post. You will then need a longer wire, and a little pulley
wheel at the base of the standard. You must exercise your own
ingenuity for this.
Another interesting little scientific toy, which has the additional
advantage of being useful, is the Weather House, or the Man and
Woman Barometer. This consists of a little house with two doorways,
at which appear two figures, one in fine weather, and the other in
dull (Fig. 63).
Fig. 63.
With patience and care this is not very difficult to make. For the
house itself you can use an old cigar box, or, if you prefer it, you can
make the entire house in cardboard. This is, of course, easier, but
not very durable. If you are going to use the cigar box, you will need
first to cut the lid and bottom into something like the shape of a
house end. You will then have to nail the lid down, and add two
slanting pieces for the sides of the roof: and that will complete the
house.
However, before you nail down the lid and put on the roof, you will
need to understand the mechanism. First you will bore a round hole
in the top of the roof, just behind the front gable. This hole is for a
round peg to which the revolving base is attached.
The actual mechanism of the toy consists of a piece of catgut (an
old violin string, or a tennis-racket string). This passes through the
centre of a small flat piece of wood on which the two figures are
balanced. Just in front of the string a piece of wire (a bent hairpin
will do admirably) is fixed, so as to form a loop through which the
catgut can pass (see Fig. 64). The other end of the catgut is fixed to
the peg which fits in the hole in the roof.
Fig. 64.
For the man and woman you can use two of the grotesque figures
cut from clothes pegs. Screws passed through the revolving base will
secure the figures firmly and at the same time add a little weight,
and so improve the balance.
When there is moisture in the air the catgut will twist. You must fit
together the different parts and then, by turning the peg to right or
left, adjust the position of the figures so that the lady appears in fine
weather and the gentleman in wet.
A toy of unfailing attraction for boys—and girls as well—is
The Marble Board.—This may be quite a simple affair—such as a boy
can carry in his pocket for use in the playground—just a piece of
wood, such as a cigar-box lid, with a number of holes cut along one
edge, and a handle added (Fig. 65); or it may be a much more
elaborate form intended for use as a table game.
Fig. 65.
In this latter case there is a front board, similar to that in the simple
form; but behind each hole there is a little compartment for the
collection of the marbles (Fig. 66). To make this you need two pieces
of wood, about 2 in. wide, and as long as the table is broad: any
sort of wood will do. These are for the front and back of the
contrivance. The front must next be marked out for the marble
holes, allowing about 1 in. for the hole and 1 in. for the space
between. Of course, the wider the spaces between the more difficult
it becomes to score. These holes must then be cut out by means of
a fret saw, or, if you do not possess one, by means of saw and
chisel. The back and front must then be secured in position by
means of end-pieces nailed or screwed on. These should be about 3
in. long.
Fig. 66.
The next piece of work is the adjustment of the partitions. For these
cigar-box wood is best. You can either cut these partitions to the
exact distance between the front and the back, and glue them into
position; or else you can make them a little larger, and fit them into
grooves cut into the front and back: but that is a nice little piece of
carpentry for you.
When you have done this, all that is necessary is to give the whole
thing a coat of paint, and place numbers over the various holes—
taking care that you do not put all the high numbers together.
Boards similar to this are used in the Colonies for a game known as
Bobs. Larger balls are used, and propelled by means of a cue as in
billiards. If you can obtain the balls, this is a delightful game, and
one well worth making.
A Wooden Wind Wheel for the garden is a splendid little model to
make—interesting in itself, but doubly desirable because so much
can be done with it. Of course, it can be made quite small and very
simple, and still provide unending amusement to smaller brothers
and sisters; but for our own purpose it is just as well to make a
larger and stronger specimen, one which can be employed as a
power station for the working of smaller toys.
The main parts are: (1) a circular hub, about 2-1/2 to 3 in. in
diameter, and 1 to 1-1/4 in. in thickness (for the smaller varieties a
cotton reel will do admirably); (2) six or eight sails, each about 6 or
7 in. long and 3 in. wide at the extreme end, tapering down to a
little more than the width of the hub at the other; (3) a hardwood
axle; and (4) a driving wheel. For this last a cotton reel will do
splendidly, especially one of those with wide flanges and a slender
centre. The general arrangement is shown in Fig. 67.
Fig. 67.
The cutting of the hub is not a very difficult matter if you have a fret
saw. It should be cut across the grain if you can get a suitable piece
of wood. The sails also are quite easy to make. For these you cannot
beat cigar-box wood. The cutting of the grooves in the hub for the
insertion of the sails is the most trying piece of work. These grooves
should be just large enough to allow the sails to fit tightly, and
should be cut at an angle of 45° across the hub. The sails should
then be glued in with carpenter's glue.
For the axle secure a piece of round wood, such as an odd length of
half-inch dowel-rod. This should be cut to a length of about 4-1/2 to
5 in. On this should be fixed the wheel itself, and, at a sufficient
distance to prevent the sails catching the string, the bearing wheel.
A French nail in each end of the axle will then secure it in position
between the side supports and secure an easy running.
If you have a play shed in the garden, this apparatus can be erected
at the top of a high post projecting through or at the side of the
roof. The driving strings can then pass through a hole in the roof or
the wall, and the power can be transmitted by a double pulley wheel
and another driving string. If you have no play shed, it is not at all
difficult to rig it up outside a window. You can try that, and prove
your own inventive abilities.
How to use the Wind Power Machine.—One thing which this
mechanism will drive in good fashion is an overhead tramway
system—a very pretty little toy when in working order.
For this all that is required is a number of cotton reels, a length of
stout cord, and one or two of the model trams described on page
21. If you care to, you can make proper standards for the cotton
reels. Fig. 68 shows such an arrangement. The flat base is for heavy
weights when the system is rigged up on a table or other place
where nails cannot be used. These reels must turn freely to allow
the easy passage of the cable. In one place there must be a double
reel (Fig. 69) for the transmission of the power. The lower reel will
act as the ordinary cable wheel, while the other, glued firmly to it,
will carry the driving belt from the wind machine described above.
Fig. 68.
Fig. 69.
The model trams must be fixed to the cable. This is done by means
of two wires, fixed to the pole of the tram and twined round the
cable. When this is connected up and the cable drawn tightly round
the standard reels, the vehicles circulate rapidly on what is really a
complete model tramway system.
Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com

More Related Content

PDF
r4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdf
PDF
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
PDF
Instant Download Hadoop Operations 1st Edition Eric Sammer PDF All Chapters
PDF
Hadoop Operations 1st Edition Eric Sammer
PDF
field_guide_to_hadoop_pentaho
PPTX
The Apache Hadoop software library is a framework that allows for the distrib...
PPTX
Introduction to Apache Hadoop Ecosystem
PPTX
002 Introduction to hadoop v3
r4.OReilly.Hadoop.The.Definitive.Guide.4th.Edition.2015.pdf
Hadoop Application Architectures Mark Grover Ted Malaska Jonathan Seidman Gwe...
Instant Download Hadoop Operations 1st Edition Eric Sammer PDF All Chapters
Hadoop Operations 1st Edition Eric Sammer
field_guide_to_hadoop_pentaho
The Apache Hadoop software library is a framework that allows for the distrib...
Introduction to Apache Hadoop Ecosystem
002 Introduction to hadoop v3

Similar to Hadoop The Definitive Guide 4th Ed Tom White (20)

PPTX
Getting Started with Hadoop
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
PPTX
Introduction to hadoop V2
PDF
HBase The Definitive Guide 2 (Early Release) Edition Lars George
PPT
Brust hadoopecosystem
PPTX
Hadoop intro
PPTX
Intro to Hadoop
PDF
Hadoop essentials by shiva achari - sample chapter
PPTX
Hadoop and Big Data: Revealed
PPTX
Big Data UNIT 2 AKTU syllabus all topics covered
PPTX
Hadoop for sysadmins
DOCX
Big data and Hadoop overview
PPTX
Big Data Training in Amritsar
PPT
Hadoop distributed file system (HDFS), HDFS concept
PPTX
Foxvalley bigdata
PDF
Hadoop breizhjug
PPTX
Not Just Another Overview of Apache Hadoop
PPTX
Big Data Training in Mohali
PDF
Hadoop Mapreduce Cookbook Srinath Perera Thilina Gunarathne
PPTX
Big Data Training in Ludhiana
Getting Started with Hadoop
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Introduction to hadoop V2
HBase The Definitive Guide 2 (Early Release) Edition Lars George
Brust hadoopecosystem
Hadoop intro
Intro to Hadoop
Hadoop essentials by shiva achari - sample chapter
Hadoop and Big Data: Revealed
Big Data UNIT 2 AKTU syllabus all topics covered
Hadoop for sysadmins
Big data and Hadoop overview
Big Data Training in Amritsar
Hadoop distributed file system (HDFS), HDFS concept
Foxvalley bigdata
Hadoop breizhjug
Not Just Another Overview of Apache Hadoop
Big Data Training in Mohali
Hadoop Mapreduce Cookbook Srinath Perera Thilina Gunarathne
Big Data Training in Ludhiana
Ad

Recently uploaded (20)

PPTX
Education and Perspectives of Education.pptx
PDF
semiconductor packaging in vlsi design fab
PPTX
Core Concepts of Personalized Learning and Virtual Learning Environments
PDF
CRP102_SAGALASSOS_Final_Projects_2025.pdf
PDF
Climate and Adaptation MCQs class 7 from chatgpt
PDF
Journal of Dental Science - UDMY (2021).pdf
PDF
English Textual Question & Ans (12th Class).pdf
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
IP : I ; Unit I : Preformulation Studies
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
Journal of Dental Science - UDMY (2022).pdf
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
International_Financial_Reporting_Standa.pdf
PDF
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
PDF
Empowerment Technology for Senior High School Guide
PDF
HVAC Specification 2024 according to central public works department
Education and Perspectives of Education.pptx
semiconductor packaging in vlsi design fab
Core Concepts of Personalized Learning and Virtual Learning Environments
CRP102_SAGALASSOS_Final_Projects_2025.pdf
Climate and Adaptation MCQs class 7 from chatgpt
Journal of Dental Science - UDMY (2021).pdf
English Textual Question & Ans (12th Class).pdf
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
IP : I ; Unit I : Preformulation Studies
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
Journal of Dental Science - UDMY (2022).pdf
AI-driven educational solutions for real-life interventions in the Philippine...
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
B.Sc. DS Unit 2 Software Engineering.pptx
International_Financial_Reporting_Standa.pdf
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
Empowerment Technology for Senior High School Guide
HVAC Specification 2024 according to central public works department
Ad

Hadoop The Definitive Guide 4th Ed Tom White

  • 1. Hadoop The Definitive Guide 4th Ed Tom White download https://ebookbell.com/product/hadoop-the-definitive-guide-4th-ed- tom-white-32715658 Explore and download more ebooks at ebookbell.com
  • 2. Here are some recommended products that we believe you will be interested in. You can click the link to download. Hadoop The Definitive Guide Third White Tom https://ebookbell.com/product/hadoop-the-definitive-guide-third-white- tom-55285128 Hadoop The Definitive Guide 2nd Edition Tom White https://ebookbell.com/product/hadoop-the-definitive-guide-2nd-edition- tom-white-2310272 Hadoop The Definitive Guide Tom White https://ebookbell.com/product/hadoop-the-definitive-guide-tom- white-4102112 Hadoop The Definitive Guide 4th Edition 4th Edition Tom White https://ebookbell.com/product/hadoop-the-definitive-guide-4th- edition-4th-edition-tom-white-4760686
  • 3. Hadoop The Definitive Guide Tom White https://ebookbell.com/product/hadoop-the-definitive-guide-tom- white-36051600 Hadoopthe Definitive Guide 3rd Edition Tom White https://ebookbell.com/product/hadoopthe-definitive-guide-3rd-edition- tom-white-30848270 Hadoop The Definitive Guide Tom White White Tom https://ebookbell.com/product/hadoop-the-definitive-guide-tom-white- white-tom-31745004 Hadoop The Definitive Guide Tom Tom White https://ebookbell.com/product/hadoop-the-definitive-guide-tom-tom- white-36051640 Hadoop The Definitive Guide Tom White https://ebookbell.com/product/hadoop-the-definitive-guide-tom- white-37260814
  • 5. PROGRAMMING LANGUAGES/HADOOP Hadoop: The Definitive Guide ISBN: 978-1-491-90163-2 US $49.99 CAN $57.99 “ Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk.” —Doug Cutting Cloudera Twitter: @oreillymedia facebook.com/oreilly Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing. ■ ■ Learn fundamental components such as MapReduce, HDFS, and YARN ■ ■ Explore MapReduce in depth, including steps for developing applications with it ■ ■ Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN ■ ■ Learn two data formats: Avro for data serialization and Parquet for nested data ■ ■ Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer) ■ ■ Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop ■ ■ Learn the HBase distributed database and the ZooKeeper distributed configuration service Tom White, an engineer at Cloudera and member of the Apache Software Foundation, has been an Apache Hadoop committer since 2007. He has written numerous articles for oreilly.com, java.net, and IBM’s developerWorks, and speaks regularly about Hadoop at industry conferences. Hadoop: The Definitive Guide FOURTH EDITION White Tom White Hadoop The Definitive Guide STORAGE AND ANALYSIS AT INTERNET SCALE 4 t h E d i t i o n R e v i s e d U p d a t e d
  • 6. PROGRAMMING LANGUAGES/HADOOP Hadoop: The Definitive Guide ISBN: 978-1-491-90163-2 US $49.99 CAN $57.99 “ Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk.” —Doug Cutting Cloudera Twitter: @oreillymedia facebook.com/oreilly Get ready to unlock the power of your data. With the fourth edition of this comprehensive guide, you’ll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. Using Hadoop 2 exclusively, author Tom White presents new chapters on YARN and several Hadoop-related projects such as Parquet, Flume, Crunch, and Spark. You’ll learn about recent changes to Hadoop, and explore new case studies on Hadoop’s role in healthcare systems and genomics data processing. ■ ■ Learn fundamental components such as MapReduce, HDFS, and YARN ■ ■ Explore MapReduce in depth, including steps for developing applications with it ■ ■ Set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN ■ ■ Learn two data formats: Avro for data serialization and Parquet for nested data ■ ■ Use data ingestion tools such as Flume (for streaming data) and Sqoop (for bulk data transfer) ■ ■ Understand how high-level data processing tools like Pig, Hive, Crunch, and Spark work with Hadoop ■ ■ Learn the HBase distributed database and the ZooKeeper distributed configuration service Tom White, an engineer at Cloudera and member of the Apache Software Foundation, has been an Apache Hadoop committer since 2007. He has written numerous articles for oreilly.com, java.net, and IBM’s developerWorks, and speaks regularly about Hadoop at industry conferences. Hadoop: The Definitive Guide FOURTH EDITION White Tom White Hadoop The Definitive Guide STORAGE AND ANALYSIS AT INTERNET SCALE 4 t h E d i t i o n R e v i s e d U p d a t e d
  • 7. Tom White FOURTH EDITION Hadoop: The Definitive Guide
  • 8. Hadoop: The Definitive Guide, Fourth Edition by Tom White Copyright © 2015 Tom White. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or corporate@oreilly.com. Editors: Mike Loukides and Meghan Blanchette Production Editor: Matthew Hacker Copyeditor: Jasmine Kwityn Proofreader: Rachel Head Indexer: Lucie Haskins Cover Designer: Ellie Volckhausen Interior Designer: David Futato Illustrator: Rebecca Demarest June 2009: First Edition October 2010: Second Edition May 2012: Third Edition April 2015: Fourth Edition Revision History for the Fourth Edition: 2015-03-19: First release 2015-04-17: Second release See http://oreilly.com/catalog/errata.csp?isbn=9781491901632 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hadoop: The Definitive Guide, the cover image of an African elephant, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks.Wherethosedesignationsappearinthisbook,andO’ReillyMedia,Inc.wasawareofatrademark claim, the designations have been printed in caps or initial caps. While the publisher and the author have used good faith efforts to ensure that the information and instruc‐ tions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intel‐ lectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. ISBN: 978-1-491-90163-2 [LSI]
  • 9. For Eliane, Emilia, and Lottie
  • 11. Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Part I. Hadoop Fundamentals 1. Meet Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Data! 3 Data Storage and Analysis 5 Querying All Your Data 6 Beyond Batch 6 Comparison with Other Systems 8 Relational Database Management Systems 8 Grid Computing 10 Volunteer Computing 11 A Brief History of Apache Hadoop 12 What’s in This Book? 15 2. MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A Weather Dataset 19 Data Format 19 Analyzing the Data with Unix Tools 21 Analyzing the Data with Hadoop 22 Map and Reduce 22 Java MapReduce 24 Scaling Out 30 Data Flow 30 Combiner Functions 34 Running a Distributed MapReduce Job 37 Hadoop Streaming 37 v
  • 12. Ruby 37 Python 40 3. The Hadoop Distributed Filesystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 The Design of HDFS 43 HDFS Concepts 45 Blocks 45 Namenodes and Datanodes 46 Block Caching 47 HDFS Federation 48 HDFS High Availability 48 The Command-Line Interface 50 Basic Filesystem Operations 51 Hadoop Filesystems 53 Interfaces 54 The Java Interface 56 Reading Data from a Hadoop URL 57 Reading Data Using the FileSystem API 58 Writing Data 61 Directories 63 Querying the Filesystem 63 Deleting Data 68 Data Flow 69 Anatomy of a File Read 69 Anatomy of a File Write 72 Coherency Model 74 Parallel Copying with distcp 76 Keeping an HDFS Cluster Balanced 77 4. YARN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Anatomy of a YARN Application Run 80 Resource Requests 81 Application Lifespan 82 Building YARN Applications 82 YARN Compared to MapReduce 1 83 Scheduling in YARN 85 Scheduler Options 86 Capacity Scheduler Configuration 88 Fair Scheduler Configuration 90 Delay Scheduling 94 Dominant Resource Fairness 95 Further Reading 96 vi | Table of Contents
  • 13. 5. Hadoop I/O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Data Integrity 97 Data Integrity in HDFS 98 LocalFileSystem 99 ChecksumFileSystem 99 Compression 100 Codecs 101 Compression and Input Splits 105 Using Compression in MapReduce 107 Serialization 109 The Writable Interface 110 Writable Classes 113 Implementing a Custom Writable 121 Serialization Frameworks 126 File-Based Data Structures 127 SequenceFile 127 MapFile 135 Other File Formats and Column-Oriented Formats 136 Part II. MapReduce 6. Developing a MapReduce Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 The Configuration API 141 Combining Resources 143 Variable Expansion 143 Setting Up the Development Environment 144 Managing Configuration 146 GenericOptionsParser, Tool, and ToolRunner 148 Writing a Unit Test with MRUnit 152 Mapper 153 Reducer 156 Running Locally on Test Data 156 Running a Job in a Local Job Runner 157 Testing the Driver 158 Running on a Cluster 160 Packaging a Job 160 Launching a Job 162 The MapReduce Web UI 165 Retrieving the Results 167 Debugging a Job 168 Hadoop Logs 172 Table of Contents | vii
  • 14. Remote Debugging 174 Tuning a Job 175 Profiling Tasks 175 MapReduce Workflows 177 Decomposing a Problem into MapReduce Jobs 177 JobControl 178 Apache Oozie 179 7. How MapReduce Works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Anatomy of a MapReduce Job Run 185 Job Submission 186 Job Initialization 187 Task Assignment 188 Task Execution 189 Progress and Status Updates 190 Job Completion 192 Failures 193 Task Failure 193 Application Master Failure 194 Node Manager Failure 195 Resource Manager Failure 196 Shuffle and Sort 197 The Map Side 197 The Reduce Side 198 Configuration Tuning 201 Task Execution 203 The Task Execution Environment 203 Speculative Execution 204 Output Committers 206 8. MapReduce Types and Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 MapReduce Types 209 The Default MapReduce Job 214 Input Formats 220 Input Splits and Records 220 Text Input 232 Binary Input 236 Multiple Inputs 237 Database Input (and Output) 238 Output Formats 238 Text Output 239 Binary Output 239 viii | Table of Contents
  • 15. Multiple Outputs 240 Lazy Output 245 Database Output 245 9. MapReduce Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Counters 247 Built-in Counters 247 User-Defined Java Counters 251 User-Defined Streaming Counters 255 Sorting 255 Preparation 256 Partial Sort 257 Total Sort 259 Secondary Sort 262 Joins 268 Map-Side Joins 269 Reduce-Side Joins 270 Side Data Distribution 273 Using the Job Configuration 273 Distributed Cache 274 MapReduce Library Classes 279 Part III. Hadoop Operations 10. Setting Up a Hadoop Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Cluster Specification 284 Cluster Sizing 285 Network Topology 286 Cluster Setup and Installation 288 Installing Java 288 Creating Unix User Accounts 288 Installing Hadoop 289 Configuring SSH 289 Configuring Hadoop 290 Formatting the HDFS Filesystem 290 Starting and Stopping the Daemons 290 Creating User Directories 292 Hadoop Configuration 292 Configuration Management 293 Environment Settings 294 Important Hadoop Daemon Properties 296 Table of Contents | ix
  • 16. Hadoop Daemon Addresses and Ports 304 Other Hadoop Properties 307 Security 309 Kerberos and Hadoop 309 Delegation Tokens 312 Other Security Enhancements 313 Benchmarking a Hadoop Cluster 314 Hadoop Benchmarks 314 User Jobs 316 11. Administering Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 HDFS 317 Persistent Data Structures 317 Safe Mode 322 Audit Logging 324 Tools 325 Monitoring 330 Logging 330 Metrics and JMX 331 Maintenance 332 Routine Administration Procedures 332 Commissioning and Decommissioning Nodes 334 Upgrades 337 Part IV. Related Projects 12. Avro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Avro Data Types and Schemas 346 In-Memory Serialization and Deserialization 349 The Specific API 351 Avro Datafiles 352 Interoperability 354 Python API 354 Avro Tools 355 Schema Resolution 355 Sort Order 358 Avro MapReduce 359 Sorting Using Avro MapReduce 363 Avro in Other Languages 365 x | Table of Contents
  • 17. 13. Parquet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Data Model 368 Nested Encoding 370 Parquet File Format 370 Parquet Configuration 372 Writing and Reading Parquet Files 373 Avro, Protocol Buffers, and Thrift 375 Parquet MapReduce 377 14. Flume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 Installing Flume 381 An Example 382 Transactions and Reliability 384 Batching 385 The HDFS Sink 385 Partitioning and Interceptors 387 File Formats 387 Fan Out 388 Delivery Guarantees 389 Replicating and Multiplexing Selectors 390 Distribution: Agent Tiers 390 Delivery Guarantees 393 Sink Groups 395 Integrating Flume with Applications 398 Component Catalog 399 Further Reading 400 15. Sqoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Getting Sqoop 401 Sqoop Connectors 403 A Sample Import 403 Text and Binary File Formats 406 Generated Code 407 Additional Serialization Systems 407 Imports: A Deeper Look 408 Controlling the Import 410 Imports and Consistency 411 Incremental Imports 411 Direct-Mode Imports 411 Working with Imported Data 412 Imported Data and Hive 413 Importing Large Objects 415 Table of Contents | xi
  • 18. Performing an Export 417 Exports: A Deeper Look 419 Exports and Transactionality 420 Exports and SequenceFiles 421 Further Reading 422 16. Pig. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 Installing and Running Pig 424 Execution Types 424 Running Pig Programs 426 Grunt 426 Pig Latin Editors 427 An Example 427 Generating Examples 429 Comparison with Databases 430 Pig Latin 432 Structure 432 Statements 433 Expressions 438 Types 439 Schemas 441 Functions 445 Macros 447 User-Defined Functions 448 A Filter UDF 448 An Eval UDF 452 A Load UDF 453 Data Processing Operators 456 Loading and Storing Data 456 Filtering Data 457 Grouping and Joining Data 459 Sorting Data 465 Combining and Splitting Data 466 Pig in Practice 466 Parallelism 467 Anonymous Relations 467 Parameter Substitution 467 Further Reading 469 17. Hive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 Installing Hive 472 The Hive Shell 473 xii | Table of Contents
  • 19. An Example 474 Running Hive 475 Configuring Hive 475 Hive Services 478 The Metastore 480 Comparison with Traditional Databases 482 Schema on Read Versus Schema on Write 482 Updates, Transactions, and Indexes 483 SQL-on-Hadoop Alternatives 484 HiveQL 485 Data Types 486 Operators and Functions 488 Tables 489 Managed Tables and External Tables 490 Partitions and Buckets 491 Storage Formats 496 Importing Data 500 Altering Tables 502 Dropping Tables 502 Querying Data 503 Sorting and Aggregating 503 MapReduce Scripts 503 Joins 505 Subqueries 508 Views 509 User-Defined Functions 510 Writing a UDF 511 Writing a UDAF 513 Further Reading 518 18. Crunch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 An Example 520 The Core Crunch API 523 Primitive Operations 523 Types 528 Sources and Targets 531 Functions 533 Materialization 535 Pipeline Execution 538 Running a Pipeline 538 Stopping a Pipeline 539 Inspecting a Crunch Plan 540 Table of Contents | xiii
  • 20. Iterative Algorithms 543 Checkpointing a Pipeline 545 Crunch Libraries 545 Further Reading 548 19. Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 Installing Spark 550 An Example 550 Spark Applications, Jobs, Stages, and Tasks 552 A Scala Standalone Application 552 A Java Example 554 A Python Example 555 Resilient Distributed Datasets 556 Creation 556 Transformations and Actions 557 Persistence 560 Serialization 562 Shared Variables 564 Broadcast Variables 564 Accumulators 564 Anatomy of a Spark Job Run 565 Job Submission 565 DAG Construction 566 Task Scheduling 569 Task Execution 570 Executors and Cluster Managers 570 Spark on YARN 571 Further Reading 574 20. HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 HBasics 575 Backdrop 576 Concepts 576 Whirlwind Tour of the Data Model 576 Implementation 578 Installation 581 Test Drive 582 Clients 584 Java 584 MapReduce 587 REST and Thrift 589 Building an Online Query Application 589 xiv | Table of Contents
  • 21. Schema Design 590 Loading Data 591 Online Queries 594 HBase Versus RDBMS 597 Successful Service 598 HBase 599 Praxis 600 HDFS 600 UI 601 Metrics 601 Counters 601 Further Reading 601 21. ZooKeeper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 Installing and Running ZooKeeper 604 An Example 606 Group Membership in ZooKeeper 606 Creating the Group 607 Joining a Group 609 Listing Members in a Group 610 Deleting a Group 612 The ZooKeeper Service 613 Data Model 614 Operations 616 Implementation 620 Consistency 621 Sessions 623 States 625 Building Applications with ZooKeeper 627 A Configuration Service 627 The Resilient ZooKeeper Application 630 A Lock Service 634 More Distributed Data Structures and Protocols 636 ZooKeeper in Production 637 Resilience and Performance 637 Configuration 639 Further Reading 640 Table of Contents | xv
  • 22. Part V. Case Studies 22. Composable Data at Cerner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643 From CPUs to Semantic Integration 643 Enter Apache Crunch 644 Building a Complete Picture 644 Integrating Healthcare Data 647 Composability over Frameworks 650 Moving Forward 651 23. Biological Data Science: Saving Lives with Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 The Structure of DNA 655 The Genetic Code: Turning DNA Letters into Proteins 656 Thinking of DNA as Source Code 657 The Human Genome Project and Reference Genomes 659 Sequencing and Aligning DNA 660 ADAM, A Scalable Genome Analysis Platform 661 Literate programming with the Avro interface description language (IDL) 662 Column-oriented access with Parquet 663 A simple example: k-mer counting using Spark and ADAM 665 From Personalized Ads to Personalized Medicine 667 Join In 668 24. Cascading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669 Fields, Tuples, and Pipes 670 Operations 673 Taps, Schemes, and Flows 675 Cascading in Practice 676 Flexibility 679 Hadoop and Cascading at ShareThis 680 Summary 684 A. Installing Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 B. Cloudera’s Distribution Including Apache Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691 C. Preparing the NCDC Weather Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693 D. The Old and New Java MapReduce APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701 xvi | Table of Contents
  • 23. Foreword Hadoop got its start in Nutch. A few of us were attempting to build an open source web search engine and having trouble managing computations running on even a handful ofcomputers.OnceGooglepublisheditsGFSandMapReducepapers,theroutebecame clear.They’ddevisedsystemstosolvepreciselytheproblemswewerehavingwithNutch. So we started, two of us, half-time, to try to re-create these systems as a part of Nutch. We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Web’s massive scale, we’d need to run it on thousands of machines, and moreover, that the job was bigger than two half-time developers could handle. Around that time, Yahoo! got interested, and quickly put together a team that I joined. We split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web. In 2006, Tom White started contributing to Hadoop. I already knew Tom through an excellent article he’d written about Nutch, so I knew he could present complex ideas in clear prose. I soon learned that he could also develop software that was as pleasant to read as his prose. From the beginning, Tom’s contributions to Hadoop showed his concern for users and for the project. Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use. Initially, Tom specialized in making Hadoop run well on Amazon’s EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the Map‐ Reduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Man‐ agement Committee. xvii
  • 24. Tom is now a respected senior member of the Hadoop developer community. Though he’s an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand. Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master—not only of the technology, but also of common sense and plain talk. —Doug Cutting, April 2009 Shed in the Yard, California xviii | Foreword
  • 25. 1. Alex Bellos, “The science of fun,” The Guardian, May 31, 2008. 2. It was added to the Oxford English Dictionary in 2013. Preface Martin Gardner, the mathematics and science writer, once said in an interview: Beyond calculus, I am lost. That was the secret of my column’s success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.1 In many ways, this is how I feel about Hadoop. Its inner workings are complex, resting as they do on a mixture of distributed systems theory, practical engineering, and com‐ mon sense. And to the uninitiated, Hadoop can appear alien. But it doesn’t need to be like this. Stripped to its core, the tools that Hadoop provides for working with big data are simple. If there’s a common theme, it is about raising the level of abstraction—to create building blocks for programmers who have lots of data to store and analyze, and who don’t have the time, the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it. With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used. However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art. Things have certainly improved since then: there is more documentation, there are more ex‐ amples, and there are thriving mailing lists to go to when you have questions. And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it. That is why I wrote this book. The Apache Hadoop community has come a long way. Since the publication of the first edition of this book, the Hadoop project has blossomed. “Big data” has become a house‐ hold term.2 In this time, the software has made great leaps in adoption, performance, reliability, scalability, and manageability. The number of things being built and run on the Hadoop platform has grown enormously. In fact, it’s difficult for one person to keep xix
  • 26. track. To gain even wider adoption, I believe we need to make Hadoop even easier to use. This will involve writing more tools; integrating with even more systems; and writ‐ ing new, improved APIs. I’m looking forward to being a part of this, and I hope this book will encourage and enable others to do so, too. Administrative Notes During discussion of a particular Java class in the text, I often omit its package name to reduce clutter. If you need to know which package a class is in, you can easily look it up in the Java API documentation for Hadoop (linked to from the Apache Hadoop home page),ortherelevantproject.Orifyou’reusinganintegrateddevelopmentenvironment (IDE), its auto-complete mechanism can help find what you’re looking for. Similarly, although it deviates from usual style guidelines, program listings that import multiple classes from the same package may use the asterisk wildcard character to save space (for example, import org.apache.hadoop.io.*). The sample programs in this book are available for download from the book’s website. You will also find instructions there for obtaining the datasets that are used in examples throughout the book, as well as further notes for running the programs in the book and links to updates, additional resources, and my blog. What’s New in the Fourth Edition? The fourth edition covers Hadoop 2 exclusively. The Hadoop 2 release series is the current active release series and contains the most stable versions of Hadoop. There are new chapters covering YARN (Chapter 4), Parquet (Chapter 13), Flume (Chapter 14), Crunch (Chapter 18), and Spark (Chapter 19). There’s also a new section to help readers navigate different pathways through the book (“What’s in This Book?” on page 15). This edition includes two new case studies (Chapters 22 and 23): one on how Hadoop is used in healthcare systems, and another on using Hadoop technologies for genomics data processing. Case studies from the previous editions can now be found online. Many corrections, updates, and improvements have been made to existing chapters to bring them up to date with the latest releases of Hadoop and its related projects. What’s New in the Third Edition? The third edition covers the 1.x (formerly 0.20) release series of Apache Hadoop, as well as the newer 0.22 and 2.x (formerly 0.23) series. With a few exceptions, which are noted in the text, all the examples in this book run against these versions. xx | Preface
  • 27. This edition uses the new MapReduce API for most of the examples. Because the old API is still in widespread use, it continues to be discussed in the text alongside the new API, and the equivalent code using the old API can be found on the book’s website. The major change in Hadoop 2.0 is the new MapReduce runtime, MapReduce 2, which is built on a new distributed resource management system called YARN. This edition includes new sections covering MapReduce on YARN: how it works (Chapter 7) and how to run it (Chapter 10). There is more MapReduce material, too, including development practices such as pack‐ aging MapReduce jobs with Maven, setting the user’s Java classpath, and writing tests with MRUnit (all in Chapter 6). In addition, there is more depth on features such as outputcommittersandthedistributedcache(bothinChapter9),aswellastaskmemory monitoring (Chapter 10). There is a new section on writing MapReduce jobs to process Avro data (Chapter 12), and one on running a simple MapReduce workflow in Oozie (Chapter 6). ThechapteronHDFS(Chapter3)nowhasintroductionstohighavailability,federation, and the new WebHDFS and HttpFS filesystems. The chapters on Pig, Hive, Sqoop, and ZooKeeper have all been expanded to cover the new features and changes in their latest releases. In addition, numerous corrections and improvements have been made throughout the book. What’s New in the Second Edition? The second edition has two new chapters on Sqoop and Hive (Chapters 15 and 17, respectively), a new section covering Avro (in Chapter 12), an introduction to the new security features in Hadoop (in Chapter 10), and a new case study on analyzing massive network graphs using Hadoop. This edition continues to describe the 0.20 release series of Apache Hadoop, because this was the latest stable release at the time of writing. New features from later releases are occasionally mentioned in the text, however, with reference to the version that they were introduced in. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Preface | xxi
  • 28. Constant width Used for program listings, as well as within paragraphs to refer to commands and command-line options and to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This icon signifies a general note. This icon signifies a tip or suggestion. This icon indicates a warning or caution. Using Code Examples Supplemental material (code, examples, exercise, etc.) is available for download at this book’s website and on GitHub. This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does requirepermission.Answeringaquestionbycitingthisbookandquotingexamplecode does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. xxii | Preface
  • 29. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Hadoop: The Definitive Guide, Fourth Ed‐ ition, by Tom White (O’Reilly). Copyright 2015 Tom White, 978-1-491-90163-2.” If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at permissions@oreilly.com. Safari® Books Online Safari Books Online is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business. Technology professionals, software developers, web designers, and business and crea‐ tive professionals use Safari Books Online as their primary resource for research, prob‐ lem solving, learning, and certification training. Safari Books Online offers a range of plans and pricing for enterprise, government, education, and individuals. Members have access to thousands of books, training videos, and prepublication manu‐ scripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones Bartlett, Course Technology, and hundreds more. For more information about Safari Books Online, please visit us online. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/hadoop_tdg_4e. To comment or ask technical questions about this book, send email to bookquestions@oreilly.com. Preface | xxiii
  • 30. For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments I have relied on many people, both directly and indirectly, in writing this book. I would like to thank the Hadoop community, from whom I have learned, and continue to learn, a great deal. In particular, I would like to thank Michael Stack and Jonathan Gray for writing the chapter on HBase. Thanks also go to Adrian Woodhead, Marc de Palol, Joydeep Sen Sarma, Ashish Thusoo, Andrzej Białecki, Stu Hood, Chris K. Wensel, and Owen O’Malley for contributing case studies. I would like to thank the following reviewers who contributed many helpful suggestions and improvements to my drafts: Raghu Angadi, Matt Biddulph, Christophe Bisciglia, Ryan Cox, Devaraj Das, Alex Dorman, Chris Douglas, Alan Gates, Lars George, Patrick Hunt, Aaron Kimball, Peter Krey, Hairong Kuang, Simon Maxen, Olga Natkovich, Benjamin Reed, Konstantin Shvachko, Allen Wittenauer, Matei Zaharia, and Philip Zeyliger. Ajay Anand kept the review process flowing smoothly. Philip (“flip”) Kromer kindly helped me with the NCDC weather dataset featured in the examples in this book. Special thanks to Owen O’Malley and Arun C. Murthy for explaining the intricacies of the MapReduce shuffle to me. Any errors that remain are, of course, to be laid at my door. For the second edition, I owe a debt of gratitude for the detailed reviews and feedback from Jeff Bean, Doug Cutting, Glynn Durham, Alan Gates, Jeff Hammerbacher, Alex Kozlov, Ken Krugler, Jimmy Lin, Todd Lipcon, Sarah Sproehnle, Vinithra Varadharajan, and Ian Wrigley, as well as all the readers who submitted errata for the first edition. I would also like to thank Aaron Kimball for contributing the chapter on Sqoop, and Philip (“flip”) Kromer for the case study on graph processing. For the third edition, thanks go to Alejandro Abdelnur, Eva Andreasson, Eli Collins, Doug Cutting, Patrick Hunt, Aaron Kimball, Aaron T. Myers, Brock Noland, Arvind Prabhakar, Ahmed Radwan, and Tom Wheeler for their feedback and suggestions. Rob Weltmankindlygaveverydetailedfeedbackforthewholebook,whichgreatlyimproved the final manuscript. Thanks also go to all the readers who submitted errata for the second edition. xxiv | Preface
  • 31. For the fourth edition, I would like to thank Jodok Batlogg, Meghan Blanchette, Ryan Blue, Jarek Jarcec Cecho, Jules Damji, Dennis Dawson, Matthew Gast, Karthik Kam‐ batla, Julien Le Dem, Brock Noland, Sandy Ryza, Akshai Sarma, Ben Spivey, Michael Stack,KateTing,JoshWalter,JoshWills,andAdrianWoodheadforalloftheirinvaluable reviewfeedback.RyanBrush,MicahWhitacre,andMattMassiekindlycontributednew case studies for this edition. Thanks again to all the readers who submitted errata. I am particularly grateful to Doug Cutting for his encouragement, support, and friend‐ ship, and for contributing the Foreword. Thanks also go to the many others with whom I have had conversations or email discussions over the course of writing the book. Halfway through writing the first edition of this book, I joined Cloudera, and I want to thank my colleagues for being incredibly supportive in allowing me the time to write and to get it finished promptly. I am grateful to my editors, Mike Loukides and Meghan Blanchette, and their colleagues at O’Reilly for their help in the preparation of this book. Mike and Meghan have been there throughout to answer my questions, to read my first drafts, and to keep me on schedule. Finally, the writing of this book has been a great deal of work, and I couldn’t have done it without the constant support of my family. My wife, Eliane, not only kept the home going, but also stepped in to help review, edit, and chase case studies. My daughters, Emilia and Lottie, have been very understanding, and I’m looking forward to spending lots more time with all of them. Preface | xxv
  • 35. 1. These statistics were reported in a study entitled “The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things.” 2. All figures are from 2013 or 2014. For more information, see Tom Groenfeldt, “At NYSE, The Data Deluge Overwhelms Traditional Databases”; Rich Miller, “Facebook Builds Exabyte Data Centers for Cold Stor‐ age”; Ancestry.com’s “Company Facts”; Archive.org’s “Petabox”; and the Worldwide LHC Computing Grid project’s welcome page. CHAPTER 1 Meet Hadoop In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers. —Grace Hopper Data! We live in the data age. It’s not easy to measure the total volume of data stored elec‐ tronically, but an IDC estimate put the size of the “digital universe” at 4.4 zettabytes in 2013 and is forecasting a tenfold growth by 2020 to 44 zettabytes.1 A zettabyte is 1021 bytes, or equivalently one thousand exabytes, one million petabytes, or one billion terabytes. That’s more than one disk drive for every person in the world. This flood of data is coming from many sources. Consider the following:2 • The New York Stock Exchange generates about 4−5 terabytes of data per day. • Facebook hosts more than 240 billion photos, growing at 7 petabytes per month. • Ancestry.com, the genealogy site, stores around 10 petabytes of data. • The Internet Archive stores around 18.5 petabytes of data. 3
  • 36. • The Large Hadron Collider near Geneva, Switzerland, produces about 30 petabytes of data per year. So there’s a lot of data out there. But you are probably wondering how it affects you. Most of the data is locked up in the largest web properties (like search engines) or in scientific or financial institutions, isn’t it? Does the advent of big data affect smaller organizations or individuals? I argue that it does. Take photos, for example. My wife’s grandfather was an avid pho‐ tographerandtookphotographsthroughouthisadultlife.Hisentirecorpusofmedium- format, slide, and 35mm film, when scanned in at high resolution, occupies around 10 gigabytes. Compare this to the digital photos my family took in 2008, which take up about 5 gigabytes of space. My family is producing photographic data at 35 times the rate my wife’s grandfather’s did, and the rate is increasing every year as it becomes easier to take more and more photos. More generally, the digital streams that individuals are producing are growing apace. Microsoft Research’s MyLifeBits project gives a glimpse of the archiving of personal information that may become commonplace in the near future. MyLifeBits was an ex‐ periment where an individual’s interactions—phone calls, emails, documents—were captured electronically and stored for later access. The data gathered included a photo taken every minute, which resulted in an overall data volume of 1 gigabyte per month. When storage costs come down enough to make it feasible to store continuous audio and video, the data volume for a future MyLifeBits service will be many times that. The trend is for every individual’s data footprint to grow, but perhaps more significantly, the amount of data generated by machines as a part of the Internet of Things will be even greater than that generated by people. Machine logs, RFID readers, sensor net‐ works, vehicle GPS traces, retail transactions—all of these contribute to the growing mountain of data. The volume of data being made publicly available increases every year, too. Organiza‐ tions no longer have to merely manage their own data; success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data. Initiatives such as Public Data Sets on Amazon Web Services and Infochimps.org exist to foster the “information commons,” where data can be freely (or for a modest price) shared for anyone to download and analyze. Mashups between different information sources make for unexpected and hitherto unimaginable applications. Take, for example, the Astrometry.net project, which watches the Astrometry group on Flickr for new photos of the night sky. It analyzes each image and identifies which part of the sky it is from, as well as any interesting celestial bodies, such as stars or galaxies. This project shows the kinds of things that are possible when data (in this case, tagged photographic images) is made available and used for something (image analysis) that was not anticipated by the creator. 4 | Chapter 1: Meet Hadoop
  • 37. 3. The quote is from Anand Rajaraman’s blog post “More data usually beats better algorithms,” in which he writes about the Netflix Challenge. Alon Halevy, Peter Norvig, and Fernando Pereira make the same point in “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems, March/April 2009. 4. These specifications are for the Seagate ST-41600n. It has been said that “more data usually beats better algorithms,” which is to say that for some problems (such as recommending movies or music based on past preferences), however fiendish your algorithms, often they can be beaten simply by having more data (and a less sophisticated algorithm).3 The good news is that big data is here. The bad news is that we are struggling to store and analyze it. Data Storage and Analysis The problem is simple: although the storage capacities of hard drives have increased massivelyovertheyears,accessspeeds—therateatwhichdatacanbereadfromdrives— have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,4 so you could read all the data from a full drive in around five minutes. Over 20 years later, 1-terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk. This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes. Using only one hundredth of a disk may seem wasteful. But we can store 100 datasets, each of which is 1 terabyte, and provide shared access to them. We can imagine that the users of such a system would be happy to share access in return for shorter analysis times, and statistically, that their analysis jobs would be likely to be spread over time, so they wouldn’t interfere with each other too much. There’s more to being able to read and write data in parallel to or from multiple disks, though. The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high. A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. This is how RAID works, for instance, although Hadoop’s filesystem, the Hadoop Distributed Filesystem (HDFS), takes a slightly different approach, as you shall see later. Data Storage and Analysis | 5
  • 38. The second problem is that most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with data from any of the other 99 disks. Various distributed systems allow data to be combined from mul‐ tiple sources, but doing this correctly is notoriously challenging. MapReduce provides a programming model that abstracts the problem from disk reads and writes, trans‐ forming it into a computation over sets of keys and values. We look at the details of this model in later chapters, but the important point for the present discussion is that there are two parts to the computation—the map and the reduce—and it’s the interface be‐ tween the two where the “mixing” occurs. Like HDFS, MapReduce has built-in reliability. In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis. What’s more, because it runs on commodity hardware and is open source, Hadoop is affordable. Querying All Your Data The approach taken by MapReduce may seem like a brute-force approach. The premise is that the entire dataset—or at least a good portion of it—can be processed for each query. But this is its power. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative. It changes the way you think about data and unlocks data that was previously archived on tape or disk. It gives people the opportunity to innovate with data. Questions that took too long to get answered before can now be answered, which in turn leads to new questions and new insights. For example, Mailtrust, Rackspace’s mail division, used Hadoop for processing email logs. One ad hoc query they wrote was to find the geographic distribution of their users. In their words: This data was so useful that we’ve scheduled the MapReduce job to run monthly and we will be using this data to help us decide which Rackspace data centers to place new mail servers in as we grow. By bringing several hundred gigabytes of data together and having the tools to analyze it, the Rackspace engineers were able to gain an understanding of the data that they otherwise would never have had, and furthermore, they were able to use what they had learned to improve the service for their customers. Beyond Batch For all its strengths, MapReduce is fundamentally a batch processing system, and is not suitable for interactive analysis. You can’t run a query and get results back in a few seconds or less. Queries typically take minutes or more, so it’s best for offline use, where there isn’t a human sitting in the processing loop waiting for results. 6 | Chapter 1: Meet Hadoop
  • 39. However, since its original incarnation, Hadoop has evolved beyond batch processing. Indeed, the term “Hadoop” is sometimes used to refer to a larger ecosystem of projects, not just HDFS and MapReduce, that fall under the umbrella of infrastructure for dis‐ tributed computing and large-scale data processing. Many of these are hosted by the Apache Software Foundation, which provides support for a community of open source software projects, including the original HTTP Server from which it gets its name. The first component to provide online access was HBase, a key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access of in‐ dividual rows and batch operations for reading and writing data in bulk, making it a good solution for building applications on. The real enabler for new processing models in Hadoop was the introduction of YARN (which stands for Yet Another Resource Negotiator) in Hadoop 2. YARN is a cluster resource management system, which allows any distributed program (not just MapRe‐ duce) to run on data in a Hadoop cluster. In the last few years, there has been a flowering of different processing patterns that work with Hadoop. Here is a sample: Interactive SQL By dispensing with MapReduce and using a distributed query engine that uses dedicated “always on” daemons (like Impala) or container reuse (like Hive on Tez), it’s possible to achieve low-latency responses for SQL queries on Hadoop while still scaling up to large dataset sizes. Iterative processing Many algorithms—such as those in machine learning—are iterative in nature, so it’s much more efficient to hold each intermediate working set in memory, com‐ pared to loading from disk on each iteration. The architecture of MapReduce does not allow this, but it’s straightforward with Spark, for example, and it enables a highly exploratory style of working with datasets. Stream processing Streaming systems like Storm, Spark Streaming, or Samza make it possible to run real-time, distributed computations on unbounded streams of data and emit results to Hadoop storage or external systems. Search The Solr search platform can run on a Hadoop cluster, indexing documents as they are added to HDFS, and serving search queries from indexes stored in HDFS. Despite the emergence of different processing frameworks on Hadoop, MapReduce still has a place for batch processing, and it is useful to understand how it works since it introduces several concepts that apply more generally (like the idea of input formats, or how a dataset is split into pieces). Beyond Batch | 7
  • 40. 5. In January 2007, David J. DeWitt and Michael Stonebraker caused a stir by publishing “MapReduce: A major step backwards,” in which they criticized MapReduce for being a poor substitute for relational databases. Many commentators argued that it was a false comparison (see, for example, Mark C. Chu-Carroll’s “Data‐ bases are hammers; MapReduce is a screwdriver”), and DeWitt and Stonebraker followed up with “MapRe‐ duce II,” where they addressed the main topics brought up by others. Comparison with Other Systems Hadoop isn’t the first distributed system for data storage and analysis, but it has some unique properties that set it apart from other systems that may seem similar. Here we look at some of them. Relational Database Management Systems Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop needed? The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate. Ontheotherhand,forupdatingasmallproportionofrecordsinadatabase,atraditional B-Tree (the data structure used in relational databases, which is limited by the rate at which it can perform seeks) works well. For updating the majority of a database, a B- Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. In many ways, MapReduce can be seen as a complement to a Relational Database Man‐ agement System (RDBMS). (The differences between the two systems are shown in Table 1-1.) MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated.5 Table 1-1. RDBMS compared to MapReduce Traditional RDBMS MapReduce Data size Gigabytes Petabytes Access Interactive and batch Batch Updates Read and write many times Write once, read many times Transactions ACID None 8 | Chapter 1: Meet Hadoop
  • 41. Traditional RDBMS MapReduce Structure Schema-on-write Schema-on-read Integrity High Low Scaling Nonlinear Linear However,thedifferencesbetweenrelationaldatabasesandHadoopsystemsareblurring. Relational databases have started incorporating some of the ideas from Hadoop, and from the other direction, Hadoop systems such as Hive are becoming more interactive (by moving away from MapReduce) and adding features like indexes and transactions that make them look more and more like traditional RDBMSs. Another difference between Hadoop and an RDBMS is the amount of structure in the datasets on which they operate. Structured data is organized into entities that have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS. Semi-structured data, on the other hand, is looser, and though there may be a schema, it is often ignored, so it may be used only as a guide to the structure of the data: for example, a spreadsheet, in which the structure is the grid of cells, although the cells themselves may hold any form of data. Unstructured data does not have any particular internal structure: for example, plain text or image data. Hadoop works well on unstructured or semi-structured data because it is designed to interpret the data at processing time (so called schema-on-read). This provides flexibility and avoids the costly data loading phase of an RDBMS, since in Hadoop it is just a file copy. Relational data is often normalized to retain its integrity and remove redundancy. NormalizationposesproblemsforHadoopprocessingbecauseitmakesreadingarecord a nonlocal operation, and one of the central assumptions that Hadoop makes is that it is possible to perform (high-speed) streaming reads and writes. Awebserverlogisagoodexampleofasetofrecordsthatisnotnormalized(forexample, the client hostnames are specified in full each time, even though the same client may appear many times), and this is one reason that logfiles of all kinds are particularly well suited to analysis with Hadoop. Note that Hadoop can perform joins; it’s just that they are not used as much as in the relational world. MapReduce—and the other processing models in Hadoop—scales linearly with the size of the data. Data is partitioned, and the functional primitives (like map and reduce) can work in parallel on separate partitions. This means that if you double the size of the input data, a job will run twice as slowly. But if you also double the size of the cluster, a job will run as fast as the original one. This is not generally true of SQL queries. Comparison with Other Systems | 9
  • 42. 6. Jim Gray was an early advocate of putting the computation near the data. See “Distributed Computing Eco‐ nomics,” March 2003. Grid Computing The high-performance computing (HPC) and grid computing communities have been doing large-scale data processing for years, using such application program interfaces (APIs) as the Message Passing Interface (MPI). Broadly, the approach in HPC is to distributetheworkacrossaclusterofmachines,whichaccessasharedfilesystem,hosted by a storage area network (SAN). This works well for predominantly compute-intensive jobs,butitbecomesaproblemwhennodesneedtoaccesslargerdatavolumes(hundreds of gigabytes, the point at which Hadoop really starts to shine), since the network band‐ width is the bottleneck and compute nodes become idle. Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local.6 This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance. Recognizing that network band‐ width is the most precious resource in a data center environment (it is easy to saturate network links by copying data around), Hadoop goes to great lengths to conserve it by explicitly modeling network topology. Notice that this arrangement does not preclude high-CPU analyses in Hadoop. MPI gives great control to programmers, but it requires that they explicitly handle the mechanics of the data flow, exposed via low-level C routines and constructs such as sockets, as well as the higher-level algorithms for the analyses. Processing in Hadoop operates only at the higher level: the programmer thinks in terms of the data model (such as key-value pairs for MapReduce), while the data flow remains implicit. Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure—when you don’t know whether or notaremoteprocesshasfailed—andstillmakingprogresswiththeoverallcomputation. DistributedprocessingframeworkslikeMapReducesparetheprogrammerfromhaving to think about failure, since the implementation detects failed tasks and reschedules replacements on machines that are healthy. MapReduce is able to do this because it is a shared-nothingarchitecture,meaningthattaskshavenodependenceononeother.(This is a slight oversimplification, since the output from mappers is fed to the reducers, but this is under the control of the MapReduce system; in this case, it needs to take more care rerunning a failed reducer than rerunning a failed map, because it has to make sure it can retrieve the necessary map outputs and, if not, regenerate them by running the relevant maps again.) So from the programmer’s point of view, the order in which the tasks run doesn’t matter. By contrast, MPI programs have to explicitly manage their own checkpointing and recovery, which gives more control to the programmer but makes them more difficult to write. 10 | Chapter 1: Meet Hadoop
  • 43. 7. In January 2008, SETI@home was reported to be processing 300 gigabytes a day, using 320,000 computers (most of which are not dedicated to SETI@home; they are used for other things, too). Volunteer Computing When people first hear about Hadoop and MapReduce they often ask, “How is it dif‐ ferent from SETI@home?” SETI, the Search for Extra-Terrestrial Intelligence, runs a project called SETI@home in which volunteers donate CPU time from their otherwise idle computers to analyze radio telescope data for signs of intelligent life outside Earth. SETI@home is the most well known of many volunteer computing projects; others in‐ clude the Great Internet Mersenne Prime Search (to search for large prime numbers) and Folding@home (to understand protein folding and how it relates to disease). Volunteer computing projects work by breaking the problems they are trying to solve into chunks called work units, which are sent to computers around the world to be analyzed. For example, a SETI@home work unit is about 0.35 MB of radio telescope data, and takes hours or days to analyze on a typical home computer. When the analysis is completed, the results are sent back to the server, and the client gets another work unit. As a precaution to combat cheating, each work unit is sent to three different ma‐ chines and needs at least two results to agree to be accepted. Although SETI@home may be superficially similar to MapReduce (breaking a problem into independent pieces to be worked on in parallel), there are some significant differ‐ ences. The SETI@home problem is very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world7 because the time to transfer the work unit is dwarfed by the time to run the computation on it. Volunteers are donating CPU cycles, not bandwidth. Comparison with Other Systems | 11
  • 44. 8. In this book, we use the lowercase form, “namenode,” to denote the entity when it’s being referred to generally, and the CamelCase form NameNode to denote the Java class that implements it. 9. See Mike Cafarella and Doug Cutting, “Building Nutch: Open Source Search,” ACM Queue, April 2004. MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects. By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality. A Brief History of Apache Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project. The Origin of the Name “Hadoop” The name Hadoop is not an acronym; it’s a made-up name. The project’s creator, Doug Cutting, explains how the name came about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid’s term. Projects in the Hadoop ecosystem also tend to have names that are unrelated to their function, often with an elephant or other animal theme (“Pig,” for example). Smaller components are given more descriptive (and therefore more mundane) names. This is a good principle, as it means you can generally work out what something does from its name. For example, the namenode8 manages the filesystem namespace. Building a web search engine from scratch was an ambitious goal, for not only is the software required to crawl and index websites complex to write, but it is also a challenge to run without a dedicated operations team, since there are so many moving parts. It’s expensive, too: Mike Cafarella and Doug Cutting estimated a system supporting a one-billion-page index would cost around $500,000 in hardware, with a monthly run‐ ning cost of $30,000.9 Nevertheless, they believed it was a worthy goal, as it would open up and ultimately democratize search engine algorithms. Nutch was started in 2002, and a working crawler and search system quickly emerged. However, its creators realized that their architecture wouldn’t scale to the billions of pagesontheWeb.Helpwasathandwiththepublicationofapaperin2003thatdescribed the architecture of Google’s distributed filesystem, called GFS, which was being used in 12 | Chapter 1: Meet Hadoop
  • 45. 10. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003. 11. Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” December 2004. 12. “Yahoo! Launches World’s Largest Hadoop Production Application,” February 19, 2008. production at Google.10 GFS, or something like it, would solve their storage needs for the very large files generated as a part of the web crawl and indexing process. In par‐ ticular, GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004, Nutch’s developers set about writing an open source implemen‐ tation, the Nutch Distributed Filesystem (NDFS). In 2004, Google published the paper that introduced MapReduce to the world.11 Early in 2005, the Nutch developers had a working MapReduce implementation in Nutch, and by the middle of that year all the major Nutch algorithms had been ported to run using MapReduce and NDFS. NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of search, and in February 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which provided a dedicated team and the resources to turn Hadoop into a system that ran at web scale (see the following sidebar). This was demonstrated in Feb‐ ruary 2008 when Yahoo! announced that its production search index was being gener‐ ated by a 10,000-core Hadoop cluster.12 Hadoop at Yahoo! BuildingInternet-scalesearchenginesrequireshugeamountsofdataandthereforelarge numbers of machines to process it. Yahoo! Search consists of four primary components: the Crawler, which downloads pages from web servers; the WebMap, which builds a graph of the known Web; the Indexer, which builds a reverse index to the best pages; and the Runtime, which answers users’ queries. The WebMap is a graph that consists of roughly 1 trillion (1012 ) edges, each representing a web link, and 100 billion (1011 ) nodes, each representing distinct URLs. Creating and analyzing such a large graph requires a large number of computers running for many days. In early 2005, the infrastructure for the WebMap, named Dreadnaught, needed to be redesigned to scale up to more nodes. Dreadnaught had successfully scaled from 20 to 600 nodes, but required a complete redesign to scale out further. Dreadnaught is similar to MapReduce in many ways, but provides more flexibility and less structure. In particular, each fragment in a Dread‐ naught job could send output to each of the fragments in the next stage of the job, but the sort was all done in library code. In practice, most of the WebMap phases were pairs that corresponded to MapReduce. Therefore, the WebMap applications would not re‐ quire extensive refactoring to fit into MapReduce. A Brief History of Apache Hadoop | 13
  • 46. 13. Derek Gottfrid, “Self-Service, Prorated Super Computing Fun!” November 1, 2007. 14. Owen O’Malley, “TeraByte Sort on Apache Hadoop,” May 2008. 15. Grzegorz Czajkowski, “Sorting 1PB with MapReduce,” November 21, 2008. 16. Owen O’Malley and Arun C. Murthy, “Winning a 60 Second Dash with a Yellow Elephant,” April 2009. Eric Baldeschwieler (aka Eric14) created a small team, and we started designing and prototyping a new framework, written in C++ modeled and after GFS and MapReduce, to replace Dreadnaught. Although the immediate need was for a new framework for WebMap, it was clear that standardization of the batch platform across Yahoo! Search was critical and that by making the framework general enough to support other users, we could better leverage investment in the new platform. At the same time, we were watching Hadoop, which was part of Nutch, and its progress. In January 2006, Yahoo! hired Doug Cutting, and a month later we decided to abandon our prototype and adopt Hadoop. The advantage of Hadoop over our prototype and design was that it was already working with a real application (Nutch) on 20 nodes. That allowed us to bring up a research cluster two months later and start helping real cus‐ tomers use the new framework much sooner than we could have otherwise. Another advantage, of course, was that since Hadoop was already open source, it was easier (although far from easy!) to get permission from Yahoo!’s legal department to work in open source. So, we set up a 200-node cluster for the researchers in early 2006 and put the WebMap conversion plans on hold while we supported and improved Hadoop for the research users. —Owen O’Malley, 2009 In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. By this time, Hadoop was being used by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New York Times. In one well-publicized feat, the New York Times used Amazon’s EC2 compute cloud to crunch through 4 terabytes of scanned archives from the paper, converting them to PDFs for the Web.13 The processing took less than 24 hours to run using 100 machines, and the project probably wouldn’t have been embarked upon without the combination of Amazon’s pay-by-the-hour model (which allowed the NYT to access a large number of machines for a short period) and Hadoop’s easy-to-use parallel programming model. In April 2008, Hadoop broke a world record to become the fastest system to sort an entire terabyte of data. Running on a 910-node cluster, Hadoop sorted 1 terabyte in 209 seconds (just under 3.5 minutes), beating the previous year’s winner of 297 seconds.14 In November of the same year, Google reported that its MapReduce implementation sorted 1 terabyte in 68 seconds.15 Then, in April 2009, it was announced that a team at Yahoo! had used Hadoop to sort 1 terabyte in 62 seconds.16 14 | Chapter 1: Meet Hadoop
  • 47. 17. Reynold Xin et al., “GraySort on Apache Spark by Databricks,” November 2014. The trend since then has been to sort even larger volumes of data at ever faster rates. In the 2014 competition, a team from Databricks were joint winners of the Gray Sort benchmark. They used a 207-node Spark cluster to sort 100 terabytes of data in 1,406 seconds, a rate of 4.27 terabytes per minute.17 Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a general- purpose storage and analysis platform for big data has been recognized by the industry, and this fact is reflected in the number of products that use or incorporate Hadoop in some way. Commercial Hadoop support is available from large, established enterprise vendors, including EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoop companies such as Cloudera, Hortonworks, and MapR. What’s in This Book? The book is divided into five main parts: Parts I to III are about core Hadoop, Part IV covers related projects in the Hadoop ecosystem, and Part V contains Hadoop case studies. You can read the book from cover to cover, but there are alternative pathways through the book that allow you to skip chapters that aren’t needed to read later ones. See Figure 1-1. Part I is made up of five chapters that cover the fundamental components in Hadoop and should be read before tackling later chapters. Chapter 1 (this chapter) is a high-level introduction to Hadoop. Chapter 2 provides an introduction to MapReduce. Chap‐ ter 3 looks at Hadoop filesystems, and in particular HDFS, in depth. Chapter 4 discusses YARN, Hadoop’s cluster resource management system. Chapter 5 covers the I/O build‐ ing blocks in Hadoop: data integrity, compression, serialization, and file-based data structures. Part II has four chapters that cover MapReduce in depth. They provide useful under‐ standing for later chapters (such as the data processing chapters in Part IV), but could be skipped on a first reading. Chapter 6 goes through the practical steps needed to develop a MapReduce application. Chapter 7 looks at how MapReduce is implemented in Hadoop, from the point of view of a user. Chapter 8 is about the MapReduce pro‐ gramming model and the various data formats that MapReduce can work with. Chap‐ ter 9 is on advanced MapReduce topics, including sorting and joining data. Part III concerns the administration of Hadoop: Chapters 10 and 11 describe how to set up and maintain a Hadoop cluster running HDFS and MapReduce on YARN. Part IV of the book is dedicated to projects that build on Hadoop or are closely related to it. Each chapter covers one project and is largely independent of the other chapters in this part, so they can be read in any order. What’s in This Book? | 15
  • 48. The first two chapters in this part are about data formats. Chapter 12 looks at Avro, a cross-language data serialization library for Hadoop, and Chapter 13 covers Parquet, an efficient columnar storage format for nested data. The next two chapters look at data ingestion, or how to get your data into Hadoop. Chapter 14 is about Flume, for high-volume ingestion of streaming data. Chapter 15 is about Sqoop, for efficient bulk transfer of data between structured data stores (like relational databases) and HDFS. The common theme of the next four chapters is data processing, and in particular using higher-level abstractions than MapReduce. Pig (Chapter 16) is a data flow language for exploring very large datasets. Hive (Chapter 17) is a data warehouse for managing data stored in HDFS and provides a query language based on SQL. Crunch (Chapter 18) is a high-level Java API for writing data processing pipelines that can run on MapReduce or Spark. Spark (Chapter 19) is a cluster computing framework for large-scale data processing; it provides a directed acyclic graph (DAG) engine, and APIs in Scala, Java, and Python. Chapter 20 is an introduction to HBase, a distributed column-oriented real-time data‐ base that uses HDFS for its underlying storage. And Chapter 21 is about ZooKeeper, a distributed, highly available coordination service that provides useful primitives for building distributed applications. Finally, Part V is a collection of case studies contributed by people using Hadoop in interesting ways. Supplementary information about Hadoop, such as how to install it on your machine, can be found in the appendixes. 16 | Chapter 1: Meet Hadoop
  • 49. Figure 1-1. Structure of the book: there are various pathways through the content What’s in This Book? | 17
  • 51. CHAPTER 2 MapReduce MapReduce is a programming model for data processing. The model is simple, yet not toosimpletoexpressusefulprogramsin.HadoopcanrunMapReduceprogramswritten in various languages; in this chapter, we look at the same program expressed in Java, Ruby, and Python. Most importantly, MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines attheirdisposal.MapReducecomesintoitsownforlargedatasets,solet’sstartbylooking at one. A Weather Dataset For our example, we will write a program that mines weather data. Weather sensors collect data every hour at many locations across the globe and gather a large volume of log data, which is a good candidate for analysis with MapReduce because we want to process all the data, and the data is semi-structured and record-oriented. Data Format The data we will use is from the National Climatic Data Center, or NCDC. The data is stored using a line-oriented ASCII format, in which each line is a record. The format supports a rich set of meteorological elements, many of which are optional or with variabledatalengths.Forsimplicity,wefocusonthebasicelements,suchastemperature, which are always present and are of fixed width. Example 2-1 shows a sample line with some of the salient fields annotated. The line has been split into multiple lines to show each field; in the real file, fields are packed into one line with no delimiters. 19
  • 52. Example 2-1. Format of a National Climatic Data Center record 0057 332130 # USAF weather station identifier 99999 # WBAN weather station identifier 19500101 # observation date 0300 # observation time 4 +51317 # latitude (degrees x 1000) +028783 # longitude (degrees x 1000) FM-12 +0171 # elevation (meters) 99999 V020 320 # wind direction (degrees) 1 # quality code N 0072 1 00450 # sky ceiling height (meters) 1 # quality code C N 010000 # visibility distance (meters) 1 # quality code N 9 -0128 # air temperature (degrees Celsius x 10) 1 # quality code -0139 # dew point temperature (degrees Celsius x 10) 1 # quality code 10268 # atmospheric pressure (hectopascals x 10) 1 # quality code Datafiles are organized by date and weather station. There is a directory for each year from 1901 to 2001, each containing a gzipped file for each weather station with its readings for that year. For example, here are the first entries for 1990: % ls raw/1990 | head 010010-99999-1990.gz 010014-99999-1990.gz 010015-99999-1990.gz 010016-99999-1990.gz 010017-99999-1990.gz 010030-99999-1990.gz 010040-99999-1990.gz 010080-99999-1990.gz 010100-99999-1990.gz 010150-99999-1990.gz There are tens of thousands of weather stations, so the whole dataset is made up of a large number of relatively small files. It’s generally easier and more efficient to process 20 | Chapter 2: MapReduce
  • 53. a smaller number of relatively large files, so the data was preprocessed so that each year’s readings were concatenated into a single file. (The means by which this was carried out is described in Appendix C.) Analyzing the Data with Unix Tools What’s the highest recorded global temperature for each year in the dataset? We will answer this first without using Hadoop, as this information will provide a performance baseline and a useful means to check our results. The classic tool for processing line-oriented data is awk. Example 2-2 is a small script to calculate the maximum temperature for each year. Example 2-2. A program for finding the maximum recorded temperature by year from NCDC weather records #!/usr/bin/env bash for year in all/* do echo -ne `basename $year .gz`t gunzip -c $year | awk '{ temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1); if (temp !=9999 q ~ /[01459]/ temp max) max = temp } END { print max }' done The script loops through the compressed year files, first printing the year, and then processing each file using awk. The awk script extracts two fields from the data: the air temperature and the quality code. The air temperature value is turned into an integer by adding 0. Next, a test is applied to see whether the temperature is valid (the value 9999 signifies a missing value in the NCDC dataset) and whether the quality code in‐ dicates that the reading is not suspect or erroneous. If the reading is OK, the value is compared with the maximum value seen so far, which is updated if a new maximum is found. The END block is executed after all the lines in the file have been processed, and it prints the maximum value. Here is the beginning of a run: % ./max_temperature.sh 1901 317 1902 244 1903 289 1904 256 1905 283 ... The temperature values in the source file are scaled by a factor of 10, so this works out as a maximum temperature of 31.7°C for 1901 (there were very few readings at the Analyzing the Data with Unix Tools | 21
  • 54. beginning of the century, so this is plausible). The complete run for the century took 42 minutes in one run on a single EC2 High-CPU Extra Large instance. To speed up the processing, we need to run parts of the program in parallel. In theory, this is straightforward: we could process different years in different processes, using all the available hardware threads on a machine. There are a few problems with this, however. First, dividing the work into equal-size pieces isn’t always easy or obvious. In this case, the file size for different years varies widely, so some processes will finish much earlier thanothers.Eveniftheypickupfurtherwork,thewholerunisdominatedbythelongest file. A better approach, although one that requires more work, is to split the input into fixed-size chunks and assign each chunk to a process. Second, combining the results from independent processes may require further pro‐ cessing. In this case, the result for each year is independent of other years, and they may be combined by concatenating all the results and sorting by year. If using the fixed-size chunkapproach,thecombinationismoredelicate.Forthisexample,dataforaparticular year will typically be split into several chunks, each processed independently. We’ll end up with the maximum temperature for each chunk, so the final step is to look for the highest of these maximums for each year. Third, you are still limited by the processing capacity of a single machine. If the best time you can achieve is 20 minutes with the number of processors you have, then that’s it. You can’t make it go faster. Also, some datasets grow beyond the capacity of a single machine. When we start using multiple machines, a whole host of other factors come into play, mainly falling into the categories of coordination and reliability. Who runs the overall job? How do we deal with failed processes? So, although it’s feasible to parallelize the processing, in practice it’s messy. Using a framework like Hadoop to take care of these issues is a great help. Analyzing the Data with Hadoop To take advantage of the parallel processing that Hadoop provides, we need to express our query as a MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster of machines. Map and Reduce MapReduce works by breaking the processing into two phases: the map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions: the map function and the reduce function. 22 | Chapter 2: MapReduce
  • 55. Discovering Diverse Content Through Random Scribd Documents
  • 56. stump should be rounded with a sharp knife, and then the whole peg should be finished off with glass-paper. These pegs must then be fixed knob downwards on to the base. Fig. 39 on page 34 shows a suitable method for this. If you are at all skilful with your tools you will be able to cut a nice moulding round the edge of the base, and so improve the artistic effect of your model. Two thin coats of varnish, or of good enamel, will complete this attractive little article. One little wooden toy, quite interesting in itself, and very useful when playing with soldiers, is The Windlass.—Some odd pieces of lath or cigar-box wood, a cotton reel, a length of string, some stout wire, and some glue and pins, provide all the necessaries. The cotton reel should be the largest obtainable. Fig. 45 shows the completed work. First of all, make a square base for the windlass. If the reel is 3 in. long, cut off four lengths of lath (or four inch-strips of cigar-wood box) each 4 in. long, and glue these into a hollow square, two under and two over. Now cut off two more lengths, 3 in. long, for the upright supports—making the top ends pointed to hold the slanting covers.
  • 57. Fig. 45. Before these side-pieces are glued and pinned into position, it will be necessary to insert the reel. Get a piece of skewer, or lead pencil, 4 in. long, and glue it into the hole in the reel. At one end of the axle so formed will be placed the handle. This can be made in several ways, either with wood or wire, or a mixture of the two (Figs. 46, 47, 48 show some varieties, which may also be useful in making other toys). Holes just large enough to allow the axle to turn freely must then be cut in the side supports. The two slanting covers should be about 4 in. long, so as to allow a trifle to project at each end, and should be from 3/4 in. to 1 in. wide. The two edges which meet to form the apex of the cover should be bevelled off so as to form a clean join. In making this model it would perhaps be as well to use carpenter's glue in place of the prepared stuff.
  • 59. Fig. 48. From the material supplied by one or two empty cigar boxes, many interesting things can be made, especially articles for use with dolls —cradles, carts, furniture, c. If these articles are of no use to you, they come in very handy for presents to little sisters and friends, especially when well made and carefully finished. A Doll's Cradle is perhaps one of the simplest to commence with. To a box from which the lid has been removed, it is only necessary to add two rockers. These can be cut out from the lid by means of a fret saw, and then smoothed down with glass-paper. Fig. 49 shows the best shape for the rockers, which should be glued on about an inch from each end of the box (Fig. 50). Great care should be taken that the two rockers are as nearly alike as possible, otherwise the cradle will not swing to and fro freely.
  • 60. Fig. 49. Fig. 50. A Doll's Cart is also comparatively easy to make, the only really trying part being the cutting of the four wheels. For the body of the cart use a cigar box which has been deprived of its lid, and planed down level round the edges. To the under side of this body, and about one inch from each end, glue two pieces of wood to which to fix the wheels. Strengthen these joins by means of short pins driven through. Fix the wheels to these pieces by means
  • 61. of pins (Fig. 51). In order to support these two wheel-holders, stretch another piece across the space between them, at right angles to each, gluing it firmly to the two centres. Fig. 51. The wheels should be cut with a fret saw, if you possess one. If you do not possess one, then draw out the circle on the wood, and cut the square containing the circle. Then saw off the corners to form an eight-sided figure, and go on cutting off corners until you get down to the circle, which you can finish off with glass-paper (Fig. 52).
  • 62. Fig. 52. A little hook or ring should be attached at the bottom of one end, in order that a string may be tied on, and the vehicle drawn along. A Jack-in-the-Box.—One of the most old-fashioned of toys, this never loses its interest. The box required for it is practically cubical: therefore 6 four-inch squares of cigar-box wood must be cut out. Two of these will need to be cut down to 3-3/4 in. in width, so that the four-inch bottom and lid will fit: so from two squares cut a strip 1/4 in. wide. Glue and pin together the two 3-3/4 pieces and two of the four-inch pieces to form a hollow square. To this will be fixed one of the other four-inch pieces to form a bottom; and at the other end the remaining four-inch piece will be hinged (or wired on like the lid of a chocolate box).
  • 63. Before the bottom is finally put on, it will be necessary to attach the mechanism. For this you will require a strong piece of spring about 6 in. long when released, and a doll's head. One end of the spring must be fixed to the centre of the base. You can do this by means of tiny wire staples (bent pins with the heads nipped off) hammered over the wire into the base, and then bent back on the opposite side of the wood (Fig. 53). At the other end of the spring a piece of cardboard must be fixed, and to it the doll's head must be firmly glued. When the mechanism is complete, nail on the bottom, and fix the lid. Fig. 53. Into the centre of the front edge of the lid drive a small nail, or stout pin, and on the box just below fix a revolving catch hook. This you can quite easily cut from an old piece of fairly thick tin (Fig. 54). In this way an effective means is provided of releasing the lid and enabling the Jack to shoot out suddenly.
  • 64. Fig. 54. The Jig-saw Puzzle was at one time a very popular toy, and there are signs that its popularity is being revived. If it does not interest you particularly, it will provide a little brother or sister with endless amusement. In reality the puzzle consists merely of a picture (generally an interesting coloured one) glued very firmly to a piece of fretwood or cigar-box wood. This is then by means of a fret saw cut into a great many pieces, shaped as quaintly and awkwardly as possible (see Fig. 55). These pieces are then jumbled up into disorder, and passed on to the little one in order that the shapes may be fitted into place and the original picture reconstructed.
  • 65. Fig. 55. Somewhat after the style of the jig-saw puzzle just described is the Geometrical Puzzle shown in Fig. 56. Each of these consists of a capital letter divided up by one or two straight lines into right-angled triangles and other geometrical shapes. While very simple to look at when completed, these little puzzles are by no means easy to solve when the odd pieces are given in a jumbled state. The capital letters should be drawn on a piece of cigar-box wood, and then carefully cut out with a fret saw, or, better still, with a tenon saw if you have one. If you cannot manage wood, then the puzzle can be done in stout cardboard and cut out with a sharp thin knife.
  • 66. Fig. 56. Of other cheaply made puzzles The Reels and String Puzzle is highly entertaining. The only materials required for it are the lid of a cigar box, two cotton reels, two beads, and a length of smooth string or thin silk cord. The making is simplicity itself. All you need do is cut the lid in halves and bore three holes in a line in one of the halves. Of course you can ornament your wood as much as you like, but that will in no way increase or decrease the effectiveness of the puzzle. When you have cut it out and finished it off nicely with glass-paper, thread the beads and reels as shown in Fig. 57. Take special care that you do not make any mistake in the arrangement, or your solution will result in a hopeless tangle.
  • 67. Fig. 57. The object of the puzzle is to get the two cotton reels, which, as you see, are now on quite separate loops, on to one loop. To solve it proceed as follows: Take hold of the centre loop, and pull it down to its full extent. Now pass the right-hand reel through the loop. Taking care not to twist the cord, pass this loop through the hole on the right-hand side, over the bead, and then draw it back again. Now if you follow the same procedure with the left-hand reel you will find that the centre loop is released and can be pulled through the centre hole. Then will the two reels slide down side by side. One thoroughly entertaining and, to a certain extent, bewildering puzzle is The Three-hole Puzzle.—Really the puzzle consists of a piece of thin wood with three holes cut in it. These three holes are respectively circular, square, and triangular (Fig. 58). The problem is to cut one block of wood which will pass through each hole and at the same time fit the hole exactly.
  • 68. Fig. 58. Can it be done? At first it looks to be quite impossible; but there is a very neat solution to the difficulty. First cut out your holes. To do this get a cigar-box lid and draw out the three figures, taking care that the length of the side of the square and the length of the side of the triangle and the length of the diameter of the circle are equal. Now, using your fret saw, cut out these holes very neatly and precisely. For the block you need a small cylinder of wood: an odd piece of broken broom handle will do admirably. This must be cut and finished with glass-paper so that it will fit the circular hole exactly. Now saw a piece just as long as the cylinder is wide. This looked at in one way gives an exact square which will fit the second hole. Thus two holes are catered for. Finally, for the third hole the cylinder must be tapered on two sides. To do this draw a diameter at one end and then gradually pare away a flat surface till the triangular section is obtained. Fig. 59 shows how the block, when turned in different ways, fits the three holes.
  • 69. Fig. 59. Another toy which can be made quite easily from cigar-box wood is A Model Signal.—First cut two strips of wood, half an inch wide and as long as you can get them, which will be 8 or 9 in. These will stand upright on a base board, and form the sides of the standard. Now between these two you must glue shorter pieces of half-inch strip, so as to make the standard solid at the top and bottom, and leave a hollow slot, 1 in. long, in which the signal arm will fit and work up and down (Fig. 60).
  • 70. Fig. 60. Now cut out and paint a signal arm, about 2-1/2 in. long. Fix this by means of a pin passing through the two sides of the standard, and through the arm about 3/4 in. from the square end. If it does not move easily in the slot, take off the top surface with glass-paper. Before fixing the signal arm in position, bore a small hole 1/4 in. from the square end, and knot in a piece of twine or thin wire to act as a connection between the movable arm and the controlling lever (Fig. 61).
  • 71. Fig. 61. At the base of the standard fix the controlling lever. This consists of a small strip, with a pin passing through one end into the standard. Adjust the length of the twine or wire, so that when the signal arm is down, the lever is horizontal; and when the lever is pressed down, the arm rises. You can make a little contrivance for fixing the lever by erecting a small post close to the standard, and gluing on two stops, under which to rest the free end of the lever in its two positions (Fig. 62).
  • 72. Fig. 62. If you prefer it, you can have the controlling lever at a distance from the signal post. You will then need a longer wire, and a little pulley wheel at the base of the standard. You must exercise your own ingenuity for this. Another interesting little scientific toy, which has the additional advantage of being useful, is the Weather House, or the Man and Woman Barometer. This consists of a little house with two doorways, at which appear two figures, one in fine weather, and the other in dull (Fig. 63).
  • 73. Fig. 63. With patience and care this is not very difficult to make. For the house itself you can use an old cigar box, or, if you prefer it, you can make the entire house in cardboard. This is, of course, easier, but not very durable. If you are going to use the cigar box, you will need first to cut the lid and bottom into something like the shape of a house end. You will then have to nail the lid down, and add two slanting pieces for the sides of the roof: and that will complete the house. However, before you nail down the lid and put on the roof, you will need to understand the mechanism. First you will bore a round hole in the top of the roof, just behind the front gable. This hole is for a round peg to which the revolving base is attached. The actual mechanism of the toy consists of a piece of catgut (an old violin string, or a tennis-racket string). This passes through the
  • 74. centre of a small flat piece of wood on which the two figures are balanced. Just in front of the string a piece of wire (a bent hairpin will do admirably) is fixed, so as to form a loop through which the catgut can pass (see Fig. 64). The other end of the catgut is fixed to the peg which fits in the hole in the roof. Fig. 64. For the man and woman you can use two of the grotesque figures cut from clothes pegs. Screws passed through the revolving base will secure the figures firmly and at the same time add a little weight, and so improve the balance. When there is moisture in the air the catgut will twist. You must fit together the different parts and then, by turning the peg to right or left, adjust the position of the figures so that the lady appears in fine weather and the gentleman in wet.
  • 75. A toy of unfailing attraction for boys—and girls as well—is The Marble Board.—This may be quite a simple affair—such as a boy can carry in his pocket for use in the playground—just a piece of wood, such as a cigar-box lid, with a number of holes cut along one edge, and a handle added (Fig. 65); or it may be a much more elaborate form intended for use as a table game. Fig. 65. In this latter case there is a front board, similar to that in the simple form; but behind each hole there is a little compartment for the collection of the marbles (Fig. 66). To make this you need two pieces of wood, about 2 in. wide, and as long as the table is broad: any sort of wood will do. These are for the front and back of the contrivance. The front must next be marked out for the marble holes, allowing about 1 in. for the hole and 1 in. for the space between. Of course, the wider the spaces between the more difficult it becomes to score. These holes must then be cut out by means of a fret saw, or, if you do not possess one, by means of saw and chisel. The back and front must then be secured in position by means of end-pieces nailed or screwed on. These should be about 3 in. long. Fig. 66.
  • 76. The next piece of work is the adjustment of the partitions. For these cigar-box wood is best. You can either cut these partitions to the exact distance between the front and the back, and glue them into position; or else you can make them a little larger, and fit them into grooves cut into the front and back: but that is a nice little piece of carpentry for you. When you have done this, all that is necessary is to give the whole thing a coat of paint, and place numbers over the various holes— taking care that you do not put all the high numbers together. Boards similar to this are used in the Colonies for a game known as Bobs. Larger balls are used, and propelled by means of a cue as in billiards. If you can obtain the balls, this is a delightful game, and one well worth making. A Wooden Wind Wheel for the garden is a splendid little model to make—interesting in itself, but doubly desirable because so much can be done with it. Of course, it can be made quite small and very simple, and still provide unending amusement to smaller brothers and sisters; but for our own purpose it is just as well to make a larger and stronger specimen, one which can be employed as a power station for the working of smaller toys. The main parts are: (1) a circular hub, about 2-1/2 to 3 in. in diameter, and 1 to 1-1/4 in. in thickness (for the smaller varieties a cotton reel will do admirably); (2) six or eight sails, each about 6 or 7 in. long and 3 in. wide at the extreme end, tapering down to a little more than the width of the hub at the other; (3) a hardwood axle; and (4) a driving wheel. For this last a cotton reel will do splendidly, especially one of those with wide flanges and a slender centre. The general arrangement is shown in Fig. 67.
  • 77. Fig. 67. The cutting of the hub is not a very difficult matter if you have a fret saw. It should be cut across the grain if you can get a suitable piece of wood. The sails also are quite easy to make. For these you cannot beat cigar-box wood. The cutting of the grooves in the hub for the insertion of the sails is the most trying piece of work. These grooves should be just large enough to allow the sails to fit tightly, and should be cut at an angle of 45° across the hub. The sails should then be glued in with carpenter's glue. For the axle secure a piece of round wood, such as an odd length of half-inch dowel-rod. This should be cut to a length of about 4-1/2 to 5 in. On this should be fixed the wheel itself, and, at a sufficient
  • 78. distance to prevent the sails catching the string, the bearing wheel. A French nail in each end of the axle will then secure it in position between the side supports and secure an easy running. If you have a play shed in the garden, this apparatus can be erected at the top of a high post projecting through or at the side of the roof. The driving strings can then pass through a hole in the roof or the wall, and the power can be transmitted by a double pulley wheel and another driving string. If you have no play shed, it is not at all difficult to rig it up outside a window. You can try that, and prove your own inventive abilities. How to use the Wind Power Machine.—One thing which this mechanism will drive in good fashion is an overhead tramway system—a very pretty little toy when in working order. For this all that is required is a number of cotton reels, a length of stout cord, and one or two of the model trams described on page 21. If you care to, you can make proper standards for the cotton reels. Fig. 68 shows such an arrangement. The flat base is for heavy weights when the system is rigged up on a table or other place where nails cannot be used. These reels must turn freely to allow the easy passage of the cable. In one place there must be a double reel (Fig. 69) for the transmission of the power. The lower reel will act as the ordinary cable wheel, while the other, glued firmly to it, will carry the driving belt from the wind machine described above.
  • 79. Fig. 68. Fig. 69. The model trams must be fixed to the cable. This is done by means of two wires, fixed to the pole of the tram and twined round the cable. When this is connected up and the cable drawn tightly round the standard reels, the vehicles circulate rapidly on what is really a complete model tramway system.
  • 80. Welcome to our website – the perfect destination for book lovers and knowledge seekers. We believe that every book holds a new world, offering opportunities for learning, discovery, and personal growth. That’s why we are dedicated to bringing you a diverse collection of books, ranging from classic literature and specialized publications to self-development guides and children's books. More than just a book-buying platform, we strive to be a bridge connecting you with timeless cultural and intellectual values. With an elegant, user-friendly interface and a smart search system, you can quickly find the books that best suit your interests. Additionally, our special promotions and home delivery services help you save time and fully enjoy the joy of reading. Join us on a journey of knowledge exploration, passion nurturing, and personal growth every day! ebookbell.com