This document provides an overview of Hadoop infrastructure and related technologies:
- Hadoop is based on Apache's implementation of Google's BigTable and uses Java VMs to parse instructions. It allows reading, writing, and manipulating very large datasets using sequential writes and column-based file structures in HDFS.
- HDFS is the backend file system for Hadoop that allows for easy node management and operability. Technologies like HBase can augment or replace HDFS.
- Middleware like Hive, Pig, and Cassandra help connect to and utilize Hadoop. Each has different uses - Hive is a data warehouse, Pig uses its own query language, and Sqoop connects databases and datasets.
Related topics: