SlideShare a Scribd company logo
 Big Data and Hadoop Introduction
- Objectives
- Contents:
• Big data
• Apache Hadoop
• Examples using Hadoop
- Demo
- Q&A
- References
Security Classification: Internal
Objectives
Big data and Hadoop
introduction 3
• Big data overview.
• Apache Hadoop common architecture:
– Read/write a file in Hadoop File System
– How Hadoop MapReduce tasks work
– Hadoop 1 & 2 difference
• Develop a MapReduce job using Hadoop
• Apply Hadoop in the real world
 Big Data and Hadoop Introduction
Security Classification: Internal
Big data – Information explosion
Big data and Hadoop
introduction 5
Security Classification: Internal
Big data – Definition
Big data and Hadoop
introduction 6
“Big data is high-volume, high-velocity
and/or high-variety information assets that
demand cost-effective, innovative forms of
information processing that enable
enhanced insight, decision making, and
process automation”
- Gartner
Security Classification: Internal
Big data – The 3Vs
Big data and Hadoop
introduction 7
• Volume :
– Google receives over 2 million search queries every minute
– transactional data or sensor data are being stored every fraction of
seconds
• Variety :
– YouTube, Facebook generate video, audio, image and text data
– Over 200 million emails are sent every minute
• Velocity:
– Experiments at CERN generate colossal amounts of data.
– Particles collide 600 million times per second.
– Their Data Center processes about one petabyte of data every day.
Security Classification: Internal
Big data – Challenges
Big data and Hadoop
introduction 8
• Difficult in identifying the right data and determining how to best
use it.
• Struggling to find the right talent.
• Data access and connectivity obstacle.
• Data technology landscape is evolving extremely fast.
• Finding new ways of collaborating across functions and
businesses.
• Security concerns.
Security Classification: Internal
Big data – Landscape
Big data and Hadoop
introduction 9
Security Classification: Internal
Big data – Plays part in firm’s revenue
Big data and Hadoop
introduction 10
 Big Data and Hadoop Introduction
Security Classification: Internal
Apache Hadoop – What?
Big data and Hadoop
introduction 12
• It is a software platform:
– allows us easily write and run data related applications
– facilitates processing and manipulating massive amount of
data
– the processes are conveniently scalable
Security Classification: Internal
Apache Hadoop – Brief history
Big data and Hadoop
introduction 13
Security Classification: Internal
Apache Hadoop – Characteristics
Big data and Hadoop
introduction 14
• Reliable shared storage (HDFS) and analysis system
(MapReduce).
• Highly scalable
• Cost effective as it can work with commodity hardware.
• Highly flexible and can process both structured as well as
unstructured data.
• Built-in fault tolerance.
• Write once and read multiple times.
• Optimized for large and very large data sets.
Security Classification: Internal
Apache Hadoop – Design principals
Big data and Hadoop
introduction 15
• Moving computation is cheaper than moving data
• Hardware will fail, manage it
• Hide execution details from the user
• Use streaming data access
• Use a simple file system coherency model
Security Classification: Internal
Apache Hadoop – Core architecture (1)
Big data and Hadoop
introduction 16
Security Classification: Internal
Apache Hadoop – Core architecture (2)
Big data and Hadoop
introduction 17
Security Classification: Internal
Apache Hadoop – HDFS architecture
Big data and Hadoop
introduction 18
Security Classification: Internal
Apache Hadoop – HDFS architecture - Replication
Big data and Hadoop
introduction 19
Security Classification: Internal
Apache Hadoop – HDFS architecture – Secondary namenode
Big data and Hadoop
introduction 20
Security Classification: Internal
Apache Hadoop – HDFS – Read a file
Big data and Hadoop
introduction 21
Security Classification: Internal
Apache Hadoop – HDFS – Write a file (1)
Big data and Hadoop
introduction 22
Security Classification: Internal
Apache Hadoop – HDFS – Write a file (2)
Big data and Hadoop
introduction 23
Security Classification: Internal
How MapReduce pattern works
Big data and Hadoop
introduction 24
Security Classification: Internal
Apache Hadoop – Running jobs In Hadoop 1
Big data and Hadoop
introduction 25
Security Classification: Internal
Apache Hadoop – Running jobs in Hadoop 1 – How it works
Big data and Hadoop
introduction 26
Security Classification: Internal
Apache Hadoop – Running jobs In Hadoop 2
Big data and Hadoop
introduction 27
Security Classification: Internal
Apache Hadoop – Running Jobs In Hadoop 2 – How it works
Big data and Hadoop
introduction 28
Security Classification: Internal
Apache Hadoop – Using
Big data and Hadoop
introduction 29
• When to use Hadoop:
– Hadoop can be used in various scenarios including some of the following:
– Analytics
– Search
– Data Retention
– Log file processing
– Analysis of Text, Image, Audio, & Video content
– Recommendation systems like in E-Commerce Websites
• When Not to Use Hadoop:
– Low-latency or near real-time data access.
– Having a large number of small files to be processed.
– Multiple writes scenario or scenarios requiring arbitrary writes or writes between
the files
Security Classification: Internal
Apache Hadoop – Ecosystem
Big data and Hadoop
introduction 30
 Big Data and Hadoop Introduction
Security Classification: Internal
Examples using Hadoop – A retail management system
Big data and Hadoop
introduction 32
Security Classification: Internal
Examples using Hadoop – SQL Server and Hadoop
Big data and Hadoop
introduction 33
Security Classification: Internal
Real world applications/solutions using Hadoop – MS HDInsight
Big data and Hadoop
introduction 34
Security Classification: Internal
Real world applications/solutions using Hadoop – Case studies
Big data and Hadoop
introduction 35
 Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction
Security Classification: Internal
References
Big data and Hadoop
introduction 38
- http://hadoop.apache.org
- Hadoop in action – Chuck Lam
- Hadoop: The definitive guide – Tom White
- http://www.bigdatanews.com/
- http://stackoverflow.com
- http://codeproject.com
- Hadoop 2 Fundamentals – LiveLession
 Big Data and Hadoop Introduction

More Related Content

PPTX
Hadoop and Big Data
PPTX
Big Data & Hadoop Tutorial
PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Introduction to Big Data and Hadoop
PPTX
Big Data and Hadoop
PDF
Introduction to Bigdata and HADOOP
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PDF
Big data Hadoop Analytic and Data warehouse comparison guide
Hadoop and Big Data
Big Data & Hadoop Tutorial
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Introduction to Big Data and Hadoop
Big Data and Hadoop
Introduction to Bigdata and HADOOP
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
Big data Hadoop Analytic and Data warehouse comparison guide

What's hot (20)

PPTX
Big data and Hadoop
PDF
Introduction to Big data & Hadoop -I
PDF
Introduction to Big Data & Hadoop
PPTX
Hadoop Tutorial For Beginners
PPTX
PPT on Hadoop
PDF
Understanding Big Data And Hadoop
PPTX
Big data Hadoop presentation
PDF
Introduction to Big Data & Hadoop
PPTX
Whatisbigdataandwhylearnhadoop
PDF
Hadoop tools with Examples
PDF
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
PDF
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
PPTX
Hadoop Presentation
PDF
Hadoop Developer
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
PPTX
Introduction to Big Data & Hadoop Architecture - Module 1
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
PPTX
Big data concepts
PDF
Hadoop Architecture and HDFS
PDF
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Big data and Hadoop
Introduction to Big data & Hadoop -I
Introduction to Big Data & Hadoop
Hadoop Tutorial For Beginners
PPT on Hadoop
Understanding Big Data And Hadoop
Big data Hadoop presentation
Introduction to Big Data & Hadoop
Whatisbigdataandwhylearnhadoop
Hadoop tools with Examples
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Apache Hadoop Tutorial | Hadoop Tutorial For Beginners | Big Data Hadoop | Ha...
Hadoop Presentation
Hadoop Developer
Top Hadoop Big Data Interview Questions and Answers for Fresher
Introduction to Big Data & Hadoop Architecture - Module 1
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Big data concepts
Hadoop Architecture and HDFS
Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoo...
Ad

Viewers also liked (6)

PPTX
T-SQL performance improvement - session 2 - Owned copy
PPTX
R Hadoop integration
PPT
JIRA Service Desk + ChatOps Webinar Deck
PPTX
Big data and Hadoop introduction
PPTX
Introduction to BIg Data and Hadoop
PDF
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
T-SQL performance improvement - session 2 - Owned copy
R Hadoop integration
JIRA Service Desk + ChatOps Webinar Deck
Big data and Hadoop introduction
Introduction to BIg Data and Hadoop
JIRA Service Desk - Tokyo, Japan Sept. 26, 2014
Ad

Similar to Big Data and Hadoop Introduction (20)

PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
PDF
Hadoop Master Class : A concise overview
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PPT
Big data and hadoop
PDF
big data analytics introduction chapter 1
PPT
Big data hadoop
PDF
Learn About Big Data and Hadoop The Most Significant Resource
PPT
Data analytics & its Trends
PPTX
Big-Data-Seminar-6-Aug-2014-Koenig
PPTX
Hadoop and Big Data: Revealed
PPTX
Big data peresintaion
PPTX
PPTX
Big Data Hadoop (Overview)
PPTX
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PDF
Hadoop and Mapreduce Certification
PDF
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
PDF
Bigdata and Hadoop Bootcamp
PPT
BIG_DATA(HADOOP)
PDF
Hadoop Mapreduce Cookbook Srinath Perera Thilina Gunarathne
Lecture 5 - Big Data and Hadoop Intro.ppt
Hadoop Master Class : A concise overview
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Big data and hadoop
big data analytics introduction chapter 1
Big data hadoop
Learn About Big Data and Hadoop The Most Significant Resource
Data analytics & its Trends
Big-Data-Seminar-6-Aug-2014-Koenig
Hadoop and Big Data: Revealed
Big data peresintaion
Big Data Hadoop (Overview)
Hadoop_EcoSystem slide by CIDAC India.pptx
Hadoop and Mapreduce Certification
Data 360 Conference: Introduction to Big Data, Hadoop and Big Data Analytics
Bigdata and Hadoop Bootcamp
BIG_DATA(HADOOP)
Hadoop Mapreduce Cookbook Srinath Perera Thilina Gunarathne

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
KodekX | Application Modernization Development
PPTX
A Presentation on Artificial Intelligence
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Review of recent advances in non-invasive hemoglobin estimation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
KodekX | Application Modernization Development
A Presentation on Artificial Intelligence
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Mobile App Security Testing_ A Comprehensive Guide.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
Digital-Transformation-Roadmap-for-Companies.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Big Data and Hadoop Introduction

Editor's Notes

  • #7: This definition consists of three parts: Part One: 3Vs (Variety – Velocity – Volume) Part Two: Cost-Effective, Innovative Forms of Information Processing Part Three: Enhanced Insight and Decision Making
  • #8: Data scientists from some companies break big data into 4 Vs: Volume, Variety, Velocity, Veracity. The others add one more V to the characteristics of big data: Value.
  • #10: Information about Big data Ecosystem can be found at URL: http://hadoopilluminated.com/hadoop_illuminated/Bigdata_Ecosystem.html
  • #17: Here are few highlights of the Hadoop Architecture: - Hadoop works in a master-worker / master-slave fashion. - Hadoop has two core components: HDFS and MapReduce. HDFS (Hadoop Distributed File System) offers a highly reliable and distributed storage, and ensures reliability, even on a commodity hardware, by replicating the data across multiple nodes. Unlike a regular file system, when data is pushed to HDFS, it will automatically split into multiple blocks (configurable parameter) and stores/replicates the data across various datanodes. This ensures high availability and fault tolerance. MapReduce offers an analysis system which can perform complex computations on large datasets. This component is responsible for performing all the computations and works by breaking down a large complex computation into multiple tasks and assigns those to individual worker/slave nodes and takes care of coordination and consolidation of results. - The master contains the Namenode and Job Tracker components. Namenode holds the information about all the other nodes in the Hadoop Cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of the Hadoop Cluster. Job Tracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results. - Each Worker / Slave contains the Task Tracker and a Datanode components. Task Tracker is responsible for running the task / computation assigned to it. Datanode is responsible for holding the data. The computers present in the cluster can be present in any location and there is no dependency on the location of the physical server. The differences between Hadoop 1 & 2 are: Hadoop 1 limits to 4000 nodes per cluster, Hadoop 2 is up to 10000 nodes per cluster. Hadoop 1 has a Jobtracker bottle-neck, Hadoop 2 has efficient cluster utilization – YARN. Hadoop 1 only supports MapReduce jobs but Hadoop 2 supports more job types.
  • #18: Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters. YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing.
  • #19: Namenode The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks: Manages the file system namespace. Regulates client’s access to files. It also executes file system operations such as renaming, closing, and opening files and directories. Datanode The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system. Datanodes perform read-write operations on the file systems, as per client request. They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode. Block Generally the user data is stored in the files of HDFS. The file in a file system will be divided into one or more segments and/or stored in individual data nodes. These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block. The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration. Hadoop commands reference: The syntax is: hadoop fs –command, with command be either: 1.ls <path> Lists the contents of the directory specified by path, showing the names, permissions, owner, size and modification date for each entry. 2.lsr <path> Behaves like -ls, but recursively displays entries in all subdirectories of path. 3.du <path> Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full HDFS protocol prefix. 4.dus <path> Like -du, but prints a summary of disk usage of all files/directories in the path. 5.mv <src><dest> Moves the file or directory indicated by src to dest, within HDFS. 6.cp <src> <dest> Copies the file or directory identified by src to dest, within HDFS. 7.rm <path> Removes the file or empty directory identified by path. 8.rmr <path> Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or subdirectories of path). 9.put <localSrc> <dest> Copies the file or directory from the local file system identified by localSrc to dest within the DFS. 10.copyFromLocal <localSrc> <dest> Identical to –put 11.moveFromLocal <localSrc> <dest> Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and then deletes the local copy on success. 12.get [-crc] <src> <localDest> Copies the file or directory in HDFS identified by src to the local file system path identified by localDest. 13.getmerge <src> <localDest> Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the local file system identified by localDest. 14.cat <filen-ame> Displays the contents of filename on stdout. 15.copyToLocal <src> <localDest> Identical to –get 16.moveToLocal <src> <localDest> Works like -get, but deletes the HDFS copy on success. 17.mkdir <path> Creates a directory named path in HDFS. Creates any parent directories in path that are missing (e.g., mkdir -p in Linux). 18.setrep [-R] [-w] rep <path> Sets the target replication factor for files identified by path to rep. (The actual replication factor will move toward the target over time) 19.touchz <path> Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path, unless the file is already size 0. 20.test -[ezd] <path> Returns 1 if path exists; has zero length; or is a directory or 0 otherwise. 21.stat [format] <path> Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n), block size (%o), replication (%r), and modification date (%y, %Y). 22.tail [-f] <file2name> Shows the last 1KB of file on stdout. 23.chmod [-R] mode,mode,... <path>... Changes the file permissions associated with one or more objects identified by path.... Performs changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no scope is specified and does not apply an umask. 24.chown [-R] [owner][:[group]] <path>... Sets the owning user and/or group for files or directories identified by path.... Sets owner recursively if -R is specified. 25.chgrp [-R] group <path>... Sets the owning group for files or directories identified by path.... Sets group recursively if -R is specified. 26.help <cmd-name> Returns usage information for one of the commands listed above. You must omit the leading '-' character in cmd.
  • #20: When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes from the NameNode. This list contains the DataNodes that will host a replica of that block. The client then flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second DataNode in the list. The second DataNode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline. Thus, the data is pipelined from one DataNode to the next. * Why the default number of replications is 3? Hadoop is used in clustered environment where we have clusters, each cluster will have multiple racks, each rack will have multiple datanodes. So to make HDFS fault tolerant in the cluster we need to consider following failures: - DataNode failure - Rack failure Chances of Cluster failure is fairly low so let not think about it. In the above cases we need to make sure that - If one DataNode fails, you can get the same data from another DataNode - If the entire Rack fails, you can get the same data from another Rack So now its pretty clear why default replication factor is set to 3, so that not 2 replica goes to same DataNode and at-least 1 replica goes to different Rack to fulfill the above mentioned Fault-Tolerant criteria.
  • #21: The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense that data-nodes cannot connect to the secondary name-node, and in no event it can replace the primary name-node in case of its failure. The only purpose of the secondary name-node is to perform periodic checkpoints. The secondary name-node periodically downloads current name-node image and edits log files, joins them into new image and uploads the new image back to the (primary and the only) name-node. So if the name-node fails and you can restart it on the same physical node then there is no need to shutdown data-nodes, just the name-node need to be restarted. If you cannot use the old node anymore you will need to copy the latest image somewhere else. The latest image can be found either on the node that used to be the primary before failure if available; or on the secondary name-node. The latter will be the latest checkpoint without subsequent edits logs, that is the most recent name space modifications may be missing there. You will also need to restart the whole cluster in this case.
  • #22: Step 1: First the Client will open the file by giving a call to open() method on FileSystem object, which for HDFS is an instance of DistributedFileSystem class. Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to determine the locations of the blocks for the first few blocks of the file. For each block, the namenode returns the addresses of all the datanodes that have a copy of that block. Client will interact with respective datanodes to read the file. Namenode also provide a token to the client which it shows to data node for authentication. The DistributedFileSystem returns an object of FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first closest datanode for the first block in the file. Step 4: Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream. Step 5: When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block. This happens transparently to the client, which from its point of view is just reading a continuous stream. Step 6: Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namnode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream.
  • #23: Step 1: The client creates the file by calling create() method on DistributedFileSystem.  Step 2:   DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it. The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. If these checks pass, the namenode makes a record of the new file; otherwise, file creation fails and the client is thrown an IOException. TheDistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. Step 3: As the client writes data, DFSOutputStream splits it into packets, which it writes to an internal queue, called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll assume the replication level is three, so there are three nodes in the pipeline. TheDataStreamer streams the packets to the first datanode in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline. Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline. Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline. Step 6: When the client has finished writing data, it calls close() on the stream. Step 7: This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete The  namenode  already  knows  which blocks  the  file  is  made  up  of , so it only has to wait for blocks to be minimally replicated before returning successfully.
  • #25: - There are two procedures: + map filters and sort the data + reduce summarize the data + reduce is not necessary, you can have map only process This facilitates scalability and parallelization - Each job in MapReduce process are processed in datanode: + jobs are simple and nodes perform similar jobs + when combined together, operation could be powerful and even complex + it is necessary to write MapReduce program with great care
  • #31: HBase - Google BigTable Inspired. Non-relational distributed database. Ramdom, real-time r/w operations in column-oriented very large tables (BDDB: Big Data Data Base). It’s the backing system for MR jobs outputs. It’s the Hadoop database. It’s for backing Hadoop MapReduce jobs with Apache HBase tables. Hive - Data Warehouse infrastructure developed by Facebook. Data summarization, query, and analysis. It’s provides SQL-like language (not SQL92 compliant): HiveQL. Pig - Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce. Pig uses MapReduce to execute all of its data processing. It compiles the Pig Latin scripts that users write into a series of one or more MapReduce jobs that it then executes. Pig Latin looks different from many of the programming languages you have seen. There are no if statements or for loops in Pig Latin. This is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Latin instead focuses on data flow. Zookeeper - It’s a coordination service that gives you the tools you need to write correct distributed applications. ZooKeeper was developed at Yahoo! Research. Several Hadoop projects are already using ZooKeeper to coordinate the cluster and provide highly-available distributed services. Perhaps most famous of those are Apache HBase, Storm, Kafka. ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations. Back in 2006, Google published a paper on "Chubby", a distributed lock service which gained wide adoption within their data centers. Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure. Mahout - Machine learning library and math library, on top of MapReduce. Sqoop - Sqoop works to transport data from RDBMS to HDFS. Flume - Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS Oozie - Apache Oozie is a Java Web application used to schedule Apache Hadoop jobs Ambari – Monitoring & management of Hadoop clusters and nodes
  • #36: A9.com – Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data;