SlideShare a Scribd company logo
Increase computational power
with distributed processing
Neil Stein 03 Nov 2012
Distributed processing
A Discussion Example……..
Getting the data, and ordering it as needed…..
Familiar with grep and sort?

—  “grep” extracts all the matching lines
—  “sort” sorts all the lines
grep “some_record_parameters” hl7_transfer.data-file | sort
[2012/02/25/ 9:15] records sent to healthcare-1
[2012/02/28/ 6:15] records sent to healthcare-2
[2012/03/12/ 10:30] records sent to healthcare-3
A Discussion Example……..
—  As the amount of data increases, process requires more and
more resources

—  What if hl7_transfor.data-file is 500GB or bigger?
—  What if there are hundreds or thousands of data files?
—  What if there are multiple types of data files?
grep “provider 1” hl7_transfor.data-file | sort

—  Ignoring the process for a moment, how do we write all the data to
disk in the first place?

Need to rethink the process
Distributed processing
Distributed File-System – “the cloud”
—  Files can be stored across many machines
—  Files can be replicated across many machines
—  Files can be in a hyrbid-cloud model
—  Share the file-system transparently
—  You simply see the usual file structure
—  Opportunity to leverage private and public cloud environments
Distributed processing
Map-Reduce – the cloud
—  A way of processing large amounts of data across many machines
—  Must be able to split-up the data in chunks for processing, (Map)
—  Recombined after processing (Reduce)
—  Requires a constant flow of data from one simple state to another
—  Allows for a simple way of breaking down a large task into smaller
manageable tasks

—  Increase the available computational power
A look at Hadoop
What is Hadoop
—  A Map-Reduce framework
—  Designed to run applications on clusters of
local and remote systems

—  HDFS
—  The file system of Hadoop (Hadoop Distributed
File System)
—  Designed to access clusters of local and
remote systems
Putting the pieces together….
First, we need some code……
Map

Reduce
Map

Hadoop streams information on STDIN
Separate value with a newline (for Hadoop)
Reduce

Hadoop streams back to us on STDIN
Output the aggregated records
Sanity Checking
Command

Results
This should work with small data-sets
Push file to “the distributed file system”

Put file on the DFS

Check that the file is in the cloud
Running in “the distributed environment”

Call the Hadoop streaming command
Pass the appropriate parameters
Running in “the distributed environment”
Running in “the distributed environment”
Running in “the distributed environment”
Running in “the distributed environment”
Checking Status
—  Cluster Summary
—  Running Jobs
—  Completed Jobs
—  Failed Jobs
—  Job Statistics
—  Detailed Job Logs
Checking Distributed Cluster Health
—  List Data-Nodes
—  Dead Nodes
—  Node Heart-beat information
—  Failed Jobs
—  Job Statistics
—  Detailed Job Logs
Conclusion
—  A different paradigm for solving large-scale problems
—  Designed to solve specific problems that can be defined
in a focused map-reduce manner

More Related Content

PPTX
Cloud File System with GFS and HDFS
PPTX
Distributed Databases - Concepts & Architectures
PPTX
Inroduction to Big Data
PPT
20. Parallel Databases in DBMS
PPTX
Distributed database
PPTX
Distributed database
PPTX
Distributed data base management system
Cloud File System with GFS and HDFS
Distributed Databases - Concepts & Architectures
Inroduction to Big Data
20. Parallel Databases in DBMS
Distributed database
Distributed database
Distributed data base management system

What's hot (20)

PPTX
Database System Architectures
PPT
Centralised and distributed databases
PPT
hadoop
PPT
Distributed Database Management System(DDMS)
PPTX
Distributed database management system
PPT
Distributed Database System
PPTX
Distributed database system
PPTX
Parallel databases
PPTX
Distributed database management system
PPT
Cluster Computers
PPTX
Distributed Database
PPT
Lecture 10 distributed database management system
PPT
Distributed databases,types of database
PPTX
Massive parallel processing database systems mpp
PDF
Dremel
PPT
Distributed database
PPTX
Cluster computing
PPTX
Distributed DBMS - Unit 1 - Introduction
PPT
Cluster Computing
PPTX
Database , 1 Introduction
Database System Architectures
Centralised and distributed databases
hadoop
Distributed Database Management System(DDMS)
Distributed database management system
Distributed Database System
Distributed database system
Parallel databases
Distributed database management system
Cluster Computers
Distributed Database
Lecture 10 distributed database management system
Distributed databases,types of database
Massive parallel processing database systems mpp
Dremel
Distributed database
Cluster computing
Distributed DBMS - Unit 1 - Introduction
Cluster Computing
Database , 1 Introduction
Ad

Viewers also liked (20)

PPT
Distributed Processing
PPT
Distributed computing
PPTX
Compare Chihuahua and Queretaro
PPTX
Cloud ready discussion
PPT
Law presentation: Summarry of Stat. Int
PPTX
Viraj D Visual cv
PPT
ALTOS ESCONDIDOS PANAMA: ECO LUXURY LIVING IN PANAMA
PDF
AML Manual AltosEscondidos
PDF
Altos Escondidos Road Construction and Enviremental Impact Study
PPTX
Visual Arts Workshop
PDF
Hadoop: Distributed data processing
PPT
Qfi boarding lodging 2012 ppt
PPTX
Instructional Design Projects and Resources
PPT
Virtualization (Distributed computing)
PDF
THE AVIAL PURSUIT OPEN QUIZ 2013 Finals
PPTX
Hadoop: Distributed Data Processing
PPT
Parallel processing Concepts
PPTX
Presentation on data communication
PPT
Chapter 3 - Data and Signals
Distributed Processing
Distributed computing
Compare Chihuahua and Queretaro
Cloud ready discussion
Law presentation: Summarry of Stat. Int
Viraj D Visual cv
ALTOS ESCONDIDOS PANAMA: ECO LUXURY LIVING IN PANAMA
AML Manual AltosEscondidos
Altos Escondidos Road Construction and Enviremental Impact Study
Visual Arts Workshop
Hadoop: Distributed data processing
Qfi boarding lodging 2012 ppt
Instructional Design Projects and Resources
Virtualization (Distributed computing)
THE AVIAL PURSUIT OPEN QUIZ 2013 Finals
Hadoop: Distributed Data Processing
Parallel processing Concepts
Presentation on data communication
Chapter 3 - Data and Signals
Ad

Similar to Distributed processing (20)

PPTX
Hands on Hadoop and pig
PPTX
Hadoop tutorial for beginners-tibacademy.in
PPTX
data analytics lecture4.pptx
PDF
Hadoop introduction
PPT
Bigdata processing with Spark
PPTX
Presentation sreenu dwh-services
PPTX
Big Data and Hadoop
PPTX
Big Data & Hadoop
PDF
Lecture 2 part 1
PDF
20131205 hadoop-hdfs-map reduce-introduction
PPTX
Unit-1 Introduction to Big Data.pptx
PPTX
Big data Hadoop presentation
PPTX
Big Data Analytics -Introduction education
PPT
hadoop
PPTX
Hadoop and MapReduce addDdaDadadDDAD.pptx
PPTX
Hadoop
PPTX
PPTX
Hadoop and BigData - July 2016
PDF
hdfs readrmation ghghg bigdats analytics info.pdf
Hands on Hadoop and pig
Hadoop tutorial for beginners-tibacademy.in
data analytics lecture4.pptx
Hadoop introduction
Bigdata processing with Spark
Presentation sreenu dwh-services
Big Data and Hadoop
Big Data & Hadoop
Lecture 2 part 1
20131205 hadoop-hdfs-map reduce-introduction
Unit-1 Introduction to Big Data.pptx
Big data Hadoop presentation
Big Data Analytics -Introduction education
hadoop
Hadoop and MapReduce addDdaDadadDDAD.pptx
Hadoop
Hadoop and BigData - July 2016
hdfs readrmation ghghg bigdats analytics info.pdf

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Machine learning based COVID-19 study performance prediction
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Cloud computing and distributed systems.
PPTX
Spectroscopy.pptx food analysis technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Empathic Computing: Creating Shared Understanding
Unlocking AI with Model Context Protocol (MCP)
Machine learning based COVID-19 study performance prediction
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Cloud computing and distributed systems.
Spectroscopy.pptx food analysis technology
The AUB Centre for AI in Media Proposal.docx
Dropbox Q2 2025 Financial Results & Investor Presentation
The Rise and Fall of 3GPP – Time for a Sabbatical?
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm

Distributed processing

  • 1. Increase computational power with distributed processing Neil Stein 03 Nov 2012
  • 3. A Discussion Example…….. Getting the data, and ordering it as needed….. Familiar with grep and sort? —  “grep” extracts all the matching lines —  “sort” sorts all the lines grep “some_record_parameters” hl7_transfer.data-file | sort [2012/02/25/ 9:15] records sent to healthcare-1 [2012/02/28/ 6:15] records sent to healthcare-2 [2012/03/12/ 10:30] records sent to healthcare-3
  • 4. A Discussion Example…….. —  As the amount of data increases, process requires more and more resources —  What if hl7_transfor.data-file is 500GB or bigger? —  What if there are hundreds or thousands of data files? —  What if there are multiple types of data files? grep “provider 1” hl7_transfor.data-file | sort —  Ignoring the process for a moment, how do we write all the data to disk in the first place? Need to rethink the process
  • 6. Distributed File-System – “the cloud” —  Files can be stored across many machines —  Files can be replicated across many machines —  Files can be in a hyrbid-cloud model —  Share the file-system transparently —  You simply see the usual file structure —  Opportunity to leverage private and public cloud environments
  • 8. Map-Reduce – the cloud —  A way of processing large amounts of data across many machines —  Must be able to split-up the data in chunks for processing, (Map) —  Recombined after processing (Reduce) —  Requires a constant flow of data from one simple state to another —  Allows for a simple way of breaking down a large task into smaller manageable tasks —  Increase the available computational power
  • 9. A look at Hadoop
  • 10. What is Hadoop —  A Map-Reduce framework —  Designed to run applications on clusters of local and remote systems —  HDFS —  The file system of Hadoop (Hadoop Distributed File System) —  Designed to access clusters of local and remote systems
  • 11. Putting the pieces together….
  • 12. First, we need some code…… Map Reduce
  • 13. Map Hadoop streams information on STDIN Separate value with a newline (for Hadoop)
  • 14. Reduce Hadoop streams back to us on STDIN Output the aggregated records
  • 15. Sanity Checking Command Results This should work with small data-sets
  • 16. Push file to “the distributed file system” Put file on the DFS Check that the file is in the cloud
  • 17. Running in “the distributed environment” Call the Hadoop streaming command Pass the appropriate parameters
  • 18. Running in “the distributed environment”
  • 19. Running in “the distributed environment”
  • 20. Running in “the distributed environment”
  • 21. Running in “the distributed environment”
  • 22. Checking Status —  Cluster Summary —  Running Jobs —  Completed Jobs —  Failed Jobs —  Job Statistics —  Detailed Job Logs
  • 23. Checking Distributed Cluster Health —  List Data-Nodes —  Dead Nodes —  Node Heart-beat information —  Failed Jobs —  Job Statistics —  Detailed Job Logs
  • 24. Conclusion —  A different paradigm for solving large-scale problems —  Designed to solve specific problems that can be defined in a focused map-reduce manner