SlideShare a Scribd company logo
My other computer is a datacentre - 2012 edition
Julio
                              Steve




Our other computer is a datacentre
Their other computer is a datacentre
That sorts a Terabyte in 82s
Their other computer is a datacentre
facebook: 45 Petabytes
Why?
Big Data

Petabytes of "unstructured" data
Datamining: analysing past events
Prediction, Inference, Clustering
Graph analysis
Facebook-scale NoSQL databases
Big Data vs HPC
Big Data : Petabytes             HPC: petaflops

  Storage of low-value data       Storage for checkpointing
  H/W failure common              Surprised by H/W failure
  Code: frequency, graphs,        Code: simulation, rendering
   machine-learning, rendering
                                   Less persistent data, ingress
  Ingress/egress problems          & egress
  Dense storage of data           Dense compute
  Mix CPU and data                CPU + GPU
  Spindle:core ratio              Bandwidth to other servers
Architectural Issues
Failure is inevitable – design for it.
Bandwidth is finite – be topology aware.
SSD expensive, low seek times, best written to
 sequentially.
HDD less $, more W; awful random access
– stripe data across disks, read/write in bulk
Coping with Failure
Avoid SPOFs
Replicate data
Restart work
Redeploy applications onto live machines
Route traffic to live front ends
Decoupled connection to back end: queues,
 scatter/gather
  Failure-tolerance must be designed in
Scale

Goal: linear performance scaling


Hide the problems from (most) developers
Design applications to scale across thousands
 of servers, tens of thousands of cores
Massive parallelisation, minimal communication


  Scalability must be designed in
Algorithms and Frameworks

MapReduce – Hadoop, CouchDB, (Dryad)
BSP – Pregel, Giraph, Hama
Column Tables – Cassandra, HBase, BigTable
Location Aware filesystem: GFS, HDFS
State Service: Chubby, Zookeeper, Anubis
Scatter/gather – search engines
(MPI)
MapReduce: Hadoop
Bath Bluetooth Dataset
gate1,b46cca4d3f5f313176e50a0e38e7fde3,,2006-10-30,16:06:17
gate1,f1191b79236083ce59981e049d863604,,2006-10-30,16:06:20
gate1,b45c7795f5be038dda8615ab44676872,,2006-10-30,16:06:21
gate1,02e73779c77fcd4e9f90a193c4f3e7ff,,2006-10-30,16:06:23
gate1,eef1836efddf8dbfe5e2a3cd5c13745f,,2006-10-30,16:06:24


• 2006-2009
• Multiple sites
• 10GB data
Map to device ID
class DeviceMapper extends Mapper {
  def parser = new EventParser()
  def one = new IntWritable(1)

    def map(LongWritable k, Text v,
          Mapper.Context ctx) {
      def event = parser.parse(v)
      ctx.write(event.device, one)
    }
}



          (gate, device,day,hour ) → (deviceID, 1)
Reduce to device count
class CountReducer2 extends Reducer {
  def iw = new IntWritable()

    def reduce(Text k,
               Iterable values,
               Reducer.Context ctx) {
      def sum = values.collect() {it.get()}.sum()
      iw.set(sum)
      ctx.write(k, iw);
    }
}


     (device,[count1, count2, ...] ) → (deviceID, count')
Hadoop running MapReduce
HDFS Filesystem
Commodity disks
Scatter data across machines
Replicate data across racks
Trickle-feed backup to other sites
Archive unused files
In idle time: check health of data, rebalance
Report data location for placement of work
Results
0072adec0c1699c0af152c3cdb6c018e   2128
0120df42306097c70384501ebbdd888c   243
01541fef30e606ce88f8b0e931f010d2   5
0161257b1b0b8d1884975dd7b62f4387   15
01ad97908c53712e58894bc7009f5aa0   22
0225a2b080a4ac8f18344edd6108c46c   3
0276aba603a2aead55fe67bc48839cec   9
027973a027d85ad4dd4a15efa5142204   1
02e73779c77fcd4e9f90a193c4f3e7ff   3953
02e9a7bef5ba4c1caf5f35e8ada226ed   2
...
Map: day of week; reduce: count

14000

12000

10000

8000

6000

4000

2000

   0
        1   2   3   4   5   6   7
Other Questions
Peak hours for devices
Predictability of device
Time to cross gates/transit city
Routes they take
Which devices are often co-sighted?
When do they stop being co-sighted?
Clustering: resident, commuter, student, tourist
Hardware Challenges
Energy Proportional Computing
 [Barroso07]
Energy Proportional Datacenter Networks
 [Abts10]
Dark Silicon and the End of Multicore Scaling
 [Esmaeilzadeh10]


Scaling of CPU, storage, network
CS Problems
Scheduling
Placement of data and work
Graph Theory
Machine Learning
Heterogenous parallelism
Algorithms that run on large, unreliable, clusters
Dealing with availability
New problem: availability
Availability in Globally Distributed Storage
 Systems [Ford10]
Failure Trends in a Large Disk Drive Population
 [Pinheiro07]
DRAM Errors in the Wild [Schroeder09]
Characterizing Cloud Computing Hardware
 Reliability [Vishwanath10]
Understanding Network Failures in Data
 Centers [Gill11]
What will your datacentre do?

More Related Content

PPTX
Big Data Use Cases
PDF
Big Data Real Time Applications
PDF
Big Data Architecture
PPTX
Big Data Analytics with Hadoop, MongoDB and SQL Server
PPTX
Владимир Слободянюк «DWH & BigData – architecture approaches»
PPTX
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
PDF
Big Data Landscape 2016
PPTX
The Big Data Ecosystem at LinkedIn
Big Data Use Cases
Big Data Real Time Applications
Big Data Architecture
Big Data Analytics with Hadoop, MongoDB and SQL Server
Владимир Слободянюк «DWH & BigData – architecture approaches»
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
Big Data Landscape 2016
The Big Data Ecosystem at LinkedIn

What's hot (20)

PDF
Bigdata Hadoop project payment gateway domain
PDF
What is hadoop
PPTX
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
PPTX
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
PPTX
High Performance Computing and Big Data
PPTX
Big Data in the Real World
PDF
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
PDF
Big Data , Big Problem?
PPTX
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
PDF
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
PDF
Lecture4 big data technology foundations
PPTX
Building big data solutions on azure
PDF
What is an Open Data Lake? - Data Sheets | Whitepaper
PPTX
Big Data Analytics Projects - Real World with Pentaho
PDF
From hadoop to spark
PDF
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
PDF
Hadoop - Architectural road map for Hadoop Ecosystem
DOCX
Hotel inspection data set analysis copy
PDF
Benefits of Hadoop as Platform as a Service
Bigdata Hadoop project payment gateway domain
What is hadoop
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
The key to unlocking the Value in the IoT? Managing the Data!
High Performance Computing and Big Data
Big Data in the Real World
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Big Data , Big Problem?
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Lecture4 big data technology foundations
Building big data solutions on azure
What is an Open Data Lake? - Data Sheets | Whitepaper
Big Data Analytics Projects - Real World with Pentaho
From hadoop to spark
On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms
Hadoop - Architectural road map for Hadoop Ecosystem
Hotel inspection data set analysis copy
Benefits of Hadoop as Platform as a Service
Ad

Viewers also liked (15)

PPTX
Unlocking Operational Intelligence from the Data Lake
PDF
MongoDB Europe 2016 - The Rise of the Data Lake
PPT
My other computer is a datacentre
PDF
A New Language for Specifying Modular Data Centers
PDF
"Adoption Tactics; Why Your End Users and Project Managers Will Rave Over Sha...
PDF
Groovy Domain Specific Languages - SpringOne2GX 2012
PPTX
Past, Present and Future of Data Processing in Apache Hadoop
PPTX
Lambda Architecture in Practice
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
PPTX
L’architettura di Classe Enterprise di Nuova Generazione
PDF
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
PDF
MongoDB Europe 2016 - Big Data meets Big Compute
PDF
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
PDF
Lambda architecture for real time big data
PDF
Big Data and Fast Data - Lambda Architecture in Action
Unlocking Operational Intelligence from the Data Lake
MongoDB Europe 2016 - The Rise of the Data Lake
My other computer is a datacentre
A New Language for Specifying Modular Data Centers
"Adoption Tactics; Why Your End Users and Project Managers Will Rave Over Sha...
Groovy Domain Specific Languages - SpringOne2GX 2012
Past, Present and Future of Data Processing in Apache Hadoop
Lambda Architecture in Practice
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
L’architettura di Classe Enterprise di Nuova Generazione
MongoDB Europe 2016 - Who’s Helping Themselves To Your Data? Demystifying Mon...
MongoDB Europe 2016 - Big Data meets Big Compute
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
Lambda architecture for real time big data
Big Data and Fast Data - Lambda Architecture in Action
Ad

Similar to My other computer is a datacentre - 2012 edition (20)

PPT
My other computer_is_a_datacentre
PDF
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
PDF
Handout3o
PDF
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
PDF
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
PDF
Big Data: Architecture and Performance Considerations in Logical Data Lakes
PPT
Hive @ Hadoop day seattle_2010
PDF
Hadoop 101 for bioinformaticians
PPT
Taste Java In The Clouds
PPT
Hive Training -- Motivations and Real World Use Cases
PDF
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
PPTX
Inroduction to Big Data
PDF
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
PPTX
MATLAB and Scientific Data: New Features and Capabilities
PDF
Hadoop scalability
PDF
EclipseCon Keynote: Apache Hadoop - An Introduction
PDF
HadoopThe Hadoop Java Software Framework
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PPT
Hadoop Hive Talk At IIT-Delhi
PPTX
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
My other computer_is_a_datacentre
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Handout3o
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Hive @ Hadoop day seattle_2010
Hadoop 101 for bioinformaticians
Taste Java In The Clouds
Hive Training -- Motivations and Real World Use Cases
S51281 - Accelerate Data Science in Python with RAPIDS_1679330128290001YmT7.pdf
Inroduction to Big Data
Nvidia gpu-application-catalog TESLA K80 GPU應用程式型錄
MATLAB and Scientific Data: New Features and Capabilities
Hadoop scalability
EclipseCon Keynote: Apache Hadoop - An Introduction
HadoopThe Hadoop Java Software Framework
HIVE: Data Warehousing & Analytics on Hadoop
Hadoop Hive Talk At IIT-Delhi
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

More from Steve Loughran (20)

PPTX
Hadoop Vectored IO
PPTX
The age of rename() is over
PPTX
What does Rename Do: (detailed version)
PPTX
Put is the new rename: San Jose Summit Edition
PPTX
@Dissidentbot: dissent will be automated!
PPTX
PUT is the new rename()
PPT
Extreme Programming Deployed
PPT
PPTX
I hate mocking
PPTX
What does rename() do?
PPTX
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
PPTX
Apache Spark and Object Stores —for London Spark User Group
PPTX
Spark Summit East 2017: Apache spark and object stores
PPTX
Hadoop, Hive, Spark and Object Stores
PPTX
Apache Spark and Object Stores
PPTX
Household INFOSEC in a Post-Sony Era
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
PPTX
Hadoop and Kerberos: the Madness Beyond the Gate
PPTX
Slider: Applications on YARN
PPTX
YARN Services
Hadoop Vectored IO
The age of rename() is over
What does Rename Do: (detailed version)
Put is the new rename: San Jose Summit Edition
@Dissidentbot: dissent will be automated!
PUT is the new rename()
Extreme Programming Deployed
I hate mocking
What does rename() do?
Dancing Elephants: Working with Object Storage in Apache Spark and Hive
Apache Spark and Object Stores —for London Spark User Group
Spark Summit East 2017: Apache spark and object stores
Hadoop, Hive, Spark and Object Stores
Apache Spark and Object Stores
Household INFOSEC in a Post-Sony Era
Hadoop and Kerberos: the Madness Beyond the Gate: January 2016 edition
Hadoop and Kerberos: the Madness Beyond the Gate
Slider: Applications on YARN
YARN Services

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
NewMind AI Weekly Chronicles - August'25 Week I
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Programs and apps: productivity, graphics, security and other tools
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
The AUB Centre for AI in Media Proposal.docx
Per capita expenditure prediction using model stacking based on satellite ima...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
MIND Revenue Release Quarter 2 2025 Press Release
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”

My other computer is a datacentre - 2012 edition