SlideShare a Scribd company logo
1
Real-time searching
of big data with
Solr and Hadoop
Rod Cope, CTO
2
• Introduction
• The problem
• The solution
• Top 10 lessons
• Final thoughts
• Q&A
3
Rod Cope, CTO
Rogue Wave Software
4
• “Big data”
– All the world’s open source software
– Metadata, code, indexes
– Individual tables contain many terabytes
– Relational databases aren’t scale-free
• Growing every day
• Need real-time random access to all data
• Long-running and complex analysis jobs
5
• Hadoop, HBase, and Solr
– Hadoop – distributed file system, map/reduce
– HBase – “NoSQL” data store – column-oriented
– Solr – search server based on Lucene
– All are scalable, flexible, fast, well-supported, and used in
production environments
• And a supporting cast of thousands…
– Stargate, MySQL, Rails, Redis, Resque,
– Nginx, Unicorn, HAProxy, Memcached,
– Ruby, JRuby, CentOS, …
6
Internet Application LAN Data LAN *Caching and load balancing not shown
7
• HBase  NoSQL
– Think hash table, not relational database
• How do find my data if primary key won’t cut it?
• Solr to the rescue
– Very fast, highly scalable search server with built-in sharding
and replication – based on Lucene
– Dynamic schema, powerful query language, faceted search,
accessible via simple REST-like web API w/XML, JSON, Ruby,
and other data formats
8
• Sharding
– Query any server – it executes the same query against all other servers in
the group
– Returns aggregated result to original caller
• Async replication (slaves poll their masters)
– Can use repeaters if replicating across data centers
• OpenLogic
– Solr farm, sharded, cross-replicated, fronted with HAProxy
• Load balanced writes across masters, reads across masters and slaves
• Be careful not to over-commit
– Billions of lines of code in HBase, all indexed in Solr for real-time search in
multiple ways
– Over 20 Solr fields indexed per source file
9
10
11
12
13
14
15
• Experiment with different Solr merge factors
– During huge loads, it can help to use a higher factor for load
performance
• Minimize index manipulation gymnastics
• Start with something like 25
– When you’re done with the massive initial load/import, turn it
back down for search performance
• Minimize number of queries
• Start with something like 5
• Note that a small merge factor will hurt indexing performance if you
need to do massive loads on a frequent basis or continuous
indexing
16
• Test your write-focused load balancing
– Look for large skews in Solr index size
– Note: you may have to commit, optimize, write again, and commit
before you can really tell
• Make sure your replication slaves are keeping up
– Using identical hardware helps
– If index directories don’t look the same, something is wrong
17
• Don’t commit to Solr too frequently
– It’s easy to auto-commit or commit after every record
– Doing this 100’s of times per second will take Solr down,
especially if you have serious warm up queries configured
• Avoid putting large values in HBase (> 5MB)
– Works, but may cause instability and/or performance issues
– Rows and columns are cheap, so use more of them instead
18
• Don’t use a single machine to load the cluster
– You might not live long enough to see it finish
• At OpenLogic, we spread raw source data across many machines and
hard drives via NFS
– Be very careful with NFS configuration – can hang machines
• Load data into HBase via Hadoop map/reduce jobs
– Turn off WAL for much better performance
• put.setWriteToWAL(false)
– Index in Solr as you go
• Good way to test your load balancing write schemes and replication set up
– This will find your weak spots!
19
• Writing data loading jobs can be tedious
• Scripting is faster and easier than writing Java
• Great for system administration tasks, testing
• Standard HBase shell is based on JRuby
• Very easy Map/Reduce jobs with J/Ruby and Wukong
• Used heavily at OpenLogic
– Productivity of Ruby
– Power of Java Virtual Machine
– Ruby on Rails, Hadoop integration, GUI clients
20
21
JRuby
list = ["Rod", "Neeta", "Eric", "Missy"]
shorts = list.find_all { |name| name.size <= 4 }
puts shorts.size
shorts.each { |name| puts name }
-> 2
-> Rod
Eric
Groovy
list = ["Rod", "Neeta", "Eric", "Missy"]
shorts = list.findAll { name -> name.size() <= 4 }
println shorts.size
shorts.each { name -> println name }
-> 2
-> Rod
Eric
22
• Hadoop
– SPOF around Namenode, append functionality
• HBase
– Backup, replication, and indexing solutions in flux
• Solr
– Several competing solutions around cloud-like scalability and
fault-tolerance, including ZooKeeper and Hadoop integration
– No clear winner, none quite ready for production
23
• Many moving parts
– It’s easy to let typos slip through
– Consider automated configuration via Chef, Puppet, or similar
• Pay attention to the details
– Operating system – max open files, sockets, and other limits
– Hadoop and HBase configuration
• http://hbase.apache.org/book.html#trouble
– Solr merge factor and norms
• Don’t starve HBase or Solr for memory
– Swapping will cripple your system
24
• “Commodity hardware” != 3 year old desktop
• Dual quad-core, 32GB RAM, 4+ disks
• Don’t bother with RAID on Hadoop data disks
– Be wary of non-enterprise drives
• Expect ugly hardware issues at some point
25
• Environment
– 100+ CPU cores
– 100+ Terabytes of disk
– Machines don’t have identity
– Add capacity by plugging in new machines
• Why not Amazon EC2?
– Great for computational bursts
– Expensive for long-term storage of big data
– Not yet consistent enough for mission-critical usage of HBase
26
• Dual quad-core and dual hex-core
• Dell boxes
• 32-64GB RAM
– ECC (highly recommended by Google)
• 6 x 2TB enterprise hard drives
• RAID 1 on two of the drives
– OS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc.
– Key “source” data backups
• Hadoop datanode gets remaining drives
• Redundant enterprise switches
• Dual- and quad-gigabit NIC’s
27
• Amazon EC2
– EBS Storage
• 100TB * $0.10/GB/month = $120k/year
– Double Extra Large instances
• 13 EC2 compute units, 34.2GB RAM
• 20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year
• 3 year reserved instances
– 20 * 4k = $80k up front to reserve
– (20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/year to operate
– Totals for 20 virtual machines
• 1st year cost: $120k + $80k + $86k = $286k
• 2nd & 3rd year costs: $120k + $86k = $206k
• Average: ($286k + $206k + $206k) / 3 = $232k/year
28
• Buy your own
– 20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k
• Over 33 EC2 compute units each
– Total: $53k/year (amortized over 3 years)
29
• Amazon EC2
– 20 instances * 13 EC2 compute units = 260 EC2 compute units
– Cost: $232k/year
• Buy your own
– 20 machines * 33 EC2 compute units = 660 EC2 compute units
– Cost: $53k/year
– Does not include hosting and maintenance costs
• Don’t think system administration goes away
– You still “own” all the instances – monitoring, debugging, support
30
• Hardware
– Power supplies, hard drives
• Operating System
– Kernel panics, zombie processes, dropped packets
• Software Servers
– Hadoop datanodes, HBase regionservers, Stargate servers, Solr
servers
• Your Code and Data
– Stray map/reduce jobs, strange corner cases in your data leading
to program failures
31
32
• Hadoop, HBase, Solr
• Apache, Tomcat, ZooKeeper,
HAProxy
• Stargate, JRuby, Lucene, Jetty,
HSQLDB, Geronimo
• Apache Commons, JUnit
• CentOS
• Dozens more
• Too expensive to build or buy
everything
33
• You can host big data in your own private cloud
– Tools are available today that didn’t exist a few years ago
– Fast to prototype – production readiness takes time
– Expect to invest in training and support
• Hbase and Solr are fast
– 100+ random queries/sec per instance
– Give them memory and stand back
• Hbase Scales, Solr scales (to a point)
– Don’t worry about out-growing a few machines
– Do worry about out-growing a rack of Solr instances
• Look for ways to partition your data other than “automatic” sharding
34
Q&A
35
Rod Cope
rod.cope@roguewave.com
See us in action:
www.roguewave.com
36

More Related Content

PPTX
Top 10 lessons learned from deploying hadoop in a private cloud
PPT
Deploying Grid Services Using Apache Hadoop
PPT
Hadoop 24/7
PPTX
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
PDF
Hadoop Robot from eBay at China Hadoop Summit 2015
PPTX
Improving Hadoop Cluster Performance via Linux Configuration
PDF
Kudu: Fast Analytics on Fast Data
PDF
Improving Hadoop Performance via Linux
Top 10 lessons learned from deploying hadoop in a private cloud
Deploying Grid Services Using Apache Hadoop
Hadoop 24/7
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Hadoop Robot from eBay at China Hadoop Summit 2015
Improving Hadoop Cluster Performance via Linux Configuration
Kudu: Fast Analytics on Fast Data
Improving Hadoop Performance via Linux

What's hot (20)

PDF
Improving Hadoop Cluster Performance via Linux Configuration
PDF
HBase lon meetup
PDF
Shard-Query, an MPP database for the cloud using the LAMP stack
PDF
Intro to big data choco devday - 23-01-2014
PPT
8a. How To Setup HBase with Docker
PPTX
Optimizing your Infrastrucure and Operating System for Hadoop
PDF
Apache kudu
PPTX
March 2011 HUG: Scaling Hadoop
PDF
HBase: Extreme Makeover
PDF
Let's Talk Operations! (Hadoop Summit 2014)
PDF
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
PPT
Hadoop Performance at LinkedIn
PPTX
HBase and Accumulo | Washington DC Hadoop User Group
PPTX
Divide and conquer in the cloud
PDF
Conquering "big data": An introduction to shard query
PPTX
Simple Works Best
 
PDF
02 Hadoop deployment and configuration
PPT
Deployment and Management of Hadoop Clusters
PDF
Elastic HBase on Mesos - HBaseCon 2015
PPTX
HBaseCon 2013: Apache HBase Table Snapshots
Improving Hadoop Cluster Performance via Linux Configuration
HBase lon meetup
Shard-Query, an MPP database for the cloud using the LAMP stack
Intro to big data choco devday - 23-01-2014
8a. How To Setup HBase with Docker
Optimizing your Infrastrucure and Operating System for Hadoop
Apache kudu
March 2011 HUG: Scaling Hadoop
HBase: Extreme Makeover
Let's Talk Operations! (Hadoop Summit 2014)
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Hadoop Performance at LinkedIn
HBase and Accumulo | Washington DC Hadoop User Group
Divide and conquer in the cloud
Conquering "big data": An introduction to shard query
Simple Works Best
 
02 Hadoop deployment and configuration
Deployment and Management of Hadoop Clusters
Elastic HBase on Mesos - HBaseCon 2015
HBaseCon 2013: Apache HBase Table Snapshots
Ad

Viewers also liked (13)

PDF
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
PDF
Five ways to protect your software supply chain from hacks, quacks, and wrecks
PPTX
Shifting the conversation from active interception to proactive neutralization
PDF
Top 5 best practice for delivering secure in-vehicle software
PPTX
Enterprise Search: An Information Architect's Perspective
PPTX
Legal and Practical Concerns with Software Development
PPTX
TriHUG: Lucene Solr Hadoop
PDF
Integrating Hadoop & Solr
PPTX
Enterprise Search Summit Keynote: A Big Data Architecture for Search
PDF
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
PDF
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
PPTX
Dlvr.it 使用說明
PDF
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Slash n: Tech Talk Track 2 – Distributed Transactions in SOA - Yogi Kulkarni,...
Five ways to protect your software supply chain from hacks, quacks, and wrecks
Shifting the conversation from active interception to proactive neutralization
Top 5 best practice for delivering secure in-vehicle software
Enterprise Search: An Information Architect's Perspective
Legal and Practical Concerns with Software Development
TriHUG: Lucene Solr Hadoop
Integrating Hadoop & Solr
Enterprise Search Summit Keynote: A Big Data Architecture for Search
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Dlvr.it 使用說明
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Ad

Similar to Real-time searching of big data with Solr and Hadoop (20)

KEY
Make Life Suck Less (Building Scalable Systems)
PPTX
MyHeritage backend group - build to scale
PDF
Tweaking performance on high-load projects
KEY
Make Life Suck Less (Building Scalable Systems)
ODP
Hadoop demo ppt
PDF
BIG DATA: From mammoth to elephant
PDF
Shared slides-edbt-keynote-03-19-13
PDF
Hbase: an introduction
KEY
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
PDF
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
PPTX
Essential Data Engineering for Data Scientist
PDF
Hbase 20141003
PPT
Hadoop ecosystem framework n hadoop in live environment
PPTX
Hadoop: An Industry Perspective
PPTX
HBase in Practice
PDF
Searching Billions of Product Logs in Real Time (Use Case)
PPT
Hadoop presentation
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
PPTX
HBase in Practice
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
Make Life Suck Less (Building Scalable Systems)
MyHeritage backend group - build to scale
Tweaking performance on high-load projects
Make Life Suck Less (Building Scalable Systems)
Hadoop demo ppt
BIG DATA: From mammoth to elephant
Shared slides-edbt-keynote-03-19-13
Hbase: an introduction
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Essential Data Engineering for Data Scientist
Hbase 20141003
Hadoop ecosystem framework n hadoop in live environment
Hadoop: An Industry Perspective
HBase in Practice
Searching Billions of Product Logs in Real Time (Use Case)
Hadoop presentation
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
HBase in Practice
UnConference for Georgia Southern Computer Science March 31, 2015

More from Rogue Wave Software (20)

PPTX
The Global Influence of Open Banking, API Security, and an Open Data Perspective
PPTX
No liftoff, touchdown, or heartbeat shall miss because of a software failure
PDF
Disrupt or be disrupted – Using secure APIs to drive digital transformation
PPTX
Leveraging open banking specifications for rigorous API security – What’s in...
PPTX
Adding layers of security to an API in real-time
PPTX
Getting the most from your API management platform: A case study
PPTX
Advanced technologies and techniques for debugging HPC applications
PPTX
The forgotten route: Making Apache Camel work for you
PPTX
Are open source and embedded software development on a collision course?
PDF
Three big mistakes with APIs and microservices
PPTX
5 strategies for enterprise cloud infrastructure success
PPTX
PSD2 & Open Banking: How to go from standards to implementation and compliance
PPTX
Java 10 and beyond: Keeping up with the language and planning for the future
PPTX
How to keep developers happy and lawyers calm (Presented at ESC Boston)
PPTX
Open source applied - Real world use cases (Presented at Open Source 101)
PPTX
How to migrate SourcePro apps from Solaris to Linux
PPTX
Approaches to debugging mixed-language HPC apps
PPTX
Enterprise Linux: Justify your migration from Red Hat to CentOS
PPTX
Walk through an enterprise Linux migration
PPTX
How to keep developers happy and lawyers calm
The Global Influence of Open Banking, API Security, and an Open Data Perspective
No liftoff, touchdown, or heartbeat shall miss because of a software failure
Disrupt or be disrupted – Using secure APIs to drive digital transformation
Leveraging open banking specifications for rigorous API security – What’s in...
Adding layers of security to an API in real-time
Getting the most from your API management platform: A case study
Advanced technologies and techniques for debugging HPC applications
The forgotten route: Making Apache Camel work for you
Are open source and embedded software development on a collision course?
Three big mistakes with APIs and microservices
5 strategies for enterprise cloud infrastructure success
PSD2 & Open Banking: How to go from standards to implementation and compliance
Java 10 and beyond: Keeping up with the language and planning for the future
How to keep developers happy and lawyers calm (Presented at ESC Boston)
Open source applied - Real world use cases (Presented at Open Source 101)
How to migrate SourcePro apps from Solaris to Linux
Approaches to debugging mixed-language HPC apps
Enterprise Linux: Justify your migration from Red Hat to CentOS
Walk through an enterprise Linux migration
How to keep developers happy and lawyers calm

Recently uploaded (20)

PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
ai tools demonstartion for schools and inter college
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
L1 - Introduction to python Backend.pptx
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PDF
Nekopoi APK 2025 free lastest update
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
top salesforce developer skills in 2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Understanding Forklifts - TECH EHS Solution
PDF
System and Network Administraation Chapter 3
Internet Downloader Manager (IDM) Crack 6.42 Build 41
ai tools demonstartion for schools and inter college
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Design an Analysis of Algorithms II-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
L1 - Introduction to python Backend.pptx
Navsoft: AI-Powered Business Solutions & Custom Software Development
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
Nekopoi APK 2025 free lastest update
2025 Textile ERP Trends: SAP, Odoo & Oracle
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
top salesforce developer skills in 2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
Wondershare Filmora 15 Crack With Activation Key [2025
Understanding Forklifts - TECH EHS Solution
System and Network Administraation Chapter 3

Real-time searching of big data with Solr and Hadoop

  • 1. 1 Real-time searching of big data with Solr and Hadoop Rod Cope, CTO
  • 2. 2 • Introduction • The problem • The solution • Top 10 lessons • Final thoughts • Q&A
  • 3. 3 Rod Cope, CTO Rogue Wave Software
  • 4. 4 • “Big data” – All the world’s open source software – Metadata, code, indexes – Individual tables contain many terabytes – Relational databases aren’t scale-free • Growing every day • Need real-time random access to all data • Long-running and complex analysis jobs
  • 5. 5 • Hadoop, HBase, and Solr – Hadoop – distributed file system, map/reduce – HBase – “NoSQL” data store – column-oriented – Solr – search server based on Lucene – All are scalable, flexible, fast, well-supported, and used in production environments • And a supporting cast of thousands… – Stargate, MySQL, Rails, Redis, Resque, – Nginx, Unicorn, HAProxy, Memcached, – Ruby, JRuby, CentOS, …
  • 6. 6 Internet Application LAN Data LAN *Caching and load balancing not shown
  • 7. 7 • HBase  NoSQL – Think hash table, not relational database • How do find my data if primary key won’t cut it? • Solr to the rescue – Very fast, highly scalable search server with built-in sharding and replication – based on Lucene – Dynamic schema, powerful query language, faceted search, accessible via simple REST-like web API w/XML, JSON, Ruby, and other data formats
  • 8. 8 • Sharding – Query any server – it executes the same query against all other servers in the group – Returns aggregated result to original caller • Async replication (slaves poll their masters) – Can use repeaters if replicating across data centers • OpenLogic – Solr farm, sharded, cross-replicated, fronted with HAProxy • Load balanced writes across masters, reads across masters and slaves • Be careful not to over-commit – Billions of lines of code in HBase, all indexed in Solr for real-time search in multiple ways – Over 20 Solr fields indexed per source file
  • 9. 9
  • 10. 10
  • 11. 11
  • 12. 12
  • 13. 13
  • 14. 14
  • 15. 15 • Experiment with different Solr merge factors – During huge loads, it can help to use a higher factor for load performance • Minimize index manipulation gymnastics • Start with something like 25 – When you’re done with the massive initial load/import, turn it back down for search performance • Minimize number of queries • Start with something like 5 • Note that a small merge factor will hurt indexing performance if you need to do massive loads on a frequent basis or continuous indexing
  • 16. 16 • Test your write-focused load balancing – Look for large skews in Solr index size – Note: you may have to commit, optimize, write again, and commit before you can really tell • Make sure your replication slaves are keeping up – Using identical hardware helps – If index directories don’t look the same, something is wrong
  • 17. 17 • Don’t commit to Solr too frequently – It’s easy to auto-commit or commit after every record – Doing this 100’s of times per second will take Solr down, especially if you have serious warm up queries configured • Avoid putting large values in HBase (> 5MB) – Works, but may cause instability and/or performance issues – Rows and columns are cheap, so use more of them instead
  • 18. 18 • Don’t use a single machine to load the cluster – You might not live long enough to see it finish • At OpenLogic, we spread raw source data across many machines and hard drives via NFS – Be very careful with NFS configuration – can hang machines • Load data into HBase via Hadoop map/reduce jobs – Turn off WAL for much better performance • put.setWriteToWAL(false) – Index in Solr as you go • Good way to test your load balancing write schemes and replication set up – This will find your weak spots!
  • 19. 19 • Writing data loading jobs can be tedious • Scripting is faster and easier than writing Java • Great for system administration tasks, testing • Standard HBase shell is based on JRuby • Very easy Map/Reduce jobs with J/Ruby and Wukong • Used heavily at OpenLogic – Productivity of Ruby – Power of Java Virtual Machine – Ruby on Rails, Hadoop integration, GUI clients
  • 20. 20
  • 21. 21 JRuby list = ["Rod", "Neeta", "Eric", "Missy"] shorts = list.find_all { |name| name.size <= 4 } puts shorts.size shorts.each { |name| puts name } -> 2 -> Rod Eric Groovy list = ["Rod", "Neeta", "Eric", "Missy"] shorts = list.findAll { name -> name.size() <= 4 } println shorts.size shorts.each { name -> println name } -> 2 -> Rod Eric
  • 22. 22 • Hadoop – SPOF around Namenode, append functionality • HBase – Backup, replication, and indexing solutions in flux • Solr – Several competing solutions around cloud-like scalability and fault-tolerance, including ZooKeeper and Hadoop integration – No clear winner, none quite ready for production
  • 23. 23 • Many moving parts – It’s easy to let typos slip through – Consider automated configuration via Chef, Puppet, or similar • Pay attention to the details – Operating system – max open files, sockets, and other limits – Hadoop and HBase configuration • http://hbase.apache.org/book.html#trouble – Solr merge factor and norms • Don’t starve HBase or Solr for memory – Swapping will cripple your system
  • 24. 24 • “Commodity hardware” != 3 year old desktop • Dual quad-core, 32GB RAM, 4+ disks • Don’t bother with RAID on Hadoop data disks – Be wary of non-enterprise drives • Expect ugly hardware issues at some point
  • 25. 25 • Environment – 100+ CPU cores – 100+ Terabytes of disk – Machines don’t have identity – Add capacity by plugging in new machines • Why not Amazon EC2? – Great for computational bursts – Expensive for long-term storage of big data – Not yet consistent enough for mission-critical usage of HBase
  • 26. 26 • Dual quad-core and dual hex-core • Dell boxes • 32-64GB RAM – ECC (highly recommended by Google) • 6 x 2TB enterprise hard drives • RAID 1 on two of the drives – OS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc. – Key “source” data backups • Hadoop datanode gets remaining drives • Redundant enterprise switches • Dual- and quad-gigabit NIC’s
  • 27. 27 • Amazon EC2 – EBS Storage • 100TB * $0.10/GB/month = $120k/year – Double Extra Large instances • 13 EC2 compute units, 34.2GB RAM • 20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year • 3 year reserved instances – 20 * 4k = $80k up front to reserve – (20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/year to operate – Totals for 20 virtual machines • 1st year cost: $120k + $80k + $86k = $286k • 2nd & 3rd year costs: $120k + $86k = $206k • Average: ($286k + $206k + $206k) / 3 = $232k/year
  • 28. 28 • Buy your own – 20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k • Over 33 EC2 compute units each – Total: $53k/year (amortized over 3 years)
  • 29. 29 • Amazon EC2 – 20 instances * 13 EC2 compute units = 260 EC2 compute units – Cost: $232k/year • Buy your own – 20 machines * 33 EC2 compute units = 660 EC2 compute units – Cost: $53k/year – Does not include hosting and maintenance costs • Don’t think system administration goes away – You still “own” all the instances – monitoring, debugging, support
  • 30. 30 • Hardware – Power supplies, hard drives • Operating System – Kernel panics, zombie processes, dropped packets • Software Servers – Hadoop datanodes, HBase regionservers, Stargate servers, Solr servers • Your Code and Data – Stray map/reduce jobs, strange corner cases in your data leading to program failures
  • 31. 31
  • 32. 32 • Hadoop, HBase, Solr • Apache, Tomcat, ZooKeeper, HAProxy • Stargate, JRuby, Lucene, Jetty, HSQLDB, Geronimo • Apache Commons, JUnit • CentOS • Dozens more • Too expensive to build or buy everything
  • 33. 33 • You can host big data in your own private cloud – Tools are available today that didn’t exist a few years ago – Fast to prototype – production readiness takes time – Expect to invest in training and support • Hbase and Solr are fast – 100+ random queries/sec per instance – Give them memory and stand back • Hbase Scales, Solr scales (to a point) – Don’t worry about out-growing a few machines – Do worry about out-growing a rack of Solr instances • Look for ways to partition your data other than “automatic” sharding
  • 35. 35 Rod Cope rod.cope@roguewave.com See us in action: www.roguewave.com
  • 36. 36