SlideShare a Scribd company logo
Dipping Your Toe Into
Hadoop
How to get started in Big Data without Big
Costs
Bobby Dewitt
VP, Systems Architect
Aisle411
StampedeCon 2016
My Background
• Oracle, MySQL, and PostgreSQL DBA with 15
years of experience
• Led database, infrastructure, and business
intelligence teams to deliver highly available
data systems
• Currently responsible for design,
implementation, and operational availability of
infrastructure and systems at Aisle411
Aisle411
• Digitizing the indoor world
• Indoor maps, positioning, and analytics
• Asset and customer tracking within
locations
• Using augmented reality to make
indoor solutions more interactive
• Small company - big data
RDBMS Versus Hadoop
• Relational databases
• Very structured data
• Good for transactional and operational systems
• Difficult to scale out
• Hardware failures can be disastrous
• Hadoop
• Semistructured or unstructured data
• Good for batch and bulk processing as well as
analytic systems
• Simple to scale out
• Hardware failures are handled seamlessly
Hadoop Adoption
• Still not a reality for many companies
• Major barriers include
• Lack of skilled employees
• Getting value out of the investment
• Constant changes to the ecosystem
Kick the Tires
• Play around with it
• A Hadoop cluster can reside on a single
machine
• Pre-loaded virtual machines
• Install on EC2 or other cloud VM
What Data Should I Use?
• Stick with what you know
• Choose a dataset that is not specific to
your company
• Try documented examples and use
cases
Example Datasets
• Apache web server logs
• Twitter feeds
• Stock market prices
• Census data
• Sports statistics
• Song data
Apache Web Log Data
• Many online resources
• Potentially large data set
• Real business value
• Combine with other data sources
From Batch to Streaming
• Initial testing done with a batch load using HDFS
tools
• Setup streaming to provide near real-time
updates
• Used several Hadoop components
• HDFS
• Flume
• Morphlines
• Avro
• Hive
• Impala
Quick Wins
• Get data into HDFS
• Get data into Hive or Impala
• Stream live data
• Combine with other data sources
• Create pretty graphs and charts
Costs
• Start small with a data puddle
• Use virtual machines, not the big
appliance
• Research and experimentation time
may be biggest cost
Where Am I?
• Evaluate your initial trials
• Is Hadoop everything you thought it would
be?
• Do you have a real business need to use it?
• Can you migrate any existing data or
processes?
Training
• Hortonworks University
• MapR Academy
• Cloudera quick start tutorials
• Online classes through Coursera, edX, and
others
• Conferences like StampedeCon
Hadoop Is Not For Everyone
• Your “big data” may not be big enough
• Still some work to be done with security
and tools
• Skills are being learned, but not quickly
enough
Thank You
• Questions?
rdewitt@aisle411.com

More Related Content

PDF
Turn Data Into Actionable Insights - StampedeCon 2016
PDF
Innovation in the Data Warehouse - StampedeCon 2016
PDF
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
PDF
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
PPTX
Introduction to Kudu - StampedeCon 2016
PDF
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
PPTX
Big data architectures and the data lake
PDF
Incorporating the Data Lake into Your Analytic Architecture
Turn Data Into Actionable Insights - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Big data architectures and the data lake
Incorporating the Data Lake into Your Analytic Architecture

What's hot (19)

PPTX
Hadoop Powers Modern Enterprise Data Architectures
PDF
Solving Big Data Problems using Hortonworks
PPTX
Big Data Analytics in the Cloud with Microsoft Azure
PDF
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
PPTX
PPTX
Accelerating Big Data Analytics
PPTX
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
PPTX
Big Data: Setting Up the Big Data Lake
PPTX
Swimming Across the Data Lake, Lessons learned and keys to success
PDF
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
PPTX
Top Trends in Building Data Lakes for Machine Learning and AI
PDF
Data lake benefits
PDF
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
PPTX
PDF
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
PPTX
Why Data Lake should be the foundation of Enterprise Data Architecture
PDF
Creating a Next-Generation Big Data Architecture
PDF
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Hadoop Powers Modern Enterprise Data Architectures
Solving Big Data Problems using Hortonworks
Big Data Analytics in the Cloud with Microsoft Azure
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Accelerating Big Data Analytics
Hadoop Data Lake vs classical Data Warehouse: How to utilize best of both wor...
Big Data: Setting Up the Big Data Lake
Swimming Across the Data Lake, Lessons learned and keys to success
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
Top Trends in Building Data Lakes for Machine Learning and AI
Data lake benefits
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Why Data Lake should be the foundation of Enterprise Data Architecture
Creating a Next-Generation Big Data Architecture
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Ad

Viewers also liked (16)

PDF
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
PPTX
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
PPTX
Creating a Data Driven Organization - StampedeCon 2016
PDF
Interplay of Big Data and IoT - StampedeCon 2016
PDF
Resource Management in Impala - StampedeCon 2016
PPTX
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
PDF
Hadoop Security and Compliance - StampedeCon 2016
PDF
Visualizing Big Data – The Fundamentals
PPTX
Enabling Diverse Workload Scheduling in YARN
PPTX
Get most out of Spark on YARN
PPTX
Using The Internet of Things for Population Health Management - StampedeCon 2016
PDF
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
PDF
Building large scale applications in yarn with apache twill
PDF
Harnessing the power of YARN with Apache Twill
PPTX
A Multi Colored YARN
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
Interplay of Big Data and IoT - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
Hadoop Security and Compliance - StampedeCon 2016
Visualizing Big Data – The Fundamentals
Enabling Diverse Workload Scheduling in YARN
Get most out of Spark on YARN
Using The Internet of Things for Population Health Management - StampedeCon 2016
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Building large scale applications in yarn with apache twill
Harnessing the power of YARN with Apache Twill
A Multi Colored YARN
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Ad

Similar to How to get started in Big Data without Big Costs - StampedeCon 2016 (20)

PDF
Hitachi Data Systems Hadoop Solution
PDF
SQL Server Konferenz 2014 - SSIS & HDInsight
PDF
50 Shades of SQL
PDF
Hadoop and the Data Warehouse: When to Use Which
PPTX
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
PPTX
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
PDF
Technologies for Data Analytics Platform
PPTX
5 Things that Make Hadoop a Game Changer
PPTX
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
PDF
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
PPTX
Architecting Your First Big Data Implementation
PPTX
Twitter with hadoop for oow
PPTX
Data Wrangling and Oracle Connectors for Hadoop
PDF
Hadoop and SQL: Delivery Analytics Across the Organization
PPTX
Summer Shorts: Big Data Integration
 
PDF
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
PDF
Meta scale kognitio hadoop webinar
PPTX
Piranha vs. mammoth predator appliances that chew up big data
PPTX
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
PDF
Impala use case @ edge
Hitachi Data Systems Hadoop Solution
SQL Server Konferenz 2014 - SSIS & HDInsight
50 Shades of SQL
Hadoop and the Data Warehouse: When to Use Which
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Technologies for Data Analytics Platform
5 Things that Make Hadoop a Game Changer
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Big Data Everywhere Chicago: Leading a Healthcare Company to the Big Data Pro...
Architecting Your First Big Data Implementation
Twitter with hadoop for oow
Data Wrangling and Oracle Connectors for Hadoop
Hadoop and SQL: Delivery Analytics Across the Organization
Summer Shorts: Big Data Integration
 
The Big Data Gusher: Big Data Analytics, the Internet of Things and the Oil B...
Meta scale kognitio hadoop webinar
Piranha vs. mammoth predator appliances that chew up big data
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
Impala use case @ edge

More from StampedeCon (17)

PDF
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
PDF
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
PDF
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
PDF
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
PDF
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
PDF
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
PDF
Foundations of Machine Learning - StampedeCon AI Summit 2017
PDF
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
PDF
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
PDF
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
PDF
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
PDF
A Different Data Science Approach - StampedeCon AI Summit 2017
PDF
Graph in Customer 360 - StampedeCon Big Data Conference 2017
PDF
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
PDF
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
PDF
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
PDF
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
Teaching material agriculture food technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Diabetes mellitus diagnosis method based random forest with bat algorithm
20250228 LYD VKU AI Blended-Learning.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Teaching material agriculture food technology
Review of recent advances in non-invasive hemoglobin estimation
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Network Security Unit 5.pdf for BCA BBA.

How to get started in Big Data without Big Costs - StampedeCon 2016

  • 1. Dipping Your Toe Into Hadoop How to get started in Big Data without Big Costs Bobby Dewitt VP, Systems Architect Aisle411 StampedeCon 2016
  • 2. My Background • Oracle, MySQL, and PostgreSQL DBA with 15 years of experience • Led database, infrastructure, and business intelligence teams to deliver highly available data systems • Currently responsible for design, implementation, and operational availability of infrastructure and systems at Aisle411
  • 3. Aisle411 • Digitizing the indoor world • Indoor maps, positioning, and analytics • Asset and customer tracking within locations • Using augmented reality to make indoor solutions more interactive • Small company - big data
  • 4. RDBMS Versus Hadoop • Relational databases • Very structured data • Good for transactional and operational systems • Difficult to scale out • Hardware failures can be disastrous • Hadoop • Semistructured or unstructured data • Good for batch and bulk processing as well as analytic systems • Simple to scale out • Hardware failures are handled seamlessly
  • 5. Hadoop Adoption • Still not a reality for many companies • Major barriers include • Lack of skilled employees • Getting value out of the investment • Constant changes to the ecosystem
  • 6. Kick the Tires • Play around with it • A Hadoop cluster can reside on a single machine • Pre-loaded virtual machines • Install on EC2 or other cloud VM
  • 7. What Data Should I Use? • Stick with what you know • Choose a dataset that is not specific to your company • Try documented examples and use cases
  • 8. Example Datasets • Apache web server logs • Twitter feeds • Stock market prices • Census data • Sports statistics • Song data
  • 9. Apache Web Log Data • Many online resources • Potentially large data set • Real business value • Combine with other data sources
  • 10. From Batch to Streaming • Initial testing done with a batch load using HDFS tools • Setup streaming to provide near real-time updates • Used several Hadoop components • HDFS • Flume • Morphlines • Avro • Hive • Impala
  • 11. Quick Wins • Get data into HDFS • Get data into Hive or Impala • Stream live data • Combine with other data sources • Create pretty graphs and charts
  • 12. Costs • Start small with a data puddle • Use virtual machines, not the big appliance • Research and experimentation time may be biggest cost
  • 13. Where Am I? • Evaluate your initial trials • Is Hadoop everything you thought it would be? • Do you have a real business need to use it? • Can you migrate any existing data or processes?
  • 14. Training • Hortonworks University • MapR Academy • Cloudera quick start tutorials • Online classes through Coursera, edX, and others • Conferences like StampedeCon
  • 15. Hadoop Is Not For Everyone • Your “big data” may not be big enough • Still some work to be done with security and tools • Skills are being learned, but not quickly enough