SlideShare a Scribd company logo
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development
Localized Hadoop Development

More Related Content

PPT
Web 101 by Jennifer Lill
PPTX
Big Event Looping Deck
PPTX
Data Virtualization: revolutionizing database cloning
PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PDF
First Step for Big Data with Apache Hadoop
PDF
Intro to Python for Data Science
PPTX
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
PPTX
BigData Meets the Federal Data Center
Web 101 by Jennifer Lill
Big Event Looping Deck
Data Virtualization: revolutionizing database cloning
Rental Cars and Industrialized Learning to Rank with Sean Downes
First Step for Big Data with Apache Hadoop
Intro to Python for Data Science
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
BigData Meets the Federal Data Center

Similar to Localized Hadoop Development (20)

PDF
Hybrid my sql_hadoop_datawarehouse
PDF
Design for X: Exploring Product Design with Apache Spark and GraphLab
PPT
Open source e_discovery
PDF
Intro to Apache Spark
PPT
Hadoop at Yahoo! -- University Talks
PPTX
Big data-denis-rothman
PDF
Intro to Python for Data Science
PDF
Web Performance & You
PDF
Introduction to Big Data & Hadoop
PDF
Open source secret_sauce_apache_con_2010
PPTX
Agile Data: revolutionizing data and database cloning
PPTX
PPTX
Big data hadoop
PPTX
What is the semantic web
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
PPT
Cloud computing and Hadoop introduction
PPT
Business Intelligence for normal people
PPT
Murli Thirumale, CEO Ocarina Networks
PDF
From a student to an apache committer practice of apache io tdb
PPTX
AI from Space using Azure
Hybrid my sql_hadoop_datawarehouse
Design for X: Exploring Product Design with Apache Spark and GraphLab
Open source e_discovery
Intro to Apache Spark
Hadoop at Yahoo! -- University Talks
Big data-denis-rothman
Intro to Python for Data Science
Web Performance & You
Introduction to Big Data & Hadoop
Open source secret_sauce_apache_con_2010
Agile Data: revolutionizing data and database cloning
Big data hadoop
What is the semantic web
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Cloud computing and Hadoop introduction
Business Intelligence for normal people
Murli Thirumale, CEO Ocarina Networks
From a student to an apache committer practice of apache io tdb
AI from Space using Azure
Ad

More from Adam Doyle (20)

PPTX
ML Ops.pptx
PPTX
Data Engineering Roles
PPTX
Managed Cluster Services
PPTX
Delta lake and the delta architecture
PPTX
Great Expectations Presentation
PDF
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
PDF
Automate your data flows with Apache NIFI
PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
PDF
The new big data
PDF
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
PDF
Snowflake Data Science and AI/ML at Scale
PPTX
Operationalizing Data Science St. Louis Big Data IDEA
PPTX
Retooling on the Modern Data and Analytics Tech Stack
PDF
Stl meetup cloudera platform - january 2020
PPTX
How stlrda does data
PPTX
Tailoring machine learning practices to support prescriptive analytics
PPTX
Synthesis of analytical methods data driven decision-making
PPTX
Big Data IDEA 101 2019
PPTX
Data Engineering and the Data Science Lifecycle
PDF
Data engineering Stl Big Data IDEA user group
ML Ops.pptx
Data Engineering Roles
Managed Cluster Services
Delta lake and the delta architecture
Great Expectations Presentation
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
Automate your data flows with Apache NIFI
Apache Iceberg Presentation for the St. Louis Big Data IDEA
The new big data
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Snowflake Data Science and AI/ML at Scale
Operationalizing Data Science St. Louis Big Data IDEA
Retooling on the Modern Data and Analytics Tech Stack
Stl meetup cloudera platform - january 2020
How stlrda does data
Tailoring machine learning practices to support prescriptive analytics
Synthesis of analytical methods data driven decision-making
Big Data IDEA 101 2019
Data Engineering and the Data Science Lifecycle
Data engineering Stl Big Data IDEA user group
Ad

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Global journeys: estimating international migration
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Foundation of Data Science unit number two notes
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Acumen Training GuidePresentation.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Moving the Public Sector (Government) to a Digital Adoption
Miokarditis (Inflamasi pada Otot Jantung)
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
.pdf is not working space design for the following data for the following dat...
Global journeys: estimating international migration
Reliability_Chapter_ presentation 1221.5784
Foundation of Data Science unit number two notes
168300704-gasification-ppt.pdfhghhhsjsjhsuxush

Editor's Notes

  • #2: Thank you for attending today and thank you for giving me your time. Tonight, I’ll be talking a bit about training and developing in Hadoop and particularly the challenges of doing so.
  • #3: First that awkward narcissistic slide where I tell you a little about myself. Like many of you I grew up lovingly addicted to technology, especially computers. Seventeen years ago I finally turned that passion into a career and over that time I’ve gotten my hands into many different verticals. Much of that time has been spent working with data either as a DBA or as an engineer. Paired with that has been a lot of time in the Microsoft stack either developing and supporting software applications or deploying and managing server infrastructure. As most of my career has been spent in managed hosting, I’ve also had quite of bit of experience working with systems automation, monitoring, infrastructure design and implementation and a little dabbling in network engineering. I’ve put my favorite quote there by Thomas Edison. [READ THE QUOTE] You’ll find out why I like that quote so much in a bit.
  • #4: So, what IS the problem exactly? Well, I should probably start with my story. I got interested in Big Data several years ago when the term became mainstream. I did my typical Google-fu to see what I could learn about the technology and maybe convince my managers to look at implementing it. No dice. It felt like the more I dug the more questions I had. Hadoop, HDFS, YARN, PIG, SQOOP, MapReduce, Spark, Hive, Solar, Lucene, Zookeeper, Oozie… and I’ve only scratched the surface of the entire ecosystem. By the time I got INTO big data and Hadoop, it was already overwhelming. Alright fine, I’ll knuckle down and get a private environment setup for myself so I can start learning this behemoth. At the time, most of the guides I followed all directed me to the cloud providers…which I followed…and a several hundred dollar bill later after forgetting that I left a cluster online for a month put a big price tag on this lesson. And the effect of that? Well, I shied away, opting instead to try to learn Hadoop in other people’s environments… which of course took a lot more time. So Hadoop has a steep learning requirement that is … having an environment to learn with in the first place.
  • #5: “Well but Tim there must be other options out there.” you’re probably saying right now. “What about Cloudera’s Quickstart VM?” you’re asking. Well Cloudera has ended the Quickstart environment in favor of pushing their “free” trial of a hosted product. There are other options and some of them can be pretty effective. Let’s touch back on the Cloud Hosted method. There is a vast number of guides that will take you step-by-step through spinning up a Hadoop cluster in each of the major Cloud Providers. I will warn you that a lot of those guides are outdated and will have you scratching your head with older or mismatched versions of components. Also set yourself a reminder. Shut that thing down when you’re done with it, your wallet will thank you later. As for Book learning or Video Training, I’ve always envied people who were able to sit down and read a training manual cover to cover and absorb all of that knowledge. Myself? I learn better when I’m getting my hands dirty. Video training ala Pluralsight or Linda does a pretty good job, but usually only get you so far before sending you off on your own without a working environment to use. And of course for those of you who are fortunate enough to have a full Cisco UCS chassis sitting in your basement just waiting for another workload to be thrown at it, more power to you folks. For the rest of us, if you have a spare PC lying around with a fair amount of memory (> 8GB), you can manage to cobble together a home lab and there are plenty of guides out there on how to do that.
  • #6: So, what am I proposing? Well, Docker to be quite honest. The portability, flexibility and scalability make this option REALLY attractive. So attractive that I took a good college try at putting an environment together. Now… this is where I fall on my sword and recall that quote from Thomas Edison earlier. I… didn’t fail per-se… but I certainly found at LEAST 10,000 ways to build a Dockerized Hadoop environment incorrectly. To that end, in my adventures in this space I’ve stumbled across several repositories that I’ve forked, enhanced and utilized to create my own environment. What I’ve put together is a Docker-Compose file that make it quick and easy to build and provision a Hadoop cluster with Hive AND a multi-node Spark cluster, all of which is open source and ready to be further enhanced by anyone wanting to contribute.
  • #7: My goal with this environment is to provide like-minded individuals a way to dip their toes into Hadoop at its core. It’s barebones Hadoop, Hive and Spark. The idea is straight to the point, get data into the environment, add it to HDFS, create a Hive table for that data and get to work. If you choose to do so, you can leave it at that, or you can spin up the Spark cluster and really get your hands dirty with the data. When you execute the docker-compose commands you see here, these are the containers that get provisioned. On the Hadoop side you have a namenode and a single datanode. You get a hive-server, a dedicated hive-metastore container and a postgres container that houses the hive metastore database. On the Spark side you get a Master and two Worker nodes. All of this can interconnect using Dockers bridge networking which also allows your workstation to connect to these components as if they were running on your machine. Once you’ve mastered the basics here you can easily jump in and start adding more components like PIG or Impala or Ranger maybe.
  • #8: We’ve covered why I built the environment but here’s a few reasons why I think it could be helpful for others and why I’m sharing it with you all today. Obviously the most useful thing about this environment is enabling people to Learn Hadoop. And learn it without all the other distractions that enterprise deployments bring with them. I’m looking at you Cloudera. Development can take place in this environment and I’m comfortable with saying it will get you at least 90% of the way there. You’ll want to spend that last 10% tweaking your code for performance reasons on whatever environment you’re working in. And lastly maybe you’re assessing whether or not Hadoop is right for your team. With this environment you can rapidly stand up a proof of concept and decide whether Hadoop is right for your datasets or whether or not Spark would be advantageous to you.
  • #9: The environment is not, let me repeat that, NOT for production purposes. It’s not optimized for performance at all and that’s on purpose. I think part of the fun of working at this capacity is troubleshooting all the hair-raising events that would come up in a production environment. So the installation is completely default. Throw your workload on it and tweak the performance to your liking. I don’t know if I made this clear enough before but to reiterate, this environment is NOT for production. I’ve taken no security standards or best-practices in mind when building this. Again, that’s on purpose. If I were to secure everything the way it should be, no one would want to use it. That said, it’s the perfect environment for learning how to implement security policies, so feel free to go nuts. Worst case scenario you blow away your containers and spin up new ones ready to be broken again.
  • #11: Getting started is as simple as cloning the GitHub repository and following the instructions posted in the README. A few warnings or disclaimers. This hasn’t been thoroughly tested on all platforms, yes it’s Docker and as long as you’re running a recent version of that it SHOULD work fine, but I think we all know there’s a big difference between SHOULD work and WILL work. Also, in the spirit of open source, I want to make it known that I will be actively maintaining this repository. So feel free to throw PRs my way or fork my work and enhance it for your own uses.
  • #14: That brings me to the end of my presentation. Thank you all for sitting through my babbling, hopefully you found at least some of it useful. Again, here is my contact information should you have ANY questions at all or what to help participate in the project. A HUGE thank you to Ivan Ermilov and his team at Big Data Europe. Their work REALLY saved me on this, and I highly recommend you check out what they’ve done at their repositories.