SlideShare a Scribd company logo
1
© 2016 IBM Corporation
Big Data Developer meetup
Installing Apache Hadoop and Spark from scratch
Ljubljana, June 2016
2
© 2016 IBM Corporation
Agenda
 Why do you need Hadoop
 What do you need before you install Apache Hadoop
 Hadoop distributions
 Hadoop components you need to know about
 About Spark
 Installation process walk-through
 Adding cluster nodes
 Ways to automate
 Zero-install options
3
© 2016 IBM Corporation
Why do you need Apache Hadoop
License – free
Scalable
General purpose MPP
engine
Distributed storage
Packed with tools
Backend for your Big
Data project
4
© 2016 IBM Corporation
What do you need before you install Hadoop and Spark
 A server (or servers)
 Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)
 A Hadoop distribution (more later)
 Or avoid all that trouble by using VM / Docker if you are just
playing (more later)
5
© 2016 IBM Corporation
Apache Hadoop Distributions
 Hortonworks HDP
 Cloudera CDH
 IBM IOP (today’s focus)
 Number of others
 Distributions are very similar but different, as in Linux
 Some are part of ODP some are not
6
© 2016 IBM Corporation
Hadoop components you need to know about
 Yarn – resource manager
 HDFS
 MapReduce
 Ambari
 ZooKeeper
 Hive
 Pig
 sqoop
7
© 2016 IBM Corporation
 Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-
scale data processing
– Fast
•Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
•Faster than MapReduce
– General purpose
•Covers a wide range of workloads
•Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
•Spark is written in Scala, an object oriented,
functional programming language
•Scala, Python and Java APIs
•Scala and Python interactive shells
•Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
from http://spark.apache.org
8
© 2016 IBM Corporation
Installation process walk-through
 Review the requirements
 Review the installation docs
 Get IOP software: http://www-
01.ibm.com/support/docview.wss?uid=swg24040517
9
© 2016 IBM Corporation
Prereqs
 Install OS
 Setup yum repository
 Install prerequisites
• Yum install nc
 Full list of preparation steps
 Make sure your hostname is in /etc/hosts
 Tweak some settings (disable Trasparent Huge Pages)
• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
 Generate ssh key and set up passwordless ssh
• Ssh-keygen
• Chmod 700 ~/.ssh
• Check with ssh localhost
10
© 2016 IBM Corporation
Prereqs (cont.)
 disable IPv6
 Configure ulimit
• /etc/security/limits.conf
 Disable SELinux
 Set up NTP on all servers
11
© 2016 IBM Corporation
First step – install Ambari
 Install repository
• yum install iop-4.1.0.0-1.<version>.<platform>.rpm
 Install ambari
• Yum install ambari-server
 Setup ambari server
• sudo ambari-server setup
 Start ambari server
• Ambari-server start
 Go to ambari interface <your-ip>:8080
• Default user/pass = admin/admin
 Launch installation wisard
12
© 2016 IBM Corporation
Ambari installation
 Next-next-next
 Provide cluster name
 Provide private ssh key
13
© 2016 IBM Corporation
Choose services
14
© 2016 IBM Corporation
Assign masters
15
© 2016 IBM Corporation
Assign slaves and clients
16
© 2016 IBM Corporation
Customize services
Here you would have to setup proper DB server connections in
your prod environment
17
© 2016 IBM Corporation
Review and deploy
18
© 2016 IBM Corporation
Validate
19
© 2016 IBM Corporation
Adding a new cluster node
 Create a new server, with same
pre-rereqs
 Make sure that passwordless ssh
works from ambari server to the
node
 ssh-copy-id -i ~/.ssh/id_rsa.pub
root@hostname01
 And done
20
© 2016 IBM Corporation
Extra steps
 Install Anaconda / Jupyter for data analysis
 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook -
-no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark
21
© 2016 IBM Corporation
Ways to automate - Ansible
 Simple automation tool
 Infrastructure as a code
 Agent-less
 Easy to learn
 Check for examples online “ansible hadoop
playbook”
22
© 2016 IBM Corporation
Zero – installation options
•Big Insights QSE
•BigInsights on cloud (paid)
23
© 2014 IBM Corporation
WRAP-UP

More Related Content

PDF
Word press and containers
PDF
Cloudera cluster setup and configuration
PPTX
Aegir Introduction
PDF
Helix core on aws webinar
PPTX
An Introduction into Bosh | anynines
PPTX
Accelerating with Ansible
PPTX
OpenStack Heat
PPTX
Hadoop cluster 安裝
Word press and containers
Cloudera cluster setup and configuration
Aegir Introduction
Helix core on aws webinar
An Introduction into Bosh | anynines
Accelerating with Ansible
OpenStack Heat
Hadoop cluster 安裝

What's hot (17)

PPTX
Migrating enterprise workloads to AWS
PDF
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
PPT
Rackspace Private Cloud presentation for ChefConf 2013
PDF
PaaS on top of CloudStack
PPTX
Advanced data migration techniques for Amazon RDS
PPTX
Scaling Drupal & Deployment in AWS
PPTX
Oracle Solutions on AWS : May 2014
PPTX
Running High Availability Websites with Acquia and AWS
PPTX
Best Practices for running the Oracle Database on EC2 webinar
PDF
Amazon Web Services Building Blocks for Drupal Applications and Hosting
ODP
AutoScaling and Drupal
PPTX
Upcoming Products, Services and Features - Workshop by Praveen Umanath
PPTX
MongoDB in the Clouds
PPTX
Oracle on AWS partner webinar series
PPTX
AWS Storage Tiering for Enterprise Workloads
PPTX
No Docker? No Problem: Automating installation and config with Ansible
PDF
Oracle COTS Applications on AWS
Migrating enterprise workloads to AWS
Cloudera User Group SF - Cloudera Manager: APIs & Extensibility
Rackspace Private Cloud presentation for ChefConf 2013
PaaS on top of CloudStack
Advanced data migration techniques for Amazon RDS
Scaling Drupal & Deployment in AWS
Oracle Solutions on AWS : May 2014
Running High Availability Websites with Acquia and AWS
Best Practices for running the Oracle Database on EC2 webinar
Amazon Web Services Building Blocks for Drupal Applications and Hosting
AutoScaling and Drupal
Upcoming Products, Services and Features - Workshop by Praveen Umanath
MongoDB in the Clouds
Oracle on AWS partner webinar series
AWS Storage Tiering for Enterprise Workloads
No Docker? No Problem: Automating installation and config with Ansible
Oracle COTS Applications on AWS
Ad

Viewers also liked (20)

PPTX
PyData Ljubljana meetup #1
PPTX
Praxis and politics of urban data: Building the Dublin Dashboard
PPTX
Dublin dashboard launch
PDF
The ethics of urban big data and smart cities
PPTX
Ethics and Politics of Big Data
PPTX
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PDF
Why your Spark job is failing
PPT
Step-by-Step Introduction to Apache Flink
PDF
Reactive app using actor model & apache spark
PPTX
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PDF
Developing a Movie recommendation Engine with Spark
PPTX
Why your Spark Job is Failing
PPTX
Big data ppt
PPT
Smart Cities and Big Data - Research Presentation
PPT
Big Data
PPTX
Tuning and Debugging in Apache Spark
PyData Ljubljana meetup #1
Praxis and politics of urban data: Building the Dublin Dashboard
Dublin dashboard launch
The ethics of urban big data and smart cities
Ethics and Politics of Big Data
Interactive Data Science From Scratch with Apache Zeppelin and Apache Spark
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Spark Under the Hood - Meetup @ Data Science London
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Why your Spark job is failing
Step-by-Step Introduction to Apache Flink
Reactive app using actor model & apache spark
Data Science lifecycle with Apache Zeppelin and Spark by Moonsoo Lee
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Developing a Movie recommendation Engine with Spark
Why your Spark Job is Failing
Big data ppt
Smart Cities and Big Data - Research Presentation
Big Data
Tuning and Debugging in Apache Spark
Ad

Similar to Installing Hadoop / Spark from scratch (20)

PDF
Building Apache Hadoop from source on IBM Power Systems
PDF
POWERing your big data solution with IBM: open-source Hadoop on POWER
PDF
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
PDF
Final Report - Spark
PDF
Install Apache Hadoop for Development/Production
PDF
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
PDF
Ibm leads way with hadoop and spark 2015 may 15
PPTX
data science toolkit 101: set up Python, Spark, & Jupyter
PDF
Spark Working Environment in Windows OS
PDF
Micro Datacenter & Data Warehouse
PPT
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
PDF
Big Data Use Cases
PDF
Apache spark - Installation
PPTX
Cloudera and Spark setup
PPTX
Big Data and Hadoop
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
PPTX
BigDataTech 2015 Is Hadoop Enterprise ready?
PDF
EMC Starter Kit - IBM BigInsights - EMC Isilon
PPTX
Hadoop in a Nutshell
PPTX
Interactive Analytics using Apache Spark
Building Apache Hadoop from source on IBM Power Systems
POWERing your big data solution with IBM: open-source Hadoop on POWER
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Final Report - Spark
Install Apache Hadoop for Development/Production
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Ibm leads way with hadoop and spark 2015 may 15
data science toolkit 101: set up Python, Spark, & Jupyter
Spark Working Environment in Windows OS
Micro Datacenter & Data Warehouse
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Big Data Use Cases
Apache spark - Installation
Cloudera and Spark setup
Big Data and Hadoop
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
BigDataTech 2015 Is Hadoop Enterprise ready?
EMC Starter Kit - IBM BigInsights - EMC Isilon
Hadoop in a Nutshell
Interactive Analytics using Apache Spark

More from Andrey Vykhodtsev (9)

PPTX
Explaining machine learning models with python
PDF
20181003 Whirlwind tour into Pyspark
PDF
20180405 av toxic_comment_classification
PDF
20180328 av kaggle_jigsaw_with_amlwb
PPTX
20170927 py data_n3_bokeh_plotly
PDF
20151015 zagreb spark_notebooks
PDF
20150716 introduction to apache spark v3
PDF
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
PDF
20150617 spark meetup zagreb
Explaining machine learning models with python
20181003 Whirlwind tour into Pyspark
20180405 av toxic_comment_classification
20180328 av kaggle_jigsaw_with_amlwb
20170927 py data_n3_bokeh_plotly
20151015 zagreb spark_notebooks
20150716 introduction to apache spark v3
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
20150617 spark meetup zagreb

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to machine learning and Linear Models
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PDF
Lecture1 pattern recognition............
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
Database Infoormation System (DBIS).pptx
Introduction to machine learning and Linear Models
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
Lecture1 pattern recognition............
Miokarditis (Inflamasi pada Otot Jantung)
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Clinical guidelines as a resource for EBP(1).pdf
Business Acumen Training GuidePresentation.pptx
.pdf is not working space design for the following data for the following dat...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

Installing Hadoop / Spark from scratch

  • 1. 1 © 2016 IBM Corporation Big Data Developer meetup Installing Apache Hadoop and Spark from scratch Ljubljana, June 2016
  • 2. 2 © 2016 IBM Corporation Agenda  Why do you need Hadoop  What do you need before you install Apache Hadoop  Hadoop distributions  Hadoop components you need to know about  About Spark  Installation process walk-through  Adding cluster nodes  Ways to automate  Zero-install options
  • 3. 3 © 2016 IBM Corporation Why do you need Apache Hadoop License – free Scalable General purpose MPP engine Distributed storage Packed with tools Backend for your Big Data project
  • 4. 4 © 2016 IBM Corporation What do you need before you install Hadoop and Spark  A server (or servers)  Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)  A Hadoop distribution (more later)  Or avoid all that trouble by using VM / Docker if you are just playing (more later)
  • 5. 5 © 2016 IBM Corporation Apache Hadoop Distributions  Hortonworks HDP  Cloudera CDH  IBM IOP (today’s focus)  Number of others  Distributions are very similar but different, as in Linux  Some are part of ODP some are not
  • 6. 6 © 2016 IBM Corporation Hadoop components you need to know about  Yarn – resource manager  HDFS  MapReduce  Ambari  ZooKeeper  Hive  Pig  sqoop
  • 7. 7 © 2016 IBM Corporation  Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large- scale data processing – Fast •Leverages aggressively cached in-memory distributed computing and dedicated App Executor processes even when no jobs are running •Faster than MapReduce – General purpose •Covers a wide range of workloads •Provides SQL, streaming and complex analytics – Flexible and easier to use than Map Reduce •Spark is written in Scala, an object oriented, functional programming language •Scala, Python and Java APIs •Scala and Python interactive shells •Runs on Hadoop, Mesos, standalone or cloud Logistic regression in Hadoop and Spark from http://spark.apache.org
  • 8. 8 © 2016 IBM Corporation Installation process walk-through  Review the requirements  Review the installation docs  Get IOP software: http://www- 01.ibm.com/support/docview.wss?uid=swg24040517
  • 9. 9 © 2016 IBM Corporation Prereqs  Install OS  Setup yum repository  Install prerequisites • Yum install nc  Full list of preparation steps  Make sure your hostname is in /etc/hosts  Tweak some settings (disable Trasparent Huge Pages) • echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled  Generate ssh key and set up passwordless ssh • Ssh-keygen • Chmod 700 ~/.ssh • Check with ssh localhost
  • 10. 10 © 2016 IBM Corporation Prereqs (cont.)  disable IPv6  Configure ulimit • /etc/security/limits.conf  Disable SELinux  Set up NTP on all servers
  • 11. 11 © 2016 IBM Corporation First step – install Ambari  Install repository • yum install iop-4.1.0.0-1.<version>.<platform>.rpm  Install ambari • Yum install ambari-server  Setup ambari server • sudo ambari-server setup  Start ambari server • Ambari-server start  Go to ambari interface <your-ip>:8080 • Default user/pass = admin/admin  Launch installation wisard
  • 12. 12 © 2016 IBM Corporation Ambari installation  Next-next-next  Provide cluster name  Provide private ssh key
  • 13. 13 © 2016 IBM Corporation Choose services
  • 14. 14 © 2016 IBM Corporation Assign masters
  • 15. 15 © 2016 IBM Corporation Assign slaves and clients
  • 16. 16 © 2016 IBM Corporation Customize services Here you would have to setup proper DB server connections in your prod environment
  • 17. 17 © 2016 IBM Corporation Review and deploy
  • 18. 18 © 2016 IBM Corporation Validate
  • 19. 19 © 2016 IBM Corporation Adding a new cluster node  Create a new server, with same pre-rereqs  Make sure that passwordless ssh works from ambari server to the node  ssh-copy-id -i ~/.ssh/id_rsa.pub root@hostname01  And done
  • 20. 20 © 2016 IBM Corporation Extra steps  Install Anaconda / Jupyter for data analysis  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook - -no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark
  • 21. 21 © 2016 IBM Corporation Ways to automate - Ansible  Simple automation tool  Infrastructure as a code  Agent-less  Easy to learn  Check for examples online “ansible hadoop playbook”
  • 22. 22 © 2016 IBM Corporation Zero – installation options •Big Insights QSE •BigInsights on cloud (paid)
  • 23. 23 © 2014 IBM Corporation WRAP-UP