Installing Hadoop / Spark from scratch

1
© 2016 IBM Corporation
Big Data Developer meetup
Installing Apache Hadoop and Spark from scratch
Ljubljana, June 2016

2
Agenda
 Why do you need Hadoop
 What do you need before you install Apache Hadoop
 Hadoop distributions
 Hadoop components you need to know about
 About Spark
 Installation process walk-through
 Adding cluster nodes
 Ways to automate
 Zero-install options

3
Why do you need Apache Hadoop
License – free
Scalable
General purpose MPP
engine
Distributed storage
Packed with tools
Backend for your Big
Data project

4
What do you need before you install Hadoop and Spark
 A server (or servers)
 Installed OS (in case of IBM RHEL 6.5-7 or SUSE 11 SP3)
 A Hadoop distribution (more later)
 Or avoid all that trouble by using VM / Docker if you are just
playing (more later)

5
Apache Hadoop Distributions
 Hortonworks HDP
 Cloudera CDH
 IBM IOP (today’s focus)
 Number of others
 Distributions are very similar but different, as in Linux
 Some are part of ODP some are not

6
Hadoop components you need to know about
 Yarn – resource manager
 HDFS
 MapReduce
 Ambari
 ZooKeeper
 Hive
 Pig
 sqoop

7
 Apache Spark is a fast, general purpose, easy-to-use cluster computing system for large-
scale data processing
– Fast
•Leverages aggressively cached in-memory
distributed computing and dedicated
App Executor processes even when no jobs
are running
•Faster than MapReduce
– General purpose
•Covers a wide range of workloads
•Provides SQL, streaming and complex
analytics
– Flexible and easier to use than Map Reduce
•Spark is written in Scala, an object oriented,
functional programming language
•Scala, Python and Java APIs
•Scala and Python interactive shells
•Runs on Hadoop, Mesos, standalone or
cloud
Logistic regression in Hadoop and Spark
from http://spark.apache.org

8
Installation process walk-through
 Review the requirements
 Review the installation docs
 Get IOP software: http://www-
01.ibm.com/support/docview.wss?uid=swg24040517

9
Prereqs
 Install OS
 Setup yum repository
 Install prerequisites
• Yum install nc
 Full list of preparation steps
 Make sure your hostname is in /etc/hosts
 Tweak some settings (disable Trasparent Huge Pages)
• echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
 Generate ssh key and set up passwordless ssh
• Ssh-keygen
• Chmod 700 ~/.ssh
• Check with ssh localhost

10
Prereqs (cont.)
 disable IPv6
 Configure ulimit
• /etc/security/limits.conf
 Disable SELinux
 Set up NTP on all servers

11
First step – install Ambari
 Install repository
• yum install iop-4.1.0.0-1.<version>.<platform>.rpm
 Install ambari
• Yum install ambari-server
 Setup ambari server
• sudo ambari-server setup
 Start ambari server
• Ambari-server start
 Go to ambari interface <your-ip>:8080
• Default user/pass = admin/admin
 Launch installation wisard

12
Ambari installation
 Next-next-next
 Provide cluster name
 Provide private ssh key

13
Choose services

14
Assign masters

15
Assign slaves and clients

16
Customize services
Here you would have to setup proper DB server connections in
your prod environment

17
Review and deploy

18
Validate

19
Adding a new cluster node
 Create a new server, with same
pre-rereqs
 Make sure that passwordless ssh
works from ambari server to the
node
 ssh-copy-id -i ~/.ssh/id_rsa.pub
root@hostname01
 And done

20
Extra steps
 Install Anaconda / Jupyter for data analysis
 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook -
-no-browser --port 12000 --ip='0.0.0.0'" ./bin/pyspark

21
Ways to automate - Ansible
 Simple automation tool
 Infrastructure as a code
 Agent-less
 Easy to learn
 Check for examples online “ansible hadoop
playbook”

22
Zero – installation options
•Big Insights QSE
•BigInsights on cloud (paid)

23
WRAP-UP

Installing Hadoop / Spark from scratch

More Related Content

What's hot (17)

Viewers also liked (20)

Similar to Installing Hadoop / Spark from scratch (20)

More from Andrey Vykhodtsev (9)

Recently uploaded (20)

Installing Hadoop / Spark from scratch