SlideShare a Scribd company logo
SMF analysis using
Apache Spark and Jupyterlab
April 9th 2024 – GSE z/OS Expertenforum
Marcel Schmidt
Introduction
Infrastructure overview
Component description
Building the infrastructure
WSL Windows Subsystem for Linux
- ( )
Apache Spark
-
Jupyterlab
-
PostgreSQL
-
Nvidia CUDA Compute unified device Architecture
- ( )
Tesla T accelerator card
- 4
Demo
Wrapup
Agenda
Spark Jupyterlab Final GSE Presentation 2024
●
It is possible to analyze SMF data using an assembly of open
source technologies
●
The necessary infrastructure can be built on a Windows or Linux
pla orm
●
The SMF records must be transformed and made available on
the analysis pla orm either as JSON files or PostgreSQL records
●
There are commercial solu ons available that are combining
these open source tools with proprietary code to directly access
SMF data residing on z/OS
Introduction
Infrastructure
Overview
Infrastructure overview
Component
description
●
Windows Subsystem for Linux V2
Allows you to run a Linux environment on a Windows machine
without the need for a separate virtualiza on solu on.
Component description WSL
- 2
●
Unified analy cs engine for large-scale dataprocessing.
Aka “Big Data” and “Hadoop”
●
Spark Core provides distributed task dispatching, scheduling,
and basic I/O func onali es, exposed through an API for Java,
Python, Scala, .NET and R
●
Spark SQL is a data abstrac on called DataFrames
Component description Apache Spark
–
●
JupyterLab is the latest web-based interac ve development
environment for notebooks, code, and data.
●
Its flexible interface allows users to configure and arrange
workflows in data science, scien fic compu ng, computa onal
journalism, and machine learning.
●
A modular design invites extensions to expand and enrich
func onality.
●
The JupyterLab environments provide a produc vity-focused
redesign of Jupyter Notebook. It introduces tools such as a built-in
HTML viewer and CSV viewer along with features that unify several
discrete features of Jupyter Notebooks onto the same screen.
Component description Jupyterlab
–
PostgreSQL is a powerful, open source object-rela onal database system
with over 35 years of ac ve development that has earned it a strong
reputa on for reliability, feature robustness, and performance.
There is a wealth of informa on to be found describing how to install
and use PostgreSQL through the official documenta on.
Component description PostgreSQL
–
CUDA® is a parallel compu ng pla orm and programming model
developed by NVIDIA for general compu ng on graphical processing
units (GPUs). With CUDA, developers are able to drama cally speed
up compu ng applica ons by harnessing the power of GPUs.
In GPU-accelerated applica ons, the sequen al part of the workload
runs on the CPU – which is op mized for single-threaded
performance – while the compute intensive por on of the
applica on runs on thousands of GPU cores in parallel.
Component description Nvidia CUDA
–
The Tesla T4 is a professional graphics card by NVIDIA. Built on the 12 nm
process, and based on the TU104 graphics processor, the card supports DirectX
12 Ul mate.
●
It features 2560 shading units, 160 texture mapping units, and 64 ROPs.
Also included are 320 tensor cores which help improve the speed of machine
learning applica ons.
●
NVIDIA has paired 16 GB GDDR6 memory with the Tesla T4, which are
connected using a 256-bit memory interface. The GPU is opera ng at a
frequency of 585 MHz, which can be boosted up to 1590 MHz,
memory is running at 1250 MHz (10 Gbps effec ve).
●
It does not require any addi onal power connector,
its power draw is rated at 70 W maximum.
Component description Tesla T Hardware
– 4
Building the
infrastructure
WSL Ubuntu distribution
2 /
1. Ensure that your WSL version is 0.67.6 or newer.
Systemd support is required!
To check, run wsl --version.
To update, run wsl --update or download from MS Store
2. wsl --install
3. reboot Windows
4. wsl --install Ubuntu
5. wsl --list --verbose
NAME STATE VERSION
* Ubuntu Running 2
5. wsl
6. sudo apt update; sudo apt upgrade
7. sudo apt install wget tar net-tools mc -y
Apache Spark (1)
1. Install Java runtime
Apache Spark requires Java to run
sudo apt install curl mlocate default-jdk -y
2. Download Apache Spark
Download the latest release of Apache Spark from the downloads page.
https://spark.apache.org/downloads.html
VER=3.5.1 (23. Feb. 2024)
wget https://dlcdn.apache.org/spark/spark-$VER/spark-$VER-bin-
hadoop3.tgz
tar xvf spark-$VER-bin-hadoop3.tgz
Move the Spark folder created after extraction to the /opt/ directory.
sudo mv spark-$VER-bin-hadoop3/ /opt/spark
Apache Spark (2)
# Set Spark environment
# Open your bashrc configuration file.
nano ~/.bashrc
add:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Activate changes:
source ~/.bashrc
Apache Spark (3)
3. Start a standalone master Server:
start-master.sh
starting org.apache.spark.deploy.master.Master, logging to
/opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-
EMA.out
The process will be listening on TCP port 8080.
sudo ss -tunelp | grep 8080
tcp LISTEN 0 1 *:8080 *:* users:
(("java",pid=5437,fd=286)) ino:61662 sk:6 cgroup:/ v6only:0 <->
http://localhost:8080/
My Spark URL is spark://EMA:7077
Apache Spark (4)
4. Starting Spark Worker Process
The start-worker.sh command is used to start Spark Worker Process.
start-worker.sh spark://EMA:7077
5. Using Spark shell
Use the spark-shell command to access Spark Shell:
spark-shell
starting org.apache.spark.deploy.worker.Worker, logging to
/opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-
EMA.out
Apache Spark (5)
spark-shell
24/04/07 12:33:43 WARN Utils: Your hostname, EMA resolves to a loopback address: 127.0.1.1;
using 172.26.226.96 instead (on interface eth0)
24/04/07 12:33:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/07 12:33:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Spark context Web UI available at http://172.26.226.96:4040
Spark context available as 'sc' (master = local[*], app id = local-1712486036586).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.5.1
/_/
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 11.0.22)
Type in expressions to have them evaluated.
Jupyterlab (1)
pre-requisites
sudo apt install python3 python3-pip python3-venv nodejs -y
python3 --version
Python 3.10.12
pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
Jupyterlab (2)
add user and group
run the following commands to create a new user called jupyteruser and grant sudo permission
# Add a new group
sudo groupadd jupyter
# Creating jupyteruser and adding to the jupyter group
sudo useradd --groups jupyter jupyteruser
sudo passwd jupyteruser
# add jupyteruser to the sudo group
sudo adduser jupyteruser sudo
sudo chown jupyteruser:jupyter /home/jupyteruser
sudo mkdir /home/jupyteruser
su - jupyteruser
Jupyterlab (3)
python3 -m pip install --user --upgrade pip
python3 -m pip install --user psycopg2-binary bokeh plotly chart_studio numpy scipy
python-dotenv
python3 -m pip install --user jupyterlab
python3 -m pip install --user pyspark
python3 -m pip install --user matplotlib seaborn
# install scala kernel
pip install spylon-kernel
sudo python3 -m spylon-kernel install
Jupyterlab (4)
https (ssl) setup
mkdir ~/ssl_cert && cd ~/ssl_cert
# Generate a new private key.
openssl genrsa -out jupyter.key 2048
# Create a signed certificate.
openssl req -new -key jupyter.key -out jupyter.csr
# Create a self-signed certificate
openssl x509 -req -days 365 -in jupyter.csr -signkey jupyter.key -out jupyter.pem
Certificate request self-signature ok
subject=C = CH, ST = Thurgau, L = Ettenhausen, O = MMS IT GmbH
Jupyterlab (5)
# Password protect your JupyterLab server by generating and modifying a Jupyter config file:
jupyter server --generate-config
Writing default config to: /home/jupyteruser/.jupyter/jupyter_server_config.py
jupyter server Password
[JupyterPasswordApp] Wrote hashed password to
/home/jupyteruser/.jupyter/jupyter_server_config.json
# Find the config file open it because there are changes required for SSL
nano ~/.jupyter/jupyter_server_config.py
If using the SSL certificate, also add the location of the certificate file and the private key to the
config file.
c.ServerApp.certfile = '/home/jupyteruser/ssl_cert/jupyter.pem'
c.ServerApp.keyfile = '/home/jupyteruser/ssl_cert/jupyter.key'
mkdir /home/jupyteruser/notebooks
jupyter-lab --no-browser --ip "*" --notebook-dir=/home/jupyteruser/notebooks --port=8888
Jupyterlab (6)
systemd Setup
sudo nano /etc/systemd/system/jupyter.service
add the following lines:
[Unit]
Description=Jupyter Notebook
[Service]
Type=simple
PIDFile=/run/jupyter.pid
# If you need environment variables for Tensorflow GPU work, .bashrc usually does the job
# you need to somehow make those available to the Jupyter service, or else Notebooks that need
the GPU won't be able to see it.
Environment="PATH=/usr/local/cuda-12.3/bin:$PATH"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:/usr/local/cuda-12.3/lib64:usr/
local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH"
Environment="CUDA_HOME=/usr/local/cuda-12.3"
Jupyterlab (7)
Environment="PYSPARK_ALLOW_INSECURE_GATEWAY=1"
Environment="CLASSPATH=/home/jupyteruser/postgresql-42.5.0.jar:$CLASSPATH"
ExecStart=/home/jupyteruser/.local/bin/jupyter-lab --notebook-dir=/home/jupyteruser/notebooks --
no-browser --ip "*" --port=8888
User=jupyteruser
Group=jupyter
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Jupyterlab (8)
sudo systemctl enable jupyter
Created symlink /etc/systemd/system/multi-user.target.wants/jupyter.service →
/etc/systemd/system/jupyter.service.
Reload the systemd daemon and restart the service
sudo systemctl daemon-reload
sudo systemctl restart jupyter
sudo systemctl status jupyter
jupyter.service - Jupyter Notebook
Loaded: loaded (/etc/systemd/system/jupyter.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2024-04-07 14:03:11 CEST; 27ms ago
Main PID: 7507 (jupyter-lab)
Tasks: 1 (limit: 4589)
Memory: 2.8M
CGroup: /system.slice/jupyter.service
└─7507 /usr/bin/python3 /home/jupyteruser/.local/bin/jupyter-lab
--notebook-dir=/home/jupyteruser/notebook>
Apr 07 14:03:11 EMA systemd[1]: Started Jupyter Notebook.
Jupyterlab (9)
Finally, you can monitor the output of the service:
To show the log messages since the last boot (-b) and without additional fields like timestamp and
hostname (-o cat), type:
sudo journalctl -u jupyter -b -o cat -f
Open a browser window on your local computer and enter the following to open the notebook.
https://[External IP]:8888
PostgreSQL (1)
apt install postgresql libpostgresql-jdbc-java
systemctl start postgresql
systemctl enable postgresql
systemctl Status PostgreSQL
# You will need a JDBC connection to connect Apache Spark to your PostgreSQL
database. It’s available for download here:
cd /home/jupyteruser
wget https://jdbc.postgresql.org/download/postgresql-42.7.3.jar
chown jupyteruser:jupyter postgresql-42.7.3.jar
Nvidia CUDA (1)
# disable "nouveau" driver because it tries to activate the Tesla card as a
graphics card which doesn’t work because it has no graphics port.
In /etc/default/grub, add the following phrase to the value of
GRUB_CMDLINE_LINUX:
module_blacklist=nouveau
Create /etc/modprobe.d/nouveau.conf and add the following line:
blacklist nouveau
Rebuild modules:
depmod -a
Rebuild your grub config:
grub2-mkconfig --output=/boot/efi/EFI/rocky/grub.cfg
Nvidia CUDA (2)
Download and install the Nvidia Tesla driver
wget https://us.download.nvidia.com/tesla/525.60.13/NVIDIA-Linux-
x86_64-525.60.13.run
chmod +x *.run
Execute the downloaded package in the Shell
./NVIDIA-xxx --kernel-source-path=/usr/src/kernels/xxx
Nvidia CUDA (3)
nvidia-smi
Sat Dec 17 14:03:36 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:01:00.0 Off | 0 |
| N/A 93C P0 41W / 70W | 2MiB / 15360MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Nvidia CUDA CUDA Toolkit
(4) –
wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/
cuda_10.1.243_418.87.00_linux.run
sh cuda_10.1.243_418.87.00_linux.run --override (--override required to bypass gcc version check)
# unselect the driver. install the rest
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-10.1/
Samples: Installed in /root/, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-10.1/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to
/etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin
Demo
Install the demo assets
Download
SMF110_Spark_Python3.ipynb
SMF110_data.json.zip
From
https://github.com/IzODA/examples/tree/master/SMF
and put them into /home/jupyteruser/Notebooks
Demo (1)
Demo (2)
Demo (3)
Demo (4)
Demo (5)
Demo (6)
Demo (7)
Demo (8)
08.04.2024

More Related Content

PDF
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
PPTX
GoSF Jan 2016 - Go Write a Plugin for Snap!
DOCX
Tutorial to setup OpenStreetMap tileserver with customized boundaries of India
PDF
PDF
Spectre meltdown performance_tests - v0.3
PDF
Oracle 11g R2 RAC setup on rhel 5.0
DOCX
Vbox virtual box在oracle linux 5 - shoug 梁洪响
PPTX
RAC-Installing your First Cluster and Database
[OpenStack Day in Korea 2015] Track 1-6 - 갈라파고스의 이구아나, 인프라에 오픈소스를 올리다. 그래서 보이...
GoSF Jan 2016 - Go Write a Plugin for Snap!
Tutorial to setup OpenStreetMap tileserver with customized boundaries of India
Spectre meltdown performance_tests - v0.3
Oracle 11g R2 RAC setup on rhel 5.0
Vbox virtual box在oracle linux 5 - shoug 梁洪响
RAC-Installing your First Cluster and Database

Similar to Spark Jupyterlab Final GSE Presentation 2024 (20)

PDF
Deploying PostgreSQL on Kubernetes
PDF
20181210 - PGconf.ASIA Unconference
PDF
linux installation.pdf
PDF
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
PDF
Rac on NFS
PDF
CERN OpenStack Cloud Control Plane - From VMs to K8s
PDF
Ceph Day Beijing - SPDK for Ceph
PDF
Ceph Day Beijing - SPDK in Ceph
PDF
9 creating cent_os 7_mages_for_dpdk_training
PDF
TechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
PDF
Apache spark - Installation
ODP
LSA2 - 02 Namespaces
PPTX
Network Automation Tools
PDF
Building SuperComputers @ Home
PDF
RISC V in Spacer
PDF
Minimal OpenStack LinuxCon NA 2015
PPT
les_02.ppt of the Oracle course train_2 file
PDF
Known basic of NFV Features
PDF
Installing oracle grid infrastructure and database 12c r1
PDF
GPGPU Accelerates PostgreSQL (English)
Deploying PostgreSQL on Kubernetes
20181210 - PGconf.ASIA Unconference
linux installation.pdf
图文详解安装Net backup 6.5备份恢复oracle 10g rac 数据库
Rac on NFS
CERN OpenStack Cloud Control Plane - From VMs to K8s
Ceph Day Beijing - SPDK for Ceph
Ceph Day Beijing - SPDK in Ceph
9 creating cent_os 7_mages_for_dpdk_training
TechDay - Cambridge 2016 - OpenNebula at Harvard Univerity
Apache spark - Installation
LSA2 - 02 Namespaces
Network Automation Tools
Building SuperComputers @ Home
RISC V in Spacer
Minimal OpenStack LinuxCon NA 2015
les_02.ppt of the Oracle course train_2 file
Known basic of NFV Features
Installing oracle grid infrastructure and database 12c r1
GPGPU Accelerates PostgreSQL (English)
Ad

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
history of c programming in notes for students .pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
System and Network Administration Chapter 2
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Introduction to Artificial Intelligence
PDF
top salesforce developer skills in 2025.pdf
PPTX
Essential Infomation Tech presentation.pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Transform Your Business with a Software ERP System
PDF
System and Network Administraation Chapter 3
PDF
How Creative Agencies Leverage Project Management Software.pdf
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
VVF-Customer-Presentation2025-Ver1.9.pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
history of c programming in notes for students .pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Reimagine Home Health with the Power of Agentic AI​
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Design an Analysis of Algorithms I-SECS-1021-03
System and Network Administration Chapter 2
Which alternative to Crystal Reports is best for small or large businesses.pdf
Introduction to Artificial Intelligence
top salesforce developer skills in 2025.pdf
Essential Infomation Tech presentation.pptx
Odoo POS Development Services by CandidRoot Solutions
Understanding Forklifts - TECH EHS Solution
Navsoft: AI-Powered Business Solutions & Custom Software Development
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Transform Your Business with a Software ERP System
System and Network Administraation Chapter 3
How Creative Agencies Leverage Project Management Software.pdf
Ad

Spark Jupyterlab Final GSE Presentation 2024

  • 1. SMF analysis using Apache Spark and Jupyterlab April 9th 2024 – GSE z/OS Expertenforum Marcel Schmidt
  • 2. Introduction Infrastructure overview Component description Building the infrastructure WSL Windows Subsystem for Linux - ( ) Apache Spark - Jupyterlab - PostgreSQL - Nvidia CUDA Compute unified device Architecture - ( ) Tesla T accelerator card - 4 Demo Wrapup Agenda
  • 4. ● It is possible to analyze SMF data using an assembly of open source technologies ● The necessary infrastructure can be built on a Windows or Linux pla orm ● The SMF records must be transformed and made available on the analysis pla orm either as JSON files or PostgreSQL records ● There are commercial solu ons available that are combining these open source tools with proprietary code to directly access SMF data residing on z/OS Introduction
  • 8. ● Windows Subsystem for Linux V2 Allows you to run a Linux environment on a Windows machine without the need for a separate virtualiza on solu on. Component description WSL - 2
  • 9. ● Unified analy cs engine for large-scale dataprocessing. Aka “Big Data” and “Hadoop” ● Spark Core provides distributed task dispatching, scheduling, and basic I/O func onali es, exposed through an API for Java, Python, Scala, .NET and R ● Spark SQL is a data abstrac on called DataFrames Component description Apache Spark –
  • 10. ● JupyterLab is the latest web-based interac ve development environment for notebooks, code, and data. ● Its flexible interface allows users to configure and arrange workflows in data science, scien fic compu ng, computa onal journalism, and machine learning. ● A modular design invites extensions to expand and enrich func onality. ● The JupyterLab environments provide a produc vity-focused redesign of Jupyter Notebook. It introduces tools such as a built-in HTML viewer and CSV viewer along with features that unify several discrete features of Jupyter Notebooks onto the same screen. Component description Jupyterlab –
  • 11. PostgreSQL is a powerful, open source object-rela onal database system with over 35 years of ac ve development that has earned it a strong reputa on for reliability, feature robustness, and performance. There is a wealth of informa on to be found describing how to install and use PostgreSQL through the official documenta on. Component description PostgreSQL –
  • 12. CUDA® is a parallel compu ng pla orm and programming model developed by NVIDIA for general compu ng on graphical processing units (GPUs). With CUDA, developers are able to drama cally speed up compu ng applica ons by harnessing the power of GPUs. In GPU-accelerated applica ons, the sequen al part of the workload runs on the CPU – which is op mized for single-threaded performance – while the compute intensive por on of the applica on runs on thousands of GPU cores in parallel. Component description Nvidia CUDA –
  • 13. The Tesla T4 is a professional graphics card by NVIDIA. Built on the 12 nm process, and based on the TU104 graphics processor, the card supports DirectX 12 Ul mate. ● It features 2560 shading units, 160 texture mapping units, and 64 ROPs. Also included are 320 tensor cores which help improve the speed of machine learning applica ons. ● NVIDIA has paired 16 GB GDDR6 memory with the Tesla T4, which are connected using a 256-bit memory interface. The GPU is opera ng at a frequency of 585 MHz, which can be boosted up to 1590 MHz, memory is running at 1250 MHz (10 Gbps effec ve). ● It does not require any addi onal power connector, its power draw is rated at 70 W maximum. Component description Tesla T Hardware – 4
  • 15. WSL Ubuntu distribution 2 / 1. Ensure that your WSL version is 0.67.6 or newer. Systemd support is required! To check, run wsl --version. To update, run wsl --update or download from MS Store 2. wsl --install 3. reboot Windows 4. wsl --install Ubuntu 5. wsl --list --verbose NAME STATE VERSION * Ubuntu Running 2 5. wsl 6. sudo apt update; sudo apt upgrade 7. sudo apt install wget tar net-tools mc -y
  • 16. Apache Spark (1) 1. Install Java runtime Apache Spark requires Java to run sudo apt install curl mlocate default-jdk -y 2. Download Apache Spark Download the latest release of Apache Spark from the downloads page. https://spark.apache.org/downloads.html VER=3.5.1 (23. Feb. 2024) wget https://dlcdn.apache.org/spark/spark-$VER/spark-$VER-bin- hadoop3.tgz tar xvf spark-$VER-bin-hadoop3.tgz Move the Spark folder created after extraction to the /opt/ directory. sudo mv spark-$VER-bin-hadoop3/ /opt/spark
  • 17. Apache Spark (2) # Set Spark environment # Open your bashrc configuration file. nano ~/.bashrc add: export SPARK_HOME=/opt/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin Activate changes: source ~/.bashrc
  • 18. Apache Spark (3) 3. Start a standalone master Server: start-master.sh starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1- EMA.out The process will be listening on TCP port 8080. sudo ss -tunelp | grep 8080 tcp LISTEN 0 1 *:8080 *:* users: (("java",pid=5437,fd=286)) ino:61662 sk:6 cgroup:/ v6only:0 <-> http://localhost:8080/ My Spark URL is spark://EMA:7077
  • 19. Apache Spark (4) 4. Starting Spark Worker Process The start-worker.sh command is used to start Spark Worker Process. start-worker.sh spark://EMA:7077 5. Using Spark shell Use the spark-shell command to access Spark Shell: spark-shell starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1- EMA.out
  • 20. Apache Spark (5) spark-shell 24/04/07 12:33:43 WARN Utils: Your hostname, EMA resolves to a loopback address: 127.0.1.1; using 172.26.226.96 instead (on interface eth0) 24/04/07 12:33:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/04/07 12:33:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://172.26.226.96:4040 Spark context available as 'sc' (master = local[*], app id = local-1712486036586). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 3.5.1 /_/ Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 11.0.22) Type in expressions to have them evaluated.
  • 21. Jupyterlab (1) pre-requisites sudo apt install python3 python3-pip python3-venv nodejs -y python3 --version Python 3.10.12 pip3 --version pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
  • 22. Jupyterlab (2) add user and group run the following commands to create a new user called jupyteruser and grant sudo permission # Add a new group sudo groupadd jupyter # Creating jupyteruser and adding to the jupyter group sudo useradd --groups jupyter jupyteruser sudo passwd jupyteruser # add jupyteruser to the sudo group sudo adduser jupyteruser sudo sudo chown jupyteruser:jupyter /home/jupyteruser sudo mkdir /home/jupyteruser su - jupyteruser
  • 23. Jupyterlab (3) python3 -m pip install --user --upgrade pip python3 -m pip install --user psycopg2-binary bokeh plotly chart_studio numpy scipy python-dotenv python3 -m pip install --user jupyterlab python3 -m pip install --user pyspark python3 -m pip install --user matplotlib seaborn # install scala kernel pip install spylon-kernel sudo python3 -m spylon-kernel install
  • 24. Jupyterlab (4) https (ssl) setup mkdir ~/ssl_cert && cd ~/ssl_cert # Generate a new private key. openssl genrsa -out jupyter.key 2048 # Create a signed certificate. openssl req -new -key jupyter.key -out jupyter.csr # Create a self-signed certificate openssl x509 -req -days 365 -in jupyter.csr -signkey jupyter.key -out jupyter.pem Certificate request self-signature ok subject=C = CH, ST = Thurgau, L = Ettenhausen, O = MMS IT GmbH
  • 25. Jupyterlab (5) # Password protect your JupyterLab server by generating and modifying a Jupyter config file: jupyter server --generate-config Writing default config to: /home/jupyteruser/.jupyter/jupyter_server_config.py jupyter server Password [JupyterPasswordApp] Wrote hashed password to /home/jupyteruser/.jupyter/jupyter_server_config.json # Find the config file open it because there are changes required for SSL nano ~/.jupyter/jupyter_server_config.py If using the SSL certificate, also add the location of the certificate file and the private key to the config file. c.ServerApp.certfile = '/home/jupyteruser/ssl_cert/jupyter.pem' c.ServerApp.keyfile = '/home/jupyteruser/ssl_cert/jupyter.key' mkdir /home/jupyteruser/notebooks jupyter-lab --no-browser --ip "*" --notebook-dir=/home/jupyteruser/notebooks --port=8888
  • 26. Jupyterlab (6) systemd Setup sudo nano /etc/systemd/system/jupyter.service add the following lines: [Unit] Description=Jupyter Notebook [Service] Type=simple PIDFile=/run/jupyter.pid # If you need environment variables for Tensorflow GPU work, .bashrc usually does the job # you need to somehow make those available to the Jupyter service, or else Notebooks that need the GPU won't be able to see it. Environment="PATH=/usr/local/cuda-12.3/bin:$PATH" Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:/usr/local/cuda-12.3/lib64:usr/ local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH" Environment="CUDA_HOME=/usr/local/cuda-12.3"
  • 28. Jupyterlab (8) sudo systemctl enable jupyter Created symlink /etc/systemd/system/multi-user.target.wants/jupyter.service → /etc/systemd/system/jupyter.service. Reload the systemd daemon and restart the service sudo systemctl daemon-reload sudo systemctl restart jupyter sudo systemctl status jupyter jupyter.service - Jupyter Notebook Loaded: loaded (/etc/systemd/system/jupyter.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2024-04-07 14:03:11 CEST; 27ms ago Main PID: 7507 (jupyter-lab) Tasks: 1 (limit: 4589) Memory: 2.8M CGroup: /system.slice/jupyter.service └─7507 /usr/bin/python3 /home/jupyteruser/.local/bin/jupyter-lab --notebook-dir=/home/jupyteruser/notebook> Apr 07 14:03:11 EMA systemd[1]: Started Jupyter Notebook.
  • 29. Jupyterlab (9) Finally, you can monitor the output of the service: To show the log messages since the last boot (-b) and without additional fields like timestamp and hostname (-o cat), type: sudo journalctl -u jupyter -b -o cat -f Open a browser window on your local computer and enter the following to open the notebook. https://[External IP]:8888
  • 30. PostgreSQL (1) apt install postgresql libpostgresql-jdbc-java systemctl start postgresql systemctl enable postgresql systemctl Status PostgreSQL # You will need a JDBC connection to connect Apache Spark to your PostgreSQL database. It’s available for download here: cd /home/jupyteruser wget https://jdbc.postgresql.org/download/postgresql-42.7.3.jar chown jupyteruser:jupyter postgresql-42.7.3.jar
  • 31. Nvidia CUDA (1) # disable "nouveau" driver because it tries to activate the Tesla card as a graphics card which doesn’t work because it has no graphics port. In /etc/default/grub, add the following phrase to the value of GRUB_CMDLINE_LINUX: module_blacklist=nouveau Create /etc/modprobe.d/nouveau.conf and add the following line: blacklist nouveau Rebuild modules: depmod -a Rebuild your grub config: grub2-mkconfig --output=/boot/efi/EFI/rocky/grub.cfg
  • 32. Nvidia CUDA (2) Download and install the Nvidia Tesla driver wget https://us.download.nvidia.com/tesla/525.60.13/NVIDIA-Linux- x86_64-525.60.13.run chmod +x *.run Execute the downloaded package in the Shell ./NVIDIA-xxx --kernel-source-path=/usr/src/kernels/xxx
  • 33. Nvidia CUDA (3) nvidia-smi Sat Dec 17 14:03:36 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:01:00.0 Off | 0 | | N/A 93C P0 41W / 70W | 2MiB / 15360MiB | 8% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
  • 34. Nvidia CUDA CUDA Toolkit (4) – wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/ cuda_10.1.243_418.87.00_linux.run sh cuda_10.1.243_418.87.00_linux.run --override (--override required to bypass gcc version check) # unselect the driver. install the rest =========== = Summary = =========== Driver: Not Selected Toolkit: Installed in /usr/local/cuda-10.1/ Samples: Installed in /root/, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-10.1/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin
  • 35. Demo
  • 36. Install the demo assets Download SMF110_Spark_Python3.ipynb SMF110_data.json.zip From https://github.com/IzODA/examples/tree/master/SMF and put them into /home/jupyteruser/Notebooks