Spark Jupyterlab Final GSE Presentation 2024

SMF analysis using
Apache Spark and Jupyterlab
April 9th 2024 – GSE z/OS Expertenforum
Marcel Schmidt

Introduction
Infrastructure overview
Component description
Building the infrastructure
WSL Windows Subsystem for Linux
- ( )
Apache Spark
-
Jupyterlab
-
PostgreSQL
-
Nvidia CUDA Compute unified device Architecture
- ( )
Tesla T accelerator card
- 4
Demo
Wrapup
Agenda

●
It is possible to analyze SMF data using an assembly of open
source technologies
●
The necessary infrastructure can be built on a Windows or Linux
pla orm
●
The SMF records must be transformed and made available on
the analysis pla orm either as JSON ﬁles or PostgreSQL records
●
There are commercial solu ons available that are combining
these open source tools with proprietary code to directly access
SMF data residing on z/OS
Introduction

●
Windows Subsystem for Linux V2
Allows you to run a Linux environment on a Windows machine
without the need for a separate virtualiza on solu on.
Component description WSL
- 2

●
Uniﬁed analy cs engine for large-scale dataprocessing.
Aka “Big Data” and “Hadoop”
●
Spark Core provides distributed task dispatching, scheduling,
and basic I/O func onali es, exposed through an API for Java,
Python, Scala, .NET and R
●
Spark SQL is a data abstrac on called DataFrames
Component description Apache Spark
–

●
JupyterLab is the latest web-based interac ve development
environment for notebooks, code, and data.
●
Its flexible interface allows users to configure and arrange
workflows in data science, scien fic compu ng, computa onal
journalism, and machine learning.
●
A modular design invites extensions to expand and enrich
func onality.
●
The JupyterLab environments provide a produc vity-focused
redesign of Jupyter Notebook. It introduces tools such as a built-in
HTML viewer and CSV viewer along with features that unify several
discrete features of Jupyter Notebooks onto the same screen.
Component description Jupyterlab
–

PostgreSQL is a powerful, open source object-rela onal database system
with over 35 years of ac ve development that has earned it a strong
reputa on for reliability, feature robustness, and performance.
There is a wealth of informa on to be found describing how to install
and use PostgreSQL through the oﬃcial documenta on.
Component description PostgreSQL
–

CUDA® is a parallel compu ng pla orm and programming model
developed by NVIDIA for general compu ng on graphical processing
units (GPUs). With CUDA, developers are able to drama cally speed
up compu ng applica ons by harnessing the power of GPUs.
In GPU-accelerated applica ons, the sequen al part of the workload
runs on the CPU – which is op mized for single-threaded
performance – while the compute intensive por on of the
applica on runs on thousands of GPU cores in parallel.
Component description Nvidia CUDA
–

The Tesla T4 is a professional graphics card by NVIDIA. Built on the 12 nm
process, and based on the TU104 graphics processor, the card supports DirectX
12 Ul mate.
●
It features 2560 shading units, 160 texture mapping units, and 64 ROPs.
Also included are 320 tensor cores which help improve the speed of machine
learning applica ons.
●
NVIDIA has paired 16 GB GDDR6 memory with the Tesla T4, which are
connected using a 256-bit memory interface. The GPU is opera ng at a
frequency of 585 MHz, which can be boosted up to 1590 MHz,
memory is running at 1250 MHz (10 Gbps eﬀec ve).
●
It does not require any addi onal power connector,
its power draw is rated at 70 W maximum.
Component description Tesla T Hardware
– 4

WSL Ubuntu distribution
2 /
1. Ensure that your WSL version is 0.67.6 or newer.
Systemd support is required!
To check, run wsl --version.
To update, run wsl --update or download from MS Store
2. wsl --install
3. reboot Windows
4. wsl --install Ubuntu
5. wsl --list --verbose
NAME STATE VERSION
* Ubuntu Running 2
5. wsl
6. sudo apt update; sudo apt upgrade
7. sudo apt install wget tar net-tools mc -y

Apache Spark (1)
1. Install Java runtime
Apache Spark requires Java to run
sudo apt install curl mlocate default-jdk -y
2. Download Apache Spark
Download the latest release of Apache Spark from the downloads page.
https://spark.apache.org/downloads.html
VER=3.5.1 (23. Feb. 2024)
wget https://dlcdn.apache.org/spark/spark-$VER/spark-$VER-bin-
hadoop3.tgz
tar xvf spark-$VER-bin-hadoop3.tgz
Move the Spark folder created after extraction to the /opt/ directory.
sudo mv spark-$VER-bin-hadoop3/ /opt/spark

Apache Spark (2)
# Set Spark environment
# Open your bashrc configuration file.
nano ~/.bashrc
add:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Activate changes:
source ~/.bashrc

Apache Spark (3)
3. Start a standalone master Server:
start-master.sh
starting org.apache.spark.deploy.master.Master, logging to
/opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-
EMA.out
The process will be listening on TCP port 8080.
sudo ss -tunelp | grep 8080
tcp LISTEN 0 1 *:8080 *:* users:
(("java",pid=5437,fd=286)) ino:61662 sk:6 cgroup:/ v6only:0 <->
http://localhost:8080/
My Spark URL is spark://EMA:7077

Apache Spark (4)
4. Starting Spark Worker Process
The start-worker.sh command is used to start Spark Worker Process.
start-worker.sh spark://EMA:7077
5. Using Spark shell
Use the spark-shell command to access Spark Shell:
spark-shell
starting org.apache.spark.deploy.worker.Worker, logging to
/opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-
EMA.out

Apache Spark (5)
spark-shell
24/04/07 12:33:43 WARN Utils: Your hostname, EMA resolves to a loopback address: 127.0.1.1;
using 172.26.226.96 instead (on interface eth0)
24/04/07 12:33:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/04/07 12:33:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Spark context Web UI available at http://172.26.226.96:4040
Spark context available as 'sc' (master = local[*], app id = local-1712486036586).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.5.1
/_/
Using Scala version 2.12.18 (OpenJDK 64-Bit Server VM, Java 11.0.22)
Type in expressions to have them evaluated.

Jupyterlab (1)
pre-requisites
sudo apt install python3 python3-pip python3-venv nodejs -y
python3 --version
Python 3.10.12
pip3 --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

Jupyterlab (2)
add user and group
run the following commands to create a new user called jupyteruser and grant sudo permission
# Add a new group
sudo groupadd jupyter
# Creating jupyteruser and adding to the jupyter group
sudo useradd --groups jupyter jupyteruser
sudo passwd jupyteruser
# add jupyteruser to the sudo group
sudo adduser jupyteruser sudo
sudo chown jupyteruser:jupyter /home/jupyteruser
sudo mkdir /home/jupyteruser
su - jupyteruser

Jupyterlab (3)
python3 -m pip install --user --upgrade pip
python3 -m pip install --user psycopg2-binary bokeh plotly chart_studio numpy scipy
python-dotenv
python3 -m pip install --user jupyterlab
python3 -m pip install --user pyspark
python3 -m pip install --user matplotlib seaborn
# install scala kernel
pip install spylon-kernel
sudo python3 -m spylon-kernel install

Jupyterlab (4)
https (ssl) setup
mkdir ~/ssl_cert && cd ~/ssl_cert
# Generate a new private key.
openssl genrsa -out jupyter.key 2048
# Create a signed certificate.
openssl req -new -key jupyter.key -out jupyter.csr
# Create a self-signed certificate
openssl x509 -req -days 365 -in jupyter.csr -signkey jupyter.key -out jupyter.pem
Certificate request self-signature ok
subject=C = CH, ST = Thurgau, L = Ettenhausen, O = MMS IT GmbH

Jupyterlab (5)
# Password protect your JupyterLab server by generating and modifying a Jupyter config file:
jupyter server --generate-config
Writing default config to: /home/jupyteruser/.jupyter/jupyter_server_config.py
jupyter server Password
[JupyterPasswordApp] Wrote hashed password to
/home/jupyteruser/.jupyter/jupyter_server_config.json
# Find the config file open it because there are changes required for SSL
nano ~/.jupyter/jupyter_server_config.py
If using the SSL certificate, also add the location of the certificate file and the private key to the
config file.
c.ServerApp.certfile = '/home/jupyteruser/ssl_cert/jupyter.pem'
c.ServerApp.keyfile = '/home/jupyteruser/ssl_cert/jupyter.key'
mkdir /home/jupyteruser/notebooks
jupyter-lab --no-browser --ip "*" --notebook-dir=/home/jupyteruser/notebooks --port=8888

Jupyterlab (6)
systemd Setup
sudo nano /etc/systemd/system/jupyter.service
add the following lines:
[Unit]
Description=Jupyter Notebook
[Service]
Type=simple
PIDFile=/run/jupyter.pid
# If you need environment variables for Tensorflow GPU work, .bashrc usually does the job
# you need to somehow make those available to the Jupyter service, or else Notebooks that need
the GPU won't be able to see it.
Environment="PATH=/usr/local/cuda-12.3/bin:$PATH"
Environment="LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:/usr/local/cuda-12.3/lib64:usr/
local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH"
Environment="CUDA_HOME=/usr/local/cuda-12.3"

Jupyterlab (7)
Environment="PYSPARK_ALLOW_INSECURE_GATEWAY=1"
Environment="CLASSPATH=/home/jupyteruser/postgresql-42.5.0.jar:$CLASSPATH"
ExecStart=/home/jupyteruser/.local/bin/jupyter-lab --notebook-dir=/home/jupyteruser/notebooks --
no-browser --ip "*" --port=8888
User=jupyteruser
Group=jupyter
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target

Jupyterlab (8)
sudo systemctl enable jupyter
Created symlink /etc/systemd/system/multi-user.target.wants/jupyter.service →
/etc/systemd/system/jupyter.service.
Reload the systemd daemon and restart the service
sudo systemctl daemon-reload
sudo systemctl restart jupyter
sudo systemctl status jupyter
jupyter.service - Jupyter Notebook
Loaded: loaded (/etc/systemd/system/jupyter.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2024-04-07 14:03:11 CEST; 27ms ago
Main PID: 7507 (jupyter-lab)
Tasks: 1 (limit: 4589)
Memory: 2.8M
CGroup: /system.slice/jupyter.service
└─7507 /usr/bin/python3 /home/jupyteruser/.local/bin/jupyter-lab
--notebook-dir=/home/jupyteruser/notebook>
Apr 07 14:03:11 EMA systemd[1]: Started Jupyter Notebook.

Jupyterlab (9)
Finally, you can monitor the output of the service:
To show the log messages since the last boot (-b) and without additional fields like timestamp and
hostname (-o cat), type:
sudo journalctl -u jupyter -b -o cat -f
Open a browser window on your local computer and enter the following to open the notebook.
https://[External IP]:8888

PostgreSQL (1)
apt install postgresql libpostgresql-jdbc-java
systemctl start postgresql
systemctl enable postgresql
systemctl Status PostgreSQL
# You will need a JDBC connection to connect Apache Spark to your PostgreSQL
database. It’s available for download here:
cd /home/jupyteruser
wget https://jdbc.postgresql.org/download/postgresql-42.7.3.jar
chown jupyteruser:jupyter postgresql-42.7.3.jar

Nvidia CUDA (1)
# disable "nouveau" driver because it tries to activate the Tesla card as a
graphics card which doesn’t work because it has no graphics port.
In /etc/default/grub, add the following phrase to the value of
GRUB_CMDLINE_LINUX:
module_blacklist=nouveau
Create /etc/modprobe.d/nouveau.conf and add the following line:
blacklist nouveau
Rebuild modules:
depmod -a
Rebuild your grub config:
grub2-mkconfig --output=/boot/efi/EFI/rocky/grub.cfg

Nvidia CUDA (2)
Download and install the Nvidia Tesla driver
wget https://us.download.nvidia.com/tesla/525.60.13/NVIDIA-Linux-
x86_64-525.60.13.run
chmod +x *.run
Execute the downloaded package in the Shell
./NVIDIA-xxx --kernel-source-path=/usr/src/kernels/xxx

Nvidia CUDA CUDA Toolkit
(4) –
wget https://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/
cuda_10.1.243_418.87.00_linux.run
sh cuda_10.1.243_418.87.00_linux.run --override (--override required to bypass gcc version check)
# unselect the driver. install the rest
===========
= Summary =
===========
Driver: Not Selected
Toolkit: Installed in /usr/local/cuda-10.1/
Samples: Installed in /root/, but missing recommended libraries
Please make sure that
- PATH includes /usr/local/cuda-10.1/bin
- LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to
/etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin

Install the demo assets
Download
SMF110_Spark_Python3.ipynb
SMF110_data.json.zip
From
https://github.com/IzODA/examples/tree/master/SMF
and put them into /home/jupyteruser/Notebooks

Spark Jupyterlab Final GSE Presentation 2024

More Related Content

Similar to Spark Jupyterlab Final GSE Presentation 2024 (20)

Recently uploaded (20)

Spark Jupyterlab Final GSE Presentation 2024