Incorta spark integration

Incorta Spark Integration
Dylan Wan
Solution Architect at Incorta

Agenda
• Spark Overview
• Incorta and Spark
• Installation and Configuration
• Create your first MV in Incorta
• Demo

Why Spark?
• General purpose framework for parallel processing in cluster
• Functional programming available in
• Scala
• Python
• Java
• Spark SQL and Dataframe
• Using the same framework for
• Data Streaming
• Machine Learning
• Graph processing
Spark Core
Spark
SQL
Spark
Streaming
Machine
Learning
Graph
Processing
Standalone
Scheduler
YARN Meso

Incorta Server Spark Server
Spark Execution Flow
Driver Program
Spark Master
(Cluster Manager)Spark Context
Worker Node
Executor Cache
Task Task
Worker Node
Executor Cache
Task Task
http://spark.apache.org/docs/latest/cluster-
overview.html

Spark Concepts
• Spark Context – like a connection in JDBC to hold the DB
session to a database. It is the connection to Spark cluster
• Master and Workers have its own JVM process and Listener
Port
• Master and Workers have its Web UI for display the progress
• Application codes are sent and assigned to executors
• Executors read, write and process the data
• Memory can be controlled at the worker level and are allocated
to individually executors

Spark Dataframe
• Organized Data into named columns like database table
• A dataframe can be created from a parquet file
• A dataframe can be written into and stored as a parquet file
• A dataframe can be processed via DataFrame API
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.
sql.DataFrame
• A dataframe can be registered as a table and processed by
SQL

Incorta Data Load
web
service
*.csv
*.xlsx
SQL
DB
Extract Load Backup
Restart
Formula
Column,
Reference, etc

Materialized View in Incorta
• An object created within Incorta, not loaded from data sources
• Created based on other tables loaded from Incorta
• Once a MV is loaded, it works like other tables.
• Join from and to another table or MV
• Formula columns can be created in a MV
• Aliases can be created against a MV
• MV can be defined via Spark Python and Spark SQL
• Spark Python or Spark SQL are executed as part of the regular
Loader jobs

Incorta Data Load
web
service
*.csv
*.xlsx
SQL
DB
Extract
Load Backup
Restart
Formula Column,
Reference, etc
Read
Save

Installation and Configuration

Spark Installation
• Download Spark
• http://spark.apache.org/downloads.html
• Select Package Type of “Prebuilt for Hadoop XX”
• Select Spark 1.6.2 for Incorta Release 2.8 or before
• Download the Tarball file or get the link from browser
• Run wget <download URL> from the server machine
• Unzip and uncompress the tar file
• tar –xzvf spark-1.6.2-bin-hadoop2.6.tgz
• The spark is ready to use! Try this
• bin/pyspark
• exit()

Spark Configurations
• Edit the spark-env.sh in the <Spark Home>/conf
• Change the WebUI ports if there is any conflict (optional)
• If not all ports are open to use or available from browser, you can specify
• SPARK_MASTER_WEBUI_PORT
• SPARK_WORKER_WEBUI_PORT
• Specify the external IP for monitoring Spark jobs (optional)
• If the server machine runs under a firewall and the external IP and internal IP
are different
• Set SPARK_PUBLIC_DNS to the external IP
• Limit the memory used by Spark jobs (optional)
• SPARK_WORKER_MEMORY
• Control total available, not the individual assignment to the executors

Spark Configurations
• Enable Logging – Useful for investigating issues (recommended)
• Create a directory for holding he log files
• cd <spark home>
• mkdir eventlogs
• Edit the spark-defaults.conf in <spark home>/conf
• spark.eventLog.enabled true
• spark.eventLog.dir <spark home>/eventlogs
• Enable History Server (recommended)
• Edit <spark home>/conf/spark-env.sh
• SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=<spark
home>/eventlogs”
• Start the history server
• ./sbin/start-history-server.sh

Spark Configuration – Hive metastore DB
• Hive metadata is stored in Hive
metastore
• Hive metastore requires a
database
• Create the hive-site.xml in
<spark home>/conf
• Edit <Spark Home>/conf/spark-
env.sh
• SPARK_HIVE=true
• SPARK_SUBMIT_CLASSPATH
• SPARK_CLASSPATH
• Make sure JDBC driver is
available to Spark

hive-site.xml for mySQL
[incorta@clorox2-poc spark-1.6.2-bin-hadoop2.6]$ cat ~/spark-1.6.2-bin-hadoop2.6/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3307/mydb?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mysql_root</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>

Starting Spark Master and Worker
• Go to <Spark Home> and start the spark master process
• sbin/start-master.sh
• Check the log file to get the master WebUI URL
• Open the webUI from a browser
• Start the spark slave processes
• sbin/start-slave.sh spark://<spark master host>:7077
• Check the log file to ensure that the worker started properly
• Refresh the browser page to check worker processes
• Start history server (optional, but recommended)
• sbin/start-history-server.sh

Incorta Configuration
• Edit <Incorta Home>/incorta/server.properties
• spark.home=/home/incorta/spark-1.6.2-bin-hadoop2.6
• spark.master.url=spark://clorox2-poc:7077
• Please ensure that the spark.master.url set to the URL generated in the
log file when you launch the Spark master.
• You can also see it in the Spark Master Web UI

Monitoring
• Spark Master WebUI
• Check if the job is submitted to Spark master
• Check if the worker has allocated the resources to execute the job
• Check DAG for optimizing the performance
• Incorta Log
• Use tail –f <incorta home>/server/logs/incorta/<tenant>/incorta-…out
• See runtime errors

Create your first MV in Incorta

Understand read() and save()
• Read(“schema.table”) – get
the data from incorta
• Save(dataframe) – create the
data from the dataframe as
an MV
• These are incorta functions,
internally they call
• sqlContext.read.parquet
• df.write.mode("overwrite").parq
uet

Incorta spark integration

More Related Content

What's hot (20)

Similar to Incorta spark integration (20)

More from Dylan Wan (7)

Recently uploaded (20)

Incorta spark integration