SlideShare a Scribd company logo
Incorta spark integration
Incorta Spark Integration
Dylan Wan
Solution Architect at Incorta
Agenda
• Spark Overview
• Incorta and Spark
• Installation and Configuration
• Create your first MV in Incorta
• Demo
Spark Overview
Why Spark?
• General purpose framework for parallel processing in cluster
• Functional programming available in
• Scala
• Python
• Java
• Spark SQL and Dataframe
• Using the same framework for
• Data Streaming
• Machine Learning
• Graph processing
Spark Core
Spark
SQL
Spark
Streaming
Machine
Learning
Graph
Processing
Standalone
Scheduler
YARN Meso
Incorta Server Spark Server
Spark Execution Flow
Driver Program
Spark Master
(Cluster Manager)Spark Context
Worker Node
Executor Cache
Task Task
Worker Node
Executor Cache
Task Task
http://spark.apache.org/docs/latest/cluster-
overview.html
Spark Concepts
• Spark Context – like a connection in JDBC to hold the DB
session to a database. It is the connection to Spark cluster
• Master and Workers have its own JVM process and Listener
Port
• Master and Workers have its Web UI for display the progress
• Application codes are sent and assigned to executors
• Executors read, write and process the data
• Memory can be controlled at the worker level and are allocated
to individually executors
Spark Dataframe
• Organized Data into named columns like database table
• A dataframe can be created from a parquet file
• A dataframe can be written into and stored as a parquet file
• A dataframe can be processed via DataFrame API
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.
sql.DataFrame
• A dataframe can be registered as a table and processed by
SQL
Incorta and Spark
Incorta Data Load
web
service
*.csv
*.xlsx
SQL
DB
Extract Load Backup
Restart
Formula
Column,
Reference, etc
Materialized View in Incorta
• An object created within Incorta, not loaded from data sources
• Created based on other tables loaded from Incorta
• Once a MV is loaded, it works like other tables.
• Join from and to another table or MV
• Formula columns can be created in a MV
• Aliases can be created against a MV
• MV can be defined via Spark Python and Spark SQL
• Spark Python or Spark SQL are executed as part of the regular
Loader jobs
Incorta Data Load
web
service
*.csv
*.xlsx
SQL
DB
Extract
Load Backup
Restart
Formula Column,
Reference, etc
Read
Save
Installation and Configuration
Spark Installation
• Download Spark
• http://spark.apache.org/downloads.html
• Select Package Type of “Prebuilt for Hadoop XX”
• Select Spark 1.6.2 for Incorta Release 2.8 or before
• Download the Tarball file or get the link from browser
• Run wget <download URL> from the server machine
• Unzip and uncompress the tar file
• tar –xzvf spark-1.6.2-bin-hadoop2.6.tgz
• The spark is ready to use! Try this
• bin/pyspark
• exit()
Spark Configurations
• Edit the spark-env.sh in the <Spark Home>/conf
• Change the WebUI ports if there is any conflict (optional)
• If not all ports are open to use or available from browser, you can specify
• SPARK_MASTER_WEBUI_PORT
• SPARK_WORKER_WEBUI_PORT
• Specify the external IP for monitoring Spark jobs (optional)
• If the server machine runs under a firewall and the external IP and internal IP
are different
• Set SPARK_PUBLIC_DNS to the external IP
• Limit the memory used by Spark jobs (optional)
• SPARK_WORKER_MEMORY
• Control total available, not the individual assignment to the executors
Spark Configurations
• Enable Logging – Useful for investigating issues (recommended)
• Create a directory for holding he log files
• cd <spark home>
• mkdir eventlogs
• Edit the spark-defaults.conf in <spark home>/conf
• spark.eventLog.enabled true
• spark.eventLog.dir <spark home>/eventlogs
• Enable History Server (recommended)
• Edit <spark home>/conf/spark-env.sh
• SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=<spark
home>/eventlogs”
• Start the history server
• ./sbin/start-history-server.sh
Spark Configuration – Hive metastore DB
• Hive metadata is stored in Hive
metastore
• Hive metastore requires a
database
• Create the hive-site.xml in
<spark home>/conf
• Edit <Spark Home>/conf/spark-
env.sh
• SPARK_HIVE=true
• SPARK_SUBMIT_CLASSPATH
• SPARK_CLASSPATH
• Make sure JDBC driver is
available to Spark
hive-site.xml for mySQL
[incorta@clorox2-poc spark-1.6.2-bin-hadoop2.6]$ cat ~/spark-1.6.2-bin-hadoop2.6/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3307/mydb?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mysql_root</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>
Starting Spark Master and Worker
• Go to <Spark Home> and start the spark master process
• sbin/start-master.sh
• Check the log file to get the master WebUI URL
• Open the webUI from a browser
• Start the spark slave processes
• sbin/start-slave.sh spark://<spark master host>:7077
• Check the log file to ensure that the worker started properly
• Refresh the browser page to check worker processes
• Start history server (optional, but recommended)
• sbin/start-history-server.sh
Spark Processes
Incorta Configuration
• Edit <Incorta Home>/incorta/server.properties
• spark.home=/home/incorta/spark-1.6.2-bin-hadoop2.6
• spark.master.url=spark://clorox2-poc:7077
• Please ensure that the spark.master.url set to the URL generated in the
log file when you launch the Spark master.
• You can also see it in the Spark Master Web UI
Monitoring
• Spark Master WebUI
• Check if the job is submitted to Spark master
• Check if the worker has allocated the resources to execute the job
• Check DAG for optimizing the performance
• Incorta Log
• Use tail –f <incorta home>/server/logs/incorta/<tenant>/incorta-…out
• See runtime errors
Create your first MV in Incorta
Understand read() and save()
• Read(“schema.table”) – get
the data from incorta
• Save(dataframe) – create the
data from the dataframe as
an MV
• These are incorta functions,
internally they call
• sqlContext.read.parquet
• df.write.mode("overwrite").parq
uet
Demo
Incorta spark integration

More Related Content

PPT
Lucene Bootcamp - 2
PPTX
Software architecture for data applications
PDF
The inner workings of Dynamo DB
PDF
Project Voldemort
PPTX
PPTX
Apache Spark Core
PPTX
Apache Spark Components
Lucene Bootcamp - 2
Software architecture for data applications
The inner workings of Dynamo DB
Project Voldemort
Apache Spark Core
Apache Spark Components

What's hot (20)

PPTX
Apache Spark Streaming
PDF
Connecting Hadoop and Oracle
PDF
Voldemort Nosql
PPTX
Apache Spark and Online Analytics
PPTX
Data Architectures for Robust Decision Making
PDF
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
PPTX
Apache Spark MLlib
POTX
Apache Spark Streaming: Architecture and Fault Tolerance
PPTX
Emerging technologies /frameworks in Big Data
PPTX
Spark and Spark Streaming
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PPTX
Building real time Data Pipeline using Spark Streaming
PPTX
Learning spark ch06 - Advanced Spark Programming
PPTX
NoSQL databases - An introduction
PDF
Using extended events for troubleshooting sql server
PDF
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
PDF
Reactive dashboard’s using apache spark
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PDF
Oracle to Postgres Migration - part 2
Apache Spark Streaming
Connecting Hadoop and Oracle
Voldemort Nosql
Apache Spark and Online Analytics
Data Architectures for Robust Decision Making
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Apache Spark MLlib
Apache Spark Streaming: Architecture and Fault Tolerance
Emerging technologies /frameworks in Big Data
Spark and Spark Streaming
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Building real time Data Pipeline using Spark Streaming
Learning spark ch06 - Advanced Spark Programming
NoSQL databases - An introduction
Using extended events for troubleshooting sql server
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Reactive dashboard’s using apache spark
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Oracle to Postgres Migration - part 2
Ad

Similar to Incorta spark integration (20)

PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PPTX
Spark SQL
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Spark sql
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
PDF
Using your DB2 SQL Skills with Hadoop and Spark
PDF
Impala: Real-time Queries in Hadoop
PDF
Apache Spark and Python: unified Big Data analytics
PDF
Intro to Spark and Spark SQL
PDF
Started with-apache-spark
PDF
IoT Applications and Patterns using Apache Spark & Apache Bahir
PPTX
In Memory Analytics with Apache Spark
PPT
Spark_Part 1
PDF
20170126 big data processing
PDF
Hive on spark berlin buzzwords
PDF
Spark Working Environment in Windows OS
PPTX
Impala for PhillyDB Meetup
PDF
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
PDF
SQL on Hadoop
PDF
Solr as a Spark SQL Datasource
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Spark SQL
Intro to Apache Spark by CTO of Twingo
Spark sql
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
Using your DB2 SQL Skills with Hadoop and Spark
Impala: Real-time Queries in Hadoop
Apache Spark and Python: unified Big Data analytics
Intro to Spark and Spark SQL
Started with-apache-spark
IoT Applications and Patterns using Apache Spark & Apache Bahir
In Memory Analytics with Apache Spark
Spark_Part 1
20170126 big data processing
Hive on spark berlin buzzwords
Spark Working Environment in Windows OS
Impala for PhillyDB Meetup
Writing Apache Spark and Apache Flink Applications Using Apache Bahir
SQL on Hadoop
Solr as a Spark SQL Datasource
Ad

More from Dylan Wan (7)

PPTX
Exploratory Data Analysis with Sweetviz in Incorta
PDF
Home and Auto Insurance Policy
ODP
2017 Classroom Analytics
PPTX
Incorta Data Security
PPTX
BI Apps Architecture
PPTX
Conformed Dimension and Data Mining
PPTX
Data Mining Scoring Engine development process
Exploratory Data Analysis with Sweetviz in Incorta
Home and Auto Insurance Policy
2017 Classroom Analytics
Incorta Data Security
BI Apps Architecture
Conformed Dimension and Data Mining
Data Mining Scoring Engine development process

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Computer network topology notes for revision
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Knowledge Engineering Part 1
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Database Infoormation System (DBIS).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Computer network topology notes for revision
Galatica Smart Energy Infrastructure Startup Pitch Deck
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IBA_Chapter_11_Slides_Final_Accessible.pptx
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
Launch Your Data Science Career in Kochi – 2025
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Knowledge Engineering Part 1
.pdf is not working space design for the following data for the following dat...
Database Infoormation System (DBIS).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Incorta spark integration

  • 2. Incorta Spark Integration Dylan Wan Solution Architect at Incorta
  • 3. Agenda • Spark Overview • Incorta and Spark • Installation and Configuration • Create your first MV in Incorta • Demo
  • 5. Why Spark? • General purpose framework for parallel processing in cluster • Functional programming available in • Scala • Python • Java • Spark SQL and Dataframe • Using the same framework for • Data Streaming • Machine Learning • Graph processing Spark Core Spark SQL Spark Streaming Machine Learning Graph Processing Standalone Scheduler YARN Meso
  • 6. Incorta Server Spark Server Spark Execution Flow Driver Program Spark Master (Cluster Manager)Spark Context Worker Node Executor Cache Task Task Worker Node Executor Cache Task Task http://spark.apache.org/docs/latest/cluster- overview.html
  • 7. Spark Concepts • Spark Context – like a connection in JDBC to hold the DB session to a database. It is the connection to Spark cluster • Master and Workers have its own JVM process and Listener Port • Master and Workers have its Web UI for display the progress • Application codes are sent and assigned to executors • Executors read, write and process the data • Memory can be controlled at the worker level and are allocated to individually executors
  • 8. Spark Dataframe • Organized Data into named columns like database table • A dataframe can be created from a parquet file • A dataframe can be written into and stored as a parquet file • A dataframe can be processed via DataFrame API https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark. sql.DataFrame • A dataframe can be registered as a table and processed by SQL
  • 10. Incorta Data Load web service *.csv *.xlsx SQL DB Extract Load Backup Restart Formula Column, Reference, etc
  • 11. Materialized View in Incorta • An object created within Incorta, not loaded from data sources • Created based on other tables loaded from Incorta • Once a MV is loaded, it works like other tables. • Join from and to another table or MV • Formula columns can be created in a MV • Aliases can be created against a MV • MV can be defined via Spark Python and Spark SQL • Spark Python or Spark SQL are executed as part of the regular Loader jobs
  • 12. Incorta Data Load web service *.csv *.xlsx SQL DB Extract Load Backup Restart Formula Column, Reference, etc Read Save
  • 14. Spark Installation • Download Spark • http://spark.apache.org/downloads.html • Select Package Type of “Prebuilt for Hadoop XX” • Select Spark 1.6.2 for Incorta Release 2.8 or before • Download the Tarball file or get the link from browser • Run wget <download URL> from the server machine • Unzip and uncompress the tar file • tar –xzvf spark-1.6.2-bin-hadoop2.6.tgz • The spark is ready to use! Try this • bin/pyspark • exit()
  • 15. Spark Configurations • Edit the spark-env.sh in the <Spark Home>/conf • Change the WebUI ports if there is any conflict (optional) • If not all ports are open to use or available from browser, you can specify • SPARK_MASTER_WEBUI_PORT • SPARK_WORKER_WEBUI_PORT • Specify the external IP for monitoring Spark jobs (optional) • If the server machine runs under a firewall and the external IP and internal IP are different • Set SPARK_PUBLIC_DNS to the external IP • Limit the memory used by Spark jobs (optional) • SPARK_WORKER_MEMORY • Control total available, not the individual assignment to the executors
  • 16. Spark Configurations • Enable Logging – Useful for investigating issues (recommended) • Create a directory for holding he log files • cd <spark home> • mkdir eventlogs • Edit the spark-defaults.conf in <spark home>/conf • spark.eventLog.enabled true • spark.eventLog.dir <spark home>/eventlogs • Enable History Server (recommended) • Edit <spark home>/conf/spark-env.sh • SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=<spark home>/eventlogs” • Start the history server • ./sbin/start-history-server.sh
  • 17. Spark Configuration – Hive metastore DB • Hive metadata is stored in Hive metastore • Hive metastore requires a database • Create the hive-site.xml in <spark home>/conf • Edit <Spark Home>/conf/spark- env.sh • SPARK_HIVE=true • SPARK_SUBMIT_CLASSPATH • SPARK_CLASSPATH • Make sure JDBC driver is available to Spark
  • 18. hive-site.xml for mySQL [incorta@clorox2-poc spark-1.6.2-bin-hadoop2.6]$ cat ~/spark-1.6.2-bin-hadoop2.6/conf/hive-site.xml <?xml version="1.0" encoding="UTF-8"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3307/mydb?createDatabaseIfNotExist=true</value> <description>metadata is stored in a MySQL server</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>MySQL JDBC driver class</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>user name for connecting to mysql server</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>mysql_root</value> <description>password for connecting to mysql server</description> </property> </configuration>
  • 19. Starting Spark Master and Worker • Go to <Spark Home> and start the spark master process • sbin/start-master.sh • Check the log file to get the master WebUI URL • Open the webUI from a browser • Start the spark slave processes • sbin/start-slave.sh spark://<spark master host>:7077 • Check the log file to ensure that the worker started properly • Refresh the browser page to check worker processes • Start history server (optional, but recommended) • sbin/start-history-server.sh
  • 21. Incorta Configuration • Edit <Incorta Home>/incorta/server.properties • spark.home=/home/incorta/spark-1.6.2-bin-hadoop2.6 • spark.master.url=spark://clorox2-poc:7077 • Please ensure that the spark.master.url set to the URL generated in the log file when you launch the Spark master. • You can also see it in the Spark Master Web UI
  • 22. Monitoring • Spark Master WebUI • Check if the job is submitted to Spark master • Check if the worker has allocated the resources to execute the job • Check DAG for optimizing the performance • Incorta Log • Use tail –f <incorta home>/server/logs/incorta/<tenant>/incorta-…out • See runtime errors
  • 23. Create your first MV in Incorta
  • 24. Understand read() and save() • Read(“schema.table”) – get the data from incorta • Save(dataframe) – create the data from the dataframe as an MV • These are incorta functions, internally they call • sqlContext.read.parquet • df.write.mode("overwrite").parq uet
  • 25. Demo