SlideShare a Scribd company logo
Apache Spark in Azure
Jen Stirrup, Data Whisperer, Data Relish UK
Lighting up Big
Data Analytics
Please silence
cell phones
Please silence
cell phones
2
Free online webinar
events
Free 1-day local
training events
Local user groups
around the world
Online special
interest user groups
Business analytics
training
Free Online Resources
PASS Blog
White Papers
Session Recordings
Newsletter www.pass.org
Explore everything PASS has to offer
PASS Connector
BA Insights
Get involved
Session evaluations
Download the GuideBook App
and search: PASS Summit 2017
Follow the QR code link
displayed on session signage
throughout the conference
venue and in the program guide
Your feedback is important and valuable.
Go to passSummit.com
Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
Jen Stirrup
Data Whisperer
Data Relish UK
Postgrad in Artificial Intelligence
Universities in the UK and Paris
AI and BI Consultant for 20 years
Global delivery of projects
Author
Published author on Business Intelligence
technology boos
/jenstirrup @jenstirrup jenstirrup
Artificial
Intelligence of
Business
Intelligence
Lighting up Big Data Analytics with Apache Spark in Azure
Augmented Reality
Lighting up Big Data Analytics with Apache Spark in Azure
Apache Spark™ is a fast and general engine for large-scale data processing.
Apache Spark
It is the largest open source process in data processing.
Since its release, Apache Spark has seen rapid adoption
by enterprises across a wide range of industries.
Apache Spark is a fast, in-memory data processing
engine with elegant and expressive development APIs
to allow data workers to efficiently execute streaming.
As well as machine learning or SQL workloads that
require fast iterative access to datasets
Lighting up Big Data Analytics with Apache Spark in Azure
Why Apache Spark?
FASTER THAN HADOOP RUNS EVERYWHERE
Who uses Spark?
Apache Spark
Apache Spark consists of Spark Core and a set of libraries.
The core is the distributed execution engine and the Java,
Scala, and Python APIs offer a platform for distributed ETL
application development.
Quickly achieve success by writing applications in Java,
Scala, or Python.
Resilient Distributed Datasets (RDDs)
Resilient Distributed Datasets (RDDs) are the fundamental
object used in Apache Spark.
RDDs are immutable collections representing datasets
New RDDs are created upon any operation
Lineage is also stored
Input
File
RDD RDDRDD
Output
File
Read Map Filter Reduce
Apache Spark
It comes with a built-in set of over 80 high-level operators.
And you can use it interactively to query data within the
shell.
In addition to Map and Reduce operations, it supports SQL
queries, streaming data, machine learning and graph data
processing.
Apache Spark
Developers can use these capabilities stand-alone or
combine them to run in a single data pipeline use case
Spark Components on HDInsight
Apache Spark is an open-source parallel processing
framework that supports in-memory processing to boost
the performance of big-data analytic applications.
Spark cluster on HDInsight is compatible with Azure
Storage (WASB) as well as Azure Data Lake Store.
Spark Components on HDInsight
Apache Spark
When you create a Spark cluster on HDInsight, you create
Azure compute resources with Spark installed and
configured.
It only takes about 10 minutes to create a Spark cluster in
HDInsight. The data to be processed is stored in Azure
Storage or Azure Data Lake Store.
Apache Spark
Spark provides primitives for in-memory cluster
computing.
A Spark job can load and cache data into memory and
query it repeatedly, much more quickly than disk-based
systems.
Spark also integrates into the Scala programming
language to let you manipulate distributed data sets like
local collections.
What does Spark give you?
Apache Spark is a powerful open source processing engine
for Hadoop data built around speed, easy to use, and
sophisticated analytics.
When comes to BigData processing speed always matters.
We always look for processing our huge data as fast as
possible.
What does Spark give you
Spark enables applications in Hadoop clusters to run up to
100x faster in memory, and 10x faster even when running
on disk.
Spark makes it possible by reducing number of read/write
to disc. It stores this intermediate processing data in-
memory.
Why Spark?
Easy: Built on Spark’s lightweight yet powerful APIs, Spark Streaming
lets you rapidly develop streaming applications
Fault tolerant: Unlike other streaming solutions (e.g. Storm), Spark
Streaming recovers lost work and delivers exactly-once semantics out
of the box with no extra code or configuration
Integrated: Reuse the same code for batch and stream processing,
even joining streaming data to historical data
Why Spark?
It uses the concept of Resilient Distributed Dataset (RDD),
which allows it to transparently store data on memory and
persist it to disc only it’s needed.
This helps to reduce most of the disc read and write the
main time consuming factors of data processing.
YARN Data Operating system:
YARN is one of the key features in the second-generation
Hadoop 2 version of the Apache Software Foundation's
open source distributed processing framework.
Originally described by Apache as a redesigned resource
manager, YARN is now characterized as a large-scale,
distributed operating system for big data applications.
YARN Data Operating system:
YARN is a software rewrite that decouples MapReduce's
resource management and scheduling capabilities from the
data processing component, enabling Hadoop to support
more varied processing approaches and a broader array of
applications.
Spark Deployment Modes:
Two deployment modes can be used to launch Spark applications:
In cluster mode, jobs are managed by the YARN cluster. The Spark driver runs
inside an Application Master (AM) process that is managed by YARN. This
means that the client can go away after initiating the application.
In client mode, the Spark driver runs in the client process, and the Application
Master is used only to request resources from YARN.
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data
structure of Spark. It is an immutable distributed collection
of objects.
Each dataset in RDD is divided into logical partitions, which
may be computed on different nodes of the cluster. RDDs
can contain any type of Python, Java, or Scala objects,
including user-defined classes.
Resilient Distributed Datasets
There are two ways to create RDDs:
Parallelizing an existing collection in your driver program, or
referencing a dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop
InputFormat.
Resilient Distributed Datasets
Parallelized Collections
Parallelized collections are created by calling SparkContext’s
parallelize method on an existing collection in your driver
program (a Scala Seq). The elements of the collection are
copied to form a distributed dataset that can be operated
on in parallel.
Resilient Distributed Datasets
External Datasets
Spark can create distributed datasets from any storage
source supported by Hadoop, including your local file
system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark
supports text files, SequenceFiles, and any other Hadoop
InputFormat
Transformations
Map (func): Return a new distributed dataset formed by
passing each element of the source through a function
func.
Filter (func): Return a new dataset formed by selecting
those elements of the source on which func returns true.
Distinct (numTasks): Return a new dataset that contains
the distinct elements of the source dataset.
Summary
Try it out!
• Agenda item one
• Agenda item two
• Agenda item three
• Agenda item four
• Agenda item five
Agenda
An agenda slide is highly recommended so attendees understand what
you will be presenting and to minimize session hopping.
Session evaluations
Download the GuideBook App
and search: PASS Summit 2017
Follow the QR code link
displayed on session signage
throughout the conference
venue and in the program guide
Your feedback is important and valuable.
Go to passSummit.com
Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
Thank You
Learn more from Speaker Name
email@company.com@yourhandle

More Related Content

PDF
Apache Spark PDF
PDF
SparkPaper
PPTX
Hadoop vs Apache Spark
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Comparison among rdbms, hadoop and spark
PPTX
Apache spark
PDF
Big data with java
PDF
Apache Spark 101
Apache Spark PDF
SparkPaper
Hadoop vs Apache Spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Comparison among rdbms, hadoop and spark
Apache spark
Big data with java
Apache Spark 101

What's hot (20)

PDF
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
DOCX
Big Data - Hadoop Ecosystem
PDF
Why Spark over Hadoop?
PPTX
Big dataarchitecturesandecosystem+nosql
PDF
Hadoop Technologies
PPTX
Hadoop white papers
PPTX
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
PDF
Learning How to Learn Hadoop
PDF
Hadoop & Complex Systems Research
PPTX
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
PPS
Big data hadoop rdbms
PPTX
In15orlesss hadoop
PPTX
Big data overview
PPTX
Analysing big data with cluster service and R
PDF
Started with-apache-spark
PDF
SAP HORTONWORKS
PDF
Big Data , Big Problem?
PPTX
Top Hadoop Big Data Interview Questions and Answers for Fresher
PDF
Hadoop vs spark
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Big Data - Hadoop Ecosystem
Why Spark over Hadoop?
Big dataarchitecturesandecosystem+nosql
Hadoop Technologies
Hadoop white papers
The Hadoop Path by Subash DSouza of Archangel Technology Consultants, LLC.
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Learning How to Learn Hadoop
Hadoop & Complex Systems Research
Hadoop World 2011: Hadoop and RDBMS with Sqoop and Other Tools - Guy Harrison...
Big data hadoop rdbms
In15orlesss hadoop
Big data overview
Analysing big data with cluster service and R
Started with-apache-spark
SAP HORTONWORKS
Big Data , Big Problem?
Top Hadoop Big Data Interview Questions and Answers for Fresher
Hadoop vs spark
Ad

Viewers also liked (6)

PPTX
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
PDF
Spark Summit EU talk by Herman van Hovell
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
SparkSQL: A Compiler from Queries to RDDs
PPTX
Optimizing Apache Spark SQL Joins
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Spark Summit EU talk by Herman van Hovell
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Jump Start on Apache Spark 2.2 with Databricks
SparkSQL: A Compiler from Queries to RDDs
Optimizing Apache Spark SQL Joins
Ad

Similar to Lighting up Big Data Analytics with Apache Spark in Azure (20)

PPTX
big data analytics (BAD601) Module-5.pptx
PDF
Apache spark
PPTX
Getting Started with Apache Spark (Scala)
PPTX
Apachespark 160612140708
PPT
An Introduction to Apache spark with scala
PDF
Apache Spark Introduction
PPTX
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
PPTX
In Memory Analytics with Apache Spark
PDF
Apache Spark Introduction.pdf
PDF
Big Data Analytics and Ubiquitous computing
PDF
Spark Concepts Cheat Sheet_Interview_Question.pdf
PDF
Top 10 Big Data Tools that you should know about.pdf
PDF
Bds session 13 14
PPTX
finap ppt conference.pptx
PPTX
Apache spark architecture (Big Data and Analytics)
PPTX
BigData & Hadoop Ecosystem.pptx
PPTX
Spark Unveiled Essential Insights for All Developers
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
PPTX
Introduction to spark
PPTX
APACHE SPARK.pptx
big data analytics (BAD601) Module-5.pptx
Apache spark
Getting Started with Apache Spark (Scala)
Apachespark 160612140708
An Introduction to Apache spark with scala
Apache Spark Introduction
Exploiting Apache Spark's Potential Changing Enormous Information Investigati...
In Memory Analytics with Apache Spark
Apache Spark Introduction.pdf
Big Data Analytics and Ubiquitous computing
Spark Concepts Cheat Sheet_Interview_Question.pdf
Top 10 Big Data Tools that you should know about.pdf
Bds session 13 14
finap ppt conference.pptx
Apache spark architecture (Big Data and Analytics)
BigData & Hadoop Ecosystem.pptx
Spark Unveiled Essential Insights for All Developers
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Introduction to spark
APACHE SPARK.pptx

More from Jen Stirrup (20)

PPTX
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
PDF
AI Applications in Healthcare and Medicine.pdf
PPTX
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
PPTX
CuRious about R in Power BI? End to end R in Power BI for beginners
PPTX
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
PDF
1 Introduction to Microsoft data platform analytics for release
PDF
5 Comparing Microsoft Big Data Technologies for Analytics
PDF
Comparing Microsoft Big Data Platform Technologies
PDF
Introduction to Analytics with Azure Notebooks and Python
PDF
Sales Analytics in Power BI
PDF
Analytics for Marketing
PDF
Diversity and inclusion for the newbies and doers
PDF
Artificial Intelligence from the Business perspective
PDF
How to be successful with Artificial Intelligence - from small to success
PDF
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
PDF
Data Visualization dataviz superpower
PDF
R - what do the numbers mean? #RStats
PDF
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
PPTX
Blockchain Demystified for Business Intelligence Professionals
PDF
Examples of the worst data visualization ever
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
AI Applications in Healthcare and Medicine.pdf
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
CuRious about R in Power BI? End to end R in Power BI for beginners
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
1 Introduction to Microsoft data platform analytics for release
5 Comparing Microsoft Big Data Technologies for Analytics
Comparing Microsoft Big Data Platform Technologies
Introduction to Analytics with Azure Notebooks and Python
Sales Analytics in Power BI
Analytics for Marketing
Diversity and inclusion for the newbies and doers
Artificial Intelligence from the Business perspective
How to be successful with Artificial Intelligence - from small to success
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
Data Visualization dataviz superpower
R - what do the numbers mean? #RStats
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Blockchain Demystified for Business Intelligence Professionals
Examples of the worst data visualization ever

Recently uploaded (20)

PDF
annual-report-2024-2025 original latest.
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Computer network topology notes for revision
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Qualitative Qantitative and Mixed Methods.pptx
annual-report-2024-2025 original latest.
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
climate analysis of Dhaka ,Banglades.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Reliability_Chapter_ presentation 1221.5784
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IB Computer Science - Internal Assessment.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
.pdf is not working space design for the following data for the following dat...
Database Infoormation System (DBIS).pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Computer network topology notes for revision
oil_refinery_comprehensive_20250804084928 (1).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Qualitative Qantitative and Mixed Methods.pptx

Lighting up Big Data Analytics with Apache Spark in Azure

  • 1. Apache Spark in Azure Jen Stirrup, Data Whisperer, Data Relish UK Lighting up Big Data Analytics
  • 2. Please silence cell phones Please silence cell phones 2
  • 3. Free online webinar events Free 1-day local training events Local user groups around the world Online special interest user groups Business analytics training Free Online Resources PASS Blog White Papers Session Recordings Newsletter www.pass.org Explore everything PASS has to offer PASS Connector BA Insights Get involved
  • 4. Session evaluations Download the GuideBook App and search: PASS Summit 2017 Follow the QR code link displayed on session signage throughout the conference venue and in the program guide Your feedback is important and valuable. Go to passSummit.com Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
  • 5. Jen Stirrup Data Whisperer Data Relish UK Postgrad in Artificial Intelligence Universities in the UK and Paris AI and BI Consultant for 20 years Global delivery of projects Author Published author on Business Intelligence technology boos /jenstirrup @jenstirrup jenstirrup
  • 10. Apache Spark™ is a fast and general engine for large-scale data processing.
  • 11. Apache Spark It is the largest open source process in data processing. Since its release, Apache Spark has seen rapid adoption by enterprises across a wide range of industries. Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs to allow data workers to efficiently execute streaming. As well as machine learning or SQL workloads that require fast iterative access to datasets
  • 13. Why Apache Spark? FASTER THAN HADOOP RUNS EVERYWHERE
  • 15. Apache Spark Apache Spark consists of Spark Core and a set of libraries. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. Quickly achieve success by writing applications in Java, Scala, or Python.
  • 16. Resilient Distributed Datasets (RDDs) Resilient Distributed Datasets (RDDs) are the fundamental object used in Apache Spark. RDDs are immutable collections representing datasets New RDDs are created upon any operation Lineage is also stored
  • 18. Apache Spark It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell. In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing.
  • 19. Apache Spark Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case
  • 20. Spark Components on HDInsight Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Spark cluster on HDInsight is compatible with Azure Storage (WASB) as well as Azure Data Lake Store.
  • 21. Spark Components on HDInsight
  • 22. Apache Spark When you create a Spark cluster on HDInsight, you create Azure compute resources with Spark installed and configured. It only takes about 10 minutes to create a Spark cluster in HDInsight. The data to be processed is stored in Azure Storage or Azure Data Lake Store.
  • 23. Apache Spark Spark provides primitives for in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly, much more quickly than disk-based systems. Spark also integrates into the Scala programming language to let you manipulate distributed data sets like local collections.
  • 24. What does Spark give you? Apache Spark is a powerful open source processing engine for Hadoop data built around speed, easy to use, and sophisticated analytics. When comes to BigData processing speed always matters. We always look for processing our huge data as fast as possible.
  • 25. What does Spark give you Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in- memory.
  • 26. Why Spark? Easy: Built on Spark’s lightweight yet powerful APIs, Spark Streaming lets you rapidly develop streaming applications Fault tolerant: Unlike other streaming solutions (e.g. Storm), Spark Streaming recovers lost work and delivers exactly-once semantics out of the box with no extra code or configuration Integrated: Reuse the same code for batch and stream processing, even joining streaming data to historical data
  • 27. Why Spark? It uses the concept of Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write the main time consuming factors of data processing.
  • 28. YARN Data Operating system: YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open source distributed processing framework. Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications.
  • 29. YARN Data Operating system: YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.
  • 30. Spark Deployment Modes: Two deployment modes can be used to launch Spark applications: In cluster mode, jobs are managed by the YARN cluster. The Spark driver runs inside an Application Master (AM) process that is managed by YARN. This means that the client can go away after initiating the application. In client mode, the Spark driver runs in the client process, and the Application Master is used only to request resources from YARN.
  • 31. Resilient Distributed Datasets Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
  • 32. Resilient Distributed Datasets There are two ways to create RDDs: Parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
  • 33. Resilient Distributed Datasets Parallelized Collections Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
  • 34. Resilient Distributed Datasets External Datasets Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat
  • 35. Transformations Map (func): Return a new distributed dataset formed by passing each element of the source through a function func. Filter (func): Return a new dataset formed by selecting those elements of the source on which func returns true. Distinct (numTasks): Return a new dataset that contains the distinct elements of the source dataset.
  • 37. • Agenda item one • Agenda item two • Agenda item three • Agenda item four • Agenda item five Agenda An agenda slide is highly recommended so attendees understand what you will be presenting and to minimize session hopping.
  • 38. Session evaluations Download the GuideBook App and search: PASS Summit 2017 Follow the QR code link displayed on session signage throughout the conference venue and in the program guide Your feedback is important and valuable. Go to passSummit.com Submit by 5pm Friday, November 10th to win prizes. 3 Ways to Access:
  • 39. Thank You Learn more from Speaker Name email@company.com@yourhandle

Editor's Notes

  • #3: Today, CIOs and other business decision-makers are increasingly recognizing the value of open source software and Azure cloud computing for the enterprise, as a way of driving down costs whilst delivering enterprise capabilities. For the Business Intelligence professional, how can you introduce Open Source for analytics into the Enterprise in a robust way, whilst also creating an architecture that accommodates cloud, on-premise and hybrid architectures? We will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics. - incorporating Apache Spark into your existing projects - looking at your choices to parallelize Apache Spark your computations across nodes of a Hadoop cluster - how ScaleR works with Spark - Using Sparkly and SparkR within a ScaleR workflow Join this session to learn more about open source with Azure for Business Intelligence
  • #4: Today, CIOs and other business decision-makers are increasingly recognizing the value of open source software and Azure cloud computing for the enterprise, as a way of driving down costs whilst delivering enterprise capabilities. For the Business Intelligence professional, how can you introduce Open Source for analytics into the Enterprise in a robust way, whilst also creating an architecture that accommodates cloud, on-premise and hybrid architectures? We will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics. - incorporating Apache Spark into your existing projects - looking at your choices to parallelize Apache Spark your computations across nodes of a Hadoop cluster - how ScaleR works with Spark - Using Sparkly and SparkR within a ScaleR workflow Join this session to learn more about open source with Azure for Business Intelligence
  • #5: Today, CIOs and other business decision-makers are increasingly recognizing the value of open source software and Azure cloud computing for the enterprise, as a way of driving down costs whilst delivering enterprise capabilities. For the Business Intelligence professional, how can you introduce Open Source for analytics into the Enterprise in a robust way, whilst also creating an architecture that accommodates cloud, on-premise and hybrid architectures? We will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics. - incorporating Apache Spark into your existing projects - looking at your choices to parallelize Apache Spark your computations across nodes of a Hadoop cluster - how ScaleR works with Spark - Using Sparkly and SparkR within a ScaleR workflow Join this session to learn more about open source with Azure for Business Intelligence
  • #6: Today, CIOs and other business decision-makers are increasingly recognizing the value of open source software and Azure cloud computing for the enterprise, as a way of driving down costs whilst delivering enterprise capabilities. For the Business Intelligence professional, how can you introduce Open Source for analytics into the Enterprise in a robust way, whilst also creating an architecture that accommodates cloud, on-premise and hybrid architectures? We will examine strategies for using open source technologies to improve existing common Business Intelligence issues, using Apache Spark as our backdrop to delivering open source Big Data analytics. - incorporating Apache Spark into your existing projects - looking at your choices to parallelize Apache Spark your computations across nodes of a Hadoop cluster - how ScaleR works with Spark - Using Sparkly and SparkR within a ScaleR workflow Join this session to learn more about open source with Azure for Business Intelligence
  • #7: Image credit: https://pixabay.com/en/users/Seanbatty-5097598/ No attribution required. In this information age, we drive information to create value (Skok, 2013). But, the tools which create this value have always required substantial economic capital. Intelligence systems that learn and suggest what we need to know, based on: History Your colleague’s actions Data behaviour AI can make sense of data Learn and predict what you need to see.
  • #8: https://pixabay.com/en/directory-away-wisdom-education-229117/ We need something to prioritize the data for us Insights come in the form of KPIs but they are automatic, suggestive, predictive and drive value. Gone are the days of reports and complicated dashboards. We will see more focused, targeted information that can be consumed by users. Mobile, apple watch and we can react immediately.
  • #9: https://pixabay.com/en/laptop-prezi-3d-presentation-mockup-2411303/ Augmented reality. If we think we are in trouble over the three Vs…. Too much data. Automated data integration Blending data is essential to insights Automated data
  • #10: Blockchain – people are talking about currencies and a financial world that we don’t even use. https://pixabay.com/en/block-chain-data-records-concept-2850277/
  • #13: Generality Combine SQL, streaming, and complex analytics. Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
  • #14: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing.
  • #17: Resilient Distributed Datasets (RDDs) are the fundamental object used in Apache Spark.  RDDs are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery. By nature, RDDs create new RDDs upon any operation such as transformation or action. They also store the lineage, which is used to recover from failures.