SlideShare a Scribd company logo
Using 
HPCC 
Systems 
for 
Big 
Data 
and 
More 
-­‐ 
Because 
Who 
Has 
Time 
for 
MapReduce? 
John 
Andleman 
October 
7, 
2014
About 
Me 
I 
love 
to 
architect 
and 
build 
systems 
that 
acquire, 
manage, 
and 
use 
data 
to 
solve 
problems 
• OperaKonal 
• AnalyKcal 
• Real-­‐Kme 
• Big 
Data 
• Data 
Science
About 
Citrix 
SaaS 
Division 
a 
market-­‐leading 
global 
provider 
of 
web 
collabora<on, 
remote 
access, 
data 
sharing 
and 
IT 
support 
so>ware 
as 
a 
service. 
GoToMee<ng 
For 
online 
mee*ngs 
GoToWebinar 
For 
do-­‐it-­‐yourself 
webinars 
GoToTraining 
For 
online 
training 
GoToAssist 
for 
integrated 
IT 
support 
tools 
GoToMyPC 
for 
remote 
access 
to 
your 
Mac 
or 
PC 
Sharefile 
for 
data 
sharing 
and 
storage 
Podio 
for 
social 
collabora*on 
OpenVoice 
for 
affordable 
audio 
conferencing
Finding 
Insights 
In 
Big 
Data 
• Structured 
and 
semi-­‐structured 
data 
• Data 
sets 
from 
very 
small 
to 
many 
billions 
of 
records 
• Hundreds 
of 
terabytes 
of 
log 
files 
• Thousands 
of 
Oracle 
database 
tables 
• Spreadsheet 
data 
• And 
more…
Finding 
Insights 
In 
Big 
Data 
– 
TradiKonal 
BI? 
• Oracle 
Data 
Warehouse 
• ROLAP 
and 
Data 
Cubes 
• Very 
expensive 
licensing 
and 
hardware 
costs 
• Does 
not 
scale 
well 
to 
very 
large 
data 
sets 
• ETL 
to 
get 
data 
loaded 
is 
complicated 
• ExtracKng 
useful 
content 
from 
log 
files 
is 
complicated 
• Limited 
analyKc 
capabiliKes
Finding 
Insights 
In 
Big 
Data 
– 
Hadoop? 
• It’s 
very 
powerful, 
but… 
• Why 
do 
they 
have 
to 
make 
it 
so 
complicated?! 
• MapReduce 
scales, 
but 
it 
is 
a 
giant 
step 
backwards 
in 
producKvity 
• Java 
is 
a 
horrible 
language 
for 
data 
processing; 
Python 
is 
a 
li[le 
be[er 
• ExtracKng 
useful 
content 
from 
log 
files 
is 
very 
complicated 
• Much 
of 
the 
Hadoop 
infrastructure 
is 
immature 
and 
poorly 
documented
Finding 
Insights 
In 
Big 
Data 
– 
Hadoop 
with 
Pig? 
• Much 
more 
producKve 
than 
wriKng 
MapReduce 
code, 
but… 
• The 
language 
is 
very 
limited 
• Where 
the 
language 
has 
gaps, 
you 
end 
up 
wriKng 
user-­‐defined 
funcKons, 
or 
worse, 
going 
back 
to 
wriKng 
MapReduce 
code 
• ExtracKng 
useful 
content 
from 
log 
files 
is 
sKll 
very 
complicated
Finding 
Insights 
In 
Big 
Data 
– 
HPCC? 
• ECL 
Language 
is 
a 
very 
mature 
data 
processing 
language 
• ECL 
is 
a 
very 
complete 
language 
• ECL 
has 
very 
powerful 
pa[ern 
matching 
constructs 
for 
extracKng 
useful 
content 
from 
log 
files 
– 
the 
best 
I 
have 
seen! 
• ECL 
is 
the 
best 
ETL 
language 
I 
have 
worked 
with 
• HPCC 
with 
ECL 
scales 
well 
and 
is 
a 
very 
producKve 
development 
environment
Big 
Data 
Projects 
at 
Citrix 
– 
GoToMeeKng 
and 
GoToWebinar 
• Study 
product 
feature 
usage 
by: 
ᵒ Different 
customer 
segments 
ᵒ Trial 
vs. 
paid 
customers 
ᵒ Retained 
vs. 
lost 
accounts 
• Study 
trial 
usage 
pa[erns 
of 
converted 
vs. 
non-­‐converted 
accounts 
• Study 
relaKonships 
of 
various 
session 
staKsKcs 
to 
customer 
retenKon 
• Study 
usage 
pa[erns 
of 
VoIP 
vs. 
dial-­‐up 
audio 
in 
sessions 
• Study 
pa[erns 
of 
session 
audio 
problems 
• Study 
to 
idenKfy 
fraudulent 
usage 
including 
trial 
abuse 
and 
spam 
acKvity
Big 
Data 
Projects 
at 
Citrix 
– 
HPCC 
and 
ECL 
• Raw 
Oracle 
data 
was 
dumped 
to 
CSV 
files 
for 
ingesKon 
into 
HPCC 
– 
this 
turned 
out 
to 
be 
faster 
than 
doing 
any 
data 
crunching 
in 
Oracle 
• HPCC 
for 
ETL 
jobs 
ran 
MUCH 
faster 
than 
Oracle, 
even 
when 
HPCC 
was 
run 
on 
much 
smaller 
hardware 
• HPCC 
is 
a 
much 
more 
capable 
and 
producKve 
ETL 
tool 
than 
anything 
else 
I 
have 
used. 
I 
even 
prefer 
it 
for 
data 
that 
is 
not 
“big”. 
• ECL 
has 
excellent 
support 
for 
analyKcs, 
especially 
on 
very 
large 
data 
sets 
that 
would 
choke 
most 
other 
analyKc 
tools
Work 
be[er. 
Live 
be[er.

More Related Content

PPTX
Data Engineering for Data Scientists
PPTX
Data engineering
PDF
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
PDF
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
PDF
Flink and NiFi, Two Stars in the Apache Big Data Constellation
PPTX
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
ODP
Cloud Computing ...changes everything
PPTX
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Engineering for Data Scientists
Data engineering
H2O World - Survey of Available Machine Learning Frameworks - Brendan Herger
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Cloud Computing ...changes everything
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...

What's hot (20)

PPTX
Understanding Big Data for policy professionals
PPTX
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
PDF
Airbyte @ Airflow Summit - The new modern data stack
PPTX
Demystifying data engineering
PDF
Introduction to Sparkling Water - Spark Summit East 2016
PDF
Lambda architecture for real time big data
PDF
Big Data Computing Architecture
PDF
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
PDF
Joe Witt presentation on Apache NiFi
PPTX
Rapid Data Analytics @ Netflix
PDF
Budapest Big Data Meetup Real-time stream processing
PPTX
Implementing BigPetStore with Apache Flink
PPTX
Practical Use of a NoSQL Database
PPTX
Closing Keynote
PPTX
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
PPTX
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
PPTX
Supercharging Data Performance for Real-Time Data Analysis
PDF
Summary introduction to data engineering
PPTX
Free Servers to Build Big Data System on: Bing’s Approach
PDF
Productive Data Tools for Quants
Understanding Big Data for policy professionals
"An introduction to Kx Technology - a Big Data solution", Kyra Coyne, Data Sc...
Airbyte @ Airflow Summit - The new modern data stack
Demystifying data engineering
Introduction to Sparkling Water - Spark Summit East 2016
Lambda architecture for real time big data
Big Data Computing Architecture
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Joe Witt presentation on Apache NiFi
Rapid Data Analytics @ Netflix
Budapest Big Data Meetup Real-time stream processing
Implementing BigPetStore with Apache Flink
Practical Use of a NoSQL Database
Closing Keynote
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Supercharging Data Performance for Real-Time Data Analysis
Summary introduction to data engineering
Free Servers to Build Big Data System on: Bing’s Approach
Productive Data Tools for Quants
Ad

Viewers also liked (20)

PPT
The maharani classic marcasite
DOC
Chapter 2
DOC
Ta1 7º ano p1
PDF
Microsoft remote connectivity analyzer (exrca) autodiscover troubleshooting ...
PPTX
Dunlap, s simpson elementary school media center ppt
PDF
The productbook modificata_engleza_flat
PPT
фізика
PDF
What does it mean to be bilingual
PPTX
Storyboards
DOCX
Manejo de materieles sli
PPTX
RTF
Emotional disturbance
PDF
Bullying
DOCX
Inglés i t1
PDF
Flailing terror
PPT
English lesson. Present perfect
PDF
Каталог LR HEALTH&BEAUTY SYSTEMS 02-2013
DOC
PPTX
Blogger
PPTX
09 state of the art of the management of advanced and recurrent ovarian cancer
The maharani classic marcasite
Chapter 2
Ta1 7º ano p1
Microsoft remote connectivity analyzer (exrca) autodiscover troubleshooting ...
Dunlap, s simpson elementary school media center ppt
The productbook modificata_engleza_flat
фізика
What does it mean to be bilingual
Storyboards
Manejo de materieles sli
Emotional disturbance
Bullying
Inglés i t1
Flailing terror
English lesson. Present perfect
Каталог LR HEALTH&BEAUTY SYSTEMS 02-2013
Blogger
09 state of the art of the management of advanced and recurrent ovarian cancer
Ad

Similar to HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for MapReduce? (20)

PPTX
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
PDF
HPCC Presentation
PDF
Presentation at Wright State University
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PPTX
Big data
PDF
Putting Business Intelligence to Work on Hadoop Data Stores
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PPTX
Big Data Introduction
PDF
SQL Server Konferenz 2014 - SSIS & HDInsight
PDF
HUG Ireland Event - HPCC Presentation Slides
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
PDF
From Relational Database Management to Big Data: Solutions for Data Migration...
PPT
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
PPT
Gartner peer forum sept 2011 orbitz
PPTX
Big Data Lessons from the Cloud
PDF
Combining hadoop with big data analytics
PPTX
Introduction to Pig
PPTX
Hd insight overview
PPTX
Big data and hadoop
PDF
IS-4011, Accelerating Analytics on HADOOP using OpenCL, by Zubin Dowlaty and ...
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
HPCC Presentation
Presentation at Wright State University
Introduction to Cloud computing and Big Data-Hadoop
Big data
Putting Business Intelligence to Work on Hadoop Data Stores
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Big Data Introduction
SQL Server Konferenz 2014 - SSIS & HDInsight
HUG Ireland Event - HPCC Presentation Slides
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
From Relational Database Management to Big Data: Solutions for Data Migration...
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Gartner peer forum sept 2011 orbitz
Big Data Lessons from the Cloud
Combining hadoop with big data analytics
Introduction to Pig
Hd insight overview
Big data and hadoop
IS-4011, Accelerating Analytics on HADOOP using OpenCL, by Zubin Dowlaty and ...

More from HPCC Systems (20)

PPTX
Natural Language to SQL Query conversion using Machine Learning Techniques on...
PPT
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
PPTX
Towards Trustable AI for Complex Systems
PPTX
Welcome
PPTX
Closing / Adjourn
PPTX
Community Website: Virtual Ribbon Cutting
PPTX
Path to 8.0
PPTX
Release Cycle Changes
PPTX
Geohashing with Uber’s H3 Geospatial Index
PPTX
Advancements in HPCC Systems Machine Learning
PPTX
Docker Support
PPTX
Expanding HPCC Systems Deep Neural Network Capabilities
PPTX
Leveraging Intra-Node Parallelization in HPCC Systems
PPTX
DataPatterns - Profiling in ECL Watch
PPTX
Leveraging the Spark-HPCC Ecosystem
PPTX
Work Unit Analysis Tool
PPTX
Community Award Ceremony
PPTX
Dapper Tool - A Bundle to Make your ECL Neater
PPTX
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
PPTX
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...
Natural Language to SQL Query conversion using Machine Learning Techniques on...
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Towards Trustable AI for Complex Systems
Welcome
Closing / Adjourn
Community Website: Virtual Ribbon Cutting
Path to 8.0
Release Cycle Changes
Geohashing with Uber’s H3 Geospatial Index
Advancements in HPCC Systems Machine Learning
Docker Support
Expanding HPCC Systems Deep Neural Network Capabilities
Leveraging Intra-Node Parallelization in HPCC Systems
DataPatterns - Profiling in ECL Watch
Leveraging the Spark-HPCC Ecosystem
Work Unit Analysis Tool
Community Award Ceremony
Dapper Tool - A Bundle to Make your ECL Neater
A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...
Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Empathic Computing: Creating Shared Understanding
PDF
KodekX | Application Modernization Development
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Modernizing your data center with Dell and AMD
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Empathic Computing: Creating Shared Understanding
KodekX | Application Modernization Development
Advanced methodologies resolving dimensionality complications for autism neur...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Mobile App Security Testing_ A Comprehensive Guide.pdf
A Presentation on Artificial Intelligence
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
Modernizing your data center with Dell and AMD
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for MapReduce?

  • 1. Using HPCC Systems for Big Data and More -­‐ Because Who Has Time for MapReduce? John Andleman October 7, 2014
  • 2. About Me I love to architect and build systems that acquire, manage, and use data to solve problems • OperaKonal • AnalyKcal • Real-­‐Kme • Big Data • Data Science
  • 3. About Citrix SaaS Division a market-­‐leading global provider of web collabora<on, remote access, data sharing and IT support so>ware as a service. GoToMee<ng For online mee*ngs GoToWebinar For do-­‐it-­‐yourself webinars GoToTraining For online training GoToAssist for integrated IT support tools GoToMyPC for remote access to your Mac or PC Sharefile for data sharing and storage Podio for social collabora*on OpenVoice for affordable audio conferencing
  • 4. Finding Insights In Big Data • Structured and semi-­‐structured data • Data sets from very small to many billions of records • Hundreds of terabytes of log files • Thousands of Oracle database tables • Spreadsheet data • And more…
  • 5. Finding Insights In Big Data – TradiKonal BI? • Oracle Data Warehouse • ROLAP and Data Cubes • Very expensive licensing and hardware costs • Does not scale well to very large data sets • ETL to get data loaded is complicated • ExtracKng useful content from log files is complicated • Limited analyKc capabiliKes
  • 6. Finding Insights In Big Data – Hadoop? • It’s very powerful, but… • Why do they have to make it so complicated?! • MapReduce scales, but it is a giant step backwards in producKvity • Java is a horrible language for data processing; Python is a li[le be[er • ExtracKng useful content from log files is very complicated • Much of the Hadoop infrastructure is immature and poorly documented
  • 7. Finding Insights In Big Data – Hadoop with Pig? • Much more producKve than wriKng MapReduce code, but… • The language is very limited • Where the language has gaps, you end up wriKng user-­‐defined funcKons, or worse, going back to wriKng MapReduce code • ExtracKng useful content from log files is sKll very complicated
  • 8. Finding Insights In Big Data – HPCC? • ECL Language is a very mature data processing language • ECL is a very complete language • ECL has very powerful pa[ern matching constructs for extracKng useful content from log files – the best I have seen! • ECL is the best ETL language I have worked with • HPCC with ECL scales well and is a very producKve development environment
  • 9. Big Data Projects at Citrix – GoToMeeKng and GoToWebinar • Study product feature usage by: ᵒ Different customer segments ᵒ Trial vs. paid customers ᵒ Retained vs. lost accounts • Study trial usage pa[erns of converted vs. non-­‐converted accounts • Study relaKonships of various session staKsKcs to customer retenKon • Study usage pa[erns of VoIP vs. dial-­‐up audio in sessions • Study pa[erns of session audio problems • Study to idenKfy fraudulent usage including trial abuse and spam acKvity
  • 10. Big Data Projects at Citrix – HPCC and ECL • Raw Oracle data was dumped to CSV files for ingesKon into HPCC – this turned out to be faster than doing any data crunching in Oracle • HPCC for ETL jobs ran MUCH faster than Oracle, even when HPCC was run on much smaller hardware • HPCC is a much more capable and producKve ETL tool than anything else I have used. I even prefer it for data that is not “big”. • ECL has excellent support for analyKcs, especially on very large data sets that would choke most other analyKc tools