SlideShare a Scribd company logo
Why was it so important to us
To open the MapReduce framework
12/11/2013

Syncsort Confidential and Proprietary - do not copy or distribute
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?

For which results ?

Syncsort Confidential and Proprietary - do not copy or distribute

2
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?

For which results ?

Syncsort Confidential and Proprietary - do not copy or distribute

3
Syncsort
For 40 years we have been helping companies solve their big data
issues…even before they knew the name Big Data!
Integrating Big Data…
Smarter!

Our customers are achieving the
impossible, every day!

• 50% of all mainframes run Syncsort
• 1,500 Mainframe Customers: Most
used & trusted 3rd party mainframe
software
• Speed leader for ETL & Sort
• A history of innovation
• 25+ Issued & Pending Patents

• Large global customer base
• 15,000+ deployments in 68 countries

• First-to-market, fully integrated
approach to Hadoop ETL
Syncsort Confidential and Proprietary - do not copy or distribute

Key Partners

4
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?

For which results ?

Syncsort Confidential and Proprietary - do not copy or distribute

5
Smart Contributions to Improve Hadoop
Augmenting Critical Batch
Processing Capabilities

JIRA Description
4807

Allow MapOutputBuffer to be pluggable

4808

Allow Reduce-side merge to be pluggable

4809

Make classes required for 2454 public

4812

Create reduce input merger plug-in

4842

Shuffle race can hang reducer

2461

HDFS file name globbing in libhdfs

4482

Backport of 2454 to MapReduce 1 & 1.2

Plugin Shipping on CDH 4.2 and later
Syncsort Confidential and Proprietary - do not copy or distribute

6
Opening the MapReduce Framework
Here and here to replace MapReduce native sort

Mapper

Output
Sorter

Here to perform functional
logic on our engine

Syncsort Confidential and Proprietary - do not copy or distribute

Shuffle

Input
Sorter

Reducer

Here to perform functional
logic on our engine

7
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?

For which results ?

Syncsort Confidential and Proprietary - do not copy or distribute

8
Syncsort: Just integrating data … faster

 A simple DI engine easy to
deploy, operate, and
administer
 ETL like development GUI
 Auto-tuning
 Best patented algorithms
Sort

Join

Aggregate

Copy

+

Syncsort Confidential and Proprietary - do not copy or distribute

Merge

 Fast, fast, faster than any
other
 The more data the better
9
From Data to Big Data
$$$

Variety
Velocity

Quarterly

Monthly

Weekly

Daily

Intra-day

$$$

Right / Real-time

$$$

Volume
Mainframe

PC

Internet Revolution

Mobile & Social Media Revolution

70s

60s

80s

Syncsort Confidential and Proprietary - do not copy or distribute

90s

2000s

2010s

Next?
10
Smart Architecture
Hadoop Integration… for Real
(No Code Generation. No Compiling. No Bolts. No Nuts!)

 Runs natively within MapReduce
 Small footprint installs on every node
 Open source contributions extend
capabilities of MapReduce
Hadoop Cluster

Unleash Hadoop’s Potential

Syncsort Confidential and Proprietary - do not copy or distribute

Pluggable sort
Expanded use cases (i.e. “No sort” option)
Vertical scalability
Design flexibility (MapMapReduceReduce)

No need to worry
about this…

11
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?

For which results ?

Syncsort Confidential and Proprietary - do not copy or distribute

12
Cloudera + Syncsort: Smarter Connectivity… Also for Mainframe
Because Mainframe Is Big Data Too!

Connect

• Read files directly from mainframe
• No software required on mainframe
• Already installed on 50% of mainframes

Translate

• Parse & transform: packed
decimal, EBCDIC/ASCII, multi-format
• No coding required

Load & • Load directly to HDFS
• Offload batch data processing
Process • Find more insights
Syncsort Confidential and Proprietary - do not copy or distribute

13
Syncsort DMX-h + Cloudera Manager
Cloudera Manager

CDH Cluster + ISV software

Support Integration
Monitoring

Syncsort
DMX-h

A
P
I

Management

Installation

CDH Nodes

Syncsort Confidential and Proprietary - do not copy or distribute

DMX-h on every CDH node

14
Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?

For which results ?

Syncsort Confidential and Proprietary - do not copy or distribute

15
Test cases
Sort Acceleration
– Terasort
• Run terasort with DMX-h and without DMX-h in various configurations to
compare performance.

ETL
– Use DMX-h to perform several different ETL jobs and compare against
equivalent jobs in Pig (Apache Pig version 0.9.2-gphd-1.2.0.0).
• File Change Data Capture (CDC)
• Web Log Aggregation

Syncsort Confidential and Proprietary - do not copy or distribute

16
File CDC
DMX-h

Pig

Java

149
Lines of Code
Syncsort Confidential and Proprietary - do not copy or distribute

70
Lines of Code
Web Log Aggregation
DMX-h

Pig

Java

94
Lines of Code
Syncsort Confidential and Proprietary - do not copy or distribute

48
Lines of Code
Cluster Configuration – DMX-h Ran on 763 Nodes!
Cluster Specs:
– 763 node cluster
•
•
•
•

1 node – job tracker
1 node - name node
1 node – secondary name node
760 data and task nodes

Hadoop cluster configuration changes (from
defaults):
– 128 MB HDFS Block size (file.blocksize)
– 1.5 GB map/ 4GB reduce task JVM
memory (mapred.child.java.opts)
– Maximum 22 map tasks and 4 reduce
tasks per node
(mapred.tasktracker.map.tasks.maximu
m&
mapred.tasktracker.reduce.tasks.maximu
m)

Syncsort Confidential and Proprietary - do not copy or distribute

Cluster Node Specs:
– 12 cores - Dual Intel Westmere (Hexcore) CPUs, 2.93 GHz, 12 MB Cache
– 48GB DDR3 RDIMM Memory
– 12 x 2TB 3.5” drives Seagate 7200rpm.
– Disk 0 + Disk 1 are RAID1 (mirrored)
for OS.
• 100 MB/Sec write
• 115 MB/Sec read

– 10 single disk JBOD
– Mellanox ConnectX®-3 VPI NIC
(Supported data rates 40GbE;10GbE)
– RHEL 6.1 64-bit
– Java 1.6 (jdk.x86_64-2000:1.6.0_29fcs)

19
Sort Acceleration - Terasort

Use Case

TERASORT

TERASORT

TERASORT

TERASORT

TERASORT

TERASORT

TERASORT

Native/A
Mem
ETL or
lternativ
Elapsed
ory
Sort
e
DMX-h Time Native/Alterna
DMX-h Impro Native/Alter
Accele Alterna Data Size Elapsed Elapsed Improv tive Memory
Physical veme native CPU
ration tive
(GB)
time
Time ement
(GB)
Memory (GB) nt
Time
Sort
Accele
ration Native
512
0:01:47 0:01:45
2%
12,863
12,873
0%
114,297
Sort
Accele
ration Native 1,024 0:02:29 0:01:11 52%
14,512
14,522
0%
194,896
Sort
Accele
ration Native 1,536 0:04:02 0:01:23 66%
14,684
14,694
0%
287,055
Sort
Accele
ration Native 4,096 0:03:31 0:02:29 29%
31,520
31,549
0%
927,379
Sort
Accele
ration Native 10,242 0:08:51 0:05:14 41%
47,935
47,951
0% 2,835,927
Sort
Accele
ration Native 20,484 0:14:55 0:12:28 16%
106,153
105,239
1% 6,112,296
Sort
Accele
ration Native 102,400 1:12:12 0:51:59 28%
387,262
387,211
0% 30,436,624

Syncsort Confidential and Proprietary - do not copy or distribute

Native/
CPU Alterna
Impro tive DMX-h
DMX-h CPU veme MB/SecMB/Sec
Time
nt /Node /Node

62,491

45%

6.5

6.6

98,972

49%

9.3

19.4

143,759

50%

8.6

25.0

380,442

59%

26.2

37.0

1,460,101

49%

26.4

44.6

3,696,727

40%

31.0

37.4

16,589,332 45%

32.3

44.9
20
File CDC

Native/
ETL or
Native/Alt
Elapse
Memor
Alterna DMXSort
Data ernative DMX-h d Time Native/Altern
DMX-h
y Native/Alt
CPU tive
h
AccelerAlterna Size Elapsed Elapsed Improv ative Memory Physical Improv ernative DMX-h Improv MB/Se MB/Se
Use Case ation tive (GB)
time
Time ement
(GB)
Memory (GB) ement CPU Time CPU Time ement c/Nodec/Node
FileCDC

ETL

Pig

148

0:05:31

0:01:33

72%

79,876

79,559

0%

79,876

79,559

0%

0.6

2.2

FileCDC

ETL

Pig

450

0:05:11

0:01:58

62%

243,834

182,869

25%

243,834

182,869

25%

1.9

5.3

FileCDC

ETL

Pig

1,515

0:07:49

0:03:44

52%

845,263

557,226

34%

845,263

557,226

34%

4.4

9.4

Syncsort Confidential and Proprietary - do not copy or distribute

21
Web Log Aggregation

Use Case
WebLogAggregation Split Size & fixes
WebLogAggregation Split Size & fixes
WebLogAggregation Split Size & fixes
WebLogAggregation Split Size & fixes

Data Native/Alter
Altern Size
native
ative (GB) Elapsed time

DMX-h
Elapsed
Time

Native/A
Elapsed
lternativ
Time
Memory Native/Alter
CPU
e
DMX-h
Improve Native/Alternativ DMX-h Physical Improve native CPU DMX-h CPU Improve MB/Sec/ MB/Sec/
ment e Memory (GB)
Memory (GB)
ment
Time
Time
ment
Node
Node

Pig

2,067

0:01:12

0:00:58

19%

13,499

7,813

42%

145,972

56,496

61%

40.1

49.8

Pig

4,135

0:01:42

0:01:23

19%

18,003

15,579

13%

300,627

152,390

49%

56.1

69.6

Pig

10,240

0:05:16

0:02:04

61%

40,773

39,091

4%

807,473

335,537

58%

45.3

115.4

Pig

20,480

0:07:54

0:06:58

12%

78,654

78,128

1%

1,339,453

568,107

58%

60.4

68.4

Syncsort Confidential and Proprietary - do not copy or distribute

22
Test Drive DMX-h:
Bridge the Gap Between
Big Iron & Big Data!
• Self-contained image
• Use case accelerators for
• mainframe, Hadoop and more!

Running on CDH
A Smarter Approach…

(

+

)

www.syncsort.com/try
…and Quite Possibly The Only Approach!
23

More Related Content

PPTX
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
PDF
Hadoop in the Enterprise: Legacy Rides the Elephant
PPTX
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
PPTX
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
PPTX
Wrangling Customer Usage Data with Hadoop
PDF
IBM POWER8 as an HPC platform
PPTX
DEVNET-1166 Open SDN Controller APIs
PDF
OpenPOWER Roadmap Toward CORAL
Move to Hadoop, Go Faster and Save Millions - Mainframe Legacy Modernization
Hadoop in the Enterprise: Legacy Rides the Elephant
How to Leverage Mainframe Data with Hadoop: Bridging the Gap Between Big Iron...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Wrangling Customer Usage Data with Hadoop
IBM POWER8 as an HPC platform
DEVNET-1166 Open SDN Controller APIs
OpenPOWER Roadmap Toward CORAL

What's hot (20)

PDF
IMS01 IMS Keynote
PDF
IBM Power8 announce
PPTX
IBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
PDF
Which Change Data Capture Strategy is Right for You?
PDF
DB2 Real-Time Analytics Meeting Wayne, PA 2015 - IDAA & DB2 Tools Update
PDF
Open Innovation with Power Systems
PDF
Ibm integrated analytics system
PPTX
Understanding the IBM Power Systems Advantage
PDF
EDBT 2013 - Near Realtime Analytics with IBM DB2 Analytics Accelerator
PPTX
Datacenter 2014: HP - Brian Andersen
PPTX
Introduction to Designing and Building Big Data Applications
PPTX
Migration DB2 to EDB - Project Experience
 
PPTX
IBM Power Systems Announcement Update
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
PPTX
Cost of Ownership for Hadoop Implementation
PDF
Integrated Data Warehouse with Hadoop and Oracle Database
PPTX
Achieving cloud scale with microservices based applications on azure
PDF
A scalable server environment for your applications
PPTX
The Most Trusted In-Memory database in the world- Altibase
PDF
Greenplum Database Overview
 
IMS01 IMS Keynote
IBM Power8 announce
IBM World of Watson 2016 - DB2 Analytics Accelerator on Cloud
Which Change Data Capture Strategy is Right for You?
DB2 Real-Time Analytics Meeting Wayne, PA 2015 - IDAA & DB2 Tools Update
Open Innovation with Power Systems
Ibm integrated analytics system
Understanding the IBM Power Systems Advantage
EDBT 2013 - Near Realtime Analytics with IBM DB2 Analytics Accelerator
Datacenter 2014: HP - Brian Andersen
Introduction to Designing and Building Big Data Applications
Migration DB2 to EDB - Project Experience
 
IBM Power Systems Announcement Update
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Cost of Ownership for Hadoop Implementation
Integrated Data Warehouse with Hadoop and Oracle Database
Achieving cloud scale with microservices based applications on azure
A scalable server environment for your applications
The Most Trusted In-Memory database in the world- Altibase
Greenplum Database Overview
 
Ad

Similar to Why Hadoop is important to Syncsort (20)

PDF
Hadoop is Happening
PPTX
Simplifying and Future-Proofing Hadoop
PDF
Simplifying Big Data Integration with Syncsort DMX and DMX-h
PPTX
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
PPTX
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
PDF
Syncsort et le retour d'expérience ComScore
PPTX
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
PDF
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
PPT
IBMHadoopofferingTechline-Systems2015
PDF
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
PPTX
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
PPTX
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
PPTX
Visual Mapping of Clickstream Data
PDF
Design advantages of Hadoop ETL offload with the Intel processor-powered Dell...
PPTX
Big data-denis-rothman
PDF
Performance advantages of Hadoop ETL offload with the Intel processor-powered...
PPTX
Not Just Another Overview of Apache Hadoop
PPTX
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
PDF
Hadoop Operations - Best practices from the field
PDF
Benchmarking Hadoop and Big Data
Hadoop is Happening
Simplifying and Future-Proofing Hadoop
Simplifying Big Data Integration with Syncsort DMX and DMX-h
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Use Cases from Batch to Streaming, MapReduce to Spark, Mainframe to Cloud: To...
Syncsort et le retour d'expérience ComScore
Steve Totman Syncsort Big Data Warehousing hug 23 sept Final
Big Data Customer Education Webcast: The Latest Advancements in Syncsort DMX ...
IBMHadoopofferingTechline-Systems2015
Fueling AI & Machine Learning: Legacy Data as a Competitive Advantage
Big Data Education Webcast: Introducing DMX and DMX-h Release 8
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Visual Mapping of Clickstream Data
Design advantages of Hadoop ETL offload with the Intel processor-powered Dell...
Big data-denis-rothman
Performance advantages of Hadoop ETL offload with the Intel processor-powered...
Not Just Another Overview of Apache Hadoop
Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop
Hadoop Operations - Best practices from the field
Benchmarking Hadoop and Big Data
Ad

More from huguk (20)

PDF
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
PDF
ether.camp - Hackathon & ether.camp intro
PPTX
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
PPTX
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
PDF
Extracting maximum value from data while protecting consumer privacy. Jason ...
PDF
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
PDF
Streaming Dataflow with Apache Flink
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
PDF
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
PDF
Jonathon Southam: Venture Capital, Funding & Pitching
PDF
Signal Media: Real-Time Media & News Monitoring
PDF
Dean Bryen: Scaling The Platform For Your Startup
PDF
Peter Karney: Intro to the Digital catapult
PDF
Cytora: Real-Time Political Risk Analysis
PDF
Cubitic: Predictive Analytics
PDF
Bird.i: Earth Observation Data Made Social
PDF
Aiseedo: Real Time Machine Intelligence
PDF
Secrets of Spark's success - Deenar Toraskar, Think Reactive
PDF
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
PPTX
Hadoop - Looking to the Future By Arun Murthy
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
ether.camp - Hackathon & ether.camp intro
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Extracting maximum value from data while protecting consumer privacy. Jason ...
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Streaming Dataflow with Apache Flink
Lambda architecture on Spark, Kafka for real-time large scale ML
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Jonathon Southam: Venture Capital, Funding & Pitching
Signal Media: Real-Time Media & News Monitoring
Dean Bryen: Scaling The Platform For Your Startup
Peter Karney: Intro to the Digital catapult
Cytora: Real-Time Political Risk Analysis
Cubitic: Predictive Analytics
Bird.i: Earth Observation Data Made Social
Aiseedo: Real Time Machine Intelligence
Secrets of Spark's success - Deenar Toraskar, Think Reactive
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
Hadoop - Looking to the Future By Arun Murthy

Recently uploaded (20)

PPT
What is a Computer? Input Devices /output devices
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
project resource management chapter-09.pdf
PDF
STKI Israel Market Study 2025 version august
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Modernising the Digital Integration Hub
PDF
Getting Started with Data Integration: FME Form 101
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
DP Operators-handbook-extract for the Mautical Institute
What is a Computer? Input Devices /output devices
NewMind AI Weekly Chronicles – August ’25 Week III
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
project resource management chapter-09.pdf
STKI Israel Market Study 2025 version august
observCloud-Native Containerability and monitoring.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
A novel scalable deep ensemble learning framework for big data classification...
NewMind AI Weekly Chronicles - August'25-Week II
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Modernising the Digital Integration Hub
Getting Started with Data Integration: FME Form 101
Getting started with AI Agents and Multi-Agent Systems
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
OMC Textile Division Presentation 2021.pptx
1 - Historical Antecedents, Social Consideration.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Programs and apps: productivity, graphics, security and other tools
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
DP Operators-handbook-extract for the Mautical Institute

Why Hadoop is important to Syncsort

  • 1. Why was it so important to us To open the MapReduce framework 12/11/2013 Syncsort Confidential and Proprietary - do not copy or distribute
  • 2. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 2
  • 3. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 3
  • 4. Syncsort For 40 years we have been helping companies solve their big data issues…even before they knew the name Big Data! Integrating Big Data… Smarter! Our customers are achieving the impossible, every day! • 50% of all mainframes run Syncsort • 1,500 Mainframe Customers: Most used & trusted 3rd party mainframe software • Speed leader for ETL & Sort • A history of innovation • 25+ Issued & Pending Patents • Large global customer base • 15,000+ deployments in 68 countries • First-to-market, fully integrated approach to Hadoop ETL Syncsort Confidential and Proprietary - do not copy or distribute Key Partners 4
  • 5. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 5
  • 6. Smart Contributions to Improve Hadoop Augmenting Critical Batch Processing Capabilities JIRA Description 4807 Allow MapOutputBuffer to be pluggable 4808 Allow Reduce-side merge to be pluggable 4809 Make classes required for 2454 public 4812 Create reduce input merger plug-in 4842 Shuffle race can hang reducer 2461 HDFS file name globbing in libhdfs 4482 Backport of 2454 to MapReduce 1 & 1.2 Plugin Shipping on CDH 4.2 and later Syncsort Confidential and Proprietary - do not copy or distribute 6
  • 7. Opening the MapReduce Framework Here and here to replace MapReduce native sort Mapper Output Sorter Here to perform functional logic on our engine Syncsort Confidential and Proprietary - do not copy or distribute Shuffle Input Sorter Reducer Here to perform functional logic on our engine 7
  • 8. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 8
  • 9. Syncsort: Just integrating data … faster  A simple DI engine easy to deploy, operate, and administer  ETL like development GUI  Auto-tuning  Best patented algorithms Sort Join Aggregate Copy + Syncsort Confidential and Proprietary - do not copy or distribute Merge  Fast, fast, faster than any other  The more data the better 9
  • 10. From Data to Big Data $$$ Variety Velocity Quarterly Monthly Weekly Daily Intra-day $$$ Right / Real-time $$$ Volume Mainframe PC Internet Revolution Mobile & Social Media Revolution 70s 60s 80s Syncsort Confidential and Proprietary - do not copy or distribute 90s 2000s 2010s Next? 10
  • 11. Smart Architecture Hadoop Integration… for Real (No Code Generation. No Compiling. No Bolts. No Nuts!)  Runs natively within MapReduce  Small footprint installs on every node  Open source contributions extend capabilities of MapReduce Hadoop Cluster Unleash Hadoop’s Potential Syncsort Confidential and Proprietary - do not copy or distribute Pluggable sort Expanded use cases (i.e. “No sort” option) Vertical scalability Design flexibility (MapMapReduceReduce) No need to worry about this… 11
  • 12. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 12
  • 13. Cloudera + Syncsort: Smarter Connectivity… Also for Mainframe Because Mainframe Is Big Data Too! Connect • Read files directly from mainframe • No software required on mainframe • Already installed on 50% of mainframes Translate • Parse & transform: packed decimal, EBCDIC/ASCII, multi-format • No coding required Load & • Load directly to HDFS • Offload batch data processing Process • Find more insights Syncsort Confidential and Proprietary - do not copy or distribute 13
  • 14. Syncsort DMX-h + Cloudera Manager Cloudera Manager CDH Cluster + ISV software Support Integration Monitoring Syncsort DMX-h A P I Management Installation CDH Nodes Syncsort Confidential and Proprietary - do not copy or distribute DMX-h on every CDH node 14
  • 15. Agenda Who are we ? What did we do ? Why did we do that ? With whom did we do it with? For which results ? Syncsort Confidential and Proprietary - do not copy or distribute 15
  • 16. Test cases Sort Acceleration – Terasort • Run terasort with DMX-h and without DMX-h in various configurations to compare performance. ETL – Use DMX-h to perform several different ETL jobs and compare against equivalent jobs in Pig (Apache Pig version 0.9.2-gphd-1.2.0.0). • File Change Data Capture (CDC) • Web Log Aggregation Syncsort Confidential and Proprietary - do not copy or distribute 16
  • 17. File CDC DMX-h Pig Java 149 Lines of Code Syncsort Confidential and Proprietary - do not copy or distribute 70 Lines of Code
  • 18. Web Log Aggregation DMX-h Pig Java 94 Lines of Code Syncsort Confidential and Proprietary - do not copy or distribute 48 Lines of Code
  • 19. Cluster Configuration – DMX-h Ran on 763 Nodes! Cluster Specs: – 763 node cluster • • • • 1 node – job tracker 1 node - name node 1 node – secondary name node 760 data and task nodes Hadoop cluster configuration changes (from defaults): – 128 MB HDFS Block size (file.blocksize) – 1.5 GB map/ 4GB reduce task JVM memory (mapred.child.java.opts) – Maximum 22 map tasks and 4 reduce tasks per node (mapred.tasktracker.map.tasks.maximu m& mapred.tasktracker.reduce.tasks.maximu m) Syncsort Confidential and Proprietary - do not copy or distribute Cluster Node Specs: – 12 cores - Dual Intel Westmere (Hexcore) CPUs, 2.93 GHz, 12 MB Cache – 48GB DDR3 RDIMM Memory – 12 x 2TB 3.5” drives Seagate 7200rpm. – Disk 0 + Disk 1 are RAID1 (mirrored) for OS. • 100 MB/Sec write • 115 MB/Sec read – 10 single disk JBOD – Mellanox ConnectX®-3 VPI NIC (Supported data rates 40GbE;10GbE) – RHEL 6.1 64-bit – Java 1.6 (jdk.x86_64-2000:1.6.0_29fcs) 19
  • 20. Sort Acceleration - Terasort Use Case TERASORT TERASORT TERASORT TERASORT TERASORT TERASORT TERASORT Native/A Mem ETL or lternativ Elapsed ory Sort e DMX-h Time Native/Alterna DMX-h Impro Native/Alter Accele Alterna Data Size Elapsed Elapsed Improv tive Memory Physical veme native CPU ration tive (GB) time Time ement (GB) Memory (GB) nt Time Sort Accele ration Native 512 0:01:47 0:01:45 2% 12,863 12,873 0% 114,297 Sort Accele ration Native 1,024 0:02:29 0:01:11 52% 14,512 14,522 0% 194,896 Sort Accele ration Native 1,536 0:04:02 0:01:23 66% 14,684 14,694 0% 287,055 Sort Accele ration Native 4,096 0:03:31 0:02:29 29% 31,520 31,549 0% 927,379 Sort Accele ration Native 10,242 0:08:51 0:05:14 41% 47,935 47,951 0% 2,835,927 Sort Accele ration Native 20,484 0:14:55 0:12:28 16% 106,153 105,239 1% 6,112,296 Sort Accele ration Native 102,400 1:12:12 0:51:59 28% 387,262 387,211 0% 30,436,624 Syncsort Confidential and Proprietary - do not copy or distribute Native/ CPU Alterna Impro tive DMX-h DMX-h CPU veme MB/SecMB/Sec Time nt /Node /Node 62,491 45% 6.5 6.6 98,972 49% 9.3 19.4 143,759 50% 8.6 25.0 380,442 59% 26.2 37.0 1,460,101 49% 26.4 44.6 3,696,727 40% 31.0 37.4 16,589,332 45% 32.3 44.9 20
  • 21. File CDC Native/ ETL or Native/Alt Elapse Memor Alterna DMXSort Data ernative DMX-h d Time Native/Altern DMX-h y Native/Alt CPU tive h AccelerAlterna Size Elapsed Elapsed Improv ative Memory Physical Improv ernative DMX-h Improv MB/Se MB/Se Use Case ation tive (GB) time Time ement (GB) Memory (GB) ement CPU Time CPU Time ement c/Nodec/Node FileCDC ETL Pig 148 0:05:31 0:01:33 72% 79,876 79,559 0% 79,876 79,559 0% 0.6 2.2 FileCDC ETL Pig 450 0:05:11 0:01:58 62% 243,834 182,869 25% 243,834 182,869 25% 1.9 5.3 FileCDC ETL Pig 1,515 0:07:49 0:03:44 52% 845,263 557,226 34% 845,263 557,226 34% 4.4 9.4 Syncsort Confidential and Proprietary - do not copy or distribute 21
  • 22. Web Log Aggregation Use Case WebLogAggregation Split Size & fixes WebLogAggregation Split Size & fixes WebLogAggregation Split Size & fixes WebLogAggregation Split Size & fixes Data Native/Alter Altern Size native ative (GB) Elapsed time DMX-h Elapsed Time Native/A Elapsed lternativ Time Memory Native/Alter CPU e DMX-h Improve Native/Alternativ DMX-h Physical Improve native CPU DMX-h CPU Improve MB/Sec/ MB/Sec/ ment e Memory (GB) Memory (GB) ment Time Time ment Node Node Pig 2,067 0:01:12 0:00:58 19% 13,499 7,813 42% 145,972 56,496 61% 40.1 49.8 Pig 4,135 0:01:42 0:01:23 19% 18,003 15,579 13% 300,627 152,390 49% 56.1 69.6 Pig 10,240 0:05:16 0:02:04 61% 40,773 39,091 4% 807,473 335,537 58% 45.3 115.4 Pig 20,480 0:07:54 0:06:58 12% 78,654 78,128 1% 1,339,453 568,107 58% 60.4 68.4 Syncsort Confidential and Proprietary - do not copy or distribute 22
  • 23. Test Drive DMX-h: Bridge the Gap Between Big Iron & Big Data! • Self-contained image • Use case accelerators for • mainframe, Hadoop and more! Running on CDH A Smarter Approach… ( + ) www.syncsort.com/try …and Quite Possibly The Only Approach! 23

Editor's Notes

  • #14: The ability to process and analyze mainframe data with Hadoop can open up a wealth of opportunities by delivering deeper analytics, at lower cost. Unfortunately, there are no native Hadoop ETL capabilities for mainframe. Simply ingesting mainframe data involves lots of manual effort and coding, plus a combination of mainframe and Hadoop skills that are nearly impossible to find. The Use Case Accelerators for Mainframe Connectivity and Translation combine decades of mainframe expertise with state-of-the-art Hadoop capabilities to provide a painless and seamless approach to leverage mainframe data. Read files directly from the mainframe, parse and transform the data – packed decimal, COMP, EBCDIC/ASCII, multi-format records, and more - without installing any software on the mainframe and without writing any code. SAMPLE EBCDIC data!