Why Hadoop is important to Syncsort

Why was it so important to us
To open the MapReduce framework
12/11/2013

Syncsort Confidential and Proprietary - do not copy or distribute

Agenda
Who are we ?
What did we do ?
Why did we do that ?
With whom did we do it with?

For which results ?


2

Agenda
Who are we ?
What did we do ?

For which results ?


3

Syncsort
For 40 years we have been helping companies solve their big data
issues…even before they knew the name Big Data!
Integrating Big Data…
Smarter!

Our customers are achieving the
impossible, every day!

• 50% of all mainframes run Syncsort
• 1,500 Mainframe Customers: Most
used & trusted 3rd party mainframe
software
• Speed leader for ETL & Sort
• A history of innovation
• 25+ Issued & Pending Patents

• Large global customer base
• 15,000+ deployments in 68 countries

• First-to-market, fully integrated
approach to Hadoop ETL

Key Partners

4

Agenda
Who are we ?
What did we do ?

For which results ?


5

Smart Contributions to Improve Hadoop
Augmenting Critical Batch
Processing Capabilities

JIRA Description
4807

Allow MapOutputBuffer to be pluggable

4808

Allow Reduce-side merge to be pluggable

4809

Make classes required for 2454 public

4812

Create reduce input merger plug-in

4842

Shuffle race can hang reducer

2461

HDFS file name globbing in libhdfs

4482

Backport of 2454 to MapReduce 1 & 1.2

Plugin Shipping on CDH 4.2 and later

6

Opening the MapReduce Framework
Here and here to replace MapReduce native sort

Mapper

Output
Sorter

Here to perform functional
logic on our engine


Shuffle

Input
Sorter

Reducer

Here to perform functional
logic on our engine

7

Agenda
Who are we ?
What did we do ?

For which results ?


8

Syncsort: Just integrating data … faster

 A simple DI engine easy to
deploy, operate, and
administer
 ETL like development GUI
 Auto-tuning
 Best patented algorithms
Sort

Join

Aggregate

Copy

+


Merge

 Fast, fast, faster than any
other
 The more data the better
9

From Data to Big Data
$$$

Variety
Velocity

Quarterly

Monthly

Weekly

Daily

Intra-day

$$$

Right / Real-time

$$$

Volume
Mainframe

PC

Internet Revolution

Mobile & Social Media Revolution

70s

60s

80s


90s

2000s

2010s

Next?
10

Smart Architecture
Hadoop Integration… for Real
(No Code Generation. No Compiling. No Bolts. No Nuts!)

 Runs natively within MapReduce
 Small footprint installs on every node
 Open source contributions extend
capabilities of MapReduce
Hadoop Cluster

Unleash Hadoop’s Potential


Pluggable sort
Expanded use cases (i.e. “No sort” option)
Vertical scalability
Design flexibility (MapMapReduceReduce)

No need to worry
about this…

11

Agenda
Who are we ?
What did we do ?

For which results ?


12

Cloudera + Syncsort: Smarter Connectivity… Also for Mainframe
Because Mainframe Is Big Data Too!

Connect

• Read files directly from mainframe
• No software required on mainframe
• Already installed on 50% of mainframes

Translate

• Parse & transform: packed
decimal, EBCDIC/ASCII, multi-format
• No coding required

Load & • Load directly to HDFS
• Offload batch data processing
Process • Find more insights

13

Syncsort DMX-h + Cloudera Manager
Cloudera Manager

CDH Cluster + ISV software

Support Integration
Monitoring

Syncsort
DMX-h

A
P
I

Management

Installation

CDH Nodes


DMX-h on every CDH node

14

Agenda
Who are we ?
What did we do ?

For which results ?


15

Test cases
Sort Acceleration
– Terasort
• Run terasort with DMX-h and without DMX-h in various configurations to
compare performance.

ETL
– Use DMX-h to perform several different ETL jobs and compare against
equivalent jobs in Pig (Apache Pig version 0.9.2-gphd-1.2.0.0).
• File Change Data Capture (CDC)
• Web Log Aggregation


16

File CDC
DMX-h

Pig

Java

149
Lines of Code

70
Lines of Code

Web Log Aggregation
DMX-h

Pig

Java

94
Lines of Code

48
Lines of Code

Cluster Configuration – DMX-h Ran on 763 Nodes!
Cluster Specs:
– 763 node cluster
•
•
•
•

1 node – job tracker
1 node - name node
1 node – secondary name node
760 data and task nodes

Hadoop cluster configuration changes (from
defaults):
– 128 MB HDFS Block size (file.blocksize)
– 1.5 GB map/ 4GB reduce task JVM
memory (mapred.child.java.opts)
– Maximum 22 map tasks and 4 reduce
tasks per node
(mapred.tasktracker.map.tasks.maximu
m&
mapred.tasktracker.reduce.tasks.maximu
m)


Cluster Node Specs:
– 12 cores - Dual Intel Westmere (Hexcore) CPUs, 2.93 GHz, 12 MB Cache
– 48GB DDR3 RDIMM Memory
– 12 x 2TB 3.5” drives Seagate 7200rpm.
– Disk 0 + Disk 1 are RAID1 (mirrored)
for OS.
• 100 MB/Sec write
• 115 MB/Sec read

– 10 single disk JBOD
– Mellanox ConnectXÂ®-3 VPI NIC
(Supported data rates 40GbE;10GbE)
– RHEL 6.1 64-bit
– Java 1.6 (jdk.x86_64-2000:1.6.0_29fcs)

19

Sort Acceleration - Terasort

Use Case

TERASORT

TERASORT

TERASORT

TERASORT

TERASORT

TERASORT

TERASORT

Native/A
Mem
ETL or
lternativ
Elapsed
ory
Sort
e
DMX-h Time Native/Alterna
DMX-h Impro Native/Alter
Accele Alterna Data Size Elapsed Elapsed Improv tive Memory
Physical veme native CPU
ration tive
(GB)
time
Time ement
(GB)
Memory (GB) nt
Time
Sort
Accele
ration Native
512
0:01:47 0:01:45
2%
12,863
12,873
0%
114,297
Sort
Accele
ration Native 1,024 0:02:29 0:01:11 52%
14,512
14,522
0%
194,896
Sort
Accele
ration Native 1,536 0:04:02 0:01:23 66%
14,684
14,694
0%
287,055
Sort
Accele
ration Native 4,096 0:03:31 0:02:29 29%
31,520
31,549
0%
927,379
Sort
Accele
ration Native 10,242 0:08:51 0:05:14 41%
47,935
47,951
0% 2,835,927
Sort
Accele
ration Native 20,484 0:14:55 0:12:28 16%
106,153
105,239
1% 6,112,296
Sort
Accele
ration Native 102,400 1:12:12 0:51:59 28%
387,262
387,211
0% 30,436,624


Native/
CPU Alterna
Impro tive DMX-h
DMX-h CPU veme MB/SecMB/Sec
Time
nt /Node /Node

62,491

45%

6.5

6.6

98,972

49%

9.3

19.4

143,759

50%

8.6

25.0

380,442

59%

26.2

37.0

1,460,101

49%

26.4

44.6

3,696,727

40%

31.0

37.4

16,589,332 45%

32.3

44.9
20

File CDC

Native/
ETL or
Native/Alt
Elapse
Memor
Alterna DMXSort
Data ernative DMX-h d Time Native/Altern
DMX-h
y Native/Alt
CPU tive
h
AccelerAlterna Size Elapsed Elapsed Improv ative Memory Physical Improv ernative DMX-h Improv MB/Se MB/Se
Use Case ation tive (GB)
time
Time ement
(GB)
Memory (GB) ement CPU Time CPU Time ement c/Nodec/Node
FileCDC

ETL

Pig

148

0:05:31

0:01:33

72%

79,876

79,559

0%

79,876

79,559

0%

0.6

2.2

FileCDC

ETL

Pig

450

0:05:11

0:01:58

62%

243,834

182,869

25%

243,834

182,869

25%

1.9

5.3

FileCDC

ETL

Pig

1,515

0:07:49

0:03:44

52%

845,263

557,226

34%

845,263

557,226

34%

4.4

9.4


21

Web Log Aggregation

Use Case
WebLogAggregation Split Size & fixes

Data Native/Alter
Altern Size
native
ative (GB) Elapsed time

DMX-h
Elapsed
Time

Native/A
Elapsed
lternativ
Time
Memory Native/Alter
CPU
e
DMX-h
Improve Native/Alternativ DMX-h Physical Improve native CPU DMX-h CPU Improve MB/Sec/ MB/Sec/
ment e Memory (GB)
Memory (GB)
ment
Time
Time
ment
Node
Node

Pig

2,067

0:01:12

0:00:58

19%

13,499

7,813

42%

145,972

56,496

61%

40.1

49.8

Pig

4,135

0:01:42

0:01:23

19%

18,003

15,579

13%

300,627

152,390

49%

56.1

69.6

Pig

10,240

0:05:16

0:02:04

61%

40,773

39,091

4%

807,473

335,537

58%

45.3

115.4

Pig

20,480

0:07:54

0:06:58

12%

78,654

78,128

1%

1,339,453

568,107

58%

60.4

68.4


22

Test Drive DMX-h:
Bridge the Gap Between
Big Iron & Big Data!
• Self-contained image
• Use case accelerators for
• mainframe, Hadoop and more!

Running on CDH
A Smarter Approach…

(

+

)

www.syncsort.com/try
…and Quite Possibly The Only Approach!
23

Why Hadoop is important to Syncsort

More Related Content

What's hot (20)

Similar to Why Hadoop is important to Syncsort (20)

More from huguk (20)

Recently uploaded (20)

Why Hadoop is important to Syncsort

Editor's Notes