SlideShare a Scribd company logo
Ken Krugler | President, Scale Unlimited
Faster Workflows, Faster
The Twitter Pitch
• Cascading is a solid, established workflow API
•Good for complex custom ETL workflows
• Flink is a new streaming dataflow engine
•50% better performance by leveraging memory
Perils of Comparisons
• Performance comparisons are clickbait for devs
•“You won’t believe the speed of Flink!”
• Really, really hard to do well
• I did “moderate” tuning of existing code base
•Somewhat complex workflow in EMR
•Dataset bigger than memory (100M…1B records)
TL;DR
• Flink gets faster by minimizing disk I/O
•Map-Reduce job always has write/read at job break
•Can also spill map output, reduce merge-sort
• Flink has no job boundaries
•And no map-side spills
•So only reduce merge-sort is extra I/O
TOC
• A very short intro to Cascading
•An even shorter intro to Flink
• An example of converting a workflow
• More in-depth results
In the beginning…
• There was Hadoop, and it was good
•But life is too short to write M-R jobs with K-V data
• Then along came Cascading…
Cascading
What is Cascading?
• A thin Java library on top of Hadoop
•An open source project (APL, 8 years old)
• An API for defining and running ETL workflows
30,000ft View
• Records (Tuples) flow through Pipes
30,000ft View
• Pipes connect Operations
30,000ft View
• You do Operations on Tuples
30,000ft View
• Tuples flow into Pipes from Source Taps
30,000ft View
• Tuples flow from Pipes into Sink Taps
30,000ft View
• This is a data processing workflow (Flow)
Java API to Define Flow
Pipe ipDataPipe = new Pipe("ip data pipe");
RegexParser ipDataParser = new RegexParser(new Fields("Data IP", "Country"), ^([d.]+)t(.*));
ipDataPipe = new Each(ipDataPipe, new Fields("line"), ipDataParser);
Pipe logAnalysisPipe = new CoGroup( logDataPipe, // left-side pipe
new Fields("Log IP"), // left-side field for joining
ipDataPipe, // right-side pipe
new Fields("Data IP"), // right-side field for joining
new LeftJoin()); // type of join to do
logAnalysisPipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status"));
logAnalysisPipe = new Every(logAnalysisPipe, new Count(new Fields(“Count")));
logAnalysisPipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null")));
Tap logDataTap = new Hfs(new TextLine(), "access.log");
Tap ipDataTap = new Hfs(new TextLine(), "ip-map.tsv");
Tap outputTap = new Hfs(new TextLine(), "results");
FlowDef flowDef = new FlowDef().setName("log analysis flow")
.addSource(logDataPipe, logDataTap).addSource(ipDataPipe, ipDataTap)
.addTailSink(logAnalysisPipe, outputTap);
Flow flow = new HadoopFlowConnector(properties).connect(flowDef);
Visualizing the DAG
Things I Like
• “Stream Shaping” - easy to add, drop fields
•Field consistency checking in DAG
• Building blocks - Operations, SubAssemblies
• Flexible planner - MR, local, Tez
Time for a Change
• I’ve used Cascading for 100s of projects
•And made a lot of money consulting on ETL
• But … it was getting kind of boring
Flink
Elevator Pitch
• High throughput/low latency stream processing
•Also supports batch (bounded streams)
• Runs locally, stand-alone, or in YARN
• Super-awesome team
Versus Spark? Sigh…OK
• Very similar in many ways
•Natively streaming, vs. natively batch
• Not as mature, smaller community/ecosystem
Similar to Cascading
• You define a DAG with Java (or Scala) code
•You have data sources and sinks
• Data flows through streams to operators
• The planner turns this into a bunch of tasks
It’s Faster, But…
• I don’t want to rewrite my code
•I use lots of custom Cascading schemes
• I don’t really know Scala
•And POJOs ad nauseam are no fun
•Same for Tuple21<Integer, String, String, …>
Scala is the New APL
val input = env.readFileStream(fileName,100)
.flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } }
.timeWindowAll(Time.of(60, TimeUnit.SECONDS))
.trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5)))
.fold(Set[String]()){(r,i) => { r + i}}
.map{x => (new Timestamp(System.currentTimeMillis()), x.size)}
Tuple21 Hell
public void reduce(Iterable<Tuple2<Tuple3<String, String, Integer>,
Tuple2<String, Integer>>> tupleGroup,
Collector<Tuple3<String, String, Double>> out) {
for (Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>> tuple : tupleGroup) {
Tuple3<String, String, Integer> idWC = tuple.f0;
Tuple2<String, Integer> idTW = tuple.f1;
out.collect(new Tuple3<String, String, Double>(idWC.f0, idWC.f1, (double)idWC.f2 /
idTW.f1));
}
}
Cascading-Flink
Cascading 3 Planner
• Converts the Cascading DAG into a Flink DAG
•Around 5K lines of code
• DAG it plans looks like Cascading local mode
Boundaries for Data Sets
•Speed == no spill to disk
•Task CPU is the same
•Other than serde time
How Painful?
• Use the FlinkFlowConnector
•Flink Flow planner to convert DAG to job
• Uber jar vs. classic Hadoop jar
• Grungy details of submitting jobs to EMR cluster
Wikiwords Workflow
• Find association between terms and categories
•For every page in Wikipedia, for every term
•Find distance from term to intra-wiki links
• Then calc statistics to find “strong association”
•Prob unusually high that term is close to link
Timing Test Details
• EMR cluster with 5 i2.xlarge slaves
•~1 billion input records (term, article ref, distance)
• Hadoop MapReduce took 148 minutes
• Flink took 98 minutes
•So 1.5x faster - nice but not great
•Mostly due to spillage in many boundaries
Summary
If You’re a Java ELT Dev
• And you have to deal with batch big data
•Then the Cascading API is a good fit
• And using Flink typically gives better performance
• While still using a standard Hadoop/YARN cluster
Status of Cascading-Flink
• Still young, but surprisingly robust
•Doesn’t support Full or RightOuter HashJoins
• Pending optimizations
•Tuple serialization
•Flink improvements
Better Planning…
•Defer late-stage join
•Avoid premature resource usage
More questions?
• Feel free to contact me
•http://www.scaleunlimited.com/contact/
•ken@scaleunlimited.com
• Check out Cascading, Flink, and Cascading-Flink
•http://www.cascading.org
•http://flink.apache.org
•http://github.com/dataArtisans/cascading-flink

More Related Content

PDF
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
PDF
HyperLogLog in Hive - How to count sheep efficiently?
PDF
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
PDF
Streams processing with Storm
PDF
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
PPTX
Storm is coming
PPT
Faceting optimizations for Solr
PDF
Flux and InfluxDB 2.0 by Paul Dix
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
HyperLogLog in Hive - How to count sheep efficiently?
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...
Streams processing with Storm
Meet the Experts: Visualize Your Time-Stamped Data Using the React-Based Gira...
Storm is coming
Faceting optimizations for Solr
Flux and InfluxDB 2.0 by Paul Dix

What's hot (20)

PDF
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
PDF
Time Series Processing with Solr and Spark
PDF
Go and Uber’s time series database m3
PPTX
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
PDF
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
PDF
Time Series Data with InfluxDB
PDF
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
PDF
Real time and reliable processing with Apache Storm
PDF
Influxdb and time series data
PDF
Storm - As deep into real-time data processing as you can get in 30 minutes.
PDF
Data recovery using pg_filedump
PDF
Aggregate Sharing for User-Define Data Stream Windows
PDF
Climate data in r with the raster package
PDF
Flux and InfluxDB 2.0
PDF
pg_filedump
PDF
Internship - Final Presentation (26-08-2015)
PDF
The new time series kid on the block
PPTX
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
PDF
Locality Sensitive Hashing By Spark
PDF
Spark 4th Meetup Londond - Building a Product with Spark
The Directions Pipeline at Mapbox - AWS Meetup Berlin June 2015
Time Series Processing with Solr and Spark
Go and Uber’s time series database m3
InfluxDB 1.0 - Optimizing InfluxDB by Sam Dillard
Stratosphere System Overview Big Data Beers Berlin. 20.11.2013
Time Series Data with InfluxDB
Getting Ready to Move to InfluxDB 2.0 | Tim Hall | InfluxData
Real time and reliable processing with Apache Storm
Influxdb and time series data
Storm - As deep into real-time data processing as you can get in 30 minutes.
Data recovery using pg_filedump
Aggregate Sharing for User-Define Data Stream Windows
Climate data in r with the raster package
Flux and InfluxDB 2.0
pg_filedump
Internship - Final Presentation (26-08-2015)
The new time series kid on the block
Using Grafana with InfluxDB 2.0 and Flux Lang by Jacob Lisi
Locality Sensitive Hashing By Spark
Spark 4th Meetup Londond - Building a Product with Spark
Ad

Viewers also liked (20)

PPT
Similarity at scale
PDF
Suicide Risk Prediction Using Social Media and Cassandra
PDF
Consumer Driven Contracts and Your Microservice Architecture
PPTX
The Power of Story: Why you are losing clients and the best staff
PPT
Как построить сверхмобильную компанию
PPTX
Student work gallery
PPTX
Как запустить рекламную кампанию?
PPTX
PSY 290 Library Instruction
PPTX
Critical thinking and e portfolios orna farrell final
PPT
Citrex610 m-en-citrex carne
PPS
大地360
PPT
Итоговая презентация конкурса Web Ready: отчет для партнеров
PDF
Oliver Cromwell
PPTX
Ariadn [Web Ready 2010]
PDF
Kriton konser resimleri
PPTX
Public Cloud - Add New Revenue & Profitability to Your Existing Hosting Business
PDF
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
PDF
С праздником 8 марта!
PPTX
BabbleLABEL [Web Ready 2010]
PPT
Developing Financial Capability
Similarity at scale
Suicide Risk Prediction Using Social Media and Cassandra
Consumer Driven Contracts and Your Microservice Architecture
The Power of Story: Why you are losing clients and the best staff
Как построить сверхмобильную компанию
Student work gallery
Как запустить рекламную кампанию?
PSY 290 Library Instruction
Critical thinking and e portfolios orna farrell final
Citrex610 m-en-citrex carne
大地360
Итоговая презентация конкурса Web Ready: отчет для партнеров
Oliver Cromwell
Ariadn [Web Ready 2010]
Kriton konser resimleri
Public Cloud - Add New Revenue & Profitability to Your Existing Hosting Business
Serap Mutlu Akbulut Korosu 1 Mart 2016 Konser Resimleri
С праздником 8 марта!
BabbleLABEL [Web Ready 2010]
Developing Financial Capability
Ad

Similar to Faster Workflows, Faster (20)

PPTX
Flink internals web
PDF
Processing Big Data in Real-Time - Yanai Franchi, Tikal
PDF
High Performance Systems Without Tears - Scala Days Berlin 2018
PDF
PDF
Apache Flink internals
PDF
introduction to data processing using Hadoop and Pig
PDF
Osd ctw spark
PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Data Analytics and Simulation in Parallel with MATLAB*
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
PPTX
Hadoop and HBase experiences in perf log project
PDF
Distributed Stream Processing - Spark Summit East 2017
PDF
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
PDF
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
PDF
CBStreams - Java Streams for ColdFusion (CFML)
PDF
cb streams - gavin pickin
PPTX
Real-Time Big Data with Storm, Kafka and GigaSpaces
PPTX
Big data week presentation
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Flink internals web
Processing Big Data in Real-Time - Yanai Franchi, Tikal
High Performance Systems Without Tears - Scala Days Berlin 2018
Apache Flink internals
introduction to data processing using Hadoop and Pig
Osd ctw spark
Apache Flink: API, runtime, and project roadmap
Data Analytics and Simulation in Parallel with MATLAB*
Distributed Real-Time Stream Processing: Why and How 2.0
Hadoop and HBase experiences in perf log project
Distributed Stream Processing - Spark Summit East 2017
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
CBStreams - Java Streams for ColdFusion (CFML)
cb streams - gavin pickin
Real-Time Big Data with Storm, Kafka and GigaSpaces
Big data week presentation
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...

More from Ken Krugler (7)

PDF
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
PDF
Strata web mining tutorial
PDF
A (very) short intro to Hadoop
PDF
A (very) short history of big data
PDF
Thinking at scale with hadoop
PDF
Elastic Web Mining
PPT
Elastic Web Mining
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Strata web mining tutorial
A (very) short intro to Hadoop
A (very) short history of big data
Thinking at scale with hadoop
Elastic Web Mining
Elastic Web Mining

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
KodekX | Application Modernization Development
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Monthly Chronicles - July 2025
KodekX | Application Modernization Development
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
MYSQL Presentation for SQL database connectivity
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Faster Workflows, Faster

  • 1. Ken Krugler | President, Scale Unlimited Faster Workflows, Faster
  • 2. The Twitter Pitch • Cascading is a solid, established workflow API •Good for complex custom ETL workflows • Flink is a new streaming dataflow engine •50% better performance by leveraging memory
  • 3. Perils of Comparisons • Performance comparisons are clickbait for devs •“You won’t believe the speed of Flink!” • Really, really hard to do well • I did “moderate” tuning of existing code base •Somewhat complex workflow in EMR •Dataset bigger than memory (100M…1B records)
  • 4. TL;DR • Flink gets faster by minimizing disk I/O •Map-Reduce job always has write/read at job break •Can also spill map output, reduce merge-sort • Flink has no job boundaries •And no map-side spills •So only reduce merge-sort is extra I/O
  • 5. TOC • A very short intro to Cascading •An even shorter intro to Flink • An example of converting a workflow • More in-depth results
  • 6. In the beginning… • There was Hadoop, and it was good •But life is too short to write M-R jobs with K-V data • Then along came Cascading…
  • 8. What is Cascading? • A thin Java library on top of Hadoop •An open source project (APL, 8 years old) • An API for defining and running ETL workflows
  • 9. 30,000ft View • Records (Tuples) flow through Pipes
  • 10. 30,000ft View • Pipes connect Operations
  • 11. 30,000ft View • You do Operations on Tuples
  • 12. 30,000ft View • Tuples flow into Pipes from Source Taps
  • 13. 30,000ft View • Tuples flow from Pipes into Sink Taps
  • 14. 30,000ft View • This is a data processing workflow (Flow)
  • 15. Java API to Define Flow Pipe ipDataPipe = new Pipe("ip data pipe"); RegexParser ipDataParser = new RegexParser(new Fields("Data IP", "Country"), ^([d.]+)t(.*)); ipDataPipe = new Each(ipDataPipe, new Fields("line"), ipDataParser); Pipe logAnalysisPipe = new CoGroup( logDataPipe, // left-side pipe new Fields("Log IP"), // left-side field for joining ipDataPipe, // right-side pipe new Fields("Data IP"), // right-side field for joining new LeftJoin()); // type of join to do logAnalysisPipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status")); logAnalysisPipe = new Every(logAnalysisPipe, new Count(new Fields(“Count"))); logAnalysisPipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null"))); Tap logDataTap = new Hfs(new TextLine(), "access.log"); Tap ipDataTap = new Hfs(new TextLine(), "ip-map.tsv"); Tap outputTap = new Hfs(new TextLine(), "results"); FlowDef flowDef = new FlowDef().setName("log analysis flow") .addSource(logDataPipe, logDataTap).addSource(ipDataPipe, ipDataTap) .addTailSink(logAnalysisPipe, outputTap); Flow flow = new HadoopFlowConnector(properties).connect(flowDef);
  • 17. Things I Like • “Stream Shaping” - easy to add, drop fields •Field consistency checking in DAG • Building blocks - Operations, SubAssemblies • Flexible planner - MR, local, Tez
  • 18. Time for a Change • I’ve used Cascading for 100s of projects •And made a lot of money consulting on ETL • But … it was getting kind of boring
  • 19. Flink
  • 20. Elevator Pitch • High throughput/low latency stream processing •Also supports batch (bounded streams) • Runs locally, stand-alone, or in YARN • Super-awesome team
  • 21. Versus Spark? Sigh…OK • Very similar in many ways •Natively streaming, vs. natively batch • Not as mature, smaller community/ecosystem
  • 22. Similar to Cascading • You define a DAG with Java (or Scala) code •You have data sources and sinks • Data flows through streams to operators • The planner turns this into a bunch of tasks
  • 23. It’s Faster, But… • I don’t want to rewrite my code •I use lots of custom Cascading schemes • I don’t really know Scala •And POJOs ad nauseam are no fun •Same for Tuple21<Integer, String, String, …>
  • 24. Scala is the New APL val input = env.readFileStream(fileName,100) .flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } } .timeWindowAll(Time.of(60, TimeUnit.SECONDS)) .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5))) .fold(Set[String]()){(r,i) => { r + i}} .map{x => (new Timestamp(System.currentTimeMillis()), x.size)}
  • 25. Tuple21 Hell public void reduce(Iterable<Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>>> tupleGroup, Collector<Tuple3<String, String, Double>> out) { for (Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>> tuple : tupleGroup) { Tuple3<String, String, Integer> idWC = tuple.f0; Tuple2<String, Integer> idTW = tuple.f1; out.collect(new Tuple3<String, String, Double>(idWC.f0, idWC.f1, (double)idWC.f2 / idTW.f1)); } }
  • 27. Cascading 3 Planner • Converts the Cascading DAG into a Flink DAG •Around 5K lines of code • DAG it plans looks like Cascading local mode
  • 28. Boundaries for Data Sets •Speed == no spill to disk •Task CPU is the same •Other than serde time
  • 29. How Painful? • Use the FlinkFlowConnector •Flink Flow planner to convert DAG to job • Uber jar vs. classic Hadoop jar • Grungy details of submitting jobs to EMR cluster
  • 30. Wikiwords Workflow • Find association between terms and categories •For every page in Wikipedia, for every term •Find distance from term to intra-wiki links • Then calc statistics to find “strong association” •Prob unusually high that term is close to link
  • 31. Timing Test Details • EMR cluster with 5 i2.xlarge slaves •~1 billion input records (term, article ref, distance) • Hadoop MapReduce took 148 minutes • Flink took 98 minutes •So 1.5x faster - nice but not great •Mostly due to spillage in many boundaries
  • 33. If You’re a Java ELT Dev • And you have to deal with batch big data •Then the Cascading API is a good fit • And using Flink typically gives better performance • While still using a standard Hadoop/YARN cluster
  • 34. Status of Cascading-Flink • Still young, but surprisingly robust •Doesn’t support Full or RightOuter HashJoins • Pending optimizations •Tuple serialization •Flink improvements
  • 35. Better Planning… •Defer late-stage join •Avoid premature resource usage
  • 36. More questions? • Feel free to contact me •http://www.scaleunlimited.com/contact/ •ken@scaleunlimited.com • Check out Cascading, Flink, and Cascading-Flink •http://www.cascading.org •http://flink.apache.org •http://github.com/dataArtisans/cascading-flink

Editor's Notes

  • #4: Before we go further, where did I get that 50% faster number? Did you really tune both systems appropriately? And what’s a reasonable amount of tuning? What version did you use? Oh, there’s a new release that’s way faster. Both are fast-moving targets. Need to test with data at scale. Systems like Flink & Spark make it harder, because if it fits in memory then it’s much faster But for smallish data sets, who cares if it goes from 15 minutes to 5 minutes, or even 1 minute? Now for a number of ML algorithms that is significant, since you’re iterating. But I’m talking about classic ETL stuff, so you have to do runs that take hours, where data spills to disk, and you’re stressing the system. Workflow was 12 Hadoop jobs, with moderate parallelism And by bigger than memory, I mean that many of the tasks in the flow were processing 100M -> 1B records. To me, represents a more typical ETL vs. big reduction early in workflow. Many newer systems focus on datasets that fit in memory, where they really fly (e.g. ML)
  • #5: You have to ultimately read the source data, and write the sink data. If the jobs are CPU-bound, then Flink is no faster. The code doesn’t magically get more efficient because Flink is “newer”. Because we’re talking about batch workflows here (not streaming) any group/join has to accumulate all the data in memory.
  • #6: I've been to many conferences, where I’m at a talk that isn’t right for me. But stuck too close to the front. I'm going to wet my whistle, I won't be offended if you get up and walk out
  • #7: At my previous startup (Krugle) we were using Hadoop to process open source code. Unless you’re just coding up the Hello World of big data - aka word count.
  • #9: Not an Apache project
  • #10: A Tuple is just a list of values.
  • #11: A pipe has a list of field names & types (like the header row in a CSV file)
  • #16: I prefer a more fluid builder approach. Cascading has a start on this, but after using Flink I think I now know what I want it to look like.
  • #18: Without this approach, you’re defining many POJOs (Crunch) or using field indexes to generic arrays of data You know if a field is missing in a downstream operation when the graph is built, versus at run-time Encourages code re-use You can run Cascading workflows locally, or using map-reduce, or (now) on Tez.
  • #19: Been using Cascading for more than 6 years. So I started checking out streaming options - Storm, then Samza, looked at Kafka streaming.
  • #21: Apache project, of course Flink is relatively new - incubator about 2 years ago, graduated to TLP 6 months later.
  • #22: I can’t say I’ve used either enough to feel good about summarizing. Similar in many ways, focusing on leveraging lots of RAM, iterative calculations, Scala/Java mix, streaming & batch modes Both are moving targets, to comparisons are hard - e.g. In the past Flink had better memory management, but Spark’s Project Tungsten has improved things a lot.
  • #24: Remember all those nice building blocks from Cascading? Stream shaping means defining a new POJO class for each such change
  • #25: When I go back to look at non-trivial code, I’m like WTF???
  • #26: Tuple of tuple of tuple… So then I found out that there was this project called Cascading-Flink
  • #28: It’s all a single step
  • #29: The graph is the complete flow. Green is where we’re reading from/writing to HDFS, yellow are where data has to collect (group/join operations) So the only explicit read/write ops happen at top/bottom, unlike Hadoop where there are twelve jobs, each with an HDFS write-the-read boundary. Group/join are where data accumulates in memory, and might have to spill to disk if it gets too big. So the performance win is from avoiding disk I/O (writing/reading). Processing time is going to be roughly the same, other than less spills also means less serialization/deserialization.
  • #30: Pretty much a one line change - honestly. We leverage a cascading.utils library that hides some of this from us. I know, many people say don’t use a Hadoop jar, but it’s proven the most portable/reliable approach for us. Used Hadoop 2.7 in EMR
  • #31: Out of all of the terms that are “close to” a link to a page (across all Wikipedia pages), this term has an unusually high probability. And once I’ve got that, plus the categories a page belongs to, I can calculate this same associative value for terms to categories. And with that data in hand, I can extract all the terms from some arbitrary page, sum the category association strengths for any that are known. This gives me a set of categories with scores for an arbitrary chunk of text. That’s the theory, I’m still trying this out to see if it’s really effective.
  • #32: And time spent doing slower/more general serialization/deserialization using Kryo
  • #35: I’ve run into some issues but devs have been very responsive in fixing them. Any improvements in Flink that help DataSet (batch) will help Cascading
  • #36: This would be a change in Flink, not Cascading-Flink