SlideShare a Scribd company logo
Spark-Powered Smart Data Warehouse
Sim Simeonov,Founder & CTO, Swoop
Goal-Based Data Production
petabyte scale data engineering & ML
run our business
What is a data-powered culture?
answer questions 10x easier with
goal-based data production
simple data production request
top 10 campaigns by gross revenue
running on health sites,
weekly for the past 2 complete weeks,
with cost-per-click and
the click-through rate
Goal Based Data Production with Sim Simeonov
with
origins as (select origin_id, site_vertical from dimension_origins where site_vertical = 'health'),
t1 as (
select campaign_id, origin_id, clicks, views, billing,
date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) as ts
from rep_campaigns_daily
where from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK,
'America/New_York')
and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')
),
t2 as (
select ts, campaign_id, sum(views) as views, sum(clicks) as clicks, sum(billing) / 1000000.0 as gross_revenue
from t1 lhs join origins rhs on lhs.origin_id = rhs.origin_id
group by campaign_id, ts
),
t3 as (select *, rank() over (partition by ts order by gross_revenue desc) as rank from t2),
t4 as (select * from t3 where rank <= 10)
select
ts, rank, campaign_short_name as campaign,
bround(gross_revenue) as gross_revenue,
format_number(if(clicks = 0, 0, gross_revenue / clicks), 3) as cpc,
format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr
from t4 lhs join dimension_campaigns rhs on lhs.campaign_id = rhs.campaign_id
order by ts, rank
DSL is equally complex
spark.table("rep_campaigns_daily")
.where("""
from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York')
and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'),
cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')""")
.join(spark.table("dimension_origins").where('site_vertical === "health").select('origin_id), "origin_id")
.withColumn("ts", expr("date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))"))
.groupBy('campaign_id, 'ts)
.agg(
sum('views).as("views"),
sum('clicks).as("clicks"),
(sum('billing) / 1000000).as("gross_revenue")
)
.withColumn("rank", rank.over(Window.partitionBy('ts).orderBy('gross_revenue.desc)))
.where('rank <= 10)
.join(spark.table("dimension_campaigns").select('campaign_id, 'campaign_short_name.as("campaign")), "campaign_id")
.withColumn("gross_revenue", expr("bround(gross_revenue)"))
.withColumn("cpc", format_number(when('clicks === 0, 0).otherwise('gross_revenue / 'clicks), 3))
.withColumn("pct_ctr", format_number(when('views === 0, 0).otherwise(lit(100) * 'clicks / 'views), 3))
.select('ts, 'rank, 'campaign, 'gross_revenue, 'cpc, 'pct_ctr)
.orderBy('ts, 'rank)
Spark needs to know
what you want and
how to produce it
General data processing requires
detailed instructions every single time
the curse of generality: verbosity
what (5%) vs. how (95%)
the curse of generality: duplication
-- calculate click-through rate, %
format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr
-- join to get campaign names from campaign IDs
SELECT campaign_short_name AS campaign, ...
FROM ... lhs
JOIN dimension_campaigns rhs
ON lhs.campaign_id = rhs.campaign_id
the curse of generality: complexity
-- weekly
date(to_utc_timestamp(
date_sub(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(
date_format(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
'u')
as int) - 1),
'America/New_York'))
the curse of generality: inflexibility
-- code depends on time column datatype, format & timezone
date(to_utc_timestamp(
date_sub(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
cast(
date_format(
to_utc_timestamp(from_unixtime(uts), 'America/New_York'),
'u')
as int) - 1),
'America/New_York'))
Can we keep what and toss how?
Can how become implicit context?
• Data sources, schema, join relationships
• Presentation & formatting
• Week start (Monday)
• Reporting time zone (East Coast)
goal-based data production
DataProductionRequest()
.select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr)
.where("health sites")
.weekly.time("past 2 complete weeks")
.humanize
let’s go behind the curtain
the DSL is not as important as the
processing & configuration models
Data Production
Request
Data Production
Goal
Data Production
Rule
Request Goal
Rule
dpr.addGoal(
ResultColumn("cpc")
)
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
Requirement
SourceColumn(
role = TimeDimension()
)
Request Goal
Rule
// .time("past 2 complete weeks")
dpr.addProductionRule(
TimeFilter(CompleteWeeks(ref = Now, beg = -2)),
at = SourceFiltering // production stage
)
Requirement
Prop("time.week.start")
Prop("time.tz.source")
Prop("time.tz.reporting")
goal + rule + requirement
// .weekly
val timeCol = SourceColumn(
role = TimeDimension(maxGranularity = Weekly)
)
val ts = ResultColumn("ts",
production = Array(WithColumn("ts", BeginningOf("week", timeCol)))
)
dpr.addGoals(ts).addGroupBy(ts)
Rule
Requirement Context
satisfying
• Source table
• Time dimension column
• Business week start
• Reporting time zone
Rule
Requirement Context
Resolution
satisfying
partial resolution rewrites
Rule
Requirement Context
Resolution Dataset
satisfying
partial resolution rewrites
full resolution builds
transformations
Spark makes this quite easy
• Open & rewritable processing model
• Column = Expression + Metadata
• LogicalPlan
not your typical enterprise/BI
death-by-configuration system
context is a metadata query
smartDataWarehouse.context
.release("3.1.2")
.merge(_.tagged("sim.ml.experiments"))
Minimal, just-in-time context is sufficient.
No need for a complete, unified model.
automatic joins by column name
{
"table": "dimension_campaigns",
"primary_key": ["campaign_id"]
"joins": [ {"fields": ["campaign_id"]} ]
}
join any table with campaign_id column
automatic semantic joins
{
"table": "dimension_campaigns",
"primary_key": ["campaign_id"]
"joins": [ {"fields": ["ref:swoop.campaign.id"]} ]
}
join any table with a Swoop campaign ID
calculated columns
{
"field": "ctr",
"description": "click-through rate",
"expr": "if(nvl(views,0)=0,0,nvl(clicks,0)/nvl(views,0))"
...
}
automatically available to matching tables
humanization
{
"field": "ctr",
"humanize": {
"expr": "format_number(100 * value, 3) as pct_ctr"
}
}
change column name as units change
humanization
ResultColumn("rank_campaign_by_gross_revenue",
production = Array(...),
shortName = "rank", // simplified name
defaultOrdering = Asc, // row ordering
before = "campaign") // column ordering
optimization hints
{
"field": "campaign",
// hint: allows join after groupBy as opposed to before
"unique": true
...
}
a general optimizer can never do this
automatic source & join selection
• 14+ Swoop tables can satisfy the request
• Only 2 are good choices based on cost
– 100+x faster execution
10x easier data production
val df = DataProductionRequest()
.select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr)
.where("health sites")
.weekly.time("past 2 complete weeks")
.humanize
.toDF // result is a normal dataframe/dataset
revolutionary benefits
• Increased productivity & flexibility
• Improved performance & lower costs
• Easy collaboration (even across companies!)
• Business users can query Spark
At Swoop, we are committed to making
goal-based data production
the best free & open-source
interface for Spark data production
Interested in contributing?
Email spark@swoop.com
Join our open-source work.
Email spark@swoop.com
More Spark magic: http://bit.ly/spark-records

More Related Content

PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PDF
Scaling up data science applications
PPTX
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
PPTX
Time Series Analysis for Network Secruity
PDF
Map reduce: beyond word count
PDF
Time Series Analysis by JavaScript LL matsuri 2013
PDF
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
PDF
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot
GeoMesa on Apache Spark SQL with Anthony Fox
Scaling up data science applications
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Time Series Analysis for Network Secruity
Map reduce: beyond word count
Time Series Analysis by JavaScript LL matsuri 2013
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
Distributed Computing on PostgreSQL | PGConf EU 2017 | Marco Slot

What's hot (20)

PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
PDF
User Defined Aggregation in Apache Spark: A Love Story
PDF
Postgres Performance for Humans
PDF
New Java Date/Time API
PDF
Geospatial and bitemporal search in cassandra with pluggable lucene index
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
PDF
Time Series Meetup: Virtual Edition | July 2020
PDF
Data profiling with Apache Calcite
PDF
Arctic15 keynote
PPTX
Data visualization by Kenneth Odoh
PPTX
Average- An android project
PDF
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
PPTX
PDF
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
PDF
D3.js workshop
PDF
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
PDF
Altinity Quickstart for ClickHouse
PDF
Data Profiling in Apache Calcite
PDF
AJUG April 2011 Cascading example
PDF
Martin Fowler's Refactoring Techniques Quick Reference
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
User Defined Aggregation in Apache Spark: A Love Story
Postgres Performance for Humans
New Java Date/Time API
Geospatial and bitemporal search in cassandra with pluggable lucene index
ClickHouse Features for Advanced Users, by Aleksei Milovidov
Time Series Meetup: Virtual Edition | July 2020
Data profiling with Apache Calcite
Arctic15 keynote
Data visualization by Kenneth Odoh
Average- An android project
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
WattGo: Analyses temps-réél de series temporelles avec Spark et Solr (Français)
D3.js workshop
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Altinity Quickstart for ClickHouse
Data Profiling in Apache Calcite
AJUG April 2011 Cascading example
Martin Fowler's Refactoring Techniques Quick Reference
Ad

Similar to Goal Based Data Production with Sim Simeonov (20)

PDF
Goal Based Data Production with Sim Simeonov
PDF
Advanced analytics with Spark First Edition Laserson
DOCX
1920191Analytical Competitiveness Right Data vs. .docx
PPTX
Finding business value in Big Data
PDF
02.BigDataAnalytics curso de Legsi (1).pdf
PPTX
Analytics & Data Strategy 101 by Deko Dimeski
PDF
Accelerate Self-Service Analytics with Data Virtualization and Visualization
PDF
BI Maturity Model ppt
PDF
Big data-analytics-changing-way-organizations-conducting-business
PDF
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
PDF
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
PPT
Business Intelligence Challenges 2009
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PPTX
Data-Science-Fundamentals- Session 2.pptx
PDF
2017 06-14-getting started with data science
PPTX
Build a Case for BI with ROI Figures
PDF
Chapter 1 :Introduction to business analytics
PDF
IE Big Data Club Data Science Challenge with Novum Insights
PDF
Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)
PDF
Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Goal Based Data Production with Sim Simeonov
Advanced analytics with Spark First Edition Laserson
1920191Analytical Competitiveness Right Data vs. .docx
Finding business value in Big Data
02.BigDataAnalytics curso de Legsi (1).pdf
Analytics & Data Strategy 101 by Deko Dimeski
Accelerate Self-Service Analytics with Data Virtualization and Visualization
BI Maturity Model ppt
Big data-analytics-changing-way-organizations-conducting-business
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Bulletproof Jobs: Patterns For Large-Scale Spark Processing
Business Intelligence Challenges 2009
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Data-Science-Fundamentals- Session 2.pptx
2017 06-14-getting started with data science
Build a Case for BI with ROI Figures
Chapter 1 :Introduction to business analytics
IE Big Data Club Data Science Challenge with Novum Insights
Populating a Data Quality Scorecard with Relevant Metrics (Whitepaper)
Data to Dollars™ - Practical Analytics in the Big Data Era Jaime Fitzgerald A...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
Qualitative Qantitative and Mixed Methods.pptx
Fluorescence-microscope_Botany_detailed content
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Business Acumen Training GuidePresentation.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Database Infoormation System (DBIS).pptx
annual-report-2024-2025 original latest.
IBA_Chapter_11_Slides_Final_Accessible.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Miokarditis (Inflamasi pada Otot Jantung)
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Knowledge Engineering Part 1
Galatica Smart Energy Infrastructure Startup Pitch Deck
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Clinical guidelines as a resource for EBP(1).pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction-to-Cloud-ComputingFinal.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
ISS -ESG Data flows What is ESG and HowHow

Goal Based Data Production with Sim Simeonov

  • 1. Spark-Powered Smart Data Warehouse Sim Simeonov,Founder & CTO, Swoop Goal-Based Data Production
  • 2. petabyte scale data engineering & ML run our business
  • 3. What is a data-powered culture?
  • 4. answer questions 10x easier with goal-based data production
  • 5. simple data production request top 10 campaigns by gross revenue running on health sites, weekly for the past 2 complete weeks, with cost-per-click and the click-through rate
  • 7. with origins as (select origin_id, site_vertical from dimension_origins where site_vertical = 'health'), t1 as ( select campaign_id, origin_id, clicks, views, billing, date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York')) as ts from rep_campaigns_daily where from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York') and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York') ), t2 as ( select ts, campaign_id, sum(views) as views, sum(clicks) as clicks, sum(billing) / 1000000.0 as gross_revenue from t1 lhs join origins rhs on lhs.origin_id = rhs.origin_id group by campaign_id, ts ), t3 as (select *, rank() over (partition by ts order by gross_revenue desc) as rank from t2), t4 as (select * from t3 where rank <= 10) select ts, rank, campaign_short_name as campaign, bround(gross_revenue) as gross_revenue, format_number(if(clicks = 0, 0, gross_revenue / clicks), 3) as cpc, format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr from t4 lhs join dimension_campaigns rhs on lhs.campaign_id = rhs.campaign_id order by ts, rank
  • 8. DSL is equally complex spark.table("rep_campaigns_daily") .where(""" from_unixtime(uts) >= to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1) - INTERVAL 2 WEEK, 'America/New_York') and from_unixtime(uts) < to_utc_timestamp(date_sub(to_utc_timestamp(current_date(), 'America/New_York'), cast(date_format(to_utc_timestamp(current_date(), 'America/New_York'), 'u') as int) - 1), 'America/New_York')""") .join(spark.table("dimension_origins").where('site_vertical === "health").select('origin_id), "origin_id") .withColumn("ts", expr("date(to_utc_timestamp(date_sub(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast(date_format(to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))")) .groupBy('campaign_id, 'ts) .agg( sum('views).as("views"), sum('clicks).as("clicks"), (sum('billing) / 1000000).as("gross_revenue") ) .withColumn("rank", rank.over(Window.partitionBy('ts).orderBy('gross_revenue.desc))) .where('rank <= 10) .join(spark.table("dimension_campaigns").select('campaign_id, 'campaign_short_name.as("campaign")), "campaign_id") .withColumn("gross_revenue", expr("bround(gross_revenue)")) .withColumn("cpc", format_number(when('clicks === 0, 0).otherwise('gross_revenue / 'clicks), 3)) .withColumn("pct_ctr", format_number(when('views === 0, 0).otherwise(lit(100) * 'clicks / 'views), 3)) .select('ts, 'rank, 'campaign, 'gross_revenue, 'cpc, 'pct_ctr) .orderBy('ts, 'rank)
  • 9. Spark needs to know what you want and how to produce it
  • 10. General data processing requires detailed instructions every single time
  • 11. the curse of generality: verbosity what (5%) vs. how (95%)
  • 12. the curse of generality: duplication -- calculate click-through rate, % format_number(if(views = 0, 0, 100 * clicks / views), 3) as pct_ctr -- join to get campaign names from campaign IDs SELECT campaign_short_name AS campaign, ... FROM ... lhs JOIN dimension_campaigns rhs ON lhs.campaign_id = rhs.campaign_id
  • 13. the curse of generality: complexity -- weekly date(to_utc_timestamp( date_sub( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast( date_format( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))
  • 14. the curse of generality: inflexibility -- code depends on time column datatype, format & timezone date(to_utc_timestamp( date_sub( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), cast( date_format( to_utc_timestamp(from_unixtime(uts), 'America/New_York'), 'u') as int) - 1), 'America/New_York'))
  • 15. Can we keep what and toss how?
  • 16. Can how become implicit context? • Data sources, schema, join relationships • Presentation & formatting • Week start (Monday) • Reporting time zone (East Coast)
  • 17. goal-based data production DataProductionRequest() .select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr) .where("health sites") .weekly.time("past 2 complete weeks") .humanize
  • 18. let’s go behind the curtain
  • 19. the DSL is not as important as the processing & configuration models
  • 22. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage )
  • 23. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) Requirement SourceColumn( role = TimeDimension() )
  • 24. Request Goal Rule // .time("past 2 complete weeks") dpr.addProductionRule( TimeFilter(CompleteWeeks(ref = Now, beg = -2)), at = SourceFiltering // production stage ) Requirement Prop("time.week.start") Prop("time.tz.source") Prop("time.tz.reporting")
  • 25. goal + rule + requirement // .weekly val timeCol = SourceColumn( role = TimeDimension(maxGranularity = Weekly) ) val ts = ResultColumn("ts", production = Array(WithColumn("ts", BeginningOf("week", timeCol))) ) dpr.addGoals(ts).addGroupBy(ts)
  • 26. Rule Requirement Context satisfying • Source table • Time dimension column • Business week start • Reporting time zone
  • 28. Rule Requirement Context Resolution Dataset satisfying partial resolution rewrites full resolution builds transformations
  • 29. Spark makes this quite easy • Open & rewritable processing model • Column = Expression + Metadata • LogicalPlan
  • 30. not your typical enterprise/BI death-by-configuration system
  • 31. context is a metadata query smartDataWarehouse.context .release("3.1.2") .merge(_.tagged("sim.ml.experiments")) Minimal, just-in-time context is sufficient. No need for a complete, unified model.
  • 32. automatic joins by column name { "table": "dimension_campaigns", "primary_key": ["campaign_id"] "joins": [ {"fields": ["campaign_id"]} ] } join any table with campaign_id column
  • 33. automatic semantic joins { "table": "dimension_campaigns", "primary_key": ["campaign_id"] "joins": [ {"fields": ["ref:swoop.campaign.id"]} ] } join any table with a Swoop campaign ID
  • 34. calculated columns { "field": "ctr", "description": "click-through rate", "expr": "if(nvl(views,0)=0,0,nvl(clicks,0)/nvl(views,0))" ... } automatically available to matching tables
  • 35. humanization { "field": "ctr", "humanize": { "expr": "format_number(100 * value, 3) as pct_ctr" } } change column name as units change
  • 36. humanization ResultColumn("rank_campaign_by_gross_revenue", production = Array(...), shortName = "rank", // simplified name defaultOrdering = Asc, // row ordering before = "campaign") // column ordering
  • 37. optimization hints { "field": "campaign", // hint: allows join after groupBy as opposed to before "unique": true ... } a general optimizer can never do this
  • 38. automatic source & join selection • 14+ Swoop tables can satisfy the request • Only 2 are good choices based on cost – 100+x faster execution
  • 39. 10x easier data production val df = DataProductionRequest() .select('campaign.top(10).by('gross_revenue), 'cpc, 'ctr) .where("health sites") .weekly.time("past 2 complete weeks") .humanize .toDF // result is a normal dataframe/dataset
  • 40. revolutionary benefits • Increased productivity & flexibility • Improved performance & lower costs • Easy collaboration (even across companies!) • Business users can query Spark
  • 41. At Swoop, we are committed to making goal-based data production the best free & open-source interface for Spark data production Interested in contributing? Email spark@swoop.com
  • 42. Join our open-source work. Email spark@swoop.com More Spark magic: http://bit.ly/spark-records