SlideShare a Scribd company logo
Ibis:
Seamless Transition From
Pandas to Spark
Spark + AI Summit
2020
1
This document is being distributed for informational and educational purposes only and is not an offer to sell or the
solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to
provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the
views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the
assumptions of the author(s) of the document and are subject to change without notice. The document may employ
data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information
and the use of such information in no way implies an endorsement of the source of such information or its validity.
The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities
other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the
material and are used purely for identification and comment as fair use under international copyright and/or trademark
laws. Use of such image, copyright or trademark does not imply any association with such organization (or
endorsement of such organization) by Two Sigma, nor vice versa.
Legal Disclaimer
2
● If you…
○ like pandas but want to analyze large dataset?
○ are interested in “distributed DataFrames” but don’t know which one to
choose?
○ want to have you analysis code run faster or more scalable without
making code changes?
Target Audience
3
● Modeling Tools @ Two Sigma
● Apache Spark, Pandas, Apache Arrow, Flint, Ibis
About Me: Li Jin
4
A common data science task...
5
df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
A common data science task...
6
df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
A common data science task...
7
df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
A common data science task...
8
df = pd.read_parquet(...)
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
A common data science task...
9
● The software is not designed for large amounts of data
● Your machine doesn’t have enough RAM to hold all the data
● You are not utilizing all the CPUs
You are happy until...the code runs too slow
10
Try a few things...
11
● Use a bigger machine
● Pros:
○ Low human cost: no code change
● Cons:
○ Same software limits
○ Single threaded
○ Probably not fast enough
Try a few things...
12
● Use a generic way to distribute the code:
○ sparkContext.parallelize(range(2000, 2020)).map(compute_for
_year).collect()
● Pros:
○ Medium human cost: small code change
○ Scalable
● Cons:
○ Works only for embarrassingly parallel problems
○ Boundary handling can be tricky
○ Distributed failures
Try a few things...
13
Try a few things...
● Use a distributed dataframe library
○ Spark
○ Dask
○ Koalas
○ ...
● Pros:
○ Scalable
● Cons:
○ High human cost: learn another API
○ Not obvious which one to use
○ Distributed failures
14
Take a step back...
15
Take a step back...
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
16
The problem
The problem is not how we express the computation, but
how we execute it.
17
Separation of Concern
From wikipedia:
“In computer science, separation of concerns (SoC) is a design
principle for separating a computer program into distinct sections
such that each section addresses a separate concern.”
18
Separation of expression and execution
Can we separate “how we express the computation”
(expression) and “how we execute it” (execution)?
19
Separation of expression and execution
● SQL is a way to express the computation independent of the
execution.
● Can we have something like SQL, but for Python Data Analysis?
20
Outline
● Ibis: A high level introduction
● Ibis: expression language
● Ibis: backend execution
● PySpark backend for Ibis
● Conclusion
21
Ibis: A high level
introduction
22
Ibis: Python Data Analysis Framework
● Open source
● Started in 2015 by Wes McKinney
● Worked on by top pandas committers:
○ Wes McKinney
○ Phillip Cloud
○ Jeff Reback
23
Ibis components
● ibis language
○ The API that is used to express the computation with ibis expressions
● ibis backends
○ Modules that translate ibis expressions to something that can be
executed by different computation engines
■ ibis.pandas
■ ibis.pyspark
■ Ibis.bigquery
■ ibis.impala
■ Ibis.omniscidb
■ ...
24
Ibis language
● Table API
○ Projection
○ Filtering
○ Join
○ Groupby
○ Sort
○ Window
○ Aggregation
○ …
○ AsofJoin
○ UDFs
● Ibis expressions (intermediate representation)
25
Ibis language
● Table API
○ table = table.mutate(v3=table[‘v1’] + table[‘v2’])
● Ibis expressions
26
Ibis backends
● Ibis expressions -> Backend specific expressions
● table.mutate(v3=table[‘v1’] + table[‘v2’])
○ Pandas: df = df.assign(v3=df[‘v1’] + df[‘v2’])
○ PySpark: df = df.withColumn(‘v3’, df[‘v1’] + df[‘v2’])
27
Ibis: expression
language
28
Recall our earlier example in pandas
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg({'feature2': {'max', 'min'}})
29
Basic column selection and arithmetic expression
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
30
Basic column selection and arithmetic expression
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
31
Group-by and windowed aggregation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
32
Group-by and windowed aggregation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
33
Composite aggregation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
34
Composite aggregation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
35
Final translation
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
36
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
Ibis tables are expressions, not dataframes
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
]) 37
So far, table and the result of the
transformation on the left are
expressions, not actual dataframes.
Pandas code
df['feature'] = (df['v1'] + df['v2']) / 2
df['feature2'] = (
df.groupby('key')['feature']
.rolling(3, min_periods=1).mean()
.sort_index(level=1)
.reset_index(drop=True)
)
df.groupby('key').agg(
{'feature2': {'max', 'min'}}
)
Ibis tables are expressions, not dataframes
Ibis expression
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
]) 38
We need to execute these expressions
on real data in a backend.
Ibis: backend
execution
39
Initialize backend-specific Ibis client
PySpark
pyspark_client = ibis.pyspark.connect(pyspark_session)
Pandas
pandas_client = ibis.pandas.connect()
Impala
impala_client = ibis.impala.connect(host=host, port=port)
...
40
Ibis expression
my_table = client.table('foo')
Access table in Ibis
41
Ibis expression
my_table = pyspark_client.table('foo')
Access table in Ibis
42
Ibis expression
my_table = pyspark_client.table('foo')
# my_table is an ibis table expression
print(my_table)
PySparkTable[table]
name: foo
schema:
key : string
v1 : int64
v2 : int64
Access table in Ibis
43
Ibis expression
my_table = pyspark_client.table('foo')
# execute materializes the table expression into a pandas DataFrame
df = my_table.execute()
df
Access table in Ibis
44
def transform(table: TableExpr) -> TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
Recall our table transformation in Ibis
45
def transform(table) -> ibis.expr.types.TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
result_table = transform(my_table)
Apply transform() on our Ibis table
46
def transform(table) -> ibis.expr.types.TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
result_table = transform(my_table)
Apply transform() on our Ibis table
47
my_table and result_table
are
ibis table expressions, not dataframes.
def transform(table) -> ibis.expr.types.TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
result_table = transform(my_table)
result_table.execute()
Execute the result on our backend
48
def transform(table) -> ibis.expr.types.TableExpr:
table = table.mutate(
feature=(table['v1'] + table['v2'])/2
)
w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
return table
result_table = transform(my_table)
result_table.execute()
Execute the result on our backend
49
pandas dataframe
Ibis: PySpark
backend
50
Translate Ibis expression into PySpark expressions
51
Ibis expression tree
Basic column selection and arithmetic expression
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
52
Basic column selection and arithmetic expression
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
53
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Basic column selection and arithmetic expression
54
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
Basic column selection and arithmetic expression
55
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
t is a PySparkTranslator
Basic column selection and arithmetic expression
56
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
t is a PySparkTranslator
It has a translate() method that
evaluates an Ibis expression into a
PySpark object.
Basic column selection and arithmetic expression
57
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
expr is the ibis expression to
translate
Basic column selection and arithmetic expression
58
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
scope is a dict that caches results
of previously translated Ibis
expressions.
Basic column selection and arithmetic expression
59
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
We use the PySparkTranslator to
evaluate the left and right operands
(which are themselves ibis
expressions) into PySpark columns.
Basic column selection and arithmetic expression
60
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
left and right are
PySpark columns
PySpark column division
Basic column selection and arithmetic expression
61
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Divide)
def compile_divide(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left / right
Basic column selection and arithmetic expression
62
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
PySpark column addition
Basic column selection and arithmetic expression
63
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.Add)
def compile_add(t, expr, scope, **kwargs):
op = expr.op()
left = t.translate(op.left, scope)
right = t.translate(op.right, scope)
return left + right
Basic column selection and arithmetic expression
64
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]
Basic column selection and arithmetic expression
65
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]
Basic column selection and arithmetic expression
66
Ibis expression
table.mutate(
feature=(table['v1'] + table['v2'])/2
)
Ibis translation code
@compiles(ops.TableColumn)
def compile_column(t, expr, scope, **kwargs):
op = expr.op()
column_name = op.name
pyspark_df = t.translate(op.table, scope)
return pyspark_df[column_name]
PySpark column selection
Basic column selection and arithmetic expression
67
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
68
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
69
PySpark translation
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
70
PySpark translation
from pyspark.sql.window import Window
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
71
PySpark translation
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)
F.mean(df['feature']).over(w)
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
72
PySpark translation
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)
df = df.withColumn(
'feature2', F.mean(df['feature']).over(w)
)
Group-by and windowed aggregation
Ibis expression
w = ibis.trailing_window(preceding=2,
group_by='key',
order_by='v1')
table = table.mutate(
feature2=table['feature'].mean().over(w)
)
73
PySpark translation
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w = (
Window.partitionBy('key')
.orderBy('v1')
.rowsBetween(-2, Window.currentRow)
)
df = df.withColumn(
'feature2', F.mean(df['feature']).over(w)
)
Composite aggregation
Ibis expression
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
74
Composite aggregation
Ibis expression
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
75
PySpark translation
Composite aggregation
Ibis expression
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
76
PySpark translation
# PySpark column expressions
F.min(df['feature2'])
F.max(df['feature2'])
Composite aggregation
Ibis expression
table.group_by('key').aggregate([
table['feature2'].min(),
table['feature2'].max()
])
77
PySpark translation
df.groupby('key').agg(
F.min(df['feature2']),
F.max(df['feature2'])
)
More interesting examples
Ibis expression
table['v1'].rank().over(window)
78
PySpark translation
More interesting examples
Ibis expression
table['v1'].rank().over(window)
79
PySpark translation
import pyspark.sql.functions as F
F.rank().over(window).astype('long') - F.lit(1)
More interesting examples
Ibis expression
table['v1'].rank().over(window)
80
PySpark translation
import pyspark.sql.functions as F
F.rank().over(window).astype('long') - F.lit(1)
Subtle differences like cast to long & off by one
More interesting examples
Ibis expression
table['boolean_col'].not_any()
81
PySpark translation
More interesting examples
Ibis expression
table['boolean_col'].not_any()
82
PySpark translation
~F.max(df['boolean_col'])
More interesting examples
Ibis expression
table['boolean_col'].not_any()
83
PySpark translation
~F.max(df['boolean_col'])
No direct translations
Conclusion
84
Conclusion
● Separate expression and execution
● Don’t limit only to what you can use today, think about what you can
use in the future
○ Arrow Dataset
○ Modin
○ cudf / dask-cudf
○ ...
85
Thanks!
ice.xelloss@gmail.com (@icexelloss)
86

More Related Content

PDF
Your first ClickHouse data warehouse
PDF
non-strict functions, bottom and scala by-name parameters
PDF
Geospatial Options in Apache Spark
PDF
Introducing DataFrames in Spark for Large Scale Data Science
PDF
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PDF
1 intro to_dpdk_and_hw
Your first ClickHouse data warehouse
non-strict functions, bottom and scala by-name parameters
Geospatial Options in Apache Spark
Introducing DataFrames in Spark for Large Scale Data Science
Data Warehouses in Kubernetes Visualized: the ClickHouse Kubernetes Operator UI
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
1 intro to_dpdk_and_hw

What's hot (20)

PDF
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
Top 5 mistakes when writing Spark applications
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
ClickHouse materialized views - a secret weapon for high performance analytic...
PDF
Physical Plans in Spark SQL
PDF
Redpanda and ClickHouse
PDF
Materialize: a platform for changing data
PDF
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
ODP
eBPF maps 101
PDF
Master the RETE algorithm
PDF
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PDF
Networking in Java with NIO and Netty
PDF
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
PPTX
Practical learnings from running thousands of Flink jobs
Inside Kafka Streams—Monitoring Comcast’s Outside Plant
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Top 5 mistakes when writing Spark applications
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse materialized views - a secret weapon for high performance analytic...
Physical Plans in Spark SQL
Redpanda and ClickHouse
Materialize: a platform for changing data
ClickHouse Introduction, by Alexander Zaitsev, Altinity CTO
Common Strategies for Improving Performance on Your Delta Lakehouse
Stage Level Scheduling Improving Big Data and AI Integration
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Fine Tuning and Enhancing Performance of Apache Spark Jobs
eBPF maps 101
Master the RETE algorithm
Data warehouse on Kubernetes - gentle intro to Clickhouse Operator, by Robert...
Building a fully managed stream processing platform on Flink at scale for Lin...
Networking in Java with NIO and Netty
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Practical learnings from running thousands of Flink jobs
Ad

Similar to Ibis: Seamless Transition Between Pandas and Apache Spark (20)

PDF
Spark AI 2020
PPTX
PyData NYC 2019
PPTX
PPT on Data Science Using Python
PPTX
More on Pandas.pptx
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
PPTX
interenship.pptx
PDF
Tactical Data Science Tips: Python and Spark Together
PDF
Python-for-Data-Analysis.pdf
PPTX
Python-for-Data-Analysis.pptx
PPTX
Python-for-Data-Analysis.pptx
PDF
Python for Data Analysis.pdf
PDF
Energy analytics with Apache Spark workshop
PDF
pyspark.pdf
PDF
lecture14DATASCIENCE AND MACHINE LER.pdf
PPT
SASasasASSSasSSSSSasasaSASsasASASasasASs
PDF
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
PPTX
Python for data analysis
PPTX
Meetup Junio Data Analysis with python 2018
PDF
Sparkly Notebook: Interactive Analysis and Visualization with Spark
PPTX
Python-for-Data-Analysis.pptx
Spark AI 2020
PyData NYC 2019
PPT on Data Science Using Python
More on Pandas.pptx
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
interenship.pptx
Tactical Data Science Tips: Python and Spark Together
Python-for-Data-Analysis.pdf
Python-for-Data-Analysis.pptx
Python-for-Data-Analysis.pptx
Python for Data Analysis.pdf
Energy analytics with Apache Spark workshop
pyspark.pdf
lecture14DATASCIENCE AND MACHINE LER.pdf
SASasasASSSasSSSSSasasaSASsasASASasasASs
Python for Data Analysis_ Data Wrangling with Pandas, Numpy, and Ipython ( PD...
Python for data analysis
Meetup Junio Data Analysis with python 2018
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Python-for-Data-Analysis.pptx
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PDF
.pdf is not working space design for the following data for the following dat...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
1_Introduction to advance data techniques.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
annual-report-2024-2025 original latest.
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
1_Introduction to advance data techniques.pptx
Mega Projects Data Mega Projects Data
Database Infoormation System (DBIS).pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Clinical guidelines as a resource for EBP(1).pdf
Miokarditis (Inflamasi pada Otot Jantung)
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Acceptance and paychological effects of mandatory extra coach I classes.pptx

Ibis: Seamless Transition Between Pandas and Apache Spark

  • 1. Ibis: Seamless Transition From Pandas to Spark Spark + AI Summit 2020 1
  • 2. This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for, investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect the assumptions of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made by Two Sigma as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Legal Disclaimer 2
  • 3. ● If you… ○ like pandas but want to analyze large dataset? ○ are interested in “distributed DataFrames” but don’t know which one to choose? ○ want to have you analysis code run faster or more scalable without making code changes? Target Audience 3
  • 4. ● Modeling Tools @ Two Sigma ● Apache Spark, Pandas, Apache Arrow, Flint, Ibis About Me: Li Jin 4
  • 5. A common data science task... 5
  • 6. df = pd.read_parquet(...) df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) A common data science task... 6
  • 7. df = pd.read_parquet(...) df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) A common data science task... 7
  • 8. df = pd.read_parquet(...) df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) A common data science task... 8
  • 9. df = pd.read_parquet(...) df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) A common data science task... 9
  • 10. ● The software is not designed for large amounts of data ● Your machine doesn’t have enough RAM to hold all the data ● You are not utilizing all the CPUs You are happy until...the code runs too slow 10
  • 11. Try a few things... 11
  • 12. ● Use a bigger machine ● Pros: ○ Low human cost: no code change ● Cons: ○ Same software limits ○ Single threaded ○ Probably not fast enough Try a few things... 12
  • 13. ● Use a generic way to distribute the code: ○ sparkContext.parallelize(range(2000, 2020)).map(compute_for _year).collect() ● Pros: ○ Medium human cost: small code change ○ Scalable ● Cons: ○ Works only for embarrassingly parallel problems ○ Boundary handling can be tricky ○ Distributed failures Try a few things... 13
  • 14. Try a few things... ● Use a distributed dataframe library ○ Spark ○ Dask ○ Koalas ○ ... ● Pros: ○ Scalable ● Cons: ○ High human cost: learn another API ○ Not obvious which one to use ○ Distributed failures 14
  • 15. Take a step back... 15
  • 16. Take a step back... df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) 16
  • 17. The problem The problem is not how we express the computation, but how we execute it. 17
  • 18. Separation of Concern From wikipedia: “In computer science, separation of concerns (SoC) is a design principle for separating a computer program into distinct sections such that each section addresses a separate concern.” 18
  • 19. Separation of expression and execution Can we separate “how we express the computation” (expression) and “how we execute it” (execution)? 19
  • 20. Separation of expression and execution ● SQL is a way to express the computation independent of the execution. ● Can we have something like SQL, but for Python Data Analysis? 20
  • 21. Outline ● Ibis: A high level introduction ● Ibis: expression language ● Ibis: backend execution ● PySpark backend for Ibis ● Conclusion 21
  • 22. Ibis: A high level introduction 22
  • 23. Ibis: Python Data Analysis Framework ● Open source ● Started in 2015 by Wes McKinney ● Worked on by top pandas committers: ○ Wes McKinney ○ Phillip Cloud ○ Jeff Reback 23
  • 24. Ibis components ● ibis language ○ The API that is used to express the computation with ibis expressions ● ibis backends ○ Modules that translate ibis expressions to something that can be executed by different computation engines ■ ibis.pandas ■ ibis.pyspark ■ Ibis.bigquery ■ ibis.impala ■ Ibis.omniscidb ■ ... 24
  • 25. Ibis language ● Table API ○ Projection ○ Filtering ○ Join ○ Groupby ○ Sort ○ Window ○ Aggregation ○ … ○ AsofJoin ○ UDFs ● Ibis expressions (intermediate representation) 25
  • 26. Ibis language ● Table API ○ table = table.mutate(v3=table[‘v1’] + table[‘v2’]) ● Ibis expressions 26
  • 27. Ibis backends ● Ibis expressions -> Backend specific expressions ● table.mutate(v3=table[‘v1’] + table[‘v2’]) ○ Pandas: df = df.assign(v3=df[‘v1’] + df[‘v2’]) ○ PySpark: df = df.withColumn(‘v3’, df[‘v1’] + df[‘v2’]) 27
  • 29. Recall our earlier example in pandas df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg({'feature2': {'max', 'min'}}) 29
  • 30. Basic column selection and arithmetic expression Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 30
  • 31. Basic column selection and arithmetic expression Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 31
  • 32. Group-by and windowed aggregation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) 32
  • 33. Group-by and windowed aggregation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) 33
  • 34. Composite aggregation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) 34
  • 35. Composite aggregation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) 35
  • 36. Final translation Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) 36
  • 37. Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) Ibis tables are expressions, not dataframes Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) 37 So far, table and the result of the transformation on the left are expressions, not actual dataframes.
  • 38. Pandas code df['feature'] = (df['v1'] + df['v2']) / 2 df['feature2'] = ( df.groupby('key')['feature'] .rolling(3, min_periods=1).mean() .sort_index(level=1) .reset_index(drop=True) ) df.groupby('key').agg( {'feature2': {'max', 'min'}} ) Ibis tables are expressions, not dataframes Ibis expression table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) 38 We need to execute these expressions on real data in a backend.
  • 40. Initialize backend-specific Ibis client PySpark pyspark_client = ibis.pyspark.connect(pyspark_session) Pandas pandas_client = ibis.pandas.connect() Impala impala_client = ibis.impala.connect(host=host, port=port) ... 40
  • 41. Ibis expression my_table = client.table('foo') Access table in Ibis 41
  • 42. Ibis expression my_table = pyspark_client.table('foo') Access table in Ibis 42
  • 43. Ibis expression my_table = pyspark_client.table('foo') # my_table is an ibis table expression print(my_table) PySparkTable[table] name: foo schema: key : string v1 : int64 v2 : int64 Access table in Ibis 43
  • 44. Ibis expression my_table = pyspark_client.table('foo') # execute materializes the table expression into a pandas DataFrame df = my_table.execute() df Access table in Ibis 44
  • 45. def transform(table: TableExpr) -> TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table Recall our table transformation in Ibis 45
  • 46. def transform(table) -> ibis.expr.types.TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table result_table = transform(my_table) Apply transform() on our Ibis table 46
  • 47. def transform(table) -> ibis.expr.types.TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table result_table = transform(my_table) Apply transform() on our Ibis table 47 my_table and result_table are ibis table expressions, not dataframes.
  • 48. def transform(table) -> ibis.expr.types.TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table result_table = transform(my_table) result_table.execute() Execute the result on our backend 48
  • 49. def transform(table) -> ibis.expr.types.TableExpr: table = table.mutate( feature=(table['v1'] + table['v2'])/2 ) w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) table.group_by('key').aggregate([ table['feature2'].min(), table['feature2'].max() ]) return table result_table = transform(my_table) result_table.execute() Execute the result on our backend 49 pandas dataframe
  • 51. Translate Ibis expression into PySpark expressions 51 Ibis expression tree
  • 52. Basic column selection and arithmetic expression Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) 52
  • 53. Basic column selection and arithmetic expression Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) 53
  • 54. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Basic column selection and arithmetic expression 54
  • 55. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right Basic column selection and arithmetic expression 55
  • 56. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right t is a PySparkTranslator Basic column selection and arithmetic expression 56
  • 57. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right t is a PySparkTranslator It has a translate() method that evaluates an Ibis expression into a PySpark object. Basic column selection and arithmetic expression 57
  • 58. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right expr is the ibis expression to translate Basic column selection and arithmetic expression 58
  • 59. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right scope is a dict that caches results of previously translated Ibis expressions. Basic column selection and arithmetic expression 59
  • 60. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right We use the PySparkTranslator to evaluate the left and right operands (which are themselves ibis expressions) into PySpark columns. Basic column selection and arithmetic expression 60
  • 61. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right left and right are PySpark columns PySpark column division Basic column selection and arithmetic expression 61
  • 62. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Divide) def compile_divide(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left / right Basic column selection and arithmetic expression 62
  • 63. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Add) def compile_add(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left + right PySpark column addition Basic column selection and arithmetic expression 63
  • 64. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.Add) def compile_add(t, expr, scope, **kwargs): op = expr.op() left = t.translate(op.left, scope) right = t.translate(op.right, scope) return left + right Basic column selection and arithmetic expression 64
  • 65. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.TableColumn) def compile_column(t, expr, scope, **kwargs): op = expr.op() column_name = op.name pyspark_df = t.translate(op.table, scope) return pyspark_df[column_name] Basic column selection and arithmetic expression 65
  • 66. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.TableColumn) def compile_column(t, expr, scope, **kwargs): op = expr.op() column_name = op.name pyspark_df = t.translate(op.table, scope) return pyspark_df[column_name] Basic column selection and arithmetic expression 66
  • 67. Ibis expression table.mutate( feature=(table['v1'] + table['v2'])/2 ) Ibis translation code @compiles(ops.TableColumn) def compile_column(t, expr, scope, **kwargs): op = expr.op() column_name = op.name pyspark_df = t.translate(op.table, scope) return pyspark_df[column_name] PySpark column selection Basic column selection and arithmetic expression 67
  • 68. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 68
  • 69. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 69 PySpark translation
  • 70. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 70 PySpark translation from pyspark.sql.window import Window w = ( Window.partitionBy('key') .orderBy('v1') .rowsBetween(-2, Window.currentRow) )
  • 71. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 71 PySpark translation from pyspark.sql.window import Window import pyspark.sql.functions as F w = ( Window.partitionBy('key') .orderBy('v1') .rowsBetween(-2, Window.currentRow) ) F.mean(df['feature']).over(w)
  • 72. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 72 PySpark translation from pyspark.sql.window import Window import pyspark.sql.functions as F w = ( Window.partitionBy('key') .orderBy('v1') .rowsBetween(-2, Window.currentRow) ) df = df.withColumn( 'feature2', F.mean(df['feature']).over(w) )
  • 73. Group-by and windowed aggregation Ibis expression w = ibis.trailing_window(preceding=2, group_by='key', order_by='v1') table = table.mutate( feature2=table['feature'].mean().over(w) ) 73 PySpark translation from pyspark.sql.window import Window import pyspark.sql.functions as F w = ( Window.partitionBy('key') .orderBy('v1') .rowsBetween(-2, Window.currentRow) ) df = df.withColumn( 'feature2', F.mean(df['feature']).over(w) )
  • 78. More interesting examples Ibis expression table['v1'].rank().over(window) 78 PySpark translation
  • 79. More interesting examples Ibis expression table['v1'].rank().over(window) 79 PySpark translation import pyspark.sql.functions as F F.rank().over(window).astype('long') - F.lit(1)
  • 80. More interesting examples Ibis expression table['v1'].rank().over(window) 80 PySpark translation import pyspark.sql.functions as F F.rank().over(window).astype('long') - F.lit(1) Subtle differences like cast to long & off by one
  • 81. More interesting examples Ibis expression table['boolean_col'].not_any() 81 PySpark translation
  • 82. More interesting examples Ibis expression table['boolean_col'].not_any() 82 PySpark translation ~F.max(df['boolean_col'])
  • 83. More interesting examples Ibis expression table['boolean_col'].not_any() 83 PySpark translation ~F.max(df['boolean_col']) No direct translations
  • 85. Conclusion ● Separate expression and execution ● Don’t limit only to what you can use today, think about what you can use in the future ○ Arrow Dataset ○ Modin ○ cudf / dask-cudf ○ ... 85