SlideShare a Scribd company logo
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Alfonso Roa, Habla Computing
Working with Complex
Types in DataFrames:
Optics to the Rescue
#UnifiedDataAnalytics #SparkAISummit
Who am I
3#UnifiedDataAnalytics #SparkAISummit
Alfonso Roa
● Scala 👍
● Spark 👍
● Functional Programing 👍
● Open source (what i can) 👍
● Big data 👍
Where I work
4#UnifiedDataAnalytics #SparkAISummit
info@hablapps.com
Agenda
(Live code session)
• The problem working with complex types
• How to solve it in a no Spark world
• How to solve it in a Spark world
• …
• Profits
5
Notebook used
Spark optics
https://github.com/hablapps/sparkOptics
6#UnifiedDataAnalytics #SparkAISummit
Binder
7
Complex types are complex
case class Street(number: Int, name: String)
case class Address(city: String, street: Street)
case class Company(name: String, address: Address)
case class Employee(name: String, company: Company)
8#UnifiedDataAnalytics #SparkAISummit
Our example for the talk
val employee =
Employee("john",
Company("awesome inc",
Address("london",
Street(23, "high street")
)))
9#UnifiedDataAnalytics #SparkAISummit
How we see it in DF’s
import sparkSession.implicits._
val df = List(employee).toDF
df.show
df.printSchema
10#UnifiedDataAnalytics #SparkAISummit
+----+--------------------+
|name| company|
+----+--------------------+
|john|[awesome inc, [lo...|
+----+--------------------+
root
|-- name: string (nullable = true)
|-- company: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = true)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = true)
| | | |-- number: integer (nullable = false)
| | | |-- name: string (nullable = true)
Changes in DF
val employeeNameChanged = df.select(
concat(df("name"),lit("!!!")).as("name")
,
df("company")
)
employeeNameChanged.show
employeeNameChanged.printSchema
11#UnifiedDataAnalytics #SparkAISummit
+-------+--------------------+
| name| company|
+-------+--------------------+
|john!!!|[awesome inc, [lo...|
+-------+--------------------+
root
|-- name: string (nullable = true)
|-- company: struct (nullable = true)
| ...
Changes in complex structs
val companyNameChanged = df.select(
df("name"),
struct(
concat(df("company.name"),lit("!!!")).as("name"),
df("company.address")
).as("company")
)
12#UnifiedDataAnalytics #SparkAISummit
Even more complex structs
df.select(df("name"),struct(
df("company.name").as("name"),
struct(
df("company.address.city").as("city"),
struct(
df("company.address.street.number").as("number"),
upper(df("company.address.street.name")).as("name")
).as("street")
).as("address")
).as("company"))
13#UnifiedDataAnalytics #SparkAISummit
How this is made with case class
employee.copy(name = employee.name+"!!!")
employee.copy(company =
employee.company.copy(name =
employee.company.name+"!!!")
)
14#UnifiedDataAnalytics #SparkAISummit
Employee(
"john!!!",
Company("awesome inc", Address("london", Street(23,
"high street")))
)
Employee(
"john",
Company("awesome inc!!!", Address("london",
Street(23, "high street")))
)
Immutability is hard
Very similar...
BUT WE HAVE OPTICS!
Monocle
Scala optics library
https://julien-truffaut.github.io/Monocle/
15#UnifiedDataAnalytics #SparkAISummit
Lenses used to focus in a
element
import monocle.Lens
import monocle.macros.GenLens
val employeeName : Lens[Employee, String] = GenLens[Employee](_.name)
16#UnifiedDataAnalytics #SparkAISummit
The context The element to focus on
Macro generator for the
lens
Lenses used to focus in a
element
employeeName.get(employee)
17#UnifiedDataAnalytics #SparkAISummit
val f: Employee => Employee =
employeeName.set(“James”)
f(employee)
returns "john"
Employee(
"James",
Company("awesome inc", Address("london",
Street(23, "high street")))
)
val f: Employee => Employee =
employeeName.modify(a => a + “!!!”)
f(employee)
Employee(
"john!!!",
Company("awesome inc", Address("london",
Street(23, "high street")))
)
Optics can be merged
import monocle.Lens
import monocle.macros.GenLens
val company : Lens[Employee, Company] = GenLens[Employee](_.company)
val address : Lens[Company , Address] = GenLens[Company](_.address)
val street : Lens[Address , Street] = GenLens[Address](_.street)
val streetName: Lens[Street , String] = GenLens[Street](_.name)
val employeeStreet: Lens[Employee, String] = company composeLens address composeLens street composeLens streetName
18#UnifiedDataAnalytics #SparkAISummit
They are composable
Functionality
val streetChanger:Employee => Employee = employeeStreet.modify(_ + "!!!")
streetChanger(employee)
Employee(
"john",
Company("awesome inc", Address("london", Street(23, "high street!!!")))
)
19#UnifiedDataAnalytics #SparkAISummit
How lucky they are
So easy
Wish there was something
like this for spark dataframes…
Spark optics!
https://github.com/hablapps/sparkOptics
20#UnifiedDataAnalytics #SparkAISummit
Similar to typed optics
import org.hablapps.sparkOptics.Lens
import org.hablapps.sparkOptics.syntax._
val lens = Lens("name")(df.schema)
21#UnifiedDataAnalytics #SparkAISummit
The contextThe element to focus on
Same methods, included modify
val lens = Lens("name")(df.schema)
val column: Column = lens.get(df)
val transformedDF = df.select(lens.modify(c =>
concat(c,lit("!!!"))):_*)
transformedDF.printSchema
transformedDF.as[Employee].head
22#UnifiedDataAnalytics #SparkAISummit
(Column => Column) => Array[Columns]
Same methods, included modify
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- name: string (nullable = true)
Employee(
"john!!!",
Company("awesome inc", Address("london", Street(23, "high street")))
)
23
Creating the lenses
But not as easy as the Typed optics to get the
context in inner elements.
import org.apache.spark.sql.types.StructType
val companyL: Lens = Lens("company")(df.schema)
val companySchema = df.schema.fields.find(_.name == "company").get.dataType.asInstanceOf[StructType]
val addressL = Lens("address")(companySchema)
val addressSchema = companySchema.fields.find(_.name == "address").get.dataType.asInstanceOf[StructType]
val streetL = Lens("street")(addressSchema)
val streetSchema = addressSchema.fields.find(_.name == "street").get.dataType.asInstanceOf[StructType]
val streetNameL = Lens("name")(streetSchema)
24#UnifiedDataAnalytics #SparkAISummit
Get the schema of the inner
element
And again and again… 😔
Composable
But they are still composable
val employeeCompanyStreetName =
companyL composeLens addressL composeLens streetL composeLens streetNameL
val modifiedDF = df.select(employeeCompanyStreetName.set(lit("new street
name")):_*)
modifiedDF.as[Employee].head
Employee(
"john",
Company("awesome inc", Address("london", Street(23, "new street name")))
)
25#UnifiedDataAnalytics #SparkAISummit
Creating easier lenses
Intro the proto lens, a lens without context (yet)
val companyL: Lens = Lens("company")(df.schema)
val addressProtolens: ProtoLens = Lens("address")
val composedLens: Lens = companyL composeProtoLens addressProtolens
val composedLens: ProtoLens = Lens("a") composeProtoLens Lens("b")
26#UnifiedDataAnalytics #SparkAISummit
Checks that the schema of companyL has
the address element, or it will throw an error
No schema in any element?
Still a valid protolens
Sugar in composition
Similar syntax to spark sql
val sweetLens = Lens("company.address.street.name")(df.schema)
val sourLens = Lens("company")(df.schema) composeProtoLens
Lens("address") composeProtoLens
Lens("street") composeProtoLens
Lens("name")
27#UnifiedDataAnalytics #SparkAISummit
Comparation
val flashLens = Lens("company.address.street.name")(df.schema)
val modifiedDF = df.select(flashLens.modify(upper):_*)
Much better than
val mDF = df.select(df("name"),struct(
df("company.name").as("name"),
struct(
df("company.address.city").as("city"),
struct(
df("company.address.street.number").as("number"),
upper(df("company.address.street.name")).as("name")
).as("street")
).as("address")
).as("company"))
And lenses function are reusable
28#UnifiedDataAnalytics #SparkAISummit
Extra functionality
Schema changing functions
29#UnifiedDataAnalytics #SparkAISummit
Prune
Deletes elements inside of a struct
val flashLens = Lens("company.address.street.name")(df.schema)
df.select(flashLens.prune(Vector.empty):_*).printSchema
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- name: string (nullable = true)
30#UnifiedDataAnalytics #SparkAISummit
Deleted
Rename
Deletes elements inside of a struct
val flashLens = Lens("company.address.street.name")(df.schema)
df.select(flashLens.rename("newName"):_*).printSchema
root
|-- name: string (nullable = true)
|-- company: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- address: struct (nullable = false)
| | |-- city: string (nullable = true)
| | |-- street: struct (nullable = false)
| | | |-- number: integer (nullable = true)
| | | |-- newName: string (nullable = true)
31#UnifiedDataAnalytics #SparkAISummit
Future Work
New types of optics (traversable)
Make them with spark inner model, not
with the public API (If is worth it).
Compatibility with other APIS
(Frameless)
32
Thanks for your interest
Links:
Monocle
https://julien-truffaut.github.io/Monocle/
Spark optics
https://github.com/hablapps/sparkOptics
33#UnifiedDataAnalytics #SparkAISummit
Social networks
Habla computing:
www.hablapps.com
@hablapps
Alfonso Roa
https://linkedin.com/in/roaalfonso
@saco_pepe
34#UnifiedDataAnalytics #SparkAISummit
¿QUESTIONS?
Thanks for attending
35
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PPTX
Indexing with MongoDB
PDF
Kibana + timelion: time series with the elastic stack
PPT
Jsp/Servlet
PDF
MongoDB performance
PDF
Indexing and Performance Tuning
PDF
Mongoose: MongoDB object modelling for Node.js
PDF
[2D1]Elasticsearch 성능 최적화
PPTX
Database Management - Lecture 2 - SQL select, insert, update and delete
Indexing with MongoDB
Kibana + timelion: time series with the elastic stack
Jsp/Servlet
MongoDB performance
Indexing and Performance Tuning
Mongoose: MongoDB object modelling for Node.js
[2D1]Elasticsearch 성능 최적화
Database Management - Lecture 2 - SQL select, insert, update and delete

What's hot (20)

PDF
Big Data Paris - Air France: Stratégie BigData et Use Cases
PPTX
Introduction to OOP in Python
PPTX
Apache Calcite overview
PDF
Spark shuffle introduction
PDF
Apache Calcite (a tutorial given at BOSS '21)
PPTX
MongoDB Aggregation Performance
PDF
Introduction to Apache Solr
PDF
JavaScript - Chapter 7 - Advanced Functions
PPTX
JSON in Solr: from top to bottom
PPT
Cascading Style Sheets(CSS)
PPT
JDBC
PPTX
Flink SQL & TableAPI in Large Scale Production at Alibaba
PPTX
Introduction to apache lucene
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Cassandra & puppet, scaling data at $15 per month
PDF
5. Basic Structure of SQL Queries.pdf
PPTX
Do we need Unsafe in Java?
PPT
Resource Bundle
PPT
PHP Frameworks and CodeIgniter
PPT
Effective Java - Generics
Big Data Paris - Air France: Stratégie BigData et Use Cases
Introduction to OOP in Python
Apache Calcite overview
Spark shuffle introduction
Apache Calcite (a tutorial given at BOSS '21)
MongoDB Aggregation Performance
Introduction to Apache Solr
JavaScript - Chapter 7 - Advanced Functions
JSON in Solr: from top to bottom
Cascading Style Sheets(CSS)
JDBC
Flink SQL & TableAPI in Large Scale Production at Alibaba
Introduction to apache lucene
Real-Time Spark: From Interactive Queries to Streaming
Cassandra & puppet, scaling data at $15 per month
5. Basic Structure of SQL Queries.pdf
Do we need Unsafe in Java?
Resource Bundle
PHP Frameworks and CodeIgniter
Effective Java - Generics
Ad

Similar to Working with Complex Types in DataFrames: Optics to the Rescue (20)

PDF
Go Programming Patterns
PPTX
Graph Database Query Languages
PDF
Spark DataFrames for Data Munging
PDF
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
PDF
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
PDF
DataMapper
PDF
Real life-coffeescript
PDF
Intro to Spark and Spark SQL
PDF
SCALA - Functional domain
PDF
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
PDF
Story for a Ruby on Rails Single Engineer
PDF
Introduction to Scalding and Monoids
PDF
Koalas: Making an Easy Transition from Pandas to Apache Spark
PDF
Creating Domain Specific Languages in Python
PDF
ScalikeJDBC Tutorial for Beginners
PDF
Refactoring to Macros with Clojure
PPSX
Scala @ TomTom
PDF
FITC '14 Toronto - Technology, a means to an end
PDF
Technology: A Means to an End with Thibault Imbert
Go Programming Patterns
Graph Database Query Languages
Spark DataFrames for Data Munging
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
DataMapper
Real life-coffeescript
Intro to Spark and Spark SQL
SCALA - Functional domain
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...
Story for a Ruby on Rails Single Engineer
Introduction to Scalding and Monoids
Koalas: Making an Easy Transition from Pandas to Apache Spark
Creating Domain Specific Languages in Python
ScalikeJDBC Tutorial for Beginners
Refactoring to Macros with Clojure
Scala @ TomTom
FITC '14 Toronto - Technology, a means to an end
Technology: A Means to an End with Thibault Imbert
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Database Infoormation System (DBIS).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Computer network topology notes for revision
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Business Analytics and business intelligence.pdf
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Business Acumen Training GuidePresentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Lecture1 pattern recognition............
Introduction to Knowledge Engineering Part 1
Database Infoormation System (DBIS).pptx
.pdf is not working space design for the following data for the following dat...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Mega Projects Data Mega Projects Data
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Foundation of Data Science unit number two notes
Introduction-to-Cloud-ComputingFinal.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Computer network topology notes for revision
Business Ppt On Nestle.pptx huunnnhhgfvu
Business Analytics and business intelligence.pdf
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Business Acumen Training GuidePresentation.pptx

Working with Complex Types in DataFrames: Optics to the Rescue

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Alfonso Roa, Habla Computing Working with Complex Types in DataFrames: Optics to the Rescue #UnifiedDataAnalytics #SparkAISummit
  • 3. Who am I 3#UnifiedDataAnalytics #SparkAISummit Alfonso Roa ● Scala 👍 ● Spark 👍 ● Functional Programing 👍 ● Open source (what i can) 👍 ● Big data 👍
  • 4. Where I work 4#UnifiedDataAnalytics #SparkAISummit info@hablapps.com
  • 5. Agenda (Live code session) • The problem working with complex types • How to solve it in a no Spark world • How to solve it in a Spark world • … • Profits 5
  • 8. Complex types are complex case class Street(number: Int, name: String) case class Address(city: String, street: Street) case class Company(name: String, address: Address) case class Employee(name: String, company: Company) 8#UnifiedDataAnalytics #SparkAISummit
  • 9. Our example for the talk val employee = Employee("john", Company("awesome inc", Address("london", Street(23, "high street") ))) 9#UnifiedDataAnalytics #SparkAISummit
  • 10. How we see it in DF’s import sparkSession.implicits._ val df = List(employee).toDF df.show df.printSchema 10#UnifiedDataAnalytics #SparkAISummit +----+--------------------+ |name| company| +----+--------------------+ |john|[awesome inc, [lo...| +----+--------------------+ root |-- name: string (nullable = true) |-- company: struct (nullable = true) | |-- name: string (nullable = true) | |-- address: struct (nullable = true) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = true) | | | |-- number: integer (nullable = false) | | | |-- name: string (nullable = true)
  • 11. Changes in DF val employeeNameChanged = df.select( concat(df("name"),lit("!!!")).as("name") , df("company") ) employeeNameChanged.show employeeNameChanged.printSchema 11#UnifiedDataAnalytics #SparkAISummit +-------+--------------------+ | name| company| +-------+--------------------+ |john!!!|[awesome inc, [lo...| +-------+--------------------+ root |-- name: string (nullable = true) |-- company: struct (nullable = true) | ...
  • 12. Changes in complex structs val companyNameChanged = df.select( df("name"), struct( concat(df("company.name"),lit("!!!")).as("name"), df("company.address") ).as("company") ) 12#UnifiedDataAnalytics #SparkAISummit
  • 13. Even more complex structs df.select(df("name"),struct( df("company.name").as("name"), struct( df("company.address.city").as("city"), struct( df("company.address.street.number").as("number"), upper(df("company.address.street.name")).as("name") ).as("street") ).as("address") ).as("company")) 13#UnifiedDataAnalytics #SparkAISummit
  • 14. How this is made with case class employee.copy(name = employee.name+"!!!") employee.copy(company = employee.company.copy(name = employee.company.name+"!!!") ) 14#UnifiedDataAnalytics #SparkAISummit Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) ) Employee( "john", Company("awesome inc!!!", Address("london", Street(23, "high street"))) )
  • 15. Immutability is hard Very similar... BUT WE HAVE OPTICS! Monocle Scala optics library https://julien-truffaut.github.io/Monocle/ 15#UnifiedDataAnalytics #SparkAISummit
  • 16. Lenses used to focus in a element import monocle.Lens import monocle.macros.GenLens val employeeName : Lens[Employee, String] = GenLens[Employee](_.name) 16#UnifiedDataAnalytics #SparkAISummit The context The element to focus on Macro generator for the lens
  • 17. Lenses used to focus in a element employeeName.get(employee) 17#UnifiedDataAnalytics #SparkAISummit val f: Employee => Employee = employeeName.set(“James”) f(employee) returns "john" Employee( "James", Company("awesome inc", Address("london", Street(23, "high street"))) ) val f: Employee => Employee = employeeName.modify(a => a + “!!!”) f(employee) Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) )
  • 18. Optics can be merged import monocle.Lens import monocle.macros.GenLens val company : Lens[Employee, Company] = GenLens[Employee](_.company) val address : Lens[Company , Address] = GenLens[Company](_.address) val street : Lens[Address , Street] = GenLens[Address](_.street) val streetName: Lens[Street , String] = GenLens[Street](_.name) val employeeStreet: Lens[Employee, String] = company composeLens address composeLens street composeLens streetName 18#UnifiedDataAnalytics #SparkAISummit They are composable
  • 19. Functionality val streetChanger:Employee => Employee = employeeStreet.modify(_ + "!!!") streetChanger(employee) Employee( "john", Company("awesome inc", Address("london", Street(23, "high street!!!"))) ) 19#UnifiedDataAnalytics #SparkAISummit
  • 20. How lucky they are So easy Wish there was something like this for spark dataframes… Spark optics! https://github.com/hablapps/sparkOptics 20#UnifiedDataAnalytics #SparkAISummit
  • 21. Similar to typed optics import org.hablapps.sparkOptics.Lens import org.hablapps.sparkOptics.syntax._ val lens = Lens("name")(df.schema) 21#UnifiedDataAnalytics #SparkAISummit The contextThe element to focus on
  • 22. Same methods, included modify val lens = Lens("name")(df.schema) val column: Column = lens.get(df) val transformedDF = df.select(lens.modify(c => concat(c,lit("!!!"))):_*) transformedDF.printSchema transformedDF.as[Employee].head 22#UnifiedDataAnalytics #SparkAISummit (Column => Column) => Array[Columns]
  • 23. Same methods, included modify root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- name: string (nullable = true) Employee( "john!!!", Company("awesome inc", Address("london", Street(23, "high street"))) ) 23
  • 24. Creating the lenses But not as easy as the Typed optics to get the context in inner elements. import org.apache.spark.sql.types.StructType val companyL: Lens = Lens("company")(df.schema) val companySchema = df.schema.fields.find(_.name == "company").get.dataType.asInstanceOf[StructType] val addressL = Lens("address")(companySchema) val addressSchema = companySchema.fields.find(_.name == "address").get.dataType.asInstanceOf[StructType] val streetL = Lens("street")(addressSchema) val streetSchema = addressSchema.fields.find(_.name == "street").get.dataType.asInstanceOf[StructType] val streetNameL = Lens("name")(streetSchema) 24#UnifiedDataAnalytics #SparkAISummit Get the schema of the inner element And again and again… 😔
  • 25. Composable But they are still composable val employeeCompanyStreetName = companyL composeLens addressL composeLens streetL composeLens streetNameL val modifiedDF = df.select(employeeCompanyStreetName.set(lit("new street name")):_*) modifiedDF.as[Employee].head Employee( "john", Company("awesome inc", Address("london", Street(23, "new street name"))) ) 25#UnifiedDataAnalytics #SparkAISummit
  • 26. Creating easier lenses Intro the proto lens, a lens without context (yet) val companyL: Lens = Lens("company")(df.schema) val addressProtolens: ProtoLens = Lens("address") val composedLens: Lens = companyL composeProtoLens addressProtolens val composedLens: ProtoLens = Lens("a") composeProtoLens Lens("b") 26#UnifiedDataAnalytics #SparkAISummit Checks that the schema of companyL has the address element, or it will throw an error No schema in any element? Still a valid protolens
  • 27. Sugar in composition Similar syntax to spark sql val sweetLens = Lens("company.address.street.name")(df.schema) val sourLens = Lens("company")(df.schema) composeProtoLens Lens("address") composeProtoLens Lens("street") composeProtoLens Lens("name") 27#UnifiedDataAnalytics #SparkAISummit
  • 28. Comparation val flashLens = Lens("company.address.street.name")(df.schema) val modifiedDF = df.select(flashLens.modify(upper):_*) Much better than val mDF = df.select(df("name"),struct( df("company.name").as("name"), struct( df("company.address.city").as("city"), struct( df("company.address.street.number").as("number"), upper(df("company.address.street.name")).as("name") ).as("street") ).as("address") ).as("company")) And lenses function are reusable 28#UnifiedDataAnalytics #SparkAISummit
  • 29. Extra functionality Schema changing functions 29#UnifiedDataAnalytics #SparkAISummit
  • 30. Prune Deletes elements inside of a struct val flashLens = Lens("company.address.street.name")(df.schema) df.select(flashLens.prune(Vector.empty):_*).printSchema root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- name: string (nullable = true) 30#UnifiedDataAnalytics #SparkAISummit Deleted
  • 31. Rename Deletes elements inside of a struct val flashLens = Lens("company.address.street.name")(df.schema) df.select(flashLens.rename("newName"):_*).printSchema root |-- name: string (nullable = true) |-- company: struct (nullable = false) | |-- name: string (nullable = true) | |-- address: struct (nullable = false) | | |-- city: string (nullable = true) | | |-- street: struct (nullable = false) | | | |-- number: integer (nullable = true) | | | |-- newName: string (nullable = true) 31#UnifiedDataAnalytics #SparkAISummit
  • 32. Future Work New types of optics (traversable) Make them with spark inner model, not with the public API (If is worth it). Compatibility with other APIS (Frameless) 32
  • 33. Thanks for your interest Links: Monocle https://julien-truffaut.github.io/Monocle/ Spark optics https://github.com/hablapps/sparkOptics 33#UnifiedDataAnalytics #SparkAISummit
  • 34. Social networks Habla computing: www.hablapps.com @hablapps Alfonso Roa https://linkedin.com/in/roaalfonso @saco_pepe 34#UnifiedDataAnalytics #SparkAISummit
  • 36. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT