Streaming ETL
With Flink and Elasticsearch
Jared Stehler | @jstehler
About Me
● Principal Architect @ Edgenuity - Online Learning Software for K-12
○ Previously @ Intellify Learning
● I Live in Boston, MA
● I Work on Analytics Systems for Educational (EdTech) Software
Agenda
● Background
● ETL Approach
● Data Pipeline
● Building, Deploying, Running
Source Data
Education App
Sensor
EntityEntityEventEventEventEventEntityEventEvent
Entities
● Course
● Assignment
● Student
● Enrollment
Events
● Login
● Navigation
● Assessment
● Outcome
Caliper Analytics Specification
● Entities
○ name
○ dateModified
○ Extensions
● Events
○ Shared Attributes
■ eventTime
■ Actor
■ Action
■ Object
○ Specific Types
■ Session
■ Annotation
■ Navigation
■ Assessmenthttp://www.imsglobal.org/caliper/caliperv1p0/ims-caliper-analytics-implementation-guide
Example Engagement Sequence
Actors Student, LMS, Reader App, Quiz App
Basic Flow of Events 1. Student logs in to interact with a course. A SessionEvent is sent by the sensor with an action of logged in.
2. The student navigates to page in the reading. A NavigationEvent is sent by the sensor with an action of
navigated to.
3. The student adds a tag to the page. A TagAnnotation is generated. An AnnotationEvent is sent by the
sensor with an action of tagged..
4. The student starts a quiz and generates an Attempt. An AssessmentEvent is sent by the sensor with an
action of started..
5. The student starts question 1. Generates an Attempt. An AssessmentItemEvent is sent by the sensor
with an action of started.
6. The student completes question 1 and generates a response. An AssessmentItemEvent is sent by the
sensor with an action of completed.
7. Repeats 5-6 for all questions
8. The student submits the quiz. An AssessmentEvent is sent by the sensor with an action of submitted.
9. System grades the quiz and generates a Result An OutcomeEvent is sent by the sensor with an action of
graded
10. The student logs out of the course. A SessionEvent is sent by the sensor with an action of logged out.
Education App
Sensor
Elasticsearch
Web Reports
Tableau Workbooks
Education App
Sensor
Web Reports
Tableau Workbooks
Problems with Initial Approach
● Flexible Input Schema, Non-Uniform Input Data
● Everything in One Index - Query Performance Problems
● Elasticsearch Doesn’t Do Joins!
ETL Solution
● Unified Set of Table Schemas - Avro
● Flink Job per Input Source - Map / Transform to Avro Schema
● Store Outputs in Index per Table - Elasticsearch
● Store Outputs in S3 as Parquet
ETL Join Example
- generated.actor.@id
- generated.score
- object.assignable.@id
{event} - attempt
- @id
- courseSection.subOrgOf.@id
{entity} - assignment
- member.@id
- organization.@id
- role
{entity} - enrollment
- attemptId
- assignmentId
- courseId
- role
- score
<<avro>>
ETL Join Example - Attempt Event Record
{
"event": {
"@type": "http://purl.imsglobal.org/caliper/v1/OutcomeEvent",
"generated": {
"actor": {
"@id": "https://example.edu/users/554433"
},
"totalScore": 92.4
},
"object": {
"assignable": {
"@id": "https://example.edu/terms/201801/courses/7/sections/1/assign/2"
}
}
}
}
ETL Join Example - Assignment Entity Record
{
"entity": {
"@type": "http://purl.imsglobal.org/caliper/v1/AssignableDigitalResource",
"@id": "https://example.edu/terms/201801/courses/7/sections/1/assign/2",
"extensions": {
"courseSection": {
"@id": "https://example.edu/terms/201801/courses/7/sections/1",
"subOrganizationOf": {
"@id": "https://example.edu/terms/201801/courses/7"
}
}
}
}
}
ETL Join Example - Enrollment Entity Record
{
"entity": {
"@type": "http://purl.imsglobal.org/caliper/v1/lis/Membership",
"member": {
"@id": "https://example.edu/users/554433",
"type": "Person"
},
"organization": {
"@id": "https://example.edu/terms/201801/courses/7/sections/1",
"type": "CourseSection",
"subOrganizationOf": {
"@id": "https://example.edu/terms/201801/courses/7",
"type": "CourseOffering"
}
},
"roles": [ "Learner" ]
}
}
Join Registry
Mapper
Join
Registry
register()
MapperETL
Mapper
Joinable
Entity
FlatMap
init
runtime
Source
Transform
Co-Process
Function
Join State
EntityJoin
- entityName
- sourceField
- targetField
- relations: (src, dst)[]
- sourceKey()
- targetKey()
JoinableKey
- entityName
- conditions: (key, val)[]
JoinableEntity
- joinableKeys[]
- record
Join RegistryEntity Record Joinable Entity
Transform State
EntityJoin
- entityName
- sourceField
- targetField
- relations: (src, dst)[]
- sourceKey()
- targetKey()
TransformContext
- joins[]
- mapFunction
- outputType:
{APPEND|UPDATE}
TransformingRecord
- context
- record
ETL Mapper Flat-
Map
Transformable
Record
Transforming
Record
ETL Transforming Record Lifecycle
Flat Map Fulfill Joins Map to Avro
● Filter on Attributes
● Unwind Values
● Join Entity Data ● Translate Attributes
● Build Avro Object
ETL Flink Job
Split on
Record
Type
Joinable Entity
Mapper
Transforming
Record Mapper Transforming
Record
CoProcess
Function
{“type”: “entity”}
{“type”: “*”}
Joinable
Entities
Pending
Joins
MapState<JoinableKey, JoinableEntity>
MapState<JoinableKey, List<PendingJoin>>
ETL Job - Framework
Main Class
Base ETL Class
Defines Source Topic
Concrete Transformer
Class <AvroType>
RecordTransformer<T>
Concrete Transformer
Class <AvroType>Concrete Transformer
Class <AvroType>
Emit Avro Records
ETL Job - Main
@SpringBootApplication
public class LMSIngestEtlProgram extends IngestEtlProgram {
public static void main(String[] args) throws Exception {
IngestEtlProgram.etlFlinkMain(LMSIngestEtlProgram.class,
EtlSourceType.LMS, args);
}
}
ETL Job - Record Transformer
@Component
public class QuizSubmissionTransformer extends RecordTransformer<QuizSubmission> {
public QuizSubmissionTransformer(EntityJoinRegistry joinRegistry) {
super(joinRegistry, QuizSubmission.class);
}
@Override
public void configure(Builder<QuizSubmission> context) { /* */ }
@Override
public void flatMap(SourceRecordElement value, Collector<TransformingRecord> out) throws Exception { /* */ }
private static class Mapper implements MapFunction<SourceRecordElement, QuizSubmission> {
@Override
public QuizSubmission map(SourceRecordElement value) throws Exception { /* */ }
}
}
2
3
4
1
ETL Job - Transformer - FlatMap
@Override
public void flatMap(SourceRecordElement value, Collector<TransformingRecord> out)
throws Exception {
if ("http://purl.imsglobal.org/caliper/v1/OutcomeEvent".equals(
stringProp(value.getRecord(), "event.@type"))) {
out.collect(new TransformingRecord(getTransformContext(), value));
}
}
ETL Job - Transformer - Configure
@Override
public void configure(Builder<QuizSubmission> context) {
context.join("http://purl.imsglobal.org/caliper/v1/AssignableDigitalResource")
.toField("activity")
.on("entity.@id").eq("event.object.assignable.@id");
context.join("http://purl.imsglobal.org/caliper/v1/lis/Membership")
.toField("enrollment")
.on("entity.member.@id").eq("event.generated.actor.@id")
.and("entity.organization.@id").eq("activity.extensions.courseSection.subOrganizationOf.@id");
context.map(new Mapper());
}
ETL Job - Transformer - Map Function
@Override
public QuizSubmission map(SourceRecordElement value) throws Exception {
QuizSubmission.Builder val = QuizSubmission.newBuilder();
ObjectNode rec = value.getRecord();
val.setAction(stringProp(rec, "event.action"));
val.setActivityId(stringProp(rec, "event.object.assignable.@id"));
val.setActivityType(stringProp(rec, "activity.extensions.moduleType"));
val.setAttemptId(stringProp(rec, "event.object.@id"));
val.setCourseId(stringProp(rec, "activity.extensions.courseSection.subOrganizationOf.@id"));
val.setDateAndTime(dateTimeProp(rec, "event.eventTime"));
val.setRole(stringProp(rec, "enrollment.roles"));
val.setScore(doubleProp(rec, "event.generated.totalScore", 0d));
val.setStudentId(stringProp(rec, "event.generated.actor.@id"));
return val.build();
}
ETL - Versioned Namespaces
elasticsearchETL: Source A: v1
<<multitenant>>
table_0: source_A: v1
<<customer>>
table_0
<<customer>>
<<alias>>
A
B
C
ETL: Source A: v2
<<multitenant>>
ETL: Source B: v1
<<multitenant>>
ETL: Source C: v1
<<multitenant>>
Alias Management Service
table_0: source_A: v2
<<customer>>
table_0: source_B: v1
<<customer>>
table_0: source_C: v1
<<customer>>
Flink Bootstrapping Source
S3 Data Lake
Kafka
Topic
DataLake
BootstrapSourceFunction
FlinkKafkaConsumer
Bootstrapping
SourceFunction
Range:
{seedStart, seedEnd}
Starting Offsets:
seedEnd
Flink Bootstrapping Source
@Override
public void run(SourceContext<T> ctx)
throws Exception {
while (running) {
if (seeding.get()) {
if (bootstrapSource != null) {
bootstrapSource.run(ctx);
}
seeding.set(false);
// yield momentarily to allow other parallel seeding tasks time to complete
Thread.sleep(5_000);
} else {
streamSource.run(ctx);
}
}
}
Data Pipeline - Overview
Collection Ingestion DeliveryETL
Processing
Data Pipeline - Collection
LMS
Product A
LMS
Product B
E-Reader
App
Third-Party
App
Collector Service
Kafka
Ingest
Topic
Push (HTTP)
Pull (S3)
Pull (REST API)
Data Pipeline - Ingestion
Kafka
Ingest
Topic
Dedup
Filter *
Redis
Elasticsearch Sink
Dedup Unique Hash Sink *
ETL Source Kafka Sink *
Bucketing Sink
ES
EFS
Kafka
*: 2-Phase Commit
Data Pipeline - Ingestion (Raw Data Lake)
2018-09-05-16-30
2018-09-05-16-30
2018-09-05-16-30
2018-09-05-16-30
EFS
Translate
Ingest
Time to
Source
Time
y=2018
m=09
d=05
14.avro
{
“sourceTimestamp”,
“uniqueHash”,
“record”
}
S3
Data Pipeline - ETL
Kafka ETL
Source
“A” Topic
ingest Avro
Schemas
ETL Source
“A” Job
Students
Courses
Assessments
Kafka ETL
Source
“B” Topic
ETL Source
“B” Job
Data Pipeline - “Processing”
● Sessionization Jobs: “Time on Task”, “Concurrent Active Sessions” Reports
● Daily / YTD Aggregations: Student, Course, School, District
● Future: CEP - Detect Cheating, etc
Data Pipeline - Delivery
● Real-time reports created / delivered via Elasticsearch indices
● BI Analytics / exploration via SQL on S3 data in parquet buckets by
timestamp
○ Presto / Athena
○ Drill
Flink Deployment Model
● DataStream App
○ CI builds versioned artifacts
○ Deployable to environments via descriptor
● DataStream App Instance (1..*)
○ Produces output to namespaced sinks
○ Actions:
■ Suspend: cancel with savepoint
■ Resume: start from savepoint
■ Recover: start from last external checkpoint
■ Upgrade: update App version to currently deployed artifact
● DataStream App Instance Job (1..*)
○ Link between App Instance and Flink Job Instance
Flink Deployment Model
SCM
App
Artifact
App @
Version
CI Build Deploy
App
Instance
Start
Instance
Job
flink run
DS App Deployment Descriptor
{
"type": "datastream-app",
"description": "ETL job for edgenuity-lms data sources",
"multitenant": true,
"versioned": true,
"seedingEnabled": true,
"seedingType": "datasource_type",
"enabledSeedingTargets": [
"edgenuity_lms"
]
}
Deploying
DS App Management UI
DS App - Start New Instance
DS App Management UI
DS App Management UI
DS App - Flink Job Config Params - Versioned
DS App - Flink Job Config Params - Singleton
Flink Pipeline Monitoring
● rate(write_records{unique_id="crusher-prime-job-system-sy
stem--unversioned", vertex_type="source"}[5m])
● max(flink_jobmanager_job_lastcheckpointduration{jobname=~
"global-ingest-.*"}) / 1000.0
● sum(rate(flink_jobmanager_job_fullrestarts[10m])) BY
(jobname) > 0
Flink Pipeline Monitoring
Thanks!
Jared Stehler | @jstehler

More Related Content

PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
PDF
Amin beheshti c ai-se13
PDF
Natural Language Search with Knowledge Graphs (Chicago Meetup)
PDF
Natural Language Search with Knowledge Graphs (Haystack 2019)
PDF
Vital AI: Big Data Modeling
PPTX
How to Build a Semantic Search System
PPTX
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
PDF
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Amin beheshti c ai-se13
Natural Language Search with Knowledge Graphs (Chicago Meetup)
Natural Language Search with Knowledge Graphs (Haystack 2019)
Vital AI: Big Data Modeling
How to Build a Semantic Search System
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre

Similar to Flink Forward Berlin 2018: Jared Stehler - "Streaming ETL with Flink and Elasticsearch" (20)

PPTX
The Relevance of the Apache Solr Semantic Knowledge Graph
PDF
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
PPT
Sem tech 2011 v8
PDF
Andrzej Ludwikowski - Event Sourcing - what could possibly go wrong? - Codemo...
PDF
Efficient, Scalable, and Provenance-Aware Management of Linked Data
PDF
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
PDF
Web-scale semantic search
PDF
Virtual Knowledge Graph by MIT Article.pdf
PDF
Improving Entity Retrieval on Structured Data
PPTX
WebAction-Sami Abkay
PPTX
Wrokflow programming and provenance query model
PDF
inteSearch: An Intelligent Linked Data Information Access Framework
PPTX
Linked data for Enterprise Data Integration
PDF
Event Sourcing - what could possibly go wrong?
PDF
Graph-based Approaches for Organization Entity Resolution in MapReduce
PDF
ETL and Event Sourcing
KEY
Talking to your IDE
PDF
Crafting Solutions with the Elastic Stack: pragmatic takes and lessons learned
PDF
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
PPTX
Self adaptive based natural language interface for disambiguation of
The Relevance of the Apache Solr Semantic Knowledge Graph
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Sem tech 2011 v8
Andrzej Ludwikowski - Event Sourcing - what could possibly go wrong? - Codemo...
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Closing the Loop in Extended Reality with Kafka Streams and Machine Learning ...
Web-scale semantic search
Virtual Knowledge Graph by MIT Article.pdf
Improving Entity Retrieval on Structured Data
WebAction-Sami Abkay
Wrokflow programming and provenance query model
inteSearch: An Intelligent Linked Data Information Access Framework
Linked data for Enterprise Data Integration
Event Sourcing - what could possibly go wrong?
Graph-based Approaches for Organization Entity Resolution in MapReduce
ETL and Event Sourcing
Talking to your IDE
Crafting Solutions with the Elastic Stack: pragmatic takes and lessons learned
Spark SQL Catalyst Optimizer, Custom Expressions, UDFs - Advanced Spark and T...
Self adaptive based natural language interface for disambiguation of
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg
Ad

Recently uploaded (20)

PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
DOCX
search engine optimization ppt fir known well about this
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
The influence of sentiment analysis in enhancing early warning system model f...
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PPTX
The various Industrial Revolutions .pptx
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Flame analysis and combustion estimation using large language and vision assi...
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
1 - Historical Antecedents, Social Consideration.pdf
search engine optimization ppt fir known well about this
A contest of sentiment analysis: k-nearest neighbor versus neural network
Taming the Chaos: How to Turn Unstructured Data into Decisions
A review of recent deep learning applications in wood surface defect identifi...
sbt 2.0: go big (Scala Days 2025 edition)
Final SEM Unit 1 for mit wpu at pune .pptx
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
The influence of sentiment analysis in enhancing early warning system model f...
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
The various Industrial Revolutions .pptx
OpenACC and Open Hackathons Monthly Highlights July 2025
Hindi spoken digit analysis for native and non-native speakers
2018-HIPAA-Renewal-Training for executives
Flame analysis and combustion estimation using large language and vision assi...
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
Module 1.ppt Iot fundamentals and Architecture
Benefits of Physical activity for teenagers.pptx
Abstractive summarization using multilingual text-to-text transfer transforme...

Flink Forward Berlin 2018: Jared Stehler - "Streaming ETL with Flink and Elasticsearch"

  • 1. Streaming ETL With Flink and Elasticsearch Jared Stehler | @jstehler
  • 2. About Me ● Principal Architect @ Edgenuity - Online Learning Software for K-12 ○ Previously @ Intellify Learning ● I Live in Boston, MA ● I Work on Analytics Systems for Educational (EdTech) Software
  • 3. Agenda ● Background ● ETL Approach ● Data Pipeline ● Building, Deploying, Running
  • 4. Source Data Education App Sensor EntityEntityEventEventEventEventEntityEventEvent Entities ● Course ● Assignment ● Student ● Enrollment Events ● Login ● Navigation ● Assessment ● Outcome
  • 5. Caliper Analytics Specification ● Entities ○ name ○ dateModified ○ Extensions ● Events ○ Shared Attributes ■ eventTime ■ Actor ■ Action ■ Object ○ Specific Types ■ Session ■ Annotation ■ Navigation ■ Assessmenthttp://www.imsglobal.org/caliper/caliperv1p0/ims-caliper-analytics-implementation-guide
  • 6. Example Engagement Sequence Actors Student, LMS, Reader App, Quiz App Basic Flow of Events 1. Student logs in to interact with a course. A SessionEvent is sent by the sensor with an action of logged in. 2. The student navigates to page in the reading. A NavigationEvent is sent by the sensor with an action of navigated to. 3. The student adds a tag to the page. A TagAnnotation is generated. An AnnotationEvent is sent by the sensor with an action of tagged.. 4. The student starts a quiz and generates an Attempt. An AssessmentEvent is sent by the sensor with an action of started.. 5. The student starts question 1. Generates an Attempt. An AssessmentItemEvent is sent by the sensor with an action of started. 6. The student completes question 1 and generates a response. An AssessmentItemEvent is sent by the sensor with an action of completed. 7. Repeats 5-6 for all questions 8. The student submits the quiz. An AssessmentEvent is sent by the sensor with an action of submitted. 9. System grades the quiz and generates a Result An OutcomeEvent is sent by the sensor with an action of graded 10. The student logs out of the course. A SessionEvent is sent by the sensor with an action of logged out.
  • 7. Education App Sensor Elasticsearch Web Reports Tableau Workbooks Education App Sensor Web Reports Tableau Workbooks
  • 8. Problems with Initial Approach ● Flexible Input Schema, Non-Uniform Input Data ● Everything in One Index - Query Performance Problems ● Elasticsearch Doesn’t Do Joins!
  • 9. ETL Solution ● Unified Set of Table Schemas - Avro ● Flink Job per Input Source - Map / Transform to Avro Schema ● Store Outputs in Index per Table - Elasticsearch ● Store Outputs in S3 as Parquet
  • 10. ETL Join Example - generated.actor.@id - generated.score - object.assignable.@id {event} - attempt - @id - courseSection.subOrgOf.@id {entity} - assignment - member.@id - organization.@id - role {entity} - enrollment - attemptId - assignmentId - courseId - role - score <<avro>>
  • 11. ETL Join Example - Attempt Event Record { "event": { "@type": "http://purl.imsglobal.org/caliper/v1/OutcomeEvent", "generated": { "actor": { "@id": "https://example.edu/users/554433" }, "totalScore": 92.4 }, "object": { "assignable": { "@id": "https://example.edu/terms/201801/courses/7/sections/1/assign/2" } } } }
  • 12. ETL Join Example - Assignment Entity Record { "entity": { "@type": "http://purl.imsglobal.org/caliper/v1/AssignableDigitalResource", "@id": "https://example.edu/terms/201801/courses/7/sections/1/assign/2", "extensions": { "courseSection": { "@id": "https://example.edu/terms/201801/courses/7/sections/1", "subOrganizationOf": { "@id": "https://example.edu/terms/201801/courses/7" } } } } }
  • 13. ETL Join Example - Enrollment Entity Record { "entity": { "@type": "http://purl.imsglobal.org/caliper/v1/lis/Membership", "member": { "@id": "https://example.edu/users/554433", "type": "Person" }, "organization": { "@id": "https://example.edu/terms/201801/courses/7/sections/1", "type": "CourseSection", "subOrganizationOf": { "@id": "https://example.edu/terms/201801/courses/7", "type": "CourseOffering" } }, "roles": [ "Learner" ] } }
  • 15. Join State EntityJoin - entityName - sourceField - targetField - relations: (src, dst)[] - sourceKey() - targetKey() JoinableKey - entityName - conditions: (key, val)[] JoinableEntity - joinableKeys[] - record Join RegistryEntity Record Joinable Entity
  • 16. Transform State EntityJoin - entityName - sourceField - targetField - relations: (src, dst)[] - sourceKey() - targetKey() TransformContext - joins[] - mapFunction - outputType: {APPEND|UPDATE} TransformingRecord - context - record ETL Mapper Flat- Map Transformable Record Transforming Record
  • 17. ETL Transforming Record Lifecycle Flat Map Fulfill Joins Map to Avro ● Filter on Attributes ● Unwind Values ● Join Entity Data ● Translate Attributes ● Build Avro Object
  • 18. ETL Flink Job Split on Record Type Joinable Entity Mapper Transforming Record Mapper Transforming Record CoProcess Function {“type”: “entity”} {“type”: “*”} Joinable Entities Pending Joins MapState<JoinableKey, JoinableEntity> MapState<JoinableKey, List<PendingJoin>>
  • 19. ETL Job - Framework Main Class Base ETL Class Defines Source Topic Concrete Transformer Class <AvroType> RecordTransformer<T> Concrete Transformer Class <AvroType>Concrete Transformer Class <AvroType> Emit Avro Records
  • 20. ETL Job - Main @SpringBootApplication public class LMSIngestEtlProgram extends IngestEtlProgram { public static void main(String[] args) throws Exception { IngestEtlProgram.etlFlinkMain(LMSIngestEtlProgram.class, EtlSourceType.LMS, args); } }
  • 21. ETL Job - Record Transformer @Component public class QuizSubmissionTransformer extends RecordTransformer<QuizSubmission> { public QuizSubmissionTransformer(EntityJoinRegistry joinRegistry) { super(joinRegistry, QuizSubmission.class); } @Override public void configure(Builder<QuizSubmission> context) { /* */ } @Override public void flatMap(SourceRecordElement value, Collector<TransformingRecord> out) throws Exception { /* */ } private static class Mapper implements MapFunction<SourceRecordElement, QuizSubmission> { @Override public QuizSubmission map(SourceRecordElement value) throws Exception { /* */ } } } 2 3 4 1
  • 22. ETL Job - Transformer - FlatMap @Override public void flatMap(SourceRecordElement value, Collector<TransformingRecord> out) throws Exception { if ("http://purl.imsglobal.org/caliper/v1/OutcomeEvent".equals( stringProp(value.getRecord(), "event.@type"))) { out.collect(new TransformingRecord(getTransformContext(), value)); } }
  • 23. ETL Job - Transformer - Configure @Override public void configure(Builder<QuizSubmission> context) { context.join("http://purl.imsglobal.org/caliper/v1/AssignableDigitalResource") .toField("activity") .on("entity.@id").eq("event.object.assignable.@id"); context.join("http://purl.imsglobal.org/caliper/v1/lis/Membership") .toField("enrollment") .on("entity.member.@id").eq("event.generated.actor.@id") .and("entity.organization.@id").eq("activity.extensions.courseSection.subOrganizationOf.@id"); context.map(new Mapper()); }
  • 24. ETL Job - Transformer - Map Function @Override public QuizSubmission map(SourceRecordElement value) throws Exception { QuizSubmission.Builder val = QuizSubmission.newBuilder(); ObjectNode rec = value.getRecord(); val.setAction(stringProp(rec, "event.action")); val.setActivityId(stringProp(rec, "event.object.assignable.@id")); val.setActivityType(stringProp(rec, "activity.extensions.moduleType")); val.setAttemptId(stringProp(rec, "event.object.@id")); val.setCourseId(stringProp(rec, "activity.extensions.courseSection.subOrganizationOf.@id")); val.setDateAndTime(dateTimeProp(rec, "event.eventTime")); val.setRole(stringProp(rec, "enrollment.roles")); val.setScore(doubleProp(rec, "event.generated.totalScore", 0d)); val.setStudentId(stringProp(rec, "event.generated.actor.@id")); return val.build(); }
  • 25. ETL - Versioned Namespaces elasticsearchETL: Source A: v1 <<multitenant>> table_0: source_A: v1 <<customer>> table_0 <<customer>> <<alias>> A B C ETL: Source A: v2 <<multitenant>> ETL: Source B: v1 <<multitenant>> ETL: Source C: v1 <<multitenant>> Alias Management Service table_0: source_A: v2 <<customer>> table_0: source_B: v1 <<customer>> table_0: source_C: v1 <<customer>>
  • 26. Flink Bootstrapping Source S3 Data Lake Kafka Topic DataLake BootstrapSourceFunction FlinkKafkaConsumer Bootstrapping SourceFunction Range: {seedStart, seedEnd} Starting Offsets: seedEnd
  • 27. Flink Bootstrapping Source @Override public void run(SourceContext<T> ctx) throws Exception { while (running) { if (seeding.get()) { if (bootstrapSource != null) { bootstrapSource.run(ctx); } seeding.set(false); // yield momentarily to allow other parallel seeding tasks time to complete Thread.sleep(5_000); } else { streamSource.run(ctx); } } }
  • 28. Data Pipeline - Overview Collection Ingestion DeliveryETL Processing
  • 29. Data Pipeline - Collection LMS Product A LMS Product B E-Reader App Third-Party App Collector Service Kafka Ingest Topic Push (HTTP) Pull (S3) Pull (REST API)
  • 30. Data Pipeline - Ingestion Kafka Ingest Topic Dedup Filter * Redis Elasticsearch Sink Dedup Unique Hash Sink * ETL Source Kafka Sink * Bucketing Sink ES EFS Kafka *: 2-Phase Commit
  • 31. Data Pipeline - Ingestion (Raw Data Lake) 2018-09-05-16-30 2018-09-05-16-30 2018-09-05-16-30 2018-09-05-16-30 EFS Translate Ingest Time to Source Time y=2018 m=09 d=05 14.avro { “sourceTimestamp”, “uniqueHash”, “record” } S3
  • 32. Data Pipeline - ETL Kafka ETL Source “A” Topic ingest Avro Schemas ETL Source “A” Job Students Courses Assessments Kafka ETL Source “B” Topic ETL Source “B” Job
  • 33. Data Pipeline - “Processing” ● Sessionization Jobs: “Time on Task”, “Concurrent Active Sessions” Reports ● Daily / YTD Aggregations: Student, Course, School, District ● Future: CEP - Detect Cheating, etc
  • 34. Data Pipeline - Delivery ● Real-time reports created / delivered via Elasticsearch indices ● BI Analytics / exploration via SQL on S3 data in parquet buckets by timestamp ○ Presto / Athena ○ Drill
  • 35. Flink Deployment Model ● DataStream App ○ CI builds versioned artifacts ○ Deployable to environments via descriptor ● DataStream App Instance (1..*) ○ Produces output to namespaced sinks ○ Actions: ■ Suspend: cancel with savepoint ■ Resume: start from savepoint ■ Recover: start from last external checkpoint ■ Upgrade: update App version to currently deployed artifact ● DataStream App Instance Job (1..*) ○ Link between App Instance and Flink Job Instance
  • 36. Flink Deployment Model SCM App Artifact App @ Version CI Build Deploy App Instance Start Instance Job flink run
  • 37. DS App Deployment Descriptor { "type": "datastream-app", "description": "ETL job for edgenuity-lms data sources", "multitenant": true, "versioned": true, "seedingEnabled": true, "seedingType": "datasource_type", "enabledSeedingTargets": [ "edgenuity_lms" ] }
  • 40. DS App - Start New Instance
  • 43. DS App - Flink Job Config Params - Versioned
  • 44. DS App - Flink Job Config Params - Singleton
  • 45. Flink Pipeline Monitoring ● rate(write_records{unique_id="crusher-prime-job-system-sy stem--unversioned", vertex_type="source"}[5m]) ● max(flink_jobmanager_job_lastcheckpointduration{jobname=~ "global-ingest-.*"}) / 1000.0 ● sum(rate(flink_jobmanager_job_fullrestarts[10m])) BY (jobname) > 0