Flink Forward Berlin 2018: Jared Stehler - "Streaming ETL with Flink and Elasticsearch"

Streaming ETL
With Flink and Elasticsearch
Jared Stehler | @jstehler

About Me
● Principal Architect @ Edgenuity - Online Learning Software for K-12
○ Previously @ Intellify Learning
● I Live in Boston, MA
● I Work on Analytics Systems for Educational (EdTech) Software

Agenda
● Background
● ETL Approach
● Data Pipeline
● Building, Deploying, Running

Source Data
Education App
Sensor
EntityEntityEventEventEventEventEntityEventEvent
Entities
● Course
● Assignment
● Student
● Enrollment
Events
● Login
● Navigation
● Assessment
● Outcome

Caliper Analytics Specification
● Entities
○ name
○ dateModified
○ Extensions
● Events
○ Shared Attributes
■ eventTime
■ Actor
■ Action
■ Object
○ Specific Types
■ Session
■ Annotation
■ Navigation
■ Assessmenthttp://www.imsglobal.org/caliper/caliperv1p0/ims-caliper-analytics-implementation-guide

Example Engagement Sequence
Actors Student, LMS, Reader App, Quiz App
Basic Flow of Events 1. Student logs in to interact with a course. A SessionEvent is sent by the sensor with an action of logged in.
2. The student navigates to page in the reading. A NavigationEvent is sent by the sensor with an action of
navigated to.
3. The student adds a tag to the page. A TagAnnotation is generated. An AnnotationEvent is sent by the
sensor with an action of tagged..
4. The student starts a quiz and generates an Attempt. An AssessmentEvent is sent by the sensor with an
action of started..
5. The student starts question 1. Generates an Attempt. An AssessmentItemEvent is sent by the sensor
with an action of started.
6. The student completes question 1 and generates a response. An AssessmentItemEvent is sent by the
sensor with an action of completed.
7. Repeats 5-6 for all questions
8. The student submits the quiz. An AssessmentEvent is sent by the sensor with an action of submitted.
9. System grades the quiz and generates a Result An OutcomeEvent is sent by the sensor with an action of
graded
10. The student logs out of the course. A SessionEvent is sent by the sensor with an action of logged out.

Education App
Sensor
Elasticsearch
Web Reports
Tableau Workbooks
Education App
Sensor
Web Reports
Tableau Workbooks

Problems with Initial Approach
● Flexible Input Schema, Non-Uniform Input Data
● Everything in One Index - Query Performance Problems
● Elasticsearch Doesn’t Do Joins!

ETL Solution
● Unified Set of Table Schemas - Avro
● Flink Job per Input Source - Map / Transform to Avro Schema
● Store Outputs in Index per Table - Elasticsearch
● Store Outputs in S3 as Parquet

ETL Join Example
- generated.actor.@id
- generated.score
- object.assignable.@id
{event} - attempt
- @id
- courseSection.subOrgOf.@id
{entity} - assignment
- member.@id
- organization.@id
- role
{entity} - enrollment
- attemptId
- assignmentId
- courseId
- role
- score
<<avro>>

ETL Join Example - Attempt Event Record
{
"event": {
"@type": "http://purl.imsglobal.org/caliper/v1/OutcomeEvent",
"generated": {
"actor": {
"@id": "https://example.edu/users/554433"
},
"totalScore": 92.4
},
"object": {
"assignable": {
"@id": "https://example.edu/terms/201801/courses/7/sections/1/assign/2"
}
}
}
}

ETL Join Example - Assignment Entity Record
{
"entity": {
"@type": "http://purl.imsglobal.org/caliper/v1/AssignableDigitalResource",
"@id": "https://example.edu/terms/201801/courses/7/sections/1/assign/2",
"extensions": {
"courseSection": {
"@id": "https://example.edu/terms/201801/courses/7/sections/1",
"subOrganizationOf": {
"@id": "https://example.edu/terms/201801/courses/7"
}
}
}
}
}

ETL Join Example - Enrollment Entity Record
{
"entity": {
"@type": "http://purl.imsglobal.org/caliper/v1/lis/Membership",
"member": {
"@id": "https://example.edu/users/554433",
"type": "Person"
},
"organization": {
"@id": "https://example.edu/terms/201801/courses/7/sections/1",
"type": "CourseSection",
"subOrganizationOf": {
"@id": "https://example.edu/terms/201801/courses/7",
"type": "CourseOffering"
}
},
"roles": [ "Learner" ]
}
}

Join Registry
Mapper
Join
Registry
register()
MapperETL
Mapper
Joinable
Entity
FlatMap
init
runtime
Source
Transform
Co-Process
Function

Join State
EntityJoin
- entityName
- sourceField
- targetField
- relations: (src, dst)[]
- sourceKey()
- targetKey()
JoinableKey
- entityName
- conditions: (key, val)[]
JoinableEntity
- joinableKeys[]
- record
Join RegistryEntity Record Joinable Entity

Transform State
EntityJoin
- entityName
- sourceField
- targetField
- relations: (src, dst)[]
- sourceKey()
- targetKey()
TransformContext
- joins[]
- mapFunction
- outputType:
{APPEND|UPDATE}
TransformingRecord
- context
- record
ETL Mapper Flat-
Map
Transformable
Record
Transforming
Record

ETL Transforming Record Lifecycle
Flat Map Fulfill Joins Map to Avro
● Filter on Attributes
● Unwind Values
● Join Entity Data ● Translate Attributes
● Build Avro Object

ETL Flink Job
Split on
Record
Type
Joinable Entity
Mapper
Transforming
Record Mapper Transforming
Record
CoProcess
Function
{“type”: “entity”}
{“type”: “*”}
Joinable
Entities
Pending
Joins
MapState<JoinableKey, JoinableEntity>
MapState<JoinableKey, List<PendingJoin>>

ETL Job - Framework
Main Class
Base ETL Class
Defines Source Topic
Concrete Transformer
Class <AvroType>
RecordTransformer<T>
Concrete Transformer
Class <AvroType>Concrete Transformer
Class <AvroType>
Emit Avro Records

ETL Job - Main
@SpringBootApplication
public class LMSIngestEtlProgram extends IngestEtlProgram {
public static void main(String[] args) throws Exception {
IngestEtlProgram.etlFlinkMain(LMSIngestEtlProgram.class,
EtlSourceType.LMS, args);
}
}

ETL Job - Record Transformer
@Component
public class QuizSubmissionTransformer extends RecordTransformer<QuizSubmission> {
public QuizSubmissionTransformer(EntityJoinRegistry joinRegistry) {
super(joinRegistry, QuizSubmission.class);
}
@Override
public void configure(Builder<QuizSubmission> context) { /* */ }
@Override
public void flatMap(SourceRecordElement value, Collector<TransformingRecord> out) throws Exception { /* */ }
private static class Mapper implements MapFunction<SourceRecordElement, QuizSubmission> {
@Override
public QuizSubmission map(SourceRecordElement value) throws Exception { /* */ }
}
}
2
3
4
1

ETL Job - Transformer - FlatMap
@Override
public void flatMap(SourceRecordElement value, Collector<TransformingRecord> out)
throws Exception {
if ("http://purl.imsglobal.org/caliper/v1/OutcomeEvent".equals(
stringProp(value.getRecord(), "event.@type"))) {
out.collect(new TransformingRecord(getTransformContext(), value));
}
}

ETL Job - Transformer - Configure
@Override
public void configure(Builder<QuizSubmission> context) {
context.join("http://purl.imsglobal.org/caliper/v1/AssignableDigitalResource")
.toField("activity")
.on("entity.@id").eq("event.object.assignable.@id");
context.join("http://purl.imsglobal.org/caliper/v1/lis/Membership")
.toField("enrollment")
.on("entity.member.@id").eq("event.generated.actor.@id")
.and("entity.organization.@id").eq("activity.extensions.courseSection.subOrganizationOf.@id");
context.map(new Mapper());
}

ETL Job - Transformer - Map Function
@Override
public QuizSubmission map(SourceRecordElement value) throws Exception {
QuizSubmission.Builder val = QuizSubmission.newBuilder();
ObjectNode rec = value.getRecord();
val.setAction(stringProp(rec, "event.action"));
val.setActivityId(stringProp(rec, "event.object.assignable.@id"));
val.setActivityType(stringProp(rec, "activity.extensions.moduleType"));
val.setAttemptId(stringProp(rec, "event.object.@id"));
val.setCourseId(stringProp(rec, "activity.extensions.courseSection.subOrganizationOf.@id"));
val.setDateAndTime(dateTimeProp(rec, "event.eventTime"));
val.setRole(stringProp(rec, "enrollment.roles"));
val.setScore(doubleProp(rec, "event.generated.totalScore", 0d));
val.setStudentId(stringProp(rec, "event.generated.actor.@id"));
return val.build();
}

ETL - Versioned Namespaces
elasticsearchETL: Source A: v1
<<multitenant>>
table_0: source_A: v1
<<customer>>
table_0
<<customer>>
<<alias>>
A
B
C
ETL: Source A: v2
<<multitenant>>
ETL: Source B: v1
<<multitenant>>
ETL: Source C: v1
<<multitenant>>
Alias Management Service
table_0: source_A: v2
<<customer>>
table_0: source_B: v1
<<customer>>
table_0: source_C: v1
<<customer>>

Flink Bootstrapping Source
S3 Data Lake
Kafka
Topic
DataLake
BootstrapSourceFunction
FlinkKafkaConsumer
Bootstrapping
SourceFunction
Range:
{seedStart, seedEnd}
Starting Offsets:
seedEnd

Flink Bootstrapping Source
@Override
public void run(SourceContext<T> ctx)
throws Exception {
while (running) {
if (seeding.get()) {
if (bootstrapSource != null) {
bootstrapSource.run(ctx);
}
seeding.set(false);
// yield momentarily to allow other parallel seeding tasks time to complete
Thread.sleep(5_000);
} else {
streamSource.run(ctx);
}
}
}

Data Pipeline - Overview
Collection Ingestion DeliveryETL
Processing

Data Pipeline - Collection
LMS
Product A
LMS
Product B
E-Reader
App
Third-Party
App
Collector Service
Kafka
Ingest
Topic
Push (HTTP)
Pull (S3)
Pull (REST API)

Data Pipeline - Ingestion
Kafka
Ingest
Topic
Dedup
Filter *
Redis
Elasticsearch Sink
Dedup Unique Hash Sink *
ETL Source Kafka Sink *
Bucketing Sink
ES
EFS
Kafka
*: 2-Phase Commit

Data Pipeline - Ingestion (Raw Data Lake)
2018-09-05-16-30
2018-09-05-16-30
2018-09-05-16-30
2018-09-05-16-30
EFS
Translate
Ingest
Time to
Source
Time
y=2018
m=09
d=05
14.avro
{
“sourceTimestamp”,
“uniqueHash”,
“record”
}
S3

Data Pipeline - ETL
Kafka ETL
Source
“A” Topic
ingest Avro
Schemas
ETL Source
“A” Job
Students
Courses
Assessments
Kafka ETL
Source
“B” Topic
ETL Source
“B” Job

Data Pipeline - “Processing”
● Sessionization Jobs: “Time on Task”, “Concurrent Active Sessions” Reports
● Daily / YTD Aggregations: Student, Course, School, District
● Future: CEP - Detect Cheating, etc

Data Pipeline - Delivery
● Real-time reports created / delivered via Elasticsearch indices
● BI Analytics / exploration via SQL on S3 data in parquet buckets by
timestamp
○ Presto / Athena
○ Drill

Flink Deployment Model
● DataStream App
○ CI builds versioned artifacts
○ Deployable to environments via descriptor
● DataStream App Instance (1..*)
○ Produces output to namespaced sinks
○ Actions:
■ Suspend: cancel with savepoint
■ Resume: start from savepoint
■ Recover: start from last external checkpoint
■ Upgrade: update App version to currently deployed artifact
● DataStream App Instance Job (1..*)
○ Link between App Instance and Flink Job Instance

Flink Deployment Model
SCM
App
Artifact
App @
Version
CI Build Deploy
App
Instance
Start
Instance
Job
flink run

DS App Deployment Descriptor
{
"type": "datastream-app",
"description": "ETL job for edgenuity-lms data sources",
"multitenant": true,
"versioned": true,
"seedingEnabled": true,
"seedingType": "datasource_type",
"enabledSeedingTargets": [
"edgenuity_lms"
]
}

DS App - Flink Job Config Params - Versioned

DS App - Flink Job Config Params - Singleton

Flink Pipeline Monitoring
● rate(write_records{unique_id="crusher-prime-job-system-sy
stem--unversioned", vertex_type="source"}[5m])
● max(flink_jobmanager_job_lastcheckpointduration{jobname=~
"global-ingest-.*"}) / 1000.0
● sum(rate(flink_jobmanager_job_fullrestarts[10m])) BY
(jobname) > 0

Thanks!
Jared Stehler | @jstehler

Flink Forward Berlin 2018: Jared Stehler - "Streaming ETL with Flink and Elasticsearch"

More Related Content

Similar to Flink Forward Berlin 2018: Jared Stehler - "Streaming ETL with Flink and Elasticsearch" (20)

More from Flink Forward (20)

Recently uploaded (20)

Flink Forward Berlin 2018: Jared Stehler - "Streaming ETL with Flink and Elasticsearch"