State schema evolution for Apache Flink Applications

State Schema Evolution for Apache Flink®
Applications
Apache Flink®
流式应用中状态的数据结构定义升级
戴资力, Tzu-Li (Gordon) Tai
Apache Flink PMC

Agenda
1. Evolving Stateful Flink Streaming Applications
2. Schema Evolution for Flink Built-in Types
3. Implementing Custom State Serializers
Flink 有状态流式应用升级的考虑要素
Flink 内建类别的数据结构定义更新
自订状态序列化器的实现

Evolving Stateful Flink Streaming Applications
Flink 有状态流式应用升级的考虑要素

Flink 流式应用升级流程解析
Anatomy of a Flink
stream job upgrade
local read / writes
that manipulate state
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码

Anatomy of a Flink
stream job upgrade
local read / writes
that manipulate state
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
persist to DFS
on savepoint

Anatomy of a Flink
stream job upgrade
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
upgrade application

Anatomy of a Flink
stream job upgrade
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Restore state
to state
backends

Anatomy of a Flink
stream job upgrade
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
continue to
access state

字体
Schema Evolution for Built-In Types
Flink 内建类别的数据结构定义更新

状态注册时使用内建序列化器
State registration with built-in serialization
ValueStateDescriptor<MyStateType> desc =
new ValueStateDescriptor<>(
“my-value-state”,
MyStateType.class
);
ValueState<MyStateType> state = getRuntimeContext().getState(desc);

MyStateType.class
);
type information for state
状态类别资讯

MyStateType.class
);
type information for state
Flink infers information about the type and creates a serializer for it
● Primitive types: IntSerializer, DoubleSerializer, LongArraySerializer, etc.
● Tuples: TupleSerializer
● POJOs / Scala case classes: PojoSerializer, CaseClassSerializer
● Apache Avro types: AvroSerializer
● Fallback is Kryo: KryoSerializer
状态类别资讯

以 Apache Avro 进行状态数据结构定义进化
Evolving state schema for Apache Avro types
Can swap between GenericRecord and code generated SpecificRecords
Can evolve schema according to Avro specifications*
*Avro specifications: http://avro.apache.org/docs/1.7.7/spec.html#Schema+Resolution
Cannot change namespace of generated SpecificRecord classes
可依据 Avro 规范* 进化状态的数据结构定义
可交替使用 GenericRecord 与代码生成的 SpecificRecord 类别
不可更动 SpecificRecord 类别的命名空间

内建型别的数据结构定义升级支援度现况
Status quo of schema evolution support
More is planned for 1.8+: POJOs, Scala case classes, Rows (for Flink Tables)
Avro types are the only built-in types that support schema evolution (as of 1.7)
Avoid using Kryo if you want evolvable schema for state
目前仅有 Avro 型别有支援数据结构定义升级 (Flink 1.7 现况)
社群有规划支援 POJOs, Scala case class, Rows 等类别的数据结构定义升级
若希望支援数据结构定义升级，请避免使用 KryoSerializer

Implementing Custom State Serializers
自订状态序列化器的实现

State registration with custom serializers
new MyStateTypeSerializer();
);
class MyStateTypeSerializer extends TypeSerializer<MyStateType> { … }
状态注册时使用自订序列化器

状态的数据结构定义和序列化
State Schema and Serialization
Evolving state’s data schema requires evolving the state’s serializer
The terms data schema and serialization format are interchangeable here
Depending on serialization behaviour of state backends (heap v.s. off-heap)
state migration may be required
在此，「数据结构定义」与「序列化格式」两词可交互替换
欲升级状态的数据结构定义则必须升级状态的序列化器
基于不同状态后端 (内存 / 非内存) 的序列化模式，可能需要进行状态迁移

内存式后端的状态序列化模式
State Serialization for
Heap Backends
User code
Local state
backend
Persisted
savepoint
Key1
Key2
Key3
Key4
Key5
new SerializerV1()
);
本地状态后端
持久保存点
使用者代码

Serialized by
V1 serializer
Heap Backends
User code
Local state
backend
Persisted
savepoint
Key1
Key2
Key3
Key4
Key5
new SerializerV1()
);
本地状态后端
持久保存点
使用者代码
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1

Heap Backends
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1

Heap Backends
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
Key1
Key2
Key3
Key4
Key5
Requires
V1 serializer
for restore

Heap Backends
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
Key1
Key2
Key3
Key4
Key5

Heap Backends
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key 1 bytes V2
Key 2 bytes V2
Key 3 bytes V2
Key 4 bytes V2
Key 5 bytes V2
Key1
Key2
Key3
Key4
Key5
Serialized by
V2 serializer

State Serialization for Heap Backends
By nature, restoring + snapshotting state is already a state migration
process
Serialization happens on restore + snapshot:
lazy serialization, eager deserialization
Requires a written form of the previous serializer in the snapshot
反序列化发生于状态恢復阶段、序列化发生于状态的保存点生成
状态的恢復与保存点生成本质上就是一个状态迁移的过程
需要状态之前的序列化器被写入于保存点中

Out-of-Core Backends
new SerializerV1()
);
Key1bytesV1
Key2bytesV1
Key3bytesV1
Key4bytesV1
Key5bytesV1
…01110
…01110
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
非内存式后端的状态序列化模式

new SerializerV1()
);
Key1bytesV1
Key2bytesV1
Key3bytesV1
Key4bytesV1
Key5bytesV1
…01110
…01110
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
File transfer Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
new SerializerV2()
);

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
File transfer
V1

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
V1
…01110
…01110
state access with
V2 serializer?

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
V1
state access with
V2 serializer?
Requires Migration!

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
new SerializerV2()
);
Key1bytesV2
Key2bytesV2
Key3bytes
Key4bytes
Key5bytes
…01110
…01110
V2
V2
V2

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Key 1 bytes V2
Key 2 bytes V2
Key 3 bytes V2
Key 4 bytes V2
Key 5 bytes V2
new SerializerV2()
);
Key1bytesV2
Key2bytesV2
Key3bytes
Key4bytes
Key5bytes
…01110
…01110
V2
V2
V2
File transfer

State Serialization for Out-of-Core Backends
After restore, state migration occurs on first access if schema has
changed
Serialization happens on every state access:
Eager serialization, lazy deserialization
The previous serializer is required if state migration occurs
状态恢復后，第一次的状态注册即视需求进行发生状态迁移
若需要进行状态迁移，则需要使用到状态的前一个序列化器
序列化、反序列化会发生于每一次状态的读写

编程抽象类：TypeSerializerSnapshot
Main abstraction: TypeSerializerSnapshot
interface TypeSerializerSnapshot<T> {
int getCurrentVersion();
void writeSnapshot(DataOutputView out);
void readSnapshot(int readVersion, DataInputView in, ClassLoader userCodeClassloader);
TypeSerializer<T> restoreSerializer();
TypeSerializerSchemaCompatibility<T> resolveSchemaCompatibility(TypeSerializer<T> newSerializer);
}

编程抽象类：TypeSerializerSnapshot
Main abstraction:
TypeSerializerSnapshot
Represents the written form of a state’s serializer, written to snapshots
代表着写入于保存点中状态的序列化器
interface TypeSerializerSnapshot<T> {
int getCurrentVersion();
void writeSnapshot(DataOutputView out);
void readSnapshot(int readVersion, DataInputView in, ClassLoader userCodeClassloader);
TypeSerializer<T> restoreSerializer();
TypeSerializerSchemaCompatibility<T> resolveSchemaCompatibility(TypeSerializer<T> newSerializer);
}
Encodes information about the state’s written schema + serializer configuration
Serves as a factory for the previous serializer
拥有关于状态被序列化的格式以及序列化器的设定相关资讯
可用于建构状态被写入时所使用的序列化器

Heap Backends
User code
Local state
backend
Persisted
savepoint
Key1
Key2
Key3
Key4
Key5
new SerializerV1()
);
本地状态后端
持久保存点
使用者代码
Serialized by
SerializerV1
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
SerializerV1SnapshotSerializerV1
.snapshotConfiguration.write(...)

Heap Backends
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Key1
Key2
Key3
Key4
Key5
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
SerializerV1Snapshot
SerializerV1
.restoreSerializer();
new SerializerV2()
);

Heap Backends
User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
Key1
Key2
Key3
Key4
Key5
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
new SerializerV2()
);
SerializerV1
Deserialized by
SerializerV1

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
V1
…01110
…01110
state access with
V2 serializer?
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
V1
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
TypeSerializerSchemaCompatibility<T> compat =
serializerV1Snapshot
.resolveSchemaCompatibility(serializerV2)
if (compat.isCompatibleAfterMigration()) {
// migrate the state schema
}

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
V1
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
SerializerV1
.restoreSerializer();

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
V1
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
SerializerV1
read
State
object

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
V1
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
SerializerV1
State
object SerializerV2

User code
Local state
backend
Persisted
savepoint
本地状态后端
持久保存点
使用者代码
new SerializerV2()
);
Key1bytesV1
Key2bytesV1
Key3bytes
Key4bytesV1
Key5bytesV1
V1
Key 1 bytes V1
Key 2 bytes V1
Key 3 bytes V1
Key 4 bytes V1
Key 5 bytes V1
SerializerV1
State
object SerializerV2
write

范例：可进化序列化格式的 PojoSerializer
Example: Evolvable PojoSerializer [FLINK-10987]
class Employee {
int age,
String name,
Department dep,
...
}

class Employee {
int age,
String name,
Department dep,
...
}
write
field
name
IntSerializer

class Employee {
int age,
String name,
Department dep,
...
}
write
field
name StringSerializer

class Employee {
int age,
String name,
Department dep,
...
}
write
field
name PojoSerializer

Example: Evolvable
PojoSerializer [FLINK-10987]
class PojoSerializer<T> extends TypeSerializer<T> {
private Field[] fields;
private TypeSerializer<?>[] fieldSerializers;
…
public TypeSerializerSnapshot<T> snapshotConfiguration {
return new PojoSerializerSnapshot<>(fields, fieldSerializers);
}
}
class Employee {
int age,
String name,
Department dep,
...
}

class PojoSerializerSnapshot<T> implements TypeSerializerSnapshot<T> {
private Field[] fields;
private TypeSerializer<?>[] fieldSerializers;
/**
* Constructor for instantiating the snapshot when reading.
*/
public PojoSerializerSnapshot() {}
/**
* Constructor to create a snapshot for writing.
*/
public PojoSerializerSnapshot(Field[] fields, TypeSerializer<?>[]
fieldSerializers) {
this.fields = fields;
this.fieldSerializers = fieldSerializers;
}
...
}
Example: Evolvable

...
public TypeSerializerSchemaCompatibility<T> resolveSchemaCompatibility(TypeSerializer<T> newSerializer) {
if (newSerializer instanceof PojoSerializer) {
Field[] newFields = ((PojoSerializer<T>) newSerializer).getFields();
if (hasDifferentTypedFields(this.fields, newFields)) {
return TypeSerializerSchemaCompatibility.incompatible();
} else if (hasNewFields(this.fields, newFields) || hasRemovedFields(this.fields, newFields)) {
return TypeSerializerSchemaCompatibility.compatibleAfterMigration();
}
return TypeSerializerSchemaCompatibility.compatibleAsIs();
}
return TypeSerializerSchemaCompatibility.incompatible();
}
}
Example: Evolvable

...
public TypeSerializer<T> restoreSerializer() {
return new PojoSerializer<>(fields, fieldSerializers);
}
}
Example: Evolvable

Miscellaneous Best Practices
Avoid classname changes to the serializer snapshot class
Use CompositeSerializerSnapshot to handle nested TypeSerializers
避免 TypeSerializerSnapshot 实现类名被更动
类名为读取 TypeSerializerSnapshot 的入口点
避免使用匿名类或巢状类作为 TypeSerializerSnapshot 的实现
可利用 CompositeSerializerSnapshot 类应付巢状的 TypeSerializer
实现最佳守则
Classname is the entrypoint to reading a serializer snapshot
Avoid using anonymous or nested classes for snapshot classes

Flink 1.7 now supports state schema evolution
自 Flink 1.7 开始支援状态的数据结构定义升级
Avro schema evolution is supported; more support is on the radar
Covered details on implementing custom state
serializers with evolve-able schema
支援 Avro 数据结构定义升级；支援其他原生类别的数据结构定义升级将会在未来持续增加
针对可升级数据结构定义的状态序列化器的实现方法进行解析

State schema evolution for Apache Flink Applications

State schema evolution for Apache Flink Applications

More Related Content

What's hot (20)

Recently uploaded (20)

State schema evolution for Apache Flink Applications