Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx

1
Enhancing AI-Driven
User Engagement
with Real-Time Data
Streaming via Flink
Zbigniew Królikowski
Senior ML Engineer @VirtusLab

2
● Our client, a worldwide retailer, runs an e-commerce
platform. Content displayed to the customer is
personalised based on individual purchase history.
● The existing approach was based on batch data
processing and model training that introduced a
significant delay between customer actions on the
website and tailored content recommendations,
leading to missed sales opportunities.
● Client’s platform allows for continuous tracking and
processing customer actions with minimal latency. Up
to this point this capability wasn’t leveraged.
● Our Machine Learning engineering team undertook the
challenge to build an online data processing and model
training solution that improves customer engagement
in order to generate more sales opportunities for the
client.
The opportunity

3
● From Machine Learning perspective our main goal was
to improve the metric called Click-Through Rate (CTR),
which is built on a statistic of individual users seeing a
particular piece of content and engaging with it.
● Engagement with content leads to sales opportunities
impacting the Revenue per Customer (RPC) as well
Revenue per Basket (RPB).
● Although both RPC and RPB have a direct link to profit
and were actively monitored during A/B testing through
analytics means. CTR is more practical as a running
metric as it can be calculated on-the-fly.
Business metrics
we were interested in

4
● Events describing user actions are fed through Kafka. Our
solution supports backwards-compatibility for different
event versions.
● Events contain information about user identity, displayed
content, user’s basket as well as user’s interactions within
the page. A number of them needs to be taken account to
form a complete customer story.
● All events types have to be fed through a specialised
filtering logic as well correctly joined and aggregated to
build the model training features.
● Events are broadcasted from all locations on the mobile app
and website while select subsets of this data have to be
efficiently routed into dedicated machine learning models.
The data

5
● Customer offers a mobile application as well as a web
service. Our solution is capable of supporting both of these
as well as potential extensions for on-site use-cases.
● Training a single model, on one platform, requires
processing events for each individual customer exposed the
that piece of content. This, on average, is close to 2000
events per second.
● There are multiple locations, on each platform, that can be
enhanced with the move to the online solution enabled by
Flink.
● In order to provide complete coverage, the solution needs
to be able to scale up to tens of thousands of events per
second, while keeping costs manageable during
development ramp-up.
The data: volume

6
● Uptime SLA’s are fulfilled and the system needs to be able
to recover from transient failure automatically.
● Data is never lost and the whole process is always
recoverable.
● User identity remains hidden at all times.
● No bias is introduced to the models through the data
ensuring fair treatment. Training is based solely on relevant
characteristics.
● Personalisation is only delivered to logged-in users who
have expressly agreed to participate.
Other
requirements

7
Apache Flink is a stream processing software designed from ground up with
those tenants is mind:
● Correctness guarantees - reproducible and consistent results
● Layered APIs - viable for all team skill sets
● Operational focus - production ready from the start
● Scalability - from minimal use-cases to core business
● Performance - no-compromise approach to low-latency, high-throughput
and minimal operational costs.
Flink offers excellent integration with a number of queue systems including
Kafka as well as SQL and NoSQL databases.
Apache Flink is an Open Source Software available under the Apache License
2.0.
Apache Flink

8
● In contrast to stateless processing, stateful
processing enables a broader set of operations that
encompass more than one event. This opens-up the
possibilities for more sophisticated business
processes to be represented within the code.
● Apache Flink supports multiple back-ends for
storing state to best fit different scalability,
throughput, and latency requirements: in-memory
or a key-value database (RocksDB).
Stateful
processing

9
● Apache Flink supports both batch and streaming processing
and guarantees full reproducibility of results across both
approaches. This requires close to zero code changes and works
with each of Flink’s API’s. Both bounded and unbounded
streams are supported.
Image source: https://flink.apache.org/what-is-flink/flink-architecture/
Batch processing and
Stream processing
● This feature makes Flink capable of covering the use-cases as
batch processing tools like Apache Spark. This has potential
to reduce the amount of different tooling necessary for a
project.

10
Big part of what makes Flink a good fit for the client is the very
powerful feature of event-time processing.
By using this feature, risk associated with a scalable, dynamic
environment is mitigated and we achieve full reproducibility. As
the logical time is untied from wall-clock time application gain
data consistency the focus can be directed towards keeping
latency as low as possible.
Development process is also greatly enhanced, as the logic can be
rolled back to any point time allowing for a replicable
development environment.
Moreover the same applies to the root-cause-analysis process.
Production issues can be resolved swiftly and reliably reducing
the ongoing maintenance costs.
Event-time processing

11
Flink supports two application programming interfaces
(APIs) with a large overlap of features:
● Table API (SQL) - suitable for BI specialists, analysts
and data scientists.
● DataStream API (available for Python, Java and
Scala) - suitable for data engineers and ML
engineers.
This makes Flink fit seamlessly into existing proficiencies
of the team, without the need of time-investment
necessary for re-skilling.
Available languages and
APIs
Image source: https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/concepts/overview/

13
Flink can be easily integrated with
cloud platform as well as on
premise environment through the
use of the Flink Kubernetes
Operator which handles many
aspects of the deployment.
Deployment
image source: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/img/overview.svg

14
● Flink offers complete consistency of results
through checkpointing which will either
be run on a very small schedule (seconds
to minutes) or continuously. In case of
issues Flink application is able to reload
from that point and seamlessly catch-up
with minimal disruption.
Resilience
● When deployed on Kubernetes, Flink
supports a high-availability (HA)
deployment method in which application
stability is maintained in the case of failure
of the JobManager, entity that supervises
the scheduling and resource management.
● Industry-standard ZooKeeper can be used
instead. This makes HA possible on non-
kubernetes platforms.
Image source: https://nightlies.apache.org/flink/flink-docs-master/docs/concepts/stateful-stream-processing/

15
The deployment will adapt to changing load patterns by
increasing or reducing the allocation of resources based on the
load. This lowers the operating costs, while at the same
safeguards from deterioration of the service level in time os
increased customer traffic.
Auto-scaling
Apache Flink will automatically re-adjust it’s internal
allocation of resource to individual processing steps based on
quality characteristics. Through this approach, the
administrative and development costs can be reduced at the
same time ensuring operational efficiency.
Image source: https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.7/docs/custom-resource/autoscaler

16
Flink offers excellent in-built monitoring capability with a web UI:
● Provides real-time statistics of input and data sizes and
message counts for every step of the processing.
● Collects all application logs in single location for seamless
browsing and auditing.
● Displays information regarding checkpoints providing clear
understanding of stability of the applications.
● The same capability is available during the development
process ensuring quality and clarity of operation from the early
stages.
Additionally we’ve integrated the Flink instance with an external
cloud monitoring tools giving us a uniform monitoring capability
across all cloud deployments.
Monitoring

17
System architecture - adjusted

18
● Our solution gave an 35% uptick
in CTR for duration of the A/B trial.
● Performance has been
maintained over time. This shows
that this method has prevented
concept/model drift.
The results

Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx

More Related Content

Similar to Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx (20)

Recently uploaded (20)

Enhancing AI-Driven User Engagement with Real-Time Data Streaming via Flink.pptx