Timeplus: One single binary to tackle streaming and historical analytics

SF TechWeek : Data & AI Edition
Ken Chen, Co-founder & Chief Architect
Oct 8, 2024
Proton - one single binary to tackle
streaming and historical analytics

Agenda
• Two Quick Demos
• What kind of problems Timeplus Proton resolves
• Architecture
• Storage
• Query Processing
• Python Extension
• Integrate with ML / AI Use Cases Study
• QA

Demo 1 : Query on Kafka
Kafka Topic
(Source)
Proton Materialized View
(Streaming ETL +
Window Aggregation)
Kafka Topic
(Target)
Data Gen
{
“devicename”:
“region”:
“city”:
“version”:
“lat”:
“lon”:
“battery”:
“humidity”:
“temperature”:
“hydraulic_pressure”:
“atmospheric_pressure”:
“timestamp”:
}

Demo 2 : Ingest, Materialize, Serving (All in one binary)
Proton Source
Stream
Proton Materialized View
(Window Aggregation)
Proton Target
Stream
Data Gen

What’s Timeplus Proton
• Core problems to solve :
“Query, detect and act on fast-changing stream data in an
incremental way.”
• Key Tech Highlights
• Unifies streaming processing and historical analytics via SQL
• Dual data stores to handle data persistence
• Materialized incremental computation engine
• Everything is in single binary, light-weight and efficient , run from edge
to cloud, from single instance to distributed cluster

CONFIDENTIAL
Proton Architecture
Timeplus Proton
(Edge / Cloud)
Incremental ML / AI

CONFIDENTIAL
Proton Architecture - Zoom In
Unified Incremental SQL Processing Engine (Vectorized / JIT)
NativeLog
(Multi-Raft Replicated)
Query processing super power
(Applied state of the art
vectorized / JIT capabilities)
V8 Engine
(JavaScript)
CPython
Engine
(Cloud) Historical Store
NativeLog - Streaming Store
Optimized for concurrent ingest / latency Optimized for bulk scan / throughput
SQL via TCP, JSON Rest via HTTP
Ingest Query
API Layer

CONFIDENTIAL
Streaming Storage – NativeLog (Zoom In)
9
Replicated
LogManager
MetaStore
(Replicated)
NativeLog
Server
Loglet
Loglet
Segment sequence
index
atime
index
etime
index
Client
ReplicatedLog
Segment
…

CONFIDENTIAL
Historical Storage - LSM-like (Zoom In)
Level 0 Level 0 Level 0 … Level 0
write `part`
Level 1 Background (Multi-level) Merge
primary.idx skipping..idx
data..bin data..mrk2
Metadata files
…
Big chunk of columnarized columns

Query - One SQL to Rule All Analytics
Powerful windowing, aggregation and joining etc for streaming and historical query
SUBSCRIBE TO
SELECT device_name, avg(temperature), predict(temperature)
FROM tumble(devices, 5s)
INNER JOIN table(device_products)
ON devices.product_id = device_products.id
GROUP BY device_name, window_end
[SETTINGS seek_to = ‘2021-12-02 10:00:00’]
EMIT [AFTER watermark | LAST 30m]
1. Just one declarative SQL query
2. The most
advanced streaming
windowing & global
functions
5. Intelligent watermark control can
handle late events and time skew issues
properly
7. JOIN between
stream and stream, or
stream and table can
drive more real-time
analytics insights
6. LAST X can help user focusing on what’s
happening in recent time window
9. Super fast push-based stateful query through TCP.
No more “Refresh”!
4. Time can be easily
rewinded to any
historical moment for
reprocessing
8. UDA/F for customer functions, e.g. ML
prediction or anomaly detection, customized
aggregations, Complex Event Processing
3. Connect historical
data via table()

CONFIDENTIAL
Extended SQL / Execution Plans
13
StreamSource HistoricalSource
BuildingHashTable
Joining
Transform
Watermark
Transform
WindowAssigment
Transform
TumbleWindowAggr
Transform
Output Project
WITH joined AS
(
SELECT
*
FROM devices
INNER JOIN table(device_inventory) AS
device_inventory ON id = device_inventory.device_id
)
SELECT
max_k(cpu, device_name), window_start
FROM
tumble(joined, 5s)
GROUP BY window_start

CONFIDENTIAL
Query Processing - Historical Backfill
Entry Entry Entry Block Block Block Block
Historical Data
Reader (Backfill)
Streaming Data
Reader (Live)
Concatenated Stream Reader
Last sequence number
(1)
(2)
(3)
Downstream
Pipeline
NativeLog Historical Data Store
● Streaming Lookup
● Streaming Enrichment
● Traveling across historical
store and streaming events
● Correlated windows analytics:
Compare live window with
historical window
sn
SELECT count(), max(cpu), max(memory)
FROM devices WHERE _tp_time > ‘2022-11-12 00:00:00’ GROUP by device_id;

CONFIDENTIAL
Native Python Integration (Python UDF)
CREATE FUNCTION mask_password(values string) RETURN string
LANGUAGE PYTHON AS
$$
import re
for value in values:
value[i]=re.sub(‘password=(?=.*[A-Za-z])(?=.*d)[A-Za-zd]{8,}’, ‘password=***’, v);
return value;
$$;

CONFIDENTIAL
Native Python Integration (Python UDAF)
CREATE AGGREGATE FUNCTION second_max(value double) RETURN double
LANGUAGE PYTHON AS
$$
def initialize():
pass
def process(values):
pass
def finalize():
pass
}
$$;

CONFIDENTIAL
Case Study : Real-time DDos Detection
https://www.timeplus.com/post/real-time-ddos-detection

CONFIDENTIAL
Case Study : Real-time Machine Learning
https://www.timeplus.com/post/real-time-machine-learning

Recap - Timeplus Proton
• Core problems to solve :
“Query, detect and act on fast-changing stream data in an
incremental way.”
• Key Tech Highlights
• Unifies streaming processing and historical analytics via SQL
• Dual data stores to handle data persistence
• Materialized incremental computation engine
• Everything is in single binary, light-weight and efficient , run from edge
to cloud, from single instance to distributed cluster

Timeplus: One single binary to tackle streaming and historical analytics

CONFIDENTIAL
Timeplus Cluster - Single Binary with Multiple-Roles
(Replicated NativeLog, powered by Multi-Raft)
Data Node
shard-1 shard-3
shard-2
Data Node
shard-1 shard-3
shard-2
Data Node
shard-1 shard-3
shard-2
Data Ingestion
Node
Data Query Node
Data Ingestion
Node
Data Ingestion
Node
Data Query Node
Data Query Node
Query Query Query
Ingestion Ingestion Ingestion
Client
replica = 3
Access
Layer
Computing
Layer
Data
Layer
Data
Replication
and parallel
access
High
Availability
Query and
Ingestion and
horizontal
scalability
High
Availability
Access and
Load
Balancing
Metadata
Node
Metadata
Node
Metadata
Node

CONFIDENTIAL
Timeplus Cluster - Query Failover
Data Node
shard-1 shard-3
shard-2
Data Node
shard-1 shard-3
shard-2
Data Node
shard-1 shard-3
shard-2
Data Query Node Data Query Node
Query Query
Client
replica = 3
checkpoint checkpoint checkpoint
Access
Layer
Computing
Layer
Data
Layer
Data
Replication
and parallel
access
High
Availability
Query and
Ingestion and
horizontal
scalability
High
Availability
Access and
Load
Balancing

Timeplus: One single binary to tackle streaming and historical analytics

More Related Content

Similar to Timeplus: One single binary to tackle streaming and historical analytics (20)

More from Zilliz (20)

Recently uploaded (20)

Timeplus: One single binary to tackle streaming and historical analytics