SlideShare a Scribd company logo
MaxScale
Architecture Evolution
Johan Wikman
Lead Developer
Overview
● What is MaxScale
● Architecture
● Performance
● Summary
What is MaxScale
What is MaxScale
● Cluster Abstraction
○ Hides the complexity.
○ Load Balancer
○ High Availability
○ Easier Maintenance
● And more
○ Firewall
○ Data masking
○ Logging
○ Cache
○ ...
Client
MaxScale
Master Slave Slave
Read Write Splitting
● Analyze statements
○ Send where appropriate
Client
MaxScale
Master Slave Slave
Read Write Splitting
● Analyze statements
○ Send where appropriate
● Write statements to master
Client
MaxScale
Master Slave Slave
> INSERT INTO ...
Read Write Splitting
● Analyze statements
○ Send where appropriate
● Write statements to master
● Read statements to some slave
Client
MaxScale
Master Slave Slave
> SELECT * ...
Read Write Splitting
● Analyze statements
○ Send where appropriate
● Write statements to master
● Read statements to some slave
● Session statements to all servers
Client
MaxScale
Master Slave Slave
> SET autocommit
Architecture
.........
Static Architecture
Protocol Authenticator Filter Router Query
Classifier
Monitor
MariaDBClient MySQLAuth
...
DBFwfilter
...
ReadWriteSplit qc_sqlite
...
MariaDBMon
Core
● Threading
● Logging
● Plugin loading
● Lifetime management
● REST-API
● Admin Functionality
● etc.
APIs
MaxScale
Data FlowClient
Protocol
Filter Filter
Router
Protocol
Monitor
Query
Classifier
Servers
Server State
monitors
updates
uses
MaxScale
Code
MaxScale: 147 kloc
Core: 51 kloc
Authenticators: 5 kloc
Filters: 27 kloc
Routers: 43 kloc
Monitors: 12 kloc
Protocols: 9 kloc
Modules: 96 kloc
For comparison:
● MariaDB server: 2500 kloc
Threading Architecture
● MaxScale is essentially a router.
○ It receives SQL packets from numerous clients and dispatches them to one or more servers.
○ Waits for responses from one or more servers, and sends a response to the client.
○ Number of clients may be large.
● Basic alternatives:
○ One thread per client.
○ Asynchronous I/O and fixed number of threads.
● Reason lost in the mists of history, but MaxScale is implemented using the
latter approach.
Asynchronous I/O in Principle.
● Basically:
● When there is no activity, the thread is idle.
● When something happens, the thread wakes up and handles the events.
○ May involve initiating asynchronous I/O whose result is later reported as an event.
● Once the event has been handled, the thread returns to waiting for events.
setup();
while (true)
{
io_events events = wait_for_io();
handle_events(events);
}
● Create some file descriptors
● Make them non-blocking
● Add them to some waiting mechanism.
● Wait for something to happen to those
file descriptors
● Handle whatever happened
So How do You Wait on Events?
● select
○ The original mechanism, been around since the beginning of time. Fixed size limit on the
number of descriptors. O(N)
● poll
○ No limit on number of descriptors. O(N).
● epoll
○ More complex to set up. No limit on number of descriptors. All changes via system calls, i.e.
thread safe. O(1).
Epoll is not a better poll, it’s different.
MaxScale epoll Setup
● At startup, socket creation is triggered by the presence of listeners.
[TheListener]
type=listener
service=TheService
...
port=4009
so = socket(...);
...
listen(so);
...
struct epoll_event ev;
ev.events = events;
ev.data.ptr = data;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, so, &ev);
epoll_fd = epoll_create(...);
Client Connection
Client MaxScale
while (!shutdown)
{
struct epoll_event events[MAX_EVENTS];
int ndfs = epoll_wait(epoll_fd, events, ...);
for (int i = 0; i < ndfs; ++i)
{
epoll_event* event = &events[i];
handle_event(event);
}
}
Client Connection, cont’d
void handle_event(struct epoll_event* event)
{
if (event->events & EPOLLIN)
{
if (descriptor was a listening socket)
{
handle_accept(event);
}
else
{
handle_read(event);
}
}
if (event->events & ...)
{
...
}
}
Handle Accept
void handle_accept(struct epoll_event* event)
{
for (all servers in service)
{
int so;
connect each server;
struct epoll_event ev;
ev.events = events;
ev.data.ptr = data;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, so, &ev);
}
}
[TheService]
type=service
service=readwritesplit
servers=server,server2
...
Handle Read
void handle_read(struct epoll_event* event)
{
char buffer[MAX_SIZE];
read(sd, buffer, sizeof(buffer));
figure out what to do with the data
// - wait for more
// - authenticate
// - send to master, send to slaves, send to all
// ...
...
}
● When the servers reply,
the response will be
handled in a similar
manner.
Binding Things Together
Client
Server
DCB
Session
MXS_ROUTER_SESSION
RWSplitSession DCB
1..*
1 1
Plugin boundary
Representation of a
connection/descriptor
● A Session object ties together the client connection and all server
connections associated with that client connection.
MaxScale 1.0 - 2.0
epoll_fd = epoll_create(...);
Thread 1
while (!shutdown)
{
epoll_wait(epoll_fd, ...);
...
}
Thread 2
while (!shutdown)
{
epoll_wait(epoll_fd, ...);
...
}
Thread 3
while (!shutdown)
{
epoll_wait(epoll_fd, ...);
...
}
Problematic with one epoll Instance
● There are multiple socket descriptors for each client session.
○ One for the client connection.
○ One for every backend server.
● It is possible that an event on each of those is concurrently handled by as
many threads.
○ Client have issued a request that has been sent to all servers.
○ Response arrives from each server at the same time the client closes its connection.
● Session data ends up being manipulated by many threads concurrently.
Implications
● Lots of locks and locking was needed.
○ Primarily spinlocks, intended to be held for brief periods of time.
● Events for a socket may be reported to a thread while another thread was still
handling earlier events for that same socket.
○ Event extraction and event handling had to be decoupled => locking.
● Very hard to be sure no deadlocks could occur.
● Very hard to be sure no races were possible.
● Very hard to program, as it was not always obvious what could and what
could not occur concurrently.
● The locks started to hurt under high load and lots of clients.
MaxScale 2.1
Thread 1
while (!shutdown)
{
epoll_wait(epoll_fd, ...);
...
}
epoll_fd = epoll_create(...);
Thread 2
while (!shutdown)
{
epoll_wait(epoll_fd, ...);
...
}
epoll_fd = epoll_create(...);
Thread 2
while (!shutdown)
{
epoll_wait(epoll_fd, ...);
...
}
epoll_fd = epoll_create(...);
MaxScale 2.1
● Each thread has a epoll instance of its own.
● When a client connects:
○ The thread that handles the client will also handle all communication will all backends on
behalf of that client.
○ All descriptors belonging to a particular client session are only added to the epoll instance of
the thread in question.
● Listening sockets are still an exception; added to the poll set of all threads.
○ After accepting, the client socket is then moved in a round-robin fashion to some thread.
● Huge impact on the performance.
MaxScale 2.2
● Remove the last traces of inter-thread communication.
● Basic problem: How to distribute new connections among existing threads?
● New connections should be distributed across different threads in a roughly
even manner.
● All ports must be treated in the same way.
○ So a particular port cannot e.g. be permanently assigned to a specific thread.
epoll
● Two ways events can be triggered:
○ Edge-triggered, reported when something has happened.
○ Level-triggered, reported when something is available.
Inactive
Active
Edge triggered Level triggered
Implication of edge/level triggered epoll
1. The file descriptor that represents the read side of a pipe
(rfd) is registered on the epoll instance.
2. A pipe writer writes 2 kB of data on the write side of the pipe.
3. A call to epoll_wait(2) is done that will return rfd as a
ready file descriptor.
4. The pipe reader reads 1 kB of data from rfd.
5. A call to epoll_wait(2) is done.
● If rfd was added using EPOLLET (edge-triggered) then the call at 5 will hang.
● EPOLLET requires
○ Non-blocking descriptors.
○ Events can be waited for (epoll_wait) only after read or write return EAGAIN.
Example straight from
$ man epoll
Two kind of Descriptors
● Listening sockets that all threads should handle.
● Sockets related to a client session that only a particular thread should handle.
● What’s the problem with the listening sockets being in the epoll instance of
each thread (as in MaxScale 2.1)?
○ Also the listening socket must be non-blocking and added using EPOLLET.
○ A thread that returns from epoll_wait must call accept on the listening socket until it returns
EWOULDBLOCK.
○ So, either we must accept that a thread suddenly may have to deal with a large number of
clients (if there is a sudden surge) or a thread must be able to offload an accepted client
socket to another thread.
What we Want
● Each thread does not need to accept more than one client at a time.
○ That is, EPOLLET cannot be used.
● We don’t have to manipulate the epoll instance of a thread, from outside the
thread.
○ Listening sockets are a global resource while sockets related to a client session are thread
local resources.
○ Not having to do that also means that making it possible to increase and decrease the number
of threads at runtime becomes easier.
But epoll instances can also be waited for.
● If an epoll file descriptor has events waiting, then it will indicate that as
being readable.
● So,
○ if a file descriptor is added to an epoll instance, and
○ the descriptor of that epoll instance is added to another second epoll instance, then
○ when something happens to the file descriptor, a thread blocked in an epoll_wait call on the
second epoll instance will return.
● If the thread now calls epoll_wait on the first epoll instance, it will return
with actual file descriptor on which some change has occurred.
MaxScale 2.2
Thread N
l_fd = epoll_create(...);
struct epoll_event ev;
ev.events = EPOLLIN; // NOT EPOLLET
epoll_ctl(l_fd, EPOLL_CTL_ADD, g_fd, &ev);
while (!shutdown)
{
epoll_wait(l_fd, ...);
...
}
g_fd = epoll_create(...);
void add_listening_socket(int sd)
{
struct epoll_event ev;
ev.events = EPOLLIN; // NOT EPOLLET
epoll_ctl(g_fd, EPOLL_CTL_ADD, g_fd, &ev);
}
void add_client_socket(int l_fd, int sd)
{
struct epoll_event ev;
ev.events = .. | EPOLLET;
epoll_ctl(l_fd, EPOLL_CTL_ADD, sd, &ev);
}
Client Connecting
typedef void (*handler_t)(epoll_event*);
while (!shutdown)
{
struct epoll_event events[MAX_EVENTS];
int ndfs = epoll_wait(epoll_fd, events, ...);
for (int i = 0; i < ndfs; ++i)
{
epoll_event* event = &events[i];
handler_t handler = get_handler(event);
handler(event);
}
}
void handle_epoll_event(epoll_event*)
{
struct epoll_event events[1];
int fd = get_fd(event); // fd == g_fd
epoll_wait(fd, events, 1, 0); // 0 timeout.
epoll_event* event = &events[0];
handler_t handler = get_handler(event);
handler(event);
}
void handle_accept_event(epoll_event* event)
{
int sd = get_fd(event);
while ((cd = accept(sd)) != NULL)
{
...
add_client_socket(cd, ...);
}
}
get_hander(event) and get_fd(event) ?
typedef union epoll_data {
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
struct epoll_event {
uint32_t events;
epoll_data_t data;
};
● When adding a descriptor to an epoll instance you can associate a value.
● When something occurs you get that value back.
○ If you do not store the fd, you do not know what fd the event relates to.
○ If you store the fd, you cannot store anything else.
Storing More Context With an Event
typedef uint32_t (*mxs_poll_handler_t)(struct mxs_poll_data* data, int wid, uint32_t events);
typedef struct mxs_poll_data {
mxs_poll_handler_t handler; /*< Handler for this particular kind of mxs_poll_data. */
} MXS_POLL_DATA;
typedef struct dcb {
MXS_POLL_DATA poll;
int fd;
...
} DCB;
static uint32_t dcb_poll_handler(MXS_POLL_DATA *data, ...) {
DCB *dcb = (DCB*)data;
...
};
DCB* create_dcb(...)
{
DCB* dcb = alloc_dcb(...);
dcb.poll.handler = dcb_poll_handler;
return dcb;
}
class Worker : private MXS_POLL_DATA {
public:
Worker() {
MXS_POLL_DATA::handler = &Worker::epoll_handler;
...
};
static uint32_t epoll_handler(MXS_POLL_DATA* data, ...) {
return ((Worker*)data)->handler(...);
}
int fd;
};
Adding and Extracting Events
void poll_add_fd(int fd, uint32_t events, MXS_POLL_DATA* pData)
{
struct epoll_event ev;
ev.events = events;
ev.data.ptr = pData;
epoll_ctl(m_epoll_fd, EPOLL_CTL_ADD, fd, &ev);
}
DCB* dcb = ...;
poll_add_events(dcb->fd, ..., &dcb->poll);
Worker* pWorker = ...;
poll_add_events(pWorker->fd, ..., pWorker);
while (!should_shutdown)
{
struct epoll_event events[MAX_EVENTS];
int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
for (int i = 0; i < n; ++i)
{
MXS_POLL_DATA* data = (MXS_POLL_DATA)events[i].data.ptr;
data->handler(data, ..., events[i].events);
}
}
Each worker thread sits in this loop.
Performance
MaxScale 2.0.5 (up to)
Hardware: Two physical servers, 16 cores / 32 hyperthreads,
128GB RAM and an SSD drive, connected using GBE LAN. One
runs MaxScale and sysbench, the other 4 MariaDB servers setup
as Master and 3 Slaves.
Workload: OLTP read-only, 100 simple selects per iteration, no
transaction boundaries.
● direct: Sysbench uses all
servers directly in
round-robin fashion.
● rcr: MaxScale readonnroute
router.
● rws: MaxScale readwritesplit
router.
MaxScale 2.1.0
● The architectural change that
allowed the removal of a large
number of locks provided a
dramatic improvement for
readconnroute.
● No change for readwritesplit.
● With small number of clients the
introduced cache improved the
performance, with large number
no impact.
Query Classification
● When ReadWriteSplitting, MaxScale must parse the statement.
○ Does it need to be sent to the master, to some slave or to all servers?
● The classification is done using a significantly modified parser from sqlite.
● In each thread, the parsing is done using a thread specific in-memory
database.
Thread 1 Sqlite Thread 2 sqlite
● No shared data, should be no
contention.
● Sqlite was not built using the right
flags, but there was serialization going
on.
Data Collection
● While parsing a statement, a fair amount of information was collected.
○ What tables and columns are accessed. What functions are called. Etc.
● Allocating memory for that information did not come without a cost.
○ Basically only the firewall filter uses that information.
● Now no information is collected by default, but a filter that is interested in
that information must express it explicitly.
qc_parse_result_t parse_result = qc_parse(stmt, QC_COLLECT_ALL);
Custom Parser
● Many routers and filters need to know whether a transaction is ongoing.
● Up until MaxScale 2.1.1 that implied that the statements had to be parsed
using the query classifier.
● For MaxScale 2.1.2 we introduced a custom parser that only detects
statements affecting the autocommit mode & transaction state.
○ Much faster than full parsing.
● In MaxScale 2.3 we will rely upon the server telling the autocommit mode &
transaction state.
○ Implies that changes performed via prepared statements or functions will also be detected.
MaxScale 2.1.3 versus 2.0.5
Cache
● The cache was introduced in 2.1.0 but the
performance was less than satisfactory.
● Problem was caused by parsing.
○ The cache parsed all statements to detect non-cacheable statements.
○ E.g. SELECT CURRENT_DATE();
● Added possibility to declare that all SELECT statements are cacheable.
[TheCache]
type=filter
module=cache
...
selects=assume_cacheable
● Huge impact on the performance.
MaxScale ReadWriteSplit
● In the best case the performance of MaxScale 2.1.3 is three times, eight if
caching is used, than the performance of MaxScale 2.0.5.
The Importance of Early Customer Feedback
● MaxScale caches user information so that it can authenticate users.
● In MaxScale 2.2.0 the user database was shared between threads.
● Worked fine when connection attempts were relatively rare and sessions
were relatively long
Thread 1
Users
Thread 2 ● If a user is not found, any thread may
refresh the user data from the server and
update the database.
● All access must use locks.
User Report
● With MaxScale 2.2.0 Beta a user reported that he got only 6000 qps.
function event(thread_id)
db_connect()
rs = db_query("select 1;")
db_disconnect()
● Reason turned out to be thread contention in relation to the user database.
Thread 1
Users
Thread 2
Users
● We split it, so that each thread has its own
user database.
“I just tested the same case, got 361437
queries/second, I think it works for us”
What about MaxScale 2.2.2
● No real difference, which is good, because 2.2 does more than 2.1.
○ E.g. must catch “SET SESSION SQL_MODE=ORACLE”
Summary
Summary
● MaxScale 0.1 -> 2.0
○ One epoll instance that all worker threads wait on.
○ Any thread can handle anything.
○ Lots of locking needed, and lots of potential for hard-to-resolve races.
○ Performance problems.
● MaxScale 2.1
○ One epoll instance per worker thread.
○ Any thread can accept, but must distribute the client socket to a particular thread.
○ All activity related to a particular session handled by one thread.
○ Significantly reduced need for locking and race risk effectively eliminated.
○ Good performance.
● MaxScale 2.2
○ One epoll instance for “shared” descriptors (listening sockets).
○ One epoll instance per worker thread.
○ All activity related to a particular session handled by one thread.
○ Even less locking needed.
○ Good performance.
Where do We Go From Here?
● The architectural evolution of MaxScale can be summarized as:
○ Decrease the explicit coupling between the worker threads.
■ If that leads to duplicate work or increased memory usage, fine.
● We are likely to continue moving in that direction still, so that we
conceptually will end up running N “mini”-MaxScales in parallel,
completely oblivious of each other.
● That would also make it easy to allow the starting and stopping of threads,
while MaxScale is running.

More Related Content

PDF
Cybersecurity Awareness Training Presentation v1.0
PDF
Intro to GitOps & Flux.pdf
PPTX
MaxScale이해와활용-2023.11
PDF
DPDK: Multi Architecture High Performance Packet Processing
PPTX
4. 대용량 아키텍쳐 설계 패턴
PDF
Secrets of Performance Tuning Java on Kubernetes
PDF
Building DataCenter networks with VXLAN BGP-EVPN
PDF
Advanced MySQL Query Tuning
Cybersecurity Awareness Training Presentation v1.0
Intro to GitOps & Flux.pdf
MaxScale이해와활용-2023.11
DPDK: Multi Architecture High Performance Packet Processing
4. 대용량 아키텍쳐 설계 패턴
Secrets of Performance Tuning Java on Kubernetes
Building DataCenter networks with VXLAN BGP-EVPN
Advanced MySQL Query Tuning

What's hot (20)

PDF
MariaDB MaxScale
PDF
How to Manage Scale-Out Environments with MariaDB MaxScale
PDF
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
PDF
MariaDB Performance Tuning and Optimization
PDF
MariaDB MaxScale: an Intelligent Database Proxy
PDF
Using all of the high availability options in MariaDB
PDF
Maxscale switchover, failover, and auto rejoin
PDF
MariaDB 10.11 key features overview for DBAs
PPTX
Maxscale 소개 1.1.1
PPT
Oracle GoldenGate
PDF
Best Practice for Achieving High Availability in MariaDB
PDF
MariaDB Galera Cluster presentation
PPTX
Running MariaDB in multiple data centers
PDF
Maxscale_메뉴얼
PPTX
Maria db 이중화구성_고민하기
ODP
Stream processing using Kafka
PDF
[2018] MySQL 이중화 진화기
PDF
ProxySQL Cluster - Percona Live 2022
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
MySQL Replication Performance Tuning for Fun and Profit!
MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScale
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
MariaDB Performance Tuning and Optimization
MariaDB MaxScale: an Intelligent Database Proxy
Using all of the high availability options in MariaDB
Maxscale switchover, failover, and auto rejoin
MariaDB 10.11 key features overview for DBAs
Maxscale 소개 1.1.1
Oracle GoldenGate
Best Practice for Achieving High Availability in MariaDB
MariaDB Galera Cluster presentation
Running MariaDB in multiple data centers
Maxscale_메뉴얼
Maria db 이중화구성_고민하기
Stream processing using Kafka
[2018] MySQL 이중화 진화기
ProxySQL Cluster - Percona Live 2022
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
MySQL Replication Performance Tuning for Fun and Profit!
Ad

Similar to M|18 Architectural Overview: MariaDB MaxScale (20)

PPTX
epoll() - The I/O Hero
PPTX
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
PDF
Tornado Web Server Internals
ODP
Linux multiplexing
PDF
Non-DIY* Logging
PDF
Let's Talk Locks!
PPT
Intro To .Net Threads
PDF
Concurrency, Parallelism And IO
PPTX
Apache Kafka
PDF
Fun with Network Interfaces
PDF
Dennis Benkert & Matthias Lübken - Patterns in a containerized world? - code....
PPTX
MULTI-THREADING in python appalication.pptx
PDF
rtnetlink
PPTX
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
PDF
TCP Sockets Tutor maXbox starter26
PPT
Introto netthreads-090906214344-phpapp01
PPTX
Scaling an ELK stack at bol.com
PPTX
.NET Multithreading/Multitasking
PPTX
Threads and multi threading
PDF
How Splunk Is Using Pulsar IO
epoll() - The I/O Hero
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Tornado Web Server Internals
Linux multiplexing
Non-DIY* Logging
Let's Talk Locks!
Intro To .Net Threads
Concurrency, Parallelism And IO
Apache Kafka
Fun with Network Interfaces
Dennis Benkert & Matthias Lübken - Patterns in a containerized world? - code....
MULTI-THREADING in python appalication.pptx
rtnetlink
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
TCP Sockets Tutor maXbox starter26
Introto netthreads-090906214344-phpapp01
Scaling an ELK stack at bol.com
.NET Multithreading/Multitasking
Threads and multi threading
How Splunk Is Using Pulsar IO
Ad

More from MariaDB plc (20)

PDF
MariaDB Berlin Roadshow Slides - 8 April 2025
PDF
MariaDB München Roadshow - 24 September, 2024
PDF
MariaDB Paris Roadshow - 19 September 2024
PDF
MariaDB Amsterdam Roadshow: 19 September, 2024
PDF
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
PDF
MariaDB Paris Workshop 2023 - Newpharma
PDF
MariaDB Paris Workshop 2023 - Cloud
PDF
MariaDB Paris Workshop 2023 - MariaDB Enterprise
PDF
MariaDB Paris Workshop 2023 - Performance Optimization
PDF
MariaDB Paris Workshop 2023 - MaxScale
PDF
MariaDB Paris Workshop 2023 - novadys presentation
PDF
MariaDB Paris Workshop 2023 - DARVA presentation
PDF
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
PDF
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
PDF
Einführung : MariaDB Tech und Business Update Hamburg 2023
PDF
Hochverfügbarkeitslösungen mit MariaDB
PDF
Die Neuheiten in MariaDB Enterprise Server
PDF
Global Data Replication with Galera for Ansell Guardian®
PDF
Introducing workload analysis
PDF
Under the hood: SkySQL monitoring
MariaDB Berlin Roadshow Slides - 8 April 2025
MariaDB München Roadshow - 24 September, 2024
MariaDB Paris Roadshow - 19 September 2024
MariaDB Amsterdam Roadshow: 19 September, 2024
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
Einführung : MariaDB Tech und Business Update Hamburg 2023
Hochverfügbarkeitslösungen mit MariaDB
Die Neuheiten in MariaDB Enterprise Server
Global Data Replication with Galera for Ansell Guardian®
Introducing workload analysis
Under the hood: SkySQL monitoring

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Foundation of Data Science unit number two notes
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Quality review (1)_presentation of this 21
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Galatica Smart Energy Infrastructure Startup Pitch Deck
Foundation of Data Science unit number two notes
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction-to-Cloud-ComputingFinal.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IB Computer Science - Internal Assessment.pptx
Mega Projects Data Mega Projects Data
Data_Analytics_and_PowerBI_Presentation.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn

M|18 Architectural Overview: MariaDB MaxScale

  • 2. Overview ● What is MaxScale ● Architecture ● Performance ● Summary
  • 4. What is MaxScale ● Cluster Abstraction ○ Hides the complexity. ○ Load Balancer ○ High Availability ○ Easier Maintenance ● And more ○ Firewall ○ Data masking ○ Logging ○ Cache ○ ... Client MaxScale Master Slave Slave
  • 5. Read Write Splitting ● Analyze statements ○ Send where appropriate Client MaxScale Master Slave Slave
  • 6. Read Write Splitting ● Analyze statements ○ Send where appropriate ● Write statements to master Client MaxScale Master Slave Slave > INSERT INTO ...
  • 7. Read Write Splitting ● Analyze statements ○ Send where appropriate ● Write statements to master ● Read statements to some slave Client MaxScale Master Slave Slave > SELECT * ...
  • 8. Read Write Splitting ● Analyze statements ○ Send where appropriate ● Write statements to master ● Read statements to some slave ● Session statements to all servers Client MaxScale Master Slave Slave > SET autocommit
  • 10. ......... Static Architecture Protocol Authenticator Filter Router Query Classifier Monitor MariaDBClient MySQLAuth ... DBFwfilter ... ReadWriteSplit qc_sqlite ... MariaDBMon Core ● Threading ● Logging ● Plugin loading ● Lifetime management ● REST-API ● Admin Functionality ● etc. APIs MaxScale
  • 12. Code MaxScale: 147 kloc Core: 51 kloc Authenticators: 5 kloc Filters: 27 kloc Routers: 43 kloc Monitors: 12 kloc Protocols: 9 kloc Modules: 96 kloc For comparison: ● MariaDB server: 2500 kloc
  • 13. Threading Architecture ● MaxScale is essentially a router. ○ It receives SQL packets from numerous clients and dispatches them to one or more servers. ○ Waits for responses from one or more servers, and sends a response to the client. ○ Number of clients may be large. ● Basic alternatives: ○ One thread per client. ○ Asynchronous I/O and fixed number of threads. ● Reason lost in the mists of history, but MaxScale is implemented using the latter approach.
  • 14. Asynchronous I/O in Principle. ● Basically: ● When there is no activity, the thread is idle. ● When something happens, the thread wakes up and handles the events. ○ May involve initiating asynchronous I/O whose result is later reported as an event. ● Once the event has been handled, the thread returns to waiting for events. setup(); while (true) { io_events events = wait_for_io(); handle_events(events); } ● Create some file descriptors ● Make them non-blocking ● Add them to some waiting mechanism. ● Wait for something to happen to those file descriptors ● Handle whatever happened
  • 15. So How do You Wait on Events? ● select ○ The original mechanism, been around since the beginning of time. Fixed size limit on the number of descriptors. O(N) ● poll ○ No limit on number of descriptors. O(N). ● epoll ○ More complex to set up. No limit on number of descriptors. All changes via system calls, i.e. thread safe. O(1). Epoll is not a better poll, it’s different.
  • 16. MaxScale epoll Setup ● At startup, socket creation is triggered by the presence of listeners. [TheListener] type=listener service=TheService ... port=4009 so = socket(...); ... listen(so); ... struct epoll_event ev; ev.events = events; ev.data.ptr = data; epoll_ctl(epoll_fd, EPOLL_CTL_ADD, so, &ev); epoll_fd = epoll_create(...);
  • 17. Client Connection Client MaxScale while (!shutdown) { struct epoll_event events[MAX_EVENTS]; int ndfs = epoll_wait(epoll_fd, events, ...); for (int i = 0; i < ndfs; ++i) { epoll_event* event = &events[i]; handle_event(event); } }
  • 18. Client Connection, cont’d void handle_event(struct epoll_event* event) { if (event->events & EPOLLIN) { if (descriptor was a listening socket) { handle_accept(event); } else { handle_read(event); } } if (event->events & ...) { ... } }
  • 19. Handle Accept void handle_accept(struct epoll_event* event) { for (all servers in service) { int so; connect each server; struct epoll_event ev; ev.events = events; ev.data.ptr = data; epoll_ctl(epoll_fd, EPOLL_CTL_ADD, so, &ev); } } [TheService] type=service service=readwritesplit servers=server,server2 ...
  • 20. Handle Read void handle_read(struct epoll_event* event) { char buffer[MAX_SIZE]; read(sd, buffer, sizeof(buffer)); figure out what to do with the data // - wait for more // - authenticate // - send to master, send to slaves, send to all // ... ... } ● When the servers reply, the response will be handled in a similar manner.
  • 21. Binding Things Together Client Server DCB Session MXS_ROUTER_SESSION RWSplitSession DCB 1..* 1 1 Plugin boundary Representation of a connection/descriptor ● A Session object ties together the client connection and all server connections associated with that client connection.
  • 22. MaxScale 1.0 - 2.0 epoll_fd = epoll_create(...); Thread 1 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } Thread 2 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } Thread 3 while (!shutdown) { epoll_wait(epoll_fd, ...); ... }
  • 23. Problematic with one epoll Instance ● There are multiple socket descriptors for each client session. ○ One for the client connection. ○ One for every backend server. ● It is possible that an event on each of those is concurrently handled by as many threads. ○ Client have issued a request that has been sent to all servers. ○ Response arrives from each server at the same time the client closes its connection. ● Session data ends up being manipulated by many threads concurrently.
  • 24. Implications ● Lots of locks and locking was needed. ○ Primarily spinlocks, intended to be held for brief periods of time. ● Events for a socket may be reported to a thread while another thread was still handling earlier events for that same socket. ○ Event extraction and event handling had to be decoupled => locking. ● Very hard to be sure no deadlocks could occur. ● Very hard to be sure no races were possible. ● Very hard to program, as it was not always obvious what could and what could not occur concurrently. ● The locks started to hurt under high load and lots of clients.
  • 25. MaxScale 2.1 Thread 1 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } epoll_fd = epoll_create(...); Thread 2 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } epoll_fd = epoll_create(...); Thread 2 while (!shutdown) { epoll_wait(epoll_fd, ...); ... } epoll_fd = epoll_create(...);
  • 26. MaxScale 2.1 ● Each thread has a epoll instance of its own. ● When a client connects: ○ The thread that handles the client will also handle all communication will all backends on behalf of that client. ○ All descriptors belonging to a particular client session are only added to the epoll instance of the thread in question. ● Listening sockets are still an exception; added to the poll set of all threads. ○ After accepting, the client socket is then moved in a round-robin fashion to some thread. ● Huge impact on the performance.
  • 27. MaxScale 2.2 ● Remove the last traces of inter-thread communication. ● Basic problem: How to distribute new connections among existing threads? ● New connections should be distributed across different threads in a roughly even manner. ● All ports must be treated in the same way. ○ So a particular port cannot e.g. be permanently assigned to a specific thread.
  • 28. epoll ● Two ways events can be triggered: ○ Edge-triggered, reported when something has happened. ○ Level-triggered, reported when something is available. Inactive Active Edge triggered Level triggered
  • 29. Implication of edge/level triggered epoll 1. The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance. 2. A pipe writer writes 2 kB of data on the write side of the pipe. 3. A call to epoll_wait(2) is done that will return rfd as a ready file descriptor. 4. The pipe reader reads 1 kB of data from rfd. 5. A call to epoll_wait(2) is done. ● If rfd was added using EPOLLET (edge-triggered) then the call at 5 will hang. ● EPOLLET requires ○ Non-blocking descriptors. ○ Events can be waited for (epoll_wait) only after read or write return EAGAIN. Example straight from $ man epoll
  • 30. Two kind of Descriptors ● Listening sockets that all threads should handle. ● Sockets related to a client session that only a particular thread should handle. ● What’s the problem with the listening sockets being in the epoll instance of each thread (as in MaxScale 2.1)? ○ Also the listening socket must be non-blocking and added using EPOLLET. ○ A thread that returns from epoll_wait must call accept on the listening socket until it returns EWOULDBLOCK. ○ So, either we must accept that a thread suddenly may have to deal with a large number of clients (if there is a sudden surge) or a thread must be able to offload an accepted client socket to another thread.
  • 31. What we Want ● Each thread does not need to accept more than one client at a time. ○ That is, EPOLLET cannot be used. ● We don’t have to manipulate the epoll instance of a thread, from outside the thread. ○ Listening sockets are a global resource while sockets related to a client session are thread local resources. ○ Not having to do that also means that making it possible to increase and decrease the number of threads at runtime becomes easier.
  • 32. But epoll instances can also be waited for. ● If an epoll file descriptor has events waiting, then it will indicate that as being readable. ● So, ○ if a file descriptor is added to an epoll instance, and ○ the descriptor of that epoll instance is added to another second epoll instance, then ○ when something happens to the file descriptor, a thread blocked in an epoll_wait call on the second epoll instance will return. ● If the thread now calls epoll_wait on the first epoll instance, it will return with actual file descriptor on which some change has occurred.
  • 33. MaxScale 2.2 Thread N l_fd = epoll_create(...); struct epoll_event ev; ev.events = EPOLLIN; // NOT EPOLLET epoll_ctl(l_fd, EPOLL_CTL_ADD, g_fd, &ev); while (!shutdown) { epoll_wait(l_fd, ...); ... } g_fd = epoll_create(...); void add_listening_socket(int sd) { struct epoll_event ev; ev.events = EPOLLIN; // NOT EPOLLET epoll_ctl(g_fd, EPOLL_CTL_ADD, g_fd, &ev); } void add_client_socket(int l_fd, int sd) { struct epoll_event ev; ev.events = .. | EPOLLET; epoll_ctl(l_fd, EPOLL_CTL_ADD, sd, &ev); }
  • 34. Client Connecting typedef void (*handler_t)(epoll_event*); while (!shutdown) { struct epoll_event events[MAX_EVENTS]; int ndfs = epoll_wait(epoll_fd, events, ...); for (int i = 0; i < ndfs; ++i) { epoll_event* event = &events[i]; handler_t handler = get_handler(event); handler(event); } } void handle_epoll_event(epoll_event*) { struct epoll_event events[1]; int fd = get_fd(event); // fd == g_fd epoll_wait(fd, events, 1, 0); // 0 timeout. epoll_event* event = &events[0]; handler_t handler = get_handler(event); handler(event); } void handle_accept_event(epoll_event* event) { int sd = get_fd(event); while ((cd = accept(sd)) != NULL) { ... add_client_socket(cd, ...); } }
  • 35. get_hander(event) and get_fd(event) ? typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); struct epoll_event { uint32_t events; epoll_data_t data; }; ● When adding a descriptor to an epoll instance you can associate a value. ● When something occurs you get that value back. ○ If you do not store the fd, you do not know what fd the event relates to. ○ If you store the fd, you cannot store anything else.
  • 36. Storing More Context With an Event typedef uint32_t (*mxs_poll_handler_t)(struct mxs_poll_data* data, int wid, uint32_t events); typedef struct mxs_poll_data { mxs_poll_handler_t handler; /*< Handler for this particular kind of mxs_poll_data. */ } MXS_POLL_DATA; typedef struct dcb { MXS_POLL_DATA poll; int fd; ... } DCB; static uint32_t dcb_poll_handler(MXS_POLL_DATA *data, ...) { DCB *dcb = (DCB*)data; ... }; DCB* create_dcb(...) { DCB* dcb = alloc_dcb(...); dcb.poll.handler = dcb_poll_handler; return dcb; } class Worker : private MXS_POLL_DATA { public: Worker() { MXS_POLL_DATA::handler = &Worker::epoll_handler; ... }; static uint32_t epoll_handler(MXS_POLL_DATA* data, ...) { return ((Worker*)data)->handler(...); } int fd; };
  • 37. Adding and Extracting Events void poll_add_fd(int fd, uint32_t events, MXS_POLL_DATA* pData) { struct epoll_event ev; ev.events = events; ev.data.ptr = pData; epoll_ctl(m_epoll_fd, EPOLL_CTL_ADD, fd, &ev); } DCB* dcb = ...; poll_add_events(dcb->fd, ..., &dcb->poll); Worker* pWorker = ...; poll_add_events(pWorker->fd, ..., pWorker); while (!should_shutdown) { struct epoll_event events[MAX_EVENTS]; int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1); for (int i = 0; i < n; ++i) { MXS_POLL_DATA* data = (MXS_POLL_DATA)events[i].data.ptr; data->handler(data, ..., events[i].events); } } Each worker thread sits in this loop.
  • 39. MaxScale 2.0.5 (up to) Hardware: Two physical servers, 16 cores / 32 hyperthreads, 128GB RAM and an SSD drive, connected using GBE LAN. One runs MaxScale and sysbench, the other 4 MariaDB servers setup as Master and 3 Slaves. Workload: OLTP read-only, 100 simple selects per iteration, no transaction boundaries. ● direct: Sysbench uses all servers directly in round-robin fashion. ● rcr: MaxScale readonnroute router. ● rws: MaxScale readwritesplit router.
  • 40. MaxScale 2.1.0 ● The architectural change that allowed the removal of a large number of locks provided a dramatic improvement for readconnroute. ● No change for readwritesplit. ● With small number of clients the introduced cache improved the performance, with large number no impact.
  • 41. Query Classification ● When ReadWriteSplitting, MaxScale must parse the statement. ○ Does it need to be sent to the master, to some slave or to all servers? ● The classification is done using a significantly modified parser from sqlite. ● In each thread, the parsing is done using a thread specific in-memory database. Thread 1 Sqlite Thread 2 sqlite ● No shared data, should be no contention. ● Sqlite was not built using the right flags, but there was serialization going on.
  • 42. Data Collection ● While parsing a statement, a fair amount of information was collected. ○ What tables and columns are accessed. What functions are called. Etc. ● Allocating memory for that information did not come without a cost. ○ Basically only the firewall filter uses that information. ● Now no information is collected by default, but a filter that is interested in that information must express it explicitly. qc_parse_result_t parse_result = qc_parse(stmt, QC_COLLECT_ALL);
  • 43. Custom Parser ● Many routers and filters need to know whether a transaction is ongoing. ● Up until MaxScale 2.1.1 that implied that the statements had to be parsed using the query classifier. ● For MaxScale 2.1.2 we introduced a custom parser that only detects statements affecting the autocommit mode & transaction state. ○ Much faster than full parsing. ● In MaxScale 2.3 we will rely upon the server telling the autocommit mode & transaction state. ○ Implies that changes performed via prepared statements or functions will also be detected.
  • 45. Cache ● The cache was introduced in 2.1.0 but the performance was less than satisfactory. ● Problem was caused by parsing. ○ The cache parsed all statements to detect non-cacheable statements. ○ E.g. SELECT CURRENT_DATE(); ● Added possibility to declare that all SELECT statements are cacheable. [TheCache] type=filter module=cache ... selects=assume_cacheable ● Huge impact on the performance.
  • 46. MaxScale ReadWriteSplit ● In the best case the performance of MaxScale 2.1.3 is three times, eight if caching is used, than the performance of MaxScale 2.0.5.
  • 47. The Importance of Early Customer Feedback ● MaxScale caches user information so that it can authenticate users. ● In MaxScale 2.2.0 the user database was shared between threads. ● Worked fine when connection attempts were relatively rare and sessions were relatively long Thread 1 Users Thread 2 ● If a user is not found, any thread may refresh the user data from the server and update the database. ● All access must use locks.
  • 48. User Report ● With MaxScale 2.2.0 Beta a user reported that he got only 6000 qps. function event(thread_id) db_connect() rs = db_query("select 1;") db_disconnect() ● Reason turned out to be thread contention in relation to the user database. Thread 1 Users Thread 2 Users ● We split it, so that each thread has its own user database. “I just tested the same case, got 361437 queries/second, I think it works for us”
  • 49. What about MaxScale 2.2.2 ● No real difference, which is good, because 2.2 does more than 2.1. ○ E.g. must catch “SET SESSION SQL_MODE=ORACLE”
  • 51. Summary ● MaxScale 0.1 -> 2.0 ○ One epoll instance that all worker threads wait on. ○ Any thread can handle anything. ○ Lots of locking needed, and lots of potential for hard-to-resolve races. ○ Performance problems. ● MaxScale 2.1 ○ One epoll instance per worker thread. ○ Any thread can accept, but must distribute the client socket to a particular thread. ○ All activity related to a particular session handled by one thread. ○ Significantly reduced need for locking and race risk effectively eliminated. ○ Good performance. ● MaxScale 2.2 ○ One epoll instance for “shared” descriptors (listening sockets). ○ One epoll instance per worker thread. ○ All activity related to a particular session handled by one thread. ○ Even less locking needed. ○ Good performance.
  • 52. Where do We Go From Here? ● The architectural evolution of MaxScale can be summarized as: ○ Decrease the explicit coupling between the worker threads. ■ If that leads to duplicate work or increased memory usage, fine. ● We are likely to continue moving in that direction still, so that we conceptually will end up running N “mini”-MaxScales in parallel, completely oblivious of each other. ● That would also make it easy to allow the starting and stopping of threads, while MaxScale is running.