M|18 Architectural Overview: MariaDB MaxScale

MaxScale
Architecture Evolution
Johan Wikman
Lead Developer

Overview
● What is MaxScale
● Architecture
● Performance
● Summary

What is MaxScale
● Cluster Abstraction
○ Hides the complexity.
○ Load Balancer
○ High Availability
○ Easier Maintenance
● And more
○ Firewall
○ Data masking
○ Logging
○ Cache
○ ...
Client
MaxScale
Master Slave Slave

Read Write Splitting
● Analyze statements
○ Send where appropriate
Client
MaxScale
Master Slave Slave

● Write statements to master
Client
MaxScale
Master Slave Slave
> INSERT INTO ...

● Read statements to some slave
Client
MaxScale
Master Slave Slave
> SELECT * ...

● Read statements to some slave
● Session statements to all servers
Client
MaxScale
Master Slave Slave
> SET autocommit

.........
Static Architecture
Protocol Authenticator Filter Router Query
Classifier
Monitor
MariaDBClient MySQLAuth
...
DBFwfilter
...
ReadWriteSplit qc_sqlite
...
MariaDBMon
Core
● Threading
● Logging
● Plugin loading
● Lifetime management
● REST-API
● Admin Functionality
● etc.
APIs
MaxScale

Data FlowClient
Protocol
Filter Filter
Router
Protocol
Monitor
Query
Classifier
Servers
Server State
monitors
updates
uses
MaxScale

Code
MaxScale: 147 kloc
Core: 51 kloc
Authenticators: 5 kloc
Filters: 27 kloc
Routers: 43 kloc
Monitors: 12 kloc
Protocols: 9 kloc
Modules: 96 kloc
For comparison:
● MariaDB server: 2500 kloc

Threading Architecture
● MaxScale is essentially a router.
○ It receives SQL packets from numerous clients and dispatches them to one or more servers.
○ Waits for responses from one or more servers, and sends a response to the client.
○ Number of clients may be large.
● Basic alternatives:
○ One thread per client.
○ Asynchronous I/O and fixed number of threads.
● Reason lost in the mists of history, but MaxScale is implemented using the
latter approach.

Asynchronous I/O in Principle.
● Basically:
● When there is no activity, the thread is idle.
● When something happens, the thread wakes up and handles the events.
○ May involve initiating asynchronous I/O whose result is later reported as an event.
● Once the event has been handled, the thread returns to waiting for events.
setup();
while (true)
{
io_events events = wait_for_io();
handle_events(events);
}
● Create some file descriptors
● Make them non-blocking
● Add them to some waiting mechanism.
● Wait for something to happen to those
file descriptors
● Handle whatever happened

So How do You Wait on Events?
● select
○ The original mechanism, been around since the beginning of time. Fixed size limit on the
number of descriptors. O(N)
● poll
○ No limit on number of descriptors. O(N).
● epoll
○ More complex to set up. No limit on number of descriptors. All changes via system calls, i.e.
thread safe. O(1).
Epoll is not a better poll, it’s different.

MaxScale epoll Setup
● At startup, socket creation is triggered by the presence of listeners.
[TheListener]
type=listener
service=TheService
...
port=4009
so = socket(...);
...
listen(so);
...
struct epoll_event ev;
ev.events = events;
ev.data.ptr = data;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, so, &ev);
epoll_fd = epoll_create(...);

Client Connection
Client MaxScale
while (!shutdown)
{
struct epoll_event events[MAX_EVENTS];
int ndfs = epoll_wait(epoll_fd, events, ...);
for (int i = 0; i < ndfs; ++i)
{
epoll_event* event = &events[i];
handle_event(event);
}
}

Client Connection, cont’d
void handle_event(struct epoll_event* event)
{
if (event->events & EPOLLIN)
{
if (descriptor was a listening socket)
{
handle_accept(event);
}
else
{
handle_read(event);
}
}
if (event->events & ...)
{
...
}
}

Handle Accept
void handle_accept(struct epoll_event* event)
{
for (all servers in service)
{
int so;
connect each server;
ev.events = events;
ev.data.ptr = data;
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, so, &ev);
}
}
[TheService]
type=service
service=readwritesplit
servers=server,server2
...

Handle Read
void handle_read(struct epoll_event* event)
{
char buffer[MAX_SIZE];
read(sd, buffer, sizeof(buffer));
figure out what to do with the data
// - wait for more
// - authenticate
// - send to master, send to slaves, send to all
// ...
...
}
● When the servers reply,
the response will be
handled in a similar
manner.

Binding Things Together
Client
Server
DCB
Session
MXS_ROUTER_SESSION
RWSplitSession DCB
1..*
1 1
Plugin boundary
Representation of a
connection/descriptor
● A Session object ties together the client connection and all server
connections associated with that client connection.

MaxScale 1.0 - 2.0
Thread 1
while (!shutdown)
{
epoll_wait(epoll_fd, ...);
...
}
Thread 2
while (!shutdown)
{
...
}
Thread 3
while (!shutdown)
{
...
}

Problematic with one epoll Instance
● There are multiple socket descriptors for each client session.
○ One for the client connection.
○ One for every backend server.
● It is possible that an event on each of those is concurrently handled by as
many threads.
○ Client have issued a request that has been sent to all servers.
○ Response arrives from each server at the same time the client closes its connection.
● Session data ends up being manipulated by many threads concurrently.

Implications
● Lots of locks and locking was needed.
○ Primarily spinlocks, intended to be held for brief periods of time.
● Events for a socket may be reported to a thread while another thread was still
handling earlier events for that same socket.
○ Event extraction and event handling had to be decoupled => locking.
● Very hard to be sure no deadlocks could occur.
● Very hard to be sure no races were possible.
● Very hard to program, as it was not always obvious what could and what
could not occur concurrently.
● The locks started to hurt under high load and lots of clients.

MaxScale 2.1
Thread 1
while (!shutdown)
{
...
}
Thread 2
while (!shutdown)
{
...
}
Thread 2
while (!shutdown)
{
...
}

MaxScale 2.1
● Each thread has a epoll instance of its own.
● When a client connects:
○ The thread that handles the client will also handle all communication will all backends on
behalf of that client.
○ All descriptors belonging to a particular client session are only added to the epoll instance of
the thread in question.
● Listening sockets are still an exception; added to the poll set of all threads.
○ After accepting, the client socket is then moved in a round-robin fashion to some thread.
● Huge impact on the performance.

MaxScale 2.2
● Remove the last traces of inter-thread communication.
● Basic problem: How to distribute new connections among existing threads?
● New connections should be distributed across different threads in a roughly
even manner.
● All ports must be treated in the same way.
○ So a particular port cannot e.g. be permanently assigned to a specific thread.

epoll
● Two ways events can be triggered:
○ Edge-triggered, reported when something has happened.
○ Level-triggered, reported when something is available.
Inactive
Active
Edge triggered Level triggered

Implication of edge/level triggered epoll
1. The file descriptor that represents the read side of a pipe
(rfd) is registered on the epoll instance.
2. A pipe writer writes 2 kB of data on the write side of the pipe.
3. A call to epoll_wait(2) is done that will return rfd as a
ready file descriptor.
4. The pipe reader reads 1 kB of data from rfd.
5. A call to epoll_wait(2) is done.
● If rfd was added using EPOLLET (edge-triggered) then the call at 5 will hang.
● EPOLLET requires
○ Non-blocking descriptors.
○ Events can be waited for (epoll_wait) only after read or write return EAGAIN.
Example straight from
$ man epoll

Two kind of Descriptors
● Listening sockets that all threads should handle.
● Sockets related to a client session that only a particular thread should handle.
● What’s the problem with the listening sockets being in the epoll instance of
each thread (as in MaxScale 2.1)?
○ Also the listening socket must be non-blocking and added using EPOLLET.
○ A thread that returns from epoll_wait must call accept on the listening socket until it returns
EWOULDBLOCK.
○ So, either we must accept that a thread suddenly may have to deal with a large number of
clients (if there is a sudden surge) or a thread must be able to offload an accepted client
socket to another thread.

What we Want
● Each thread does not need to accept more than one client at a time.
○ That is, EPOLLET cannot be used.
● We don’t have to manipulate the epoll instance of a thread, from outside the
thread.
○ Listening sockets are a global resource while sockets related to a client session are thread
local resources.
○ Not having to do that also means that making it possible to increase and decrease the number
of threads at runtime becomes easier.

But epoll instances can also be waited for.
● If an epoll file descriptor has events waiting, then it will indicate that as
being readable.
● So,
○ if a file descriptor is added to an epoll instance, and
○ the descriptor of that epoll instance is added to another second epoll instance, then
○ when something happens to the file descriptor, a thread blocked in an epoll_wait call on the
second epoll instance will return.
● If the thread now calls epoll_wait on the first epoll instance, it will return
with actual file descriptor on which some change has occurred.

MaxScale 2.2
Thread N
l_fd = epoll_create(...);
ev.events = EPOLLIN; // NOT EPOLLET
epoll_ctl(l_fd, EPOLL_CTL_ADD, g_fd, &ev);
while (!shutdown)
{
epoll_wait(l_fd, ...);
...
}
g_fd = epoll_create(...);
void add_listening_socket(int sd)
{
ev.events = EPOLLIN; // NOT EPOLLET
epoll_ctl(g_fd, EPOLL_CTL_ADD, g_fd, &ev);
}
void add_client_socket(int l_fd, int sd)
{
ev.events = .. | EPOLLET;
epoll_ctl(l_fd, EPOLL_CTL_ADD, sd, &ev);
}

Client Connecting
typedef void (*handler_t)(epoll_event*);
while (!shutdown)
{
int ndfs = epoll_wait(epoll_fd, events, ...);
for (int i = 0; i < ndfs; ++i)
{
epoll_event* event = &events[i];
handler_t handler = get_handler(event);
handler(event);
}
}
void handle_epoll_event(epoll_event*)
{
struct epoll_event events[1];
int fd = get_fd(event); // fd == g_fd
epoll_wait(fd, events, 1, 0); // 0 timeout.
epoll_event* event = &events[0];
handler_t handler = get_handler(event);
handler(event);
}
void handle_accept_event(epoll_event* event)
{
int sd = get_fd(event);
while ((cd = accept(sd)) != NULL)
{
...
add_client_socket(cd, ...);
}
}

get_hander(event) and get_fd(event) ?
typedef union epoll_data {
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
struct epoll_event {
uint32_t events;
epoll_data_t data;
};
● When adding a descriptor to an epoll instance you can associate a value.
● When something occurs you get that value back.
○ If you do not store the fd, you do not know what fd the event relates to.
○ If you store the fd, you cannot store anything else.

Storing More Context With an Event
typedef uint32_t (*mxs_poll_handler_t)(struct mxs_poll_data* data, int wid, uint32_t events);
typedef struct mxs_poll_data {
mxs_poll_handler_t handler; /*< Handler for this particular kind of mxs_poll_data. */
} MXS_POLL_DATA;
typedef struct dcb {
MXS_POLL_DATA poll;
int fd;
...
} DCB;
static uint32_t dcb_poll_handler(MXS_POLL_DATA *data, ...) {
DCB *dcb = (DCB*)data;
...
};
DCB* create_dcb(...)
{
DCB* dcb = alloc_dcb(...);
dcb.poll.handler = dcb_poll_handler;
return dcb;
}
class Worker : private MXS_POLL_DATA {
public:
Worker() {
MXS_POLL_DATA::handler = &Worker::epoll_handler;
...
};
static uint32_t epoll_handler(MXS_POLL_DATA* data, ...) {
return ((Worker*)data)->handler(...);
}
int fd;
};

Adding and Extracting Events
void poll_add_fd(int fd, uint32_t events, MXS_POLL_DATA* pData)
{
ev.events = events;
ev.data.ptr = pData;
epoll_ctl(m_epoll_fd, EPOLL_CTL_ADD, fd, &ev);
}
DCB* dcb = ...;
poll_add_events(dcb->fd, ..., &dcb->poll);
Worker* pWorker = ...;
poll_add_events(pWorker->fd, ..., pWorker);
while (!should_shutdown)
{
int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
for (int i = 0; i < n; ++i)
{
MXS_POLL_DATA* data = (MXS_POLL_DATA)events[i].data.ptr;
data->handler(data, ..., events[i].events);
}
}
Each worker thread sits in this loop.

MaxScale 2.0.5 (up to)
Hardware: Two physical servers, 16 cores / 32 hyperthreads,
128GB RAM and an SSD drive, connected using GBE LAN. One
runs MaxScale and sysbench, the other 4 MariaDB servers setup
as Master and 3 Slaves.
Workload: OLTP read-only, 100 simple selects per iteration, no
transaction boundaries.
● direct: Sysbench uses all
servers directly in
round-robin fashion.
● rcr: MaxScale readonnroute
router.
● rws: MaxScale readwritesplit
router.

MaxScale 2.1.0
● The architectural change that
allowed the removal of a large
number of locks provided a
dramatic improvement for
readconnroute.
● No change for readwritesplit.
● With small number of clients the
introduced cache improved the
performance, with large number
no impact.

Query Classification
● When ReadWriteSplitting, MaxScale must parse the statement.
○ Does it need to be sent to the master, to some slave or to all servers?
● The classification is done using a significantly modified parser from sqlite.
● In each thread, the parsing is done using a thread specific in-memory
database.
Thread 1 Sqlite Thread 2 sqlite
● No shared data, should be no
contention.
● Sqlite was not built using the right
flags, but there was serialization going
on.

Data Collection
● While parsing a statement, a fair amount of information was collected.
○ What tables and columns are accessed. What functions are called. Etc.
● Allocating memory for that information did not come without a cost.
○ Basically only the firewall filter uses that information.
● Now no information is collected by default, but a filter that is interested in
that information must express it explicitly.
qc_parse_result_t parse_result = qc_parse(stmt, QC_COLLECT_ALL);

Custom Parser
● Many routers and filters need to know whether a transaction is ongoing.
● Up until MaxScale 2.1.1 that implied that the statements had to be parsed
using the query classifier.
● For MaxScale 2.1.2 we introduced a custom parser that only detects
statements affecting the autocommit mode & transaction state.
○ Much faster than full parsing.
● In MaxScale 2.3 we will rely upon the server telling the autocommit mode &
transaction state.
○ Implies that changes performed via prepared statements or functions will also be detected.

Cache
● The cache was introduced in 2.1.0 but the
performance was less than satisfactory.
● Problem was caused by parsing.
○ The cache parsed all statements to detect non-cacheable statements.
○ E.g. SELECT CURRENT_DATE();
● Added possibility to declare that all SELECT statements are cacheable.
[TheCache]
type=filter
module=cache
...
selects=assume_cacheable
● Huge impact on the performance.

MaxScale ReadWriteSplit
● In the best case the performance of MaxScale 2.1.3 is three times, eight if
caching is used, than the performance of MaxScale 2.0.5.

The Importance of Early Customer Feedback
● MaxScale caches user information so that it can authenticate users.
● In MaxScale 2.2.0 the user database was shared between threads.
● Worked fine when connection attempts were relatively rare and sessions
were relatively long
Thread 1
Users
Thread 2 ● If a user is not found, any thread may
refresh the user data from the server and
update the database.
● All access must use locks.

User Report
● With MaxScale 2.2.0 Beta a user reported that he got only 6000 qps.
function event(thread_id)
db_connect()
rs = db_query("select 1;")
db_disconnect()
● Reason turned out to be thread contention in relation to the user database.
Thread 1
Users
Thread 2
Users
● We split it, so that each thread has its own
user database.
“I just tested the same case, got 361437
queries/second, I think it works for us”

What about MaxScale 2.2.2
● No real difference, which is good, because 2.2 does more than 2.1.
○ E.g. must catch “SET SESSION SQL_MODE=ORACLE”

Summary
● MaxScale 0.1 -> 2.0
○ One epoll instance that all worker threads wait on.
○ Any thread can handle anything.
○ Lots of locking needed, and lots of potential for hard-to-resolve races.
○ Performance problems.
● MaxScale 2.1
○ One epoll instance per worker thread.
○ Any thread can accept, but must distribute the client socket to a particular thread.
○ All activity related to a particular session handled by one thread.
○ Significantly reduced need for locking and race risk effectively eliminated.
○ Good performance.
● MaxScale 2.2
○ One epoll instance for “shared” descriptors (listening sockets).
○ One epoll instance per worker thread.
○ All activity related to a particular session handled by one thread.
○ Even less locking needed.
○ Good performance.

Where do We Go From Here?
● The architectural evolution of MaxScale can be summarized as:
○ Decrease the explicit coupling between the worker threads.
■ If that leads to duplicate work or increased memory usage, fine.
● We are likely to continue moving in that direction still, so that we
conceptually will end up running N “mini”-MaxScales in parallel,
completely oblivious of each other.
● That would also make it easy to allow the starting and stopping of threads,
while MaxScale is running.

M|18 Architectural Overview: MariaDB MaxScale

More Related Content

What's hot (20)

Similar to M|18 Architectural Overview: MariaDB MaxScale (20)

More from MariaDB plc (20)

Recently uploaded (20)

M|18 Architectural Overview: MariaDB MaxScale