M|18 Ingesting Data with the New Bulk Data Adapters

ColumnStore Bulk Data
Adapters
David Thompson, VP Engineering
Jens Rowekamp, Engineer

Streamline and simplify
the process of data ingestion

Motivation
Organizations need to make data available for
analysis as soon as it arrives
Enable Machine learning results to be published and
accessible by business users through SQL Based
tools.
Ease of integration whether custom or ETL tools.

Bulk data adapters
Applications can use bulk data
adapters SDK to collect and
write data - on-demand data
loading
No need to copy CSV to
ColumnStore node -
simpler
Bypass SQL interface,
parser and optimizer -
faster writes
MariaDB Server
ColumnStore UM
Application
ColumnStore PM ColumnStore PMColumnStore PM
Write API Write API Write API
MariaDB Server
ColumnStore UM
Bulk Data Adapter
1. For each row
a. For each column
bulkInsert->setColumn
b. bulkInsert->writeRow
2. bulkInsert->commit
* Buffer 100,000 rows by default

Language Bindings
● The API is C++ 11 based.
● Currently available on modern Linux distributions:
○ May port to Windows and Mac in future release.
● Other language bindings are implemented using SWIG which generates
efficient almost identical native implementations on top of the C++ library:
○ Java 8 (also providing Scala support).
○ Python 2 & 3
○ Other language bindings can be implemented in the future.

System Configuration
● The adapter assumes the existence of a ColumnStore.xml file in the system in
order to determine the system topology, hosts, and ports for the PM nodes.
● If you are running on a ColumnStore node, the adapter will work
immediately.
● For a remote host, you will need to copy the ColumnStore.xml from a server
node.
● The adapter will need to be able to connect with the ProcMon (8800),
WriteEngine (8630), and DBRMController (8616) ports.

Core Classes
The following classes provide the core interface:
● ColumnStoreDriver : Entry point / connection management
● ColumnStoreBulkInsert: Per table interface for writing a transaction
● ColumnStoreSystemCatalog: Table metadata retrieval
Language namespaces:
● C++ - mcsapi::
● Java - com.mariadb.columnstore.api
● Python - pymcsapi

Core Classes - ColumnStoreDriver
● Entry point and factory class for creating:
○ ColumnStoreBulkInsert objects to allow bulk write of a single transaction for a single table
○ ColumnStoreSystemCatalog object to allow retrieval of table and column metadata
● Default constructor will look for ColumnStore.xml in:
○ $COLUMNSTORE_INSTALL_DIR/etc/ColumnStore.xml (for non root installs).
○ /usr/local/mariadb/columnstore/etc/ColumnStore.xml
● Alternatively pass path to ColumnStore.xml as constructor argument to
specify non standard location.

ColumnStoreDriver Examples
Java
import com.mariadb.columnstore.api.*;
..
ColumnStoreDriver d1, d2;
d1 = new ColumnStoreDriver();
d2 = new ColumnStoreDriver
("/etc/cs2.xml");
Python
import pymcsapi
d1 = pymcsapi.ColumnStoreDriver()
d2 = pymcsapi.ColumnStoreDriver
("/etc/cs2.xml");
C++
mcsapi::ColumnStoreDriver* d1, d2;
d1 = new mcsapi::ColumnStoreDriver();
d2 = new mcsapi::ColumnStoreDriver
("/etc/cs2.xml");

Core Classes - ColumnStoreBulkInsert
● Encapsulates bulk insert operations. Constructed for a single table and
transaction.
● Multiple instances can be created for multiple drivers but you can only have
one active per table per ColumnStore instance.
● Error handling is important, if you fail to commit or rollback a
ColumnStore table lock will be left which must be released manually with the
cleartablelock command.
○ resetRow can be used to clear the current row if an error occurs and you want to commit the
prior rows.
● After completion getSummary returns some summary details.

ColumnStoreBulkInsert Examples
Java
..
ColumnStoreDriver d;
ColumnStoreBulkInsert b;
d = new ColumnStoreDriver();
try {
b = d.createBulkInsert("test", "t1",
(short)0, 0);
b.setColumn(0, 1);
b.setColumn(1, "ABC");
b.writeRow();
b.setColumn(0,2);
b.setColumn(1, "DEF");
b.writeRow();
b.commit();
} catch (ColumnStoreException e) {
b.rollback();
..
}
Python
import pymcsapi
d = pymcsapi.ColumnStoreDriver()
try:
b = d.createBulkInsert("test", "t1",
0, 0);
b.setColumn(0, 1);
b.setColumn(1, "ABC");
b.writeRow();
b.setColumn(0,2);
b.setColumn(1, "DEF");
b.writeRow();
b.commit();
except RuntimeError as err:
b.rollback()
C++
mcsapi::ColumnStoreDriver* d;
mcsapi::ColumnStoreBulkInsert* b;
d = new mcsapi::ColumnStoreDriver();
try {
b = d->createBulkInsert("test", "t1",
0, 0);
b->setColumn(0, (uint32_t)1);
b->setColumn(1, "ABC");
b->writeRow();
b->setColumn(0, (uint32_t)2);
b->setColumn(1, "DEF");
b->writeRow();
b->commit();
} catch (mcsapi::ColumnStoreError &e) {
b->rollback();
..
}

Core Classes - ColumnStoreSystemCatalog
● Allow retrieval of ColumnStore table and column metadata to allow for
generic implementations.

ColumnStoreSystemCatalog Examples
Java
..
ColumnStoreDriver d;
ColumnStoreSystemCatalog c;
ColumnStoreSystemCatalogTable t;
ColumnStoreSystemCatalogColumn c1,c2;
d = new ColumnStoreDriver();
c = d.getSystemCatalog();
t = c.getTable("test", "t1");
int t1_cols = c.getColumnCount();
c1 = t.getColumn(0);
String c1_name = c1.getColumnName();
c2 = t.getColumn("area_code");
Python
import pymcsapi
d = pymcsapi.ColumnStoreDriver()
c = d.getSystemCatalog()
t1_cols = t.getColumnCount();
c1_name = c1.getColumnName();
C++
mcsapi::ColumnStoreDriver* d;
mcsapi::ColumnStoreSystemCatalog c;
mcsapi::ColumnStoreSystemCatalogTable t;
mcsapi::ColumnStoreSystemCatalogColumn c1,c2;
d = new mcsapi::ColumnStoreDriver();
c = d->getSystemCatalog();
uint16_t t1_cols = c.getColumnCount();
std:string c1_name = c1.getColumnName();

Core Classes - Bulk Insert
ColumnStoreDriver
char* getVersion()
ColumnStoreBulkInsert* createBulkInsert(..)
ColumnStoreSystemCatalog& getSystemCatalog()
ColumnStoreBulkInsert
uint16_t getColumnCount()
ColumnStoreBulkInsert* writeRow()
ColumnStoreBulkInsert* resetRow()
void commit()
void rollback()
ColumnStoreSummary& getSummary()
void setTruncateIsError(bool)
void setBatchSize(uint32_t)
bool isActive()
ColumnStoreBulkInsert* setColumn(uint16_t,
const std::string& value,..)
ColumnStoreBulkInsert* setColumn(uint16_t, uint64_t,..)
..
ColumnStoreSummary
double getExecutionTime()
uint64_t getRowsInsertedCount()
uint64_t getTruncationCount()
uint64_t getSaturatedCount()
uint64_t getInvalidCount()
1 1
0..*
1
ColumnStoreDateTime
ColumnStoreDateTime(..)
bool set(..)
ColumnStoreDecimal
ColumnStoreDecimal(..)
bool set(..)

Core Classes - System Catalog
ColumnStoreSystemCatalogColumn
uint32_t getOID()
const std::string& getColumnName()
uint32_t getDictionaryOID()
columnstore_data_types_t getType()
uint32_t getWidth()
uint32_t getPosition()
const std::string& getDefaultValue()
bool isAutoincrement()
uint32_t getPrecision()
uint32_t getScale()
bool isNullable()
uint8_t compressionType()
ColumnStoreSystemCatalogTable
const std::string& getSchemaName()
const std::string& getTableName()
uint32_t getOID()
uint16_t getColumnCount()
ColumnStoreSystemCatalogColumn& getColumn(const std::string&)
ColumnStoreSystemCatalogColumn& getColumn(uint16_t)
ColumnStoreDriver
char* getVersion()
ColumnStoreBulkInsert* createBulkInsert(..)
ColumnStoreSystemCatalog& getSystemCatalog()
ColumnStoreSystemCatalog
ColumnStoreSystemCatalogTable& getTable(const
std::string& schemaName, const std::string& tableName)
1
1
1
1
0..*
1..*

Use Cases
The Bulk Data adapters are designed to be used to more easily enable
integrations and streaming use cases such as:
- Kafka or Messaging integration
- Exposing data import via an API
- ETL Tool adapters.
- Custom ETL logic.
MariaDB has introduced a few specific streaming adapters (MaxScale CDC and
Kafka) and we plan to build more in the future. For further details please attend
tomorrow's session "Real-time Analytics With The New Streaming Data Adapters" at
8.40am.

Spark Connector
● Enables publishing of machine learning results from Spark DataFrames to
ColumnStore.
● Enable best of breed approach:
○ In memory machine learning algorithms in Spark
○ Publish results to ColumnStore for ease of consumption with SQL tools such as Tableau.
● Supports both Scala and Python notebooks.
● To pull data from ColumnStore into Spark use the JDBC connector and Spark
SQL to read data.
○ In the future we plan to add a bulk read api.
● Requires configuring additional jar files to Spark runtime configuration.
● Available as a Docker image for reference / easy evaluation.

Spark Connector Demo / Example

Spark Connector - Getting Started with Docker
git clone https://github.com/mariadb-corporation/mariadb-columnstore-docker.git
cd mariadb-columnstore-docker/columnstore_jupyter
docker-compose up -d
In your browser open http://localhost:8888 and enter 'mariadb' as the password to login to the jupyter notebook
application.

M|18 Ingesting Data with the New Bulk Data Adapters

More Related Content

What's hot (20)

Similar to M|18 Ingesting Data with the New Bulk Data Adapters (20)

More from MariaDB plc (20)

Recently uploaded (20)

M|18 Ingesting Data with the New Bulk Data Adapters