Transactional vs. Analytical Architecture
Two data architecture models have been around for a long time, and many organizations use them as pillars for their data strategy. But which one is better?
The answer is straightforward: neither. Each is focused on optimally solving different use cases. We can use either of the two to solve the other's use cases, but yes, we cannot provide the best performance for each need.
Transactional databases serve a particular function. They are responsible for sending and receiving data at high speed, allowing many users to carry out many transactions simultaneously.
Transactional database architectures were designed to resolve a particular record's insertion, update, or deletion quickly or in real-time, focusing on small movements in the data but instantly. To achieve this, the data is stored in record structures, and many databases use indexes as accelerators to identify the specific record that will be affected.
This physical structure allows banking, financial, or commercial applications in all industries to record individual transactions quickly and efficiently. It also efficiently performs small queries on the data and responds without latency. These databases retain current data.
The problem with these databases arises when we use them to retrieve massive amounts of information since, in this case, the indexes are no longer efficient. It's correct that our strategic and historical panels will work but with low performance and at the cost of saturating the database at critical times. A simple example is when we use a transactional database where we store the accounting data and the customer data, and at the same time, we have a dashboard that retrieves all the historical information using this transactional database. At times, the dashboard will behave slowly or very, very slowly, and at the same time, the users might perceive that their transactions are no longer being processed in real-time. Have you lived in this situation?
If this represents a tactical problem for the organization, the best action is to divide the workloads and have one analytical database for the control panels and other processes.
Analytical databases store and manage large amounts of structured information and big data, including business, market, and customer data, for business intelligence (BI) analysis.
These databases are specially optimized for faster queries and scalability. Analytical database software is designed to quickly analyze massive amounts of data, with up to 1,000 times faster performance than an operational database for demanding analytical workloads. Allow optimizing queries that can have one or multiple joins.
The physical structure is usually based on a column-based structure. This allows for data compression appropriate to each data type and improves performance when performing massive data queries. These databases are designed to simultaneously perform inserts in batches of millions of records and perform aggregation operations easily and quickly. They focus on retrieving large amounts of records simultaneously. Generally, these databases have all the historical information of a company or the years that the organization considers convenient. The stored data typically converges from different data sources.
Business analysts, researchers, financial market analysts, big data analysts, geospatial analysts, and data scientists relied on the high availability of analytical databases that can handle large volumes of data.
In this case, I will use a specific example with a platform: Snowflake.
Snowflake stores data in a columnar format and physically divides it into micro partitions. A table is a set of physical micro partitions. Each micro partition is immutable, meaning that records cannot be modified or deleted.
How does this work if the database allows inserts, updates, and deletes? This is where the magic of processing large amounts of data emerges. When inserting a large amount of data, a micro partition can have 100,000 records or much more depending on the data types and the number of columns. They are all recorded in a physical space, and each column has metadata about the minimum and maximum values, number of nulls, number of records, and number of distinct values. If a table comprises 3,000 micro partitions, each micro partition has that metadata. When performing a data query, instead of reading the entire micro partition, the Snowflake engine consults the metadata. If the values to be retrieved exist within the minimum and maximum, the micro partition is read. Otherwise, it is discarded, and the next one is moved on. In this way, the engine can make large jumps without having to read records that do not have the data we need. Of course, part of the magic is being able to organize our data efficiently so that we can take advantage of this metadata.
Returning to the topic of each micro partition being immutable, the initial creation is speedy, but the insertion of a record is expensive since the micro partition is internally deleted, and a new one is created with the previous records plus the new record. It is done very efficiently if the new insertion includes hundreds or thousands of records. However, suppose we use this technology to perform individual and recurring insertions. In that case, the process becomes slow since each time a record comes in, the deletion and recreation of the entire micro partition is done. For this reason, an analytical database is not optimal for having small, real-time transactional processes.
Every organization should be able to exploit and use both technologies. Two database engines, each focused on its strengths and integrated, should be available to exponentiate your data strategy.
Snowflake has tried to combine both worlds in a single platform. While its primary focus remains the analytical part, the release of hybrid tables allows the use of this type of physical structure together with indexes that are optimized for transactions. Thus, by correctly using the different types of tables, we can have the advantages of both approaches in the same database.
Suppose you currently have only one type of database and want to improve the performance and costs of analytical processes. In that case, you can start by considering having specialized databases for each use case. This type of implementation will always improve those two pains: performance (time to respond) and cost (money). The only problem is that once everything is fine-tuned and working according to the end users' expectations, consumption will grow, and increasing expenses, which are aligned with the strategic implementation of projects that generate greater ROI, will produce an increase in business profits.
If you are using Snowflake with another transactional database or as your only database, consider using all table types supported by the platform.
Our business will always want all the benefits of rapid access to data: strategic dashboards, business analysts making decisions that impact and improve the company, data scientists generating value and innovation for us, and finally, a way to monetize our data.
If you are looking for an ally to help you develop a strategy and implement or restructure your data warehouse as an analytical complement, contact the Pomerol team.
Víctor Gómez
Principal Data Consultant
M +1 872 444-6008
M +506 8833-9295