How I Chose MS SQL Server as the Ultimate Data Solution for Scoris in 2023

How I Chose MS SQL Server as the Ultimate Data Solution for Scoris in 2023

For Lithuanian readers, there is Lithuanian version on Scoris blog.

Scoris - Lithuanian business data platform - had one significant yet somewhat overlooked drawback. The data was only updated once a month, resulting in some data being outdated by a month or even two. From today, on the Scoris website, the most recent company data is always available. The data gets updated automatically every day.

Why was the data updated only once a month?

A common issue with startups is the reluctance to heavily invest in infrastructure early on, as it can be time-consuming and costly. At the beginning, the focus is on MVP PoC (Minimum Viable Product, Proof of Concept) — a functional product achieved with minimal effort and resources. Once such a product is in place, the company can start market exploration. If the product proves successful, infrastructure modernization can then follow. This is the ideal scenario.

However, the frequent reality is to start with an MVP PoC and keep building upon it. Things start to break, you patch them up and continue. After a year, modernizing the infrastructure becomes a mammoth task, mainly because the PoC has expanded so much.

Such was the case with Scoris. We assembled our data model on Knime, a useful tool but not intended for database tasks. In the beginning, manual processing took about half an hour. But as Scoris grew and Knime expanded, alongside it arose Python scripts, MS Excel PowerQuery, and a few interim databases.

In the end, the update process took around 3 hours. The Knime workflow looked something like this:

Article content
Scoris data preparation Knime workflow

This tangled web was not suited for automation and optimal performance. It is a bloated PoC, which is why daily data updating was unrealistic.

The Quest for a New Data Infrastructure

It had long been evident that the data infrastructure needed an overhaul, but the expanded PoC made such a task daunting. Just how daunting?

  • Over 100 data sources
  • Almost every conceivable file format (csv, xlsx, json, zip).
  • Constantly changing data structures
  • Approximately 40 Gb of raw data

Knowing that a revamp was inevitable, I began my search for the best solution. Several components were essential:

  1. Data downloading
  2. Storing raw data in an interim DB
  3. Data preparation and modeling
  4. Publishing the final results

For the third point, I was certain SQL would be the tool, regardless of the technology (Postgres, MySQL, MS SQL, etc.). The fourth was also clear since the website was built on a MySQL database, and I didn’t want significant changes there as it functioned quite well. The real headache lay in the first two points.

Unsuccessful Attempts

I started with keboola. The free version promised 250 Gb of data and around 1h of compute-time. Since I could manage the data downloading and interim DB storage in an hour, I assumed powerful cloud computers would do it faster. I was mistaken. After uploading just 3-4 files, I had already exhausted 120 minutes of compute time, and over 90 files were still pending. I had to abandon this option due to the prohibitive cost of a paid solution.

My second attempt was with Airbyte, which offers a free version for on-prem installation. I rented a mid-range VPS (64Gb RAM, 8 cores), installed Airbyte, and got to work. The free version of Airbyte is poorly optimized. Loading a few Gb of data into the database required 100% of server resources (CPU and RAM for looong time), and upon reaching the resource limit, performance significantly slowed. I had to abandon this approach due to its sluggishness and the resource requirement for simple tasks.

The third attempt was a do-it-yourself (DIY) approach using Open-Source solutions. A Python scripts for data downloading and storing in a Postgres database, with Aiflow for orchestration. I thought this would be the final solution. It was fast and cost-effective. For data transformation, I considered using DBT. The issue was the code became lengthy and complex to maintain optimal stability. It appeared too labor-intensive for one person.

In pursuit of other objectives, I got acquainted with two technologies: Azure cloud and MS SQL server (SSIS). I simultaneously attempted tasks on these platforms, curious to see if one might be suitable and appealing. Both had steep learning curves with excessive configuration requirements and cryptic error messages, but it got easier over time. I grew fond of both. After some deliberation, Azure was discarded because:

  • It lacked adequate control. I frequently had to use python scripts on a Spark cluster to bypass limitations, which wasn't straightforward and demanded significant resources.
  • The costs began to escalate. Charges were for utilized resources, but even without completing 10% of the task, the cost was already around 5 EUR/day. A full solution would not be economical.

Thus, only MS SQL server with SSIS remained. I already had a home server running 24/7 (with UPS and failover internet). I managed to obtain an affordable MS SQL 2019 server license and commenced my tasks.

MS SQL 2019 Server-Based Data Architecture in 2023?!

It’s rare to see anyone recommending such a data architecture in 2023. Both MS SQL server and SSIS seem like legacy technologies, but after trying several "modern" options, this "outdated" one fit best.

The greatest advantage is its speed and flexibility. Processing 40-60GB of data in half an hour is a commendable benchmark. The system operates quite steadily. The primary drawback is its age and somewhat outdated logic, necessitating detailed configuration and description. Another challenge is the accurate access-rights management in Windows/SQL server.

The overall setup looks like this:

Article content
New Scoris MS SQL server based architecture

The system operates quickly and reliably. Daily ETL processes take about 30 minutes, and monthly processes take roughly 2 hours. There’s room for further optimization.

Currently, nearly all tables undergo complete reloads, although data changes are minimal. This complicates snapshot creation. A near-future plan is to optimize by loading through deltas and enabling snapshotting.

The Benefits

The apparent benefit for me is the saved 3 hours a month. For users, they always access the latest company data. But the most significant advantage lies in Scoris's future, enabling further platform scaling. The new data architecture allows the collection of much more historical data, previously not gathered at all. In time (once sufficient data accumulates), this will enable the introduction of new products and offerings that no market player has yet managed to introduce.

In hindsight, while the journey was demanding, time-consuming, and paved with obstacles, it was undoubtedly worth the effort. Scoris now operates on an advanced, efficient, and effective data architecture that ensures future growth and opportunities.

Closing Thoughts

Data management is an ever-evolving discipline. What might seem like a perfect solution today might need tweaking tomorrow. My journey to update Scoris’s data infrastructure was a challenge, but also a rich learning experience. In the end, it's all about finding the right balance between current needs, scalability, and technology stack choices.

Despite being 2023, sometimes the seemingly "outdated" solutions, like MS SQL 2019 Server, can be a robust choice, proving that it’s not about the age of the technology but its fit and capability in addressing a particular problem.

As Scoris expands, we'll continue to reassess our technology stack, ensuring we're delivering the best possible data and service to our users. Today, MS SQL 2019 Server is our choice, but who knows what the future might bring. Regardless, we’re ready for it.

Interesting that Postgres with Airflow and DBT didn’t fit. MS SQL will perfectly do the job with 60 GB. But I though Postgres with Airflow and DBT would do just as good. And DBT claims to deal perfectly with complex code.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics