Serverless Compose vs hurtownia danych

Serverless Compose
vs
Data Warehouse
Artur Wita

The overall brief of the presentation:
Agenda
It’s not a golden hammer but only an attempt.
● Intro
● Problem context
● Solution
● Challenges
○ Architectural
○ Serverless
● Results

About me
● Node.js Developer with 2 years of commercial experience
(1 year in Serverless Framework)
● At The Software House, I am responsible for greeting new
resources employees (occasionally blowing up production)
● Co-author of the data warehouse
● Privately mountains lover, sports freak (hiking, ice skating,
swimming, ultimate frisbee), workation enthusiast,
and a friendly-neighborhood rapper

A data warehouse is a system responsible for gathering miscellaneous data. 🙃
Intro
One of the most important things to take care of in such systems are:
● ensuring that the data is up to date
● having data backup / the ability to recreate data
● and surely something else
With data warehouses, we are able to create complex queries useful for analyzing the trends
in order to make decisions that are going to define our future decisions.
Sounds good?
Because it is good…

How we built our data warehouse?
Our internal data warehouse is based on AWS Step Functions.
● ensuring that the data is up to date
○ cron jobs responsible for updating the data every day
● having data backup / the ability to recreate data
○ manually-triggered recreate workflows
Almost each feature supports 2 workflows, sometimes we were
able to use the recreate flow for updating the data.

Extract
● Collect data e.g. from an external API
2 3
How does a workflow might look like?
1
Transform
● Transform and model raw data
Load
● Store the data in the database

Granulation advantages
Building workflows using single-responsible lambdas has many
benefits, such as:
● easier testing
● reusable steps
● improved scalability
● ability to adjust particular resources
(memory size, timeouts, etc.)
● easier debugging
(a graphical representation of steps)

Time for some maths
At the time of making the decision to split our warehouse
we had about 16 features.
Each feature supported 2 workflows.
Each workflow consisted of average 3-4 lambdas*.
16 features • 4 lambdas = 64 lambdas
An average time of a deployment of the whole application*
was something between 18 - 20 minutes.
And we still had a small warehouse...

One workflow does not make a difference, but how about seventeen?

Serverless Compose is an official package created to
“simplify deploying and orchestrating multiple services”.
The key features allow us to:
● Deploy multiple services in parallel
● Deploy services in a specific order
● Share outputs from one service to another
● Run commands across multiple services
Serverless Compose

The heart of each Serverless Framework application is
a serverless.yml file. It's used for configuring things such as:
How does it work?
In a certain sense what Serverless Compose does
is handle multiple serverless.yml files and specify
dependencies between services described by them.
● the framework itself
● cloud provider (e.g. AWS, Azure, Google)
● functions
● state machines
● resources
● etc.

There are two ways of specifying dependencies
between services:
Dependent services
● implicit - by referencing a resource belonging to
another service
● explicit - by marking a service as a dependent
Firstly the downstream is deployed, then the target.

It’s not a problem - it’s a challenge
Serverless
Architectural
● How should we organize directories?
● Where should we put shared code?
● What about configuration files?
● Are we going to have a problem with the database?
● At least we can share webpack config, can’t we?
● How many deployment buckets are we going to need?
The same for the data lake ones?
● Are we still going to be able to work locally?
● Will pipelines configuration be troublesome?

Goals
Our aim was to achieve the following criteria:
● shorten deployment time
● allow the ability to deploy a single service
as well as everything at once
● maintain the ability to work locally
● keep a single package.json*

Organising directories
We decided to split features by domains and created 4 services:
● api
● finance
● migrations
● people
In each service, we kept the previous directories structure
(functions + shared).
Also, we kept the original shared folder in the project’s root
directory.

Organising directories
Before: After:

Configuration files
Since the very beginning, we opted for small, dedicated configs,
rather than one huge config. Thanks to that, we were able to:
● keep feature-specific configs in features’ directories
● keep service-specific configs in the service’s shared directory
● keep common configs in the project’s shared directory
In order to hermetise environment variables we split the original
.env file and created dedicated .envs per each service.

Configuration files
Before: After:

Database configuration
Because of a few saboteurs, we decided to keep all models in the
project’s shared directory.
Even though we put all models in the same directory, we were able
to create separate, slightly differing ORM configs per each service.
In order to keep our database in sync, we created a dedicated
service for migrations and marked other services as a dependent.

Database configuration
Migration: Others:

Webpack configuration
If it looks the same, it works the same, right?
It seems that nope. 😕
What could be the reason?
A single static config was causing a race.
As soon as we created dedicated config for every service
our serverless roads were safe again.

Webpack configuration
Before: After:

S3 buckets
Although the Serverless Framework allows us to use an existing
bucket for uploading source code files, we still had to use
separate deployment buckets per each service.
Also, we decided to split our data lake bucket as well, so that
every service has its own one. That was not necessary,
however, we thought that such hermetization might be valuable.
We only had to migrate the data from the old data lake bucket to
the new ones. For that we used the AWS CLI sync method.

Local environment
The great thing about our warehouse is its ability to run 100%
locally. That is why we had to maintain it.
We achieved it by:
● using service-dedicated commands
● assigning unique http and lambda ports to each service
The only thing that was left was refactoring docker-compose.yml
by adding local step functions instances per each service.

Local environment
Before: After:

After: We had to use different ports to be able to run multiple services concurrently.

Pipelines
We had to slightly refactor our pipelines too.
That consisted of:
● adding a few more pipelines to be able to deploy each
service independently to each environment
● call the lambda responsible for running migrations in a
different way

Before: A single definition for the deploy step. Migrations are being run immediately after the deployment.

After: A single definition per each service. To run migrations we need to call the migrations-service.

Before After Difference
Services count 1 service 4 services +3 services
Deployment time
(all at once)
20 minutes 10 minutes - 10 minutes
Monthly cost X Y Δ ≈ 0
Capabilities 100% 100% none

Tips & tricks
Despite gaining knowledge on how to split an application using
Serverless Compose we also made some observations:
● use short services names
● name your resources using a common prefix:
<service><stage><resource_name>
● don’t be afraid of the lack of serverless knowledge

Summary
We managed not only to shorten the deployment time of our
warehouse but also to make it scalable.
The application is still easy to maintain thanks to its modular
structure.
Serverless Compose is a truly powerful tool and you should
definitely give it a try!

Join the community!
https://tsh.io/programowanko https://github.com/arturwita/serverless-compose-boilerplate

tsh.io
Thank you
for your attention
Serverless Compose vs Data Warehouse

Serverless Compose vs hurtownia danych

More Related Content

Similar to Serverless Compose vs hurtownia danych (20)

More from The Software House (20)

Recently uploaded (20)

Serverless Compose vs hurtownia danych