SlideShare a Scribd company logo
AWS Techniques and
lessons writing a minimal
cost gitlab runner
February 2023
AWS Techniques and lessons writing a minimal cost gitlab runner
● Principal Engineer for Digio
● Focus on platform engineering
● Background in development
● 10 years AWS experience
● Worked ~2 years each in Azure
and GCP
● 12 years Infrastructure as Code
● Passion for automating things
● 4 years Terraform experience
● Terraform associate certified
● Previous AWS associate
certified but now I’m lazy
Overview of
Digio & Mantel
Group
Digio and Mantel Group
Melbourne
Sydney
Brisbane
Auckland
Queenstown
Magnetic Island
Perth
Adelaide
We’re an Australian-owned, Principle based technology-
led consulting business founded in Melbourne.
Digio is Australia’s Premier Digital Services provider from concept to
production, continually evolving alongside technologies and method.
We are a dynamic business established in November 2017 and have
grown to a team of over 200 across Australia and New Zealand.
We are part of the broader Mantel Group currently comprised of 9
technology brands and a total team size of over 800. As a group we
have been recognised in the AFR’s 2020 fastest growing companies,
achieved #1 Best Place to Work for 2021 and 2022 in the Great Place
to Work Survey and awarded AWS 2022 Services Partner and
Migration Partner of the year.
Hobart
Mantel Group Brands
Working with Mantel Group not only enables access to expertise within Digio, but across all current and future brands.
A broad end-to-end capability that is vendor agnostic, yet has deep specialisations…
Software
Engineering (API)
Software
Engineering (QA)
Platform
Enablement
Software
Engineering (.NET)
Security & Identity
Managed Services
Data & Analytics
Data Strategy
Analytics & BI
Advanced Analytics
Platform Agnostic
Data Engineering
Technology
Strategy & Advisory
Software
Engineering (Web)
Application
Modernisation
Capabilities
Capabilities Capabilities
Cloud Native
Migration
Security
Data & Analytics
Managed Services
Digital Workplace
Capabilities
Automation &
DevOps
Cloud Computing
Analytics &
Machine Learning
Security & Identity
MarTech
Collaboration &
Productivity
Capabilities
Training &
Certification
Application
Transformation
Capabilities
Pursuit Model
Discovery Sprints
Rapid Prototyping
Service Design
ML Engineering
UX/UI Design
Software
Engineering (Mobile)
Capabilities
Platform
Enablement
Data Engineering
Data Architecture
Training &
Certification
Capabilities
Native Mobile
Technology
Strategy
Native Mobile
Product / Design
Strategy
Software
Engineering
(Android)
Software
Engineering (iOS)
Delivery & Method
Advanced Analytics
Capabilities
Data Engineering
Data Architecture
Data Strategy
Analytics & BI
Coaching & Training
The problem we faced
AWS Techniques and lessons writing a minimal cost gitlab runner
AWS Techniques and lessons writing a minimal cost gitlab runner
AWS Techniques and lessons writing a minimal cost gitlab runner
AWS Techniques and lessons writing a minimal cost gitlab runner
Our Solution
https://github.com/cmdlabs/terraform-aws-gitlab-runner-scale
https://registry.terraform.io/modules/cmdlabs/gitlab-runner-scale/aws
AWS Techniques and lessons writing a minimal cost gitlab runner
AWS Techniques and lessons writing a minimal cost gitlab runner
Function URL vs EventBridge with polling
The webhook is:
● Faster to respond to events as
it runs ~instantly
● Zero AWS cost to enable
● Cheaper if the repository /
runner activity is low
● Could be abused via third
parties executing the function
without security permissions.
EventBridge is:
● More predictable in terms of
AWS spend
● 14 millions free invocations
● Slow to respond
● Lower cost if the GitLab
project activity is high
● Make use of:
○ CloudWatch metrics and CloudWatch alarms
○ Triggers on auto-scale group
○ Scale policies to determine how many instances to scale
Scaling Out
● Requires multiple inputs and considerations
○ Avoid churn of runners
● Scale down based on load (number of active runners and jobs in the queue)
● Make use of a premature transition to states (see Avoiding premature
transitions to alarm state)
○ AWS alarms include logic to try to avoid false alarms
○ CloudWatch waits the full N periods before alarming
○ Any time metric above the threshold the alarm "timer" is effectively reset.
● The tradeoff longer idle time with additional cost
Scaling In
Cost Estimation
Lambda
● Running the lambda via (In the ap-southeast-2
region):
○ x86 architecture
○ 1 request per minute
○ 2000ms duration
○ 128mb memory allocated
○ 512mb ephemeral storage (default)
● Free tier cost $0.00 a month.
● Without the free tier $0.19 USD (43,800
invocations)
Runner (EC2)
● t3.medium spot instance(s) 5 hours over the
month at the average price of $0.0158 is
$0.079 a month
● A t3.medium on demand instance(s) 5 hours
over the month at the average price of
$0.0528 is $0.264 a month
● Trade off speed to respond
due to runner startup
● Likely not ideal for high
activity pipelines
● Small pipelines that trigger
after hours
Cost Estimation vs Docker machine
● Install and register GitLab
Runner for autoscaling with
Docker Machine
○ ~$10 a month for a pilot instance
running 24/7
● Patching and maintenance
● Verification
● Troubleshooting
● Internally we had issues with
SSH access
● Overall cost becomes a lot
higher
●Nice to just have it work
Terraform tips and tricks
● Diagrams and pictures
● Working examples
● Example why, not what
Auto generated Terraform docs - https://terraform-docs.io/
● Variable validation
● Ensure we pass in valid data
● Can never be sure what users will pass in
● Sort attributes alphabetically
○ Reduces cognitive load
● Order resources logically
○ If the same resource, alphabetically
● Multiple tf files
● Split via high level resource type
● Reduces cognitive load
● Reduces visual complexity
●Reduce
duplication with
locals
●Move complex
operations into
locals
●Magic strings
● Infer data where possible
● Reduces input requirements
● Reduces possible mistakes
○ VPC and subnets not aligned
Demo
Thank you

More Related Content

PPTX
AWS Techniques and lessons writing low cost autoscaling GitLab runners
PPTX
Reusable, composable, battle-tested Terraform modules
PDF
OSDC 2018 | Migrating to the cloud by Devdas Bhagat
PPTX
Scaling with Automation
PDF
AWS DevOps - Terraform, Docker, HashiCorp Vault
PDF
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
PDF
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
PDF
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
Reusable, composable, battle-tested Terraform modules
OSDC 2018 | Migrating to the cloud by Devdas Bhagat
Scaling with Automation
AWS DevOps - Terraform, Docker, HashiCorp Vault
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners

Similar to AWS Techniques and lessons writing a minimal cost gitlab runner (20)

PDF
What we talk about when we talk about DevOps
PDF
Terraforming your Infrastructure on GCP
PPTX
Serverlessusecase workshop feb3_v2
PDF
stackconf 2024 | On-Prem is the new Black by AJ Jester
PDF
AWS vs Azure vs Google (GCP) - Slides
KEY
Cloud tools
PDF
Terraform in Depth (MEAP V01) Robert Hafner
PDF
Terraform in Depth (MEAP V01) Robert Hafner
PDF
Architecting for the cloud
PPTX
RIMA-Infrastructure as a code with Terraform.pptx
PPTX
AWS VS AZURE VS GCP.pptx
PPTX
Cloud Native Summit 2019 Summary
PPTX
Aws architecture problems while being fancy
PDF
Server’s variations bsw2015
PDF
Comment choisir entre Parse, Heroku et AWS ?
PDF
Individual Serverless Development Environments for AWS
PDF
Hashicorp-Terraform-Deep-Dive-with-no-Fear-Victor-Turbinsky-Texuna.pdf
PDF
Terraform-2.pdf
PDF
DevOps at Tradeshift - AWS community day nordics
PPT
The Future is Now: Leveraging the Cloud with Ruby
What we talk about when we talk about DevOps
Terraforming your Infrastructure on GCP
Serverlessusecase workshop feb3_v2
stackconf 2024 | On-Prem is the new Black by AJ Jester
AWS vs Azure vs Google (GCP) - Slides
Cloud tools
Terraform in Depth (MEAP V01) Robert Hafner
Terraform in Depth (MEAP V01) Robert Hafner
Architecting for the cloud
RIMA-Infrastructure as a code with Terraform.pptx
AWS VS AZURE VS GCP.pptx
Cloud Native Summit 2019 Summary
Aws architecture problems while being fancy
Server’s variations bsw2015
Comment choisir entre Parse, Heroku et AWS ?
Individual Serverless Development Environments for AWS
Hashicorp-Terraform-Deep-Dive-with-no-Fear-Victor-Turbinsky-Texuna.pdf
Terraform-2.pdf
DevOps at Tradeshift - AWS community day nordics
The Future is Now: Leveraging the Cloud with Ruby
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Cloud computing and distributed systems.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Big Data Technologies - Introduction.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Cloud computing and distributed systems.
Mobile App Security Testing_ A Comprehensive Guide.pdf
sap open course for s4hana steps from ECC to s4
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
Spectral efficient network and resource selection model in 5G networks
Programs and apps: productivity, graphics, security and other tools
Big Data Technologies - Introduction.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Ad

AWS Techniques and lessons writing a minimal cost gitlab runner

  • 1. AWS Techniques and lessons writing a minimal cost gitlab runner February 2023
  • 3. ● Principal Engineer for Digio ● Focus on platform engineering ● Background in development ● 10 years AWS experience ● Worked ~2 years each in Azure and GCP ● 12 years Infrastructure as Code ● Passion for automating things ● 4 years Terraform experience ● Terraform associate certified ● Previous AWS associate certified but now I’m lazy
  • 4. Overview of Digio & Mantel Group
  • 5. Digio and Mantel Group Melbourne Sydney Brisbane Auckland Queenstown Magnetic Island Perth Adelaide We’re an Australian-owned, Principle based technology- led consulting business founded in Melbourne. Digio is Australia’s Premier Digital Services provider from concept to production, continually evolving alongside technologies and method. We are a dynamic business established in November 2017 and have grown to a team of over 200 across Australia and New Zealand. We are part of the broader Mantel Group currently comprised of 9 technology brands and a total team size of over 800. As a group we have been recognised in the AFR’s 2020 fastest growing companies, achieved #1 Best Place to Work for 2021 and 2022 in the Great Place to Work Survey and awarded AWS 2022 Services Partner and Migration Partner of the year. Hobart
  • 6. Mantel Group Brands Working with Mantel Group not only enables access to expertise within Digio, but across all current and future brands. A broad end-to-end capability that is vendor agnostic, yet has deep specialisations… Software Engineering (API) Software Engineering (QA) Platform Enablement Software Engineering (.NET) Security & Identity Managed Services Data & Analytics Data Strategy Analytics & BI Advanced Analytics Platform Agnostic Data Engineering Technology Strategy & Advisory Software Engineering (Web) Application Modernisation Capabilities Capabilities Capabilities Cloud Native Migration Security Data & Analytics Managed Services Digital Workplace Capabilities Automation & DevOps Cloud Computing Analytics & Machine Learning Security & Identity MarTech Collaboration & Productivity Capabilities Training & Certification Application Transformation Capabilities Pursuit Model Discovery Sprints Rapid Prototyping Service Design ML Engineering UX/UI Design Software Engineering (Mobile) Capabilities Platform Enablement Data Engineering Data Architecture Training & Certification Capabilities Native Mobile Technology Strategy Native Mobile Product / Design Strategy Software Engineering (Android) Software Engineering (iOS) Delivery & Method Advanced Analytics Capabilities Data Engineering Data Architecture Data Strategy Analytics & BI Coaching & Training
  • 16. Function URL vs EventBridge with polling The webhook is: ● Faster to respond to events as it runs ~instantly ● Zero AWS cost to enable ● Cheaper if the repository / runner activity is low ● Could be abused via third parties executing the function without security permissions. EventBridge is: ● More predictable in terms of AWS spend ● 14 millions free invocations ● Slow to respond ● Lower cost if the GitLab project activity is high
  • 17. ● Make use of: ○ CloudWatch metrics and CloudWatch alarms ○ Triggers on auto-scale group ○ Scale policies to determine how many instances to scale Scaling Out
  • 18. ● Requires multiple inputs and considerations ○ Avoid churn of runners ● Scale down based on load (number of active runners and jobs in the queue) ● Make use of a premature transition to states (see Avoiding premature transitions to alarm state) ○ AWS alarms include logic to try to avoid false alarms ○ CloudWatch waits the full N periods before alarming ○ Any time metric above the threshold the alarm "timer" is effectively reset. ● The tradeoff longer idle time with additional cost Scaling In
  • 19. Cost Estimation Lambda ● Running the lambda via (In the ap-southeast-2 region): ○ x86 architecture ○ 1 request per minute ○ 2000ms duration ○ 128mb memory allocated ○ 512mb ephemeral storage (default) ● Free tier cost $0.00 a month. ● Without the free tier $0.19 USD (43,800 invocations) Runner (EC2) ● t3.medium spot instance(s) 5 hours over the month at the average price of $0.0158 is $0.079 a month ● A t3.medium on demand instance(s) 5 hours over the month at the average price of $0.0528 is $0.264 a month
  • 20. ● Trade off speed to respond due to runner startup ● Likely not ideal for high activity pipelines ● Small pipelines that trigger after hours Cost Estimation vs Docker machine ● Install and register GitLab Runner for autoscaling with Docker Machine ○ ~$10 a month for a pilot instance running 24/7 ● Patching and maintenance ● Verification ● Troubleshooting ● Internally we had issues with SSH access ● Overall cost becomes a lot higher ●Nice to just have it work
  • 22. ● Diagrams and pictures ● Working examples ● Example why, not what
  • 23. Auto generated Terraform docs - https://terraform-docs.io/
  • 24. ● Variable validation ● Ensure we pass in valid data ● Can never be sure what users will pass in
  • 25. ● Sort attributes alphabetically ○ Reduces cognitive load ● Order resources logically ○ If the same resource, alphabetically
  • 26. ● Multiple tf files ● Split via high level resource type ● Reduces cognitive load ● Reduces visual complexity
  • 28. ● Infer data where possible ● Reduces input requirements ● Reduces possible mistakes ○ VPC and subnets not aligned
  • 29. Demo

Editor's Notes

  • #2: Hi, I’m Anthony Scata and I’m going to talk about some of my experience, lessons, coding tips and tricks while on my journey to write a module for deploying GitLab runners in a cost effective manner. We will see how things go, may even show some live demos.
  • #3: Start by saying Happy Valentines day, hopefully by saying this i can gain some good karma from my wife so is likely sitting at home, angrily watching tv wondering where i am. I did ask her to join us but she wasn’t keen.
  • #5: As a good consultant i cannot start a presentation without talking about where I work
  • #9: Working in a consultancy we often have internal project, some of which are hosted in AWS. They are not business critical but may be a small application used by a few people, an internal project or a solution accelerator that we showcase to clients regarding latest technologies. The issue is that we don’t make money from these, as a professional services consultancy we have our team members billable to clients. To means most people are very busy working on client projects and can be taken off internal work for higher value work. It also means people are busy, trying to work internally and just getting things done, typically this means automation or infrastructure as code are on the back burner. It is quite ironic that a company that works so much in the CI/CD space has very little maturity internally, but as mentioned this isn’t how we many money. As time is tough to come by and client projects can pop up consistency for internal projects is often an issue and a lot of projects becomes orphaned with little to no support.
  • #10: Engineers come onto these project, implement something simple, rarely with time to make it better or easier just doing what they know. Over time this leads to a large mess of reinvention or solutions, especially infrastructure as code all bandaged together. As automation is usually not people's expertise, and as most of us know, is overlooked until the next person comes long to see the dumpster fire of setup, continuing the cycle.
  • #11: If people do look at automation it often gets expensive, both in time to set up and then maintain. Any system left on needs to be patched, verified and validated and monitored.We have found this to be a large sink of money especially for projects that are rarely touched. We often float the idea of centrally managed runners but then we have an issue with ownership, cost allocation, debugging, generally usage, it becomes very painful.
  • #12: The solution was for something that needed to be low cost, easy to build, maintain and reproducible that could be deployed into any aws account or region High quality so it can be reused on another project and not falling over every few months. With the advent of serverless technologies, they provide a great approach for not needing to patch or upgrade running system, are usually low cost or at least lower and provide little attach surface for malicious actors. The idea is that is also works well for small projects that can scale to something larger. If you don’t want to spend a lot of money on CI/CD runners but if the project grows doesn’t require you to set up a whole new process or implementation.
  • #13: So this is where it lead to a terraform module, automation the process of building gitlab runners.
  • #20: Ec2 cost ~$15 a month plus extra costs
  • #21: Ec2 cost ~$15 a month plus extra costs
  • #22: Now some of the more technical tips and tricks that i learnt along the way. These help the next person picking up the code. Again one issue is people picking up the code. Build and document as if you were the one looking at this for the first time and what would really help.
  • #23: As with anything, architecture diagram or documentation as a whole is important. Nothing says this is a well maintained piece of code like documentation which is factual and thorough. Diagrams can really help pain a picture of what will be deployed. Again, why have consumers extra data out by looking at code when they can see it from a high level. Coupled with examples of working code makes it easier for people to try. You want to lower the barrier to entry for any piece of software and you may need them yourself as they provide a good guide. And lastly, example why in the documentation decisions were made. At times we focus on what was done but not the motivation or limitation as to why. The how can mostly be seen from the code, we can reason about this, the why is more abstract and less obvious. We use this cloudwatch setting because, this is set to negative value so that. This helps the next person out who thinks, why was this done, let me change it to something else that makes more sense to me, only to find themselves in the same situation and rabbit hole you did. Be kind to your future self and engineers.
  • #24: I want my code to be well documented and for those interested in it to look if necessary, key word being necessary. Terraform docs provides the ability to automatically generate resource, variable, input, provider and other docs based on the code. This means less looking at the code if you are new to the module and provides a better snapshot. Now I can see if I this works with the aws provider version i need for another module, does it use a resource type that my organisation does not allow but more importantly which input variables I need to supply, why and how. With the validation mentioned earlier and the docs a consumer shouldn’t need to view the code to see how a variable will be used making it easier to use for less experience engineers.
  • #25: As of 0.13.0 you have the ability for variable validation. To check the contents of a variable for example is within a certain number range, or matches a regex or is a valid json string. The idea being that sometimes a plan does not catch these incompatibilities due to the provider, we only find them when its running the apply which is likely too late. Lets do this before the plan to ensure we have a consistent and working environment. One advantage of the variable map and optionals as mentioned before is that we can check multiple variable values, for example the min is less than the max and the desired in somewhere in between. If the variables are defined separately this cannot be done
  • #26: This may seem minor but it really helps others who are viewing or changing your code. Some resources may use 10 or 20+ attributes and it may be hard to comprehend what is being used. Sorting the attributes alphabetically makes it easy for others to look and see where its places and then how its used. Reducing the cognitive load of making change and decisions, where does this go, should i put this here helps. This includes the resources themselves. Although Terraform does not run in a sequential order it helps for us humans to again comprehend change and find resources.
  • #27: We have all seen code with hundreds or thousands of lines and though, oh god, not this file again. This adds extra stress and cognitive load to changes. You are much better off splitting the files for a higher level resource type, possible autoscaling, cloudwatch and then add a locals specific to that set of resources into the file. Keeping the resources somewhat contained helps to facilitate change. This may sound contradictory to before in terms of logic ordering and it does depend on how many resources you are creating but anything more than 10 resources per file starts to ger unwieldy.
  • #28: The use of locals makes it easier to reuse strings or data without having to hard code it in multiple places. Sometimes people make these variables as defaults which can be messy as it gives consumers the ability to change them. Utilise locals where possible to reduce duplication of magic strings, once, twice, three times extra into a local. Locals can also be used to remove the complexity of how a variable is computed into a separate file. Resource definitions can be large enough, let alone when you add in a join, compact, concat, split, tostring, try. Move this out, make it a local and reference it when needed.
  • #29: Try and infer variables where possible rather than having the consumer pass them in. For example the caller account, no need to pass in an account id, we are deploying into this account, just grab the ID with a data source, same with region. This reduces the duplication of variables and the possibility where the consumer changes region and forgets to update the variable.
  • #31: Thanks you for listening to my presentation and hope that you gain something useful for your Terraform and Infrastructure as Code journey.