Data warehousing basics: environments
If you have been in data wareahousing for many years, please jump towards the end and find this last section: Using production data for development and testing.
Like everything else in IT, in data wareahousing we have 3 environments: development environment, testing environment and production environment. We call them Dev, Test and Prod, respectively. Some companies have PreProd or Staging, some companies have Break-Fix, but because this is data warehousing basics, let's keep it simple and use only 3 environments: Dev > Test > Prod.
By the way, the Test environment is also known as QA (Quality Assurance).
So you are a data engineer, you develop the data warehouse, the ingestions and the transformations in Dev env. Your colleague, a business analyst, need to test the data warehouse. So you deploy the data warehouse in Test env, and you deploy your ingestion + transformation code into Test env, and you run it. The warehouse in Test env is now populated with data, and the BA can do their tests.
The tests fail. The BA raises a bug ticket in JIRA and describes the issue. The ticket is assigned to you, you fix the issue in Dev, and redeploy the code to Test and run it. The BA retests the code, and this time all tests are successful.
So you create a Pull Request to merge the code from your branch to Main. In Azure Dev Ops (ADO), a Main branch is the good, clean version of the code. When you start your development, you copy this Main branch into your branch.
In your branch you add your code, such as additional ingestion or additional transformation. Then you run it locally to test it. Then you create something called Pull Request (PR). Meaning you want to pull the content of your branch into the main branch and merge them.
Your colleague, another data engineer, reviews your PR. In ADO they see the Main branch and your branch side by side. So they can see the difference, which is the additional code that you have just added. They check this additional code and everything looks fine, so they approve the PR. When they approve the PR, your code is merged with the Main branch. Your changes are now in the Main branch.
Now you deploy the Main branch to Dev environment. And test it there. You check the ingested and transformed data in the data warehouse in Dev DB and it is correct.
Now you deploy the Main branch to Test environment. And ask your BA to check the data in Test DB. You change the status of this ticket in JIRA as "In testing". All this time the status was "In development".
So the BA tested the changes. And all data looks good. The BA changes the status in JIRA as "Passed testing".
Now we are ready to deploy to Production environment. You fill in a change form. In that form there you explain what the change is, why the change is required, and what objects are impacted. You also attach the test evidence from the BA.
You also explain how to deploy the change to Prod, and how long it will take to deploy it. And how to rollback the change if something goes wrong, and how long the rollback will take. You also write down how to verify that the change has been implemented successfully.
Once a week, changes are approved. They are approved in a forum called "Change Control Board" (CCB). They only approve changes to production systems. There are all sort of changes: networking changes, Microsoft Office changes, changes in company websites, data governance, infrastructure changes, etc. Your data warehouse change is included in this weekly CCB forum and is approved.
Some changes are so critical they need to be implemented at weekend. This way they (the Production Support team) will have time to rollback the change if something goes wrong. Some changes are small (like only take a minute to deploy) and the impact is minimal so they can be implemented during the day. Or after working hours, like 6pm. After implementing the change, the Production Support team verify that the change has been implemented correctly.
And that's it, your change is now in production. And that my friends is the full lifecycle of a change going through Dev to Test to Prod environments.
I know I missed a lot of thing above. Like checkin, checkout, CI/CD, release process, IAC, etc. But hey, this article is about Data Warehousing Basics series (scroll down to the bottom for complete list). It is an intro for someone who just started in data warehousing. So it has to be simple, so they can grasp the basics.
What I am going to explain below is absolutely crucial for a beginner, but it is also crucial for experts. It is about using production data in data wareahousing projects.
Using production data for development and testing
Imagine that you have a data warehouse which is operational. It gets payment data from various payment processing systems and integrate it into one in the warehouse. Your job, as a BI developer is to create a payment dashboard. On this dashboard users can display the total payments to all suppliers every month.
For this you are given access to the Dev warehouse which has only 2 suppliers and the data is 2 years old. You have the dashboard spec in your hand so you start creating that dashboard using that old data. You have an issue creating a bar chart for "breakdown by payment type", because all the payments in the Dev environment has payment type = blank.
You get through somehow. You managed to create all the charts specified in the spec. But when you run it most of the charts and tables on the dashboard are blank.
But you've done your job so you hand it over to the tester. You deploy that dashboard to Test environment and run it. This time it's a little bit better. The Test DB contains dummy data for 10 suppliers and a few charts appear on the screen. But the tester can't test it as many of the charts are blank.
That's what happens when you're using made up data when developing and testing BI reports and BI dashboards. You can't do your job properly.
Now imagine the same scenario as above, but that story began, we copy the Prod DB to Dev DB and Test DB. And now when you develop your dashboard you will see all the charts and tables on the screen, fully populated with all the numbers.
When the tester tests it, they will see the same thing: all charts and tables on the dashboard are fully populated with data. So the tester can check that the numbers are correct.
So my friends, when we developed BI reports and dashboards, you need to use production data. The same when you are testing.
These days data warehouses are used for training ML models too. And for this we need to use production data too. You absolutely cannot use made up data to train ML models. The same with Gen AI models. They need to be trained with production data.
So you need to copy your Prod DB into Dev DB and Test DB. So that you can develop and test your BI/AI report/model properly.
How does this fit with the data governance?
Under GDPR, in Europe companies can only use customer data for the purpose to which customer has given their consent. This needs to be written in the terms of service or the privacy policy. If you want to use customer data to train AI models, in your terms of service you need to include something like this:
We may use customer data to support and develop AI functionality within our products and services. You instruct us to use customer data for such purposes. You may opt out of having your customer data used to train Company X models by emailing dataprivacy@companyx.com
And to use customer data to create BI reports, you put something like this on your privacy policy:
We use your personal data for a variety of purposes, including to provide our products and services to you, contact you about relevant content, and improve and promote our products and services.
The keyword here is "to improve our products and services". The BI reports you develop are used to improve your services. For example, to analyse the bottleneck in the ordering process. When customers put in their orders at which point in the process it gets stuck? You can do this analysis (and reporting) because you have stated that the purpose of collecting customer data is to improve your products and services.
How about doing trend analysis and reporting? Then you put something like this in your privacy policy:
We use the information you provide us, professional data, log files, and usage data about how you interact with our product and services to develop and improve our products, services, and communications.
This way you can use usage data to assess trends and usage across the product to determine what new features or integrations your users may be interested in.
Bottom line: you need to explicitly mention what you will be using customer data for.
Hope this helps.
Cover image: https://miamioh.edu/it-services/news/2017/08/dev-test-prod.html
My Linkedin articles: https://www.linkedin.com/pulse/list-all-my-articles-vincent-rainardi-eohge/
My blog: https://www.datawarehouse.org.uk/
#DataWarehouse #Data
Consultant database, BI, data warehouse, data mart, cube, ETL, SQL, analysis, design, development, documentation, test, management, SQL Server, Access, ADP+, Kimball practitioner. JOIN people ON data.
1moI prefer to have an integration branch that is running on a database server in Test environment and where developers code via PR merge are integrated and testet together with business users acceptance. Hereafter making a release branch and merged into main branch that is running in Prod environment. Thank you for not using the term Preprod for a data warehouse solution, nobody understands the purpose. Test is self describing for both develops and business users.
Enterprise Data Architect | Knowledge‑Adaptive Systems from Metadata to Execution
1moGreat stuff but can I introduce another concept ? Most SDLCs still end at Dev → Test → Pre-Prod → Prod, so when a production incident hits, we “heal” straight on live traffic with only post-hoc monitoring to tell us if the fix really worked. The missing piece is a closed-loop Automated-Healing Validation (AHV) ring that sits above the conventional stages: 1. Trigger – Observability pipelines (Prometheus, OpenTelemetry) flag an SLO breach. 2. LLM-powered remediator – A policy-bound agent (e.g. Code-Llama + GitOps rules) generates or cherry-picks a fix and opens a PR. 3. AHV environment – A disposable clone of prod spins up via IaC. Shadow traffic, synthetic load, chaos tests, security scans, and latency/error-budget checks run automatically. 4. Decision gate – If every probe passes, Argo CD/Flux merges and promotes; a single failure triggers an immediate Git revert and alerts. This “fallback → test → validate → promote” loop gives event-driven recovery the same rigor as a human release while keeping MTTR low. By chaining LLM remediation with automated, pre-prod verification, we prevent self-inflicted regressions and maintain a tamper-proof audit trail—exactly what modern SRE and change-management policies require.