8 Best Practices in Data Engineering
In data engineering we essentially do data processing (transformations, calculations, formatting, mapping) and for that we build data pipelines and data storage. All these pipelines and data processing could be for a data migration, it could be for a data platform, it could be for feeding applications (including AI). I've been building those pipelines and storage, and over the years I noticed a few principles and best practices that we should follow.
Here they are:
1. Cloud infrastructure
2. Build in stages
3. Automation
4. Expect bad data
5. Secure
6. Good DevOps
7. Data sharing
8. Easy to Manage
First up: Cloud infrastructure
The first thing we should be doing is to build the infrastructure. It should be cloud based, like in Azure or AWS. It should be SAAS, so you should not be installing anything. There are many reason for building the infrastructure in the cloud. It does not require big capital outlay up-front. You pay only for what you use in the previous month. It much more secure than if you build the security yourself (and a lot cheaper). It is scalable, you can increase the computing power and storage instantly. It's flexible. And it's good to attract talents. And retaining them.
You must build all the infrastructure using IAC (Infrastructure As Code) like Terraform. Do not build them manually. Because you need to build 3 or 4 sets (called "environment"). One for Dev, one for Test, one for Production, and one for PreProduction. If you build them manually it's a nightmare to keep it in-sync between the environments.
Second: build it in stages
Do not try to build everything in one go. It will end up in a disaster. Instead, pickup just one area and build it using Terraform. I would recommend that you start with the database. Build the database first, using Terraform. The roles, database, schemas, compute, users, privileges, etc. Then you build the ingestion. Again start small, just aim for putting 1 file into S3 and load it into the database. While it's still small, go live with it. Put it into Test and Production environment, and you'll learn a lot from doing that. From connectivity to authentication to service users, and everything in between. From transformation to data processing to calculation and formating. From data mapping and data loading, from blob storage to delta tables, from logging to monitoring. There are tons to build. The key my friends, is to build it one at a time. One by one. And everytime you build one thing, put it into Production. Don't wait until they accumulate in Dev.
Third: automate it
You have to automate EVERYTHING. From the moment the data is extracted from the sources, all the way to the reports and applications, everything must be automated. There should be no manual process. The water in the pipes must be flowing freely end-to-end, to use the analogy. This is what differentiates a good pipeline from a bad pipeline. In a good data pipeline everything is automated. The processing, the calculation, the formatting, the extraction, the transformation, the loading, everything is automated.
One of the key success factor in data engineering is to automate the testing. After you build something, put an hour or two creating the unit tests for it. A few tests that prove that it works. For example, if it's about loading data, put in a small file and test that the data is loaded into the target table correctly. You can use Azure Test Plan, dbt, IceDQ, Astera, QuerySurge, Rivery, etc. In the long run it will save you thousand of hours. And increase your code quality many times over.
And everything must be logged. Every single step in the data processing must be logged. What happens and when, who ran it, how long and what the result was. Everything must be logged securely. This is necessary for troubleshooting when there's problem but also it's mandatory for audit.
Fourth: Expect bad data
You will encounter a lot of bad data. That is just the nature of data engineering. From duplicate data, missing data, string on a numeric column, you name it it's there. So code accordingly. Don't let the data fail the process. If one day a date column contains an invalid date value, your code should be able to deal with it gracefully. The last thing we all want is for the load process to stop because of bad data. If that customer ID exists twice in the customer file (duplicate) then it should not stop the process. If a mandatory field is not supplied, it should not stop the process. No bad data of any kind should stop the process.
Yes, there should be data quality remediation, you're running data quality rules, data governance, data dictionary, and all those good things. But no bad data should stop the pipelines.
Fifth: Secure
Everything must be secure, end-to-end. From the moment the data is extracted, until the data is used in the target application or in the reports/dashboard. When the data is being transported it must be encrypted, especially when it's transported over the internet. When the data is being stored it must be encrypted. Both storing in databases and in files. Of course the encryption must be under-the-hood. When you look at the files, to you it looks like a plain text. But for people outside the network it's encrypted. When you query the database what you see is just plain unencrypted text, customer name, address, etc. But for people outside the network it's encrypted.
Data privacy procedure and protocol must be followed to the dot. People who are not authorised to see customer's PII data must not be able to. (PII = Personally Identifiable Information).
Your pipelines must use Key Management Service (KMS) or Azure Key Vault to store the encryption keys. And everything must be signed off by the InfoSec manager or security architect AND the data protection officer (DPO). They are the one who will be responsible on the data security in your company. One more thing: always test for data leakage. In every step of your pipeline. And every pipeline must be auditable.
Sixth: Good DevOps
As a data engineer you have to do CI/CD. CI stands for Continuous Integration. Meaning that you and your fellow data engineers continuosly checking in your code changes into the main branch. Using Pull Request of course. So you create a new branch, do your development, then merge it back to the Main branch using Pull Request. Another engineer must review your code and approve your Pull Request. CD stands for Continuous Deployment. Meaning that right after you merge the code change into the Main branch, you must deploy it to Dev environment. Then the automated testing run, and that code change is deployed to Test env for the users / BA to sign off. Then it's deployed into Production. You must fill in change request form before deploying any change into prod, and you must attach the test evidence. The change request must be approved by the production support team.
This is the bread and butter in data engineering. If you don't have good DevOps processes, you might as well go home. Because without DevOps doing your job as a DE will be a nightmare.
Seventh: Share it
You need to share both the transformed data and the raw data with the business users and the business/data analysts. You need to use RBAC (Role Based Access Control), not UBAC (User Based Access Control). So users are grouped together into a role, and the role is then permissioned to access the reports or tables. Business users access the report, analysts access the database tables via views. Whether you use Azure SQL or Databricks, or anything in between, you need to provide SQL interface for the analysts and testers. And you need to provide self service facility for the business users. Something like Tableau, or Power BI or Qlik. Something that they can use to create their own report, without bothering IT department. And you also need to provide facility for the business users to export data from the report into Excel. That is a defacto requirement in every data project. There is no point resisting this.
And finally: Easy to manage
You have to talk to the production operation team before you build your infrastructure and pipeline. Plan the whole thing together with them. Because after you build it, you will hand it over to them to operate. You need to provide them with a way to monitor the pipelines. They need to be notified automatically when a pipeline fails for any reason. Every step in the pipeline needs to be logged so that the ops team can troubleshoot it when problem occurs. If the ops team needs to do something (say start a data load or check incoming files), write it down clearly in the handover documentation, and automate it as much as possible. For example, archive the files automatically after they are loaded into the database. Provide some kind of data status. This might take a day or two to code, but it will save a lot of time from the ops team point of view. Automate it as far as possible. The less manual it is, the more accurate it is and the more efficient it is too. So make it simple to manage. And you'll get the ops team thanking you for it.
And that my friends, are the 8 best practices in data engineering, based on my experience. Thank you for reading it.
I'm sure I made mistakes somewhere above. And forget many things too. So would appreciate your opinion and comments below. Thank you.
List of my articles: https://lnkd.in/eRTNN6GP