YAML Engineers

YAML Engineers

In the data engineering field, YAML files have become a beloved tool, much like in DevOps. Over the years, data engineers have increasingly turned to YAML for managing configurations, especially in metadata-driven pipelines. Metadata is often stored in configuration files, and while JSON is an option, it lacks the readability and simplicity that YAML offers. JSON's strict syntax and lack of support for comments make it less ideal for configuration files. In contrast, YAML's human-readable format and support for complex data structures make it a favorite in the industry.

The Problem - No Native Inclusion in YAML

One challenge with YAML is its lack of native support for including one file within another. This can be an issue when you want to modularize configurations across different environments like development, QA, and production. Custom tags can help address this limitation. Think of a custom tag as a mechanism to hold additional metadata about a node and guide parsers on how to interpret data within that node. By using custom tags, we can implement an "include" functionality to merge content from separate files, enhancing modularity and maintainability

Understanding Key Concepts in PyYAML

Loader: The loader is responsible for parsing the YAML document and converting it into Python objects. It interprets various node types within the YAML structure.

Constructor: A constructor is a function that defines how specific tags in a YAML file are converted into Python objects. You can create custom constructors for custom tags.

Node Types:

Scalar Nodes: Represent single values like strings, integers, or booleans.

Sequence Nodes: Represent ordered lists or arrays.

Mapping Nodes: Represent key-value pairs, similar to dictionaries.

Nesting Nodes: Allow complex nested structures by combining other node types.


Imagine you have common configurations shared across different environments. The project structure is organized like this:


Article content

common.yml

database:
  port: 5432
  user: app_user

api:
  timeout: 30

features:
  enable_feature_x: true        

dev.yml (We want to include common.yml)

common: !include 'common.yml'

database:
  host: dev-db.example.com

api:
  endpoint: https://dev-api.example.com        

To make this work, we need to register a constructor that identifies the custom tag `!include` and reads common.yml, storing its content in a Python dictionary. Here's how you can do it:


import os

import yaml


def include_constructor(loader, node):
    """Custom constructor for including YAML files."""
    file_path = os.path.join(
        os.path.dirname(__file__), "../config", loader.construct_scalar(node)
    )
    with open(file_path, "r") as input_file:
        included_data = yaml.safe_load(input_file)
    return included_data


# Register the custom constructor with SafeLoader
yaml.SafeLoader.add_constructor("!include", include_constructor)
        

In the main program, when we read dev.yml, it automatically processes the !include tag and incorporates the content from common.yml.

import yaml

import custom_yaml  # Import the custom YAML constructor

# Load the development configuration with inclusion
try:
    with open("../config/dev.yml", "r") as f:
        dev_config = yaml.safe_load(f)
    print("Development Configuration:", dev_config)
except FileNotFoundError as fnf_error:
    print(f"File not found: {fnf_error}")
except yaml.YAMLError as yaml_error:
    print(f"YAML Error: {yaml_error}")
        

I modularized the setup by importing the custom_yaml file, ensuring that the custom constructor is registered when running the main program. If you don't register it with SafeLoader and use yaml.safe_load, you'll encounter an error.

Closing Thoughts

Now you understand the concept of custom tags. What can be achieved with them?Imagine a scenario where you need to read environment variable values in a YAML file without hardcoding secrets. You could use custom tags to fetch these values dynamically or even retrieve them from a secure vault using a vault's API.

Use Cases

  • Validation: Ensure data adheres to specific formats or rules, enhancing data integrity.
  • Transformation: Convert data into various formats or structures, making it adaptable for different applications.
  • Integration: Connect YAML configurations with external systems or data sources, streamlining workflows and processes.
  • Customization: Adapt configurations to meet specific application requirements, allowing for greater flexibility and control.

By leveraging custom tags, you can create more dynamic, secure, and efficient configurations tailored to your specific needs.

To view or add a comment, sign in

Others also viewed

Explore topics