Demystifying AWS DataZone

Demystifying AWS DataZone

Amazon DataZone is a streamlined service for managing data, enabling quick cataloging, discovery, sharing, and governance across AWS. It allows administrators and data stewards to regulate data access with precise controls, ensuring appropriate access levels and context. This makes it simpler for a wide range of user persona's, including engineers, data scientists, product managers, analysts, and business personnel, to access and collaborate on organizational data for insightful decision-making and analytical reporting.


AWS services getting leveraged in Data Lake Architecture

Amazon Web Services (AWS) offers a comprehensive ecosystem to build and manage data lakes, harnessing the power of services like Lake Formation, Glue, Athena, and centralized domain owned Data Zones. This article aims to guide you through the best practices for launching your AWS Data Lake, focusing on configuring Lake Formation and establishing a Data Zone.

Data Lake

A Data Lake is a centralized repository allowing storage of all data types at scale, enabling secure data collection from various sources and analysis using different tools for flexible, large-scale data processing.

AWS Lake Formation

AWS Lake Formation simplifies creating a secure data lake in AWS, automating integrations with other AWS services like S3 and Glue metadata store for easy data management and access control.

AWS Glue

AWS Glue is a managed service that makes data discovery, preparation, and cataloging for seamless analytics and machine learning use-cases.

AWS Athena

AWS Athena is a serverless query service for analyzing data in Amazon S3 using SQL. Ideal for quick, ad-hoc analyses, it supports direct queries on various data formats, facilitating efficient analysis and reporting.


AWS Data Zone

The concept of a Data Zone within data lake facilitates efficient data management, governance. It is essentially a segmented area in your data lake designed to categorize data based on its readiness for centralized consumption in pub-sub model.

Key steps for enabling the Data Zone using AWS console and managing the access enablement for producers and subscribers are outlined below -

Create S3 Bucket for DataZone

Navigate to S3 console and set up a bucket named “datazone-bucket-12345”. We’ll use this bucket for our Data. We’ll use this bucket for our DataZone area. Don’t forget to enable versioning.

Article content

Create Domain for DataZone

Navigate to AWS Datazone and create a domain for further consumption.

Article content

After that, you can see a dashboard with domain settings. Let's go to the “Blueprints” and select DefaultDataLake option and enable.

Article content

Leverage s3 location we've created for Data Lake: “s3://datazone-bucket-12345

Article content

Create a project for Data Publisher

Next, navigate to the DataZone dashboard and select “Open data portal”. You will then be taken to the portal’s main page. Now it’s time to set up the first project for the Publisher. To do this, click on “Create project”.

Article content

Enter a name in the input field, for example, “Publisher"

Article content

Next step, would be to create environment

First, we have to create profile for Publisher project. Please choose “Create Environment Profile” from “Environments” tab.

Article content

Next, it’s time to create an environment. For this, go to the “Environments” tab and click on “create environment”. Fill in the Name and select the profile you created earlier. Leave the rest of the form blank. This way, DataZone will apply the default naming convention.

Article content

After you initiate a new environment, DataZone will start creating resources for it. Behind the scenes, CloudFormation will deploy the stack. Your new environment will be set up shortly!

Once it’s ready, you’ll be able to view a dashboard for the environment. Now please repeat the steps to create a Consumer profile with a new environment! You should have projects like this screen:

Article content

Now you can create a table with some data. Make sure that your publishing environment is selected and the database publisherdata_pub_db is selected as in the query editor.

Article content

You can use this example to ingest data into a new table: inventory_table

Article content

After that you can see your table on AWS Athena:

Article content

Generate metadata

It’s time to return to your DataZone and generate metadata from the table you created in the previous step. As a Publisher, navigate to the DATA tab and select DataSources from the menu. Here, you’ll see a list of your sources from which the system can generate metadata. Click on the first one, “PublisherData-default-datasource” which is set by default. Next to the Action dropdown menu, choose Run, and then hit the refresh button. Once the data source run is complete, the assets will be added to the Amazon DataZone inventory.

Article content

Subscribe data from the data catalog

As a “Consumer” project please search inventory_table asset and next send a request for subscribing the data.

Article content

As a “Publisher” go to DATA tab and choose “Incoming requests” and approve.

Article content

Time to use your data

Now that you have successfully published an asset to the DataZone catalog and subscribed, it’s time to return to DataZone, select the Consumer project, and log into Athena. Please choose the consumerdata_sub_db database and preview the inventory_table.

Article content

Appendix

Data Lake Blueprint

This blueprint outlines how to start and set up AWS Glue, AWS Lake Formation, and Amazon Athena in the Amazon DataZone catalog.

Article content

References

AWS re:Invent 2023 - What’s new in Amazon DataZone



Abhijit Nandy

Senior Cloud Data Architect | AWS | AZURE | Databricks | Snowflake | || I help organizations to unlock the value of their data by designing and implementing scalable, secure, and cost-efficient cloud data solutions

1y

Insightful

Anujay Suyal

Sr. Software Engineer at NatWest Group | Data Engineering | Passionate about AI Innovation

1y

Interesting!

To view or add a comment, sign in

Others also viewed

Explore content categories