From the course: AWS Certified AI Practitioner (AIF-C01) Cert Prep
Data governance strategies - Amazon Web Services (AWS) Tutorial
From the course: AWS Certified AI Practitioner (AIF-C01) Cert Prep
Data governance strategies
- What is generative AI data governance exactly? Our definition for the purposes of this conversation are going to be that it is a way of ensuring data is managed securely, ethically, and effectively across its entire lifecycle. It also ensures that data accuracy, availability, integrity, and security, and can be audited as such. The importance of data governance is that generative AI models are going to rely on very large data sets, and so proper governance are going to help ensure responsible usage as well as compliance. The generative AI data lifecycle is going to consist of several different phases. The first is data collection, and that's just makes sure that we're sourcing data from diverse inputs. That includes databases, web scraping, sensors, documents, and so forth. Next is data pre-processing, and this is where the data is cleaned and transformed in anticipation of being used in training an AI model. And we follow that with the actual training, and this is using that data to train AI models, followed by model deployment where the AI models are able to generate outputs based on input and finishing with data archiving and deletion. Where we properly manage the data post usage for compliance purposes and a governance consideration here is to make sure that security, privacy, and accuracy are maintained across all of these stages. Next, we have data logging, and our definition of this is the process of recording key activities related to the data usage. The importance of data logging is that it helps with compliance, it helps with troubleshooting and transparency, and it even helps with monitoring the performance of the model itself. Next, we have data residency and sovereignty, and the definition is easy. This is just where's the data physically stored, but this is important because there may be regulatory reasons why the data has to remain in a certain geography for compliance or some sort of jurisdictional sensitivity reasons. Then we have monitoring and observation. This is just the continuous tracking of the data usage as well as model behavior and compliance over time. And this is important because it can help to identify anomalies to prevent misuse or unauthorized access. We follow that with data retention and deletion, and the retention is just how long the data's kept after we collect it. Delete it is once it's reached the end of its lifecycle, removing it in such a way that we are still compliant afterwards. And the considerations here, keep only the minimum amount of data that you absolutely need. Data storage can be expensive. Also, make sure that the data is deleted securely.