Use synthetic data to test your prediction algorithm

Use synthetic data to test your prediction algorithm

Sometimes we need to showcase an idea to a customer or stakeholder, but the conversation would be more meaningful if you can show it working or at-least with some sample data.

For example, if you want to show to a stakeholder that "if they give you their production data", you can do some solid predictions on it.   Say that it is an ed-tech business and they want to see how well we can predict the next course for an active user.  You can actually use tools like tools like Windsurf to get it to generate below synthetic (sample) data:

  • 100 courses
  • 100 users
  • 20,000 courses interactions (users taking courses)

Above synthetic data can be generated using programs like "https://github.com/ansarmuhammad/synthetic_data/blob/main/generate_user_interactions.py", by the way the program was also generated by Windsurf using a simple prompt.

Furthermore, you can ask Windsurf to generate another program that would do prediction (or give recommendation against courses) for a particular user, this is achieved here: synthetic_data/recommend_course - good predictions.py at main · ansarmuhammad/synthetic_data

The output of the prediction algorithm is available here: synthetic_data/LEARNING PROFILE FOR USER 2.docx at main · ansarmuhammad/synthetic_data 

So now you can walk your stakeholder through the synthetic data and also the predictions/recommendations from your machine learning algorithm, this would give confidence to the stakeholder that you know what you are talking about.  The stakeholder can also review the data and understand that you know their domain well.  Needless to say, that this algorithm can be improved further but I have kept it simple.  You can give the Python program above to an LLM (like perplexity.ai) if you want to learn how it works, just ask it to explain the program to you.

Once the stakeholder has seen the output, they may find it easier to share their production data with you as they can hope to get some high quality results.

Have you tried a similar approach to get stakeholder approval?

Note: All the code and data has been generated using AI enabled code generation tools.

Interestingly, in this use case we have predicted for "user 2". Just to see how sensitive AI algorithms are to poor data we actually spoilt the data a little bit. So we went from https://github.com/ansarmuhammad/synthetic_data/blob/main/user_course_interactions_20k.csv dataset to this one: https://github.com/ansarmuhammad/synthetic_data/blob/main/user_course_interactions_20k_poor_data_quality.csv We selectively added '999' to the course id and hence modified 62 entries for user 2. This is 31% of the 200 rows for user 2. The quality of prediction dropped a whopping 36% for user 2. Out of the 20,056 rows in the data set only 0.3% got spoilt but since it was related to an area of interest. It had a big impact on our results. The program that runs the same prediction on both data sets is here: https://github.com/ansarmuhammad/synthetic_data/blob/main/recommend_course%20-%20predicting%20with%20poor%20data.py Just wanted to highlight why it is so important to have clean data for your AI / ML workloads.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore topics