Use synthetic data to test your prediction algorithm

Ansar Muhammad, PMP, PSM-1

VP Engineering at 10Pearls

Published May 26, 2025

Sometimes we need to showcase an idea to a customer or stakeholder, but the conversation would be more meaningful if you can show it working or at-least with some sample data.

For example, if you want to show to a stakeholder that "if they give you their production data", you can do some solid predictions on it. Say that it is an ed-tech business and they want to see how well we can predict the next course for an active user. You can actually use tools like tools like Windsurf to get it to generate below synthetic (sample) data:

100 courses
100 users
20,000 courses interactions (users taking courses)

Above synthetic data can be generated using programs like "https://github.com/ansarmuhammad/synthetic_data/blob/main/generate_user_interactions.py", by the way the program was also generated by Windsurf using a simple prompt.

Furthermore, you can ask Windsurf to generate another program that would do prediction (or give recommendation against courses) for a particular user, this is achieved here: synthetic_data/recommend_course - good predictions.py at main · ansarmuhammad/synthetic_data

The output of the prediction algorithm is available here: synthetic_data/LEARNING PROFILE FOR USER 2.docx at main · ansarmuhammad/synthetic_data

So now you can walk your stakeholder through the synthetic data and also the predictions/recommendations from your machine learning algorithm, this would give confidence to the stakeholder that you know what you are talking about. The stakeholder can also review the data and understand that you know their domain well. Needless to say, that this algorithm can be improved further but I have kept it simple. You can give the Python program above to an LLM (like perplexity.ai) if you want to learn how it works, just ask it to explain the program to you.

Once the stakeholder has seen the output, they may find it easier to share their production data with you as they can hope to get some high quality results.

Have you tried a similar approach to get stakeholder approval?

Note: All the code and data has been generated using AI enabled code generation tools.

Ansar Muhammad, PMP, PSM-1

VP Engineering at 10Pearls

2mo

Interestingly, in this use case we have predicted for "user 2". Just to see how sensitive AI algorithms are to poor data we actually spoilt the data a little bit. So we went from https://github.com/ansarmuhammad/synthetic_data/blob/main/user_course_interactions_20k.csv dataset to this one: https://github.com/ansarmuhammad/synthetic_data/blob/main/user_course_interactions_20k_poor_data_quality.csv We selectively added '999' to the course id and hence modified 62 entries for user 2. This is 31% of the 200 rows for user 2. The quality of prediction dropped a whopping 36% for user 2. Out of the 20,056 rows in the data set only 0.3% got spoilt but since it was related to an area of interest. It had a big impact on our results. The program that runs the same prediction on both data sets is here: https://github.com/ansarmuhammad/synthetic_data/blob/main/recommend_course%20-%20predicting%20with%20poor%20data.py Just wanted to highlight why it is so important to have clean data for your AI / ML workloads.

To view or add a comment, sign in

See all

Use synthetic data to test your prediction algorithm

Ansar Muhammad, PMP, PSM-1

VP Engineering at 10Pearls

More articles by this author

Others also viewed

My Journey Building a Titanic Survival Predictor with Logistic Regression

Let's get dive into the Regression-Bit by bit from scratch.

Boost Your Machine Learning: Exploring XGBoost vs LightGBM

Day 7: k-Nearest Neighbors (k-NN)

Kfold Cross Validation for the LightGBM Classifier

Machine Learning Series – Part 9: Decision Trees – Let the Model Decide Step by Step 🌳

Build, train, and deploy a specific model using custom data on Runpod.io.

Machine Learning Series – Part 3: Polynomial Regression Explained Simply

Announcing DoubleML Coverage

Explaining Your Machine Learning Models with SHAP and LIME!

Explore topics

Testing your legacy system using a tree metaphor

Jul 17, 2025

What Computer Science Universities and Departments should consider...

Jun 13, 2025

A simple Large Language Model fine tuning example

May 11, 2025

5th Code Quality Awards!!

May 2, 2025

Building a basic Website using an AI Powered Tool

Apr 7, 2025

What to do if your AI/RAG (Retrieval Augmented Generation) Chatbot is not giving good answers?

Mar 26, 2025

How AI Assistants Help With Programming

Mar 11, 2025

Data Migration Strategy

Feb 26, 2025

Building an AI Agent using a No-code tool

Feb 18, 2025

How LangChain can help you elegantly write an Agent!

Jan 27, 2025

Others also viewed

My Journey Building a Titanic Survival Predictor with Logistic Regression

Let's get dive into the Regression-Bit by bit from scratch.

Boost Your Machine Learning: Exploring XGBoost vs LightGBM

Day 7: k-Nearest Neighbors (k-NN)

Kfold Cross Validation for the LightGBM Classifier

Machine Learning Series – Part 9: Decision Trees – Let the Model Decide Step by Step 🌳

Build, train, and deploy a specific model using custom data on Runpod.io.

Machine Learning Series – Part 3: Polynomial Regression Explained Simply

Announcing DoubleML Coverage

Explaining Your Machine Learning Models with SHAP and LIME!

Explore topics