Dear Data Padawan 3 – In the Beginning There was Data…

Susana P.

Senior Data Scientist | Computational Biologist -> Studying The Code of Life for more than a decade😊

Published Apr 30, 2024

I get it padawan, navigating the post ChatGPT world of data science is weird. It is for me as well.

I understand you might feel lost and a little bit disillusioned because AI seems to be noticeably short of what people hoped for. We might be heading towards a major AI bubble burst soon https://www.cnn.com/2024/03/14/investing/premarket-stocks-trading-ai-bubble-grantham/index.html .

But how do we move forward? Ya know, these kinds of bubbles have happened multiple times in the data world because the world of business tends to put tech before solutions. Instead of looking into problems and constructing the solution from them, they tend to “sell universal solutions” that aren’t universal at all. Like forcing a round peg in a square hole, something or someone has to suffer when that “universal solution” is forced upon the problem needing a solution.

So, what’s the solution to keep going on creating solutions despite the AI bubble? Data.

Everything in Data Science in the end lives and dies because of data. So, what to do, learn and master to survive in current data science and deal with data? Here’s a few tips:

Data Quality – To assure data quality you need to master two skills:
Statistics – Knowing how to fully do a data report that describes in depth the characteristics of the dataset you’ve been given it’s an art. It’s 50% learning the protocol, 50% learning from experience. My advice? Keep well known resources such as https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf or https://hastie.su.domains/Papers/ESLII.pdf accessible to you at all times and start developing your own protocols from your own experiences. I have my own protocols for numerical data, biological data, cohort data, text data and so on…that I’ve built along the years and keep improving them;
Quality Metrics – yes, the quality metrics begin here. Surprised? Determine bias, variance, data quality, incompleteness and so on begins with the data, not the model. From this first set of quality metrics + the question you want to answer the baseline model (the simplest statistical or machine learning model you can do that gets to a solution for the question) will naturally arise and you will form a strict protocol of metrics that will accompany you through every step of development. Every iteration of the model must answer the question with at least less error, less noise and less bias than the basic statistical description of the data and the baseline model.
Data Augmentation and Synthetization Methods – Quite often you will find yourself with too little or too sensitive data to deal with. Knowing how to properly apply techniques such as Generative Adversarial Networks is extremely useful in these situations. You’ll able to keep the results relevant and the data safe. Your clients will thank you.
Data Pipeline Monitoring and Model Monitoring – AI crashes silently until the problems are so huge that you need to fully retrain the whole pipeline. That’s the sad reality of the situation. And massive amounts of money are wasted because people don’t monitor their models and pipelines. Yes, I’m telling you to learn at least the basics of MLOps. To learn how to detect drift and the early signs of model crash. There’s no excuse not to do it. You can use open-source frameworks such as the one from the awesome team at Nanny ML to learn and test. It will in the end make you more alert for pitfalls in your data and model and it will make you a favourite of the MLOps engineers. There’s only benefits in it.
Explainer Models – Learning to build and use the explainer models correspondent to the ones you’re building is a valuable skill to add to your list. Explained Models “reverse” (to a certain degree) the AI/ML model and allow you to glimpse into what the computer is “thinking”. They’re extremely valuable to understand limitations of the models. Quite often some of the bias and problems of the dataset can and do arise out of the explainer models.

In the end remember the golden rule “Garbage in, Garbage Out”. Learn to avoid the garbage in the data and your models will shine in its full splendour.

This should be a great starting point of skills dear padawan, to keep you relevant and important in the future of data science.

May the data be with you,

Susana

Dear Data Padawan 3 – In the Beginning There was Data…

Susana P.

Senior Data Scientist | Computational Biologist -> Studying The Code of Life for more than a decade😊

Dear Data Padawan

2,474 followers

More articles by this author

Others also viewed

Shape of Data: An Introduction to Topological Data Analysis, Part 1

“AI-ready” isn't a static state. It’s an operational model.

Data Gravity & AI flares up EDP wars

"Beyond Traditional RAG: What Data-Driven Leaders Must Know About Graph-Powered AI"

Code to Command Series: Chapter 2 - Malicious Data Injection

Data Strategies Start With Defining What Problem You Want to Solve

IEEE Big Data 2024 Quick Note: Mixed Feelings about AI

The Anatomy of a GenAI System - Part 2

Navigating the Evolution: Data Scientists Must Pivot, Not Perish

Unleashing Data Science: How Compound AI Systems Can Drive Breakthrough Insights—and Competitive Advantage

Explore content categories

Dear Data Padawan

2,474 followers

“A boa educação é moeda de ouro. Em toda a parte tem valor.”

Sep 3, 2025

"Sente-se uma insatisfação, sobretudo dos jovens, perante um mundo que já não oferece nada, só vende!"

Aug 17, 2025

"Quando eu morrer voltarei para buscar Os instantes que não vivi junto do mar..."

Jun 10, 2025

"Há nos confins da Ibéria um povo que nem se governa nem se deixa governar"

May 20, 2025

Dear Data Padawan 5 - Lessons from Complexity Sciences that make me a Better Data Scientist

Oct 23, 2024

Dear Data Padawan 4 - communication takes precedence over showing off

Jul 26, 2024

Dear Data Palawan 2 - Padawans in Data / Master Jedis in Tech

Feb 5, 2024

Web Summit Survival Guide - From a Lisboner

Nov 9, 2023

Dear Data Padawan 1 - The Beginnings

Sep 13, 2023

3 things that actually improve the lives of those working in the data science sphere but are often ignored

Dec 29, 2022