I get it padawan, navigating the post ChatGPT world of data science is weird. It is for me as well.
I understand you might feel lost and a little bit disillusioned because AI seems to be noticeably short of what people hoped for. We might be heading towards a major AI bubble burst soon https://www.cnn.com/2024/03/14/investing/premarket-stocks-trading-ai-bubble-grantham/index.html .
But how do we move forward? Ya know, these kinds of bubbles have happened multiple times in the data world because the world of business tends to put tech before solutions. Instead of looking into problems and constructing the solution from them, they tend to “sell universal solutions” that aren’t universal at all. Like forcing a round peg in a square hole, something or someone has to suffer when that “universal solution” is forced upon the problem needing a solution.
So, what’s the solution to keep going on creating solutions despite the AI bubble? Data.
Everything in Data Science in the end lives and dies because of data. So, what to do, learn and master to survive in current data science and deal with data? Here’s a few tips:
- Data Quality – To assure data quality you need to master two skills:
- Statistics – Knowing how to fully do a data report that describes in depth the characteristics of the dataset you’ve been given it’s an art. It’s 50% learning the protocol, 50% learning from experience. My advice? Keep well known resources such as https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf or https://hastie.su.domains/Papers/ESLII.pdf accessible to you at all times and start developing your own protocols from your own experiences. I have my own protocols for numerical data, biological data, cohort data, text data and so on…that I’ve built along the years and keep improving them;
- Quality Metrics – yes, the quality metrics begin here. Surprised? Determine bias, variance, data quality, incompleteness and so on begins with the data, not the model. From this first set of quality metrics + the question you want to answer the baseline model (the simplest statistical or machine learning model you can do that gets to a solution for the question) will naturally arise and you will form a strict protocol of metrics that will accompany you through every step of development. Every iteration of the model must answer the question with at least less error, less noise and less bias than the basic statistical description of the data and the baseline model.
- Data Augmentation and Synthetization Methods – Quite often you will find yourself with too little or too sensitive data to deal with. Knowing how to properly apply techniques such as Generative Adversarial Networks is extremely useful in these situations. You’ll able to keep the results relevant and the data safe. Your clients will thank you.
- Data Pipeline Monitoring and Model Monitoring – AI crashes silently until the problems are so huge that you need to fully retrain the whole pipeline. That’s the sad reality of the situation. And massive amounts of money are wasted because people don’t monitor their models and pipelines. Yes, I’m telling you to learn at least the basics of MLOps. To learn how to detect drift and the early signs of model crash. There’s no excuse not to do it. You can use open-source frameworks such as the one from the awesome team at Nanny ML to learn and test. It will in the end make you more alert for pitfalls in your data and model and it will make you a favourite of the MLOps engineers. There’s only benefits in it.
- Explainer Models – Learning to build and use the explainer models correspondent to the ones you’re building is a valuable skill to add to your list. Explained Models “reverse” (to a certain degree) the AI/ML model and allow you to glimpse into what the computer is “thinking”. They’re extremely valuable to understand limitations of the models. Quite often some of the bias and problems of the dataset can and do arise out of the explainer models.
In the end remember the golden rule “Garbage in, Garbage Out”. Learn to avoid the garbage in the data and your models will shine in its full splendour.
This should be a great starting point of skills dear padawan, to keep you relevant and important in the future of data science.
May the data be with you,