How to start with Machine Learning

If you think that you need to be an engineer to use and understand ML that is not the case, and here I will show you some easy steps to begin using it. I will walk through a high-level overview of how ML models generally work, and what are the 7 steps you must take to build a model.

What is ML? 🤔💭

ML is the process of teaching a computer to learn patterns from data and then apply those patterns to make predictions on new data. So, you provide the computer with many examples of input data as well as the desired output and let the computer learn the rule itself.

ML is effective for the types of problems, where you have lots of data that have complex relationships that would be very difficult for humans to manually create rules for including image, video, and object recognition, Natural Language Processing (NLP - Language translation, sentiment analysis, chatbots, text summarization, and language generation), recommendations systems, predictive analytics, anomaly detection, and resource optimization.

ML Model 🤖

To build a ML model, we need to understand two main concepts: decision trees and decision forests.

A decision tree is an algorithm that learns by splitting the data into smaller and smaller subsets. The splits are determined by how well each subset helps to predict the desired outcome.

Here's a simple decision tree for predicting different types of dogs! (We can imagine Type 1 is Beagle, Type 2 is Golden Retriever, etc!) In this model, we have several conditions (questions that must be answered) to classify the dogs in one particular category (prediction).

Decision Forests (DF) are a large family of machine-learning algorithms for supervised classification, regression, and ranking. As the name suggests, DFs use decision trees as a building block.

Now that we understand the logic behind ML, we are ready to build our ML model.

Step 1: Set Up the environment 📊

As a first step we need to install our decision forest library, you can use free libraries for training decision forest models, such as TensorFlow, Pandas, and Numpy.

You can load these libraries and load your data into a pandas data frame (df), which is a format, python can read.

Step 2: Get the Data 📈

ML will learn patterns and trends from the data so it is very important to know well the data. You can spend a lot of your time here because this is the most important part of the process.

Some information you need to know about your data is: Where does your data come from? Who collected it? Why did they collect it? What kind of data ("features") do you have? Are they all numbers (integers or "int", "float")? words ("strings")? and What questions can this data answer? What are the trends that you identify? What are the patterns? Does it contain bias? , etc.

Step 3: Load the Data Set 📉

You can use a small data set (300 examples) or a bigger one (>1M), depending on the level of accuracy that you expect. You can load your data using Panda for a small sample or TensorFlow for larger samples, just be sure that the data is stored in a .csv-like file.

Step 4: Explore your data🔍

Because we can have many examples, it is important to explore the data in the form of summary statistics, which we can use directly on the pandas data frame.

To do it you can use the function groupby() on the entire data frame by a certain feature, like by each dog type, and do calculations per type.

Step 5: Clean your Data 🧹

Handling Missing Values:Identify missing data in your dataset. You can replace missing values with statistical measures like mean, median, or mode, delete rows/columns with missing values, or use advanced imputation techniques like K-Nearest Neighbors (KNN) or predictive models.
Removing Duplicates:Identify and remove duplicate rows, as they can skew the model's training by giving undue importance to certain observations.
Dealing with Outliers:Detect outliers using statistical methods or visualization techniques (box plots, scatter plots, etc.).Decide whether to remove outliers, transform them, or treat them differently based on the context of your data and the problem you're solving.
Handling Inconsistent Data:Check for inconsistencies in categorical variables (e.g., different spellings for the same category). Standardize or correct inconsistencies to ensure uniformity.
Feature Scaling and Normalization:Scale numerical features to a similar range to prevent certain features from dominating the model due to larger scales.
Encoding Categorical Variables:Convert categorical variables into a numerical format using techniques like one-hot encoding or label encoding.
Feature Selection:Identify irrelevant or redundant features that might not contribute much to the model's predictive power. Use techniques like correlation analysis, feature importance, or dimensionality reduction (PCA, LDA) to select the most important features.

Step 6: Train and Test the Model 📊🔍

To train a ML model we need to be sure that we are going to use just a percentage of all our data because we are going to need a portion of data to test the model, so we can see how well the model performs.

Typically, we can use the ratio of 70%/30%. 70% of our data can be training data, and 30% of our data can be testing data. We want to be sure that the data does not follow certain criteria, it has to be a random selection.

# Split the dataset into a training and a testing dataset.

def split_dataset(dataset, test_ratio=0.30):
  """Splits a panda dataframe in two."""
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]


train_ds_pd, test_ds_pd = split_dataset(df)
print("{} examples in training, {} examples for testing.".format(
    len(train_ds_pd), len(test_ds_pd)))

Step 7: Retrain the Model 💪

You can gather new, relevant data and apply the same preprocessing steps used during the initial training phase, divide the new data into training, validation, and test sets, maintaining consistency with the original data splitting strategy, and finally integrate the retrained model into the existing system.

As you see to master ML you just need to follow the correct steps and to keep practicing with the available free resources online.

Here you can find some resources that may be helpful.

Learning Resources📗

Zero to Hero Intro to ML
Tensor Flow - Basics of ML
Problem Framing: To learn how to define your problem in ML terms, and find the appropriate solution.
Kaggle Competitions: To train yourself with challenging cases and put into practice your ML skills

Finally, remember that ML is an important tool that can help you analyze huge amounts of data to solve complex problems, and automate repetitive tasks, leading to increased efficiency and reduced human errors. It can also analyze data to personalize experiences (communication, recommendations, etc), identify the most efficient ways to achieve a goal, uncover insights, help in making informed and data-driven decisions, find correlations that might not be apparent through traditional analysis, assist in risk assessment and mitigation and pave the way for innovative solutions to problems.

I hope you find this high-level overview of ML helpful for your career development. See you in the next WTM article!

Karen Avellaneda

Google Women Techmaker Ambassador

How to start with Machine Learning

Karen Avellaneda MBA

Tech Founder | WTM | HEC Paris MBA

Others also viewed

How to Become a Master in Large Language Models (LLMs)

Understanding Transformers: A Deep Dive with PyTorch

NLP Topics

Natural Language Processing Basics: From Tokenization to Word Embeddings

How to Develop AI: A Step-by-Step Guide for Beginners

BLIP from Hugging Face Transformers

Fine-Tuning BERT vs. GPT for Text Classification: A Deep Dive

Evolution of Word Embeddings: A Journey Through NLP History

Hamming vs Levenshtein Distance in NLP

Natural Language Processing: Linear Text Classification

Explore topics