How to start with Machine Learning
If you think that you need to be an engineer to use and understand ML that is not the case, and here I will show you some easy steps to begin using it. I will walk through a high-level overview of how ML models generally work, and what are the 7 steps you must take to build a model.
What is ML? 🤔💭
ML is the process of teaching a computer to learn patterns from data and then apply those patterns to make predictions on new data. So, you provide the computer with many examples of input data as well as the desired output and let the computer learn the rule itself.
ML is effective for the types of problems, where you have lots of data that have complex relationships that would be very difficult for humans to manually create rules for including image, video, and object recognition, Natural Language Processing (NLP - Language translation, sentiment analysis, chatbots, text summarization, and language generation), recommendations systems, predictive analytics, anomaly detection, and resource optimization.
ML Model 🤖
To build a ML model, we need to understand two main concepts: decision trees and decision forests.
A decision tree is an algorithm that learns by splitting the data into smaller and smaller subsets. The splits are determined by how well each subset helps to predict the desired outcome.
Here's a simple decision tree for predicting different types of dogs! (We can imagine Type 1 is Beagle, Type 2 is Golden Retriever, etc!) In this model, we have several conditions (questions that must be answered) to classify the dogs in one particular category (prediction).
Decision Forests (DF) are a large family of machine-learning algorithms for supervised classification, regression, and ranking. As the name suggests, DFs use decision trees as a building block.
Now that we understand the logic behind ML, we are ready to build our ML model.
Step 1: Set Up the environment 📊
As a first step we need to install our decision forest library, you can use free libraries for training decision forest models, such as TensorFlow, Pandas, and Numpy.
You can load these libraries and load your data into a pandas data frame (df), which is a format, python can read.
Step 2: Get the Data 📈
ML will learn patterns and trends from the data so it is very important to know well the data. You can spend a lot of your time here because this is the most important part of the process.
Some information you need to know about your data is: Where does your data come from? Who collected it? Why did they collect it? What kind of data ("features") do you have? Are they all numbers (integers or "int", "float")? words ("strings")? and What questions can this data answer? What are the trends that you identify? What are the patterns? Does it contain bias? , etc.
Step 3: Load the Data Set 📉
You can use a small data set (300 examples) or a bigger one (>1M), depending on the level of accuracy that you expect. You can load your data using Panda for a small sample or TensorFlow for larger samples, just be sure that the data is stored in a .csv-like file.
Step 4: Explore your data🔍
Because we can have many examples, it is important to explore the data in the form of summary statistics, which we can use directly on the pandas data frame.
To do it you can use the function groupby() on the entire data frame by a certain feature, like by each dog type, and do calculations per type.
Step 5: Clean your Data 🧹
Step 6: Train and Test the Model 📊🔍
To train a ML model we need to be sure that we are going to use just a percentage of all our data because we are going to need a portion of data to test the model, so we can see how well the model performs.
Typically, we can use the ratio of 70%/30%. 70% of our data can be training data, and 30% of our data can be testing data. We want to be sure that the data does not follow certain criteria, it has to be a random selection.
# Split the dataset into a training and a testing dataset.
def split_dataset(dataset, test_ratio=0.30):
"""Splits a panda dataframe in two."""
test_indices = np.random.rand(len(dataset)) < test_ratio
return dataset[~test_indices], dataset[test_indices]
train_ds_pd, test_ds_pd = split_dataset(df)
print("{} examples in training, {} examples for testing.".format(
len(train_ds_pd), len(test_ds_pd)))
Step 7: Retrain the Model 💪
You can gather new, relevant data and apply the same preprocessing steps used during the initial training phase, divide the new data into training, validation, and test sets, maintaining consistency with the original data splitting strategy, and finally integrate the retrained model into the existing system.
As you see to master ML you just need to follow the correct steps and to keep practicing with the available free resources online.
Here you can find some resources that may be helpful.
Learning Resources📗
Finally, remember that ML is an important tool that can help you analyze huge amounts of data to solve complex problems, and automate repetitive tasks, leading to increased efficiency and reduced human errors. It can also analyze data to personalize experiences (communication, recommendations, etc), identify the most efficient ways to achieve a goal, uncover insights, help in making informed and data-driven decisions, find correlations that might not be apparent through traditional analysis, assist in risk assessment and mitigation and pave the way for innovative solutions to problems.
I hope you find this high-level overview of ML helpful for your career development. See you in the next WTM article!
Karen Avellaneda
Google Women Techmaker Ambassador