introduction to machine learning

ICT 3202 - INTRODUCTION
TO DATA SCIENCE
BY
ENGR. JOHNSON C. UBAH
B.ENG, M.ENG, HCNA, ASM

Machine Learning and Statistics

Machine learning is the practice of programming computers to learn from
data.
Machine learning is a subfield of artificial intelligence (AI). The goal of
machine learning generally is to understand the structure of data and fit
that data into models that can be understood and utilized by people.
In machine learning, data is referred to as called training sets or
examples.

Intro. To Machine Learning
Machine learning differs from traditional computational approaches because;
Traditional computing algorithms are sets of steps followed by computers to
solve problems.
Machine learning algorithms allows computers to train on data inputs and use
statistical analysis in order to generate output values that falls within specific
range.

Why Machine Learning?
Lets assume you’d like to write a filter program without using machine learning
methods. The steps would be;
You’d take a look at what spam e-mails looks like
You’d write an algorithm to detect the patterns that you’ve seen and the
software would then flag the e-mails as spam
Finally, you’d test the program, and redo the first two steps again until the results
are good enough.

Why Machine learning?
This program contains very long list of rules and hence
difficult to maintain. But if done with machine learning, you
will be able to maintain it properly.
Programs that uses ML techniques will automatically detect
changes by users, and update their definition automatically.

Why Machine Learning?
Machine Learning algorithm with automatic update when users change preference

When to use machine learning
When you have a problem that requires many rules to find the
solution.
Very complex problems for which there is no solution with
traditional approach.
Non-stable environments: machine learning software can adapt to
new data.

Classification of ML
There are types of machine learning systems. We can divide them into
categories, depending on whether;
1. They have been trained with humans or not
◦ Supervised
◦ Unsupervised
◦ Semi-supervised
◦ Reinforcement learning
2. If they can learn incrementally
3. If they work simply by comparing new data points to find data points or can
detect new patterns in the data, and then will build a model.

introduction to machine learning

Supervised and unsupervised learning
We can classify machine learning systems according to the type
and amount of human supervision during the training. They are;
◦ Supervised learning
◦ Unsupervised learning
◦ Semi-supervised learning
◦ Reinforced learning.

Supervised learning
When an algorithm learns from example data and associated target
responses that can consist of numeric values or string labels, such as
classes or tags, in order to later predict the correct response when
posed with new examples comes under the category of Supervised
learning.
This approach is indeed similar to human learning under the
supervision of a teacher.

Tasks carried out by supervised learning
Supervised learning groups together a task of
classification. The program is a good example of this
because it’s been trained with many emails at the same
time as their class.
Another example is to predict a numeric value like the
price of a flat, given a set of features (location, number
of rooms, facilities) called predictors; this task is called
regression.

Supervised learning algorithms
You should keep in mind that some regression algorithms can be
used for classification as well, and vise versa.
Some important supervised algorithms
◦ K-nearest neighbors
◦ Linear regression
◦ Neural network
◦ Support vector machines
◦ Logistic regression
◦ Decision trees and random forest

Unsupervised learning
Unsupervised learning occurs when an algorithm learns from plain examples
without any associated response, leaving to the algorithm to determine the data
patterns on its own.
This type of algorithm tends to restructure the data into something else, such as
new features that may represent a class or a new series of un-correlated values.
They are quite useful in providing humans with insights into the meaning of data
and new useful inputs to supervised machine learning algorithms.

Unsupervised learning
As a kind of learning, it resembles the methods humans use to figure
out that certain objects or events are from the same class, such as by
observing the degree of similarity between objects. Some
recommendation systems that you find on the web in the form of
marketing automation are based on this type of learning.
In this type of learning the data is unlabeled.

Unsupervised learning algorithms
Some unsupervised learning algorithms includes;
◦Clustering: k-means, hierarchical cluster analysis
◦Association rule learning: Eclat, apriori
◦Visualization and dimensionality reduction: kernel PCA, t-
distribution, PCA

Examples of unsupervised learning
suppose you’ve got many data on visitor, you can use one
algorithm to detect groups with similar visitors. 65% of your
visitors might be males who love watching movie in the
evening, while 30% watch plays in the evening: Using
clustering algorithm, we have the smaller groups.
Secondly, for visualization algorithms, you will need to give
them many data and unlabeled data as input, and then you
will get 2D or 3D visualization as an output. Feature
extraction takes place here.

Reinforcement learning
An Agent “AI system” will observe the
environment, performs given actions, and
then receive rewards in return.
Here, the agent must learn by itself.
You can find this type of learning in many
robotics applications that learns how to
walk.

Semi-supervised learning
where an incomplete training signal is given: a training set
with some (often many) of the target outputs missing.
There is a special case of this principle known as
Transduction where the entire set of problem instances is
known at learning time, except that part of the targets are
missing.

Bad and Insufficient quantity of Training
Data
Machine learning systems are not like children,
who can distinguish apples and oranges in all
sorts of colors and shapes, but they require lot of
data to work effectively, whether you’re working
with very simple programs and problems, or
complex applications like image processing and
speech recognition.

Poor Quality Data
If you are working with training data that is full of errors and
outliers, this will make it very hard for the system to detect
patterns, so it won’t work properly.
So, if you want your program to work well, you must spend
more time cleaning up your training data.

Irrelevant features
The system will only be able to learn if the training data contains enough features
and data that aren’t too irrelevant. The most important part of any ML project is to
develop good features. “feature engineering”
Feature engineering follows this process:
◦ Feature selection: selecting the most useful features
◦ Feature extraction: combining existing features to provide more useful features.
◦ Creation of new features: creation of new features, based on data.

Testing
To ensure your model is working well and that models can generalize
with new cases, you can try out new cases with it by putting the
model in the environment and then monitoring how it will perform.
This is good practice.
You should divide your data into two set, one for training and the
second for testing.

Testing
The generalization error is the rate of error by evaluation of your model on the
test set. The value you get will tell you if your model is good enough, and if it will
work properly.
If the error rate is low, the model is good and will perform properly and vice
versa.
It is advisable to use 80% of your data for training and 20% for testing

Overfitting the data
Overgeneralization in machine learning is called “overfitting”.
Overfitting occurs when the model is very complex for the amount of
training data given.
Solution
Gather more data for “training data”
Reduce the noise level
Select one with fewer parameters

Under-fitting the data
This the opposite of overfitting. You will encounter this when the model is very
simple to learn.
For example, using the example of quality of life, real life is more complex than
your model, so the predictions won’t yield the same, even in the training
examples.
Solution:
◦ Select the most powerful model, which has many parameters
◦ Feed the best features into your algorithms. Here, I’m referring to feature
engineering
◦ Reduce the constraints on your model

Software for this course
Python’s popularity may be due to the increased development of deep learning
frameworks available for this language recently, including TensorFlow, PyTorch,
and Keras. As a language that has readable syntax and the ability to be used as a
scripting language, Python proves to be powerful and straightforward both for
preprocessing data and working with data directly. The scikit-learn machine
learning library is built on top of several existing Python packages that Python
developers may already be familiar with, namely NumPy, SciPy, and Matplotlib.

Software for this course
MATLAB makes machine learning easy. With tools and functions for
handling big data, as well as apps to make machine learning accessible,
MATLAB is an ideal environment for applying machine learning to your
data analytics.
With MATLAB, engineers and data scientists have immediate access to
prebuilt functions, extensive toolboxes, and specialized apps
for classification, regression, and clustering.

introduction to machine learning

More Related Content

What's hot (20)

Similar to introduction to machine learning (20)

More from Johnson Ubah (7)

Recently uploaded (20)

introduction to machine learning