MIDS UNIT-1.pdf building, testing, and Deployment

UNIT 1
INTRODUCTION TO DATA SCIENCE
- Prof.Apurva Bijawe

What is Data Science?
► Data science is the domain of study that deals with vast
volumes of data using modern tools and techniques to find
unseen patterns, derive meaningful information, and make
business decisions.
► Data science uses complex machine learning algorithms to
build predictive models.
► The data used for analysis can come from many different
sources and presented in various formats.

What is Big Data?
► Big data is a combination of structured, semi structured and
unstructured data collected by organizations that can be mined
for information and used in machine
learning projects, predictive modeling and other advanced
analytics applications.
► Big data is often characterized by the three V’s:
• the large volume of data in many environments;
• the wide variety of data types frequently stored in big data
systems and
• the velocity at which much of the data is generated, collected
and processed.

Different types of Data
► Qualitative Data Type
• Nominal
• Ordinal
► Quantitative Data Type
• Discrete
• Continuous

Introduction
► We are entering into the digital era where we produce a lot of
Data. For instance, a company like Flipkart produces more
than 2TB of data on daily basis.
► When this Data has so much importance in our life then it
becomes important to properly store and process this without
any error.
► When dealing with datasets, the category of data plays an
important role to determine which preprocessing strategy
would work for a particular set to get the right results or which
type of statistical analysis should be applied for the best
results.

Qualitative Data Type
► Qualitative or Categorical Data describes the object under
consideration using a finite set of discrete classes.
► It means that this type of data can’t be counted or measured easily
using numbers and therefore divided into categories.
► The gender of a person (male, female, or others) is a good example of
this data type.
► These are usually extracted from audio, images, or text medium.
► Another example can be of a smartphone brand that provides
information about the current rating, the color of the phone, category of
the phone, and so on. All this information can be categorized as
Qualitative data.
► There are two subcategories under this:

1.Nominal
► These are the set of values that don’t possess a natural ordering.
► Let’s understand this with some examples. The color of a smartphone
can be considered as a nominal data type as we can’t compare one
color with others.
► It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender
of a person is another one where we can’t differentiate between male,
female, or others. Mobile phone categories whether it is midrange,
budget segment, or premium smartphone is also nominal data type.
► Nominal data types in statistics are not quantifiable and cannot be
measured through numerical units. Nominal types of statistical data
are valuable while conducting qualitative research as it extends
freedom of opinion to subjects.

2. Ordinal
► These types of values have a natural ordering while maintaining their
class of values.
► If we consider the size of a clothing brand then we can easily sort them
according to their name tag in the order of small < medium < large.
► The grading system while marking candidates in a test can also be
considered as an ordinal data type where A+ is definitely better than B
grade.

These categories help us deciding which encoding strategy can
be applied to which type of data.
Data encoding for Qualitative data is important because machine
learning models can’t handle these values directly and needed to
be converted to numerical types as the models are mathematical
in nature.
For nominal data type where there is no comparison among the
categories, one-hot encoding can be applied which is similar to
binary coding considering there are in less number and for the
ordinal data type, label encoding can be applied which is a
form of integer encoding.

Difference Between Nominal and Ordinal
Data
Aspect Nominal Data Ordinal Data
Definition
Categories data into distinct
classes or categories without
any inherent order or ranking.
Categories data into ordered or
ranked categories with
meaningful differences
between them.
Examples
Colors, gender, types of
animals
Education levels, customer
satisfaction ratings
Mathematical Operations
No meaningful mathematical
operations can be performed
(e.g., averaging categories).
Limited mathematical
operations can be performed,
such as determining the mode
or median.
Order/ Ranking
No natural or meaningful order
exists.
Categories have a specific
order or ranking, but the
magnitude of differences
between ranks may not be
uniform.
Central Tendency Mode (most frequent category)
Mode, median (middle
category), but mean is not
typically used due to lack of
uniform interval between ranks.
Example Use Case
Classifying objects, grouping
data
Rating scales, survey
responses, educational levels

Quantitative Data Type
► This data type tries to quantify things and it does by considering
numerical values that make it countable in nature.
► The price of a smartphone, discount offered, number of ratings on
a product, the frequency of processor of a smartphone, or ram of
that particular phone, all these things fall under the category of
Quantitative data types.
► The key thing is that there can be an infinite number of values a
feature can take.
► For instance, the price of a smartphone can vary from x amount to
any value and it can be further broken down based on fractional
values.
► The two subcategories which describe them clearly are:
• Discrete
• Continuous

1.Discrete
► The numerical values which fall under are integers or whole
numbers are placed under this category.
► The number of speakers in the phone, cameras, cores in
the processor, the number of sims supported all these are
some of the examples of the discrete data type.
► Discrete data types in statistics cannot be measured – it
can only be counted as the objects included in discrete data
have a fixed value.
► The value can be represented in decimal, but it has to be
whole. Discrete data is often identified through charts,
including bar charts, pie charts, and tally charts.

Continuous
► The fractional numbers are considered as continuous values.
► Example- These can take the form of the operating frequency
of the processors, the android version of the phone, wifi
frequency, temperature of the cores, and so on.
► Unlike discrete data types of data in research, with a whole
and fixed value, continuous data can break down into smaller
pieces and can take any value.
► For example, volatile values such as temperature and the
weight of a human can be included in the continuous value.
Continuous types of statistical data are represented using a
graph that easily reflects value fluctuation by the highs and
lows of the line through a certain period of time.

Difference between Discrete Data and
Continuous Data
Aspect Discrete Data Continuous Data
Definition
Consists of distinct,
separate values.
It can take any value
within a given range.
Examples
Number of students in a
class, coin toss outcomes
(1, 2, 3), customer count.
Height, weight,
temperature, time.
Nature
Usually involves whole
numbers or counts.
Involves any value along
a continuous spectrum.
Gaps in values
Gaps between values are
common and meaningful.
Values can be infinitely
divided without gaps.
Measurement
Often measured using
integers.
Measured with decimal
numbers or fractions.
Graphical representation
Typically represented with
bar charts or histograms.
Represented with line
graphs or smooth curves.

Mathematical
Operations
Typically involves
counting or summation.
Involves arithmetic
operations, including
fractions and decimals.
Probability Distribution
Typically represented
using probability mass
functions
Typically represented
using probability density
functions.
Example Use Case
Counting occurrences,
tracking integers.
Measuring quantities
and analyzing
measurements.

Data Science Process/Life-Cycle
► Data Science Lifecycle revolves around the use of
machine learning and different analytical strategies to
produce insights and predictions from information in
order to acquire a commercial enterprise objective.
► The data science project life cycle is a
methodology that outlines the stages of a data
science project, from planning to deployment.
► The project life cycle provides a framework that
helps data scientists to manage projects
effectively and efficiently.

Step 1: Problem Identification and Planning
► The first step in the data science project life cycle is to
identify the problem that needs to be solved.
► This involves understanding the business requirements
and the goals of the project.
► Once the problem has been identified, the data science
team will plan the project by determining the data sources,
the data collection process, and the analytical methods that
will be used.

Step 2: Data Collection
► The second step in the data science project life cycle is
data collection. This involves collecting the data that will
be used in the analysis. The data science team must
ensure that the data is accurate, complete, and relevant to
the problem being solved.

Step 3: Pre-processing data
► This involves cleaning and transforming the data to make it
suitable for analysis.
► The data science team will remove any duplicates, missing
values, or irrelevant data from the dataset. They will also
transform the data into a format that is suitable for
analysis.
► Large data is collected from archives, daily transactions and
intermediate records. The data is available in various formats
and in various forms. Some data may be available in hard copy
formats also.
► The data is scattered at various places on various servers. All
these data are extracted and converted into single format and
then processed.

Step 4: Data Analysis/Exploring the Data
► Now that the data is available and ready in the format required
then next important step is to understand the data in depth.
► This understanding comes from analysis of data using various
statistical tools available.
► A data engineer plays a vital role in analysis of data. This step is
also called as Exploratory Data Analysis (EDA).
► Here the data is examined by formulating the various statistical
functions and dependent and independent variables or features
are identified. Careful analysis of data revels which data or
features are important and what is the spread of data.

Exploratory Data Analysis Tools
Some of the most common data science tools used to create an EDA
include:
•Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined
with dynamic typing and dynamic binding, make it very attractive for
rapid application development, as well as for use as a scripting or
glue language to connect existing components together. Python and
EDA can be used together to identify missing values in a data set,
which is important so you can decide how to handle missing values
for machine learning.
•R: An open-source programming language and free software
environment for statistical computing and graphics supported by the
R Foundation for Statistical Computing. The R language is widely
used among statisticians in data science in developing statistical
observations and data analysis.

Step 5: Data Modelling
► Data modelling is the important step once the data is analyzed
and visualized.
► The important components are retained in the dataset and thus
data is further refined. Now the important is to decide how to
model the data? What tasks are suitable for modelling?
► The tasks, like classification or regression, which is suitable is
dependent upon what business value is required. In these tasks
also many ways of modelling are available.
► The Machine Learning engineer applies various algorithms to
the data and generates the output. While modelling the data
many a times the models are first tested on dummy data similar
to actual data.

Step 6: Communicating Model
Results/Evaluation
► In the evaluation phase you’ll evaluate the model based on the
goals of your business.
► Then, you’ll review your work process and explain how your
model will help the business, summarize your findings, and
make any corrections.
► Finally, you’ll determine your next steps. Is your model ready
for deployment? Does it need a new iteration or another
dependency project?

Step 7: Model Deployment and Maintenance
► Once the model is trained with the actual data and parameters
are fine tuned then model is deployed.
► Now the model is exposed to real time data flowing into the
system and output is generated.
► The model can be deployed as web service or as an embedded
application in edge or mobile application. This is very important
step as now model is exposed to real world.
► And once the deployment is completed then the project team
monitors its performance and perform modifications as
necessary

What is Data Science?
► Data science is a field that studies data. It involves deriving
meaningful insights for humongous amounts of data generated
across the globe on a daily basis.
► In order to do so, it uses a variety of algorithms, scientific
methods, and processes. Data science helps you discover various
patterns from collected raw data.
► Data science is a widely popular term and has been
so-called because it uses data analysis, big data, and
mathematical statistics to extract insights from withering
structured or unstructured data.
► This helps businesses in solving crucial problems by
coming up with efficient solutions.

What is meant by Machine Learning?
► According to Forbes, between the years 2013 and 2017, the
patents for Machine Learning grew at a rate of 34% Compound
Annual Growth Rate (CAGR).
► Machine learning applies artificial intelligence or AI to provide
computers with the ability to learn and improve without having to
be programmed explicitly.
► It is definitely one of the most interesting technologies one could
come across and mainly focuses on the development of system
programs in such a way that they can access necessary data and
use it to learn and improve themselves with no necessity of
human intervention.
► The ability to learn by themselves makes computers more similar
to humans. It begins with data or observations, including
instructions or direct experience.

Machine learning (ML)
• It is the study of computer algorithms that improve
automatically through experience.
• It is seen as a subset of artificial intelligence.
• Machine learning algorithms build a mathematical model
based on sample data known as training data in order to
make predictions or decisions without explicitly being
programmed to do so

Relationship between Machine Learning and
Data Science

► Data science is a field that studies data and how to extract
meaning from it, whereas machine learning is a field
devoted to understanding and building methods that utilize
data to improve performance or inform predictions.
► Machine learning is a branch of artificial intelligence.

Data science is an interdisciplinary discipline that brings together
abilities from different fields like machine learning, statistics, and
visualization. It enables us to draw valuable insights from large
volumes of data, supporting well-founded decision-making processes
in the areas of technology, scientific research, and business.
Machine learning, a subset of both data science and artificial
intelligence, utilizes algorithms and statistical models to process
data. This enables machines to learn and make predictions without
explicit programming.
Artificial intelligence, a broader concept, focuses on creating
machines capable of performing tasks that normally require human
intelligence, including reasoning, learning, and problem-solving.

Data science serves as the foundation
for machine learning and artificial intelligence by supplying
the necessary data for these models to learn from.
It integrates algorithms from machine learning and borrows
ideas from traditional domain expertise, statistics, and
mathematics to develop solutions

Examples and use cases to understand the interrelation between
AI, ML and Data Science
► Recommendation systems
• This is an example of AI in action. A recommendation system is a
type of algorithm that makes personalized recommendations based
on user data. Data science is involved in the collection and analysis of
user data, while machine learning is used to develop the algorithm
that powers the recommendation system.
• One use case for recommendation systems powered by AI
is Amazon’s personalized product recommendation algorithm. This
system employs machine learning techniques and algorithms to
understand user behavior information, including past transactions,
product ratings, and browsing history, in order to provide
personalized suggestions to clients.
• Similarly, Netflix also uses a recommendation system to provide
personalized movie and TV show recommendations to its users.
These personalized recommendations enhance the user’s experience
and increase the likelihood of making a purchase or continuing to use
the platform.

Fraud detection
• This is an example of where data science and machine learning come
into play. Fraud detection involves analyzing large amounts of data to
identify patterns and anomalies that may indicate fraudulent activity.
Machine learning algorithms are used to identify these patterns and
anomalies, while data science is involved in collecting and preparing
the data for analysis.
• For example, PayPal utilizes a combination of machine learning
methods and data science approaches to examine substantial
quantities of transaction information to pinpoint fraudulent actions.
The system can identify patterns and anomalies within the data that
could indicate fraudulent behavior, like a typical expenditure habits or
suspicious IP locations. By harnessing these tools and strategies,
PayPal can avert fraudulent transactions, safeguard its users, and
preserve the trustworthiness of its digital environment.

Natural language processing (NLP)
• This is an example of where all three concepts come together.
Natural language processing involves analyzing human language to
extract insights and meaning.
• Data science is used to collect and prepare the data, machine
learning is used to develop the algorithms, and AI is used to power
the natural language processing system.
• For example, chatbots are developed by using NLP techniques.
They are designed to simulate conversation with human users.
AI-powered chatbots, such as ChatGPT, use machine learning
algorithms and NLP techniques to understand natural language
queries and provide more personalized responses.
• To develop an AI-powered chatbot, developers must first collect
and prepare a large dataset of training data. This dataset is used to
train machine learning models that can understand and interpret
natural language text. Once the models have been trained, they
can be integrated into the chatbot to provide more intelligent and
accurate responses.

MIDS UNIT-1.pdf building, testing, and Deployment

MIDS UNIT-1.pdf building, testing, and Deployment

More Related Content

Similar to MIDS UNIT-1.pdf building, testing, and Deployment (20)

Recently uploaded (20)

MIDS UNIT-1.pdf building, testing, and Deployment