SlideShare a Scribd company logo
UNIT 1
INTRODUCTION TO DATA SCIENCE
- Prof.Apurva Bijawe
What is Data Science?
► Data science is the domain of study that deals with vast
volumes of data using modern tools and techniques to find
unseen patterns, derive meaningful information, and make
business decisions.
► Data science uses complex machine learning algorithms to
build predictive models.
► The data used for analysis can come from many different
sources and presented in various formats.
What is Big Data?
► Big data is a combination of structured, semi structured and
unstructured data collected by organizations that can be mined
for information and used in machine
learning projects, predictive modeling and other advanced
analytics applications.
► Big data is often characterized by the three V’s:
• the large volume of data in many environments;
• the wide variety of data types frequently stored in big data
systems and
• the velocity at which much of the data is generated, collected
and processed.
Different types of Data
► Qualitative Data Type
• Nominal
• Ordinal
► Quantitative Data Type
• Discrete
• Continuous
Introduction
► We are entering into the digital era where we produce a lot of
Data. For instance, a company like Flipkart produces more
than 2TB of data on daily basis.
► When this Data has so much importance in our life then it
becomes important to properly store and process this without
any error.
► When dealing with datasets, the category of data plays an
important role to determine which preprocessing strategy
would work for a particular set to get the right results or which
type of statistical analysis should be applied for the best
results.
Qualitative Data Type
► Qualitative or Categorical Data describes the object under
consideration using a finite set of discrete classes.
► It means that this type of data can’t be counted or measured easily
using numbers and therefore divided into categories.
► The gender of a person (male, female, or others) is a good example of
this data type.
► These are usually extracted from audio, images, or text medium.
► Another example can be of a smartphone brand that provides
information about the current rating, the color of the phone, category of
the phone, and so on. All this information can be categorized as
Qualitative data.
► There are two subcategories under this:
1.Nominal
► These are the set of values that don’t possess a natural ordering.
► Let’s understand this with some examples. The color of a smartphone
can be considered as a nominal data type as we can’t compare one
color with others.
► It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender
of a person is another one where we can’t differentiate between male,
female, or others. Mobile phone categories whether it is midrange,
budget segment, or premium smartphone is also nominal data type.
► Nominal data types in statistics are not quantifiable and cannot be
measured through numerical units. Nominal types of statistical data
are valuable while conducting qualitative research as it extends
freedom of opinion to subjects.
2. Ordinal
► These types of values have a natural ordering while maintaining their
class of values.
► If we consider the size of a clothing brand then we can easily sort them
according to their name tag in the order of small < medium < large.
► The grading system while marking candidates in a test can also be
considered as an ordinal data type where A+ is definitely better than B
grade.
These categories help us deciding which encoding strategy can
be applied to which type of data.
Data encoding for Qualitative data is important because machine
learning models can’t handle these values directly and needed to
be converted to numerical types as the models are mathematical
in nature.
For nominal data type where there is no comparison among the
categories, one-hot encoding can be applied which is similar to
binary coding considering there are in less number and for the
ordinal data type, label encoding can be applied which is a
form of integer encoding.
Difference Between Nominal and Ordinal
Data
Aspect Nominal Data Ordinal Data
Definition
Categories data into distinct
classes or categories without
any inherent order or ranking.
Categories data into ordered or
ranked categories with
meaningful differences
between them.
Examples
Colors, gender, types of
animals
Education levels, customer
satisfaction ratings
Mathematical Operations
No meaningful mathematical
operations can be performed
(e.g., averaging categories).
Limited mathematical
operations can be performed,
such as determining the mode
or median.
Order/ Ranking
No natural or meaningful order
exists.
Categories have a specific
order or ranking, but the
magnitude of differences
between ranks may not be
uniform.
Central Tendency Mode (most frequent category)
Mode, median (middle
category), but mean is not
typically used due to lack of
uniform interval between ranks.
Example Use Case
Classifying objects, grouping
data
Rating scales, survey
responses, educational levels
Quantitative Data Type
► This data type tries to quantify things and it does by considering
numerical values that make it countable in nature.
► The price of a smartphone, discount offered, number of ratings on
a product, the frequency of processor of a smartphone, or ram of
that particular phone, all these things fall under the category of
Quantitative data types.
► The key thing is that there can be an infinite number of values a
feature can take.
► For instance, the price of a smartphone can vary from x amount to
any value and it can be further broken down based on fractional
values.
► The two subcategories which describe them clearly are:
• Discrete
• Continuous
1.Discrete
► The numerical values which fall under are integers or whole
numbers are placed under this category.
► The number of speakers in the phone, cameras, cores in
the processor, the number of sims supported all these are
some of the examples of the discrete data type.
► Discrete data types in statistics cannot be measured – it
can only be counted as the objects included in discrete data
have a fixed value.
► The value can be represented in decimal, but it has to be
whole. Discrete data is often identified through charts,
including bar charts, pie charts, and tally charts.
Continuous
► The fractional numbers are considered as continuous values.
► Example- These can take the form of the operating frequency
of the processors, the android version of the phone, wifi
frequency, temperature of the cores, and so on.
► Unlike discrete data types of data in research, with a whole
and fixed value, continuous data can break down into smaller
pieces and can take any value.
► For example, volatile values such as temperature and the
weight of a human can be included in the continuous value.
Continuous types of statistical data are represented using a
graph that easily reflects value fluctuation by the highs and
lows of the line through a certain period of time.
Difference between Discrete Data and
Continuous Data
Aspect Discrete Data Continuous Data
Definition
Consists of distinct,
separate values.
It can take any value
within a given range.
Examples
Number of students in a
class, coin toss outcomes
(1, 2, 3), customer count.
Height, weight,
temperature, time.
Nature
Usually involves whole
numbers or counts.
Involves any value along
a continuous spectrum.
Gaps in values
Gaps between values are
common and meaningful.
Values can be infinitely
divided without gaps.
Measurement
Often measured using
integers.
Measured with decimal
numbers or fractions.
Graphical representation
Typically represented with
bar charts or histograms.
Represented with line
graphs or smooth curves.
Mathematical
Operations
Typically involves
counting or summation.
Involves arithmetic
operations, including
fractions and decimals.
Probability Distribution
Typically represented
using probability mass
functions
Typically represented
using probability density
functions.
Example Use Case
Counting occurrences,
tracking integers.
Measuring quantities
and analyzing
measurements.
Data Science Process/Life-Cycle
► Data Science Lifecycle revolves around the use of
machine learning and different analytical strategies to
produce insights and predictions from information in
order to acquire a commercial enterprise objective.
► The data science project life cycle is a
methodology that outlines the stages of a data
science project, from planning to deployment.
► The project life cycle provides a framework that
helps data scientists to manage projects
effectively and efficiently.
Step 1: Problem Identification and Planning
► The first step in the data science project life cycle is to
identify the problem that needs to be solved.
► This involves understanding the business requirements
and the goals of the project.
► Once the problem has been identified, the data science
team will plan the project by determining the data sources,
the data collection process, and the analytical methods that
will be used.
Step 2: Data Collection
► The second step in the data science project life cycle is
data collection. This involves collecting the data that will
be used in the analysis. The data science team must
ensure that the data is accurate, complete, and relevant to
the problem being solved.
Step 3: Pre-processing data
► This involves cleaning and transforming the data to make it
suitable for analysis.
► The data science team will remove any duplicates, missing
values, or irrelevant data from the dataset. They will also
transform the data into a format that is suitable for
analysis.
► Large data is collected from archives, daily transactions and
intermediate records. The data is available in various formats
and in various forms. Some data may be available in hard copy
formats also.
► The data is scattered at various places on various servers. All
these data are extracted and converted into single format and
then processed.
Step 4: Data Analysis/Exploring the Data
► Now that the data is available and ready in the format required
then next important step is to understand the data in depth.
► This understanding comes from analysis of data using various
statistical tools available.
► A data engineer plays a vital role in analysis of data. This step is
also called as Exploratory Data Analysis (EDA).
► Here the data is examined by formulating the various statistical
functions and dependent and independent variables or features
are identified. Careful analysis of data revels which data or
features are important and what is the spread of data.
Exploratory Data Analysis Tools
Some of the most common data science tools used to create an EDA
include:
•Python: An interpreted, object-oriented programming language with
dynamic semantics. Its high-level, built-in data structures, combined
with dynamic typing and dynamic binding, make it very attractive for
rapid application development, as well as for use as a scripting or
glue language to connect existing components together. Python and
EDA can be used together to identify missing values in a data set,
which is important so you can decide how to handle missing values
for machine learning.
•R: An open-source programming language and free software
environment for statistical computing and graphics supported by the
R Foundation for Statistical Computing. The R language is widely
used among statisticians in data science in developing statistical
observations and data analysis.
Step 5: Data Modelling
► Data modelling is the important step once the data is analyzed
and visualized.
► The important components are retained in the dataset and thus
data is further refined. Now the important is to decide how to
model the data? What tasks are suitable for modelling?
► The tasks, like classification or regression, which is suitable is
dependent upon what business value is required. In these tasks
also many ways of modelling are available.
► The Machine Learning engineer applies various algorithms to
the data and generates the output. While modelling the data
many a times the models are first tested on dummy data similar
to actual data.
Step 6: Communicating Model
Results/Evaluation
► In the evaluation phase you’ll evaluate the model based on the
goals of your business.
► Then, you’ll review your work process and explain how your
model will help the business, summarize your findings, and
make any corrections.
► Finally, you’ll determine your next steps. Is your model ready
for deployment? Does it need a new iteration or another
dependency project?
Step 7: Model Deployment and Maintenance
► Once the model is trained with the actual data and parameters
are fine tuned then model is deployed.
► Now the model is exposed to real time data flowing into the
system and output is generated.
► The model can be deployed as web service or as an embedded
application in edge or mobile application. This is very important
step as now model is exposed to real world.
► And once the deployment is completed then the project team
monitors its performance and perform modifications as
necessary
What is Data Science?
► Data science is a field that studies data. It involves deriving
meaningful insights for humongous amounts of data generated
across the globe on a daily basis.
► In order to do so, it uses a variety of algorithms, scientific
methods, and processes. Data science helps you discover various
patterns from collected raw data.
► Data science is a widely popular term and has been
so-called because it uses data analysis, big data, and
mathematical statistics to extract insights from withering
structured or unstructured data.
► This helps businesses in solving crucial problems by
coming up with efficient solutions.
What is meant by Machine Learning?
► According to Forbes, between the years 2013 and 2017, the
patents for Machine Learning grew at a rate of 34% Compound
Annual Growth Rate (CAGR).
► Machine learning applies artificial intelligence or AI to provide
computers with the ability to learn and improve without having to
be programmed explicitly.
► It is definitely one of the most interesting technologies one could
come across and mainly focuses on the development of system
programs in such a way that they can access necessary data and
use it to learn and improve themselves with no necessity of
human intervention.
► The ability to learn by themselves makes computers more similar
to humans. It begins with data or observations, including
instructions or direct experience.
Machine learning (ML)
• It is the study of computer algorithms that improve
automatically through experience.
• It is seen as a subset of artificial intelligence.
• Machine learning algorithms build a mathematical model
based on sample data known as training data in order to
make predictions or decisions without explicitly being
programmed to do so
Relationship between Machine Learning and
Data Science
► Data science is a field that studies data and how to extract
meaning from it, whereas machine learning is a field
devoted to understanding and building methods that utilize
data to improve performance or inform predictions.
► Machine learning is a branch of artificial intelligence.
Data science is an interdisciplinary discipline that brings together
abilities from different fields like machine learning, statistics, and
visualization. It enables us to draw valuable insights from large
volumes of data, supporting well-founded decision-making processes
in the areas of technology, scientific research, and business.
Machine learning, a subset of both data science and artificial
intelligence, utilizes algorithms and statistical models to process
data. This enables machines to learn and make predictions without
explicit programming.
Artificial intelligence, a broader concept, focuses on creating
machines capable of performing tasks that normally require human
intelligence, including reasoning, learning, and problem-solving.
Data science serves as the foundation
for machine learning and artificial intelligence by supplying
the necessary data for these models to learn from.
It integrates algorithms from machine learning and borrows
ideas from traditional domain expertise, statistics, and
mathematics to develop solutions
Examples and use cases to understand the interrelation between
AI, ML and Data Science
► Recommendation systems
• This is an example of AI in action. A recommendation system is a
type of algorithm that makes personalized recommendations based
on user data. Data science is involved in the collection and analysis of
user data, while machine learning is used to develop the algorithm
that powers the recommendation system.
• One use case for recommendation systems powered by AI
is Amazon’s personalized product recommendation algorithm. This
system employs machine learning techniques and algorithms to
understand user behavior information, including past transactions,
product ratings, and browsing history, in order to provide
personalized suggestions to clients.
• Similarly, Netflix also uses a recommendation system to provide
personalized movie and TV show recommendations to its users.
These personalized recommendations enhance the user’s experience
and increase the likelihood of making a purchase or continuing to use
the platform.
Fraud detection
• This is an example of where data science and machine learning come
into play. Fraud detection involves analyzing large amounts of data to
identify patterns and anomalies that may indicate fraudulent activity.
Machine learning algorithms are used to identify these patterns and
anomalies, while data science is involved in collecting and preparing
the data for analysis.
• For example, PayPal utilizes a combination of machine learning
methods and data science approaches to examine substantial
quantities of transaction information to pinpoint fraudulent actions.
The system can identify patterns and anomalies within the data that
could indicate fraudulent behavior, like a typical expenditure habits or
suspicious IP locations. By harnessing these tools and strategies,
PayPal can avert fraudulent transactions, safeguard its users, and
preserve the trustworthiness of its digital environment.
Natural language processing (NLP)
• This is an example of where all three concepts come together.
Natural language processing involves analyzing human language to
extract insights and meaning.
• Data science is used to collect and prepare the data, machine
learning is used to develop the algorithms, and AI is used to power
the natural language processing system.
• For example, chatbots are developed by using NLP techniques.
They are designed to simulate conversation with human users.
AI-powered chatbots, such as ChatGPT, use machine learning
algorithms and NLP techniques to understand natural language
queries and provide more personalized responses.
• To develop an AI-powered chatbot, developers must first collect
and prepare a large dataset of training data. This dataset is used to
train machine learning models that can understand and interpret
natural language text. Once the models have been trained, they
can be integrated into the chatbot to provide more intelligent and
accurate responses.
MIDS UNIT-1.pdf  building, testing, and Deployment

More Related Content

PDF
Introduction to Data science in syllabus of machine intelligence in data science
PPTX
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
DOCX
Categorical DataCategorical data represents characteristics..docx
PPTX
Four data types Data Scientist should know
PDF
STATISTICS AND PROBABILITY FOR DATA SCIENCE,
PPTX
ML for Data science Chapter.pptx
PPTX
The Use of Data and Datasets in Data Science
PPTX
Unit-4.pptx for machine learning for students
Introduction to Data science in syllabus of machine intelligence in data science
Different types of data. Qualitative, Quantitative, Ordinal, Nominal, Discret...
Categorical DataCategorical data represents characteristics..docx
Four data types Data Scientist should know
STATISTICS AND PROBABILITY FOR DATA SCIENCE,
ML for Data science Chapter.pptx
The Use of Data and Datasets in Data Science
Unit-4.pptx for machine learning for students

Similar to MIDS UNIT-1.pdf building, testing, and Deployment (20)

PPTX
Advance Data Mining - Machine Learning -
PPTX
Data Exploration and Transformation.pptx
DOCX
Note.docx
PPTX
Introduction of Data and Type of data in Statstics
PDF
02Data-osu-0829.pdf
PPTX
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
PDF
Data Mining - Introduction and Data
PDF
DMDW Unit 1.pdf
PPTX
Data What Type Of Data Do You Have V2.1
PPTX
DataScienceConcept_Kanchana_Weerasinghe.pptx
PPTX
Lu2 introduction to statistics
PPTX
Introduction of statistics and probability
PPTX
Introduction to Data (Data Analytics)...
PPTX
Data Analysis in Research: Descriptive Statistics & Normality
PDF
Pelatihan Data Analitik
PPTX
R programming for data science
PPTX
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
PPTX
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
DOCX
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Advance Data Mining - Machine Learning -
Data Exploration and Transformation.pptx
Note.docx
Introduction of Data and Type of data in Statstics
02Data-osu-0829.pdf
Artificial Intelligence - Data Analysis, Creative & Critical Thinking and AI...
Data Mining - Introduction and Data
DMDW Unit 1.pdf
Data What Type Of Data Do You Have V2.1
DataScienceConcept_Kanchana_Weerasinghe.pptx
Lu2 introduction to statistics
Introduction of statistics and probability
Introduction to Data (Data Analytics)...
Data Analysis in Research: Descriptive Statistics & Normality
Pelatihan Data Analitik
R programming for data science
Chapter Two - Overview o g yuyjkgftdrrgty yufguif Data Science.pptx
MMW (Data Management)-Part 1 for ULO 2 (1).pptx
Data Mining DataLecture Notes for Chapter 2Introduc.docx
Ad

Recently uploaded (20)

PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Insiders guide to clinical Medicine.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
master seminar digital applications in india
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
Pharma ospi slides which help in ospi learning
PDF
Pre independence Education in Inndia.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Classroom Observation Tools for Teachers
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
GDM (1) (1).pptx small presentation for students
Insiders guide to clinical Medicine.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Basic Mud Logging Guide for educational purpose
VCE English Exam - Section C Student Revision Booklet
master seminar digital applications in india
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
RMMM.pdf make it easy to upload and study
Pharma ospi slides which help in ospi learning
Pre independence Education in Inndia.pdf
Anesthesia in Laparoscopic Surgery in India
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Microbial diseases, their pathogenesis and prophylaxis
Classroom Observation Tools for Teachers
O7-L3 Supply Chain Operations - ICLT Program
Abdominal Access Techniques with Prof. Dr. R K Mishra
Ad

MIDS UNIT-1.pdf building, testing, and Deployment

  • 1. UNIT 1 INTRODUCTION TO DATA SCIENCE - Prof.Apurva Bijawe
  • 2. What is Data Science? ► Data science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. ► Data science uses complex machine learning algorithms to build predictive models. ► The data used for analysis can come from many different sources and presented in various formats.
  • 3. What is Big Data? ► Big data is a combination of structured, semi structured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modeling and other advanced analytics applications. ► Big data is often characterized by the three V’s: • the large volume of data in many environments; • the wide variety of data types frequently stored in big data systems and • the velocity at which much of the data is generated, collected and processed.
  • 4. Different types of Data ► Qualitative Data Type • Nominal • Ordinal ► Quantitative Data Type • Discrete • Continuous
  • 5. Introduction ► We are entering into the digital era where we produce a lot of Data. For instance, a company like Flipkart produces more than 2TB of data on daily basis. ► When this Data has so much importance in our life then it becomes important to properly store and process this without any error. ► When dealing with datasets, the category of data plays an important role to determine which preprocessing strategy would work for a particular set to get the right results or which type of statistical analysis should be applied for the best results.
  • 6. Qualitative Data Type ► Qualitative or Categorical Data describes the object under consideration using a finite set of discrete classes. ► It means that this type of data can’t be counted or measured easily using numbers and therefore divided into categories. ► The gender of a person (male, female, or others) is a good example of this data type. ► These are usually extracted from audio, images, or text medium. ► Another example can be of a smartphone brand that provides information about the current rating, the color of the phone, category of the phone, and so on. All this information can be categorized as Qualitative data. ► There are two subcategories under this:
  • 7. 1.Nominal ► These are the set of values that don’t possess a natural ordering. ► Let’s understand this with some examples. The color of a smartphone can be considered as a nominal data type as we can’t compare one color with others. ► It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is another one where we can’t differentiate between male, female, or others. Mobile phone categories whether it is midrange, budget segment, or premium smartphone is also nominal data type. ► Nominal data types in statistics are not quantifiable and cannot be measured through numerical units. Nominal types of statistical data are valuable while conducting qualitative research as it extends freedom of opinion to subjects.
  • 8. 2. Ordinal ► These types of values have a natural ordering while maintaining their class of values. ► If we consider the size of a clothing brand then we can easily sort them according to their name tag in the order of small < medium < large. ► The grading system while marking candidates in a test can also be considered as an ordinal data type where A+ is definitely better than B grade.
  • 9. These categories help us deciding which encoding strategy can be applied to which type of data. Data encoding for Qualitative data is important because machine learning models can’t handle these values directly and needed to be converted to numerical types as the models are mathematical in nature. For nominal data type where there is no comparison among the categories, one-hot encoding can be applied which is similar to binary coding considering there are in less number and for the ordinal data type, label encoding can be applied which is a form of integer encoding.
  • 10. Difference Between Nominal and Ordinal Data Aspect Nominal Data Ordinal Data Definition Categories data into distinct classes or categories without any inherent order or ranking. Categories data into ordered or ranked categories with meaningful differences between them. Examples Colors, gender, types of animals Education levels, customer satisfaction ratings Mathematical Operations No meaningful mathematical operations can be performed (e.g., averaging categories). Limited mathematical operations can be performed, such as determining the mode or median. Order/ Ranking No natural or meaningful order exists. Categories have a specific order or ranking, but the magnitude of differences between ranks may not be uniform. Central Tendency Mode (most frequent category) Mode, median (middle category), but mean is not typically used due to lack of uniform interval between ranks. Example Use Case Classifying objects, grouping data Rating scales, survey responses, educational levels
  • 11. Quantitative Data Type ► This data type tries to quantify things and it does by considering numerical values that make it countable in nature. ► The price of a smartphone, discount offered, number of ratings on a product, the frequency of processor of a smartphone, or ram of that particular phone, all these things fall under the category of Quantitative data types. ► The key thing is that there can be an infinite number of values a feature can take. ► For instance, the price of a smartphone can vary from x amount to any value and it can be further broken down based on fractional values. ► The two subcategories which describe them clearly are: • Discrete • Continuous
  • 12. 1.Discrete ► The numerical values which fall under are integers or whole numbers are placed under this category. ► The number of speakers in the phone, cameras, cores in the processor, the number of sims supported all these are some of the examples of the discrete data type. ► Discrete data types in statistics cannot be measured – it can only be counted as the objects included in discrete data have a fixed value. ► The value can be represented in decimal, but it has to be whole. Discrete data is often identified through charts, including bar charts, pie charts, and tally charts.
  • 13. Continuous ► The fractional numbers are considered as continuous values. ► Example- These can take the form of the operating frequency of the processors, the android version of the phone, wifi frequency, temperature of the cores, and so on. ► Unlike discrete data types of data in research, with a whole and fixed value, continuous data can break down into smaller pieces and can take any value. ► For example, volatile values such as temperature and the weight of a human can be included in the continuous value. Continuous types of statistical data are represented using a graph that easily reflects value fluctuation by the highs and lows of the line through a certain period of time.
  • 14. Difference between Discrete Data and Continuous Data Aspect Discrete Data Continuous Data Definition Consists of distinct, separate values. It can take any value within a given range. Examples Number of students in a class, coin toss outcomes (1, 2, 3), customer count. Height, weight, temperature, time. Nature Usually involves whole numbers or counts. Involves any value along a continuous spectrum. Gaps in values Gaps between values are common and meaningful. Values can be infinitely divided without gaps. Measurement Often measured using integers. Measured with decimal numbers or fractions. Graphical representation Typically represented with bar charts or histograms. Represented with line graphs or smooth curves.
  • 15. Mathematical Operations Typically involves counting or summation. Involves arithmetic operations, including fractions and decimals. Probability Distribution Typically represented using probability mass functions Typically represented using probability density functions. Example Use Case Counting occurrences, tracking integers. Measuring quantities and analyzing measurements.
  • 16. Data Science Process/Life-Cycle ► Data Science Lifecycle revolves around the use of machine learning and different analytical strategies to produce insights and predictions from information in order to acquire a commercial enterprise objective. ► The data science project life cycle is a methodology that outlines the stages of a data science project, from planning to deployment. ► The project life cycle provides a framework that helps data scientists to manage projects effectively and efficiently.
  • 17. Step 1: Problem Identification and Planning ► The first step in the data science project life cycle is to identify the problem that needs to be solved. ► This involves understanding the business requirements and the goals of the project. ► Once the problem has been identified, the data science team will plan the project by determining the data sources, the data collection process, and the analytical methods that will be used.
  • 18. Step 2: Data Collection ► The second step in the data science project life cycle is data collection. This involves collecting the data that will be used in the analysis. The data science team must ensure that the data is accurate, complete, and relevant to the problem being solved.
  • 19. Step 3: Pre-processing data ► This involves cleaning and transforming the data to make it suitable for analysis. ► The data science team will remove any duplicates, missing values, or irrelevant data from the dataset. They will also transform the data into a format that is suitable for analysis. ► Large data is collected from archives, daily transactions and intermediate records. The data is available in various formats and in various forms. Some data may be available in hard copy formats also. ► The data is scattered at various places on various servers. All these data are extracted and converted into single format and then processed.
  • 20. Step 4: Data Analysis/Exploring the Data ► Now that the data is available and ready in the format required then next important step is to understand the data in depth. ► This understanding comes from analysis of data using various statistical tools available. ► A data engineer plays a vital role in analysis of data. This step is also called as Exploratory Data Analysis (EDA). ► Here the data is examined by formulating the various statistical functions and dependent and independent variables or features are identified. Careful analysis of data revels which data or features are important and what is the spread of data.
  • 21. Exploratory Data Analysis Tools Some of the most common data science tools used to create an EDA include: •Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning. •R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.
  • 22. Step 5: Data Modelling ► Data modelling is the important step once the data is analyzed and visualized. ► The important components are retained in the dataset and thus data is further refined. Now the important is to decide how to model the data? What tasks are suitable for modelling? ► The tasks, like classification or regression, which is suitable is dependent upon what business value is required. In these tasks also many ways of modelling are available. ► The Machine Learning engineer applies various algorithms to the data and generates the output. While modelling the data many a times the models are first tested on dummy data similar to actual data.
  • 23. Step 6: Communicating Model Results/Evaluation ► In the evaluation phase you’ll evaluate the model based on the goals of your business. ► Then, you’ll review your work process and explain how your model will help the business, summarize your findings, and make any corrections. ► Finally, you’ll determine your next steps. Is your model ready for deployment? Does it need a new iteration or another dependency project?
  • 24. Step 7: Model Deployment and Maintenance ► Once the model is trained with the actual data and parameters are fine tuned then model is deployed. ► Now the model is exposed to real time data flowing into the system and output is generated. ► The model can be deployed as web service or as an embedded application in edge or mobile application. This is very important step as now model is exposed to real world. ► And once the deployment is completed then the project team monitors its performance and perform modifications as necessary
  • 25. What is Data Science? ► Data science is a field that studies data. It involves deriving meaningful insights for humongous amounts of data generated across the globe on a daily basis. ► In order to do so, it uses a variety of algorithms, scientific methods, and processes. Data science helps you discover various patterns from collected raw data. ► Data science is a widely popular term and has been so-called because it uses data analysis, big data, and mathematical statistics to extract insights from withering structured or unstructured data. ► This helps businesses in solving crucial problems by coming up with efficient solutions.
  • 26. What is meant by Machine Learning? ► According to Forbes, between the years 2013 and 2017, the patents for Machine Learning grew at a rate of 34% Compound Annual Growth Rate (CAGR). ► Machine learning applies artificial intelligence or AI to provide computers with the ability to learn and improve without having to be programmed explicitly. ► It is definitely one of the most interesting technologies one could come across and mainly focuses on the development of system programs in such a way that they can access necessary data and use it to learn and improve themselves with no necessity of human intervention. ► The ability to learn by themselves makes computers more similar to humans. It begins with data or observations, including instructions or direct experience.
  • 27. Machine learning (ML) • It is the study of computer algorithms that improve automatically through experience. • It is seen as a subset of artificial intelligence. • Machine learning algorithms build a mathematical model based on sample data known as training data in order to make predictions or decisions without explicitly being programmed to do so
  • 28. Relationship between Machine Learning and Data Science
  • 29. ► Data science is a field that studies data and how to extract meaning from it, whereas machine learning is a field devoted to understanding and building methods that utilize data to improve performance or inform predictions. ► Machine learning is a branch of artificial intelligence.
  • 30. Data science is an interdisciplinary discipline that brings together abilities from different fields like machine learning, statistics, and visualization. It enables us to draw valuable insights from large volumes of data, supporting well-founded decision-making processes in the areas of technology, scientific research, and business. Machine learning, a subset of both data science and artificial intelligence, utilizes algorithms and statistical models to process data. This enables machines to learn and make predictions without explicit programming. Artificial intelligence, a broader concept, focuses on creating machines capable of performing tasks that normally require human intelligence, including reasoning, learning, and problem-solving.
  • 31. Data science serves as the foundation for machine learning and artificial intelligence by supplying the necessary data for these models to learn from. It integrates algorithms from machine learning and borrows ideas from traditional domain expertise, statistics, and mathematics to develop solutions
  • 32. Examples and use cases to understand the interrelation between AI, ML and Data Science ► Recommendation systems • This is an example of AI in action. A recommendation system is a type of algorithm that makes personalized recommendations based on user data. Data science is involved in the collection and analysis of user data, while machine learning is used to develop the algorithm that powers the recommendation system. • One use case for recommendation systems powered by AI is Amazon’s personalized product recommendation algorithm. This system employs machine learning techniques and algorithms to understand user behavior information, including past transactions, product ratings, and browsing history, in order to provide personalized suggestions to clients. • Similarly, Netflix also uses a recommendation system to provide personalized movie and TV show recommendations to its users. These personalized recommendations enhance the user’s experience and increase the likelihood of making a purchase or continuing to use the platform.
  • 33. Fraud detection • This is an example of where data science and machine learning come into play. Fraud detection involves analyzing large amounts of data to identify patterns and anomalies that may indicate fraudulent activity. Machine learning algorithms are used to identify these patterns and anomalies, while data science is involved in collecting and preparing the data for analysis. • For example, PayPal utilizes a combination of machine learning methods and data science approaches to examine substantial quantities of transaction information to pinpoint fraudulent actions. The system can identify patterns and anomalies within the data that could indicate fraudulent behavior, like a typical expenditure habits or suspicious IP locations. By harnessing these tools and strategies, PayPal can avert fraudulent transactions, safeguard its users, and preserve the trustworthiness of its digital environment.
  • 34. Natural language processing (NLP) • This is an example of where all three concepts come together. Natural language processing involves analyzing human language to extract insights and meaning. • Data science is used to collect and prepare the data, machine learning is used to develop the algorithms, and AI is used to power the natural language processing system. • For example, chatbots are developed by using NLP techniques. They are designed to simulate conversation with human users. AI-powered chatbots, such as ChatGPT, use machine learning algorithms and NLP techniques to understand natural language queries and provide more personalized responses. • To develop an AI-powered chatbot, developers must first collect and prepare a large dataset of training data. This dataset is used to train machine learning models that can understand and interpret natural language text. Once the models have been trained, they can be integrated into the chatbot to provide more intelligent and accurate responses.