Unit-II
Probability and Statistics
Content
• Data Exploration and Pre-processing:
• Data Objects and Attributes
• Statistical Measures
• Visualization
• Data Cleaning, and Integration
• Dimensionality Reduction:
• Linear Discriminant Analysis
• Principal Component Analysis
• Transform Domain and Statistical Feature Extraction and Reduction.
DATA
• Data is a crucial component in the field of Machine Learning.
• It refers to the set of observations or measurements that can be used to train a
machine-learning model.
• Data can come in many forms, but machine learning models rely on four primary
data types.
• These include numerical data, categorical data, time series data, and text data.
intoduction of probabliity and statistics
Numerical Data
• Numerical data is any data where data points
are exact numbers.
• Statisticians also might call numerical data,
quantitative data.
• This data has meaning as a measurement such
as house prices or as a count, such as a number
of residential properties in Los Angeles or how
many houses sold in the past year.
• Numerical data can be characterized by
continuous or discrete data. Continuous data
can assume any value within a range whereas
discrete data has distinct values.
Categorical data
• Categorical data is sorted by defining characteristics. This can include
gender, social class, ethnicity, hometown, the industry you work in, or
a variety of other labels.
• While learning this data type, keep in mind that it is non-numerical,
meaning you are unable to add them together, average them out, or
sort them in any chronological order.
• Categorical data is great for grouping individuals or ideas that share
similar attributes, helping your machine learning model streamline its
data analysis.
Time series data
• Time series data consists of data points that are indexed at specific
points in time.
• More often than not, this data is collected at consistent intervals.
• Learning and utilizing time series data makes it easy to compare data
from week to week, month to month, year to year, or according to
any other time-based metric you desire.
• The distinct difference between time series data and numerical data is
that time series data has established starting and ending points, while
numerical data is simply a collection of numbers that aren’t rooted in
particular time periods.
Text data
• Text data is simply words, sentences, or paragraphs that can provide
some level of insight to your machine learning models.
• Since these words can be difficult for models to interpret on their
own, they are most often grouped together or analyzed using various
methods such as word frequency, text classification, or sentiment
analysis.
Data Objects and Attributes
• Data sets are made up of data objects.
• A data object represents an entity—in a sales database, the objects may be
customers, store items, and sales;
• In a medical database, the objects may be patients;
• In a university database, the objects may be students, professors, and
courses.
• Data objects are typically described by attributes.
• Data objects can also be referred to as samples, examples, instances, data
points, or objects.
• If the data objects are stored in a database, they are data tuples.
• That is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes
What Is an Attribute?
• An attribute is a data field, representing a characteristic or feature of a data
object.
• The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
• The term dimension is commonly used in data warehousing. Machine learning
literature tends to use the term feature, while statisticians prefer the
term variable. Data mining and database professionals commonly use the
term attribute.
• Attributes describing a customer object can include, for
example, customer_ID, name, and address.
• Observed values for a given attribute are known as observations.
• A set of attributes used to describe a given object is called an attribute
vector (or feature vector).
Type of Attributes
• The type of an attribute is determined by the set of possible values—
nominal, binary, ordinal, or numeric.
Nominal Attributes
• Nominal means “relating to names.” The values of a nominal
attribute are symbols or names of things.
• Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical.
• The values do not have any meaningful order.
• In computer science, the values are also known as enumerations.
Binary Attributes
• A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present.
• Binary attributes are referred to as Boolean if the two states
correspond to true and false.
Ordinal Attributes
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude
between successive values is not known.
Numeric Attributes
• A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values. Numeric attributes can
be interval-scaled or ratio-scaled.
Data Exploration and Pre-processing:
• Data exploration, also known as exploratory data analysis (EDA), is a process
where users look at and understand their data with statistical and
visualization methods.
• Exploratory Data Analysis (EDA) is one of the techniques used for extracting
vital features and trends used by machine learning and deep learning
models in Data Science.
• Thus, EDA has become an important milestone for anyone working in data
science
• This step helps identifying patterns and problems in the dataset, as well as
deciding which model or algorithm to use in subsequent steps.
Objective of Exploratory Data Analysis
• The overall objective of exploratory data analysis is to obtain vital insights
and hence usually includes the following sub-objectives:
• Identifying and removing data outliers
• Identifying trends in time and space
• Uncover patterns related to the target
• Creating hypotheses and testing them through experiments
• Identifying new sources of data
Steps Involved in Exploratory Data Analysis (EDA)
• The key components in an EDA are the main steps undertaken to perform the
EDA. These are as follows:
• Data Collection:
• Nowadays, data is generated in huge volumes and various forms belonging to every
sector of human life, like healthcare, sports, manufacturing, tourism, and so on.
• Finding all Variables and Understanding Them
• When the analysis process starts, the first focus is on the available data that gives a lot of
information.
• This information contains changing values about various features or characteristics,
which helps to understand and get valuable insights from them.
• Cleaning the Dataset
• The next step is to clean the data set, which may contain null values and irrelevant
information.
• These are to be removed so that data contains only those values that are relevant and
important from the target point of view.
• Identify Correlated Variables
• Finding a correlation between variables helps to know how a particular variable is
related to another.
• The correlation matrix method gives a clear picture of how different variables correlate,
which further helps in understanding vital relationships among them.
• Choosing the Right Statistical Methods
• Depending on the data, categorical or numerical, the size, type of variables, and the
purpose of analysis, different statistical tools are employed.
• Statistical formulae applied for numerical outputs give fair information, but graphical
visuals are more appealing and easier to interpret.
• Visualizing and Analyzing Results
• Once the analysis is over, the findings are to be observed cautiously and carefully so
that proper interpretation can be made.
• The trends in the spread of data and correlation between variables give good insights
for making suitable changes in the data parameters
Univariate Analysis
• In Univariate Analysis, you analyze data of
just one variable.
• A variable in your dataset refers to a single
feature/ column.
• You can do this either with graphical or
non-graphical means by finding specific
mathematical values in the data. Some
visual methods include:
• Histograms: Bar plots in which the frequency
of data is represented with rectangle bars.
• Box-plots: Here the information is represented
in the form of boxes.
• Let's make a histogram out of our SalePrice
column.
Bivariate Analysis
• Here, you use two variables and compare
them.
• This way, you can find how one feature
affects the other. It is done with scatter
plots, which plot individual data points or
correlation matrices that plot the
correlation in hues.
• You can also use boxplots.

More Related Content

PDF
Data Mining - Introduction and Data
PDF
Lecture2.pdf
PPTX
2-Data Preprocessing techniques Data minig.pptx
PPTX
Data mining techniques unit 2
PDF
Lect 2 getting to know your data
PDF
BIM Data Mining Unit2 by Tekendra Nath Yogi
PPTX
Data Preprocessing
PPT
Data_Preparation_Modeling_Evaluation.ppt
Data Mining - Introduction and Data
Lecture2.pdf
2-Data Preprocessing techniques Data minig.pptx
Data mining techniques unit 2
Lect 2 getting to know your data
BIM Data Mining Unit2 by Tekendra Nath Yogi
Data Preprocessing
Data_Preparation_Modeling_Evaluation.ppt

Similar to intoduction of probabliity and statistics (20)

PDF
Lecture 2 - Data Mining (Data mining).pdf
PPTX
Pengertian data dan Informasi pada mata kuliah analisa data
PDF
02Data-osu-0829.pdf
PDF
PPT
Its all about data mining
PDF
Ch.3 Data Science Data Preprocessing.pdf
PPT
Wk. 3. Data [12-05-2021] (2).ppt
PPTX
Lect 2 getting to know your data
PPTX
Unit-4.pptx for machine learning for students
PPT
Upstate CSCI 525 Data Mining Chapter 2
PPT
hanjia chapter_2.ppt data mining chapter 2
PPTX
Preprocessing_exploring_and_Visualization.pptx
PDF
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
PPTX
Abanandamengeneeringsdhrrghhhhhgggffffff
PDF
Pelatihan Data Analitik
DOCX
Data Mining DataLecture Notes for Chapter 2Introduc
PPTX
Getting to Know Data presentation basics
PDF
DATA MINING - CHARACTERISTICS and APPLICATION
PPT
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
PPT
Data mining techniques in data mining with examples
Lecture 2 - Data Mining (Data mining).pdf
Pengertian data dan Informasi pada mata kuliah analisa data
02Data-osu-0829.pdf
Its all about data mining
Ch.3 Data Science Data Preprocessing.pdf
Wk. 3. Data [12-05-2021] (2).ppt
Lect 2 getting to know your data
Unit-4.pptx for machine learning for students
Upstate CSCI 525 Data Mining Chapter 2
hanjia chapter_2.ppt data mining chapter 2
Preprocessing_exploring_and_Visualization.pptx
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Abanandamengeneeringsdhrrghhhhhgggffffff
Pelatihan Data Analitik
Data Mining DataLecture Notes for Chapter 2Introduc
Getting to Know Data presentation basics
DATA MINING - CHARACTERISTICS and APPLICATION
02Dataccccccccccccccccccccccccccccccccccccccc.ppt
Data mining techniques in data mining with examples
Ad

Recently uploaded (20)

PPT
Total quality management ppt for engineering students
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Soil Improvement Techniques Note - Rabbi
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PPTX
Module 8- Technological and Communication Skills.pptx
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PPTX
Amdahl’s law is explained in the above power point presentations
PPTX
Feature types and data preprocessing steps
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PPTX
introduction to high performance computing
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Total quality management ppt for engineering students
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Soil Improvement Techniques Note - Rabbi
distributed database system" (DDBS) is often used to refer to both the distri...
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
Module 8- Technological and Communication Skills.pptx
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
Amdahl’s law is explained in the above power point presentations
Feature types and data preprocessing steps
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
introduction to high performance computing
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Information Storage and Retrieval Techniques Unit III
Exploratory_Data_Analysis_Fundamentals.pdf
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Fundamentals of Mechanical Engineering.pptx
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Ad

intoduction of probabliity and statistics

  • 2. Content • Data Exploration and Pre-processing: • Data Objects and Attributes • Statistical Measures • Visualization • Data Cleaning, and Integration • Dimensionality Reduction: • Linear Discriminant Analysis • Principal Component Analysis • Transform Domain and Statistical Feature Extraction and Reduction.
  • 3. DATA • Data is a crucial component in the field of Machine Learning. • It refers to the set of observations or measurements that can be used to train a machine-learning model. • Data can come in many forms, but machine learning models rely on four primary data types. • These include numerical data, categorical data, time series data, and text data.
  • 5. Numerical Data • Numerical data is any data where data points are exact numbers. • Statisticians also might call numerical data, quantitative data. • This data has meaning as a measurement such as house prices or as a count, such as a number of residential properties in Los Angeles or how many houses sold in the past year. • Numerical data can be characterized by continuous or discrete data. Continuous data can assume any value within a range whereas discrete data has distinct values.
  • 6. Categorical data • Categorical data is sorted by defining characteristics. This can include gender, social class, ethnicity, hometown, the industry you work in, or a variety of other labels. • While learning this data type, keep in mind that it is non-numerical, meaning you are unable to add them together, average them out, or sort them in any chronological order. • Categorical data is great for grouping individuals or ideas that share similar attributes, helping your machine learning model streamline its data analysis.
  • 7. Time series data • Time series data consists of data points that are indexed at specific points in time. • More often than not, this data is collected at consistent intervals. • Learning and utilizing time series data makes it easy to compare data from week to week, month to month, year to year, or according to any other time-based metric you desire. • The distinct difference between time series data and numerical data is that time series data has established starting and ending points, while numerical data is simply a collection of numbers that aren’t rooted in particular time periods.
  • 8. Text data • Text data is simply words, sentences, or paragraphs that can provide some level of insight to your machine learning models. • Since these words can be difficult for models to interpret on their own, they are most often grouped together or analyzed using various methods such as word frequency, text classification, or sentiment analysis.
  • 9. Data Objects and Attributes • Data sets are made up of data objects. • A data object represents an entity—in a sales database, the objects may be customers, store items, and sales; • In a medical database, the objects may be patients; • In a university database, the objects may be students, professors, and courses. • Data objects are typically described by attributes. • Data objects can also be referred to as samples, examples, instances, data points, or objects. • If the data objects are stored in a database, they are data tuples. • That is, the rows of a database correspond to the data objects, and the columns correspond to the attributes
  • 10. What Is an Attribute? • An attribute is a data field, representing a characteristic or feature of a data object. • The nouns attribute, dimension, feature, and variable are often used interchangeably in the literature. • The term dimension is commonly used in data warehousing. Machine learning literature tends to use the term feature, while statisticians prefer the term variable. Data mining and database professionals commonly use the term attribute. • Attributes describing a customer object can include, for example, customer_ID, name, and address. • Observed values for a given attribute are known as observations. • A set of attributes used to describe a given object is called an attribute vector (or feature vector).
  • 11. Type of Attributes • The type of an attribute is determined by the set of possible values— nominal, binary, ordinal, or numeric. Nominal Attributes • Nominal means “relating to names.” The values of a nominal attribute are symbols or names of things. • Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as categorical. • The values do not have any meaningful order. • In computer science, the values are also known as enumerations.
  • 12. Binary Attributes • A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is present. • Binary attributes are referred to as Boolean if the two states correspond to true and false.
  • 13. Ordinal Attributes • An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known.
  • 14. Numeric Attributes • A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled or ratio-scaled.
  • 15. Data Exploration and Pre-processing: • Data exploration, also known as exploratory data analysis (EDA), is a process where users look at and understand their data with statistical and visualization methods. • Exploratory Data Analysis (EDA) is one of the techniques used for extracting vital features and trends used by machine learning and deep learning models in Data Science. • Thus, EDA has become an important milestone for anyone working in data science • This step helps identifying patterns and problems in the dataset, as well as deciding which model or algorithm to use in subsequent steps.
  • 16. Objective of Exploratory Data Analysis • The overall objective of exploratory data analysis is to obtain vital insights and hence usually includes the following sub-objectives: • Identifying and removing data outliers • Identifying trends in time and space • Uncover patterns related to the target • Creating hypotheses and testing them through experiments • Identifying new sources of data
  • 17. Steps Involved in Exploratory Data Analysis (EDA) • The key components in an EDA are the main steps undertaken to perform the EDA. These are as follows: • Data Collection: • Nowadays, data is generated in huge volumes and various forms belonging to every sector of human life, like healthcare, sports, manufacturing, tourism, and so on. • Finding all Variables and Understanding Them • When the analysis process starts, the first focus is on the available data that gives a lot of information. • This information contains changing values about various features or characteristics, which helps to understand and get valuable insights from them. • Cleaning the Dataset • The next step is to clean the data set, which may contain null values and irrelevant information. • These are to be removed so that data contains only those values that are relevant and important from the target point of view.
  • 18. • Identify Correlated Variables • Finding a correlation between variables helps to know how a particular variable is related to another. • The correlation matrix method gives a clear picture of how different variables correlate, which further helps in understanding vital relationships among them. • Choosing the Right Statistical Methods • Depending on the data, categorical or numerical, the size, type of variables, and the purpose of analysis, different statistical tools are employed. • Statistical formulae applied for numerical outputs give fair information, but graphical visuals are more appealing and easier to interpret. • Visualizing and Analyzing Results • Once the analysis is over, the findings are to be observed cautiously and carefully so that proper interpretation can be made. • The trends in the spread of data and correlation between variables give good insights for making suitable changes in the data parameters
  • 19. Univariate Analysis • In Univariate Analysis, you analyze data of just one variable. • A variable in your dataset refers to a single feature/ column. • You can do this either with graphical or non-graphical means by finding specific mathematical values in the data. Some visual methods include: • Histograms: Bar plots in which the frequency of data is represented with rectangle bars. • Box-plots: Here the information is represented in the form of boxes. • Let's make a histogram out of our SalePrice column.
  • 20. Bivariate Analysis • Here, you use two variables and compare them. • This way, you can find how one feature affects the other. It is done with scatter plots, which plot individual data points or correlation matrices that plot the correlation in hues. • You can also use boxplots.