intoduction of probabliity and statistics

Unit-II
Probability and Statistics

Content
• Data Exploration and Pre-processing:
• Data Objects and Attributes
• Statistical Measures
• Visualization
• Data Cleaning, and Integration
• Dimensionality Reduction:
• Linear Discriminant Analysis
• Principal Component Analysis
• Transform Domain and Statistical Feature Extraction and Reduction.

DATA
• Data is a crucial component in the field of Machine Learning.
• It refers to the set of observations or measurements that can be used to train a
machine-learning model.
• Data can come in many forms, but machine learning models rely on four primary
data types.
• These include numerical data, categorical data, time series data, and text data.

Numerical Data
• Numerical data is any data where data points
are exact numbers.
• Statisticians also might call numerical data,
quantitative data.
• This data has meaning as a measurement such
as house prices or as a count, such as a number
of residential properties in Los Angeles or how
many houses sold in the past year.
• Numerical data can be characterized by
continuous or discrete data. Continuous data
can assume any value within a range whereas
discrete data has distinct values.

Categorical data
• Categorical data is sorted by defining characteristics. This can include
gender, social class, ethnicity, hometown, the industry you work in, or
a variety of other labels.
• While learning this data type, keep in mind that it is non-numerical,
meaning you are unable to add them together, average them out, or
sort them in any chronological order.
• Categorical data is great for grouping individuals or ideas that share
similar attributes, helping your machine learning model streamline its
data analysis.

Time series data
• Time series data consists of data points that are indexed at specific
points in time.
• More often than not, this data is collected at consistent intervals.
• Learning and utilizing time series data makes it easy to compare data
from week to week, month to month, year to year, or according to
any other time-based metric you desire.
• The distinct difference between time series data and numerical data is
that time series data has established starting and ending points, while
numerical data is simply a collection of numbers that aren’t rooted in
particular time periods.

Text data
• Text data is simply words, sentences, or paragraphs that can provide
some level of insight to your machine learning models.
• Since these words can be difficult for models to interpret on their
own, they are most often grouped together or analyzed using various
methods such as word frequency, text classification, or sentiment
analysis.

Data Objects and Attributes
• Data sets are made up of data objects.
• A data object represents an entity—in a sales database, the objects may be
customers, store items, and sales;
• In a medical database, the objects may be patients;
• In a university database, the objects may be students, professors, and
courses.
• Data objects are typically described by attributes.
• Data objects can also be referred to as samples, examples, instances, data
points, or objects.
• If the data objects are stored in a database, they are data tuples.
• That is, the rows of a database correspond to the data objects, and the
columns correspond to the attributes

What Is an Attribute?
• An attribute is a data field, representing a characteristic or feature of a data
object.
• The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
• The term dimension is commonly used in data warehousing. Machine learning
literature tends to use the term feature, while statisticians prefer the
term variable. Data mining and database professionals commonly use the
term attribute.
• Attributes describing a customer object can include, for
example, customer_ID, name, and address.
• Observed values for a given attribute are known as observations.
• A set of attributes used to describe a given object is called an attribute
vector (or feature vector).

Type of Attributes
• The type of an attribute is determined by the set of possible values—
nominal, binary, ordinal, or numeric.
Nominal Attributes
• Nominal means “relating to names.” The values of a nominal
attribute are symbols or names of things.
• Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical.
• The values do not have any meaningful order.
• In computer science, the values are also known as enumerations.

Binary Attributes
• A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present.
• Binary attributes are referred to as Boolean if the two states
correspond to true and false.

Ordinal Attributes
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude
between successive values is not known.

Numeric Attributes
• A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values. Numeric attributes can
be interval-scaled or ratio-scaled.

Data Exploration and Pre-processing:
• Data exploration, also known as exploratory data analysis (EDA), is a process
where users look at and understand their data with statistical and
visualization methods.
• Exploratory Data Analysis (EDA) is one of the techniques used for extracting
vital features and trends used by machine learning and deep learning
models in Data Science.
• Thus, EDA has become an important milestone for anyone working in data
science
• This step helps identifying patterns and problems in the dataset, as well as
deciding which model or algorithm to use in subsequent steps.

Objective of Exploratory Data Analysis
• The overall objective of exploratory data analysis is to obtain vital insights
and hence usually includes the following sub-objectives:
• Identifying and removing data outliers
• Identifying trends in time and space
• Uncover patterns related to the target
• Creating hypotheses and testing them through experiments
• Identifying new sources of data

Steps Involved in Exploratory Data Analysis (EDA)
• The key components in an EDA are the main steps undertaken to perform the
EDA. These are as follows:
• Data Collection:
• Nowadays, data is generated in huge volumes and various forms belonging to every
sector of human life, like healthcare, sports, manufacturing, tourism, and so on.
• Finding all Variables and Understanding Them
• When the analysis process starts, the first focus is on the available data that gives a lot of
information.
• This information contains changing values about various features or characteristics,
which helps to understand and get valuable insights from them.
• Cleaning the Dataset
• The next step is to clean the data set, which may contain null values and irrelevant
information.
• These are to be removed so that data contains only those values that are relevant and
important from the target point of view.

• Identify Correlated Variables
• Finding a correlation between variables helps to know how a particular variable is
related to another.
• The correlation matrix method gives a clear picture of how different variables correlate,
which further helps in understanding vital relationships among them.
• Choosing the Right Statistical Methods
• Depending on the data, categorical or numerical, the size, type of variables, and the
purpose of analysis, different statistical tools are employed.
• Statistical formulae applied for numerical outputs give fair information, but graphical
visuals are more appealing and easier to interpret.
• Visualizing and Analyzing Results
• Once the analysis is over, the findings are to be observed cautiously and carefully so
that proper interpretation can be made.
• The trends in the spread of data and correlation between variables give good insights
for making suitable changes in the data parameters

Univariate Analysis
• In Univariate Analysis, you analyze data of
just one variable.
• A variable in your dataset refers to a single
feature/ column.
• You can do this either with graphical or
non-graphical means by finding specific
mathematical values in the data. Some
visual methods include:
• Histograms: Bar plots in which the frequency
of data is represented with rectangle bars.
• Box-plots: Here the information is represented
in the form of boxes.
• Let's make a histogram out of our SalePrice
column.

Bivariate Analysis
• Here, you use two variables and compare
them.
• This way, you can find how one feature
affects the other. It is done with scatter
plots, which plot individual data points or
correlation matrices that plot the
correlation in hues.
• You can also use boxplots.

intoduction of probabliity and statistics

More Related Content

Similar to intoduction of probabliity and statistics (20)

Recently uploaded (20)

intoduction of probabliity and statistics