Fundamentals of Data Science -Artificial Intelligence

UNIT I
INTRODUCTION
9
Need for data science - benefits and uses - facets of data - data science process -
setting the research goal - retrieving data - cleaning, integrating, and transforming
data - exploratory data analysis - build the models - presenting and building
applications - Frequency distributions - Outliers - relative frequency distributions -
cumulative frequency distributions - frequency distributions for nominal data -
interpreting distributions - graphs-averages - mode - median - mean - averages for
qualitative and ranked data.

Introduction to data science
Definition for data science:
Data Science is an interdisciplinary filed that seeks to extract knowledge
or insights from various forms of data.
Data science combines three areas of expertise:
business knowledge
statistical analysis
computer science

Cont.…..
• Imagine you have a giant bag of candy (data). You know there are
chocolates, lollipops, and gummies in there, but it's all mixed up
(messy data).
• A data scientist is like a kid who sorts the candy (data cleaning). They
separate the chocolates, lollipops, and gummies (data organization).
Then, they count how many of each kind there are (data analysis). This
way, you know exactly how much chocolate you have to eat (get
insights from data).

Big data
Big data is an evolving term that describes any amount of structured, semi structured and unstructured data
that has the potential to be mined for information.
Structured data- Structured data exists in a predefined format. Relational database consisting of tables with
rows and columns is one of the best examples of structured data.
Example:
excel files and Google Docs spreadsheets.
unstructured data- Unstructured data does not exists in a predefined format .
Example:
legal documents, audio, chats, video, images, text on a web page
Characteristics:
Volume-The name 'Big Data' itself is related to a size which is enormous.
Velocity-The term 'velocity' refers to the speed of generation of data.
Variety-Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.

Difference between big data and data science
Big data Data science
Big data is an evolving term that
describes any amount of structured, semi
structured and unstructured data that has
the potential to be mined for information.
Data Science is an interdisciplinary filed that
seeks to extract knowledge or insights from
various forms of data.
Applications:
Social media
Healthcare
Finance
Applications:
Shopping online
Movies and music
Weather forecasting

Benefits and uses of data science
• Anomaly detection: fraud, disease and crime
• Classification: An email server classifying emails as important
• Forecasting : sales, revenue and customer retention
• Recognition : Facial, voice, text
• Recommendation : recommendation engines can refer user to movies,
restaurants and books

facets of data
The main categories of data are these:
1. Structured- Structured data is when data is in a standardized format.
Example:
 Dates
 Phone numbers
 ZIP codes
 Customer names
 Product inventories
 Point-of-sale (POS) transaction information

Cont…
2. Unstructured
Unstructured data Unstructured or qualitative data — is just the
opposite. It doesn’t fit nicely into a spreadsheet or database.
Examples of unstructured data include:
Media: Audio and video files, images
files: Word docs, PowerPoint presentations, email, chat logs
Social Media: Data from social networking sites like Facebook,
Twitter and LinkedIn
Mobile data: Text messages, locations
Communications: Chat, call recordings

Cont…..
3. Natural language
Natural language is a special type of unstructured data;
• No clear rules: There are no boxes or lines to follow in natural language, unlike a
form. It's like trying to understand a friend's joke without knowing the whole story
(ambiguous).
• Many meanings: One word can have different meanings depending on the situation.
• Learning limitations: Computers are good at learning from data, but natural
language is just too messy and complex sometimes, even for the best computers
(models struggle with new situations).
• Finding key points: Like summarizing a long article for you (text summarization).
• Figuring out the main topic: Understanding if someone is talking about sports or
music (topic recognition).
• Knowing how someone feels: Telling if a message is happy or angry (sentiment
analysis).

Cont….
4. Machine-generated
Machine-generated data is information that’s automatically created by
a computer, process, application, or other machine without human
intervention.

Cont…
5. Graph-based
The graph structures use nodes, edges, and properties to represent and
store graphical data. Graph-based data is a natural way to represent
social networks, and its structure allows you to calculate specific
metrics such as the influence of a person and the shortest path between
two people.

 Setting the research goals and creating a project charter
What does the company expect you to do? And why does management place such a value on your research? Is it
part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone detected?
Answering these three questions (what, why, how) is the goal of the first phase.
prepare a project charter.
This charter contains information such as what you’re going to research, how the company benefits
from that, what data and resources you need, a timetable, and deliverables.
Spend time understanding the goals and context of your research
✓
Create a project charter
✓
A project charter requires teamwork, and your input covers at least the following:
A clear research goal
❖
The project mission and context How you’re going to perform your analysis What
❖ ❖
resources you expect to use
Proof that it’s an achievable project, or proof of concepts
❖ ❖
Deliverables and a measure of success

Retrieving data
The second step is to collect data.
Data can be stored in many forms, ranging from simple text files to tables in a database.
Start with data stored within the company
✓
Don’t be afraid to shop around
✓
Do data quality checks now to prevent problems later
✓

External Data
• If data isn’t available inside your organization, look outside your
organizations. Companies provide data so that you, in turn, can enrich their
services and ecosystem.
• Such is the case with Twitter, LinkedIn, and Facebook. More and more
governments and organizations share their data for free with the world.

Cont….
• Data collection is an error-prone process:
• In this phase you enhance the quality of the data and prepare it for use in
subsequent steps.
This phase consists of three subphases:
❖ data cleansing removes false values from a data source and
inconsistencies across data sources .

Mistakes during data entry
• Mistakes during data entry are errors that occur while inputting
information into a system or database. These errors can include various
types:
1.Typos: These are simple mistakes where a wrong key or combination of
keys is pressed, resulting in incorrect characters or numbers being entered.
For example, typing "hte" instead of "the".
2.Accidental Data Entry: This happens when incorrect data is entered
unintentionally. For instance, entering a wrong date, such as "2022"
instead of "2023".
3.Human Error: This encompasses a range of mistakes due to human
factors such as misinterpretation of data, misunderstanding instructions, or
incorrect application of rules during entry.

Redundant white space
• Redundant white space refers to extra spaces, tabs, or other whitespace
characters that are unintentionally included in text fields.
• String function: Use strip() function to remove spaces in test fields.
Impossible values:
Expected Range: Typically, human body temperature ranges from 36.1°C to
37.2°C.
Impossible Value: Finding a record with a temperature of 150°C.
You can manually review and correct these values, or you can set a rule to
automatically exclude them from your analysis.

Missing values
• Missing values are pieces of information that are supposed to be in
your dataset but are not there for some reason. For example, if you
have a list of people and their ages, but some ages are not recorded or
are blank, those are missing values.
• How to handle missing values?
 Ignore the Whole Row
 Guessing
 Fill in with Other Data
 Use Special Methods

Outliers
• Outliers are data points that are very different from other data points
in a dataset. They are values that are unusually far from the majority of
the data. These can happen because of errors in data collection,
measurement errors.

data transformation
❖ data transformation ensures that the data is in a suitable format for
use in your models.
The Different Ways of Combining Data You can perform two operations
to combine information from different data sets.
 Joining
 Appending or stacking

Joining
• Joining tables allows you to combine the information of one observation found in one table with
the information that you find in another table. The focus is on enriching a single observation.
• Let’s say that the first table contains information about the purchases of a customer and the other
table contains information about the region where your customer lives.
• Joining the tables allows you to combine the information.

Appending or stacking
• Appending or stacking tables is effectively adding observations from
one table to another table.
• One table contains the observations from the month January and the
second table contains observations from the month February.
• The result of appending these tables is a larger one with the
observations from January as well as February.

Reducing the Number of Variables
• Reducing the Number of Variables Having too many variables in your
model makes the model difficult to handle, and certain techniques
don’t perform well when you overload them with too many input
variables. For instance, all the techniques based on a Euclidean
distance perform well only up to 10 variables.

Turning Variables into Dummies
• Turning Variables into Dummies Dummy variables can only take two
values:
• true(1) or false(0). They’re used to indicate the absence of a
categorical effect that may explain the observation.

Data integration
• Data integration enriches data sources by combining information from multiple data
sources.
Merging/Joining Data Sets
Merging or joining data sets involves combining two or more datasets based on a common
field. This allows you to create a new dataset that includes data from both of the original
datasets. There are different types of joins, including:
Inner join: This keeps only the rows that have matches in both datasets.
Left join: This keeps all the rows from the left dataset, and matching rows from the right
dataset. Rows in the right dataset that don't have a match in the left dataset will have null
values in the corresponding columns.
Right join: This is the opposite of a left join. It keeps all the rows from the right dataset, and
matching rows from the left dataset. Rows in the left dataset that don't have a match in the
right dataset will have null values in the corresponding columns.
Full join: This keeps all the rows from both datasets, regardless of whether there is a match in
the other dataset. Rows that don't have a match in the other dataset will have null values in the
corresponding columns.

CONT….
Example of Merging Data Sets
Imagine you have two datasets:
Customer dataset: This dataset includes columns for customer ID,
customer name, and email address.
Order dataset: This dataset includes columns for order ID, customer ID,
product ID, and order amount.
You can merge these two datasets on the customer ID field.

Set Operators
• Set operators are used to perform operations on sets of data. Common
set operators include:
• Union: This operator returns the combined set of all unique values
from two sets.
• Intersection: This operator returns the values that are common to both
sets.
• Difference: This operator returns the values that are in one set but not
in the other set.

Cont…
•Simple graphs: These are the most common type of graph, and they show
the relationship between two variables. Some examples of simple graphs
include bar graphs, line graphs, and pie charts.
•Combined graphs: These graphs combine two or more simple graphs into a
single chart. This can be useful for showing multiple data sets or for
comparing different trends.

Cont…
•Link and brush: This technique allows you to link data between multiple graphs.
•Non-graphical techniques: There are also non-graphical ways to represent data, such
as tables and charts. These can be useful for presenting complex data sets or for data
that is not easily visualized in a graph.

Data modeling or model building
Using machine learning and statistical techniques to achieve your
project goal.
most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison

Types of data
Qualitative data consist of words (Yes or No), letters (Y or N), or
numerical codes (0 or 1) that represent a class or category.
Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent
relative standing within a group.
Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs)
that represent an amount or a count.
To determine the type of data, focus on a single observation in any
collection of observations .

TYPES OF VARIABLES
• Discrete and Continuous Variables Quantitative variables can be
further distinguished as discrete or continuous.
• A discrete variable consists of isolated numbers separated by gaps.
• Examples : Counts- such as the number of children in a family. (1, 2,
3, etc., but never 1.5)
• These variables cannot have fractional or decimal values. You can
have 20 or 21 cats, but not 20.5
• The number of heads in a sequence of coin tosses. The result of rolling
a die.
• The number of patients in a hospital.
• The population of a country.

continuous variable
• A continuous variable consists of numbers whose values, at least in theory,
have no restrictions.
• Continuous variables can assume any numeric value and can be
meaningfully split into smaller parts.
• Consequently, they have valid fractional and decimal values. In fact,
continuous variables have an infinite number of potential values between any
two points.
• Generally, you measure them using a scale. Examples of continuous
variables include weight, height, length, time, and temperature. Durations,
such as the reaction times of grade school children to a fire alarm; and
standardized test scores, such as those on the Scholastic Aptitude Test (SAT).

Frequency distribution (Tables)
• Frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f ) of
occurrence in each class. It is called Frequency Distribution.
• Frequency distribution is used to organize the collected data in table
form. The data could be marks scored by students, temperatures of
different towns, points scored in a volleyball match, etc. After data
collection, we have to show data in a meaningful manner for better
understanding. Organize the data in such a way that all its features are
summarized in a table.

frequency
• Let's consider an example to understand this better. The following are
the scores of 10 students in the G.K. quiz released by Mr. Chris 15, 17,
20, 15, 20, 17, 17, 14, 14, 20. Let's represent this data in frequency
distribution and find out the number of students who got the same
marks.

Cont….
• There are two types of frequency distributions -grouped and
ungrouped.

frequency distributions for Ungrouped data

frequency distributions for grouped data

Guidelines for Frequency Distributions

OUTLIERS
• An outlier is an extremely high or extremely low data point relative to
the nearest data point and the rest of the neighboring co-existing
values in a data graph or dataset you're working with.
• Outliers are extreme values that stand out greatly from the overall
pattern of values in a dataset or graph.

RELATIVE FREQUENCY DISTRIBUTIONS
• Relative frequency distributions show the frequency of each class as a
part or fraction of the total frequency for the entire distribution.

CUMULATIVE FREQUENCY DISTRIBUTIONS
• Cumulative frequency distributions show the total number of
observations in each class and in all lower ranked classes. Cumulative
frequencies are usually converted, in turn, to cumulative percentages.
Cumulative percentages are often referred to as percentile ranks.

Fundamentals of Data Science -Artificial Intelligence

Describing Data with Averages
• MODE
The mode reflects the value of the most frequently occurring score. In
other words A mode is defined as the value that has a higher frequency
in a given set of values. It is the value that appears the most number of
times.
Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data
set is 5 since it has appeared in the set twice.

Types of Modes
• Bimodal, Trimodal & Multimodal (More than one mode) When there

are two modes in a data set, then the set is called bimodal
• For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because
both 2 and 5 is repeated three times in the given set.
• When there are three modes in a data set, then the set is called
trimodal
• For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
.
• When there are four or more modes in a data set, then the set is called
multimodal.

Cont….
• Example: The following table represents the number of wickets taken
by a bowler in 10 matches. Find the mode of the given set of data.

MEDIAN
• The median reflects the middle value when observations are ordered
from least to most.
• The median splits a set of ordered observations into two equal parts,
the upper and lower halves.
• Finding the Median Order scores from least to most. If the total

number of observation given is odd, then the formula to calculate the
median is:
Median = {(n+1)/2}th term / observation .
If the total number of observation is even, then the median formula is:
Median = 1/2[(n/2)th term + {(n/2)+1}th term ]

Example 1:
Find the median of the following: 4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14,
12, 67, 23, 29.
Solution:
n= 15 When we put those numbers in the order
we have: 4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
Median = {(n+1)/2}th term
= (15+1)/2 =8
The 8th term in the list is 24 The median value of this set of numbers is 24.

example 2
Find the median of the following: 9,7,2,11,18,12,6,4
Solution n=8 When we put those numbers in the order
we have: 2, 4, 6, 7, 9,11, 12, 18
Median = 1/2[(n/2)th term + {(n/2)+1}th term ]
= ½ [(8/2) term + ((8/2)+1)term]
=1/2[4th term+5th term] (in our list 4th term is 7 and 5th term is 9)
= ½[7+9] =1/2(16)
=8 The median value of this set of numbers is 8.

MEAN
• The mean is found by adding all scores and then dividing by the
number of scores.
• Mean is the average of the given numbers and is calculated by
dividing the sum of given numbers by the total number of numbers.
Types of means
• Sample mean
• Population mean

Sample Mean
• The sample mean is a central tendency measure.
• The arithmetic average is computed using samples or random values
taken from the population.
• It is evaluated as the sum of all the sample variables divided by the
total number of variables.

Population Mean
• The population mean can be calculated by the sum of all values in the
given data/population divided by a total number of values in the given
data/population.

AVERAGES FOR QUALITATIVE AND RANKED
DATA
• Mode The mode always can be used with qualitative data.
• Median The median can be used whenever it is possible to order
qualitative data from least to most because the level of measurement
is ordinal.

Fundamentals of Data Science -Artificial Intelligence

More Related Content

Similar to Fundamentals of Data Science -Artificial Intelligence (20)

Recently uploaded (20)

Fundamentals of Data Science -Artificial Intelligence