SlideShare a Scribd company logo
UNIT I
INTRODUCTION
9
Need for data science - benefits and uses - facets of data - data science process -
setting the research goal - retrieving data - cleaning, integrating, and transforming
data - exploratory data analysis - build the models - presenting and building
applications - Frequency distributions - Outliers - relative frequency distributions -
cumulative frequency distributions - frequency distributions for nominal data -
interpreting distributions - graphs-averages - mode - median - mean - averages for
qualitative and ranked data.
Introduction to data science
Definition for data science:
Data Science is an interdisciplinary filed that seeks to extract knowledge
or insights from various forms of data.
Data science combines three areas of expertise:
business knowledge
statistical analysis
computer science
Cont.…..
• Imagine you have a giant bag of candy (data). You know there are
chocolates, lollipops, and gummies in there, but it's all mixed up
(messy data).
• A data scientist is like a kid who sorts the candy (data cleaning). They
separate the chocolates, lollipops, and gummies (data organization).
Then, they count how many of each kind there are (data analysis). This
way, you know exactly how much chocolate you have to eat (get
insights from data).
Big data
Big data is an evolving term that describes any amount of structured, semi structured and unstructured data
that has the potential to be mined for information.
Structured data- Structured data exists in a predefined format. Relational database consisting of tables with
rows and columns is one of the best examples of structured data.
Example:
excel files and Google Docs spreadsheets.
unstructured data- Unstructured data does not exists in a predefined format .
Example:
legal documents, audio, chats, video, images, text on a web page
Characteristics:
Volume-The name 'Big Data' itself is related to a size which is enormous.
Velocity-The term 'velocity' refers to the speed of generation of data.
Variety-Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
Difference between big data and data science
Big data Data science
Big data is an evolving term that
describes any amount of structured, semi
structured and unstructured data that has
the potential to be mined for information.
Data Science is an interdisciplinary filed that
seeks to extract knowledge or insights from
various forms of data.
Applications:
Social media
Healthcare
Finance
Applications:
Shopping online
Movies and music
Weather forecasting
Benefits and uses of data science
• Anomaly detection: fraud, disease and crime
• Classification: An email server classifying emails as important
• Forecasting : sales, revenue and customer retention
• Recognition : Facial, voice, text
• Recommendation : recommendation engines can refer user to movies,
restaurants and books
facets of data
The main categories of data are these:
1. Structured- Structured data is when data is in a standardized format.
Example:
 Dates
 Phone numbers
 ZIP codes
 Customer names
 Product inventories
 Point-of-sale (POS) transaction information
Example: Relational database
Cont…
2. Unstructured
Unstructured data Unstructured or qualitative data — is just the
opposite. It doesn’t fit nicely into a spreadsheet or database.
Examples of unstructured data include:
Media: Audio and video files, images
files: Word docs, PowerPoint presentations, email, chat logs
Social Media: Data from social networking sites like Facebook,
Twitter and LinkedIn
Mobile data: Text messages, locations
Communications: Chat, call recordings
Cont…..
3. Natural language
Natural language is a special type of unstructured data;
• No clear rules: There are no boxes or lines to follow in natural language, unlike a
form. It's like trying to understand a friend's joke without knowing the whole story
(ambiguous).
• Many meanings: One word can have different meanings depending on the situation.
• Learning limitations: Computers are good at learning from data, but natural
language is just too messy and complex sometimes, even for the best computers
(models struggle with new situations).
• Finding key points: Like summarizing a long article for you (text summarization).
• Figuring out the main topic: Understanding if someone is talking about sports or
music (topic recognition).
• Knowing how someone feels: Telling if a message is happy or angry (sentiment
analysis).
Cont….
4. Machine-generated
Machine-generated data is information that’s automatically created by
a computer, process, application, or other machine without human
intervention.
Cont…
5. Graph-based
The graph structures use nodes, edges, and properties to represent and
store graphical data. Graph-based data is a natural way to represent
social networks, and its structure allows you to calculate specific
metrics such as the influence of a person and the shortest path between
two people.
DATA SCIENCE PROCESS
Cont.….
 Setting the research goals and creating a project charter
What does the company expect you to do? And why does management place such a value on your research? Is it
part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone detected?
Answering these three questions (what, why, how) is the goal of the first phase.
prepare a project charter.
This charter contains information such as what you’re going to research, how the company benefits
from that, what data and resources you need, a timetable, and deliverables.
Spend time understanding the goals and context of your research
✓
Create a project charter
✓
A project charter requires teamwork, and your input covers at least the following:
A clear research goal
❖
The project mission and context How you’re going to perform your analysis What
❖ ❖
resources you expect to use
Proof that it’s an achievable project, or proof of concepts
❖ ❖
Deliverables and a measure of success
Retrieving data
The second step is to collect data.
Data can be stored in many forms, ranging from simple text files to tables in a database.
Start with data stored within the company
✓
Don’t be afraid to shop around
✓
Do data quality checks now to prevent problems later
✓
External Data
• If data isn’t available inside your organization, look outside your
organizations. Companies provide data so that you, in turn, can enrich their
services and ecosystem.
• Such is the case with Twitter, LinkedIn, and Facebook. More and more
governments and organizations share their data for free with the world.
Cont….
Data preparation
Cont….
• Data collection is an error-prone process:
• In this phase you enhance the quality of the data and prepare it for use in
subsequent steps.
This phase consists of three subphases:
❖ data cleansing removes false values from a data source and
inconsistencies across data sources .
Cont….
Mistakes during data entry
• Mistakes during data entry are errors that occur while inputting
information into a system or database. These errors can include various
types:
1.Typos: These are simple mistakes where a wrong key or combination of
keys is pressed, resulting in incorrect characters or numbers being entered.
For example, typing "hte" instead of "the".
2.Accidental Data Entry: This happens when incorrect data is entered
unintentionally. For instance, entering a wrong date, such as "2022"
instead of "2023".
3.Human Error: This encompasses a range of mistakes due to human
factors such as misinterpretation of data, misunderstanding instructions, or
incorrect application of rules during entry.
Redundant white space
• Redundant white space refers to extra spaces, tabs, or other whitespace
characters that are unintentionally included in text fields.
• String function: Use strip() function to remove spaces in test fields.
Impossible values:
Expected Range: Typically, human body temperature ranges from 36.1°C to
37.2°C.
Impossible Value: Finding a record with a temperature of 150°C.
You can manually review and correct these values, or you can set a rule to
automatically exclude them from your analysis.
Missing values
• Missing values are pieces of information that are supposed to be in
your dataset but are not there for some reason. For example, if you
have a list of people and their ages, but some ages are not recorded or
are blank, those are missing values.
• How to handle missing values?
 Ignore the Whole Row
 Guessing
 Fill in with Other Data
 Use Special Methods
Outliers
• Outliers are data points that are very different from other data points
in a dataset. They are values that are unusually far from the majority of
the data. These can happen because of errors in data collection,
measurement errors.
data transformation
❖ data transformation ensures that the data is in a suitable format for
use in your models.
The Different Ways of Combining Data You can perform two operations
to combine information from different data sets.
 Joining
 Appending or stacking
Joining
• Joining tables allows you to combine the information of one observation found in one table with
the information that you find in another table. The focus is on enriching a single observation.
• Let’s say that the first table contains information about the purchases of a customer and the other
table contains information about the region where your customer lives.
• Joining the tables allows you to combine the information.
Appending or stacking
• Appending or stacking tables is effectively adding observations from
one table to another table.
• One table contains the observations from the month January and the
second table contains observations from the month February.
• The result of appending these tables is a larger one with the
observations from January as well as February.
Reducing the Number of Variables
• Reducing the Number of Variables Having too many variables in your
model makes the model difficult to handle, and certain techniques
don’t perform well when you overload them with too many input
variables. For instance, all the techniques based on a Euclidean
distance perform well only up to 10 variables.
Turning Variables into Dummies
• Turning Variables into Dummies Dummy variables can only take two
values:
• true(1) or false(0). They’re used to indicate the absence of a
categorical effect that may explain the observation.
Data integration
• Data integration enriches data sources by combining information from multiple data
sources.
Merging/Joining Data Sets
Merging or joining data sets involves combining two or more datasets based on a common
field. This allows you to create a new dataset that includes data from both of the original
datasets. There are different types of joins, including:
Inner join: This keeps only the rows that have matches in both datasets.
Left join: This keeps all the rows from the left dataset, and matching rows from the right
dataset. Rows in the right dataset that don't have a match in the left dataset will have null
values in the corresponding columns.
Right join: This is the opposite of a left join. It keeps all the rows from the right dataset, and
matching rows from the left dataset. Rows in the left dataset that don't have a match in the
right dataset will have null values in the corresponding columns.
Full join: This keeps all the rows from both datasets, regardless of whether there is a match in
the other dataset. Rows that don't have a match in the other dataset will have null values in the
corresponding columns.
CONT….
Example of Merging Data Sets
Imagine you have two datasets:
Customer dataset: This dataset includes columns for customer ID,
customer name, and email address.
Order dataset: This dataset includes columns for order ID, customer ID,
product ID, and order amount.
You can merge these two datasets on the customer ID field.
Set Operators
• Set operators are used to perform operations on sets of data. Common
set operators include:
• Union: This operator returns the combined set of all unique values
from two sets.
• Intersection: This operator returns the values that are common to both
sets.
• Difference: This operator returns the values that are in one set but not
in the other set.
Data exploration
Cont…
•Simple graphs: These are the most common type of graph, and they show
the relationship between two variables. Some examples of simple graphs
include bar graphs, line graphs, and pie charts.
•Combined graphs: These graphs combine two or more simple graphs into a
single chart. This can be useful for showing multiple data sets or for
comparing different trends.
Cont…
•Link and brush: This technique allows you to link data between multiple graphs.
•Non-graphical techniques: There are also non-graphical ways to represent data, such
as tables and charts. These can be useful for presenting complex data sets or for data
that is not easily visualized in a graph.
Data modeling or model building
Using machine learning and statistical techniques to achieve your
project goal.
most models consist of the following main steps:
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison
Presentation and automation
Types of data
Qualitative data consist of words (Yes or No), letters (Y or N), or
numerical codes (0 or 1) that represent a class or category.
Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent
relative standing within a group.
Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs)
that represent an amount or a count.
To determine the type of data, focus on a single observation in any
collection of observations .
TYPES OF VARIABLES
• Discrete and Continuous Variables Quantitative variables can be
further distinguished as discrete or continuous.
• A discrete variable consists of isolated numbers separated by gaps.
• Examples : Counts- such as the number of children in a family. (1, 2,
3, etc., but never 1.5)
• These variables cannot have fractional or decimal values. You can
have 20 or 21 cats, but not 20.5
• The number of heads in a sequence of coin tosses. The result of rolling
a die.
• The number of patients in a hospital.
• The population of a country.
continuous variable
• A continuous variable consists of numbers whose values, at least in theory,
have no restrictions.
• Continuous variables can assume any numeric value and can be
meaningfully split into smaller parts.
• Consequently, they have valid fractional and decimal values. In fact,
continuous variables have an infinite number of potential values between any
two points.
• Generally, you measure them using a scale. Examples of continuous
variables include weight, height, length, time, and temperature. Durations,
such as the reaction times of grade school children to a fire alarm; and
standardized test scores, such as those on the Scholastic Aptitude Test (SAT).
Frequency distribution (Tables)
• Frequency distribution is a collection of observations produced by
sorting observations into classes and showing their frequency (f ) of
occurrence in each class. It is called Frequency Distribution.
• Frequency distribution is used to organize the collected data in table
form. The data could be marks scored by students, temperatures of
different towns, points scored in a volleyball match, etc. After data
collection, we have to show data in a meaningful manner for better
understanding. Organize the data in such a way that all its features are
summarized in a table.
frequency
• Let's consider an example to understand this better. The following are
the scores of 10 students in the G.K. quiz released by Mr. Chris 15, 17,
20, 15, 20, 17, 17, 14, 14, 20. Let's represent this data in frequency
distribution and find out the number of students who got the same
marks.
Cont….
• There are two types of frequency distributions -grouped and
ungrouped.
frequency distributions for Ungrouped data
frequency distributions for grouped data
Guidelines for Frequency Distributions
Cont….
OUTLIERS
• An outlier is an extremely high or extremely low data point relative to
the nearest data point and the rest of the neighboring co-existing
values in a data graph or dataset you're working with.
• Outliers are extreme values that stand out greatly from the overall
pattern of values in a dataset or graph.
RELATIVE FREQUENCY DISTRIBUTIONS
• Relative frequency distributions show the frequency of each class as a
part or fraction of the total frequency for the entire distribution.
CUMULATIVE FREQUENCY DISTRIBUTIONS
• Cumulative frequency distributions show the total number of
observations in each class and in all lower ranked classes. Cumulative
frequencies are usually converted, in turn, to cumulative percentages.
Cumulative percentages are often referred to as percentile ranks.
Fundamentals of Data Science -Artificial Intelligence
Describing Data with Averages
• MODE
The mode reflects the value of the most frequently occurring score. In
other words A mode is defined as the value that has a higher frequency
in a given set of values. It is the value that appears the most number of
times.
Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data
set is 5 since it has appeared in the set twice.
Types of Modes
• Bimodal, Trimodal & Multimodal (More than one mode) When there

are two modes in a data set, then the set is called bimodal
• For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because
both 2 and 5 is repeated three times in the given set.
• When there are three modes in a data set, then the set is called
trimodal
• For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
.
• When there are four or more modes in a data set, then the set is called
multimodal.
Cont….
• Example: The following table represents the number of wickets taken
by a bowler in 10 matches. Find the mode of the given set of data.
MEDIAN
• The median reflects the middle value when observations are ordered
from least to most.
• The median splits a set of ordered observations into two equal parts,
the upper and lower halves.
• Finding the Median Order scores from least to most. If the total

number of observation given is odd, then the formula to calculate the
median is:
Median = {(n+1)/2}th term / observation .
If the total number of observation is even, then the median formula is:
Median = 1/2[(n/2)th term + {(n/2)+1}th term ]
Example 1:
Find the median of the following: 4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14,
12, 67, 23, 29.
Solution:
n= 15 When we put those numbers in the order
we have: 4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
Median = {(n+1)/2}th term
= (15+1)/2 =8
The 8th term in the list is 24 The median value of this set of numbers is 24.
example 2
Find the median of the following: 9,7,2,11,18,12,6,4
Solution n=8 When we put those numbers in the order
we have: 2, 4, 6, 7, 9,11, 12, 18
Median = 1/2[(n/2)th term + {(n/2)+1}th term ]
= ½ [(8/2) term + ((8/2)+1)term]
=1/2[4th term+5th term] (in our list 4th term is 7 and 5th term is 9)
= ½[7+9] =1/2(16)
=8 The median value of this set of numbers is 8.
MEAN
• The mean is found by adding all scores and then dividing by the
number of scores.
• Mean is the average of the given numbers and is calculated by
dividing the sum of given numbers by the total number of numbers.
Types of means
• Sample mean
• Population mean
Sample Mean
• The sample mean is a central tendency measure.
• The arithmetic average is computed using samples or random values
taken from the population.
• It is evaluated as the sum of all the sample variables divided by the
total number of variables.
Population Mean
• The population mean can be calculated by the sum of all values in the
given data/population divided by a total number of values in the given
data/population.
AVERAGES FOR QUALITATIVE AND RANKED
DATA
• Mode The mode always can be used with qualitative data.
• Median The median can be used whenever it is possible to order
qualitative data from least to most because the level of measurement
is ordinal.

More Related Content

PPTX
Data science.chapter-1,2,3
PPTX
Data science unit1
PDF
CS3352-Foundations of Data Science Notes.pdf
PPTX
Data Science topic and introduction to basic concepts involving data manageme...
PDF
Data Science Introduction and Process in Data Science
PPTX
Introduction to Big Data Analytics
PPTX
Unit 1-Data Science Process Overview.pptx
PPTX
Introducition to Data scinece compiled by hu
Data science.chapter-1,2,3
Data science unit1
CS3352-Foundations of Data Science Notes.pdf
Data Science topic and introduction to basic concepts involving data manageme...
Data Science Introduction and Process in Data Science
Introduction to Big Data Analytics
Unit 1-Data Science Process Overview.pptx
Introducition to Data scinece compiled by hu

Similar to Fundamentals of Data Science -Artificial Intelligence (20)

PDF
Data Science Unit1 AMET.pdf
PDF
Module 2 Data Collection and Management.pdf
PPTX
Machine Learning: A Fast Review
PDF
machinelearning-191005133446.pdf
PDF
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
PPTX
big data and machine learning ppt.pptx
PPTX
DS103 - Unit03DS103 - Unit03DS103 - Unit03.pptx
PDF
GDPR for Things - ThingsCon Amsterdam 2017
PPTX
Data Science presentation for explanation of numpy and pandas
PPTX
Data analytics using Scalable Programming
PPTX
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
PPTX
Unit – 1 introduction to big datannj.pptx
PPTX
What is Data?
PPTX
classIX_DS_Teacher_Presentation.pptx
PDF
chapter 2 Data Science.pdf emerging ecnology freshman course
PPTX
Data Science Introduction to Data Science
PPTX
Data quality and data profiling
PPTX
basic of data science and big data......
PPTX
1 UNIT-DSP.pptx
PDF
Introduction to Data Analytics, AKTU - UNIT-1
Data Science Unit1 AMET.pdf
Module 2 Data Collection and Management.pdf
Machine Learning: A Fast Review
machinelearning-191005133446.pdf
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
big data and machine learning ppt.pptx
DS103 - Unit03DS103 - Unit03DS103 - Unit03.pptx
GDPR for Things - ThingsCon Amsterdam 2017
Data Science presentation for explanation of numpy and pandas
Data analytics using Scalable Programming
Machine_Learning_VTU_6th_Semester_Module_2.1.pptx
Unit – 1 introduction to big datannj.pptx
What is Data?
classIX_DS_Teacher_Presentation.pptx
chapter 2 Data Science.pdf emerging ecnology freshman course
Data Science Introduction to Data Science
Data quality and data profiling
basic of data science and big data......
1 UNIT-DSP.pptx
Introduction to Data Analytics, AKTU - UNIT-1
Ad

Recently uploaded (20)

PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT
introduction to datamining and warehousing
PPT
Project quality management in manufacturing
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Sustainable Sites - Green Building Construction
PPTX
Construction Project Organization Group 2.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Artificial Intelligence
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
PPT on Performance Review to get promotions
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
introduction to datamining and warehousing
Project quality management in manufacturing
CH1 Production IntroductoryConcepts.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Sustainable Sites - Green Building Construction
Construction Project Organization Group 2.pptx
Foundation to blockchain - A guide to Blockchain Tech
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Internet of Things (IOT) - A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Artificial Intelligence
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
R24 SURVEYING LAB MANUAL for civil enggi
PPT on Performance Review to get promotions
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
UNIT 4 Total Quality Management .pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Ad

Fundamentals of Data Science -Artificial Intelligence

  • 1. UNIT I INTRODUCTION 9 Need for data science - benefits and uses - facets of data - data science process - setting the research goal - retrieving data - cleaning, integrating, and transforming data - exploratory data analysis - build the models - presenting and building applications - Frequency distributions - Outliers - relative frequency distributions - cumulative frequency distributions - frequency distributions for nominal data - interpreting distributions - graphs-averages - mode - median - mean - averages for qualitative and ranked data.
  • 2. Introduction to data science Definition for data science: Data Science is an interdisciplinary filed that seeks to extract knowledge or insights from various forms of data. Data science combines three areas of expertise: business knowledge statistical analysis computer science
  • 3. Cont.….. • Imagine you have a giant bag of candy (data). You know there are chocolates, lollipops, and gummies in there, but it's all mixed up (messy data). • A data scientist is like a kid who sorts the candy (data cleaning). They separate the chocolates, lollipops, and gummies (data organization). Then, they count how many of each kind there are (data analysis). This way, you know exactly how much chocolate you have to eat (get insights from data).
  • 4. Big data Big data is an evolving term that describes any amount of structured, semi structured and unstructured data that has the potential to be mined for information. Structured data- Structured data exists in a predefined format. Relational database consisting of tables with rows and columns is one of the best examples of structured data. Example: excel files and Google Docs spreadsheets. unstructured data- Unstructured data does not exists in a predefined format . Example: legal documents, audio, chats, video, images, text on a web page Characteristics: Volume-The name 'Big Data' itself is related to a size which is enormous. Velocity-The term 'velocity' refers to the speed of generation of data. Variety-Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
  • 5. Difference between big data and data science Big data Data science Big data is an evolving term that describes any amount of structured, semi structured and unstructured data that has the potential to be mined for information. Data Science is an interdisciplinary filed that seeks to extract knowledge or insights from various forms of data. Applications: Social media Healthcare Finance Applications: Shopping online Movies and music Weather forecasting
  • 6. Benefits and uses of data science • Anomaly detection: fraud, disease and crime • Classification: An email server classifying emails as important • Forecasting : sales, revenue and customer retention • Recognition : Facial, voice, text • Recommendation : recommendation engines can refer user to movies, restaurants and books
  • 7. facets of data The main categories of data are these: 1. Structured- Structured data is when data is in a standardized format. Example:  Dates  Phone numbers  ZIP codes  Customer names  Product inventories  Point-of-sale (POS) transaction information
  • 9. Cont… 2. Unstructured Unstructured data Unstructured or qualitative data — is just the opposite. It doesn’t fit nicely into a spreadsheet or database. Examples of unstructured data include: Media: Audio and video files, images files: Word docs, PowerPoint presentations, email, chat logs Social Media: Data from social networking sites like Facebook, Twitter and LinkedIn Mobile data: Text messages, locations Communications: Chat, call recordings
  • 10. Cont….. 3. Natural language Natural language is a special type of unstructured data; • No clear rules: There are no boxes or lines to follow in natural language, unlike a form. It's like trying to understand a friend's joke without knowing the whole story (ambiguous). • Many meanings: One word can have different meanings depending on the situation. • Learning limitations: Computers are good at learning from data, but natural language is just too messy and complex sometimes, even for the best computers (models struggle with new situations). • Finding key points: Like summarizing a long article for you (text summarization). • Figuring out the main topic: Understanding if someone is talking about sports or music (topic recognition). • Knowing how someone feels: Telling if a message is happy or angry (sentiment analysis).
  • 11. Cont…. 4. Machine-generated Machine-generated data is information that’s automatically created by a computer, process, application, or other machine without human intervention.
  • 12. Cont… 5. Graph-based The graph structures use nodes, edges, and properties to represent and store graphical data. Graph-based data is a natural way to represent social networks, and its structure allows you to calculate specific metrics such as the influence of a person and the shortest path between two people.
  • 15.  Setting the research goals and creating a project charter What does the company expect you to do? And why does management place such a value on your research? Is it part of a bigger strategic picture or a “lone wolf” project originating from an opportunity someone detected? Answering these three questions (what, why, how) is the goal of the first phase. prepare a project charter. This charter contains information such as what you’re going to research, how the company benefits from that, what data and resources you need, a timetable, and deliverables. Spend time understanding the goals and context of your research ✓ Create a project charter ✓ A project charter requires teamwork, and your input covers at least the following: A clear research goal ❖ The project mission and context How you’re going to perform your analysis What ❖ ❖ resources you expect to use Proof that it’s an achievable project, or proof of concepts ❖ ❖ Deliverables and a measure of success
  • 16. Retrieving data The second step is to collect data. Data can be stored in many forms, ranging from simple text files to tables in a database. Start with data stored within the company ✓ Don’t be afraid to shop around ✓ Do data quality checks now to prevent problems later ✓
  • 17. External Data • If data isn’t available inside your organization, look outside your organizations. Companies provide data so that you, in turn, can enrich their services and ecosystem. • Such is the case with Twitter, LinkedIn, and Facebook. More and more governments and organizations share their data for free with the world.
  • 20. Cont…. • Data collection is an error-prone process: • In this phase you enhance the quality of the data and prepare it for use in subsequent steps. This phase consists of three subphases: ❖ data cleansing removes false values from a data source and inconsistencies across data sources .
  • 22. Mistakes during data entry • Mistakes during data entry are errors that occur while inputting information into a system or database. These errors can include various types: 1.Typos: These are simple mistakes where a wrong key or combination of keys is pressed, resulting in incorrect characters or numbers being entered. For example, typing "hte" instead of "the". 2.Accidental Data Entry: This happens when incorrect data is entered unintentionally. For instance, entering a wrong date, such as "2022" instead of "2023". 3.Human Error: This encompasses a range of mistakes due to human factors such as misinterpretation of data, misunderstanding instructions, or incorrect application of rules during entry.
  • 23. Redundant white space • Redundant white space refers to extra spaces, tabs, or other whitespace characters that are unintentionally included in text fields. • String function: Use strip() function to remove spaces in test fields. Impossible values: Expected Range: Typically, human body temperature ranges from 36.1°C to 37.2°C. Impossible Value: Finding a record with a temperature of 150°C. You can manually review and correct these values, or you can set a rule to automatically exclude them from your analysis.
  • 24. Missing values • Missing values are pieces of information that are supposed to be in your dataset but are not there for some reason. For example, if you have a list of people and their ages, but some ages are not recorded or are blank, those are missing values. • How to handle missing values?  Ignore the Whole Row  Guessing  Fill in with Other Data  Use Special Methods
  • 25. Outliers • Outliers are data points that are very different from other data points in a dataset. They are values that are unusually far from the majority of the data. These can happen because of errors in data collection, measurement errors.
  • 26. data transformation ❖ data transformation ensures that the data is in a suitable format for use in your models. The Different Ways of Combining Data You can perform two operations to combine information from different data sets.  Joining  Appending or stacking
  • 27. Joining • Joining tables allows you to combine the information of one observation found in one table with the information that you find in another table. The focus is on enriching a single observation. • Let’s say that the first table contains information about the purchases of a customer and the other table contains information about the region where your customer lives. • Joining the tables allows you to combine the information.
  • 28. Appending or stacking • Appending or stacking tables is effectively adding observations from one table to another table. • One table contains the observations from the month January and the second table contains observations from the month February. • The result of appending these tables is a larger one with the observations from January as well as February.
  • 29. Reducing the Number of Variables • Reducing the Number of Variables Having too many variables in your model makes the model difficult to handle, and certain techniques don’t perform well when you overload them with too many input variables. For instance, all the techniques based on a Euclidean distance perform well only up to 10 variables.
  • 30. Turning Variables into Dummies • Turning Variables into Dummies Dummy variables can only take two values: • true(1) or false(0). They’re used to indicate the absence of a categorical effect that may explain the observation.
  • 31. Data integration • Data integration enriches data sources by combining information from multiple data sources. Merging/Joining Data Sets Merging or joining data sets involves combining two or more datasets based on a common field. This allows you to create a new dataset that includes data from both of the original datasets. There are different types of joins, including: Inner join: This keeps only the rows that have matches in both datasets. Left join: This keeps all the rows from the left dataset, and matching rows from the right dataset. Rows in the right dataset that don't have a match in the left dataset will have null values in the corresponding columns. Right join: This is the opposite of a left join. It keeps all the rows from the right dataset, and matching rows from the left dataset. Rows in the left dataset that don't have a match in the right dataset will have null values in the corresponding columns. Full join: This keeps all the rows from both datasets, regardless of whether there is a match in the other dataset. Rows that don't have a match in the other dataset will have null values in the corresponding columns.
  • 32. CONT…. Example of Merging Data Sets Imagine you have two datasets: Customer dataset: This dataset includes columns for customer ID, customer name, and email address. Order dataset: This dataset includes columns for order ID, customer ID, product ID, and order amount. You can merge these two datasets on the customer ID field.
  • 33. Set Operators • Set operators are used to perform operations on sets of data. Common set operators include: • Union: This operator returns the combined set of all unique values from two sets. • Intersection: This operator returns the values that are common to both sets. • Difference: This operator returns the values that are in one set but not in the other set.
  • 35. Cont… •Simple graphs: These are the most common type of graph, and they show the relationship between two variables. Some examples of simple graphs include bar graphs, line graphs, and pie charts. •Combined graphs: These graphs combine two or more simple graphs into a single chart. This can be useful for showing multiple data sets or for comparing different trends.
  • 36. Cont… •Link and brush: This technique allows you to link data between multiple graphs. •Non-graphical techniques: There are also non-graphical ways to represent data, such as tables and charts. These can be useful for presenting complex data sets or for data that is not easily visualized in a graph.
  • 37. Data modeling or model building Using machine learning and statistical techniques to achieve your project goal. most models consist of the following main steps: 1. Selection of a modeling technique and variables to enter in the model 2. Execution of the model 3. Diagnosis and model comparison
  • 39. Types of data Qualitative data consist of words (Yes or No), letters (Y or N), or numerical codes (0 or 1) that represent a class or category. Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative standing within a group. Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs) that represent an amount or a count. To determine the type of data, focus on a single observation in any collection of observations .
  • 40. TYPES OF VARIABLES • Discrete and Continuous Variables Quantitative variables can be further distinguished as discrete or continuous. • A discrete variable consists of isolated numbers separated by gaps. • Examples : Counts- such as the number of children in a family. (1, 2, 3, etc., but never 1.5) • These variables cannot have fractional or decimal values. You can have 20 or 21 cats, but not 20.5 • The number of heads in a sequence of coin tosses. The result of rolling a die. • The number of patients in a hospital. • The population of a country.
  • 41. continuous variable • A continuous variable consists of numbers whose values, at least in theory, have no restrictions. • Continuous variables can assume any numeric value and can be meaningfully split into smaller parts. • Consequently, they have valid fractional and decimal values. In fact, continuous variables have an infinite number of potential values between any two points. • Generally, you measure them using a scale. Examples of continuous variables include weight, height, length, time, and temperature. Durations, such as the reaction times of grade school children to a fire alarm; and standardized test scores, such as those on the Scholastic Aptitude Test (SAT).
  • 42. Frequency distribution (Tables) • Frequency distribution is a collection of observations produced by sorting observations into classes and showing their frequency (f ) of occurrence in each class. It is called Frequency Distribution. • Frequency distribution is used to organize the collected data in table form. The data could be marks scored by students, temperatures of different towns, points scored in a volleyball match, etc. After data collection, we have to show data in a meaningful manner for better understanding. Organize the data in such a way that all its features are summarized in a table.
  • 43. frequency • Let's consider an example to understand this better. The following are the scores of 10 students in the G.K. quiz released by Mr. Chris 15, 17, 20, 15, 20, 17, 17, 14, 14, 20. Let's represent this data in frequency distribution and find out the number of students who got the same marks.
  • 44. Cont…. • There are two types of frequency distributions -grouped and ungrouped.
  • 45. frequency distributions for Ungrouped data
  • 47. Guidelines for Frequency Distributions
  • 49. OUTLIERS • An outlier is an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a data graph or dataset you're working with. • Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.
  • 50. RELATIVE FREQUENCY DISTRIBUTIONS • Relative frequency distributions show the frequency of each class as a part or fraction of the total frequency for the entire distribution.
  • 51. CUMULATIVE FREQUENCY DISTRIBUTIONS • Cumulative frequency distributions show the total number of observations in each class and in all lower ranked classes. Cumulative frequencies are usually converted, in turn, to cumulative percentages. Cumulative percentages are often referred to as percentile ranks.
  • 53. Describing Data with Averages • MODE The mode reflects the value of the most frequently occurring score. In other words A mode is defined as the value that has a higher frequency in a given set of values. It is the value that appears the most number of times. Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has appeared in the set twice.
  • 54. Types of Modes • Bimodal, Trimodal & Multimodal (More than one mode) When there  are two modes in a data set, then the set is called bimodal • For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because both 2 and 5 is repeated three times in the given set. • When there are three modes in a data set, then the set is called trimodal • For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8 . • When there are four or more modes in a data set, then the set is called multimodal.
  • 55. Cont…. • Example: The following table represents the number of wickets taken by a bowler in 10 matches. Find the mode of the given set of data.
  • 56. MEDIAN • The median reflects the middle value when observations are ordered from least to most. • The median splits a set of ordered observations into two equal parts, the upper and lower halves. • Finding the Median Order scores from least to most. If the total  number of observation given is odd, then the formula to calculate the median is: Median = {(n+1)/2}th term / observation . If the total number of observation is even, then the median formula is: Median = 1/2[(n/2)th term + {(n/2)+1}th term ]
  • 57. Example 1: Find the median of the following: 4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29. Solution: n= 15 When we put those numbers in the order we have: 4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92, Median = {(n+1)/2}th term = (15+1)/2 =8 The 8th term in the list is 24 The median value of this set of numbers is 24.
  • 58. example 2 Find the median of the following: 9,7,2,11,18,12,6,4 Solution n=8 When we put those numbers in the order we have: 2, 4, 6, 7, 9,11, 12, 18 Median = 1/2[(n/2)th term + {(n/2)+1}th term ] = ½ [(8/2) term + ((8/2)+1)term] =1/2[4th term+5th term] (in our list 4th term is 7 and 5th term is 9) = ½[7+9] =1/2(16) =8 The median value of this set of numbers is 8.
  • 59. MEAN • The mean is found by adding all scores and then dividing by the number of scores. • Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total number of numbers. Types of means • Sample mean • Population mean
  • 60. Sample Mean • The sample mean is a central tendency measure. • The arithmetic average is computed using samples or random values taken from the population. • It is evaluated as the sum of all the sample variables divided by the total number of variables.
  • 61. Population Mean • The population mean can be calculated by the sum of all values in the given data/population divided by a total number of values in the given data/population.
  • 62. AVERAGES FOR QUALITATIVE AND RANKED DATA • Mode The mode always can be used with qualitative data. • Median The median can be used whenever it is possible to order qualitative data from least to most because the level of measurement is ordinal.