SlideShare a Scribd company logo
Business Analytics & Data Science
BY
Dr. Radha Raman Chandan
•
Data Science has become the most
demanding job of the 21st century.
Every organization is looking for candidates
with knowledge of data science. Data science
What is data
science
Need of
Data
Science
Data
Science Jobs
Data
Science
Application
Data
Science
Tools
Solving
Problems
with Data
Science
Data Science
Machine Learning : By: Dr.Radha Raman Chandan
Machine Learning : By: Dr.Radha Raman Chandan
What is Structured Data vs. Unstructured Data?
Machine Learning : By: Dr.Radha Raman Chandan
Machine Learning : By: Dr.Radha Raman Chandan
On the basis
of
Structured data Unstructured data
Flexibility Structured data is less flexible and
schema-dependent.
There is an absence of schema, so it is more
flexible.
Scalability It is hard to scale database
schema.
It is more scalable.
Performan
ce
performance is higher. performance is lower than structured data.
Nature Structured data is quantitative,
i.e., it consists of hard numbers or
things that can be counted.
It is qualitative(based on description), as it
cannot be processed and analyzed using
conventional tools.
Format It has a predefined format. It has a variety of formats, i.e., it comes in a
variety of shapes and sizes.
Analysis It is easy to search. Searching for unstructured data is more
difficult.
Machine Learning : By: Dr.Radha Raman Chandan
Structured Data Unstructured Data Semi-structured Data
Well organised data Not organised at all Partially organised
It is less flexible and difficult to
scale. It is schema dependent.
It is flexible and scalable. It is
schema independent.
It is more flexible and simpler to
scale than structured data but
lesser than unstructured data.
It is based on relational
database.
It is based on character and
binary data.
It is based on XML/ RDF
Versioning over tuples,row,tables Versioning is like as a whole
data.
Versioning over tuples is
possible.
Easy analysis Difficult analysis Difficult analysis compared to
structured data but easier when
compared to unstructured data.
Financial data, bar codes are
some of the examples of
structured data.
Media logs, videos, audios are
some of the examples of
unstructured data.
Tweets organised by hashtags,
folder organised by topics are
some of the examples of
unstructured data.
Machine Learning : By: Dr.Radha Raman Chandan
What is Data Science?
• Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data.
• It is a multidisciplinary field that uses tools and techniques to manipulate the data so that
you can find something new and meaningful.
• Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve data-related problems.
• It is the future of Artificial Intelligence.
• we can say that data science is all about:
Analyzing the raw data.
Modeling the data using various complex and efficient algorithms.
Understanding the data to make better decisions and finding the final result.
Visualizing the data to get a better perspective.
Machine Learning : By: Dr.Radha Raman Chandan
Machine Learning : By: Dr.Radha Raman Chandan
• Data Science processes the raw data and solve business
problems and even make prediction about the future
trend or requirement.
For example, from the huge raw data of a company, data science can help answer
following question:
•What do customer want?
•How can we improve our services?
•What will the upcoming trend in sales?
•How much stock they need for upcoming festival.
Machine Learning : By: Dr.Radha Raman Chandan
Data science empowers the industries to make smarter,
faster, and more informed decisions.
Need for Data Science:
• It handles, processes, and analyzes the huge amount of Data by using powerful, and
efficient Data Science algorithms.
Following are some main reasons for using data science technology:
• Convert massy data and unstructured data into meaningful data.
• Data Science opting by various companies, whether it is a bring band or startup.
• Google , Facebook, Netflix, Amazon
. Working for Automating Transportation such as
Self Driving Cars.
. Help in different predictions such as
• Various Surveys
• Electronic Polling
• Flight or Railways Ticket Confirmation
Machine Learning : By: Dr.Radha Raman Chandan
Machine Learning : By: Dr.Radha Raman Chandan
Machine Learning : By: Dr.Radha Raman Chandan
Data Science Components:
• Statistics:
• Statistics is one of the most important components of data science. Statistics
is a way to collect and analyze numerical data in a large amount.
• Domain Expertise:
• Domain expertise means specialized knowledge or skills of a particular
area. In data science, there are various areas for which we need domain
experts.
• Data engineering:
• It acquiring, storing, retrieving, and transforming the data. Data
engineering also includes metadata (data about data) to the data.
Machine Learning : By: Dr.Radha Raman Chandan
•Visualization: Representation of data in a visual
context so that people can easily understand the
significance of data. Data visualization makes it easy
to access a huge amount of data in visuals.
•Mathematics: Mathematics is a critical part of
data science. For a data scientist, knowledge of
good mathematics is essential.
Machine Learning : By: Dr.Radha Raman Chandan
Data Science Components:
Machine Learning : By: Dr.Radha Raman Chandan
Tools for Data Science
Following are some tools required for data science:
• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,
MATLAB, Excel, RapidMiner.
• Data Warehousing: SQL, Hadoop
• Data Visualization tools: R, Jupyter, Tableau.
• Machine learning tools: Spark, Mahout, Azure ML studio.
Machine Learning : By: Dr.Radha Raman Chandan
By : Dr. Radha Raman Chandan 18
By : Dr. Radha Raman Chandan 19
By : Dr. Radha Raman Chandan 20
By : Dr. Radha Raman Chandan 21
• Data Scientists are highly skilled professionals responsible for
• Gathering, Cleaning, and
• Analyzing data to extract valuable insights.
• They create actionable recommendations based on their analysis,
helping organizations make informed decisions.
• Data Scientists also construct predictive models that guide critical
business strategies.
• Data Scientists handle diverse data sources, apply statistical
methods, and frequently code in Python or R.
• Their vital role involves
• converting raw data into actionable insights,
• contributing to organizational success.
Data Scientist
By : Dr. Radha Raman Chandan 22
Data Driven decision Making culture
Data Driven culture is a concept where the collection,
analysis and interpretation of data form the basis for
decision-making and the formulation of corporate
strategies.
this approach allows:
• Identify market trends;
• Predict future scenarios;
• Improve understanding of the consumer profile;
• Increase operational efficiency;
• Increase innovation capacity.
By : Dr. Radha Raman Chandan 23
Data Analysis Challenges
Data Analysis is a crucial asset for businesses across
various industries, enabling them to extract valuable
insights from the data for informed decision-making.
However, that path to successful data analytics is filled with challenges.
By : Dr. Radha Raman Chandan 24
A Key Performance
Indicator (KPI) measures
the organization's
progress toward its
Objectives.
By : Dr. Radha Raman Chandan 25
By : Dr. Radha Raman Chandan 26
Difference between BI and Data Science
• BI stands for business intelligence, which is also used for data analysis of
business information:
Criterion Business intelligence Data science
Data Source Business intelligence deals
with structured data, e.g., data
warehouse.
Data science deals with structured and
unstructured data, e.g., weblogs, feedback, etc.
Method Analytical(historical data) Scientific(goes deeper to know the reason for
the data report)
Skills Statistics and Visualization
are the two skills required for
business intelligence.
Statistics, Visualization, and Machine
learning are the required skills for data
science.
Focus Business intelligence focuses
on both Past and present
data
Data science focuses on past data, present
data, and also future predictions.
•Banking: Many banks adopted AI-based systems to provide customer
support to detect anomalies and credit card fraud.
•Finance: Businesses have been relying heavily on computers and data
scientists to decide the future patterns in the market.
•Agriculture: Organizations are using automation and robotics to help
farmers find more efficient ways to protect their crops from weeds.
•Health Care: Many organizations and medical care centers rely on AI. There
are many examples of how AI in healthcare has helped patients worldwide.
By : Dr. Radha Raman Chandan 27
Applications of Data Science
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
• Poor data quality can significantly impact the accuracy and reliability of BI insights.
• Here are its top five causes.
1. Human error
• Mistakes made during manual data entry can lead to inaccuracies.
Interpretation of this incorrect data can lead to faulty conclusions.
2. Inconsistent data
• Different data sources may use inconsistent formats and
definitions. Without standardized data formats and definitions,
analysis obstacles can arise.
3. Outdated data
• Outdated data can lead to inaccurate insights and poor business
decisions. Over time, data can become outdated and irrelevant.
4. Incomplete data
• Missing data can lead to incomplete analysis and
inaccurate results.
• Gaps in data can hinder trend analysis and forecasting.
5. Duplicate data
• Duplicate records can lead to confusion and inaccurate
analysis.
• Duplicates may have conflicting information.
Why do you need high-quality data in BI?
• Accurate decision-making:
• High-quality data ensures that insights derived from
BI tools are accurate and reliable.
• This leads to better decision-making and strategic
planning.
• Efficiency:
• Clean and consistent data reduces the time and effort
needed to prepare data for analysis.
• Trust:
• Reliable data helps maintain trust within the business
and with customers by ensuring that data is
accurately recorded and analyzed.
• Cost savings:
• Poor data quality can lead to costly errors and
inefficiencies.
• High data quality helps avoid these issues, saving money
in the long run.
• Competitive advantage:
• Businesses that leverage high-quality data can gain a
competitive edge by making more informed and timelier
decisions.
• .
BADS-MBA-Unit 1 that what data science and Interpretation
What is Data Preprocessing?
• Data preprocessing is the process of
transforming raw data into an understandable
format.
• It is also an important step in data mining as we
cannot work with raw data.
• The quality of the data should be checked before
applying machine learning or data mining
algorithms.
What is Data Preprocessing?
• Data preprocessing is the process of transforming raw
data into an understandable format.
• It is also an important step in data mining as we cannot
work with raw data.
• The preprocessing of data is mainly to check the data
quality.
• The quality of the data should be checked before
applying machine learning or data mining algorithms.
Why is Data Preprocessing Important?
The quality can be checked by the following:
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all the
places that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
BADS-MBA-Unit 1 that what data science and Interpretation
Dr. VIshal Kumar Singh 41
Major Tasks in Data Preprocessing
There are 4 major tasks in data
preprocessing –
• Data cleaning,
• Data integration,
• Data reduction, and
• Data transformation.
• Data preprocessing is a basic and primary step for
converting raw data into useful information.
• In general raw data could be incomplete, redundant, or
noisy.
Why is Data Preprocessing Important?
Preprocessing of data is mainly to check the data quality.
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all the
places that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
Data Cleaning
Dr. VIshal Kumar Singh 45
Data cleaning is the process of removing incorrect
data, incomplete data, and inaccurate data from the
datasets, and it also replaces the missing values.
Data Cleaning
What is data cleaning?
• Data cleaning is the process of
• identifying,
• deleting,
• and/or replacing inconsistent or incorrect
information from the database.
• This technique ensures
• high quality of processed data and
• minimizes the risk of wrong or inaccurate
conclusions.
• Data Cleaning
• It is the process of removing incorrect data, incomplete data,
and inaccurate data from the datasets, and it also replaces the
missing values.
Here are some techniques for data cleaning:
• Handling Missing Values
• Standard values like “Not Available” or “NA” can be used
to replace the missing values.
• Missing values can also be filled manually, but it is not
recommended when that dataset is big.
• Data Cleaning
• Invalid values: Some datasets have well-known values, e.g. gender must
only have “F” (Female) and “M” (Male).
• Missing values: Some features in the dataset may have blank or null values.
• Misspellings: Incorrectly written values.
• Misfielded values: When a feature contains the values of another.
BADS-MBA-Unit 1 that what data science and Interpretation
Here are some techniques for data cleaning:
Handling Missing Values
• Standard values like “Not Available” or “NA” can be used to
replace the missing values.
• Missing values can also be filled manually, but it is not
recommended when that dataset is big.
• The attribute’s mean value can be used to replace the missing
value when the data is normally distributed
• wherein in the case of non-normal distribution
median value of the attribute can be used.
Data Cleaning
What is Data Imputation?
• Data imputation is the process of replacing missing or unavailable entries
in a dataset with substituted values.
• This process is crucial for maintaining the integrity of data analysis.
• Missing data can arise from various sources, and understanding the
reasons behind it is crucial for selecting the right strategy to handle it.
The standard data cleaning process consists of the
following stages:
• Importing Data
• Merging data sets
• Rebuilding missing data
• Standardization
• Normalization
• Deduplication
• Verification & enrichment
• Exporting data
BADS-MBA-Unit 1 that what data science and Interpretation
• Data-driven decision-making is the process of
using data and insights to
• enhance the quality and improve the
accuracy of decisions.
What is Data-Driven Decision Making?
Challenges in Data-Driven Decision Making
• Decisions informed by data can guide organizations in
achieving their goals.
1. Poor Quality of Data
• Any data that is inaccurate, incomplete and out of date
is of no use to the organization.
• Having poor-quality data only increases your risk of
making bad decisions, eventually leading to a loss in
revenue.
2. Dirty data
• it contains inaccurate information.
The types of dirty data are listed below.
• Inaccurate: Technically correct data can be inaccurate for the
organization in this case.
• Duplicate: The occurrence of duplicate data may be the result
of repeated submissions, incorrect data joining, etc.
• Inconsistent: Inconsistent data is often caused by redundant
data.
• Incomplete: Data with missing values is the reason for this.
• Business rule violation: A business rule is violated when this
type of data is present.
BADS-MBA-Unit 1 that what data science and Interpretation
3. Integrating data
• Data integration is the process of combining data
from different sources and storing it together to
obtain a unified view.
• Data integration problems are likely to be
caused by inconsistent data within an
Why is data integration important?
• Enhances collaboration: Provides access to essential and newly generated
data, streamlining business processes and reducing manual tasks.
• Saves time: Automates data preparation and analysis, eliminating hours of
manual data gathering.
• Improves data quality: Implements precise cleansing like profiling and
validation, ensuring reliable data for confident decision-making and
simplifying quality control.
• Boosts data security: Consolidates data in one location, enhancing
protection with access controls, encryption, and authentication through
modern integration software.
• Supports flexibility: Allows organizations to use a variety of tools at different
stages of the integration process, promoting openness and adaptability in
their data management systems.
Data integration benefits
• ETL
• ETL (Extract, Transform, Load) is a widely used data pipeline process that converts raw
data into a unified dataset for business purposes.
• The process begins by extracting data from multiple sources such as databases,
applications, and files.
• Then, data is transformed through various cleansing operations (selecting specific
columns, translating values, joining, sorting, and ordering) in the staging area.
• Finally, this data is loaded into a data warehouse.
Here’s how it works:
• Extract: Pull data from multiple sources (e.g., databases, files, APIs)
• Transform: Clean, standardize, and restructure the data on a separate server
• Load: Transfer the transformed data into your data warehouse
ETL ensures that only processed and refined data enters your warehouse, which can be
beneficial for maintaining data quality and consistency.
Types of data integration
• ELT
• ELT (Extract, Load, Transform), compared to ETL, is a data pipeline
without the staging area.
• Data is immediately loaded and transformed into a cloud-based system.
• This technique is more likely fit for large data sets for quick processing
with a better fit for data lakes.
• ELT is a more recent approach that has gained popularity with the rise of
cloud computing and big data.
• Extract: Pull data from various sources
• Load: Transfer raw data directly into your data warehouse or lake
• Transform: Clean and restructure the data within the warehouse itself
• Data streaming
• Data streaming technology allows data to be processed in real time as it flows
continuously from one source to another. This enables immediate analysis and decision-
making without waiting for all data to be collected first.
Types of data integration
4. Uncertainty of data:
Uncertainty can occur for many reasons,
• including measurement errors,
• processing errors, etc.
When using real-world data, error, and uncertainty should be
expected.
5. Transforming data
• Data Transformation can be described as converting data from one format to another.
• Even though the entire data can be converted into a usable form, there remain a few
things that can go wrong.
Key Components
The data transformation process involves several key components:
1.Data Cleaning: This involves removing duplicates, correcting
errors, and handling missing values.
2.Data Standardization: Ensuring consistency across different data
sources and formats.
3.Data Validation: Verifying the accuracy and integrity of the data.
4.Data Structuring: Organizing the data into a format that’s
suitable for analysis.
BADS-MBA-Unit 1 that what data science and Interpretation
Type of Data
Transformatio
n
Description Example
Constructive Adds or replicates data
Copying customer data to create
a new marketing list
Destructive Removes unnecessary data
Deleting outdated product
information
Esthetic Standardizes data values
Converting all date formats to
YYYY-MM-DD
Structural Reorganizes data structure
Combining ‘First Name’ and ‘Last
Name’ columns into a single ‘Full
Name’ column
Organizations should follow these 5 principles to
cultivate good quality data.
1.Accuracy
2.Completeness
3.Consistency
4.Uniqueness
5.Timeliness
Phases of Data Analytics Life Cycle
• The Data Analytics Life Cycle defines the process of how
information is carried out in various phases for professionals
working on a project.
• It is a step-by-step procedure that is arranged in a circular structure.
Each phase has its own characteristics and importance.
BADS-MBA-Unit 1 that what data science and Interpretation
Phase 1: Discovery
• This is the first initial phase which defines the data’s purpose
and how to complete the data analytics life cycle.
• It identifies all the critical objectives and understands the
business domain.
Phase 2: Data Preparation
• This phase will involve collecting, processing and cleansing
the data prior to modeling and analysis.
Common tools used are — Hadoop, Spark, OpenRefine, etc.
Phase 3: Model Planning
• In this phase, the team will analyze the quality of the data
and find an appropriate model for the project.
Phase 5: Communicate Results
• This phase determines whether the results are a success or
failure.
Phase 4: Model Building
• The team will develop training, testing, and production
datasets in this phase.
• Once this is done, the team will build and execute the models.
Phase 6: Operationalize
• In the final phase, the team will present the full in-depth report
with the briefings, coding, key findings and all the technical
documents and papers to the stakeholders.
Data Science Vs Business Analytics: Skills and Tools Required

More Related Content

PPTX
Data science in business Administration Nagarajan.pptx
PPTX
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
PPTX
The Power of Data Science by DICS INNOVATIVE.pptx
PDF
Introduction to Data Science: data science process
PPTX
introduction to data science
PDF
Untitled document.pdf
PPTX
Unit 1-FDS. .pptx
PPTX
Chapter 1 Introduction to Data Science (Computing)
Data science in business Administration Nagarajan.pptx
INTRODUCTION TO DATA SCIENCE -CONCEPTS.pptx
The Power of Data Science by DICS INNOVATIVE.pptx
Introduction to Data Science: data science process
introduction to data science
Untitled document.pdf
Unit 1-FDS. .pptx
Chapter 1 Introduction to Data Science (Computing)

Similar to BADS-MBA-Unit 1 that what data science and Interpretation (20)

PPTX
Data science Nagarajan and madhav.pptx
PDF
Data Science Unit 01 PPT - SPPU Sem 6.pdf
PPTX
Introduction To Data Science
PDF
Introduction to Data Science.pdf
PPTX
Data science and business analytics
PPTX
CSE3038_Module1 - updated v1.1bvjchcghvkhvjkvjvkjvh.pptx
PPTX
Big Data Courses In Mumbai
PDF
Ultimate Data Science Cheat Sheet For Success
PPTX
Ch1IntroductiontoDataScience.pptx
PPTX
Fundamentals of Analytics and Statistic (1).pptx
PPTX
introductiontodatascience-230122140841-b90a0856 (1).pptx
PDF
Data Science Unit1 AMET.pdf
PPTX
introduction TO DS 1.pptxvbvcbvcbvcbvcbvcb
PDF
Data Science Overview and a brief introduction to data science.pdf
PPTX
Unit 1 Introduction to DATA SCIENCE .pptx
PPTX
Best Data Science Course in Rohini, BY DICS
PPTX
Data-Science-Training-in-Chandigarh.pptx
PDF
The Power of Data Science Building a Smarter Future
PPTX
basic of data science and big data......
PPTX
Introduction to Data Science.pptx
Data science Nagarajan and madhav.pptx
Data Science Unit 01 PPT - SPPU Sem 6.pdf
Introduction To Data Science
Introduction to Data Science.pdf
Data science and business analytics
CSE3038_Module1 - updated v1.1bvjchcghvkhvjkvjvkjvh.pptx
Big Data Courses In Mumbai
Ultimate Data Science Cheat Sheet For Success
Ch1IntroductiontoDataScience.pptx
Fundamentals of Analytics and Statistic (1).pptx
introductiontodatascience-230122140841-b90a0856 (1).pptx
Data Science Unit1 AMET.pdf
introduction TO DS 1.pptxvbvcbvcbvcbvcbvcb
Data Science Overview and a brief introduction to data science.pdf
Unit 1 Introduction to DATA SCIENCE .pptx
Best Data Science Course in Rohini, BY DICS
Data-Science-Training-in-Chandigarh.pptx
The Power of Data Science Building a Smarter Future
basic of data science and big data......
Introduction to Data Science.pptx
Ad

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Reliability_Chapter_ presentation 1221.5784
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
climate analysis of Dhaka ,Banglades.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Launch Your Data Science Career in Kochi – 2025
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction-to-Cloud-ComputingFinal.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Fluorescence-microscope_Botany_detailed content
Galatica Smart Energy Infrastructure Startup Pitch Deck
Business Acumen Training GuidePresentation.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Knowledge Engineering Part 1
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
.pdf is not working space design for the following data for the following dat...
Major-Components-ofNKJNNKNKNKNKronment.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Reliability_Chapter_ presentation 1221.5784
Ad

BADS-MBA-Unit 1 that what data science and Interpretation

  • 1. Business Analytics & Data Science BY Dr. Radha Raman Chandan
  • 2. • Data Science has become the most demanding job of the 21st century. Every organization is looking for candidates with knowledge of data science. Data science What is data science Need of Data Science Data Science Jobs Data Science Application Data Science Tools Solving Problems with Data Science Data Science Machine Learning : By: Dr.Radha Raman Chandan
  • 3. Machine Learning : By: Dr.Radha Raman Chandan What is Structured Data vs. Unstructured Data?
  • 4. Machine Learning : By: Dr.Radha Raman Chandan
  • 5. Machine Learning : By: Dr.Radha Raman Chandan On the basis of Structured data Unstructured data Flexibility Structured data is less flexible and schema-dependent. There is an absence of schema, so it is more flexible. Scalability It is hard to scale database schema. It is more scalable. Performan ce performance is higher. performance is lower than structured data. Nature Structured data is quantitative, i.e., it consists of hard numbers or things that can be counted. It is qualitative(based on description), as it cannot be processed and analyzed using conventional tools. Format It has a predefined format. It has a variety of formats, i.e., it comes in a variety of shapes and sizes. Analysis It is easy to search. Searching for unstructured data is more difficult.
  • 6. Machine Learning : By: Dr.Radha Raman Chandan Structured Data Unstructured Data Semi-structured Data Well organised data Not organised at all Partially organised It is less flexible and difficult to scale. It is schema dependent. It is flexible and scalable. It is schema independent. It is more flexible and simpler to scale than structured data but lesser than unstructured data. It is based on relational database. It is based on character and binary data. It is based on XML/ RDF Versioning over tuples,row,tables Versioning is like as a whole data. Versioning over tuples is possible. Easy analysis Difficult analysis Difficult analysis compared to structured data but easier when compared to unstructured data. Financial data, bar codes are some of the examples of structured data. Media logs, videos, audios are some of the examples of unstructured data. Tweets organised by hashtags, folder organised by topics are some of the examples of unstructured data.
  • 7. Machine Learning : By: Dr.Radha Raman Chandan
  • 8. What is Data Science? • Data science is a deep study of the massive amount of data, which involves extracting meaningful insights from raw, structured, and unstructured data. • It is a multidisciplinary field that uses tools and techniques to manipulate the data so that you can find something new and meaningful. • Data science uses the most powerful hardware, programming systems, and most efficient algorithms to solve data-related problems. • It is the future of Artificial Intelligence. • we can say that data science is all about: Analyzing the raw data. Modeling the data using various complex and efficient algorithms. Understanding the data to make better decisions and finding the final result. Visualizing the data to get a better perspective. Machine Learning : By: Dr.Radha Raman Chandan
  • 9. Machine Learning : By: Dr.Radha Raman Chandan • Data Science processes the raw data and solve business problems and even make prediction about the future trend or requirement. For example, from the huge raw data of a company, data science can help answer following question: •What do customer want? •How can we improve our services? •What will the upcoming trend in sales? •How much stock they need for upcoming festival.
  • 10. Machine Learning : By: Dr.Radha Raman Chandan Data science empowers the industries to make smarter, faster, and more informed decisions.
  • 11. Need for Data Science: • It handles, processes, and analyzes the huge amount of Data by using powerful, and efficient Data Science algorithms. Following are some main reasons for using data science technology: • Convert massy data and unstructured data into meaningful data. • Data Science opting by various companies, whether it is a bring band or startup. • Google , Facebook, Netflix, Amazon . Working for Automating Transportation such as Self Driving Cars. . Help in different predictions such as • Various Surveys • Electronic Polling • Flight or Railways Ticket Confirmation Machine Learning : By: Dr.Radha Raman Chandan
  • 12. Machine Learning : By: Dr.Radha Raman Chandan
  • 13. Machine Learning : By: Dr.Radha Raman Chandan
  • 14. Data Science Components: • Statistics: • Statistics is one of the most important components of data science. Statistics is a way to collect and analyze numerical data in a large amount. • Domain Expertise: • Domain expertise means specialized knowledge or skills of a particular area. In data science, there are various areas for which we need domain experts. • Data engineering: • It acquiring, storing, retrieving, and transforming the data. Data engineering also includes metadata (data about data) to the data. Machine Learning : By: Dr.Radha Raman Chandan
  • 15. •Visualization: Representation of data in a visual context so that people can easily understand the significance of data. Data visualization makes it easy to access a huge amount of data in visuals. •Mathematics: Mathematics is a critical part of data science. For a data scientist, knowledge of good mathematics is essential. Machine Learning : By: Dr.Radha Raman Chandan Data Science Components:
  • 16. Machine Learning : By: Dr.Radha Raman Chandan
  • 17. Tools for Data Science Following are some tools required for data science: • Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel, RapidMiner. • Data Warehousing: SQL, Hadoop • Data Visualization tools: R, Jupyter, Tableau. • Machine learning tools: Spark, Mahout, Azure ML studio. Machine Learning : By: Dr.Radha Raman Chandan
  • 18. By : Dr. Radha Raman Chandan 18
  • 19. By : Dr. Radha Raman Chandan 19
  • 20. By : Dr. Radha Raman Chandan 20
  • 21. By : Dr. Radha Raman Chandan 21 • Data Scientists are highly skilled professionals responsible for • Gathering, Cleaning, and • Analyzing data to extract valuable insights. • They create actionable recommendations based on their analysis, helping organizations make informed decisions. • Data Scientists also construct predictive models that guide critical business strategies. • Data Scientists handle diverse data sources, apply statistical methods, and frequently code in Python or R. • Their vital role involves • converting raw data into actionable insights, • contributing to organizational success. Data Scientist
  • 22. By : Dr. Radha Raman Chandan 22 Data Driven decision Making culture Data Driven culture is a concept where the collection, analysis and interpretation of data form the basis for decision-making and the formulation of corporate strategies. this approach allows: • Identify market trends; • Predict future scenarios; • Improve understanding of the consumer profile; • Increase operational efficiency; • Increase innovation capacity.
  • 23. By : Dr. Radha Raman Chandan 23 Data Analysis Challenges Data Analysis is a crucial asset for businesses across various industries, enabling them to extract valuable insights from the data for informed decision-making. However, that path to successful data analytics is filled with challenges.
  • 24. By : Dr. Radha Raman Chandan 24 A Key Performance Indicator (KPI) measures the organization's progress toward its Objectives.
  • 25. By : Dr. Radha Raman Chandan 25
  • 26. By : Dr. Radha Raman Chandan 26 Difference between BI and Data Science • BI stands for business intelligence, which is also used for data analysis of business information: Criterion Business intelligence Data science Data Source Business intelligence deals with structured data, e.g., data warehouse. Data science deals with structured and unstructured data, e.g., weblogs, feedback, etc. Method Analytical(historical data) Scientific(goes deeper to know the reason for the data report) Skills Statistics and Visualization are the two skills required for business intelligence. Statistics, Visualization, and Machine learning are the required skills for data science. Focus Business intelligence focuses on both Past and present data Data science focuses on past data, present data, and also future predictions.
  • 27. •Banking: Many banks adopted AI-based systems to provide customer support to detect anomalies and credit card fraud. •Finance: Businesses have been relying heavily on computers and data scientists to decide the future patterns in the market. •Agriculture: Organizations are using automation and robotics to help farmers find more efficient ways to protect their crops from weeds. •Health Care: Many organizations and medical care centers rely on AI. There are many examples of how AI in healthcare has helped patients worldwide. By : Dr. Radha Raman Chandan 27 Applications of Data Science
  • 31. • Poor data quality can significantly impact the accuracy and reliability of BI insights. • Here are its top five causes.
  • 32. 1. Human error • Mistakes made during manual data entry can lead to inaccuracies. Interpretation of this incorrect data can lead to faulty conclusions. 2. Inconsistent data • Different data sources may use inconsistent formats and definitions. Without standardized data formats and definitions, analysis obstacles can arise. 3. Outdated data • Outdated data can lead to inaccurate insights and poor business decisions. Over time, data can become outdated and irrelevant.
  • 33. 4. Incomplete data • Missing data can lead to incomplete analysis and inaccurate results. • Gaps in data can hinder trend analysis and forecasting. 5. Duplicate data • Duplicate records can lead to confusion and inaccurate analysis. • Duplicates may have conflicting information.
  • 34. Why do you need high-quality data in BI? • Accurate decision-making: • High-quality data ensures that insights derived from BI tools are accurate and reliable. • This leads to better decision-making and strategic planning. • Efficiency: • Clean and consistent data reduces the time and effort needed to prepare data for analysis. • Trust: • Reliable data helps maintain trust within the business and with customers by ensuring that data is accurately recorded and analyzed.
  • 35. • Cost savings: • Poor data quality can lead to costly errors and inefficiencies. • High data quality helps avoid these issues, saving money in the long run. • Competitive advantage: • Businesses that leverage high-quality data can gain a competitive edge by making more informed and timelier decisions. • .
  • 37. What is Data Preprocessing? • Data preprocessing is the process of transforming raw data into an understandable format. • It is also an important step in data mining as we cannot work with raw data. • The quality of the data should be checked before applying machine learning or data mining algorithms.
  • 38. What is Data Preprocessing? • Data preprocessing is the process of transforming raw data into an understandable format. • It is also an important step in data mining as we cannot work with raw data. • The preprocessing of data is mainly to check the data quality. • The quality of the data should be checked before applying machine learning or data mining algorithms.
  • 39. Why is Data Preprocessing Important? The quality can be checked by the following: • Accuracy: To check whether the data entered is correct or not. • Completeness: To check whether the data is available or not recorded. • Consistency: To check whether the same data is kept in all the places that do or do not match. • Timeliness: The data should be updated correctly. • Believability: The data should be trustable. • Interpretability: The understandability of the data.
  • 41. Dr. VIshal Kumar Singh 41
  • 42. Major Tasks in Data Preprocessing There are 4 major tasks in data preprocessing – • Data cleaning, • Data integration, • Data reduction, and • Data transformation.
  • 43. • Data preprocessing is a basic and primary step for converting raw data into useful information. • In general raw data could be incomplete, redundant, or noisy.
  • 44. Why is Data Preprocessing Important? Preprocessing of data is mainly to check the data quality. • Accuracy: To check whether the data entered is correct or not. • Completeness: To check whether the data is available or not recorded. • Consistency: To check whether the same data is kept in all the places that do or do not match. • Timeliness: The data should be updated correctly. • Believability: The data should be trustable. • Interpretability: The understandability of the data.
  • 45. Data Cleaning Dr. VIshal Kumar Singh 45
  • 46. Data cleaning is the process of removing incorrect data, incomplete data, and inaccurate data from the datasets, and it also replaces the missing values. Data Cleaning
  • 47. What is data cleaning? • Data cleaning is the process of • identifying, • deleting, • and/or replacing inconsistent or incorrect information from the database. • This technique ensures • high quality of processed data and • minimizes the risk of wrong or inaccurate conclusions.
  • 48. • Data Cleaning • It is the process of removing incorrect data, incomplete data, and inaccurate data from the datasets, and it also replaces the missing values. Here are some techniques for data cleaning:
  • 49. • Handling Missing Values • Standard values like “Not Available” or “NA” can be used to replace the missing values. • Missing values can also be filled manually, but it is not recommended when that dataset is big. • Data Cleaning • Invalid values: Some datasets have well-known values, e.g. gender must only have “F” (Female) and “M” (Male). • Missing values: Some features in the dataset may have blank or null values. • Misspellings: Incorrectly written values. • Misfielded values: When a feature contains the values of another.
  • 51. Here are some techniques for data cleaning: Handling Missing Values • Standard values like “Not Available” or “NA” can be used to replace the missing values. • Missing values can also be filled manually, but it is not recommended when that dataset is big. • The attribute’s mean value can be used to replace the missing value when the data is normally distributed • wherein in the case of non-normal distribution median value of the attribute can be used. Data Cleaning
  • 52. What is Data Imputation? • Data imputation is the process of replacing missing or unavailable entries in a dataset with substituted values. • This process is crucial for maintaining the integrity of data analysis.
  • 53. • Missing data can arise from various sources, and understanding the reasons behind it is crucial for selecting the right strategy to handle it.
  • 54. The standard data cleaning process consists of the following stages: • Importing Data • Merging data sets • Rebuilding missing data • Standardization • Normalization • Deduplication • Verification & enrichment • Exporting data
  • 56. • Data-driven decision-making is the process of using data and insights to • enhance the quality and improve the accuracy of decisions. What is Data-Driven Decision Making?
  • 57. Challenges in Data-Driven Decision Making • Decisions informed by data can guide organizations in achieving their goals. 1. Poor Quality of Data • Any data that is inaccurate, incomplete and out of date is of no use to the organization. • Having poor-quality data only increases your risk of making bad decisions, eventually leading to a loss in revenue.
  • 58. 2. Dirty data • it contains inaccurate information. The types of dirty data are listed below. • Inaccurate: Technically correct data can be inaccurate for the organization in this case. • Duplicate: The occurrence of duplicate data may be the result of repeated submissions, incorrect data joining, etc. • Inconsistent: Inconsistent data is often caused by redundant data. • Incomplete: Data with missing values is the reason for this. • Business rule violation: A business rule is violated when this type of data is present.
  • 60. 3. Integrating data • Data integration is the process of combining data from different sources and storing it together to obtain a unified view. • Data integration problems are likely to be caused by inconsistent data within an
  • 61. Why is data integration important? • Enhances collaboration: Provides access to essential and newly generated data, streamlining business processes and reducing manual tasks. • Saves time: Automates data preparation and analysis, eliminating hours of manual data gathering. • Improves data quality: Implements precise cleansing like profiling and validation, ensuring reliable data for confident decision-making and simplifying quality control. • Boosts data security: Consolidates data in one location, enhancing protection with access controls, encryption, and authentication through modern integration software. • Supports flexibility: Allows organizations to use a variety of tools at different stages of the integration process, promoting openness and adaptability in their data management systems. Data integration benefits
  • 62. • ETL • ETL (Extract, Transform, Load) is a widely used data pipeline process that converts raw data into a unified dataset for business purposes. • The process begins by extracting data from multiple sources such as databases, applications, and files. • Then, data is transformed through various cleansing operations (selecting specific columns, translating values, joining, sorting, and ordering) in the staging area. • Finally, this data is loaded into a data warehouse. Here’s how it works: • Extract: Pull data from multiple sources (e.g., databases, files, APIs) • Transform: Clean, standardize, and restructure the data on a separate server • Load: Transfer the transformed data into your data warehouse ETL ensures that only processed and refined data enters your warehouse, which can be beneficial for maintaining data quality and consistency. Types of data integration
  • 63. • ELT • ELT (Extract, Load, Transform), compared to ETL, is a data pipeline without the staging area. • Data is immediately loaded and transformed into a cloud-based system. • This technique is more likely fit for large data sets for quick processing with a better fit for data lakes. • ELT is a more recent approach that has gained popularity with the rise of cloud computing and big data. • Extract: Pull data from various sources • Load: Transfer raw data directly into your data warehouse or lake • Transform: Clean and restructure the data within the warehouse itself • Data streaming • Data streaming technology allows data to be processed in real time as it flows continuously from one source to another. This enables immediate analysis and decision- making without waiting for all data to be collected first. Types of data integration
  • 64. 4. Uncertainty of data: Uncertainty can occur for many reasons, • including measurement errors, • processing errors, etc. When using real-world data, error, and uncertainty should be expected.
  • 65. 5. Transforming data • Data Transformation can be described as converting data from one format to another. • Even though the entire data can be converted into a usable form, there remain a few things that can go wrong.
  • 66. Key Components The data transformation process involves several key components: 1.Data Cleaning: This involves removing duplicates, correcting errors, and handling missing values. 2.Data Standardization: Ensuring consistency across different data sources and formats. 3.Data Validation: Verifying the accuracy and integrity of the data. 4.Data Structuring: Organizing the data into a format that’s suitable for analysis.
  • 68. Type of Data Transformatio n Description Example Constructive Adds or replicates data Copying customer data to create a new marketing list Destructive Removes unnecessary data Deleting outdated product information Esthetic Standardizes data values Converting all date formats to YYYY-MM-DD Structural Reorganizes data structure Combining ‘First Name’ and ‘Last Name’ columns into a single ‘Full Name’ column
  • 69. Organizations should follow these 5 principles to cultivate good quality data. 1.Accuracy 2.Completeness 3.Consistency 4.Uniqueness 5.Timeliness
  • 70. Phases of Data Analytics Life Cycle • The Data Analytics Life Cycle defines the process of how information is carried out in various phases for professionals working on a project. • It is a step-by-step procedure that is arranged in a circular structure. Each phase has its own characteristics and importance.
  • 72. Phase 1: Discovery • This is the first initial phase which defines the data’s purpose and how to complete the data analytics life cycle. • It identifies all the critical objectives and understands the business domain. Phase 2: Data Preparation • This phase will involve collecting, processing and cleansing the data prior to modeling and analysis. Common tools used are — Hadoop, Spark, OpenRefine, etc. Phase 3: Model Planning • In this phase, the team will analyze the quality of the data and find an appropriate model for the project.
  • 73. Phase 5: Communicate Results • This phase determines whether the results are a success or failure. Phase 4: Model Building • The team will develop training, testing, and production datasets in this phase. • Once this is done, the team will build and execute the models. Phase 6: Operationalize • In the final phase, the team will present the full in-depth report with the briefings, coding, key findings and all the technical documents and papers to the stakeholders.
  • 74. Data Science Vs Business Analytics: Skills and Tools Required