2. •
Data Science has become the most
demanding job of the 21st century.
Every organization is looking for candidates
with knowledge of data science. Data science
What is data
science
Need of
Data
Science
Data
Science Jobs
Data
Science
Application
Data
Science
Tools
Solving
Problems
with Data
Science
Data Science
Machine Learning : By: Dr.Radha Raman Chandan
3. Machine Learning : By: Dr.Radha Raman Chandan
What is Structured Data vs. Unstructured Data?
5. Machine Learning : By: Dr.Radha Raman Chandan
On the basis
of
Structured data Unstructured data
Flexibility Structured data is less flexible and
schema-dependent.
There is an absence of schema, so it is more
flexible.
Scalability It is hard to scale database
schema.
It is more scalable.
Performan
ce
performance is higher. performance is lower than structured data.
Nature Structured data is quantitative,
i.e., it consists of hard numbers or
things that can be counted.
It is qualitative(based on description), as it
cannot be processed and analyzed using
conventional tools.
Format It has a predefined format. It has a variety of formats, i.e., it comes in a
variety of shapes and sizes.
Analysis It is easy to search. Searching for unstructured data is more
difficult.
6. Machine Learning : By: Dr.Radha Raman Chandan
Structured Data Unstructured Data Semi-structured Data
Well organised data Not organised at all Partially organised
It is less flexible and difficult to
scale. It is schema dependent.
It is flexible and scalable. It is
schema independent.
It is more flexible and simpler to
scale than structured data but
lesser than unstructured data.
It is based on relational
database.
It is based on character and
binary data.
It is based on XML/ RDF
Versioning over tuples,row,tables Versioning is like as a whole
data.
Versioning over tuples is
possible.
Easy analysis Difficult analysis Difficult analysis compared to
structured data but easier when
compared to unstructured data.
Financial data, bar codes are
some of the examples of
structured data.
Media logs, videos, audios are
some of the examples of
unstructured data.
Tweets organised by hashtags,
folder organised by topics are
some of the examples of
unstructured data.
8. What is Data Science?
• Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data.
• It is a multidisciplinary field that uses tools and techniques to manipulate the data so that
you can find something new and meaningful.
• Data science uses the most powerful hardware, programming systems, and most efficient
algorithms to solve data-related problems.
• It is the future of Artificial Intelligence.
• we can say that data science is all about:
Analyzing the raw data.
Modeling the data using various complex and efficient algorithms.
Understanding the data to make better decisions and finding the final result.
Visualizing the data to get a better perspective.
Machine Learning : By: Dr.Radha Raman Chandan
9. Machine Learning : By: Dr.Radha Raman Chandan
• Data Science processes the raw data and solve business
problems and even make prediction about the future
trend or requirement.
For example, from the huge raw data of a company, data science can help answer
following question:
•What do customer want?
•How can we improve our services?
•What will the upcoming trend in sales?
•How much stock they need for upcoming festival.
10. Machine Learning : By: Dr.Radha Raman Chandan
Data science empowers the industries to make smarter,
faster, and more informed decisions.
11. Need for Data Science:
• It handles, processes, and analyzes the huge amount of Data by using powerful, and
efficient Data Science algorithms.
Following are some main reasons for using data science technology:
• Convert massy data and unstructured data into meaningful data.
• Data Science opting by various companies, whether it is a bring band or startup.
• Google , Facebook, Netflix, Amazon
. Working for Automating Transportation such as
Self Driving Cars.
. Help in different predictions such as
• Various Surveys
• Electronic Polling
• Flight or Railways Ticket Confirmation
Machine Learning : By: Dr.Radha Raman Chandan
14. Data Science Components:
• Statistics:
• Statistics is one of the most important components of data science. Statistics
is a way to collect and analyze numerical data in a large amount.
• Domain Expertise:
• Domain expertise means specialized knowledge or skills of a particular
area. In data science, there are various areas for which we need domain
experts.
• Data engineering:
• It acquiring, storing, retrieving, and transforming the data. Data
engineering also includes metadata (data about data) to the data.
Machine Learning : By: Dr.Radha Raman Chandan
15. •Visualization: Representation of data in a visual
context so that people can easily understand the
significance of data. Data visualization makes it easy
to access a huge amount of data in visuals.
•Mathematics: Mathematics is a critical part of
data science. For a data scientist, knowledge of
good mathematics is essential.
Machine Learning : By: Dr.Radha Raman Chandan
Data Science Components:
17. Tools for Data Science
Following are some tools required for data science:
• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,
MATLAB, Excel, RapidMiner.
• Data Warehousing: SQL, Hadoop
• Data Visualization tools: R, Jupyter, Tableau.
• Machine learning tools: Spark, Mahout, Azure ML studio.
Machine Learning : By: Dr.Radha Raman Chandan
21. By : Dr. Radha Raman Chandan 21
• Data Scientists are highly skilled professionals responsible for
• Gathering, Cleaning, and
• Analyzing data to extract valuable insights.
• They create actionable recommendations based on their analysis,
helping organizations make informed decisions.
• Data Scientists also construct predictive models that guide critical
business strategies.
• Data Scientists handle diverse data sources, apply statistical
methods, and frequently code in Python or R.
• Their vital role involves
• converting raw data into actionable insights,
• contributing to organizational success.
Data Scientist
22. By : Dr. Radha Raman Chandan 22
Data Driven decision Making culture
Data Driven culture is a concept where the collection,
analysis and interpretation of data form the basis for
decision-making and the formulation of corporate
strategies.
this approach allows:
• Identify market trends;
• Predict future scenarios;
• Improve understanding of the consumer profile;
• Increase operational efficiency;
• Increase innovation capacity.
23. By : Dr. Radha Raman Chandan 23
Data Analysis Challenges
Data Analysis is a crucial asset for businesses across
various industries, enabling them to extract valuable
insights from the data for informed decision-making.
However, that path to successful data analytics is filled with challenges.
24. By : Dr. Radha Raman Chandan 24
A Key Performance
Indicator (KPI) measures
the organization's
progress toward its
Objectives.
26. By : Dr. Radha Raman Chandan 26
Difference between BI and Data Science
• BI stands for business intelligence, which is also used for data analysis of
business information:
Criterion Business intelligence Data science
Data Source Business intelligence deals
with structured data, e.g., data
warehouse.
Data science deals with structured and
unstructured data, e.g., weblogs, feedback, etc.
Method Analytical(historical data) Scientific(goes deeper to know the reason for
the data report)
Skills Statistics and Visualization
are the two skills required for
business intelligence.
Statistics, Visualization, and Machine
learning are the required skills for data
science.
Focus Business intelligence focuses
on both Past and present
data
Data science focuses on past data, present
data, and also future predictions.
27. •Banking: Many banks adopted AI-based systems to provide customer
support to detect anomalies and credit card fraud.
•Finance: Businesses have been relying heavily on computers and data
scientists to decide the future patterns in the market.
•Agriculture: Organizations are using automation and robotics to help
farmers find more efficient ways to protect their crops from weeds.
•Health Care: Many organizations and medical care centers rely on AI. There
are many examples of how AI in healthcare has helped patients worldwide.
By : Dr. Radha Raman Chandan 27
Applications of Data Science
31. • Poor data quality can significantly impact the accuracy and reliability of BI insights.
• Here are its top five causes.
32. 1. Human error
• Mistakes made during manual data entry can lead to inaccuracies.
Interpretation of this incorrect data can lead to faulty conclusions.
2. Inconsistent data
• Different data sources may use inconsistent formats and
definitions. Without standardized data formats and definitions,
analysis obstacles can arise.
3. Outdated data
• Outdated data can lead to inaccurate insights and poor business
decisions. Over time, data can become outdated and irrelevant.
33. 4. Incomplete data
• Missing data can lead to incomplete analysis and
inaccurate results.
• Gaps in data can hinder trend analysis and forecasting.
5. Duplicate data
• Duplicate records can lead to confusion and inaccurate
analysis.
• Duplicates may have conflicting information.
34. Why do you need high-quality data in BI?
• Accurate decision-making:
• High-quality data ensures that insights derived from
BI tools are accurate and reliable.
• This leads to better decision-making and strategic
planning.
• Efficiency:
• Clean and consistent data reduces the time and effort
needed to prepare data for analysis.
• Trust:
• Reliable data helps maintain trust within the business
and with customers by ensuring that data is
accurately recorded and analyzed.
35. • Cost savings:
• Poor data quality can lead to costly errors and
inefficiencies.
• High data quality helps avoid these issues, saving money
in the long run.
• Competitive advantage:
• Businesses that leverage high-quality data can gain a
competitive edge by making more informed and timelier
decisions.
• .
37. What is Data Preprocessing?
• Data preprocessing is the process of
transforming raw data into an understandable
format.
• It is also an important step in data mining as we
cannot work with raw data.
• The quality of the data should be checked before
applying machine learning or data mining
algorithms.
38. What is Data Preprocessing?
• Data preprocessing is the process of transforming raw
data into an understandable format.
• It is also an important step in data mining as we cannot
work with raw data.
• The preprocessing of data is mainly to check the data
quality.
• The quality of the data should be checked before
applying machine learning or data mining algorithms.
39. Why is Data Preprocessing Important?
The quality can be checked by the following:
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all the
places that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
42. Major Tasks in Data Preprocessing
There are 4 major tasks in data
preprocessing –
• Data cleaning,
• Data integration,
• Data reduction, and
• Data transformation.
43. • Data preprocessing is a basic and primary step for
converting raw data into useful information.
• In general raw data could be incomplete, redundant, or
noisy.
44. Why is Data Preprocessing Important?
Preprocessing of data is mainly to check the data quality.
• Accuracy: To check whether the data entered is correct or not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all the
places that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
46. Data cleaning is the process of removing incorrect
data, incomplete data, and inaccurate data from the
datasets, and it also replaces the missing values.
Data Cleaning
47. What is data cleaning?
• Data cleaning is the process of
• identifying,
• deleting,
• and/or replacing inconsistent or incorrect
information from the database.
• This technique ensures
• high quality of processed data and
• minimizes the risk of wrong or inaccurate
conclusions.
48. • Data Cleaning
• It is the process of removing incorrect data, incomplete data,
and inaccurate data from the datasets, and it also replaces the
missing values.
Here are some techniques for data cleaning:
49. • Handling Missing Values
• Standard values like “Not Available” or “NA” can be used
to replace the missing values.
• Missing values can also be filled manually, but it is not
recommended when that dataset is big.
• Data Cleaning
• Invalid values: Some datasets have well-known values, e.g. gender must
only have “F” (Female) and “M” (Male).
• Missing values: Some features in the dataset may have blank or null values.
• Misspellings: Incorrectly written values.
• Misfielded values: When a feature contains the values of another.
51. Here are some techniques for data cleaning:
Handling Missing Values
• Standard values like “Not Available” or “NA” can be used to
replace the missing values.
• Missing values can also be filled manually, but it is not
recommended when that dataset is big.
• The attribute’s mean value can be used to replace the missing
value when the data is normally distributed
• wherein in the case of non-normal distribution
median value of the attribute can be used.
Data Cleaning
52. What is Data Imputation?
• Data imputation is the process of replacing missing or unavailable entries
in a dataset with substituted values.
• This process is crucial for maintaining the integrity of data analysis.
53. • Missing data can arise from various sources, and understanding the
reasons behind it is crucial for selecting the right strategy to handle it.
54. The standard data cleaning process consists of the
following stages:
• Importing Data
• Merging data sets
• Rebuilding missing data
• Standardization
• Normalization
• Deduplication
• Verification & enrichment
• Exporting data
56. • Data-driven decision-making is the process of
using data and insights to
• enhance the quality and improve the
accuracy of decisions.
What is Data-Driven Decision Making?
57. Challenges in Data-Driven Decision Making
• Decisions informed by data can guide organizations in
achieving their goals.
1. Poor Quality of Data
• Any data that is inaccurate, incomplete and out of date
is of no use to the organization.
• Having poor-quality data only increases your risk of
making bad decisions, eventually leading to a loss in
revenue.
58. 2. Dirty data
• it contains inaccurate information.
The types of dirty data are listed below.
• Inaccurate: Technically correct data can be inaccurate for the
organization in this case.
• Duplicate: The occurrence of duplicate data may be the result
of repeated submissions, incorrect data joining, etc.
• Inconsistent: Inconsistent data is often caused by redundant
data.
• Incomplete: Data with missing values is the reason for this.
• Business rule violation: A business rule is violated when this
type of data is present.
60. 3. Integrating data
• Data integration is the process of combining data
from different sources and storing it together to
obtain a unified view.
• Data integration problems are likely to be
caused by inconsistent data within an
61. Why is data integration important?
• Enhances collaboration: Provides access to essential and newly generated
data, streamlining business processes and reducing manual tasks.
• Saves time: Automates data preparation and analysis, eliminating hours of
manual data gathering.
• Improves data quality: Implements precise cleansing like profiling and
validation, ensuring reliable data for confident decision-making and
simplifying quality control.
• Boosts data security: Consolidates data in one location, enhancing
protection with access controls, encryption, and authentication through
modern integration software.
• Supports flexibility: Allows organizations to use a variety of tools at different
stages of the integration process, promoting openness and adaptability in
their data management systems.
Data integration benefits
62. • ETL
• ETL (Extract, Transform, Load) is a widely used data pipeline process that converts raw
data into a unified dataset for business purposes.
• The process begins by extracting data from multiple sources such as databases,
applications, and files.
• Then, data is transformed through various cleansing operations (selecting specific
columns, translating values, joining, sorting, and ordering) in the staging area.
• Finally, this data is loaded into a data warehouse.
Here’s how it works:
• Extract: Pull data from multiple sources (e.g., databases, files, APIs)
• Transform: Clean, standardize, and restructure the data on a separate server
• Load: Transfer the transformed data into your data warehouse
ETL ensures that only processed and refined data enters your warehouse, which can be
beneficial for maintaining data quality and consistency.
Types of data integration
63. • ELT
• ELT (Extract, Load, Transform), compared to ETL, is a data pipeline
without the staging area.
• Data is immediately loaded and transformed into a cloud-based system.
• This technique is more likely fit for large data sets for quick processing
with a better fit for data lakes.
• ELT is a more recent approach that has gained popularity with the rise of
cloud computing and big data.
• Extract: Pull data from various sources
• Load: Transfer raw data directly into your data warehouse or lake
• Transform: Clean and restructure the data within the warehouse itself
• Data streaming
• Data streaming technology allows data to be processed in real time as it flows
continuously from one source to another. This enables immediate analysis and decision-
making without waiting for all data to be collected first.
Types of data integration
64. 4. Uncertainty of data:
Uncertainty can occur for many reasons,
• including measurement errors,
• processing errors, etc.
When using real-world data, error, and uncertainty should be
expected.
65. 5. Transforming data
• Data Transformation can be described as converting data from one format to another.
• Even though the entire data can be converted into a usable form, there remain a few
things that can go wrong.
66. Key Components
The data transformation process involves several key components:
1.Data Cleaning: This involves removing duplicates, correcting
errors, and handling missing values.
2.Data Standardization: Ensuring consistency across different data
sources and formats.
3.Data Validation: Verifying the accuracy and integrity of the data.
4.Data Structuring: Organizing the data into a format that’s
suitable for analysis.
68. Type of Data
Transformatio
n
Description Example
Constructive Adds or replicates data
Copying customer data to create
a new marketing list
Destructive Removes unnecessary data
Deleting outdated product
information
Esthetic Standardizes data values
Converting all date formats to
YYYY-MM-DD
Structural Reorganizes data structure
Combining ‘First Name’ and ‘Last
Name’ columns into a single ‘Full
Name’ column
69. Organizations should follow these 5 principles to
cultivate good quality data.
1.Accuracy
2.Completeness
3.Consistency
4.Uniqueness
5.Timeliness
70. Phases of Data Analytics Life Cycle
• The Data Analytics Life Cycle defines the process of how
information is carried out in various phases for professionals
working on a project.
• It is a step-by-step procedure that is arranged in a circular structure.
Each phase has its own characteristics and importance.
72. Phase 1: Discovery
• This is the first initial phase which defines the data’s purpose
and how to complete the data analytics life cycle.
• It identifies all the critical objectives and understands the
business domain.
Phase 2: Data Preparation
• This phase will involve collecting, processing and cleansing
the data prior to modeling and analysis.
Common tools used are — Hadoop, Spark, OpenRefine, etc.
Phase 3: Model Planning
• In this phase, the team will analyze the quality of the data
and find an appropriate model for the project.
73. Phase 5: Communicate Results
• This phase determines whether the results are a success or
failure.
Phase 4: Model Building
• The team will develop training, testing, and production
datasets in this phase.
• Once this is done, the team will build and execute the models.
Phase 6: Operationalize
• In the final phase, the team will present the full in-depth report
with the briefings, coding, key findings and all the technical
documents and papers to the stakeholders.
74. Data Science Vs Business Analytics: Skills and Tools Required