SlideShare a Scribd company logo
DATA MANAGEMENT
UNIT 1
DATA ANALYTICS
DATA
■ Data is raw facts and numbers used to analyze something or make decisions.
■ Data when classified or organized that has some meaningful value for the
user is information.
■ Information is processed data used to make decisions and take actions.
■ Processed data must meet the following criteria for it to be of any significant use in
decision-making:
– Accuracy: The information must be accurate.
– Completeness: The information must be complete.
– Timeliness: The information must be available when it’s needed.
FORMS OF DATA
■ Collection of information stored in a particular file is represented as forms of data.
■ Data can be of three forms:
■ STRUCTURED FORM - Any form of relational database structure where relation between
attributes is possible.
That there exists a relation between rows and columns in the database with a table
structure.
Eg: using database programming languages (sql, oracle, mysql etc).
■ UNSTRUCTURED FORM - Any form of data that does not have predefined structure is
represented as unstructured form of data.
Eg: video, images, comments, posts, few websites such as blogs and Wikipedia.
■ SEMI STRUCTURED DATA - Does not have form tabular data like RDBMS. Predefined
organized formats available.
Eg: csv, xml, json, txt file with tab separator etc.
BIG DATA
■ Big data describes large and diverse datasets that are huge in volume and
also rapidly grow over time.
■ Big data is used in machine learning, predictive modeling, and other
advanced analytics to solve business problems and make informed decisions.
■ Big data can be classified into structured, semi-structured, and
unstructured data.
– Structured data is highly organized and fits neatly into traditional databases.
– Semi-structured data, like JSON or XML, is partially organized.
– Unstructured data, such as text or multimedia, lacks a predefined structure.
4V PROPERTIES OF BIG DATA
Big data is a collection of data from many different sources and is often described by
four characteristics, known as the "4 V's":
■ Volume: How much data is collected
■ Velocity: How quickly data can be generated, gathered, and analyzed
■ Variety: How many points of reference are used to collect data
■ Veracity: How reliable the data is, and whether it comes from a trusted source
SOURCES OF DATA
■ There are two types of sources of data
available.
PRIMARY SOURCE OF DATA
■ Eg: data created by individual or a
business concern on their own.
SECONDARY SOURCE OF DATA
■ Eg: data can be extracted from cloud
servers, website sources (kaggle, uci, aws,
google cloud, twitter, facebook, youtube,
github etc..)
DATA ANALYSIS vs DATA ANALYTICS
DATA ANALYSIS
■ Data analysis is a process of:
– inspecting,
– cleansing,
– transforming and
– modeling data
■ with the goal of:
– discovering useful information,
– informing conclusions and
– supporting decision-making
DATA ANALYTICS
■ Data analytics is the science of:
– analyzing raw data
■ to make conclusions about that
information.
■ This information can then be used to
optimize processes to increase the
overall efficiency of a business or
system
8
Data Analytics
Data & Analytics
Data
■ Raw Facts & Figures - Unprocessed elements
representing values, observations, or descriptions.
■ Foundation of Information - When structured or
interpreted, data becomes information.
■ Digital & Analog - Exists in various forms, such as
numbers, text, images, or sounds.
Data Analytics
■ The science of analyzing raw data to make
conclusions about that information
■ Data analytics is the process of examining,
cleansing, transforming, and modeling data to
discover useful information
Data Analytics
What is Data Analytics?
Data analytics is a multifaceted field that involves the systematic computational
analysis of data to discover patterns, infer correlations, and assist in decision-
making.
Components:
 Data Collection
 Data Processing
 Data Cleansing
 Statistical Analysis
 Interpretation
 Reporting
Objective:
To uncover patterns, correlations, and insights to
inform decision-making by
• Examination of Datasets
• Integration of Techniques
• Strategic Decision-Making
Tools & Techniques:
• Utilizes statistical analysis, predictive modeling,
machine learning, and data mining
Role of Data Analytics
■ Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
■ Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
■ Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
■ Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
Tools used in Data Analytics
■ R programming
■ Python
■ Tableau Public
■ QlikView
■ SAS
■ Microsoft Excel
■ RapidMiner
■ KNIME
■ Open Refine
■ Apache Spark
Data Analytics
Types of Data Analytics
01
02
03
04
What Happened?
Interprets historical data to identify trends and
patterns
Descriptive Analytics
Why Did It Happen?
Delves into data to understand causes and
effects
Diagnostic Analytics
What Could Happen?
Uses statistical models to forecast future
outcomes
Predictive Analytics
What Should We Do?
Provides recommendations on possible courses of
action.
Prescriptive Analytics
Data
Analytics
13
Advantages of Data Analytics
Transforming Data into Value
Informed
Decision-Making
Innovation
Customer
Insights
Operational
Efficiency Cost Efficiency
Risk
Management
Facilitates evidence-
based strategies,
reducing guesswork.
Empowers organizations
to make evidence-based
decisions
Data-driven approach for
strategic planning
reducing the risk of
reliance on intuition or
assumptions.
Helps identifying
inefficiencies
supported by data
Streamlines operations
by identifying process
improvements.
Results in improved
efficiency & productivity,
reduced cost
Enhances
understanding of
customer behavior
and preferences for
better engagement.
Leads to better
product development
and customer service
Provides insights to
stay ahead in the
market.
Helps anticipating and
mitigating risks
Identifies potential risks
and offers mitigation
strategies through
predictive analytics
Support strategic
decision-making, real time
monitoring, scenario
planning with focus on
risk mitigation.
Drives product
development and
innovation by revealing
market trends
Provides a leverage over
competitors through
insights
Helps tailor products and
services to individual
customer needs
Revenue Growth -
Uncovers opportunities
for new revenue
streams.
Cost Reduction -
Pinpoints areas to save
resources and reduce
waste.
Support resource
optimization, process
automation,
predictive
maintenance,
financial planning
and many more…
14
Data analytics can be applied across various industries to solve specific problems, improve efficiency, and create
value.
Manufacturing
• Quality control & demand forecasting
• Suppl chain optimization
• Inventory management
Banking & Finance
• Fraud detection
• Algorithmic Trading
• Credit risk assessment
Business Intelligence
• Market trends
• Customer Experience Enhancement
• Inventory Management & Operational performance
Healthcare Sector
• Patient data analysis for better treatment plans
• Predict health trends
• Provide preventative care
Use Cases
Data Analytics in Action
Transportation & Logistics
• Route Optimization
• Fleet Management
Energy Sector
• Demand Forecasting
• Grid Management
Data Architecture
What is Data Architecture?
■ Definition: A data architecture describes how data is managed--from collection
through to transformation, distribution, and consumption.
■ Blueprint: It sets the blueprint for data and the way it flows through data storage
systems.
Importance of Data Architecture
■ Foundation: It is foundational to data processing operations and artificial
intelligence (AI) applications.
■ Business-Driven: The design should be driven by business requirements.
■ Collaboration: Data architects and data engineers define the data model and
underlying data structures.
Data Architecture
Facilitating Business Needs
■ Purpose: These designs typically facilitate a business need, such as a reporting or
data science initiative.
■ Modern Approaches: Modern data architectures often leverage cloud platforms to
manage and process data.
Benefits of Cloud Platforms
■ Compute Scalability: Enables important data processing tasks to be completed
rapidly.
■ Storage Scalability: Helps to cope with rising data volumes and ensures all relevant
data is available.
■ Quality Improvement: Enhances the quality of training AI applications.
Data Architecture
Models
The data architecture includes three types of data
model:
1. Conceptual model – It is a business model which
uses Entity Relationship (ER) model for relation
between entities and their attributes.
2. Logical model – It is a model where problems
are represented in the form of logic such as rows
and column of data, classes, xml tags and other
DBMS techniques.
3. Physical model – Physical models holds the
database design like which type of database
technology will be suitable for architecture.
FACTORS THAT INFLUENCE DATA ARCHITECTURE
Various constraints and influences affect data architecture design, including:
■ Enterprise Requirements
– Economical and effective system expansion
– Acceptable performance levels (e.g., system access speed)
– Transaction reliability and transparent data management
– Conversion of raw data into useful information (e.g., data warehouses)
– Techniques:
■ Managing transaction data vs. reference data
■ Separating data capture from data retrieval systems
FACTORS THAT INFLUENCE DATA ARCHITECTURE
Technology Drivers
■ Derived from completed data and
database architecture designs
■ Influenced by:
– Existing organizational
integration frameworks and
standards
– Organizational economics
– Available site resources (e.g.,
software licensing)
Economics
■ Critical considerations during the data
architecture phase
■ Some optimal solutions may be cost-
prohibitive
■ External factors affecting decisions:
– Business cycle
– Interest rates
– Market conditions
– Legal considerations
FACTORS THAT INFLUENCE DATA ARCHITECTURE
Business Policies
■ Influences include:
– Internal organizational policies
– Rules from regulatory bodies
– Professional standards
– Governmental laws (varying by
agency)
■ These policies dictate how the
enterprise processes data
Data Processing Needs
■ Requirements include:
– Accurate and reproducible high-
volume transactions
– Data warehousing for
management information
systems
– Support for periodic and ad hoc
reporting
– Alignment with organizational
initiatives (e.g., budgets, product
development)
DATA COLLECTION
Types of Data Sources
■ Primary Sources: Data generated directly from original sources.
■ Secondary Sources: Data collected from existing sources.
Data Collection Process
■ Involves acquiring, collecting, extracting, and storing large volumes of data.
■ Data can be structured (e.g., databases) or unstructured (e.g., text, video, audio,
XML files, images).
DATA COLLECTION
Importance in Big Data Analysis
■ Initial Step: Data collection is crucial before analyzing patterns and extracting useful
information.
■ Raw Data: Collected data is initially unrefined; it requires cleaning to become useful.
From Data to Knowledge
■ Transformation: Cleaned data leads to information, which is then transformed into
knowledge.
■ Knowledge Applications: Can pertain to various fields, such as business insights or
medical treatments.
DATA COLLECTION
Goals of Data Collection
■ Aim to gather information-rich data.
■ Start with key questions:
– What type of data needs to be collected?
– What are the sources of this data?
Types of Data Collected
■ Qualitative Data: Non-numerical, focusing on behaviors and actions (e.g., words, sentences).
■ Quantitative Data: Numerical data that can be analyzed using scientific tools.
Understanding the sources and types of data is essential for effective data collection and
subsequent analysis, leading to valuable insights and knowledge.
Understanding Data Types
Overview of Data Types
■ Data is primarily categorized into two types:
– Primary Data
– Secondary Data
Importance of Data Collection
■ Goal: To gather rich, relevant data for informed decision-making.
■ Considerations: Ensure data is valid, reliable, and suitable for analysis.
Primary Data
Definition: Raw, original data collected directly from official sources.
Collection Techniques:
■ Interviews: Direct questioning of individuals (interviewee) by an interviewer.
– Can be structured or unstructured (e.g., personal, telephone, email).
■ Surveys: Gathering responses through questionnaires.
– Conducted online or offline (e.g., website forms, social media polls).
■ Observation: Researcher observes behaviors and practices.
– Data is recorded in text, audio, or video formats.
■ Experimental Method: Data collected through controlled experiments.
– Common designs include:
■ CRD (Completely Randomized Design): Randomization for comparison.
■ RBD (Randomized Block Design): Dividing experiments into blocks for analysis.
■ LSD (Latin Square Design): Arranging data in rows and columns to minimize errors.
■ FD (Factorial Design): Testing multiple variables simultaneously.
Secondary Data
Definition: Data previously collected and reused for analysis.
Sources:
■ Internal Sources: Data found within the organization (e.g., sales records, accounting
resources).
■ External Sources: Data obtained from third-party resources (e.g., government
publications, market reports).
Secondary Data
Internal Sources
Examples:
■ Accounting Resources: Financial data
for marketing analysis.
■ Sales Force Reports: Information on
product sales.
■ Internal Experts: Insights from
departmental heads.
■ Miscellaneous Reports: Operational
data.
External Sources
Examples:
■ Government Publications:
Demographic and economic data (e.g.,
Registrar General of India).
■ Central Statistical Organization:
National accounts statistics.
■ Ministry of Commerce: Wholesale
price indices.
■ International Organizations: Data from
ILO, OECD, IMF.
Data Management
What is Data Management?
■ Definition: The practice of collecting, storing, and using data securely, efficiently,
and cost-effectively.
■ Goal: To optimize data usage for individuals and organizations while adhering to
policies and regulations, enabling informed decision-making and maximizing
organizational benefits.
Key Aspects of Data Management
■ Broad Scope: Involves a variety of tasks, policies, procedures, and practices.
Core Components of Data Management
1. Data Creation and Access: Efficient methods for creating, accessing, and updating
data across diverse data tiers.
2. Data Storage Solutions: Strategies for storing data across multiple cloud
environments and on-premises systems.
3. High Availability & Disaster Recovery: Ensuring data is always accessible and
protected against loss through robust recovery plans.
4. Data Utilization: Leveraging data in various applications, analytics, and algorithms
for enhanced insights.
5. Data Privacy & Security: Implementing measures to protect data from
unauthorized access and breaches.
6. Data Archiving & Destruction: Managing data lifecycle by archiving and securely
destroying data according to retention schedules and compliance requirements.
Data Quality
What is Data Quality?
■ Definition: Data quality refers to the assessment of how usable data is and how well
it fits its intended context.
Importance of Data Quality
■ Core of Organizational Activities: High data quality is essential as it underpins all
organizational functions.
■ Consequences of Poor Data Quality: Inaccurate data leads to flawed reporting,
misguided decisions, and potential economic losses.
Data Quality
Key Factors in Measuring Data Quality
■ Data Accuracy: Ensures that stored values reflect real-world conditions.
■ Data Uniqueness: Assesses the level of unwanted duplication within datasets.
■ Data Consistency: Checks for adherence to defined semantic rules across datasets.
■ Data Completeness: Evaluates the presence of necessary values in a dataset.
■ Data Timeliness: Measures the appropriateness of data age for specific tasks.
Other factors can be taken into consideration such as Availability, Ease of Manipulation,
reliability.
Importance of Effective Data Management
■ Decision-Making: Enables organizations to make informed decisions based on
accurate and timely data.
■ Regulatory Compliance: Ensures adherence to legal and regulatory standards
regarding data handling.
■ Operational Efficiency: Streamlines data processes, reducing costs and improving
productivity.
Conclusion
■ Effective data management is essential for organizations to connect the full
potential of their data, ensuring security, compliance, and optimal use in decision-
making processes.
Data Quality Issues
Impact of Poor Data Quality
■ Negative Effects: Poor data quality significantly hampers various processing efforts.
■ Consequences: It directly influences decision-making and can adversely affect
revenue.
Example: Credit Rating System
■ Implications of Poor Data Quality:
– Loan Rejections: Creditworthy applicants may be denied loans.
– Risky Approvals: Loans may be sanctioned to applicants with a higher
likelihood of defaulting.
Data Quality Issues
Common Data Quality Problems
1. Noise and Outliers: Irregular data points that can skew analysis.
2. Missing Values: Absence of data that can lead to incomplete insights.
3. Duplicate Data: Redundant entries that can distort results.
4. Incorrect Data: Data that is inaccurate or misleading.
Noisy Data
■ Noisy data refers to datasets that contain extraneous, meaningless information.
■ Almost all datasets will inherently include some level of unwanted noise.
■ The term is often synonymous with corrupt data or data that is difficult for machines
to interpret.
■ This type of data can be filtered and processed to enhance overall data quality.
Illustration: The Crowded Room Analogy
■ Imagine trying to listen to a conversation in a noisy, crowded room.
■ While the human brain can effectively filter out background noise, excessive noise
can hinder comprehension.
■ Similarly, as more irrelevant information is added to a dataset, it becomes
increasingly challenging to identify the desired patterns.
Outliers
Definition
■ An outlier is a data point or observation that significantly deviates from the other
observations in a dataset.
Importance in Data Analysis
■ Outliers are critical for analysts and data scientists, as they require careful
examination to avoid misleading estimations.
■ In simple terms, an outlier is an observation that appears distant and diverges from
the overall pattern within a sample.
Causes of Outliers
■ Outliers may arise from experimental errors or unique circumstances.
Outliners
Subjectivity in Identification
■ There is no strict mathematical definition for what constitutes an outlier;
determining whether an observation is an outlier is often a subjective process.
Methods of Outlier Detection
Various techniques exist for detecting outliers:
■ Graphical Methods: Such as normal probability plots.
■ Model-Based Approaches: Including box plots and other statistical models.
Outliers
Types of Outliers: two types:
■ Univariate: These outliers can be found when we look at distribution of a single variable.
■ Multivariate: Multi-variate outliers are outliers in an n-dimensional space.
Impact of Outliers on a dataset:
■ Increased Error Variance: Outliers can elevate error variance, weakening the power of
statistical tests.
■ Decreased Normality: Non-randomly distributed outliers can disrupt the normality of the
dataset.
■ Biased Estimates: Outliers can twist estimates that are crucial for analysis.
■ Violation of Assumptions: They can affect the fundamental assumptions of statistical
models, including Regression and ANOVA.
Outliers
Detecting Outliers
■ Visualization Methods: The most common approach for identifying outliers is
through visualization techniques, including:
– Box Plot
– Histogram
– Scatter Plot
HISTOGRAM
OUTLINERS
BOX PLOT
OUTLIERS
SCATTERED
PLOT
OUTLINERS
Outlier Treatment Methods
■ Retention:
Retaining outliers is appropriate when they represent genuine variability in the
data rather than errors. This approach is useful when:
– The outliers reflect rare but valid events or behaviors.
– The goal is to understand the full range of data, including extremes.
– Removing them would result in loss of valuable insights.
– Retention is common in exploratory analysis or when modeling real-world
phenomena where extreme values are expected Exclusion:
– Depending on the study's objectives, it may be necessary to determine which
outliers should be removed from the dataset, as they can significantly
influence the analysis results.
Outlier Treatment Methods
■ Exclusion:
Exclusion involves removing outliers that are clearly due to data entry mistakes,
measurement errors, or irrelevant to the study context. This is suitable when:
– The outlier is confirmed to be erroneous/not useful.
– It does not represent the population of interest.
– Its presence distorts statistical summaries or model performance.
– Exclusion should be justified and documented to maintain transparency in the
analysis.
Outlier Treatment Methods
■ Rejection:
Rejection is a more formal process, often based on statistical tests or domain-
specific rules. It is appropriate when:
– The data-generating process and error distribution are well understood.
– The outlier violates known theoretical or physical constraints.
– Statistical methods (e.g., Z-score, Grubbs' test) support its removal.
– Rejection is typically used in controlled environments like laboratory
experiments or quality control settings.
Other treatment methods
OUTLIER package in R
■ outlier(): Returns the observation that is farthest from the mean
■ scores(): Calculates normalized scores
Missing Data
treatment
Impact of Missing Values
■ Missing data in the training data set can reduce the
power / fit of a model or can lead to a biased model
because we have not analyzed the behavior and
relationship with other variables correctly. It can lead
to wrong prediction or classification.
Representation in R
■ Missing Values: Represented by the symbol NA (Not
Available).
■ Impossible Values: Denoted by NaN (Not a Number)
for operations like division by zero, which R outputs
as Inf (Infinity).
Predictive Mean Matching (PMM) Approach
Definition:
■ PMM is a semi-parametric imputation method used to handle missing values.
Process:
■ Similar to regression, PMM fills in missing values by randomly selecting from
observed donor values.
■ It identifies the donor observation with regression-predicted values closest to those
of the missing value from the simulated regression model.
Data Pre-processing
■ Data preprocessing is a data mining
technique which is used to transform the
raw data in a useful and efficient format.
■ Data preprocessing is the process of
transforming raw data into an
understandable format.
■ Data Preprocessing includes the steps we
need to follow to transform or encode data
so that it may be easily parsed by the
machine.
Importance of Data Preprocessing
Why is Data Preprocessing Essential
■ Real-World Challenges: Most real-world datasets for machine learning are often plagued
by missing values, inconsistencies, and noise due to their diverse origins.
■ Impact on Results: Applying data mining algorithms to noisy data can yield poor quality
results, as they struggle to effectively identify patterns.
■ Data Quality Improvement: Data preprocessing is crucial for enhancing overall data
quality, ensuring more reliable outcomes.
■ Statistical Accuracy: Duplicate or missing values can distort the overall statistics, leading
to misleading interpretations.
■ Model Integrity: Outliers and inconsistent data points can disrupt the model's learning
process, resulting in inaccurate predictions.
■ Quality-Driven Decisions: Effective decision-making relies on high-quality data. Data
preprocessing is vital for obtaining this quality data; without it, we risk a "Garbage In,
Garbage Out" scenario.
DATA PRE-PROCESSING
STEPS
Data Cleaning
Overview of Data Cleaning
■ Definition: Data cleaning is a crucial step in data preprocessing, aimed at improving
data quality by addressing issues such as missing values, noisy data,
inconsistencies, and outliers.
Key Components of Data Cleaning
1. Missing Values
■ Solutions:
– Ignore: Suitable for large datasets with numerous missing values.
– Fill: Use methods like manual entry, regression prediction, or attribute mean.
Data Cleaning
2. Noisy Data
■ Techniques:
– Binning: Smooths data by dividing it into equal-sized bins and replacing values
with mean, median, or boundary values.
– Regression: Fits data points to a regression function to minimize noise.
– Clustering: Groups similar data points; outliers can be identified and removed.
3. Removing Outliers
■ Method: Utilize clustering techniques to identify and eliminate inconsistent data
points.
Data Integration
■ Definition: Merging data from multiple sources into a single data store, such as a
data warehouse.
■ Challenges:
– Schema integration and object matching.
– Removing redundant attributes.
– Resolving data value conflicts.
Data Transformation
■ Purpose: Consolidate quality data into alternative forms through various strategies.
Data transformation steps:
■ Generalization: Transforming low-level data into high-level information (e.g., city to
country).
■ Normalization: Scaling numerical attributes to fit within a specified range using
techniques like:
– Min-max normalization
– Z-score normalization
– Decimal scaling normalization
■ Attribute Selection: Creating new properties from existing attributes to enhance data
mining.
■ Aggregation: Summarizing data for easier interpretation (e.g., monthly sales data).
Data Reduction
■ Definition: Reducing the size of the dataset while maintaining analytical quality.
Data reduction steps:
■ Data Cube Aggregation: Summarizing data in a compact form.
■ Dimensionality Reduction: Reducing redundant features using techniques like
Principal Component Analysis.
■ Data Compression: Reducing data size using encoding technologies (lossy vs.
lossless).
■ Discretization: Dividing continuous attributes into intervals for better interpretation.
■ Numerosity Reduction: Representing data as models or equations to save storage.
Data Quality Assessment
■ Importance: Ensures high-quality data for operations, customer management, and decision-
making.
Key Components:
■ Completeness: No missing values.
■ Accuracy: Reliable information.
■ Consistency: Uniform features.
■ Validity: Data meets defined criteria.
■ No Redundancy: Elimination of duplicate data.
Quality Assurance Process:
■ Data Profiling: Identifying data quality issues.
■ Data Cleaning: Fixing identified issues.
■ Data Monitoring: Maintaining data quality over time.
Best Practices in Data Preprocessing
■ Understand Your Data: Familiarize yourself with the dataset to identify focus areas.
■ Use Statistical Methods: Visualize data for insights into class distribution and
quality.
■ Summarize Data: Identify duplicates, missing values, and outliers.
■ Dimensionality Reduction: Eliminate irrelevant fields and reduce complexity.
■ Feature Engineering: Determine which attributes significantly contribute to model
training.
Conclusion
■ Effective data cleaning and preprocessing are essential for ensuring high-quality
data, which ultimately enhances the accuracy and reliability of data analysis and
decision-making processes.
Data Processing
■ Definition: Data processing refers to the conversion of data into a usable format,
transforming raw data into informative and valuable insights. It is often associated
with information systems, highlighting the dual role of converting information into
data and vice versa.
Processing Data vs. Processed Data
■ Processing Data: Involves defining and managing the structure, characteristics, and
specifications of data within an organization.
■ Processed Data: Refers to refined and finalized data specifications after undergoing
various processing steps.
■ Importance: Processing data reflects ongoing efforts to improve data quality, while
processed data represents the refined dataset ready for effective utilization.
STAGES
OF DATA
PROCESS
ING
Stages of Data Processing
1. Collection: Gathering raw data from various sources, establishing a foundation for
analysis.
2. Preparation: Organizing, cleaning, and formatting data to facilitate efficient
analysis.
3. Input: Entering prepared data into a computer system, either manually or through
automated methods.
4. Data Processing: Manipulating and analyzing data to extract meaningful insights
and patterns.
5. Data Output: Presenting results in comprehensible formats like reports, charts, or
graphs.
6. Data Storage: Storing processed data for future reference and analysis, ensuring
accessibility and longevity
Methods of Data Processing
■ Manual Data Processing: Involves human effort to manage data without machines,
prone to errors and time-consuming.
■ Mechanical Data Processing: Utilizes machines like punch cards for increased
efficiency over manual methods.
■ Electronic Data Processing: Leverages computers for enhanced speed, accuracy,
and capacity in data handling.
Types of Data Processing
■ Batch Data Processing: Groups data for scheduled processing, suitable for non-time-
sensitive tasks.
■ Real-time Data Processing: Processes data immediately as generated, crucial for
time-sensitive applications.
■ Online Data Processing: Interactive processing during data collection, supporting
concurrent transactions.
■ Automatic Data Processing: Automates tasks using computers and software,
efficiently handling large data volumes.
Examples of Data Processing
■ Stock Exchanges: Process vast amounts of data during trades, matching buy/sell
orders and updating stock prices in real-time.
■ Manufacturing: Utilizes data processing for quality control, analyzing production data
to identify defects.
■ Smart Home Devices: Process sensor data to manage tasks like adjusting
thermostats and controlling security systems.
■ Electronic Health Records (EHRs): Store and process patient data, facilitating
efficient healthcare delivery.
Conclusion
Data processing is a vital component of data management, encompassing various
stages and methods that transform raw data into actionable insights, ultimately
supporting informed decision-making across diverse fields.
Experimental
Data collection
Methods
■ Experimental data
collection methos are
broadly classified into
TWO categories:
1. Single-Factor
Experiments
2. Mult-Factor Experiments
Single Factor Experiments
■ Single factor experiments involve varying one factor while keeping all other factors
constant. This design helps isolate the effects of the single factor on the outcome.
■ Single factor experiments are experiments in which only a single factor varies while
all others are kept constant.
■ Here the treatment consists exclusively of the different levels of the single variable
factor.
Examples of Single Factor Designs
■ Completely Randomized Design (CRD)
■ Randomized Block Design (RBD)
■ Latin Square Design (LSD)
Completely Randomized Design (CRD)
■ A statistical experimental design used
in data analytics which is based on
randomization and replication. It is
mostly used for comparing the
experiments.
Key Characteristics:
■ Applicable only when experimental
material is homogeneous (e.g.,
uniform soil conditions).
Advantages of CRD
■ Simplicity: Easy to understand and
calculate variance.
■ Flexible Replications: Number of
replications can vary across treatments.
■ High Flexibility: Any number of treatments
can be used.
■ Simple Statistical Analysis: Requires
straightforward statistical methods.
■ Maximum Degrees of Freedom: Provides
the highest number of degrees of freedom.
Disadvantages of CRD
■ Not suitable for field experiments with
heterogeneous conditions.
Randomized Block Design (RBD)
■ An experimental design where experimental units are divided into blocks based on a certain
characteristic, and treatments are randomly assigned within each block.
Key Characteristics:
■ More efficient and accurate compared to CRD.
■ Reduces experimental error by controlling for variability among blocks.
Advantages of RBD
■ Efficiency: More accurate results compared to CRD.
■ Reduced Error: Lower chance of error due to blocking.
■ Flexibility: Allows for any number of treatments and replications.
■ Simple Statistical Analysis: Relatively straightforward analysis.
■ Isolation of Errors: Errors from any treatment can be isolated.
Randomized Block Design (RBD)
Disadvantages of RBD
■ Limited Treatments: Not advisable for a large number of treatments.
■ High Heterogeneity: Ineffective if plot heterogeneity is very high.
■ Increased Block Size: Larger treatments may lead to heterogeneous blocks.
■ Higher Experimental Errors: Increased number of treatments can raise the
possibility of errors.
Latin Square Design (LSD)
■ An experimental design where the material is divided into 'm' rows, 'm' columns, and
'm' treatments, with randomization ensuring each treatment occurs only once in
each row and column.
■ LSD is like CRD and RBD blocks but contains rows and columns.
■ Rows and columns can be any two sources of variation in an experiment. In this
sense a Latin square is a generalization of a randomized block design with two
different blocking systems.
■ LSD is a balanced two-way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and only once in each column.
■ LSD is probably under used in most fields of research
A B C D
B C D A
C D A B
D A B C
Latin Square Design (LSD)
Advantages of LSD
■ Efficient Design: More efficient compared to CRD and RBD.
■ Simple Analysis: Statistical analysis is relatively straightforward, even with one
missing value.
Disadvantages of LSD
■ Agricultural Limitations: Not suitable for agricultural experiments.
■ Complex Analysis: Complicated when two or more values are missing.
■ Treatment Limitations: Challenging when treatments exceed ten.
Multifactor Design (Factorial Design)
■ A design that studies the effects of two or more factors simultaneously, allowing researchers to
analyze interaction effects between factors.
Example: Factorial Design
Key Characteristics:
■ Each factor has multiple levels, enabling a comprehensive understanding of how different
factors influence the outcome.
■ Useful for exploring complex interactions in experiments.
Advantages of Factorial Design
■ Comprehensive Analysis: Allows for the examination of interactions between factors.
■ Efficient Use of Resources: Maximizes the information gained from each experimental run.
■ Flexibility: Can accommodate various numbers of factors and levels.
Multifactor Design (Factorial Design)
Disadvantages of Factorial Design
■ Complexity: Increased complexity in design and analysis, especially with many
factors.
■ Resource Intensive: May require significant resources and time for larger factorial
designs.

More Related Content

PPTX
Big data ppt
PPTX
Big Data_Big Data_Big Data-Big Data_Big Data
PDF
Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf
PPTX
bigdata introduction for students pg msc
PPTX
Unit 1 - Introduction to Big Data and Big Data Analytics.pptx
PPTX
KIT601 Unit I.pptx
PDF
Big data Analytics
PPTX
Bigdata Hadoop introduction
Big data ppt
Big Data_Big Data_Big Data-Big Data_Big Data
Lesson_1_definitions_BIG DATA INROSUCTIONUE.pdf
bigdata introduction for students pg msc
Unit 1 - Introduction to Big Data and Big Data Analytics.pptx
KIT601 Unit I.pptx
Big data Analytics
Bigdata Hadoop introduction

Similar to Data Analytics JNTUH Unit 1 overview 001 (20)

PPTX
Big data Analytics Unit - CCS334 Syllabus
PPTX
000 introduction to big data analytics 2021
PPTX
Introduction to data analytics - Intro to Data Analytics
PPTX
bigdata- Introduction for pg students fo
PDF
Introduction to Data Analytics, AKTU - UNIT-1
PPTX
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx
PPT
Big data
PDF
Data foundation for analytics excellence
PDF
Fundamentals of data science: digital data
PDF
CS3352-Foundations of Data Science Notes.pdf
PPTX
Big data Analytics Fundamentals Chapter 1
PDF
BigData Analytics_1.7
PPTX
Overview of Big Data Characteristics and Technologies.pptx
PPTX
Data Analytics All 5 Units_all topics.pptx
PPTX
This is abouts are you doing the same time who is the best person to be safe and
PDF
Lecture1 introduction to big data
PDF
Harness the power of data
PPTX
Big data
PPTX
basic of data science and big data......
PPTX
Big Data.pptx
Big data Analytics Unit - CCS334 Syllabus
000 introduction to big data analytics 2021
Introduction to data analytics - Intro to Data Analytics
bigdata- Introduction for pg students fo
Introduction to Data Analytics, AKTU - UNIT-1
Explorasi Data untuk Peluang Bisnis dan Pengembangan Karir.pptx
Big data
Data foundation for analytics excellence
Fundamentals of data science: digital data
CS3352-Foundations of Data Science Notes.pdf
Big data Analytics Fundamentals Chapter 1
BigData Analytics_1.7
Overview of Big Data Characteristics and Technologies.pptx
Data Analytics All 5 Units_all topics.pptx
This is abouts are you doing the same time who is the best person to be safe and
Lecture1 introduction to big data
Harness the power of data
Big data
basic of data science and big data......
Big Data.pptx
Ad

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
composite construction of structures.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Well-logging-methods_new................
PPT
Project quality management in manufacturing
PPTX
CH1 Production IntroductoryConcepts.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Construction Project Organization Group 2.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Model Code of Practice - Construction Work - 21102022 .pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
additive manufacturing of ss316l using mig welding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
composite construction of structures.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Well-logging-methods_new................
Project quality management in manufacturing
CH1 Production IntroductoryConcepts.pptx
573137875-Attendance-Management-System-original
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Construction Project Organization Group 2.pptx
Ad

Data Analytics JNTUH Unit 1 overview 001

  • 2. DATA ■ Data is raw facts and numbers used to analyze something or make decisions. ■ Data when classified or organized that has some meaningful value for the user is information. ■ Information is processed data used to make decisions and take actions. ■ Processed data must meet the following criteria for it to be of any significant use in decision-making: – Accuracy: The information must be accurate. – Completeness: The information must be complete. – Timeliness: The information must be available when it’s needed.
  • 3. FORMS OF DATA ■ Collection of information stored in a particular file is represented as forms of data. ■ Data can be of three forms: ■ STRUCTURED FORM - Any form of relational database structure where relation between attributes is possible. That there exists a relation between rows and columns in the database with a table structure. Eg: using database programming languages (sql, oracle, mysql etc). ■ UNSTRUCTURED FORM - Any form of data that does not have predefined structure is represented as unstructured form of data. Eg: video, images, comments, posts, few websites such as blogs and Wikipedia. ■ SEMI STRUCTURED DATA - Does not have form tabular data like RDBMS. Predefined organized formats available. Eg: csv, xml, json, txt file with tab separator etc.
  • 4. BIG DATA ■ Big data describes large and diverse datasets that are huge in volume and also rapidly grow over time. ■ Big data is used in machine learning, predictive modeling, and other advanced analytics to solve business problems and make informed decisions. ■ Big data can be classified into structured, semi-structured, and unstructured data. – Structured data is highly organized and fits neatly into traditional databases. – Semi-structured data, like JSON or XML, is partially organized. – Unstructured data, such as text or multimedia, lacks a predefined structure.
  • 5. 4V PROPERTIES OF BIG DATA Big data is a collection of data from many different sources and is often described by four characteristics, known as the "4 V's": ■ Volume: How much data is collected ■ Velocity: How quickly data can be generated, gathered, and analyzed ■ Variety: How many points of reference are used to collect data ■ Veracity: How reliable the data is, and whether it comes from a trusted source
  • 6. SOURCES OF DATA ■ There are two types of sources of data available. PRIMARY SOURCE OF DATA ■ Eg: data created by individual or a business concern on their own. SECONDARY SOURCE OF DATA ■ Eg: data can be extracted from cloud servers, website sources (kaggle, uci, aws, google cloud, twitter, facebook, youtube, github etc..)
  • 7. DATA ANALYSIS vs DATA ANALYTICS DATA ANALYSIS ■ Data analysis is a process of: – inspecting, – cleansing, – transforming and – modeling data ■ with the goal of: – discovering useful information, – informing conclusions and – supporting decision-making DATA ANALYTICS ■ Data analytics is the science of: – analyzing raw data ■ to make conclusions about that information. ■ This information can then be used to optimize processes to increase the overall efficiency of a business or system
  • 8. 8 Data Analytics Data & Analytics Data ■ Raw Facts & Figures - Unprocessed elements representing values, observations, or descriptions. ■ Foundation of Information - When structured or interpreted, data becomes information. ■ Digital & Analog - Exists in various forms, such as numbers, text, images, or sounds. Data Analytics ■ The science of analyzing raw data to make conclusions about that information ■ Data analytics is the process of examining, cleansing, transforming, and modeling data to discover useful information
  • 9. Data Analytics What is Data Analytics? Data analytics is a multifaceted field that involves the systematic computational analysis of data to discover patterns, infer correlations, and assist in decision- making. Components:  Data Collection  Data Processing  Data Cleansing  Statistical Analysis  Interpretation  Reporting Objective: To uncover patterns, correlations, and insights to inform decision-making by • Examination of Datasets • Integration of Techniques • Strategic Decision-Making Tools & Techniques: • Utilizes statistical analysis, predictive modeling, machine learning, and data mining
  • 10. Role of Data Analytics ■ Gather Hidden Insights – Hidden insights from data are gathered and then analyzed with respect to business requirements. ■ Generate Reports – Reports are generated from the data and are passed on to the respective teams and individuals to deal with further actions for a high rise in business. ■ Perform Market Analysis – Market Analysis can be performed to understand the strengths and weaknesses of competitors. ■ Improve Business Requirement – Analysis of Data allows improving Business to customer requirements and experience.
  • 11. Tools used in Data Analytics ■ R programming ■ Python ■ Tableau Public ■ QlikView ■ SAS ■ Microsoft Excel ■ RapidMiner ■ KNIME ■ Open Refine ■ Apache Spark
  • 12. Data Analytics Types of Data Analytics 01 02 03 04 What Happened? Interprets historical data to identify trends and patterns Descriptive Analytics Why Did It Happen? Delves into data to understand causes and effects Diagnostic Analytics What Could Happen? Uses statistical models to forecast future outcomes Predictive Analytics What Should We Do? Provides recommendations on possible courses of action. Prescriptive Analytics Data Analytics
  • 13. 13 Advantages of Data Analytics Transforming Data into Value Informed Decision-Making Innovation Customer Insights Operational Efficiency Cost Efficiency Risk Management Facilitates evidence- based strategies, reducing guesswork. Empowers organizations to make evidence-based decisions Data-driven approach for strategic planning reducing the risk of reliance on intuition or assumptions. Helps identifying inefficiencies supported by data Streamlines operations by identifying process improvements. Results in improved efficiency & productivity, reduced cost Enhances understanding of customer behavior and preferences for better engagement. Leads to better product development and customer service Provides insights to stay ahead in the market. Helps anticipating and mitigating risks Identifies potential risks and offers mitigation strategies through predictive analytics Support strategic decision-making, real time monitoring, scenario planning with focus on risk mitigation. Drives product development and innovation by revealing market trends Provides a leverage over competitors through insights Helps tailor products and services to individual customer needs Revenue Growth - Uncovers opportunities for new revenue streams. Cost Reduction - Pinpoints areas to save resources and reduce waste. Support resource optimization, process automation, predictive maintenance, financial planning and many more…
  • 14. 14 Data analytics can be applied across various industries to solve specific problems, improve efficiency, and create value. Manufacturing • Quality control & demand forecasting • Suppl chain optimization • Inventory management Banking & Finance • Fraud detection • Algorithmic Trading • Credit risk assessment Business Intelligence • Market trends • Customer Experience Enhancement • Inventory Management & Operational performance Healthcare Sector • Patient data analysis for better treatment plans • Predict health trends • Provide preventative care Use Cases Data Analytics in Action Transportation & Logistics • Route Optimization • Fleet Management Energy Sector • Demand Forecasting • Grid Management
  • 15. Data Architecture What is Data Architecture? ■ Definition: A data architecture describes how data is managed--from collection through to transformation, distribution, and consumption. ■ Blueprint: It sets the blueprint for data and the way it flows through data storage systems. Importance of Data Architecture ■ Foundation: It is foundational to data processing operations and artificial intelligence (AI) applications. ■ Business-Driven: The design should be driven by business requirements. ■ Collaboration: Data architects and data engineers define the data model and underlying data structures.
  • 16. Data Architecture Facilitating Business Needs ■ Purpose: These designs typically facilitate a business need, such as a reporting or data science initiative. ■ Modern Approaches: Modern data architectures often leverage cloud platforms to manage and process data. Benefits of Cloud Platforms ■ Compute Scalability: Enables important data processing tasks to be completed rapidly. ■ Storage Scalability: Helps to cope with rising data volumes and ensures all relevant data is available. ■ Quality Improvement: Enhances the quality of training AI applications.
  • 17. Data Architecture Models The data architecture includes three types of data model: 1. Conceptual model – It is a business model which uses Entity Relationship (ER) model for relation between entities and their attributes. 2. Logical model – It is a model where problems are represented in the form of logic such as rows and column of data, classes, xml tags and other DBMS techniques. 3. Physical model – Physical models holds the database design like which type of database technology will be suitable for architecture.
  • 18. FACTORS THAT INFLUENCE DATA ARCHITECTURE Various constraints and influences affect data architecture design, including: ■ Enterprise Requirements – Economical and effective system expansion – Acceptable performance levels (e.g., system access speed) – Transaction reliability and transparent data management – Conversion of raw data into useful information (e.g., data warehouses) – Techniques: ■ Managing transaction data vs. reference data ■ Separating data capture from data retrieval systems
  • 19. FACTORS THAT INFLUENCE DATA ARCHITECTURE Technology Drivers ■ Derived from completed data and database architecture designs ■ Influenced by: – Existing organizational integration frameworks and standards – Organizational economics – Available site resources (e.g., software licensing) Economics ■ Critical considerations during the data architecture phase ■ Some optimal solutions may be cost- prohibitive ■ External factors affecting decisions: – Business cycle – Interest rates – Market conditions – Legal considerations
  • 20. FACTORS THAT INFLUENCE DATA ARCHITECTURE Business Policies ■ Influences include: – Internal organizational policies – Rules from regulatory bodies – Professional standards – Governmental laws (varying by agency) ■ These policies dictate how the enterprise processes data Data Processing Needs ■ Requirements include: – Accurate and reproducible high- volume transactions – Data warehousing for management information systems – Support for periodic and ad hoc reporting – Alignment with organizational initiatives (e.g., budgets, product development)
  • 21. DATA COLLECTION Types of Data Sources ■ Primary Sources: Data generated directly from original sources. ■ Secondary Sources: Data collected from existing sources. Data Collection Process ■ Involves acquiring, collecting, extracting, and storing large volumes of data. ■ Data can be structured (e.g., databases) or unstructured (e.g., text, video, audio, XML files, images).
  • 22. DATA COLLECTION Importance in Big Data Analysis ■ Initial Step: Data collection is crucial before analyzing patterns and extracting useful information. ■ Raw Data: Collected data is initially unrefined; it requires cleaning to become useful. From Data to Knowledge ■ Transformation: Cleaned data leads to information, which is then transformed into knowledge. ■ Knowledge Applications: Can pertain to various fields, such as business insights or medical treatments.
  • 23. DATA COLLECTION Goals of Data Collection ■ Aim to gather information-rich data. ■ Start with key questions: – What type of data needs to be collected? – What are the sources of this data? Types of Data Collected ■ Qualitative Data: Non-numerical, focusing on behaviors and actions (e.g., words, sentences). ■ Quantitative Data: Numerical data that can be analyzed using scientific tools. Understanding the sources and types of data is essential for effective data collection and subsequent analysis, leading to valuable insights and knowledge.
  • 24. Understanding Data Types Overview of Data Types ■ Data is primarily categorized into two types: – Primary Data – Secondary Data Importance of Data Collection ■ Goal: To gather rich, relevant data for informed decision-making. ■ Considerations: Ensure data is valid, reliable, and suitable for analysis.
  • 25. Primary Data Definition: Raw, original data collected directly from official sources. Collection Techniques: ■ Interviews: Direct questioning of individuals (interviewee) by an interviewer. – Can be structured or unstructured (e.g., personal, telephone, email). ■ Surveys: Gathering responses through questionnaires. – Conducted online or offline (e.g., website forms, social media polls). ■ Observation: Researcher observes behaviors and practices. – Data is recorded in text, audio, or video formats. ■ Experimental Method: Data collected through controlled experiments. – Common designs include: ■ CRD (Completely Randomized Design): Randomization for comparison. ■ RBD (Randomized Block Design): Dividing experiments into blocks for analysis. ■ LSD (Latin Square Design): Arranging data in rows and columns to minimize errors. ■ FD (Factorial Design): Testing multiple variables simultaneously.
  • 26. Secondary Data Definition: Data previously collected and reused for analysis. Sources: ■ Internal Sources: Data found within the organization (e.g., sales records, accounting resources). ■ External Sources: Data obtained from third-party resources (e.g., government publications, market reports).
  • 27. Secondary Data Internal Sources Examples: ■ Accounting Resources: Financial data for marketing analysis. ■ Sales Force Reports: Information on product sales. ■ Internal Experts: Insights from departmental heads. ■ Miscellaneous Reports: Operational data. External Sources Examples: ■ Government Publications: Demographic and economic data (e.g., Registrar General of India). ■ Central Statistical Organization: National accounts statistics. ■ Ministry of Commerce: Wholesale price indices. ■ International Organizations: Data from ILO, OECD, IMF.
  • 28. Data Management What is Data Management? ■ Definition: The practice of collecting, storing, and using data securely, efficiently, and cost-effectively. ■ Goal: To optimize data usage for individuals and organizations while adhering to policies and regulations, enabling informed decision-making and maximizing organizational benefits. Key Aspects of Data Management ■ Broad Scope: Involves a variety of tasks, policies, procedures, and practices.
  • 29. Core Components of Data Management 1. Data Creation and Access: Efficient methods for creating, accessing, and updating data across diverse data tiers. 2. Data Storage Solutions: Strategies for storing data across multiple cloud environments and on-premises systems. 3. High Availability & Disaster Recovery: Ensuring data is always accessible and protected against loss through robust recovery plans. 4. Data Utilization: Leveraging data in various applications, analytics, and algorithms for enhanced insights. 5. Data Privacy & Security: Implementing measures to protect data from unauthorized access and breaches. 6. Data Archiving & Destruction: Managing data lifecycle by archiving and securely destroying data according to retention schedules and compliance requirements.
  • 30. Data Quality What is Data Quality? ■ Definition: Data quality refers to the assessment of how usable data is and how well it fits its intended context. Importance of Data Quality ■ Core of Organizational Activities: High data quality is essential as it underpins all organizational functions. ■ Consequences of Poor Data Quality: Inaccurate data leads to flawed reporting, misguided decisions, and potential economic losses.
  • 31. Data Quality Key Factors in Measuring Data Quality ■ Data Accuracy: Ensures that stored values reflect real-world conditions. ■ Data Uniqueness: Assesses the level of unwanted duplication within datasets. ■ Data Consistency: Checks for adherence to defined semantic rules across datasets. ■ Data Completeness: Evaluates the presence of necessary values in a dataset. ■ Data Timeliness: Measures the appropriateness of data age for specific tasks. Other factors can be taken into consideration such as Availability, Ease of Manipulation, reliability.
  • 32. Importance of Effective Data Management ■ Decision-Making: Enables organizations to make informed decisions based on accurate and timely data. ■ Regulatory Compliance: Ensures adherence to legal and regulatory standards regarding data handling. ■ Operational Efficiency: Streamlines data processes, reducing costs and improving productivity. Conclusion ■ Effective data management is essential for organizations to connect the full potential of their data, ensuring security, compliance, and optimal use in decision- making processes.
  • 33. Data Quality Issues Impact of Poor Data Quality ■ Negative Effects: Poor data quality significantly hampers various processing efforts. ■ Consequences: It directly influences decision-making and can adversely affect revenue. Example: Credit Rating System ■ Implications of Poor Data Quality: – Loan Rejections: Creditworthy applicants may be denied loans. – Risky Approvals: Loans may be sanctioned to applicants with a higher likelihood of defaulting.
  • 34. Data Quality Issues Common Data Quality Problems 1. Noise and Outliers: Irregular data points that can skew analysis. 2. Missing Values: Absence of data that can lead to incomplete insights. 3. Duplicate Data: Redundant entries that can distort results. 4. Incorrect Data: Data that is inaccurate or misleading.
  • 35. Noisy Data ■ Noisy data refers to datasets that contain extraneous, meaningless information. ■ Almost all datasets will inherently include some level of unwanted noise. ■ The term is often synonymous with corrupt data or data that is difficult for machines to interpret. ■ This type of data can be filtered and processed to enhance overall data quality. Illustration: The Crowded Room Analogy ■ Imagine trying to listen to a conversation in a noisy, crowded room. ■ While the human brain can effectively filter out background noise, excessive noise can hinder comprehension. ■ Similarly, as more irrelevant information is added to a dataset, it becomes increasingly challenging to identify the desired patterns.
  • 36. Outliers Definition ■ An outlier is a data point or observation that significantly deviates from the other observations in a dataset. Importance in Data Analysis ■ Outliers are critical for analysts and data scientists, as they require careful examination to avoid misleading estimations. ■ In simple terms, an outlier is an observation that appears distant and diverges from the overall pattern within a sample. Causes of Outliers ■ Outliers may arise from experimental errors or unique circumstances.
  • 37. Outliners Subjectivity in Identification ■ There is no strict mathematical definition for what constitutes an outlier; determining whether an observation is an outlier is often a subjective process. Methods of Outlier Detection Various techniques exist for detecting outliers: ■ Graphical Methods: Such as normal probability plots. ■ Model-Based Approaches: Including box plots and other statistical models.
  • 38. Outliers Types of Outliers: two types: ■ Univariate: These outliers can be found when we look at distribution of a single variable. ■ Multivariate: Multi-variate outliers are outliers in an n-dimensional space. Impact of Outliers on a dataset: ■ Increased Error Variance: Outliers can elevate error variance, weakening the power of statistical tests. ■ Decreased Normality: Non-randomly distributed outliers can disrupt the normality of the dataset. ■ Biased Estimates: Outliers can twist estimates that are crucial for analysis. ■ Violation of Assumptions: They can affect the fundamental assumptions of statistical models, including Regression and ANOVA.
  • 39. Outliers Detecting Outliers ■ Visualization Methods: The most common approach for identifying outliers is through visualization techniques, including: – Box Plot – Histogram – Scatter Plot
  • 43. Outlier Treatment Methods ■ Retention: Retaining outliers is appropriate when they represent genuine variability in the data rather than errors. This approach is useful when: – The outliers reflect rare but valid events or behaviors. – The goal is to understand the full range of data, including extremes. – Removing them would result in loss of valuable insights. – Retention is common in exploratory analysis or when modeling real-world phenomena where extreme values are expected Exclusion: – Depending on the study's objectives, it may be necessary to determine which outliers should be removed from the dataset, as they can significantly influence the analysis results.
  • 44. Outlier Treatment Methods ■ Exclusion: Exclusion involves removing outliers that are clearly due to data entry mistakes, measurement errors, or irrelevant to the study context. This is suitable when: – The outlier is confirmed to be erroneous/not useful. – It does not represent the population of interest. – Its presence distorts statistical summaries or model performance. – Exclusion should be justified and documented to maintain transparency in the analysis.
  • 45. Outlier Treatment Methods ■ Rejection: Rejection is a more formal process, often based on statistical tests or domain- specific rules. It is appropriate when: – The data-generating process and error distribution are well understood. – The outlier violates known theoretical or physical constraints. – Statistical methods (e.g., Z-score, Grubbs' test) support its removal. – Rejection is typically used in controlled environments like laboratory experiments or quality control settings.
  • 46. Other treatment methods OUTLIER package in R ■ outlier(): Returns the observation that is farthest from the mean ■ scores(): Calculates normalized scores
  • 47. Missing Data treatment Impact of Missing Values ■ Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analyzed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification. Representation in R ■ Missing Values: Represented by the symbol NA (Not Available). ■ Impossible Values: Denoted by NaN (Not a Number) for operations like division by zero, which R outputs as Inf (Infinity).
  • 48. Predictive Mean Matching (PMM) Approach Definition: ■ PMM is a semi-parametric imputation method used to handle missing values. Process: ■ Similar to regression, PMM fills in missing values by randomly selecting from observed donor values. ■ It identifies the donor observation with regression-predicted values closest to those of the missing value from the simulated regression model.
  • 49. Data Pre-processing ■ Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format. ■ Data preprocessing is the process of transforming raw data into an understandable format. ■ Data Preprocessing includes the steps we need to follow to transform or encode data so that it may be easily parsed by the machine.
  • 50. Importance of Data Preprocessing Why is Data Preprocessing Essential ■ Real-World Challenges: Most real-world datasets for machine learning are often plagued by missing values, inconsistencies, and noise due to their diverse origins. ■ Impact on Results: Applying data mining algorithms to noisy data can yield poor quality results, as they struggle to effectively identify patterns. ■ Data Quality Improvement: Data preprocessing is crucial for enhancing overall data quality, ensuring more reliable outcomes. ■ Statistical Accuracy: Duplicate or missing values can distort the overall statistics, leading to misleading interpretations. ■ Model Integrity: Outliers and inconsistent data points can disrupt the model's learning process, resulting in inaccurate predictions. ■ Quality-Driven Decisions: Effective decision-making relies on high-quality data. Data preprocessing is vital for obtaining this quality data; without it, we risk a "Garbage In, Garbage Out" scenario.
  • 52. Data Cleaning Overview of Data Cleaning ■ Definition: Data cleaning is a crucial step in data preprocessing, aimed at improving data quality by addressing issues such as missing values, noisy data, inconsistencies, and outliers. Key Components of Data Cleaning 1. Missing Values ■ Solutions: – Ignore: Suitable for large datasets with numerous missing values. – Fill: Use methods like manual entry, regression prediction, or attribute mean.
  • 53. Data Cleaning 2. Noisy Data ■ Techniques: – Binning: Smooths data by dividing it into equal-sized bins and replacing values with mean, median, or boundary values. – Regression: Fits data points to a regression function to minimize noise. – Clustering: Groups similar data points; outliers can be identified and removed. 3. Removing Outliers ■ Method: Utilize clustering techniques to identify and eliminate inconsistent data points.
  • 54. Data Integration ■ Definition: Merging data from multiple sources into a single data store, such as a data warehouse. ■ Challenges: – Schema integration and object matching. – Removing redundant attributes. – Resolving data value conflicts.
  • 55. Data Transformation ■ Purpose: Consolidate quality data into alternative forms through various strategies. Data transformation steps: ■ Generalization: Transforming low-level data into high-level information (e.g., city to country). ■ Normalization: Scaling numerical attributes to fit within a specified range using techniques like: – Min-max normalization – Z-score normalization – Decimal scaling normalization ■ Attribute Selection: Creating new properties from existing attributes to enhance data mining. ■ Aggregation: Summarizing data for easier interpretation (e.g., monthly sales data).
  • 56. Data Reduction ■ Definition: Reducing the size of the dataset while maintaining analytical quality. Data reduction steps: ■ Data Cube Aggregation: Summarizing data in a compact form. ■ Dimensionality Reduction: Reducing redundant features using techniques like Principal Component Analysis. ■ Data Compression: Reducing data size using encoding technologies (lossy vs. lossless). ■ Discretization: Dividing continuous attributes into intervals for better interpretation. ■ Numerosity Reduction: Representing data as models or equations to save storage.
  • 57. Data Quality Assessment ■ Importance: Ensures high-quality data for operations, customer management, and decision- making. Key Components: ■ Completeness: No missing values. ■ Accuracy: Reliable information. ■ Consistency: Uniform features. ■ Validity: Data meets defined criteria. ■ No Redundancy: Elimination of duplicate data. Quality Assurance Process: ■ Data Profiling: Identifying data quality issues. ■ Data Cleaning: Fixing identified issues. ■ Data Monitoring: Maintaining data quality over time.
  • 58. Best Practices in Data Preprocessing ■ Understand Your Data: Familiarize yourself with the dataset to identify focus areas. ■ Use Statistical Methods: Visualize data for insights into class distribution and quality. ■ Summarize Data: Identify duplicates, missing values, and outliers. ■ Dimensionality Reduction: Eliminate irrelevant fields and reduce complexity. ■ Feature Engineering: Determine which attributes significantly contribute to model training. Conclusion ■ Effective data cleaning and preprocessing are essential for ensuring high-quality data, which ultimately enhances the accuracy and reliability of data analysis and decision-making processes.
  • 59. Data Processing ■ Definition: Data processing refers to the conversion of data into a usable format, transforming raw data into informative and valuable insights. It is often associated with information systems, highlighting the dual role of converting information into data and vice versa. Processing Data vs. Processed Data ■ Processing Data: Involves defining and managing the structure, characteristics, and specifications of data within an organization. ■ Processed Data: Refers to refined and finalized data specifications after undergoing various processing steps. ■ Importance: Processing data reflects ongoing efforts to improve data quality, while processed data represents the refined dataset ready for effective utilization.
  • 61. Stages of Data Processing 1. Collection: Gathering raw data from various sources, establishing a foundation for analysis. 2. Preparation: Organizing, cleaning, and formatting data to facilitate efficient analysis. 3. Input: Entering prepared data into a computer system, either manually or through automated methods. 4. Data Processing: Manipulating and analyzing data to extract meaningful insights and patterns. 5. Data Output: Presenting results in comprehensible formats like reports, charts, or graphs. 6. Data Storage: Storing processed data for future reference and analysis, ensuring accessibility and longevity
  • 62. Methods of Data Processing ■ Manual Data Processing: Involves human effort to manage data without machines, prone to errors and time-consuming. ■ Mechanical Data Processing: Utilizes machines like punch cards for increased efficiency over manual methods. ■ Electronic Data Processing: Leverages computers for enhanced speed, accuracy, and capacity in data handling.
  • 63. Types of Data Processing ■ Batch Data Processing: Groups data for scheduled processing, suitable for non-time- sensitive tasks. ■ Real-time Data Processing: Processes data immediately as generated, crucial for time-sensitive applications. ■ Online Data Processing: Interactive processing during data collection, supporting concurrent transactions. ■ Automatic Data Processing: Automates tasks using computers and software, efficiently handling large data volumes.
  • 64. Examples of Data Processing ■ Stock Exchanges: Process vast amounts of data during trades, matching buy/sell orders and updating stock prices in real-time. ■ Manufacturing: Utilizes data processing for quality control, analyzing production data to identify defects. ■ Smart Home Devices: Process sensor data to manage tasks like adjusting thermostats and controlling security systems. ■ Electronic Health Records (EHRs): Store and process patient data, facilitating efficient healthcare delivery. Conclusion Data processing is a vital component of data management, encompassing various stages and methods that transform raw data into actionable insights, ultimately supporting informed decision-making across diverse fields.
  • 65. Experimental Data collection Methods ■ Experimental data collection methos are broadly classified into TWO categories: 1. Single-Factor Experiments 2. Mult-Factor Experiments
  • 66. Single Factor Experiments ■ Single factor experiments involve varying one factor while keeping all other factors constant. This design helps isolate the effects of the single factor on the outcome. ■ Single factor experiments are experiments in which only a single factor varies while all others are kept constant. ■ Here the treatment consists exclusively of the different levels of the single variable factor. Examples of Single Factor Designs ■ Completely Randomized Design (CRD) ■ Randomized Block Design (RBD) ■ Latin Square Design (LSD)
  • 67. Completely Randomized Design (CRD) ■ A statistical experimental design used in data analytics which is based on randomization and replication. It is mostly used for comparing the experiments. Key Characteristics: ■ Applicable only when experimental material is homogeneous (e.g., uniform soil conditions). Advantages of CRD ■ Simplicity: Easy to understand and calculate variance. ■ Flexible Replications: Number of replications can vary across treatments. ■ High Flexibility: Any number of treatments can be used. ■ Simple Statistical Analysis: Requires straightforward statistical methods. ■ Maximum Degrees of Freedom: Provides the highest number of degrees of freedom. Disadvantages of CRD ■ Not suitable for field experiments with heterogeneous conditions.
  • 68. Randomized Block Design (RBD) ■ An experimental design where experimental units are divided into blocks based on a certain characteristic, and treatments are randomly assigned within each block. Key Characteristics: ■ More efficient and accurate compared to CRD. ■ Reduces experimental error by controlling for variability among blocks. Advantages of RBD ■ Efficiency: More accurate results compared to CRD. ■ Reduced Error: Lower chance of error due to blocking. ■ Flexibility: Allows for any number of treatments and replications. ■ Simple Statistical Analysis: Relatively straightforward analysis. ■ Isolation of Errors: Errors from any treatment can be isolated.
  • 69. Randomized Block Design (RBD) Disadvantages of RBD ■ Limited Treatments: Not advisable for a large number of treatments. ■ High Heterogeneity: Ineffective if plot heterogeneity is very high. ■ Increased Block Size: Larger treatments may lead to heterogeneous blocks. ■ Higher Experimental Errors: Increased number of treatments can raise the possibility of errors.
  • 70. Latin Square Design (LSD) ■ An experimental design where the material is divided into 'm' rows, 'm' columns, and 'm' treatments, with randomization ensuring each treatment occurs only once in each row and column. ■ LSD is like CRD and RBD blocks but contains rows and columns. ■ Rows and columns can be any two sources of variation in an experiment. In this sense a Latin square is a generalization of a randomized block design with two different blocking systems. ■ LSD is a balanced two-way classification scheme say for example - 4 X 4 arrangement. In this scheme each letter from A to D occurs only once in each row and only once in each column. ■ LSD is probably under used in most fields of research A B C D B C D A C D A B D A B C
  • 71. Latin Square Design (LSD) Advantages of LSD ■ Efficient Design: More efficient compared to CRD and RBD. ■ Simple Analysis: Statistical analysis is relatively straightforward, even with one missing value. Disadvantages of LSD ■ Agricultural Limitations: Not suitable for agricultural experiments. ■ Complex Analysis: Complicated when two or more values are missing. ■ Treatment Limitations: Challenging when treatments exceed ten.
  • 72. Multifactor Design (Factorial Design) ■ A design that studies the effects of two or more factors simultaneously, allowing researchers to analyze interaction effects between factors. Example: Factorial Design Key Characteristics: ■ Each factor has multiple levels, enabling a comprehensive understanding of how different factors influence the outcome. ■ Useful for exploring complex interactions in experiments. Advantages of Factorial Design ■ Comprehensive Analysis: Allows for the examination of interactions between factors. ■ Efficient Use of Resources: Maximizes the information gained from each experimental run. ■ Flexibility: Can accommodate various numbers of factors and levels.
  • 73. Multifactor Design (Factorial Design) Disadvantages of Factorial Design ■ Complexity: Increased complexity in design and analysis, especially with many factors. ■ Resource Intensive: May require significant resources and time for larger factorial designs.

Editor's Notes

  • #8: Data Raw Facts & Figures - Unprocessed elements representing values, observations, or descriptions. Foundation of Information - When structured or interpreted, data becomes information. Digital & Analog - Exists in various forms, such as numbers, text, images, or sounds. Data Analytics The science of analyzing raw data to make conclusions about that information Data analytics is the process of examining, cleansing, transforming, and modeling data to discover useful information
  • #9: Data analytics is a multifaceted field that involves the systematic computational analysis of data to discover patterns, infer correlations, and assist in decision-making. Objective: To uncover patterns, correlations, and insights to inform decision-making by Examination of Datasets Integration of Techniques Strategic Decision-Making Tools & Techniques: Utilizes statistical analysis, predictive modeling, machine learning, and data mining Components: Data Collection Data Processing Data Cleansing Statistical Analysis Interpretation Reporting
  • #12: Types of Data Analytics   - **Descriptive Analytics:** - **What Happened?** - Interprets historical data to identify trends and patterns.   - **Diagnostic Analytics:** - **Why Did It Happen?** - Delves into data to understand causes and effects.   - **Predictive Analytics:** - **What Could Happen?** - Uses statistical models to forecast future outcomes.   - **Prescriptive Analytics:** - **What Should We Do?** - Provides recommendations on possible courses of action.
  • #13: Informed Decision-Making - Facilitates evidence-based strategies, reducing guesswork. - Empowers organizations to make evidence-based decisions - Data-driven approach for strategic planning reducing the risk of reliance on intuition or assumptions. Operational Efficiency - Helps identifying inefficiencies supported by data - Streamlines operations by identifying process improvements. - Results in improved efficiency & productivity, reduced cost Customer Insights - Enhances understanding of customer behavior and preferences for better engagement. - Leads to better product development and customer service - Provides insights to stay ahead in the market. Risk Management - Helps anticipating and mitigating risks - Identifies potential risks and offers mitigation strategies through predictive analytics - Support strategic decision-making, real time monitoring, scenario planning with focus on risk mitigation. Innovation - Drives product development and innovation by revealing market trends - Provides a leverage over competitors through insights - Helps tailor products and services to individual customer needs Cost Efficiency Revenue Growth - Uncovers opportunities for new revenue streams. Cost Reduction - Pinpoints areas to save resources and reduce waste. - Support resource optimization, process automation, predictive maintenance, financial planning and many more…
  • #14: Business Intelligence - Market trends - Customer Experience Enhancement - Inventory Management & Operational performance Healthcare Sector - Patient data analysis for better treatment plans - Predict health trends - Provide preventative care Banking & Finance - Fraud detection - Algorithmic Trading - Credit risk assessment Manufacturing - Quality control & demand forecasting - Suppl chain optimization - Inventory management Energy Sector - Demand Forecasting - Grid Management Transportation & Logistics - Route Optimization - Fleet Management
  • #38: ANOVA and regression are both statistical methods, but they are used for different purposes. ANOVA (Analysis of Variance) is used to compare the means of three or more groups, while regression is used to model the relationship between variables and make predictions. While they have distinct applications, ANOVA and regression can be used together in certain scenarios, particularly in the context of linear regression
  • #49: The main agenda for a model to be accurate and precise in predictions is that the algorithm should be able to easily interpret the data's features.