Data Analytics JNTUH Unit 1 overview 001

DATA MANAGEMENT
UNIT 1
DATA ANALYTICS

DATA
■ Data is raw facts and numbers used to analyze something or make decisions.
■ Data when classified or organized that has some meaningful value for the
user is information.
■ Information is processed data used to make decisions and take actions.
■ Processed data must meet the following criteria for it to be of any significant use in
decision-making:
– Accuracy: The information must be accurate.
– Completeness: The information must be complete.
– Timeliness: The information must be available when it’s needed.

FORMS OF DATA
■ Collection of information stored in a particular file is represented as forms of data.
■ Data can be of three forms:
■ STRUCTURED FORM - Any form of relational database structure where relation between
attributes is possible.
That there exists a relation between rows and columns in the database with a table
structure.
Eg: using database programming languages (sql, oracle, mysql etc).
■ UNSTRUCTURED FORM - Any form of data that does not have predefined structure is
represented as unstructured form of data.
Eg: video, images, comments, posts, few websites such as blogs and Wikipedia.
■ SEMI STRUCTURED DATA - Does not have form tabular data like RDBMS. Predefined
organized formats available.
Eg: csv, xml, json, txt file with tab separator etc.

BIG DATA
■ Big data describes large and diverse datasets that are huge in volume and
also rapidly grow over time.
■ Big data is used in machine learning, predictive modeling, and other
advanced analytics to solve business problems and make informed decisions.
■ Big data can be classified into structured, semi-structured, and
unstructured data.
– Structured data is highly organized and fits neatly into traditional databases.
– Semi-structured data, like JSON or XML, is partially organized.
– Unstructured data, such as text or multimedia, lacks a predefined structure.

4V PROPERTIES OF BIG DATA
Big data is a collection of data from many different sources and is often described by
four characteristics, known as the "4 V's":
■ Volume: How much data is collected
■ Velocity: How quickly data can be generated, gathered, and analyzed
■ Variety: How many points of reference are used to collect data
■ Veracity: How reliable the data is, and whether it comes from a trusted source

SOURCES OF DATA
■ There are two types of sources of data
available.
PRIMARY SOURCE OF DATA
■ Eg: data created by individual or a
business concern on their own.
SECONDARY SOURCE OF DATA
■ Eg: data can be extracted from cloud
servers, website sources (kaggle, uci, aws,
google cloud, twitter, facebook, youtube,
github etc..)

DATA ANALYSIS vs DATA ANALYTICS
DATA ANALYSIS
■ Data analysis is a process of:
– inspecting,
– cleansing,
– transforming and
– modeling data
■ with the goal of:
– discovering useful information,
– informing conclusions and
– supporting decision-making
DATA ANALYTICS
■ Data analytics is the science of:
– analyzing raw data
■ to make conclusions about that
information.
■ This information can then be used to
optimize processes to increase the
overall efficiency of a business or
system

8
Data Analytics
Data & Analytics
Data
■ Raw Facts & Figures - Unprocessed elements
representing values, observations, or descriptions.
■ Foundation of Information - When structured or
interpreted, data becomes information.
■ Digital & Analog - Exists in various forms, such as
numbers, text, images, or sounds.
Data Analytics
■ The science of analyzing raw data to make
conclusions about that information
■ Data analytics is the process of examining,
cleansing, transforming, and modeling data to
discover useful information

Data Analytics
What is Data Analytics?
Data analytics is a multifaceted field that involves the systematic computational
analysis of data to discover patterns, infer correlations, and assist in decision-
making.
Components:
 Data Collection
 Data Processing
 Data Cleansing
 Statistical Analysis
 Interpretation
 Reporting
Objective:
To uncover patterns, correlations, and insights to
inform decision-making by
• Examination of Datasets
• Integration of Techniques
• Strategic Decision-Making
Tools & Techniques:
• Utilizes statistical analysis, predictive modeling,
machine learning, and data mining

Role of Data Analytics
■ Gather Hidden Insights – Hidden insights from data are gathered and then analyzed
with respect to business requirements.
■ Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
■ Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
■ Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.

Tools used in Data Analytics
■ R programming
■ Python
■ Tableau Public
■ QlikView
■ SAS
■ Microsoft Excel
■ RapidMiner
■ KNIME
■ Open Refine
■ Apache Spark

Data Analytics
Types of Data Analytics
01
02
03
04
What Happened?
Interprets historical data to identify trends and
patterns
Descriptive Analytics
Why Did It Happen?
Delves into data to understand causes and
effects
Diagnostic Analytics
What Could Happen?
Uses statistical models to forecast future
outcomes
Predictive Analytics
What Should We Do?
Provides recommendations on possible courses of
action.
Prescriptive Analytics
Data
Analytics

13
Advantages of Data Analytics
Transforming Data into Value
Informed
Decision-Making
Innovation
Customer
Insights
Operational
Efficiency Cost Efficiency
Risk
Management
Facilitates evidence-
based strategies,
reducing guesswork.
Empowers organizations
to make evidence-based
decisions
Data-driven approach for
strategic planning
reducing the risk of
reliance on intuition or
assumptions.
Helps identifying
inefficiencies
supported by data
Streamlines operations
by identifying process
improvements.
Results in improved
efficiency & productivity,
reduced cost
Enhances
understanding of
customer behavior
and preferences for
better engagement.
Leads to better
product development
and customer service
Provides insights to
stay ahead in the
market.
Helps anticipating and
mitigating risks
Identifies potential risks
and offers mitigation
strategies through
predictive analytics
Support strategic
decision-making, real time
monitoring, scenario
planning with focus on
risk mitigation.
Drives product
development and
innovation by revealing
market trends
Provides a leverage over
competitors through
insights
Helps tailor products and
services to individual
customer needs
Revenue Growth -
Uncovers opportunities
for new revenue
streams.
Cost Reduction -
Pinpoints areas to save
resources and reduce
waste.
Support resource
optimization, process
automation,
predictive
maintenance,
financial planning
and many more…

14
Data analytics can be applied across various industries to solve specific problems, improve efficiency, and create
value.
Manufacturing
• Quality control & demand forecasting
• Suppl chain optimization
• Inventory management
Banking & Finance
• Fraud detection
• Algorithmic Trading
• Credit risk assessment
Business Intelligence
• Market trends
• Customer Experience Enhancement
• Inventory Management & Operational performance
Healthcare Sector
• Patient data analysis for better treatment plans
• Predict health trends
• Provide preventative care
Use Cases
Data Analytics in Action
Transportation & Logistics
• Route Optimization
• Fleet Management
Energy Sector
• Demand Forecasting
• Grid Management

Data Architecture
What is Data Architecture?
■ Definition: A data architecture describes how data is managed--from collection
through to transformation, distribution, and consumption.
■ Blueprint: It sets the blueprint for data and the way it flows through data storage
systems.
Importance of Data Architecture
■ Foundation: It is foundational to data processing operations and artificial
intelligence (AI) applications.
■ Business-Driven: The design should be driven by business requirements.
■ Collaboration: Data architects and data engineers define the data model and
underlying data structures.

Data Architecture
Facilitating Business Needs
■ Purpose: These designs typically facilitate a business need, such as a reporting or
data science initiative.
■ Modern Approaches: Modern data architectures often leverage cloud platforms to
manage and process data.
Benefits of Cloud Platforms
■ Compute Scalability: Enables important data processing tasks to be completed
rapidly.
■ Storage Scalability: Helps to cope with rising data volumes and ensures all relevant
data is available.
■ Quality Improvement: Enhances the quality of training AI applications.

Data Architecture
Models
The data architecture includes three types of data
model:
1. Conceptual model – It is a business model which
uses Entity Relationship (ER) model for relation
between entities and their attributes.
2. Logical model – It is a model where problems
are represented in the form of logic such as rows
and column of data, classes, xml tags and other
DBMS techniques.
3. Physical model – Physical models holds the
database design like which type of database
technology will be suitable for architecture.

FACTORS THAT INFLUENCE DATA ARCHITECTURE
Various constraints and influences affect data architecture design, including:
■ Enterprise Requirements
– Economical and effective system expansion
– Acceptable performance levels (e.g., system access speed)
– Transaction reliability and transparent data management
– Conversion of raw data into useful information (e.g., data warehouses)
– Techniques:
■ Managing transaction data vs. reference data
■ Separating data capture from data retrieval systems

Technology Drivers
■ Derived from completed data and
database architecture designs
■ Influenced by:
– Existing organizational
integration frameworks and
standards
– Organizational economics
– Available site resources (e.g.,
software licensing)
Economics
■ Critical considerations during the data
architecture phase
■ Some optimal solutions may be cost-
prohibitive
■ External factors affecting decisions:
– Business cycle
– Interest rates
– Market conditions
– Legal considerations

Business Policies
■ Influences include:
– Internal organizational policies
– Rules from regulatory bodies
– Professional standards
– Governmental laws (varying by
agency)
■ These policies dictate how the
enterprise processes data
Data Processing Needs
■ Requirements include:
– Accurate and reproducible high-
volume transactions
– Data warehousing for
management information
systems
– Support for periodic and ad hoc
reporting
– Alignment with organizational
initiatives (e.g., budgets, product
development)

DATA COLLECTION
Types of Data Sources
■ Primary Sources: Data generated directly from original sources.
■ Secondary Sources: Data collected from existing sources.
Data Collection Process
■ Involves acquiring, collecting, extracting, and storing large volumes of data.
■ Data can be structured (e.g., databases) or unstructured (e.g., text, video, audio,
XML files, images).

DATA COLLECTION
Importance in Big Data Analysis
■ Initial Step: Data collection is crucial before analyzing patterns and extracting useful
information.
■ Raw Data: Collected data is initially unrefined; it requires cleaning to become useful.
From Data to Knowledge
■ Transformation: Cleaned data leads to information, which is then transformed into
knowledge.
■ Knowledge Applications: Can pertain to various fields, such as business insights or
medical treatments.

DATA COLLECTION
Goals of Data Collection
■ Aim to gather information-rich data.
■ Start with key questions:
– What type of data needs to be collected?
– What are the sources of this data?
Types of Data Collected
■ Qualitative Data: Non-numerical, focusing on behaviors and actions (e.g., words, sentences).
■ Quantitative Data: Numerical data that can be analyzed using scientific tools.
Understanding the sources and types of data is essential for effective data collection and
subsequent analysis, leading to valuable insights and knowledge.

Understanding Data Types
Overview of Data Types
■ Data is primarily categorized into two types:
– Primary Data
– Secondary Data
Importance of Data Collection
■ Goal: To gather rich, relevant data for informed decision-making.
■ Considerations: Ensure data is valid, reliable, and suitable for analysis.

Primary Data
Definition: Raw, original data collected directly from official sources.
Collection Techniques:
■ Interviews: Direct questioning of individuals (interviewee) by an interviewer.
– Can be structured or unstructured (e.g., personal, telephone, email).
■ Surveys: Gathering responses through questionnaires.
– Conducted online or offline (e.g., website forms, social media polls).
■ Observation: Researcher observes behaviors and practices.
– Data is recorded in text, audio, or video formats.
■ Experimental Method: Data collected through controlled experiments.
– Common designs include:
■ CRD (Completely Randomized Design): Randomization for comparison.
■ RBD (Randomized Block Design): Dividing experiments into blocks for analysis.
■ LSD (Latin Square Design): Arranging data in rows and columns to minimize errors.
■ FD (Factorial Design): Testing multiple variables simultaneously.

Secondary Data
Definition: Data previously collected and reused for analysis.
Sources:
■ Internal Sources: Data found within the organization (e.g., sales records, accounting
resources).
■ External Sources: Data obtained from third-party resources (e.g., government
publications, market reports).

Secondary Data
Internal Sources
Examples:
■ Accounting Resources: Financial data
for marketing analysis.
■ Sales Force Reports: Information on
product sales.
■ Internal Experts: Insights from
departmental heads.
■ Miscellaneous Reports: Operational
data.
External Sources
Examples:
■ Government Publications:
Demographic and economic data (e.g.,
Registrar General of India).
■ Central Statistical Organization:
National accounts statistics.
■ Ministry of Commerce: Wholesale
price indices.
■ International Organizations: Data from
ILO, OECD, IMF.

Data Management
What is Data Management?
■ Definition: The practice of collecting, storing, and using data securely, efficiently,
and cost-effectively.
■ Goal: To optimize data usage for individuals and organizations while adhering to
policies and regulations, enabling informed decision-making and maximizing
organizational benefits.
Key Aspects of Data Management
■ Broad Scope: Involves a variety of tasks, policies, procedures, and practices.

Core Components of Data Management
1. Data Creation and Access: Efficient methods for creating, accessing, and updating
data across diverse data tiers.
2. Data Storage Solutions: Strategies for storing data across multiple cloud
environments and on-premises systems.
3. High Availability & Disaster Recovery: Ensuring data is always accessible and
protected against loss through robust recovery plans.
4. Data Utilization: Leveraging data in various applications, analytics, and algorithms
for enhanced insights.
5. Data Privacy & Security: Implementing measures to protect data from
unauthorized access and breaches.
6. Data Archiving & Destruction: Managing data lifecycle by archiving and securely
destroying data according to retention schedules and compliance requirements.

Data Quality
What is Data Quality?
■ Definition: Data quality refers to the assessment of how usable data is and how well
it fits its intended context.
Importance of Data Quality
■ Core of Organizational Activities: High data quality is essential as it underpins all
organizational functions.
■ Consequences of Poor Data Quality: Inaccurate data leads to flawed reporting,
misguided decisions, and potential economic losses.

Data Quality
Key Factors in Measuring Data Quality
■ Data Accuracy: Ensures that stored values reflect real-world conditions.
■ Data Uniqueness: Assesses the level of unwanted duplication within datasets.
■ Data Consistency: Checks for adherence to defined semantic rules across datasets.
■ Data Completeness: Evaluates the presence of necessary values in a dataset.
■ Data Timeliness: Measures the appropriateness of data age for specific tasks.
Other factors can be taken into consideration such as Availability, Ease of Manipulation,
reliability.

Importance of Effective Data Management
■ Decision-Making: Enables organizations to make informed decisions based on
accurate and timely data.
■ Regulatory Compliance: Ensures adherence to legal and regulatory standards
regarding data handling.
■ Operational Efficiency: Streamlines data processes, reducing costs and improving
productivity.
Conclusion
■ Effective data management is essential for organizations to connect the full
potential of their data, ensuring security, compliance, and optimal use in decision-
making processes.

Data Quality Issues
Impact of Poor Data Quality
■ Negative Effects: Poor data quality significantly hampers various processing efforts.
■ Consequences: It directly influences decision-making and can adversely affect
revenue.
Example: Credit Rating System
■ Implications of Poor Data Quality:
– Loan Rejections: Creditworthy applicants may be denied loans.
– Risky Approvals: Loans may be sanctioned to applicants with a higher
likelihood of defaulting.

Data Quality Issues
Common Data Quality Problems
1. Noise and Outliers: Irregular data points that can skew analysis.
2. Missing Values: Absence of data that can lead to incomplete insights.
3. Duplicate Data: Redundant entries that can distort results.
4. Incorrect Data: Data that is inaccurate or misleading.

Noisy Data
■ Noisy data refers to datasets that contain extraneous, meaningless information.
■ Almost all datasets will inherently include some level of unwanted noise.
■ The term is often synonymous with corrupt data or data that is difficult for machines
to interpret.
■ This type of data can be filtered and processed to enhance overall data quality.
Illustration: The Crowded Room Analogy
■ Imagine trying to listen to a conversation in a noisy, crowded room.
■ While the human brain can effectively filter out background noise, excessive noise
can hinder comprehension.
■ Similarly, as more irrelevant information is added to a dataset, it becomes
increasingly challenging to identify the desired patterns.

Outliers
Definition
■ An outlier is a data point or observation that significantly deviates from the other
observations in a dataset.
Importance in Data Analysis
■ Outliers are critical for analysts and data scientists, as they require careful
examination to avoid misleading estimations.
■ In simple terms, an outlier is an observation that appears distant and diverges from
the overall pattern within a sample.
Causes of Outliers
■ Outliers may arise from experimental errors or unique circumstances.

Outliners
Subjectivity in Identification
■ There is no strict mathematical definition for what constitutes an outlier;
determining whether an observation is an outlier is often a subjective process.
Methods of Outlier Detection
Various techniques exist for detecting outliers:
■ Graphical Methods: Such as normal probability plots.
■ Model-Based Approaches: Including box plots and other statistical models.

Outliers
Types of Outliers: two types:
■ Univariate: These outliers can be found when we look at distribution of a single variable.
■ Multivariate: Multi-variate outliers are outliers in an n-dimensional space.
Impact of Outliers on a dataset:
■ Increased Error Variance: Outliers can elevate error variance, weakening the power of
statistical tests.
■ Decreased Normality: Non-randomly distributed outliers can disrupt the normality of the
dataset.
■ Biased Estimates: Outliers can twist estimates that are crucial for analysis.
■ Violation of Assumptions: They can affect the fundamental assumptions of statistical
models, including Regression and ANOVA.

Outliers
Detecting Outliers
■ Visualization Methods: The most common approach for identifying outliers is
through visualization techniques, including:
– Box Plot
– Histogram
– Scatter Plot

Outlier Treatment Methods
■ Retention:
Retaining outliers is appropriate when they represent genuine variability in the
data rather than errors. This approach is useful when:
– The outliers reflect rare but valid events or behaviors.
– The goal is to understand the full range of data, including extremes.
– Removing them would result in loss of valuable insights.
– Retention is common in exploratory analysis or when modeling real-world
phenomena where extreme values are expected Exclusion:
– Depending on the study's objectives, it may be necessary to determine which
outliers should be removed from the dataset, as they can significantly
influence the analysis results.

■ Exclusion:
Exclusion involves removing outliers that are clearly due to data entry mistakes,
measurement errors, or irrelevant to the study context. This is suitable when:
– The outlier is confirmed to be erroneous/not useful.
– It does not represent the population of interest.
– Its presence distorts statistical summaries or model performance.
– Exclusion should be justified and documented to maintain transparency in the
analysis.

■ Rejection:
Rejection is a more formal process, often based on statistical tests or domain-
specific rules. It is appropriate when:
– The data-generating process and error distribution are well understood.
– The outlier violates known theoretical or physical constraints.
– Statistical methods (e.g., Z-score, Grubbs' test) support its removal.
– Rejection is typically used in controlled environments like laboratory
experiments or quality control settings.

Other treatment methods
OUTLIER package in R
■ outlier(): Returns the observation that is farthest from the mean
■ scores(): Calculates normalized scores

Missing Data
treatment
Impact of Missing Values
■ Missing data in the training data set can reduce the
power / fit of a model or can lead to a biased model
because we have not analyzed the behavior and
relationship with other variables correctly. It can lead
to wrong prediction or classification.
Representation in R
■ Missing Values: Represented by the symbol NA (Not
Available).
■ Impossible Values: Denoted by NaN (Not a Number)
for operations like division by zero, which R outputs
as Inf (Infinity).

Predictive Mean Matching (PMM) Approach
Definition:
■ PMM is a semi-parametric imputation method used to handle missing values.
Process:
■ Similar to regression, PMM fills in missing values by randomly selecting from
observed donor values.
■ It identifies the donor observation with regression-predicted values closest to those
of the missing value from the simulated regression model.

Data Pre-processing
■ Data preprocessing is a data mining
technique which is used to transform the
raw data in a useful and efficient format.
■ Data preprocessing is the process of
transforming raw data into an
understandable format.
■ Data Preprocessing includes the steps we
need to follow to transform or encode data
so that it may be easily parsed by the
machine.

Importance of Data Preprocessing
Why is Data Preprocessing Essential
■ Real-World Challenges: Most real-world datasets for machine learning are often plagued
by missing values, inconsistencies, and noise due to their diverse origins.
■ Impact on Results: Applying data mining algorithms to noisy data can yield poor quality
results, as they struggle to effectively identify patterns.
■ Data Quality Improvement: Data preprocessing is crucial for enhancing overall data
quality, ensuring more reliable outcomes.
■ Statistical Accuracy: Duplicate or missing values can distort the overall statistics, leading
to misleading interpretations.
■ Model Integrity: Outliers and inconsistent data points can disrupt the model's learning
process, resulting in inaccurate predictions.
■ Quality-Driven Decisions: Effective decision-making relies on high-quality data. Data
preprocessing is vital for obtaining this quality data; without it, we risk a "Garbage In,
Garbage Out" scenario.

Data Cleaning
Overview of Data Cleaning
■ Definition: Data cleaning is a crucial step in data preprocessing, aimed at improving
data quality by addressing issues such as missing values, noisy data,
inconsistencies, and outliers.
Key Components of Data Cleaning
1. Missing Values
■ Solutions:
– Ignore: Suitable for large datasets with numerous missing values.
– Fill: Use methods like manual entry, regression prediction, or attribute mean.

Data Cleaning
2. Noisy Data
■ Techniques:
– Binning: Smooths data by dividing it into equal-sized bins and replacing values
with mean, median, or boundary values.
– Regression: Fits data points to a regression function to minimize noise.
– Clustering: Groups similar data points; outliers can be identified and removed.
3. Removing Outliers
■ Method: Utilize clustering techniques to identify and eliminate inconsistent data
points.

Data Integration
■ Definition: Merging data from multiple sources into a single data store, such as a
data warehouse.
■ Challenges:
– Schema integration and object matching.
– Removing redundant attributes.
– Resolving data value conflicts.

Data Transformation
■ Purpose: Consolidate quality data into alternative forms through various strategies.
Data transformation steps:
■ Generalization: Transforming low-level data into high-level information (e.g., city to
country).
■ Normalization: Scaling numerical attributes to fit within a specified range using
techniques like:
– Min-max normalization
– Z-score normalization
– Decimal scaling normalization
■ Attribute Selection: Creating new properties from existing attributes to enhance data
mining.
■ Aggregation: Summarizing data for easier interpretation (e.g., monthly sales data).

Data Reduction
■ Definition: Reducing the size of the dataset while maintaining analytical quality.
Data reduction steps:
■ Data Cube Aggregation: Summarizing data in a compact form.
■ Dimensionality Reduction: Reducing redundant features using techniques like
Principal Component Analysis.
■ Data Compression: Reducing data size using encoding technologies (lossy vs.
lossless).
■ Discretization: Dividing continuous attributes into intervals for better interpretation.
■ Numerosity Reduction: Representing data as models or equations to save storage.

Data Quality Assessment
■ Importance: Ensures high-quality data for operations, customer management, and decision-
making.
Key Components:
■ Completeness: No missing values.
■ Accuracy: Reliable information.
■ Consistency: Uniform features.
■ Validity: Data meets defined criteria.
■ No Redundancy: Elimination of duplicate data.
Quality Assurance Process:
■ Data Profiling: Identifying data quality issues.
■ Data Cleaning: Fixing identified issues.
■ Data Monitoring: Maintaining data quality over time.

Best Practices in Data Preprocessing
■ Understand Your Data: Familiarize yourself with the dataset to identify focus areas.
■ Use Statistical Methods: Visualize data for insights into class distribution and
quality.
■ Summarize Data: Identify duplicates, missing values, and outliers.
■ Dimensionality Reduction: Eliminate irrelevant fields and reduce complexity.
■ Feature Engineering: Determine which attributes significantly contribute to model
training.
Conclusion
■ Effective data cleaning and preprocessing are essential for ensuring high-quality
data, which ultimately enhances the accuracy and reliability of data analysis and
decision-making processes.

Data Processing
■ Definition: Data processing refers to the conversion of data into a usable format,
transforming raw data into informative and valuable insights. It is often associated
with information systems, highlighting the dual role of converting information into
data and vice versa.
Processing Data vs. Processed Data
■ Processing Data: Involves defining and managing the structure, characteristics, and
specifications of data within an organization.
■ Processed Data: Refers to refined and finalized data specifications after undergoing
various processing steps.
■ Importance: Processing data reflects ongoing efforts to improve data quality, while
processed data represents the refined dataset ready for effective utilization.

Stages of Data Processing
1. Collection: Gathering raw data from various sources, establishing a foundation for
analysis.
2. Preparation: Organizing, cleaning, and formatting data to facilitate efficient
analysis.
3. Input: Entering prepared data into a computer system, either manually or through
automated methods.
4. Data Processing: Manipulating and analyzing data to extract meaningful insights
and patterns.
5. Data Output: Presenting results in comprehensible formats like reports, charts, or
graphs.
6. Data Storage: Storing processed data for future reference and analysis, ensuring
accessibility and longevity

Methods of Data Processing
■ Manual Data Processing: Involves human effort to manage data without machines,
prone to errors and time-consuming.
■ Mechanical Data Processing: Utilizes machines like punch cards for increased
efficiency over manual methods.
■ Electronic Data Processing: Leverages computers for enhanced speed, accuracy,
and capacity in data handling.

Types of Data Processing
■ Batch Data Processing: Groups data for scheduled processing, suitable for non-time-
sensitive tasks.
■ Real-time Data Processing: Processes data immediately as generated, crucial for
time-sensitive applications.
■ Online Data Processing: Interactive processing during data collection, supporting
concurrent transactions.
■ Automatic Data Processing: Automates tasks using computers and software,
efficiently handling large data volumes.

Examples of Data Processing
■ Stock Exchanges: Process vast amounts of data during trades, matching buy/sell
orders and updating stock prices in real-time.
■ Manufacturing: Utilizes data processing for quality control, analyzing production data
to identify defects.
■ Smart Home Devices: Process sensor data to manage tasks like adjusting
thermostats and controlling security systems.
■ Electronic Health Records (EHRs): Store and process patient data, facilitating
efficient healthcare delivery.
Conclusion
Data processing is a vital component of data management, encompassing various
stages and methods that transform raw data into actionable insights, ultimately
supporting informed decision-making across diverse fields.

Experimental
Data collection
Methods
■ Experimental data
collection methos are
broadly classified into
TWO categories:
1. Single-Factor
Experiments
2. Mult-Factor Experiments

Single Factor Experiments
■ Single factor experiments involve varying one factor while keeping all other factors
constant. This design helps isolate the effects of the single factor on the outcome.
■ Single factor experiments are experiments in which only a single factor varies while
all others are kept constant.
■ Here the treatment consists exclusively of the different levels of the single variable
factor.
Examples of Single Factor Designs
■ Completely Randomized Design (CRD)
■ Randomized Block Design (RBD)
■ Latin Square Design (LSD)

Completely Randomized Design (CRD)
■ A statistical experimental design used
in data analytics which is based on
randomization and replication. It is
mostly used for comparing the
experiments.
Key Characteristics:
■ Applicable only when experimental
material is homogeneous (e.g.,
uniform soil conditions).
Advantages of CRD
■ Simplicity: Easy to understand and
calculate variance.
■ Flexible Replications: Number of
replications can vary across treatments.
■ High Flexibility: Any number of treatments
can be used.
■ Simple Statistical Analysis: Requires
straightforward statistical methods.
■ Maximum Degrees of Freedom: Provides
the highest number of degrees of freedom.
Disadvantages of CRD
■ Not suitable for field experiments with
heterogeneous conditions.

Randomized Block Design (RBD)
■ An experimental design where experimental units are divided into blocks based on a certain
characteristic, and treatments are randomly assigned within each block.
■ More efficient and accurate compared to CRD.
■ Reduces experimental error by controlling for variability among blocks.
Advantages of RBD
■ Efficiency: More accurate results compared to CRD.
■ Reduced Error: Lower chance of error due to blocking.
■ Flexibility: Allows for any number of treatments and replications.
■ Simple Statistical Analysis: Relatively straightforward analysis.
■ Isolation of Errors: Errors from any treatment can be isolated.

Randomized Block Design (RBD)
Disadvantages of RBD
■ Limited Treatments: Not advisable for a large number of treatments.
■ High Heterogeneity: Ineffective if plot heterogeneity is very high.
■ Increased Block Size: Larger treatments may lead to heterogeneous blocks.
■ Higher Experimental Errors: Increased number of treatments can raise the
possibility of errors.

Latin Square Design (LSD)
■ An experimental design where the material is divided into 'm' rows, 'm' columns, and
'm' treatments, with randomization ensuring each treatment occurs only once in
each row and column.
■ LSD is like CRD and RBD blocks but contains rows and columns.
■ Rows and columns can be any two sources of variation in an experiment. In this
sense a Latin square is a generalization of a randomized block design with two
different blocking systems.
■ LSD is a balanced two-way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only once in each row
and only once in each column.
■ LSD is probably under used in most fields of research
A B C D
B C D A
C D A B
D A B C

Latin Square Design (LSD)
Advantages of LSD
■ Efficient Design: More efficient compared to CRD and RBD.
■ Simple Analysis: Statistical analysis is relatively straightforward, even with one
missing value.
Disadvantages of LSD
■ Agricultural Limitations: Not suitable for agricultural experiments.
■ Complex Analysis: Complicated when two or more values are missing.
■ Treatment Limitations: Challenging when treatments exceed ten.

Multifactor Design (Factorial Design)
■ A design that studies the effects of two or more factors simultaneously, allowing researchers to
analyze interaction effects between factors.
Example: Factorial Design
■ Each factor has multiple levels, enabling a comprehensive understanding of how different
factors influence the outcome.
■ Useful for exploring complex interactions in experiments.
Advantages of Factorial Design
■ Comprehensive Analysis: Allows for the examination of interactions between factors.
■ Efficient Use of Resources: Maximizes the information gained from each experimental run.
■ Flexibility: Can accommodate various numbers of factors and levels.

Multifactor Design (Factorial Design)
Disadvantages of Factorial Design
■ Complexity: Increased complexity in design and analysis, especially with many
factors.
■ Resource Intensive: May require significant resources and time for larger factorial
designs.

Data Analytics JNTUH Unit 1 overview 001

More Related Content

Similar to Data Analytics JNTUH Unit 1 overview 001 (20)

Recently uploaded (20)

Data Analytics JNTUH Unit 1 overview 001

Editor's Notes