Module 5.pptxData processing involves transforming raw data into useful information

What is data processing
• Data processing involves transforming raw data into useful
information
• Stages of data processing include collection, filtering,
sorting, and analysis
• Data processing relies on various tools and techniques to
ensure accurate, valuable output

Data collection
• The first stage of data collection involves gathering and
discovering raw data from various sources, such as sensors,
databases, or customer surveys. It is essential to ensure the
collected data is accurate, complete, and relevant to the
analysis or processing goals. Care must be taken to avoid
selection bias, where the method of collecting data
inadvertently favors certain outcomes or groups, potentially
skewing results and leading to inaccurate conclusions.

Data preparation
• Once the data is collected, it moves to the data preparation
stage. Here, the raw data is cleaned up, organized, and
often enriched for further processing. This stage involves
checking for errors, removing any bad data (redundant,
incomplete, or incorrect), and enhancing the dataset with
additional relevant information from external sources, a
process known as data enrichment. Data preparation aims
to create high-quality, reliable, and comprehensive data for
subsequent processing steps.

Data input
• The next stage is data input. In this stage, the clean and
prepped data is fed into a processing system, which could
be software or an algorithm designed for specific data types
or analysis goals. Various methods, such as manual entry,
data import from external sources, or automatic data
capture, can be used to input data into the processing
system.

Data processing
• In the data processing stage, the input data is transformed,
analyzed, and organized to produce relevant information.
Several data processing techniques, like filtering, sorting,
aggregation, or classification, may be employed to process
the data. The choice of methods depends on the desired
outcome or insights from the data.

Data output and interpretation
• The data output and interpretation stage deals with
presenting the processed data in an easily digestible format.
This could involve generating reports, graphs, or
visualizations that simplify complex data patterns and help
with decision-making. Furthermore, the output data should
be interpreted and analyzed to extract valuable insights and
knowledge.X

Data storage
• Finally, in the data storage stage, the processed information
is securely stored in databases or data warehouses for
future retrieval, analysis, or use. Proper storage ensures
data longevity, availability, and accessibility while
maintaining data privacy and security.

Batch processing
• Batch processing involves handling large volumes of data
collectively at predetermined times, making it ideal for non-
time-sensitive tasks. This method allows organizations to
efficiently manage data by aggregating it and processing it
during off-peak hours to minimize the impact on daily
operations.
• Example: Financial institutions batch process checks and
transactions overnight, updating account balances in one
comprehensive sweep to ensure accuracy and efficiency.

Real-time processing
• Real-time processing is essential for tasks that require
immediate handling of data upon receipt, providing instant
processing and feedback. This type of processing is crucial
for applications where delays cannot be tolerated, ensuring
timely decisions and responses.
• Example: GPS navigation systems rely on real-time
processing to offer turn-by-turn directions, adjusting routes
based on live traffic and road conditions to ensure the
fastest path.

Multiprocessing (parallel processing)
• Multiprocessing, or parallel processing, involves utilizing
multiple processing units or CPUs to handle various tasks
simultaneously. This approach allows for more efficient data
processing, particularly for complex computations that can
be broken down into smaller, concurrent tasks, thereby
speeding up overall processing time.
• Example: Movie production often utilizes multiprocessing for
rendering complex 3D animations. By distributing the
rendering across multiple computers, the overall project's
completion time is significantly reduced, leading to faster
production cycles and improved visual quality.

Online processing
• Online processing facilitates the interactive processing of
data over a network, with continuous input and output for
instant responses. It enables systems to handle user
requests immediately, making it an essential component of
e-commerce and online services.
• Example: Online banking systems utilize online processing
for real-time financial transactions, allowing users to
transfer funds, pay bills, and check account balances with
immediate updates.

Manual data processing
• Manual data processing requires human intervention for the
input, processing, and output of data, typically without the
aid of electronic devices. This labor-intensive method is
prone to errors but was common before the advent of
computerized systems.
• Example: Before the widespread use of computers, libraries
cataloged books manually, requiring librarians to carefully
record each book's details by hand for inventory and
retrieval purposes.

Mechanical data processing
• Mechanical data processing uses machines or equipment to
manage and process data tasks, a prevalent method before
the digital era. This approach involved using tangible,
mechanical devices to input, process, and output data.
• Example: Voting in the early 20th century often involved
mechanical lever machines, where votes were tallied by
pulling levers for each choice, simplifying vote counting and
reducing the potential for errors.

Electronic data processing
• Electronic data processing employs computers and digital
technology to process, store, and communicate data with
efficiency and accuracy. This modern approach to data
handling allows for rapid processing speeds, vast storage
capabilities, and easy data retrieval.
• Example: Retailers use electronic data processing at
checkouts, where barcode scans instantly update inventory
systems and process sales, enhancing checkout speed and
inventory management.

Distributed processing
• Distributed processing involves spreading computational
tasks across multiple computers or devices to improve
processing speed and reliability. This method leverages the
collective power of various systems to handle large-scale
processing tasks more efficiently than could be achieved
with a single computer.
• Example: Video streaming services use distributed
processing to deliver content efficiently. By storing videos
on multiple servers, they ensure smooth playback and quick
access for users worldwide.

Cloud computing
• Cloud computing offers computing resources, such as
servers, storage, and databases, over the internet, providing
flexibility and scalability. This model enables users to access
and utilize computing resources as needed, without the
burden of maintaining physical infrastructure.
• Example: Small businesses leverage cloud computing for
data storage and software services, avoiding the need for
significant upfront hardware investments and allowing easy
scaling as the business grows.

Automatic data processing
• Automatic data processing uses software to automate
routine tasks, reducing the need for manual input and
increasing operational efficiency. This method streamlines
repetitive processes, minimizes human error, and frees up
personnel for more strategic tasks.
• Example: Automated billing systems in telecommunications
automatically calculate and send out monthly charges to
customers, streamlining billing operations and reducing
errors.

Data preparation
• Data preparation is the process of cleaning and
transforming raw data prior to processing and analysis. It is
an important step prior to processing and often involves
reformatting data, making corrections to data, and
combining datasets to enrich data.
• Data preparation is often a lengthy undertaking for data
engineers or business users, but it is essential as a
prerequisite to put data in context in order to turn it into
insights and eliminate bias resulting from poor data quality.

Benefits of data preparation in the
cloud
• Fix errors quickly — Data preparation helps catch errors before processing. After data has been
removed from its original source, these errors become more difficult to understand and correct.
• Produce top-quality data — Cleaning and reformatting datasets ensures that all data used in
analysis will be of high quality.
• Make better business decisions — Higher-quality data that can be processed and analyzed more
quickly and efficiently leads to more timely, efficient, better-quality business decisions.
• Additionally, as data and data processes move to the cloud, data preparation moves with it for even
greater benefits, such as:
• Superior scalability — Cloud data preparation can grow at the pace of the business. Enterprises
don’t have to worry about the underlying infrastructure or try to anticipate their evolutions.
• Future proof — Cloud data preparation upgrades automatically so that new capabilities or
problem fixes can be turned on as soon as they are released. This allows organizations to stay
ahead of the innovation curve without delays and added costs.
• Accelerated data usage and collaboration — Doing data prep in the cloud means it is always on,
doesn’t require any technical installation, and lets teams collaborate on the work for faster results.

Data preparation steps
questionnaire checking
• The data preparation process begins with finding the right data. This can
come from an existing data catalog or data sources can be added ad-hoc.
• Check whether questionnaire is acceptable or not.
• Complete or not
Data editing
• Data editing is the application of checks to detect missing, invalid or
inconsistent entries or to point to data records that are potentially in
error. No matter what type of data you are working with, certain edits are
performed at different stages or phases of data collection and processing.
• Detect errors and omissions

Data coding: Converting data into codes.
Process of assigning numerical values to responses that are
originally in a given format such as numerical, text, audio or
video. The main objective is to facilitate the automatic
treatment of data for analytical purposes.
Coded data can be analyzed using statistical software tools.

Data classification:
• Data classification is the practice of organizing and categorizing data elements according
to pre-defined criteria. Classification makes data easier to locate and retrieve. Classifying
data is instrumental in promoting risk management, security, and regulatory compliance.
• Steps for Effective Data Classification
• Understand the Current Setup: Taking a detailed look at the location of current data
and all regulations that pertain to your organization is perhaps the best starting point for
effectively classifying data. You must know what data you have before you can classify it.
• Creating a Data Classification Policy: Staying compliant with data protection principles
in an organization is nearly impossible without proper policy. Creating a policy should be
your top priority.
• Prioritize and Organize Data: Now that you have a policy and a picture of your current
data, it’s time to properly classify the data. Decide on the best way to tag your data based
on its sensitivity and privacy.

Data preparation steps:
Classification is of two types
According to attribute
Ex. Literacy rate
Honesty
Beauty
Weight
Height
According to class-interval
• Income
• Production
• Age
• Sometimes weight, height

Tabulation:
Tabulation is a method of presenting numeric data in rows
and columns in a logical and systematic manner to aid
comparison and statistical analysis. It allows for easier
comparison by putting relevant data closer together, and it
aids in statistical analysis and interpretation.

IMPORTANCE OF TABULATION
• Information or any statistics presented in a table should be alienated into different
dimensions and for each dimension should be clearly mention the grand totals and
sub totals to show the associations between different dimensions of data put in the
tabular form easy understand.
• (The preparations of any statistics should be arranged in a systematic manner with a
heading and proper numberings which simply helps the readers to recognize the
necessary responsibility to the research.
• Tabulation builds the data into concise form; as a result, it helps the reader to
understand easily. This data can also be presented in the form of graphs/charts/flow
charts/ diagrams.
• The data in tabular form can be shown in the numerical figures in an attention-
grabbing form.
• It makes difficult data into a simpler form and as a result it becomes easy to
categorize within the data.

IMPORTANCE OF TABULATION
• Tabulation type of the arrangement is helpful in knowing
the mistakes.
• Tables will be helpful in condensing the information and
makes easy to examine the contents.
• Tabulation is economical mode to put the current data and
helps to minimize the time and in turn researcher will able
perform the work effectively.
• Recently the formation of tabular information with the help
of gadgets easily summaries the large data which is
scattered in a systematic.

Graphical representation refers to the use of charts and
graphs to visually display, analyze, clarify, and interpret
numerical data, functions, and other qualitative structures.

Stem and leaf plot
• A stem and leaf plot is used to organize data as they are
collected. A stem and leaf plot looks something like a bar
graph. Each number in the data is broken down into a stem
and a leaf, thus the name.
• Ex: 15,27, 8,17, 13, 17,22, 24,25,14,13, 36,22,22,32, 32,28,7
• Ex. 72, 85,89, 93, 88, 109, 115, 97, 102, 113
• Ex. 1.2, 2.3, 1.5, 1.6, 1.8, 2.7, 3.2, 3.6, 4.5,7.8,7.1,10.6, 11.5

Data Cleaning: is the process of fixing or removing incorrect,
corrupted, incorrectly formatted, duplicate, or incomplete data
within a dataset

Deduplication
• Deduplication refers to a method of eliminating a dataset's
redundant data. In a secure data deduplication process, a
deduplication assessment tool identifies extra copies of data
and deletes them, so a single instance can then be stored.
Data deduplication software analyses data to identify
duplicate byte patterns.

What is ANOVA?
• ANOVA, or Analysis of Variance, is a test used to determine
differences between research results from three or more unrelated
samples or groups
• The key word in ‘Analysis of Variance’ is the last one. ‘Variance’
represents the degree to which numerical values of a particular
variable deviate from its overall mean. You could think of the
dispersion of those values plotted on a graph, with the average being
at the centre of that graph. The variance provides a measure of how
scattered the data points are from this central value.
• H0:There is no difference between the group means

The Chi squared tests
• What Is Goodness-of-Fit:
The term goodness-of-fit refers to a statistical test that determines how
well sample data fits a distribution from a population with a normal
distribution. Put simply, it hypothesizes whether a sample is skewed or
represents the data you would expect to find in the actual population.
H0:There is no difference between the group means

T-test
• A t-test is a statistical tool that compares the means of two groups or
the mean of a group to a standard value. It's also known as a
Student's t-test, t-statistic, or t-distribution

One-Sample Proportion Test
• The One-Sample Proportion Test is used to assess whether a
population proportion (P1) is significantly different from a
hypothesized value (P0). This is called the hypothesis of inequality.

Correlational test
• A correlational test, also known as correlation analysis, is a statistical
method that measures the strength and direction of the relationship
between two or more variables. The results of a correlational test are
summarized as a correlation coefficient, which is a number between -
1 and +1. The value of the coefficient indicates the strength of the
relationship, and the sign indicates the direction

Hypothesis:
• The effect of social media on mental well-being does not significantly
vary based on the frequency of its usage.
• ANOVA
• Average Daily Social Media Usage:
• Sum of Squares df Mean Square F Sig.
• Between Groups 52.737 4 13.184 14.688 .000
• Within Groups 82.582 92 .898
• Total 135.320 96

Hypothesis
• H0: The level of social media addiction does not differ significantly
between gender.
Chi-Square Tests
Value df
Asymp. Sig.
(2-sided)
Exact Sig. (2-
sided)
Exact Sig. (1-
sided)
Pearson Chi-Square .029a
1 .865
Continuity Correctionb
.000 1 1.000
Likelihood Ratio .029 1 .865
Fisher's Exact Test 1.000 .516
Linear-by-Linear
Association
.029 1 .866
N of Valid Cases 97
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is
16.60.
b. Computed only for a 2x2 table

Hypothesis
• There is no relation between the type of accommodation and
spending more time on social media than intended.

Module 5.pptxData processing involves transforming raw data into useful information

More Related Content

Similar to Module 5.pptxData processing involves transforming raw data into useful information (20)

More from LakshmiKVN1 (14)

Recently uploaded (20)

Module 5.pptxData processing involves transforming raw data into useful information