2. Chapter Two
Data science
Multi-disciplinary field that uses
scientific methods, processes,
algorithms, and systems to extract
knowledge and insights from
structured and unstructured data.
3. What is Data?
Data is a raw/unprocessed facts or figures. Which
is seeming random and useless to human.
Data can be represented with the help of:
■ Alphabet (A-Z, a-z)
■ Digit (0-9)
■ Special characters (+, -, *, /, <, >, = …
etc.)
5. Data Vs. Information
Unprocessed
Never suited to specific need
Not depend on information
Measured in bit and byte
Data is a raw/unprocessed facts or figures and
Information is processed form of data.
Processed data
Suited to specific need because
unwanted or irrelevant data removed
Depend on data
Measured in meaningful like time and
quality
Data Information
6. Data Processing Life cycle
Data processing is the re-structuring or re-ordering of data by people or machines to increase
their usefulness and add values for a particular purpose.
Data processing consists of the following basic steps:
Input (preparation of data for processing)
Processing ( input data changing into useful form)
Output (collecting the result of processing)
7. Data Processing Cycle
Data processing is the re-structuring or re-ordering of data by people or machines to increase their
usefulness and add values for a particular purpose. It has three steps.
Input:
Data preparation in convenient form for processing. The form will depend on the processing machine.
For example, when electronic computers are used for data processing, the input data can be recorded on
hard disk, CD, flash disk and so on.
Source: Introduction to emerging technology module page 23
8. Data Processing Cycle
Processing:
The input data is changed to produce data in a more useful form.
For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be
calculated from the sales orders.
Output:
The result of the processing step is collected. The particular form of the output data depends on the use of
the data.
For example, output data may be payroll for employees.
9. Data types and their representation
Data Type Is the type of data that is going to be processed.
Can be defined from diverse perspective.
1. Programming perspective
2. Data Analytics perspective
10. Common Data
Type in
programming
perspective
Integers(int)- is used to store whole numbers, mathematically known as
integers
Booleans(bool)- is used to represent restricted to one of two values: true or
false
Characters(char)- is used to store a single character
Floating-point numbers(float)- is used to store real numbers
Alphanumeric strings(string)- used to store a combination of characters and
numbers
Data types and their representation
Programming perspective
Data Type is an attribute of data that tells the compiler or interpreter how the
programmer intends to use the data.
11. Common Data
Type in Data
Analytics
perspective
Structured (data in tabular format, relationship exist between
rows columns and tables) Ex. SQL databases, Excel files
Semi-structured (use tags and other markers) also known as a
self-describing structure. Ex. JSON and XML
Unstructured (Not organized in predefined manner) Ex. Audio,
Video, Image, Textual data, NoSQL databases
Metadata (Data about data) Ex. When and where was the photo
taken
Data types and their representation
Data Analytics perspective
Defines the operations that can be done on the data, the meaning of the data, and the way values
of that type can be stored.
12. Structured Data
It conforms to a tabular format with a relationship between the different
rows and columns.
Examples of structured data are Excel files or SQL databases. Each of
these has structured rows and columns that can be sorted.
Source: internet
13. Semi-structured data
It is a form of structured data that does not conform with the formal structure of
data models associated with relational databases or other forms of data tables, but
nonetheless, contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
Examples of semi-structured data include JSON and XML
Source: internet
14. Unstructured Data
It is information that either does not have a predefined data model or is not
organized in a pre-defined manner. Unstructured information is typically text-heavy
but may contain data such as dates, numbers, and facts as well which results in
irregularities and ambiguities that make it difficult to understand using traditional
programs as compared to data stored in structured databases.
Examples of unstructured data include audio, video files or No-SQL databases.
16. Data value Chain
Is information flow within a big data system as a series of steps needed to generate value and
useful insights from data.
Data value chain are
Data Acquisition (gathering, filtering, cleaning)
Data Analysis (Making raw data amenable to use in decision making) discovering, transforming,
and modelling data
Data curation : active management of data. Include creation of content, selection, classification
transformation, validation and preservation and improving the accessibility and quality of data.
Data Storage (Storing the processed data)
Data Usage (Using the processed data to make decision)
17. Basic concepts of big data
Big data: is a term for a collection of data sets so large and complex
• large dataset which means a dataset too large
to reasonably process or store with traditional
tooling or on a single computer
18. Big data is can be characterized by
Data is live streaming
or in motion (Rapidity)
Data comes in many
different forms from
diverse sources. Data
In many forms.
Can we trust the data?
How accurate is it? …
etc. uncertainty about
the data.
Volume
Velocity
Variety
Veracity
large amounts of data,
Massive datasets
19. THE 4 Vs OF BIG DATA
BIG
DATA
The size of the data
VOLUME
V1 V2
V3 V4
Different types of data.
VARIETY
The speed with which data
Is generated.
VELOCITY
Trustworthiness of data
Accuracy
VERACITY
20. Clustered Computing
Cluster computing: Computer made up of computer.
Is a set of loosely or tightly
connected computers that
work together so that, in
many respects, they can be
viewed as a single system.
21. Data Processing Life cycle
Combining the resources of many smaller machines to
get benefits such as:
1. Resource Pooling (storage space,
CPU, memory)
2. High Availability (fault tolerance)
3. Easy Scalability (scalable by adding
additional machine)
Cluster can be handled by a software like YARN (Yet
Another Resource Negotiator)
22. 22
● An open-source framework intended to make interaction with big data.
● Allows for the distributed processing of large datasets across clusters of computers.
● Four key characteristics of Hadoop are:
1. Economical (ordinary computers can be used for data processing)
2. Reliable (resistant to hardware failure)
3. Scalable: It is easily scalable both, horizontally and vertically
4. Flexible (can store as much data as you need)
23. 23
23
● The core components of
Hadoop:
1. Data management
2. Access
3. Processing
4. Storage
Hadoop
and its
24. 24
24
● Ingesting data into the system: Transferring data from d/t source to Hadoop
● Processing the data in storage: Data is stored and processed
● Computing and analyzing data: analyzed by processing frameworks such as Pig, Hive, and Impala.
● Visualizing the results: Accessing analyzed data.
Big Data
Life
#4:Here is the good example to compare the two: as you see from the picture Data is just represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.). But Information is interpreted data; created from structured data in a particular context.
#7:Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness and add values for a particular purpose. Data processing consists of the three basic steps - input, processing, and output.
In the Input step: the input data is prepared in some convenient form for processing. The form will depend on the processing machine. For example, when electronic computers are used, the input data can be recorded on any one of the several types of storage medium, such as hard disk, CD, flash disk and so on.
#8:
In the Processing step of data processing cycle− the input data is changed to produce data in a more useful form. For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be calculated from the sales orders.
In the Output step of data processing cycle− the result of the proceeding processing step is collected. The particular form of the output data depends on the use of the data. For example, output data may be payroll for employees.
#12:Structured data types are straightforward to analyze and conforms to a tabular format with a relationship between the different rows and columns. Common examples of structured data are Excel files or SQL databases. Each of these has structured rows and columns that can be sorted.
#13:Semi-structured data types are a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless, as you see from the picture, contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. JSON and XML are forms of semi-structured data.
#14:Unstructured data types are information that either does not have a predefined data model or are not organized in a pre-defined manner. Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well which results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in structured databases. Common examples of unstructured data include audio, video files or No-SQL databases.