Emerging Technology Chapter 2 Data Science

Chapter Two
Data science
 Multi-disciplinary field that uses
scientific methods, processes,
algorithms, and systems to extract
knowledge and insights from
structured and unstructured data.

What is Data?
 Data is a raw/unprocessed facts or figures. Which
is seeming random and useless to human.
 Data can be represented with the help of:
■ Alphabet (A-Z, a-z)
■ Digit (0-9)
■ Special characters (+, -, *, /, <, >, = …
etc.)

…Data Vs Information
Source:
internet

Data Vs. Information
 Unprocessed
 Never suited to specific need
 Not depend on information
 Measured in bit and byte
 Data is a raw/unprocessed facts or figures and
Information is processed form of data.
 Processed data
 Suited to specific need because
unwanted or irrelevant data removed
 Depend on data
 Measured in meaningful like time and
quality
Data Information

Data Processing Life cycle
 Data processing is the re-structuring or re-ordering of data by people or machines to increase
their usefulness and add values for a particular purpose.
 Data processing consists of the following basic steps:
 Input (preparation of data for processing)
 Processing ( input data changing into useful form)
 Output (collecting the result of processing)

Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by people or machines to increase their
usefulness and add values for a particular purpose. It has three steps.
Input:
 Data preparation in convenient form for processing. The form will depend on the processing machine.
 For example, when electronic computers are used for data processing, the input data can be recorded on
hard disk, CD, flash disk and so on.
Source: Introduction to emerging technology module page 23

Data Processing Cycle
Processing:
 The input data is changed to produce data in a more useful form.
 For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be
calculated from the sales orders.
Output:
 The result of the processing step is collected. The particular form of the output data depends on the use of
the data.
 For example, output data may be payroll for employees.

Data types and their representation
Data Type Is the type of data that is going to be processed.
 Can be defined from diverse perspective.
1. Programming perspective
2. Data Analytics perspective

Common Data
Type in
programming
perspective
 Integers(int)- is used to store whole numbers, mathematically known as
integers
 Booleans(bool)- is used to represent restricted to one of two values: true or
false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of characters and
numbers
 Programming perspective
Data Type is an attribute of data that tells the compiler or interpreter how the
programmer intends to use the data.

Common Data
Type in Data
Analytics
perspective
 Structured (data in tabular format, relationship exist between
rows columns and tables) Ex. SQL databases, Excel files
 Semi-structured (use tags and other markers) also known as a
self-describing structure. Ex. JSON and XML
 Unstructured (Not organized in predefined manner) Ex. Audio,
Video, Image, Textual data, NoSQL databases
 Metadata (Data about data) Ex. When and where was the photo
taken
 Data Analytics perspective
Defines the operations that can be done on the data, the meaning of the data, and the way values
of that type can be stored.

Structured Data
 It conforms to a tabular format with a relationship between the different
rows and columns.
 Examples of structured data are Excel files or SQL databases. Each of
these has structured rows and columns that can be sorted.
Source: internet

Semi-structured data
 It is a form of structured data that does not conform with the formal structure of
data models associated with relational databases or other forms of data tables, but
nonetheless, contains tags or other markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
 Examples of semi-structured data include JSON and XML
Source: internet

Unstructured Data
 It is information that either does not have a predefined data model or is not
organized in a pre-defined manner. Unstructured information is typically text-heavy
but may contain data such as dates, numbers, and facts as well which results in
irregularities and ambiguities that make it difficult to understand using traditional
programs as compared to data stored in structured databases.
 Examples of unstructured data include audio, video files or No-SQL databases.

Emerging Technology Chapter 2 Data Science

Data value Chain
 Is information flow within a big data system as a series of steps needed to generate value and
useful insights from data.
 Data value chain are
 Data Acquisition (gathering, filtering, cleaning)
 Data Analysis (Making raw data amenable to use in decision making) discovering, transforming,
and modelling data
 Data curation : active management of data. Include creation of content, selection, classification
transformation, validation and preservation and improving the accessibility and quality of data.
 Data Storage (Storing the processed data)
 Data Usage (Using the processed data to make decision)

Basic concepts of big data
Big data: is a term for a collection of data sets so large and complex
• large dataset which means a dataset too large
to reasonably process or store with traditional
tooling or on a single computer

Big data is can be characterized by
Data is live streaming
or in motion (Rapidity)
Data comes in many
different forms from
diverse sources. Data
In many forms.
Can we trust the data?
How accurate is it? …
etc. uncertainty about
the data.
Volume
Velocity
Variety
Veracity
large amounts of data,
Massive datasets

THE 4 Vs OF BIG DATA
BIG
DATA
The size of the data
VOLUME
V1 V2
V3 V4
Different types of data.
VARIETY
The speed with which data
Is generated.
VELOCITY
Trustworthiness of data
Accuracy
VERACITY

Clustered Computing
 Cluster computing: Computer made up of computer.
 Is a set of loosely or tightly
connected computers that
work together so that, in
many respects, they can be
viewed as a single system.

Data Processing Life cycle
Combining the resources of many smaller machines to
get benefits such as:
1. Resource Pooling (storage space,
CPU, memory)
2. High Availability (fault tolerance)
3. Easy Scalability (scalable by adding
additional machine)
Cluster can be handled by a software like YARN (Yet
Another Resource Negotiator)

22
● An open-source framework intended to make interaction with big data.
● Allows for the distributed processing of large datasets across clusters of computers.
● Four key characteristics of Hadoop are:
1. Economical (ordinary computers can be used for data processing)
2. Reliable (resistant to hardware failure)
3. Scalable: It is easily scalable both, horizontally and vertically
4. Flexible (can store as much data as you need)

23
23
● The core components of
Hadoop:
1. Data management
2. Access
3. Processing
4. Storage
Hadoop
and its

24
24
● Ingesting data into the system: Transferring data from d/t source to Hadoop
● Processing the data in storage: Data is stored and processed
● Computing and analyzing data: analyzed by processing frameworks such as Pig, Hive, and Impala.
● Visualizing the results: Accessing analyzed data.
Big Data
Life

Emerging Technology Chapter 2 Data Science

More Related Content

Similar to Emerging Technology Chapter 2 Data Science (20)

More from SolomonEndalu (13)

Recently uploaded (20)

Emerging Technology Chapter 2 Data Science

Editor's Notes