Chapter 2.ppt on Types of Digital f Data

“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Chapter 2
Types of Digital Data

Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes
Introduction to digital data and its types
1. Structured data – origin, organization,
storage, access and usage
2. Semi-structured data – origin,
organization, storage, access and usage
3. Unstructured data – origin, organization,
storage, access and usage
(a) To differentiate between
structured, unstructured and
semi-structured data
(b) To understand the need to
integrate structured,
unstructured and semi-
structured data

Session Plan
Lecture time : 45 to 60 minutes
Q/A : 15 minutes

Agenda
• Types of digital data
– Unstructured
• Origin
• Management
• Storage
• Storage of unstructured data in relational database
• Process of extracting information
• Key take-away and additional reads
– Semi-structured
• Origin
• Management
• Storage
• Storage of semi-structured data in relational database
• XML

Agenda (contd.)
• Types of digital data – contd.
– Structured
• Origin
• Management
• Storage

Digital Data
• Digital data can be
– Unstructured
– Semi-structured
– Structured
• According to Merrill Lynch 80–90% of business data is either unstructured
or semi-structured
• Data is usually in a format which makes it difficult to extract information
from it

Formats of Digital Data

Unstructured Data

What is Unstructured Data?
Unstructured
data
Does not
conform to any
data model
Cannot be
stored in form
of rows and
columns as in a
database
Not in any
particular
format or
sequence
Not easily
usable by a
program
Does not
follow any rule
or semantics
Has no easily
identifiable
structure

Where does Unstructured Data Come from?
Web pages
Memos
Videos (MPEG, etc.)
Images (JPEG, GIF, etc.)
Body of an e-mail
Word document
PowerPoint presentations
Chats
Reports
Whitepapers
Surveys
Unstructured data

How to Store Unstructured Data?
Storage
Space
Scalability
Retrieve
information
Security
Update and
delete
Indexing
and
searching
Sheer volume of unstructured data and its unprecedented
growth makes it difficult to store. Audios, videos, images,
etc. acquire huge amount of storage space
Scalability becomes an issue with increase
in unstructured data
Retrieving and recovering unstructured
data are cumbersome
Ensuring security is difficult due to varied
sources of data (e.g. e-mail, web pages)
Updating, deleting, etc. are not easy due to
the unstructured form
Indexing becomes difficult with increase in data.
Searching is difficult for non-text data
Challenges faced

How to Store Unstructured Data?
Change
formats
New
hardware
RDBMS/
BLOBs
XML
CAS
Unstructured data may be be converted to formats which are easily
managed, stored and searched. For example, IBM is working on
providing a solution which converts audio , video, etc. to text
Create hardware which support unstructured data
either compliment the existing storage devices or be a
stand alone for unstructured data
Store in relational databases which support
BLOBs which is Binary Large Objects
Store in XML which tries to give some structure to
unstructured data by using tags and elements
Possible solutions
Organize files based on their metadata

How to Extract Information from Unstructured
Data?
Interpretation
Tags
Deriving
meaning
File formats
Classification/
Taxonomy
Indexing
Unstructured data is not easily interpreted by conventional
search algorithms
As the data grows it is not possible to put tags
manually
Increasing number of file formats make it difficult to
interpret data
Different naming conventions followed across the
organization make it difficult to classify data.
Computer programs cannot automatically derive
meaning/structure from unstructured data
Challenges faced
Designing algorithms to understand the meaning
of the document and then tag or index them
accordingly is difficult

How to Extract Information from Unstructured
Data?
Tags
Text mining
Application
platforms
Classification/
Taxonomy
Naming conventions/
standards
Unstructured data can be stored in a virtual repository and be
automatically tagged. For example, Documentum provides this
type of solution
Application platforms like XOLAP help
extract information from e-mail and XML
based documents
Taxonomies within the organization can be
managed automatically to organize data in
hierarchical structures
Text mining tools help in grouping and classifying
unstructured data and analyze by considering
grammar, context, synonyms ,etc.
Possible solutions
Following naming conventions or standards
across an organization can greatly improve
storage and retrieval

Further Reading
• http://www.information-management.com/issues/20030201/6287-1.html
• http://www.enterpriseitplanet.com/storage/features/article.php/11318_34071
61_2
• http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.ind
ex.html
• http://www.research.ibm.com/UIMA/UIMA%20Architecture%20Highlights.
html

Answer a Quick Question
Ask the participants of the learning program to state some more examples of
Unstructured data

Do it Exercise
Search, think and write about two best practices for managing the growth of
unstructured data

Semi-structured Data

What is Semi-structured Data?
Semi-
structured
data
Does not
conform to a
data model but
contains tags &
elements
(metadata) Cannot be
stored in form
of rows and
columns as in a
database
The tags and
elements
describe how
data is stored
Not sufficient
Metadata
Attributes in a
group may not
be the same
Similar entities
are grouped

Where does Semi-structured Data Come from?
E-mail
XML
TCP/IP packets
Zipped files
Binary
executables
Mark-up languages
Integration of data from
heterogeneous sources
Semi-structured
data

How to Manage Semi-structured Data?
Schemas
• Describe the
structure and
content of data to
some extent
• Assign meaning to
data hence
allowing automatic
search and
indexing
Graph-based data
models
• Contain data on
the leaves of the
graph. Also known
as ‘schema less’
• Used for data
exchange among
heterogeneous
sources
XML
• Models the data
using tags and
elements
• Schemas are not
tightly coupled to
data
Some ways in which semi-structured data is managed and stored

How to Store Semi-structured Data?
Storage cost
RDBMS
Irregular and
partial structure
Implicit structure
Evolving schemas
Distinction between
schema and data
Storing data with their schemas increases cost
Semi-structured data cannot be stored in
existing RDBMS as data cannot be mapped
into tables directly
Challenges faced
Some data elements may have extra
information while others none at all
In many cases the structure is implicit.
Interpreting relationships and
correlations is very difficult
Schemas keep changing with
requirements making it difficult to
capture it in a database
Vague distinction between schema and data exists at times
making it difficult to capture data

How to Store Semi-structured Data?
XML
RDBMS
Special
purpose
DBMS
OEM
Possible solutions
XML allows to define tags and attributes to store data.
Data can be stored in a hierarchical/nested structure
Semi-structured data can be stored in a relational
database by mapping the data to a relational
schema which is then mapped to a table
Databases which are specifically designed to store
semi-structured data
Data can be stored and exchanged in the form of graph
where entities are represented as objects which are the
vertices in a graph

How to Extract Information from Semi-structured Data?
Flat files
Heterogeneous
sources
Incomplete/
irregular
structure
Semi-structured is usually stored in flat
files which are difficult to index and
search
Data comes from varied sources which is
difficult to tag and search
Extracting structure when there is none and
interpreting the relations existing in the structure
which is present is a difficult task
Challenges faced

How to Extract Information from Semi-structured Data?
Indexing
OEM
XML
Mining
tools
Indexing data in a graph-based model
enables quick search
Allows data to be stored in a graph-based data
model which is easier to index and search
Allows data to be arranged in a hierarchical or
tree-like structure which enables indexing and
searching
Various mining tools are available which search
data based on graphs, schemas, structure, etc.
Possible solutions

XML – A Solution for Semi-structured Data Management
XML Extensible MarkUp Language
What is XML? Open-source mark up language written in plain text.
It is hardware and software independent
Does what? Designed to store and transport data over the
Internet
How?
It allows data to be stored in a hierarchical/nested
structure. It allows user to define tags to store the
data

XML – A Solution for Semi-structured Data Management
XML has no predefined tags
The words in the <> (angular brackets) are user-defined tags
XML is known as self-describing as data can exist without a schema and
schema can be added later
Schema can be described in XSLT or XML schema
<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>

Further Reading
• http://queue.acm.org/detail.cfm?id=1103832
• http://www.computerworld.com/s/article/93968/Taming_Text
• http://searchstorage.techtarget.com/generic/0,295582,sid5_gci1334684,00.
html
• http://searchdatamanagement.techtarget.com/generic/0,295582,sid91_gci1
264550,00.html
• http://searchdatamanagement.techtarget.com/news/article/0,289142,sid91_
gci1252122,00.html

Answer a Quick Question
What is your take on this….
A Web Page is unstructured. If yes, why?

What Is Structured Data?
Structured
data
Conforms to a
data model
Data is stored in
form of rows and
columns
(e.g., relational
database)
Data resides in
fixed fields within
a record or file
Definition, format
& meaning of data
is explicitly
known
Attributes in a
group are the
same
Similar entities
are grouped

Where does Structured Data Come from?
Databases (e.g., Access)
Spreadsheets
SQL
OLTP systems
Structured Data

Structured Data: Everything in its Place
Fully described datasets
Clearly defined categories and sub-categories
Data neatly placed in rows and columns
Data that goes into the records is regulated by a well-defined structure
Indexing can be easily done either by the DBMS itself or manually

Structured Data
Name E-mail
Patrick Wood ptw@dcs.abc.ac.uk,
p.wood@ymail.uk
First name: Mark
Last name: Taylor
MarkT@dcs.ymail.ac.uk
Alex Bourdoo AlexBourdoo@dcs.ymail.a
c.uk
First Name Last Name E-mail Id Alternate E-
mail Id
Patrick Wood ptw@dcs.ab
c.ac.uk
p.wood@ym
ail.uk
Mark Taylor MarkT@dcs.
ymail.ac.uk
Alex Bourdoo AlexBourdoo
@dcs.ymail.a
c.uk
Semi-structured Structured

Ease with Structured Data-Storage
Storage
Scalability
Security
Update and
delete
Data types – both defined and user defined help
with the storage of structured data
Scalability is not generally an issue with
increase in data
Updating, deleting, etc. is easy due to
structured form
Ease with structured
data

Ease with Structured Data-Retrieval
Retrieve
information
Indexing and
searching
Mining data
BI operations
Data can be indexed based not only on a
text string but other attributes as well. This
enables streamlined search
Ease with structured
data
A well-defined structure helps in easy
retrieval of data
Structured data can be easily mined and
knowledge can be extracted from it
BI works extremely well with structured data.
Hence data mining, warehousing, etc. can be
easily undertaken

Further Readings
• http://www.govtrack.us/articles/20061209data.xpd
• http://www.sapdesignguild.org/editions/edition2/sui_content.asp

Do it Exercise
Think and write about an instance where data was presented to you in
Unstructured, semi-structured and structured data format

Ask a few participants of the learning program to summarize the lecture.
Summary please…

Chapter 2.ppt on Types of Digital f Data

More Related Content

Similar to Chapter 2.ppt on Types of Digital f Data (20)

Recently uploaded (20)

Chapter 2.ppt on Types of Digital f Data

Editor's Notes