SlideShare a Scribd company logo
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Chapter 2
Types of Digital Data
Learning Objectives and Learning Outcomes
Learning Objectives Learning Outcomes
Introduction to digital data and its types
1. Structured data – origin, organization,
storage, access and usage
2. Semi-structured data – origin,
organization, storage, access and usage
3. Unstructured data – origin, organization,
storage, access and usage
(a) To differentiate between
structured, unstructured and
semi-structured data
(b) To understand the need to
integrate structured,
unstructured and semi-
structured data
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Session Plan
Lecture time : 45 to 60 minutes
Q/A : 15 minutes
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Agenda
• Types of digital data
– Unstructured
• Origin
• Management
• Storage
• Storage of unstructured data in relational database
• Process of extracting information
• Key take-away and additional reads
– Semi-structured
• Origin
• Management
• Storage
• Storage of semi-structured data in relational database
• Process of extracting information
• XML
• Key take-away and additional reads
Agenda (contd.)
• Types of digital data – contd.
– Structured
• Origin
• Management
• Storage
• Process of extracting information
• Key take-away and additional reads
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Digital Data
• Digital data can be
– Unstructured
– Semi-structured
– Structured
• According to Merrill Lynch 80–90% of business data is either unstructured
or semi-structured
• Data is usually in a format which makes it difficult to extract information
from it
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Formats of Digital Data
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Unstructured Data
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
What is Unstructured Data?
Unstructured
data
Does not
conform to any
data model
Cannot be
stored in form
of rows and
columns as in a
database
Not in any
particular
format or
sequence
Not easily
usable by a
program
Does not
follow any rule
or semantics
Has no easily
identifiable
structure
Where does Unstructured Data Come from?
Web pages
Memos
Videos (MPEG, etc.)
Images (JPEG, GIF, etc.)
Body of an e-mail
Word document
PowerPoint presentations
Chats
Reports
Whitepapers
Surveys
Unstructured data
How to Store Unstructured Data?
Storage
Space
Scalability
Retrieve
information
Security
Update and
delete
Indexing
and
searching
Sheer volume of unstructured data and its unprecedented
growth makes it difficult to store. Audios, videos, images,
etc. acquire huge amount of storage space
Scalability becomes an issue with increase
in unstructured data
Retrieving and recovering unstructured
data are cumbersome
Ensuring security is difficult due to varied
sources of data (e.g. e-mail, web pages)
Updating, deleting, etc. are not easy due to
the unstructured form
Indexing becomes difficult with increase in data.
Searching is difficult for non-text data
Challenges faced
How to Store Unstructured Data?
Change
formats
New
hardware
RDBMS/
BLOBs
XML
CAS
Unstructured data may be be converted to formats which are easily
managed, stored and searched. For example, IBM is working on
providing a solution which converts audio , video, etc. to text
Create hardware which support unstructured data
either compliment the existing storage devices or be a
stand alone for unstructured data
Store in relational databases which support
BLOBs which is Binary Large Objects
Store in XML which tries to give some structure to
unstructured data by using tags and elements
Possible solutions
Organize files based on their metadata
How to Extract Information from Unstructured
Data?
Interpretation
Tags
Deriving
meaning
File formats
Classification/
Taxonomy
Indexing
Unstructured data is not easily interpreted by conventional
search algorithms
As the data grows it is not possible to put tags
manually
Increasing number of file formats make it difficult to
interpret data
Different naming conventions followed across the
organization make it difficult to classify data.
Computer programs cannot automatically derive
meaning/structure from unstructured data
Challenges faced
Designing algorithms to understand the meaning
of the document and then tag or index them
accordingly is difficult
How to Extract Information from Unstructured
Data?
Tags
Text mining
Application
platforms
Classification/
Taxonomy
Naming conventions/
standards
Unstructured data can be stored in a virtual repository and be
automatically tagged. For example, Documentum provides this
type of solution
Application platforms like XOLAP help
extract information from e-mail and XML
based documents
Taxonomies within the organization can be
managed automatically to organize data in
hierarchical structures
Text mining tools help in grouping and classifying
unstructured data and analyze by considering
grammar, context, synonyms ,etc.
Possible solutions
Following naming conventions or standards
across an organization can greatly improve
storage and retrieval
Further Reading
• http://www.information-management.com/issues/20030201/6287-1.html
• http://www.enterpriseitplanet.com/storage/features/article.php/11318_34071
61_2
• http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.ind
ex.html
• http://www.research.ibm.com/UIMA/UIMA%20Architecture%20Highlights.
html
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Answer a Quick Question
Ask the participants of the learning program to state some more examples of
Unstructured data
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Do it Exercise
Search, think and write about two best practices for managing the growth of
unstructured data
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Semi-structured Data
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
What is Semi-structured Data?
Semi-
structured
data
Does not
conform to a
data model but
contains tags &
elements
(metadata) Cannot be
stored in form
of rows and
columns as in a
database
The tags and
elements
describe how
data is stored
Not sufficient
Metadata
Attributes in a
group may not
be the same
Similar entities
are grouped
Where does Semi-structured Data Come from?
E-mail
XML
TCP/IP packets
Zipped files
Binary
executables
Mark-up languages
Integration of data from
heterogeneous sources
Semi-structured
data
How to Manage Semi-structured Data?
Schemas
• Describe the
structure and
content of data to
some extent
• Assign meaning to
data hence
allowing automatic
search and
indexing
Graph-based data
models
• Contain data on
the leaves of the
graph. Also known
as ‘schema less’
• Used for data
exchange among
heterogeneous
sources
XML
• Models the data
using tags and
elements
• Schemas are not
tightly coupled to
data
Some ways in which semi-structured data is managed and stored
How to Store Semi-structured Data?
Storage cost
RDBMS
Irregular and
partial structure
Implicit structure
Evolving schemas
Distinction between
schema and data
Storing data with their schemas increases cost
Semi-structured data cannot be stored in
existing RDBMS as data cannot be mapped
into tables directly
Challenges faced
Some data elements may have extra
information while others none at all
In many cases the structure is implicit.
Interpreting relationships and
correlations is very difficult
Schemas keep changing with
requirements making it difficult to
capture it in a database
Vague distinction between schema and data exists at times
making it difficult to capture data
How to Store Semi-structured Data?
XML
RDBMS
Special
purpose
DBMS
OEM
Possible solutions
XML allows to define tags and attributes to store data.
Data can be stored in a hierarchical/nested structure
Semi-structured data can be stored in a relational
database by mapping the data to a relational
schema which is then mapped to a table
Databases which are specifically designed to store
semi-structured data
Data can be stored and exchanged in the form of graph
where entities are represented as objects which are the
vertices in a graph
How to Extract Information from Semi-structured Data?
Flat files
Heterogeneous
sources
Incomplete/
irregular
structure
Semi-structured is usually stored in flat
files which are difficult to index and
search
Data comes from varied sources which is
difficult to tag and search
Extracting structure when there is none and
interpreting the relations existing in the structure
which is present is a difficult task
Challenges faced
How to Extract Information from Semi-structured Data?
Indexing
OEM
XML
Mining
tools
Indexing data in a graph-based model
enables quick search
Allows data to be stored in a graph-based data
model which is easier to index and search
Allows data to be arranged in a hierarchical or
tree-like structure which enables indexing and
searching
Various mining tools are available which search
data based on graphs, schemas, structure, etc.
Possible solutions
XML – A Solution for Semi-structured Data Management
XML Extensible MarkUp Language
What is XML? Open-source mark up language written in plain text.
It is hardware and software independent
Does what? Designed to store and transport data over the
Internet
How?
It allows data to be stored in a hierarchical/nested
structure. It allows user to define tags to store the
data
XML – A Solution for Semi-structured Data Management
XML has no predefined tags
The words in the <> (angular brackets) are user-defined tags
XML is known as self-describing as data can exist without a schema and
schema can be added later
Schema can be described in XSLT or XML schema
<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Further Reading
• http://queue.acm.org/detail.cfm?id=1103832
• http://www.computerworld.com/s/article/93968/Taming_Text
• http://searchstorage.techtarget.com/generic/0,295582,sid5_gci1334684,00.
html
• http://searchdatamanagement.techtarget.com/generic/0,295582,sid91_gci1
264550,00.html
• http://searchdatamanagement.techtarget.com/news/article/0,289142,sid91_
gci1252122,00.html
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Answer a Quick Question
What is your take on this….
A Web Page is unstructured. If yes, why?
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Structured Data
What Is Structured Data?
Structured
data
Conforms to a
data model
Data is stored in
form of rows and
columns
(e.g., relational
database)
Data resides in
fixed fields within
a record or file
Definition, format
& meaning of data
is explicitly
known
Attributes in a
group are the
same
Similar entities
are grouped
Where does Structured Data Come from?
Databases (e.g., Access)
Spreadsheets
SQL
OLTP systems
Structured Data
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Structured Data: Everything in its Place
Fully described datasets
Clearly defined categories and sub-categories
Data neatly placed in rows and columns
Data that goes into the records is regulated by a well-defined structure
Indexing can be easily done either by the DBMS itself or manually
Structured Data
Name E-mail
Patrick Wood ptw@dcs.abc.ac.uk,
p.wood@ymail.uk
First name: Mark
Last name: Taylor
MarkT@dcs.ymail.ac.uk
Alex Bourdoo AlexBourdoo@dcs.ymail.a
c.uk
First Name Last Name E-mail Id Alternate E-
mail Id
Patrick Wood ptw@dcs.ab
c.ac.uk
p.wood@ym
ail.uk
Mark Taylor MarkT@dcs.
ymail.ac.uk
Alex Bourdoo AlexBourdoo
@dcs.ymail.a
c.uk
Semi-structured Structured
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Ease with Structured Data-Storage
Storage
Scalability
Security
Update and
delete
Data types – both defined and user defined help
with the storage of structured data
Scalability is not generally an issue with
increase in data
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Updating, deleting, etc. is easy due to
structured form
Ease with structured
data
Ease with Structured Data-Retrieval
Retrieve
information
Indexing and
searching
Mining data
BI operations
Data can be indexed based not only on a
text string but other attributes as well. This
enables streamlined search
Ease with structured
data
A well-defined structure helps in easy
retrieval of data
Structured data can be easily mined and
knowledge can be extracted from it
BI works extremely well with structured data.
Hence data mining, warehousing, etc. can be
easily undertaken
Further Readings
• http://www.govtrack.us/articles/20061209data.xpd
• http://www.sapdesignguild.org/editions/edition2/sui_content.asp
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Do it Exercise
Think and write about an instance where data was presented to you in
Unstructured, semi-structured and structured data format
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
Ask a few participants of the learning program to summarize the lecture.
Summary please…
“Fundamentals of Business Analytics”
RN Prasad and Seema Acharya
Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.

More Related Content

PPTX
Unit_II_1_Types_of_Data.pptx
PPTX
Notes on Types of Digital Data in Data Analytics
PPTX
l2-types-of-digital-data-9-91b2a111.pptx
PPTX
l2-types-of-digital-data-240805145309-91b2a111.pptx
PPTX
345739761-1-Chap-3-Types-of-Digital-Data.pptx
PDF
data analyticsggfgfgfgfdgdfgfdgfdgfdgfdgdffdfdf
PPTX
Big data analytics(BAD601) module-1 ppt
PDF
Understanding the Types of Data in Data Science|ashokveda.pdf
Unit_II_1_Types_of_Data.pptx
Notes on Types of Digital Data in Data Analytics
l2-types-of-digital-data-9-91b2a111.pptx
l2-types-of-digital-data-240805145309-91b2a111.pptx
345739761-1-Chap-3-Types-of-Digital-Data.pptx
data analyticsggfgfgfgfdgdfgfdgfdgfdgfdgdffdfdf
Big data analytics(BAD601) module-1 ppt
Understanding the Types of Data in Data Science|ashokveda.pdf

Similar to Chapter 2.ppt on Types of Digital f Data (20)

PDF
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
PDF
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
PDF
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
PDF
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
PDF
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
PDF
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
PPTX
Big data Analytics(BAD601) -module-1 ppt
PPT
Week 5
PPT
Week 5
PDF
the study of data to extract meaningful insights for business
PPT
Behind The Scenes Databases And Information Systems 6
PPT
Database
PPT
Database
PPTX
Database Introduction for MIS Students.pptx
PDF
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
PDF
Information and Integration Management Vision
PDF
Database systems Handbook 4th dbms by Muhammad Sharif.pdf
PDF
Database systems Handbook 4th dbms by Muhammad Sharif.pdf
PDF
Database systems Handbook 4th dbms by Muhammad Sharif.pdf
PDF
Database systems Handbook by Muhammad sharif dba.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
Database system Handbook 3rd DONE Complete DBMS book Full book.pdf
4rth Complete book Database systems Handbook dbms rdbms by Muhammad Sharif.pdf
DBA book sql rdbms 4rth Complete book Database systems Handbook dbms rdbms by...
Big data Analytics(BAD601) -module-1 ppt
Week 5
Week 5
the study of data to extract meaningful insights for business
Behind The Scenes Databases And Information Systems 6
Database
Database
Database Introduction for MIS Students.pptx
Webinar: Decoding the Mystery - How to Know if You Need a Data Catalog, a Dat...
Information and Integration Management Vision
Database systems Handbook 4th dbms by Muhammad Sharif.pdf
Database systems Handbook 4th dbms by Muhammad Sharif.pdf
Database systems Handbook 4th dbms by Muhammad Sharif.pdf
Database systems Handbook by Muhammad sharif dba.pdf
Ad

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Computer network topology notes for revision
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Lecture1 pattern recognition............
PDF
Introduction to Data Science and Data Analysis
PPT
Predictive modeling basics in data cleaning process
PDF
annual-report-2024-2025 original latest.
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
modul_python (1).pptx for professional and student
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
Qualitative Qantitative and Mixed Methods.pptx
climate analysis of Dhaka ,Banglades.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Analytics and business intelligence.pdf
Computer network topology notes for revision
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Lecture1 pattern recognition............
Introduction to Data Science and Data Analysis
Predictive modeling basics in data cleaning process
annual-report-2024-2025 original latest.
SAP 2 completion done . PRESENTATION.pptx
Supervised vs unsupervised machine learning algorithms
oil_refinery_comprehensive_20250804084928 (1).pptx
.pdf is not working space design for the following data for the following dat...
modul_python (1).pptx for professional and student
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
IB Computer Science - Internal Assessment.pptx
Database Infoormation System (DBIS).pptx
Clinical guidelines as a resource for EBP(1).pdf
Ad

Chapter 2.ppt on Types of Digital f Data

  • 1. “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved. Chapter 2 Types of Digital Data
  • 2. Learning Objectives and Learning Outcomes Learning Objectives Learning Outcomes Introduction to digital data and its types 1. Structured data – origin, organization, storage, access and usage 2. Semi-structured data – origin, organization, storage, access and usage 3. Unstructured data – origin, organization, storage, access and usage (a) To differentiate between structured, unstructured and semi-structured data (b) To understand the need to integrate structured, unstructured and semi- structured data “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 3. Session Plan Lecture time : 45 to 60 minutes Q/A : 15 minutes “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 4. Agenda • Types of digital data – Unstructured • Origin • Management • Storage • Storage of unstructured data in relational database • Process of extracting information • Key take-away and additional reads – Semi-structured • Origin • Management • Storage • Storage of semi-structured data in relational database • Process of extracting information • XML • Key take-away and additional reads
  • 5. Agenda (contd.) • Types of digital data – contd. – Structured • Origin • Management • Storage • Process of extracting information • Key take-away and additional reads “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 6. Digital Data • Digital data can be – Unstructured – Semi-structured – Structured • According to Merrill Lynch 80–90% of business data is either unstructured or semi-structured • Data is usually in a format which makes it difficult to extract information from it “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 7. Formats of Digital Data “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 8. Unstructured Data “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 9. What is Unstructured Data? Unstructured data Does not conform to any data model Cannot be stored in form of rows and columns as in a database Not in any particular format or sequence Not easily usable by a program Does not follow any rule or semantics Has no easily identifiable structure
  • 10. Where does Unstructured Data Come from? Web pages Memos Videos (MPEG, etc.) Images (JPEG, GIF, etc.) Body of an e-mail Word document PowerPoint presentations Chats Reports Whitepapers Surveys Unstructured data
  • 11. How to Store Unstructured Data? Storage Space Scalability Retrieve information Security Update and delete Indexing and searching Sheer volume of unstructured data and its unprecedented growth makes it difficult to store. Audios, videos, images, etc. acquire huge amount of storage space Scalability becomes an issue with increase in unstructured data Retrieving and recovering unstructured data are cumbersome Ensuring security is difficult due to varied sources of data (e.g. e-mail, web pages) Updating, deleting, etc. are not easy due to the unstructured form Indexing becomes difficult with increase in data. Searching is difficult for non-text data Challenges faced
  • 12. How to Store Unstructured Data? Change formats New hardware RDBMS/ BLOBs XML CAS Unstructured data may be be converted to formats which are easily managed, stored and searched. For example, IBM is working on providing a solution which converts audio , video, etc. to text Create hardware which support unstructured data either compliment the existing storage devices or be a stand alone for unstructured data Store in relational databases which support BLOBs which is Binary Large Objects Store in XML which tries to give some structure to unstructured data by using tags and elements Possible solutions Organize files based on their metadata
  • 13. How to Extract Information from Unstructured Data? Interpretation Tags Deriving meaning File formats Classification/ Taxonomy Indexing Unstructured data is not easily interpreted by conventional search algorithms As the data grows it is not possible to put tags manually Increasing number of file formats make it difficult to interpret data Different naming conventions followed across the organization make it difficult to classify data. Computer programs cannot automatically derive meaning/structure from unstructured data Challenges faced Designing algorithms to understand the meaning of the document and then tag or index them accordingly is difficult
  • 14. How to Extract Information from Unstructured Data? Tags Text mining Application platforms Classification/ Taxonomy Naming conventions/ standards Unstructured data can be stored in a virtual repository and be automatically tagged. For example, Documentum provides this type of solution Application platforms like XOLAP help extract information from e-mail and XML based documents Taxonomies within the organization can be managed automatically to organize data in hierarchical structures Text mining tools help in grouping and classifying unstructured data and analyze by considering grammar, context, synonyms ,etc. Possible solutions Following naming conventions or standards across an organization can greatly improve storage and retrieval
  • 15. Further Reading • http://www.information-management.com/issues/20030201/6287-1.html • http://www.enterpriseitplanet.com/storage/features/article.php/11318_34071 61_2 • http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.ind ex.html • http://www.research.ibm.com/UIMA/UIMA%20Architecture%20Highlights. html “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 16. Answer a Quick Question Ask the participants of the learning program to state some more examples of Unstructured data “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 17. Do it Exercise Search, think and write about two best practices for managing the growth of unstructured data “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 18. Semi-structured Data “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 19. What is Semi-structured Data? Semi- structured data Does not conform to a data model but contains tags & elements (metadata) Cannot be stored in form of rows and columns as in a database The tags and elements describe how data is stored Not sufficient Metadata Attributes in a group may not be the same Similar entities are grouped
  • 20. Where does Semi-structured Data Come from? E-mail XML TCP/IP packets Zipped files Binary executables Mark-up languages Integration of data from heterogeneous sources Semi-structured data
  • 21. How to Manage Semi-structured Data? Schemas • Describe the structure and content of data to some extent • Assign meaning to data hence allowing automatic search and indexing Graph-based data models • Contain data on the leaves of the graph. Also known as ‘schema less’ • Used for data exchange among heterogeneous sources XML • Models the data using tags and elements • Schemas are not tightly coupled to data Some ways in which semi-structured data is managed and stored
  • 22. How to Store Semi-structured Data? Storage cost RDBMS Irregular and partial structure Implicit structure Evolving schemas Distinction between schema and data Storing data with their schemas increases cost Semi-structured data cannot be stored in existing RDBMS as data cannot be mapped into tables directly Challenges faced Some data elements may have extra information while others none at all In many cases the structure is implicit. Interpreting relationships and correlations is very difficult Schemas keep changing with requirements making it difficult to capture it in a database Vague distinction between schema and data exists at times making it difficult to capture data
  • 23. How to Store Semi-structured Data? XML RDBMS Special purpose DBMS OEM Possible solutions XML allows to define tags and attributes to store data. Data can be stored in a hierarchical/nested structure Semi-structured data can be stored in a relational database by mapping the data to a relational schema which is then mapped to a table Databases which are specifically designed to store semi-structured data Data can be stored and exchanged in the form of graph where entities are represented as objects which are the vertices in a graph
  • 24. How to Extract Information from Semi-structured Data? Flat files Heterogeneous sources Incomplete/ irregular structure Semi-structured is usually stored in flat files which are difficult to index and search Data comes from varied sources which is difficult to tag and search Extracting structure when there is none and interpreting the relations existing in the structure which is present is a difficult task Challenges faced
  • 25. How to Extract Information from Semi-structured Data? Indexing OEM XML Mining tools Indexing data in a graph-based model enables quick search Allows data to be stored in a graph-based data model which is easier to index and search Allows data to be arranged in a hierarchical or tree-like structure which enables indexing and searching Various mining tools are available which search data based on graphs, schemas, structure, etc. Possible solutions
  • 26. XML – A Solution for Semi-structured Data Management XML Extensible MarkUp Language What is XML? Open-source mark up language written in plain text. It is hardware and software independent Does what? Designed to store and transport data over the Internet How? It allows data to be stored in a hierarchical/nested structure. It allows user to define tags to store the data
  • 27. XML – A Solution for Semi-structured Data Management XML has no predefined tags The words in the <> (angular brackets) are user-defined tags XML is known as self-describing as data can exist without a schema and schema can be added later Schema can be described in XSLT or XML schema <message> <to> XYZ </to> <from> ABC </from> <subject> Greetings </subject> <body> Hello! How are you? </body> </message> “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 28. Further Reading • http://queue.acm.org/detail.cfm?id=1103832 • http://www.computerworld.com/s/article/93968/Taming_Text • http://searchstorage.techtarget.com/generic/0,295582,sid5_gci1334684,00. html • http://searchdatamanagement.techtarget.com/generic/0,295582,sid91_gci1 264550,00.html • http://searchdatamanagement.techtarget.com/news/article/0,289142,sid91_ gci1252122,00.html “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 29. Answer a Quick Question What is your take on this…. A Web Page is unstructured. If yes, why? “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 31. What Is Structured Data? Structured data Conforms to a data model Data is stored in form of rows and columns (e.g., relational database) Data resides in fixed fields within a record or file Definition, format & meaning of data is explicitly known Attributes in a group are the same Similar entities are grouped
  • 32. Where does Structured Data Come from? Databases (e.g., Access) Spreadsheets SQL OLTP systems Structured Data “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 33. Structured Data: Everything in its Place Fully described datasets Clearly defined categories and sub-categories Data neatly placed in rows and columns Data that goes into the records is regulated by a well-defined structure Indexing can be easily done either by the DBMS itself or manually
  • 34. Structured Data Name E-mail Patrick Wood ptw@dcs.abc.ac.uk, p.wood@ymail.uk First name: Mark Last name: Taylor MarkT@dcs.ymail.ac.uk Alex Bourdoo AlexBourdoo@dcs.ymail.a c.uk First Name Last Name E-mail Id Alternate E- mail Id Patrick Wood ptw@dcs.ab c.ac.uk p.wood@ym ail.uk Mark Taylor MarkT@dcs. ymail.ac.uk Alex Bourdoo AlexBourdoo @dcs.ymail.a c.uk Semi-structured Structured “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 35. Ease with Structured Data-Storage Storage Scalability Security Update and delete Data types – both defined and user defined help with the storage of structured data Scalability is not generally an issue with increase in data “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved. Updating, deleting, etc. is easy due to structured form Ease with structured data
  • 36. Ease with Structured Data-Retrieval Retrieve information Indexing and searching Mining data BI operations Data can be indexed based not only on a text string but other attributes as well. This enables streamlined search Ease with structured data A well-defined structure helps in easy retrieval of data Structured data can be easily mined and knowledge can be extracted from it BI works extremely well with structured data. Hence data mining, warehousing, etc. can be easily undertaken
  • 37. Further Readings • http://www.govtrack.us/articles/20061209data.xpd • http://www.sapdesignguild.org/editions/edition2/sui_content.asp “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 38. Do it Exercise Think and write about an instance where data was presented to you in Unstructured, semi-structured and structured data format “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.
  • 39. Ask a few participants of the learning program to summarize the lecture. Summary please… “Fundamentals of Business Analytics” RN Prasad and Seema Acharya Copyright  2011 Wiley India Pvt. Ltd. All rights reserved.

Editor's Notes

  • #7: Gartner estimates that unstructured data constitutes 80% of the whole enterprise data. Bank of America Merrill Lynch is the investment banking and wealth management division of Bank of America.
  • #8: Data can be classified as- Unstructured data- Data which does not conform to a data model or is not in a form which can be used easily by a computer program. 80-90% of the data of an organization is in this format. E.g. memos, chat rooms, PowerPoint presentations, images, videos, letters, researches, white papers, body of an e-mail etc. Semi- Structured data- Data which does not conform to a data model but has some structure. It is not in a form which can be used easily by a computer program. Metadata is available but it is not sufficient. E.g. e-mails, XML, mark- up languages like HTML Structured data- Data which is in an organized form e.g. in rows and columns; which can be easily used by a computer program. Relationships exist between entities such as classes and their objects. E.g. data stored in databases
  • #10: Unstructured data is growing at an alarming rate. 80-85% data in any organization is unstructured. Enormous amount of knowledge is buried in this data. Unstructured data is one which cannot be stored in form of rows and columns in a database and does not conform to any data model i.e. it is difficult to determine the meaning of data. It does not follow any rule or semantics. It can be of any type hence is unpredictable.
  • #11: Unstructured Data: Broadly speaking anything that is in a non-database form is unstructured data. It can be classified into 2 broad categories: Bitmap Objects: E.g.: image, video or audio files. Textual Objects: E.g.: Microsoft Word documents, e-mails or Microsoft Excel spreadsheets. Let us take the example of e-mail: Even though email messages may be organized in databases such as Microsoft Exchange or Lotus Notes, the body of the email is essentially raw data – freeform text without any structure. A lot of unstructured data is also noisy text. Noisy text such as chat, emails and SMS texts wherein the language differs significantly from the standard form of the language.
  • #12: It is difficult to store and manage unstructured data. Huge amounts of space is required to store unstructured data. It is difficult store images, videos, audios etc.. As the data grows scalability becomes an issue and costs of storing such data grows. Even if they are stored it is difficult to retrieve and recover unstructured data. Updating and deleting unstructured data is very difficult as retrieval is difficult due to no clear structure. Indexing unstructured data is difficult and error prone as the structure is not clear and attributes are not pre-defined. As a result the search results are not very accurate. Indexing becomes all the more difficult as the volume of data grows.
  • #13: CAS or Content Addressable Storage. It stores data based their meta-data. It assigns a unique name to every object stored in it. The object is retrieved based on its content and not its location. It is used extensively to store e-mails etc. Unstructured data can be stored in BLOBs or Binary Large Objects in relational databases. While unstructured data such as video or image file cannot be stored fairly neatly into a relational column, there is no such problem when it comes to store its metadata such as the date and time of its creation, the owner. author of the data etc. XML or Extensible MarkUp Language will be explained in detail in semi-structured data section.
  • #15: Possible solutions: Naming conventions/ standards: Organization wide file naming conventions or standards need to be defined to ensure easier storage and indexing/ searching of data Tags: Storing unstructured data in a virtual repository which automatically tags the data and helps in retrieval and searching of the data is a possible solution. Documentum provides this kind of solution. Text Mining: Various text mining tools are available which either classify unstructured data or analyze and search based on grammar, context, synonyms etc.
  • #20: Semi-structured data does not conform to any data model i.e. it is difficult to determine the meaning of data neither can data be stored in rows and columns as in a database but semi-structured data has tags and markers which help to group data and describe how data is stored, giving some metadata but it is not sufficient for management and automation of data. Similar entities in the data are grouped and organized in a hierarchy. The attributes or the properties within a group may or may not be the same. For example two addresses may or may not contain the same number of properties as in Address 1 <house number><street name><area name><city> Address 2 <house number><street name><city> For example an e-mail follows a standard format To: <Name> From: <Name> Subject: <Text> CC: <Name> Body: <Text, Graphics, Images etc. > The tags give us some metadata but the body of the e-mail contains no format neither is such which conveys meaning of the data it contains. There is very fine line between unstructured and semi-structured data.
  • #21: E-mail, XML, TCP/IP packets, Zipped files etc. are semi- structured data as all have certain amount of metadata. Integration of data from heterogeneous sources leads to semi-structured data as data from one source may not have the adequate structure while others may have information which is not required or is missing. In short, semi-structured data is: organized into semantic entities similar entities are grouped together entities in same group may not have same attributes order of attributes is not necessarily important not always all attributes are required size of same attributes in a group may differ type of same attributes in a group may differ For example name and e-mail of different people can be stored in more than one way as shown below: name: Patrick Wood email: ptw@dcs.abc.ac.uk, p.wood@ymail.uk name: first name: Mark last name: Taylor email: MarkT@dcs.ymail.ac.uk name: Alex Bourdoo email: AlexBourdoo@dcs.ymail.ac.uk
  • #22: Schemas can be used to describe the structure of data. Schemas define the constraint on the structure, content of the document etc. The problem with schemas is that requirements are ever changing and changes required in data also require changes in schema. Graph based data models can be used to describe data. This is ‘schema less’ approach and also known as ‘self describing’ as data is presented in such a way that it explains itself. The relationships and hierarchies are represented in form of a tree like structure where the vertices contain the object or entity and the leaves contain data. XML is widely used to store and exchange semi-structured data. It allows user to define tags to store data in hierarchical or nested form. Schemas in XML are not tightly coupled to data.
  • #23: Semi-structured data usually has irregular and partial structure. Data from sources may have partial structure while some may have none at all. Structure in some sources is implicit which makes it very difficult to interpret relationships between data. In case of semi-structured data schema and data is usually tightly coupled. Same queries may update both schema and data, schema is updated very frequently. Sometimes distinction between schema and data is very vague. E.g. in some cases the data from source may contain the ‘status’ i.e. married or single as true or false and consider it as a separate attribute while in some sources it may be a attribute of a larger set or class. These problems complicate the designing of structure for the data.
  • #24: OEM or Object Exchange Model is a model for storing and exchanging semi-structured data. It structures data in from of graphs. The entities are taken as objects which are represented as the vertices of the graph. Labels connect objects and sub-objects hence defining the relationship between them
  • #25: Data coming from heterogeneous sources contain different structures (in some cases none at all!) and it is difficult to tag and index them.
  • #26: Storing data in a graph based form enables indexing in semi-structured data. It makes search quick and efficient as it is easy to search leaf based data.
  • #28: XSLT: eXtensible Stylesheet Language Transformation
  • #32: Structured data is organized in semantic chunks (entities) Similar entities are grouped together (relations or classes) Entities in the same group have the same descriptions (attributes) Descriptions for all entities in a group (schema) have the same defined format have a predefined length are all present and follow the same order
  • #35: Consider the example taken for semi-structured data. name: Patrick Wood email: ptw@dcs.abc.ac.uk, p.wood@ymail.uk name: first name: Mark last name: Taylor email: MarkT@dcs.ymail.ac.uk name: Alex Bourdoo email: AlexBourdoo@dcs.ymail.ac.uk The same when stored in structured form will be in form of rows and columns each having a defined format as shown above.