SlideShare a Scribd company logo
CLASSIFICATION OF
DIGITAL DATA
By Dr. Shruti Arora
 1.Introduction
 2.Structured Data
 3.Unstructured Data
 4.Semi-Structured Data
 5.Difference between Semi structured and
structured data
OUTLINE
 Data growth has seen exponential acceleration since
the advent of the computer and internet.
 define: it is defined as the data that is stored on digital
format may be in the form of a picture, document or
video etc. it is the data that is not physical but stored in
digital form.
 Digital data can be classified into three forms:
 1. Unstructured Data
 2. Semi-Structured Data
 3. Structured
Introduction:
Sources of structured data
Structured
data
Databases eg. Access
spreadsheet
SQL
OLTP systems
Characteristics of structured data
Similar entities are
grouped
Conforms to a data
model
Data is stored in the
form of rows and
columns
Attributes in the group
are the same
Structured
data
Definition,
format,meaning of data is
explicitly known
Data resides in
fixed fields withn a
record or a file
Advantages of Structured Data
storage
Ease with
structured
data
Security
Scalability
Update and
delete
*
It is easy to work with structured data. The advantages
are :
Storage: Both defined and user- defined data types help
with the storage of structured data.
Scalability: Scalability is not generally an issue with
increase in data
Security: ensuring security is easy
Update and Delete: Updating, deleting etc is easy due to
structured form.
Advantages of structured data(Easy to
work with structured data)
*
Hassle free structured data
Ease with
structured data
Retrieving
information
Indexing and
searching
Mining data
BI operations
• Retrieval of structured data is totally hassle free. The
features are as follows:
• Retrieving information: a well defined structure helps in
easy retrieval of data
• Indexing and searching: Data can be indexed based not only on a
text string but also on other attributes . This enables streamlined search.
• Mining Data: Structured data can be easily mined and knowledge
can be extracted from it.
• BI operations: BI works extremely well with structured data. Hence
data mining, warehousing etc. can be easily undertaken
Hassle Free Retrieval
 It is the one which cannot be stored in the form of
rows and columns as in a database and does not
conform to any data model, i.e. it is difficult to
determine the meaning of the data.
 It does not follow any rules and it can be of any type
and thus its unpredictable.
UNSTRUCTURED DATA
CHARACTERISTICS OF UNSTRUCTURED
DATA
 Web pages, Memos, Videos (MPEG, etc.), Images (JPEG,
GIF, etc.), body of an email, Word document, PowerPoint
presentation, Chats, Reports, White papers, Surveys etc.
Where does Unstructured data come from ?
Anything in a non-database form is unstructured data. It can be
divided into two broad categories :
 Bitmap objects : For e.g. Image, video or audio files.
 Textual objects : For e.g. Microsoft word documents, emails
or MS Excel.
 A lot of unstructured data is also noisy text such as chats,
emails and SMS texts.
SOURCES OF UNSTRUCTURED DATA
 INDEXING : Data is indexed to enable faster search and
retrieval. On the basis of some value in data, index is defined
as an identifier which represents a large record in the data set.
 Indexing in unstructured data is difficult as text can be
indexed based on a text string but in case of non-text based
files, e.g. audio/video, indexing depends on file names.
 TAGS/METADATA : Using metadata, data in a document
can be tagged. But in unstructured data, it is difficult as little
or no metadata is available. Also, the data itself has no
particular format and is coming from more than one source.
MANAGING UNSTRUCTURED DATA
 CLASSIFICATION/TAXONOMY : Taxonomy is classifying data
on the basis of relationship that exist between data. Data can be
grouped and placed in hierarchies based on the taxonomy prevalent
in a firm.
 But in absence of any structure/metadata, identifying relationships
between data is difficult as data is unstructured, naming standards
are not consistent across the firm thus making it difficult to classify
data.
 CAS (Content Addressable Storage) : It stores data based on their
metadata. It assigns a unique name to every object stored in it
 The object is retrieved based on its content and not its location.
 It is used to store emails etc.
CHALLENGES FACED WHILE STORING
UNSTRUCTURED DATA
 Storage space : It is difficult to store and manage unstructured data. A
lot of space is required t store such data. It is difficult to store images,
videos, audios etc.
 Scalability : As the data grows, scalability becomes an issue and the cost
of storing such data grows.
 Retrieve information : Even if unstructured data is stored, it is
difficult to retrieve and recover from it.
 Security : Ensuring security is difficult due to varied sources of data.
E.g. emails, web pages, etc.
 Update and delete : Updating and deleting unstructured data are very
difficult as retrieval is difficult due to no clear structure.
 Indexing and searching : Indexing unstructured data is difficult as the
structure is not clear and attributes are not pre-defined.
*
SOLUTIONS FOR STORING
UNSTRUCTURED DATA
 Changing format : Unstructured data may be converted to formats which
are easily managed, stored and searched.
 Developing new hardware : New hardware needs to be developed to
support unstructured data. It may either complement the existing storage
device or may be stand-alone for unstructured data.
 Storing in RDBMS/BLOBs (Binary Large Objects): While unstructured
data such as video/image cannot be stored into a relational column, there is
no such problem when it comes to storing its metadata, like the date &
time of its creation, the author of the data etc.
 Storing in XML format : Unstructured data may be stored in XML format
which tries to give some structure to it by using tags and elements.
 CAS (Content Addressable Storage) : It organizes files based on their
metadata and assigns a unique name to every object stored in it. Used
extensively to store emails.
CHALLENGES FACED WHILE EXTRACTING
INFORMATION FROM STORED UNSTRUCTURED
DATA
 Interpretation : Unstructured data is not easily interpreted
by conventional search algorithms.
 Classification/Taxonomy : Different naming conventions
followed across the firm make it difficult to classify the
data.
 Indexing : Designing algorithms to understand the meaning
of the documents and then tagging or indexing them
accordingly is difficult.
 Deriving meaning : Computer programs cannot
automatically derive meaning from unstructured data.
 File formats : Increasing number of file formats makes it
difficult to interpret data.
 Tags : As the data grows, it is not possible to put tags
manually.
 Tags : Unstructured data can be stored in a virtual repository and can
be automatically tagged. For e.g. Documentum provides this type of
solution.
 Text mining : It helps in grouping as well as classifying unstructured
data and assist in analysing by considering grammar, context,
synonyms etc.
 Application platforms : such as XOLAP help extract information
from email and XML-based documents.
 Classification/Taxonomy : Taxonomies within the firm can be
managed automatically to organize data in the hierarchical structures.
 Naming conventions/standards : Following naming conventions
across a firm can greatly improve storage, retrieval, index and search.
POSSIBLE SOLUTIONS TO THESE
CHALLENGES
UIMA (Unstructured Information
Management Architecture)
 UIMA is an open source platform for IBM which integrates
different types of analysis engines to provide a complete solution
for knowledge discovery from unstructured data.
 In UIMA, the analysis engine enables integration and analysis of
unstructured information and bridge the gap between structured
and unstructured data.
 It stores information in structured format which can be then
mined, searched and put to other uses. They are analysed in
below ways :
 Breaking up of documents into separate words.
 Grouping and classifying according to Taxonomy.
 Detecting parts of speech, grammar, and synonyms.
 Detecting relationship between various elements.
*
 Only about 10% of data in any organization is semi-structured.
 still it is important to understand, manage, and analyze this
semi-structured data coming from heterogeneous sources.
 Semi-structured data does not conform to any data model. Also, this
data cannot be stored in rows and columns as in a database
 Semi-structured data has tags and markers which helps group the
data and describe how the data is stored. But they are not sufficient
for management and autonomous of data
 Similar entities are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the
same.
Getting to know semi-structured data
Does not
conform to a
data model
but contains
tags and
elements
Semi
structured
data
Attributes
in a
group
may not
be the
same
Similar
entities
are
grouped
Not
sufficient
metadata
Cannot be
stored in the
rows and
columns as in
a database
The tags
and
elements
describe the
data is
stored
 Email Standard format:
 To : <NAME>
 From : <NAME>
 Subject : <TEXT>
 CC : <NAME>
 Body : <TEXT,GRAPHICS,IMAGES,ETC>
Where does semi-structured data come
from?
Semi
structured data
Integration of
data from
heterogeneous
sources
Mark-Up
Languages
Zipped File
TCP/IP Packets
Binary
Executables
XML
Email
 Characteristics of semi structured data are summarized as below :
 It is organized into semantic entities.
 Similar entities are grouped together.
 Entities in the same group may not have the same attributes.
 The order of attributes is not necessarily important.
 Not always all attributes are required.
 Size of the same attributes in a group may differ.
 Type of the same attributes in a group may differ.
(Semantic – relating to “meaning”, or arising from distinctions between the meaning
of different words)
User
Mediator : Uniform access to multiple data sources
Structured
file
Legacy
System
OODBM
S
RDBMS
 Schemas :
 These can be used to describe the structured data. Schemas
define the constrains on the structure, content of the documents.
 Graph Based data models :
 These can be used to describe data. This is “schema-less”
approach and is also known as “Self-desrcibing” as data is
presented in such a way that it explains itself.
 XML:
 This is widely used to store and exchange semi structured data.
schemas in XML are not tightly coupled to data.
How to manage semi-structured data?
How to store semi-structured data?
Challenges
faced
Storage
cost
RDBMS
Irregular
and partial
structured
Implicit
structure
Evolving
Schemas
Distinction
between
schemas and
data
 Possible solution contains:
 XML
 RDBMS
 Special Purpose DBMS
 OEM (Object Exchange Model)
 The possible solutions to the challenges faced in
storing semi-structured data are indicates above.
 The OEM Way:
 Object exchange model is a model for storing and
exchanging semi-structured data.
 This brings us to the next questions.
 Labeled directed graphs (from object exchange model):
 Object exchange modeling. Nodes are objects; labels
on the arcs are attributes names
Modeling Semi-structured Data
 Data coming from heterogeneous sources contain
different structures. And it is difficult to tag and index
them
 The various challenges faced while extracting
information from semi-structured . The possible
solutions to the challenges are depicted as below.
 Challenges faced:
 1) Flat file
 2) Heterogeneous sources
 3) Incomplete/Irregular structure
How to extract information from semi-
structured data?
Possible solutions:
 Indexing :
 OEM (Object Exchange Model)
 XML
 Mining Tools
 XML is slowly emerging as a standard for
exchanging data over the web.
 It enables separation of content and presentation.
 DTD’s (Document Type Definition) provide partial
schemas for XML documents.
 XML :eXtensible markup language
 What is XML? : open source markup language written
in plain text. It is hardware and software independent.
XML : A solution for Semi-structured
data management
 Semi-structured data XML
 Consists of attributes Consists of tags
 Consists of objects Consists of elements
 Atomic values are the constituents CDATA(Characters)
are used
 Semi-structured data is the same as structured data with
one minor exception.
 semi-structured data requires looking at the data itself
to determine structure as opposed to structured data
that only requires examining the data element name.
 Semi-structured data is one processing step away from
structured data.
 This semi-structured data when stored in the structured
format will be in the form of rows and columns each
having a defined format.
Difference between semi-structured data
and structured data
THANK YOU….

More Related Content

PPTX
345739761-1-Chap-3-Types-of-Digital-Data.pptx
PPTX
Digital data
PDF
Digital Types
PPT
Chapter 2.ppt on Types of Digital f Data
PDF
data analyticsggfgfgfgfdgdfgfdgfdgfdgfdgdffdfdf
PPTX
Classification of data
PPTX
Unit_II_1_Types_of_Data.pptx
PPTX
null.pptx
345739761-1-Chap-3-Types-of-Digital-Data.pptx
Digital data
Digital Types
Chapter 2.ppt on Types of Digital f Data
data analyticsggfgfgfgfdgdfgfdgfdgfdgfdgdffdfdf
Classification of data
Unit_II_1_Types_of_Data.pptx
null.pptx

Similar to Notes on Types of Digital Data in Data Analytics (20)

PPTX
Introductio to Data Science and types of data
PDF
PDF
20CS601 - Big data Analytics - types of data , definition of big data
PDF
Map_reduce_working_Big Data_Analytics_2025
PPT
Big Data Analytics (Collection of huge Data)
PDF
Structured vs. Unstructured Data_ What’s The Difference_.pdf
PDF
Why Mark Logic Addressing The Challenges Of Unstructured Information
PDF
Why MarkLogic: Addressing the Challenges of Unstructured Information with Pur...
PDF
Fundamentals of data science: digital data
PDF
What Is Unstructured Data and Why Is It Essential for Business Success.pdf
PPTX
Ch1_Introduction to DATA SCIENCE_TYBSC(CS)_2024.pptx
PDF
Influence of-structured--semi-structured--unstructured-data-on-various-data-m...
PPSX
Unstructured Data in BI
PPTX
Types of Big Data.pptx
PDF
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
PPSX
Introduction to Big Data Analytics.ppsx
DOCX
Structured and Unstructured Data Why Balancing Both Drives Success.docx
DOCX
Structured and Unstructured Data Why Balancing Both Drives Success.docx
PPT
big-data-notes1.ppt
PDF
(Big) Data infographic - EnjoyDigitAll by BNP Paribas
Introductio to Data Science and types of data
20CS601 - Big data Analytics - types of data , definition of big data
Map_reduce_working_Big Data_Analytics_2025
Big Data Analytics (Collection of huge Data)
Structured vs. Unstructured Data_ What’s The Difference_.pdf
Why Mark Logic Addressing The Challenges Of Unstructured Information
Why MarkLogic: Addressing the Challenges of Unstructured Information with Pur...
Fundamentals of data science: digital data
What Is Unstructured Data and Why Is It Essential for Business Success.pdf
Ch1_Introduction to DATA SCIENCE_TYBSC(CS)_2024.pptx
Influence of-structured--semi-structured--unstructured-data-on-various-data-m...
Unstructured Data in BI
Types of Big Data.pptx
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
Introduction to Big Data Analytics.ppsx
Structured and Unstructured Data Why Balancing Both Drives Success.docx
Structured and Unstructured Data Why Balancing Both Drives Success.docx
big-data-notes1.ppt
(Big) Data infographic - EnjoyDigitAll by BNP Paribas
Ad

Recently uploaded (20)

PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Computer network topology notes for revision
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
1_Introduction to advance data techniques.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
annual-report-2024-2025 original latest.
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Lecture1 pattern recognition............
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Computer network topology notes for revision
oil_refinery_comprehensive_20250804084928 (1).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Qualitative Qantitative and Mixed Methods.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Miokarditis (Inflamasi pada Otot Jantung)
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Analytics and business intelligence.pdf
Introduction to Knowledge Engineering Part 1
1_Introduction to advance data techniques.pptx
climate analysis of Dhaka ,Banglades.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
STERILIZATION AND DISINFECTION-1.ppthhhbx
annual-report-2024-2025 original latest.
Fluorescence-microscope_Botany_detailed content
Lecture1 pattern recognition............
Ad

Notes on Types of Digital Data in Data Analytics

  • 2.  1.Introduction  2.Structured Data  3.Unstructured Data  4.Semi-Structured Data  5.Difference between Semi structured and structured data OUTLINE
  • 3.  Data growth has seen exponential acceleration since the advent of the computer and internet.  define: it is defined as the data that is stored on digital format may be in the form of a picture, document or video etc. it is the data that is not physical but stored in digital form.  Digital data can be classified into three forms:  1. Unstructured Data  2. Semi-Structured Data  3. Structured Introduction:
  • 4. Sources of structured data Structured data Databases eg. Access spreadsheet SQL OLTP systems
  • 5. Characteristics of structured data Similar entities are grouped Conforms to a data model Data is stored in the form of rows and columns Attributes in the group are the same Structured data Definition, format,meaning of data is explicitly known Data resides in fixed fields withn a record or a file
  • 6. Advantages of Structured Data storage Ease with structured data Security Scalability Update and delete *
  • 7. It is easy to work with structured data. The advantages are : Storage: Both defined and user- defined data types help with the storage of structured data. Scalability: Scalability is not generally an issue with increase in data Security: ensuring security is easy Update and Delete: Updating, deleting etc is easy due to structured form. Advantages of structured data(Easy to work with structured data) *
  • 8. Hassle free structured data Ease with structured data Retrieving information Indexing and searching Mining data BI operations
  • 9. • Retrieval of structured data is totally hassle free. The features are as follows: • Retrieving information: a well defined structure helps in easy retrieval of data • Indexing and searching: Data can be indexed based not only on a text string but also on other attributes . This enables streamlined search. • Mining Data: Structured data can be easily mined and knowledge can be extracted from it. • BI operations: BI works extremely well with structured data. Hence data mining, warehousing etc. can be easily undertaken Hassle Free Retrieval
  • 10.  It is the one which cannot be stored in the form of rows and columns as in a database and does not conform to any data model, i.e. it is difficult to determine the meaning of the data.  It does not follow any rules and it can be of any type and thus its unpredictable. UNSTRUCTURED DATA
  • 12.  Web pages, Memos, Videos (MPEG, etc.), Images (JPEG, GIF, etc.), body of an email, Word document, PowerPoint presentation, Chats, Reports, White papers, Surveys etc. Where does Unstructured data come from ? Anything in a non-database form is unstructured data. It can be divided into two broad categories :  Bitmap objects : For e.g. Image, video or audio files.  Textual objects : For e.g. Microsoft word documents, emails or MS Excel.  A lot of unstructured data is also noisy text such as chats, emails and SMS texts. SOURCES OF UNSTRUCTURED DATA
  • 13.  INDEXING : Data is indexed to enable faster search and retrieval. On the basis of some value in data, index is defined as an identifier which represents a large record in the data set.  Indexing in unstructured data is difficult as text can be indexed based on a text string but in case of non-text based files, e.g. audio/video, indexing depends on file names.  TAGS/METADATA : Using metadata, data in a document can be tagged. But in unstructured data, it is difficult as little or no metadata is available. Also, the data itself has no particular format and is coming from more than one source. MANAGING UNSTRUCTURED DATA
  • 14.  CLASSIFICATION/TAXONOMY : Taxonomy is classifying data on the basis of relationship that exist between data. Data can be grouped and placed in hierarchies based on the taxonomy prevalent in a firm.  But in absence of any structure/metadata, identifying relationships between data is difficult as data is unstructured, naming standards are not consistent across the firm thus making it difficult to classify data.  CAS (Content Addressable Storage) : It stores data based on their metadata. It assigns a unique name to every object stored in it  The object is retrieved based on its content and not its location.  It is used to store emails etc.
  • 15. CHALLENGES FACED WHILE STORING UNSTRUCTURED DATA
  • 16.  Storage space : It is difficult to store and manage unstructured data. A lot of space is required t store such data. It is difficult to store images, videos, audios etc.  Scalability : As the data grows, scalability becomes an issue and the cost of storing such data grows.  Retrieve information : Even if unstructured data is stored, it is difficult to retrieve and recover from it.  Security : Ensuring security is difficult due to varied sources of data. E.g. emails, web pages, etc.  Update and delete : Updating and deleting unstructured data are very difficult as retrieval is difficult due to no clear structure.  Indexing and searching : Indexing unstructured data is difficult as the structure is not clear and attributes are not pre-defined. *
  • 18.  Changing format : Unstructured data may be converted to formats which are easily managed, stored and searched.  Developing new hardware : New hardware needs to be developed to support unstructured data. It may either complement the existing storage device or may be stand-alone for unstructured data.  Storing in RDBMS/BLOBs (Binary Large Objects): While unstructured data such as video/image cannot be stored into a relational column, there is no such problem when it comes to storing its metadata, like the date & time of its creation, the author of the data etc.  Storing in XML format : Unstructured data may be stored in XML format which tries to give some structure to it by using tags and elements.  CAS (Content Addressable Storage) : It organizes files based on their metadata and assigns a unique name to every object stored in it. Used extensively to store emails.
  • 19. CHALLENGES FACED WHILE EXTRACTING INFORMATION FROM STORED UNSTRUCTURED DATA
  • 20.  Interpretation : Unstructured data is not easily interpreted by conventional search algorithms.  Classification/Taxonomy : Different naming conventions followed across the firm make it difficult to classify the data.  Indexing : Designing algorithms to understand the meaning of the documents and then tagging or indexing them accordingly is difficult.  Deriving meaning : Computer programs cannot automatically derive meaning from unstructured data.  File formats : Increasing number of file formats makes it difficult to interpret data.  Tags : As the data grows, it is not possible to put tags manually.
  • 21.  Tags : Unstructured data can be stored in a virtual repository and can be automatically tagged. For e.g. Documentum provides this type of solution.  Text mining : It helps in grouping as well as classifying unstructured data and assist in analysing by considering grammar, context, synonyms etc.  Application platforms : such as XOLAP help extract information from email and XML-based documents.  Classification/Taxonomy : Taxonomies within the firm can be managed automatically to organize data in the hierarchical structures.  Naming conventions/standards : Following naming conventions across a firm can greatly improve storage, retrieval, index and search. POSSIBLE SOLUTIONS TO THESE CHALLENGES
  • 23.  UIMA is an open source platform for IBM which integrates different types of analysis engines to provide a complete solution for knowledge discovery from unstructured data.  In UIMA, the analysis engine enables integration and analysis of unstructured information and bridge the gap between structured and unstructured data.  It stores information in structured format which can be then mined, searched and put to other uses. They are analysed in below ways :  Breaking up of documents into separate words.  Grouping and classifying according to Taxonomy.  Detecting parts of speech, grammar, and synonyms.  Detecting relationship between various elements. *
  • 24.  Only about 10% of data in any organization is semi-structured.  still it is important to understand, manage, and analyze this semi-structured data coming from heterogeneous sources.  Semi-structured data does not conform to any data model. Also, this data cannot be stored in rows and columns as in a database  Semi-structured data has tags and markers which helps group the data and describe how the data is stored. But they are not sufficient for management and autonomous of data  Similar entities are grouped and organized in a hierarchy. The attributes or the properties within a group may or may not be the same. Getting to know semi-structured data
  • 25. Does not conform to a data model but contains tags and elements Semi structured data Attributes in a group may not be the same Similar entities are grouped Not sufficient metadata Cannot be stored in the rows and columns as in a database The tags and elements describe the data is stored
  • 26.  Email Standard format:  To : <NAME>  From : <NAME>  Subject : <TEXT>  CC : <NAME>  Body : <TEXT,GRAPHICS,IMAGES,ETC>
  • 27. Where does semi-structured data come from? Semi structured data Integration of data from heterogeneous sources Mark-Up Languages Zipped File TCP/IP Packets Binary Executables XML Email
  • 28.  Characteristics of semi structured data are summarized as below :  It is organized into semantic entities.  Similar entities are grouped together.  Entities in the same group may not have the same attributes.  The order of attributes is not necessarily important.  Not always all attributes are required.  Size of the same attributes in a group may differ.  Type of the same attributes in a group may differ. (Semantic – relating to “meaning”, or arising from distinctions between the meaning of different words)
  • 29. User Mediator : Uniform access to multiple data sources Structured file Legacy System OODBM S RDBMS
  • 30.  Schemas :  These can be used to describe the structured data. Schemas define the constrains on the structure, content of the documents.  Graph Based data models :  These can be used to describe data. This is “schema-less” approach and is also known as “Self-desrcibing” as data is presented in such a way that it explains itself.  XML:  This is widely used to store and exchange semi structured data. schemas in XML are not tightly coupled to data. How to manage semi-structured data?
  • 31. How to store semi-structured data? Challenges faced Storage cost RDBMS Irregular and partial structured Implicit structure Evolving Schemas Distinction between schemas and data
  • 32.  Possible solution contains:  XML  RDBMS  Special Purpose DBMS  OEM (Object Exchange Model)  The possible solutions to the challenges faced in storing semi-structured data are indicates above.
  • 33.  The OEM Way:  Object exchange model is a model for storing and exchanging semi-structured data.  This brings us to the next questions.  Labeled directed graphs (from object exchange model):  Object exchange modeling. Nodes are objects; labels on the arcs are attributes names Modeling Semi-structured Data
  • 34.  Data coming from heterogeneous sources contain different structures. And it is difficult to tag and index them  The various challenges faced while extracting information from semi-structured . The possible solutions to the challenges are depicted as below.  Challenges faced:  1) Flat file  2) Heterogeneous sources  3) Incomplete/Irregular structure How to extract information from semi- structured data?
  • 35. Possible solutions:  Indexing :  OEM (Object Exchange Model)  XML  Mining Tools
  • 36.  XML is slowly emerging as a standard for exchanging data over the web.  It enables separation of content and presentation.  DTD’s (Document Type Definition) provide partial schemas for XML documents.  XML :eXtensible markup language  What is XML? : open source markup language written in plain text. It is hardware and software independent. XML : A solution for Semi-structured data management
  • 37.  Semi-structured data XML  Consists of attributes Consists of tags  Consists of objects Consists of elements  Atomic values are the constituents CDATA(Characters) are used
  • 38.  Semi-structured data is the same as structured data with one minor exception.  semi-structured data requires looking at the data itself to determine structure as opposed to structured data that only requires examining the data element name.  Semi-structured data is one processing step away from structured data.  This semi-structured data when stored in the structured format will be in the form of rows and columns each having a defined format. Difference between semi-structured data and structured data