SlideShare a Scribd company logo
XML::XParent
Another way to store XML elements...

             Marco Masetti(grubert) - masetti@linux.it
                                     grubert65@gmail.com
Ways of storing XML files
• Plain files, simple scripts to perform XPath
  queries
 – trivial, very limited scalability, search and element handling
• DBMS as BLOBs (text)
 – Limited search features, performance and scalability. No
   inherent element handling.
• DBMS with XML support
 – Document oriented. Not supported by all. Different features
   provided.
• Native XML databases (Tamino, Basex, eXist,...)
 – Ok…but then I need something else to talk of…
• Custom DBMS schemas
 – Data oriented, element handling trivial, scale very well
Custom DBMS schemas

• Structure mapping:
 – the design of the database schema is based on the
   understanding of XML Schema or DTDs

• Model mapping:
 – A fixed database schema for all XML documents
   without assistance of DTD or XML schemes
Structure-mapping schema: XML::RDB!
• Perl module to convert XML files into RDB schemas and
  populate, and unpopulate them. You end up with 1 table
  per each xml element type.
• Pros:
  ●
    Does what he means
  ●
    Quite fast
  ●
    Works with XML Schemas too
  ●
    Could eventually treat value types properly
• Cons:
  ●
    Inherent hierarchical structure lost
  ●
    Not good if XML files belongs to different schemas
  ●
    Does only what he means...
  ●
    Not very well maintained...
  ●
    SQL schemas can easily become unreadable...
Model-mapping schema: XParent !

• XParent is a very simple DBMS schema that can be
  used to store XML elements
• Does not require the XML schema (Schema-oblivious)
• Highly normalized
• Cons:
  
    Values are stored as text
XParent: how it works...
                     Table LabelPath
                      id | len |                               path                               
                     ­­­­+­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
                       1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace
                       2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag
<?xml version="1.0" encoding="ISO­8859­1"?>
                       3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type
  <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"
         xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">
    <DescriptionUnit xsi:type="DescriptorCollectionType">
      <Descriptor size="5" xsi:type="DominantColorType">
                     Table Element
        <ColorSpace type="HSV" colorReferenceFlag="false"/>
                      did | pathid | ordinal 
        <SpatialCoherency>0</SpatialCoherency>
                     ­­­­­+­­­­­­­­+­­­­­­­­­
        <Values>        1 |      1 |       1
        <Percentage>2</Percentage>
                        2 |      2 |       1
        <Index>10 6 0</Index>
                        3 |      3 |       2
        </Values>
        <Values>
          <Percentage>15</Percentage>
                     Table Data
          <Index>6 16 9</Index>
                      did | pathid | ordinal |                    value                     
        </Values>
                     ­­­­­+­­­­­­­­+­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
        <Values>
                        2 |      2 |       1 | false
          <Percentage>3</Percentage>
                        3 |      3 |       2 | HSV
          <Index>7 18 4</Index>
      </Values>
    </Descriptor>
  </DescriptionUnit>
</Mpeg7>             Table DataPath
                      pid | cid 
                     ­­­­­+­­­­­
                        1 |   2
                        1 |   3
The XML::XParent module
• Perl module to handle XML documents on a XParent
  schema
• Can load any XML file into the same SQL schema
• Plugins can be registered for custom logic on elements
• Provides utilities to:
  ●
    Create the XParent schema for SQLite and Postgresql
  ●
    Parse and load an XML file ( xparent-parse.pl )
  ●
    Query the XParent schema ( xparent-search.pl )
• Classes:
  ●
    XML::XParent::Parser: XML parser based on XML::Twig
  ●
    XML::XParent::Parser::Plugin: base interface class to
    be implemented by any plugin
  ●
    XML::XParent::Schema: base class (interface) to the
    XParent schema
  ●
    XML::XParent::Elem: class that describes an XML
    element
XML::XParent::Schema drivers

• The XML::XParent::Schema class implements the
  Driver/Interface pattern: in this way custom drivers can
  be implemented for specific data stores
• 2 generic drivers implemented so far:
  
    XML::XParent::Schema::DBIx: driver implementation based on
    DBIx::Class
    ●
      All advantages of an ORM (but who cares ?)
     ●
         Quite slow!
 
     XML::XParent::Schema::DBI: driver implementation
     based on DBI
     ●
       Direct integration with the data store
     ●
       Much faster...
The quest for speed...

●
    Tests performed on my laptop:
    ●
        CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05
    ●
        CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05

●
    Reference XML file:
    ●
        Size: 45 MB
    ●
        XML elements: ~600.000
●
    Reference DBMS: PostgreSQL 8.4.13

●
    Parsing of the reference file with the DBIx driver:
    ●
        perl xparent­parse.pl ­i <ref.xml> ­­driver DBIx
    ●
      Execution time: > 3000 mins !!!
●
    Parsing of the reference file with the DBI driver:
    ●
        perl xparent­parse.pl ­i <ref.xml> ­­driver DBI
    ●
        Execution time: ~ 400 mins.
...But then...

  ●
      I realized loading times were divergent!

  ●
   I realized there was a stupid error in the implementation of
  the algorith...
Exec Time
(log t)
        4
                            3000
        3
                            400
                                                     177
        2


                                                     28
        1



                                               ...
                       m
                        .                  ed.
                    le                  ch
              Im
                   p
                                    pat
         f.                    go
      Re                     Al
...But then...

●
  I realized that records in Data and DataPath tables are not
referenced by anybody...
●
  They do not need to be inserted one each...
●
  => Bulk Loading!!!
●
  ...given N elements, how many records we have in the
DataPath table ?
Bulk Loading
                                                  • Saves a lot of time storing data:
                                                  ­­­ DBI: Bulk loading of 1000000 records ­­­
                                                  All in once:    50.462398 wallclock seconds
                                                  Chunks of 1000: 31.157044 wallclock seconds
                                                  Chunks of 2000: 27.747248 wallclock seconds
                                                  Chunks of 5000: 28.209256 wallclock seconds
Exec Time                                         Chunks of 10000:26.334099 wallclock seconds
(log t)
        4                                         • Distinct inserts of 1000000 records:
                          3000
                                                            Elapsed time: 250.563282 wallclock seconds
       3
                          400
                                                 177
       2                                                                   98
                                                 28
       1                                                                   16



                                           ...                       ...
                      .                 d.                        g.
                    em                he                       in
                  pl                tc                      ad
             Im                   pa                      Lo
        f.                   go                      lk
     Re                    Al                     Bu
...But then...
    • For each element we have to check if path
      already exists...
    • Much better cache it in an hash than go back
      and forth into the DB...
Exec Time
(log t)
       4
                           3000
       3
                           400
                                                    177
       2                                                                      98
                                                                                                             41
                                                    28
                                                                              16
       1                                                                                                     12


                                              ...                       ...                        ...
                                                                                                         .
                       .                   d.                        g.
                      m                   e
                                                                 di
                                                                   n                           t hs
                   le                  ch                                                    Pa
             Im
                  p
                                   pat                       L oa
        f.                    go                        lk                              ed
     Re                     Al                       Bu                               ch
                                                                                   Ca
...But then...
                                      • Added some indexes:
                                      •   CREATE INDEX LabelPath_Path ON LabelPath (Path);
                                      •   CREATE INDEX Element_PathID ON Element (PathID);
                                      •   CREATE INDEX DataPath_Cid ON DataPath (Cid);
                                      •   CREATE INDEX DataPath_Pid ON DataPath (Pid);
                                      •   CREATE INDEX Data_Did ON Data (Did);
Exec Time
(log t)
       4
                              3000
       3
                              400
                                                     177
       2                                                                       98
                                                                                                                 41
                                                     28
                                                                               16                                                         29
       1                                                                                                         12
                                                                                                                                          8

                                                 .                       ...                                 .
                                              ...                     g.                                ..                          ...
                         m
                          .                 ed                      n                                s.                           s.
                      le                  h                       di                               th                          xe
                     p                  tc                      oa                               Pa
                 m                    pa                      L                              d                              de
        f   .I                   go                      lk                                he                             In
     Re                        Al                     Bu                            Ca
                                                                                         c                            +
...But then...
• Realized I could “compact” records...
                 <?xml version="1.0" encoding="ISO­8859­1"?>
                   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"
                          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">
                     <DescriptionUnit xsi:type="DescriptorCollectionType">
                       <Descriptor size="5" xsi:type="DominantColorType">
                         <ColorSpace type="HSV" colorReferenceFlag="false"/>
                         <SpatialCoherency>0</SpatialCoherency>
                         <Values>
                           <Percentage>2</Percentage>
                           <Index>10 6 0</Index>
                         </Values>
                         <Values>
                           <Percentage>15</Percentage>
                           <Index>6 16 9</Index>
                         </Values>
                         <Values>
                           <Percentage>3</Percentage>
                           <Index>7 18 4</Index>
                         </Values>
                     </Descriptor>
                   </DescriptionUnit>
                 </Mpeg7>



Saves another 20%-30%...
Needs some logic at query time (experimental)...
To cut a very long story short...
      Time (mins) to load ~600.000 XML elems
             Reference   Algo      Bulk      Cached   indexes   Compact
                         patched   loading   Paths

      DBIx   > 3000      177       98        41       29        22


      DBI    ~400        28        16        12       8         6




●
    ..and we have still to do:
    ●
      Code profiling...
    ●
      Specific DBMS techniques...
    ●
      Use MapReduce to split jobs among several
      workers...
About retrieval...

• At first I tried implementing an Xpath-to-sql
  translator
• Found it very very hard...
• ...and almost useless
• ...use the power of SQL to express what you
  want!
• XML::XParent provides an API (get_elem) to
  query for a set of elements whose paths match
  a given SQL regex. The API returns a set of
  XML::XParent::Elem objects.
XML::XParent utilities: how to use them
• Configure parameters into xparent.yml file:
                                  ­­­
• To load an XML file:            schema_params:
perl xparent­parse.pl                 ­ 'dbi:Pg:dbname=xparent'
    ­i <input file>               #    ­ 'dbi:SQLite:xparent.db'
    ­­driver <the Schema driver to use>
                                      ­ grubert
                                      ­ grubert
    [­­config_file <the config file>]
    [­­verbose]                       ­
                                          AutoCommit: 1
    [­­clean]                     #plugins:
    [­­compact]                   #    'SLMS::Redis::ParserPlugin': 
• To query the Xparent data store:#        'tag': 'MovingRegion' 
perl xparent­search.pl
   ­­path <path regex>
   ­­driver <the Schema driver to use>
   [­­config_file <the config file>]
• To clean the data store:
perl xparent­clean.pl 
   ­­driver <the Schema driver to use>
   [­­config_file <the config file>]
Contribute!

https://github.com/grubert65/XParent-Perl.git
Thank You !!!!

More Related Content

PDF
Migrating To Ruby1.9
PPTX
Effective java - concurrency
PDF
SoftNews-lowres
PDF
CAPTO_en
PDF
Mir - a Media Information Retrieval system
PDF
rails_tutorial
PDF
rails_tutorial
PPTX
Data oriented design and c++
Migrating To Ruby1.9
Effective java - concurrency
SoftNews-lowres
CAPTO_en
Mir - a Media Information Retrieval system
rails_tutorial
rails_tutorial
Data oriented design and c++

Similar to Xml::parent - Yet another way to store XML files (20)

PPTX
#GDC15 Code Clinic
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
PDF
(4) cpp abstractions references_copies_and_const-ness
PDF
Blazing Performance with Flame Graphs
PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
PDF
Feature Hashing for Scalable Machine Learning with Nick Pentreath
PPTX
Apache Cassandra Opinion and Fact
PDF
[Ruxcon 2011] Post Memory Corruption Memory Analysis
PDF
Tiling matrix-matrix multiply, code tuning
PDF
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
PPT
Memory Optimization
PPT
Memory Optimization
PDF
What is Distributed Computing, Why we use Apache Spark
PDF
DTrace Topics: Introduction
PPTX
This gives a brief detail about big data
PPTX
Intro to Spark - for Denver Big Data Meetup
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
PDF
Drilling Deep Into Exadata Performance
PDF
Linux Perf Tools
#GDC15 Code Clinic
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
(4) cpp abstractions references_copies_and_const-ness
Blazing Performance with Flame Graphs
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Apache Cassandra Opinion and Fact
[Ruxcon 2011] Post Memory Corruption Memory Analysis
Tiling matrix-matrix multiply, code tuning
JDD 2016 - Tomasz Borek - DB for next project? Why, Postgres, of course
Memory Optimization
Memory Optimization
What is Distributed Computing, Why we use Apache Spark
DTrace Topics: Introduction
This gives a brief detail about big data
Intro to Spark - for Denver Big Data Meetup
Think Like Spark: Some Spark Concepts and a Use Case
Drilling Deep Into Exadata Performance
Linux Perf Tools
Ad

Xml::parent - Yet another way to store XML files

  • 1. XML::XParent Another way to store XML elements... Marco Masetti(grubert) - masetti@linux.it grubert65@gmail.com
  • 2. Ways of storing XML files • Plain files, simple scripts to perform XPath queries – trivial, very limited scalability, search and element handling • DBMS as BLOBs (text) – Limited search features, performance and scalability. No inherent element handling. • DBMS with XML support – Document oriented. Not supported by all. Different features provided. • Native XML databases (Tamino, Basex, eXist,...) – Ok…but then I need something else to talk of… • Custom DBMS schemas – Data oriented, element handling trivial, scale very well
  • 3. Custom DBMS schemas • Structure mapping: – the design of the database schema is based on the understanding of XML Schema or DTDs • Model mapping: – A fixed database schema for all XML documents without assistance of DTD or XML schemes
  • 4. Structure-mapping schema: XML::RDB! • Perl module to convert XML files into RDB schemas and populate, and unpopulate them. You end up with 1 table per each xml element type. • Pros: ● Does what he means ● Quite fast ● Works with XML Schemas too ● Could eventually treat value types properly • Cons: ● Inherent hierarchical structure lost ● Not good if XML files belongs to different schemas ● Does only what he means... ● Not very well maintained... ● SQL schemas can easily become unreadable...
  • 5. Model-mapping schema: XParent ! • XParent is a very simple DBMS schema that can be used to store XML elements • Does not require the XML schema (Schema-oblivious) • Highly normalized • Cons:  Values are stored as text
  • 6. XParent: how it works... Table LabelPath  id | len |                               path                                ­­­­+­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­   1 |   4 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace   2 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@colorReferenceFlag <?xml version="1.0" encoding="ISO­8859­1"?>   3 |   5 | /Mpeg7/DescriptionUnit/Descriptor/ColorSpace/@type   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">     <DescriptionUnit xsi:type="DescriptorCollectionType">       <Descriptor size="5" xsi:type="DominantColorType"> Table Element         <ColorSpace type="HSV" colorReferenceFlag="false"/>  did | pathid | ordinal          <SpatialCoherency>0</SpatialCoherency> ­­­­­+­­­­­­­­+­­­­­­­­­         <Values>    1 |      1 |       1         <Percentage>2</Percentage>    2 |      2 |       1         <Index>10 6 0</Index>    3 |      3 |       2         </Values>         <Values>           <Percentage>15</Percentage> Table Data           <Index>6 16 9</Index>  did | pathid | ordinal |                    value                              </Values> ­­­­­+­­­­­­­­+­­­­­­­­­+­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­         <Values>    2 |      2 |       1 | false           <Percentage>3</Percentage>    3 |      3 |       2 | HSV           <Index>7 18 4</Index>       </Values>     </Descriptor>   </DescriptionUnit> </Mpeg7> Table DataPath  pid | cid  ­­­­­+­­­­­    1 |   2    1 |   3
  • 7. The XML::XParent module • Perl module to handle XML documents on a XParent schema • Can load any XML file into the same SQL schema • Plugins can be registered for custom logic on elements • Provides utilities to: ● Create the XParent schema for SQLite and Postgresql ● Parse and load an XML file ( xparent-parse.pl ) ● Query the XParent schema ( xparent-search.pl ) • Classes: ● XML::XParent::Parser: XML parser based on XML::Twig ● XML::XParent::Parser::Plugin: base interface class to be implemented by any plugin ● XML::XParent::Schema: base class (interface) to the XParent schema ● XML::XParent::Elem: class that describes an XML element
  • 8. XML::XParent::Schema drivers • The XML::XParent::Schema class implements the Driver/Interface pattern: in this way custom drivers can be implemented for specific data stores • 2 generic drivers implemented so far:  XML::XParent::Schema::DBIx: driver implementation based on DBIx::Class ● All advantages of an ORM (but who cares ?) ● Quite slow!  XML::XParent::Schema::DBI: driver implementation based on DBI ● Direct integration with the data store ● Much faster...
  • 9. The quest for speed... ● Tests performed on my laptop: ● CPU0: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05 ● CPU1: Intel(R) Core(TM) i5 CPU M 540@ 2.53GHz stepping 05 ● Reference XML file: ● Size: 45 MB ● XML elements: ~600.000 ● Reference DBMS: PostgreSQL 8.4.13 ● Parsing of the reference file with the DBIx driver: ● perl xparent­parse.pl ­i <ref.xml> ­­driver DBIx ● Execution time: > 3000 mins !!! ● Parsing of the reference file with the DBI driver: ● perl xparent­parse.pl ­i <ref.xml> ­­driver DBI ● Execution time: ~ 400 mins.
  • 10. ...But then... ● I realized loading times were divergent! ● I realized there was a stupid error in the implementation of the algorith... Exec Time (log t) 4 3000 3 400 177 2 28 1 ... m . ed. le ch Im p pat f. go Re Al
  • 11. ...But then... ● I realized that records in Data and DataPath tables are not referenced by anybody... ● They do not need to be inserted one each... ● => Bulk Loading!!! ● ...given N elements, how many records we have in the DataPath table ?
  • 12. Bulk Loading • Saves a lot of time storing data: ­­­ DBI: Bulk loading of 1000000 records ­­­ All in once:    50.462398 wallclock seconds Chunks of 1000: 31.157044 wallclock seconds Chunks of 2000: 27.747248 wallclock seconds Chunks of 5000: 28.209256 wallclock seconds Exec Time Chunks of 10000:26.334099 wallclock seconds (log t) 4 • Distinct inserts of 1000000 records: 3000 Elapsed time: 250.563282 wallclock seconds 3 400 177 2 98 28 1 16 ... ... . d. g. em he in pl tc ad Im pa Lo f. go lk Re Al Bu
  • 13. ...But then... • For each element we have to check if path already exists... • Much better cache it in an hash than go back and forth into the DB... Exec Time (log t) 4 3000 3 400 177 2 98 41 28 16 1 12 ... ... ... . . d. g. m e di n t hs le ch Pa Im p pat L oa f. go lk ed Re Al Bu ch Ca
  • 14. ...But then... • Added some indexes: • CREATE INDEX LabelPath_Path ON LabelPath (Path); • CREATE INDEX Element_PathID ON Element (PathID); • CREATE INDEX DataPath_Cid ON DataPath (Cid); • CREATE INDEX DataPath_Pid ON DataPath (Pid); • CREATE INDEX Data_Did ON Data (Did); Exec Time (log t) 4 3000 3 400 177 2 98 41 28 16 29 1 12 8 . ... . ... g. .. ... m . ed n s. s. le h di th xe p tc oa Pa m pa L d de f .I go lk he In Re Al Bu Ca c +
  • 15. ...But then... • Realized I could “compact” records... <?xml version="1.0" encoding="ISO­8859­1"?>   <Mpeg7 xmlns="http://www.mpeg7.org/2001/MPEG­7_Schema"          xmlns:xsi="http://www.w3.org/2000/10/XMLSchema­instance">     <DescriptionUnit xsi:type="DescriptorCollectionType">       <Descriptor size="5" xsi:type="DominantColorType">         <ColorSpace type="HSV" colorReferenceFlag="false"/>         <SpatialCoherency>0</SpatialCoherency>         <Values>           <Percentage>2</Percentage>           <Index>10 6 0</Index>         </Values>         <Values>           <Percentage>15</Percentage>           <Index>6 16 9</Index>         </Values>         <Values>           <Percentage>3</Percentage>           <Index>7 18 4</Index>         </Values>     </Descriptor>   </DescriptionUnit> </Mpeg7> Saves another 20%-30%... Needs some logic at query time (experimental)...
  • 16. To cut a very long story short... Time (mins) to load ~600.000 XML elems Reference Algo Bulk Cached indexes Compact patched loading Paths DBIx > 3000 177 98 41 29 22 DBI ~400 28 16 12 8 6 ● ..and we have still to do: ● Code profiling... ● Specific DBMS techniques... ● Use MapReduce to split jobs among several workers...
  • 17. About retrieval... • At first I tried implementing an Xpath-to-sql translator • Found it very very hard... • ...and almost useless • ...use the power of SQL to express what you want! • XML::XParent provides an API (get_elem) to query for a set of elements whose paths match a given SQL regex. The API returns a set of XML::XParent::Elem objects.
  • 18. XML::XParent utilities: how to use them • Configure parameters into xparent.yml file: ­­­ • To load an XML file: schema_params: perl xparent­parse.pl     ­ 'dbi:Pg:dbname=xparent' ­i <input file> #    ­ 'dbi:SQLite:xparent.db' ­­driver <the Schema driver to use>     ­ grubert     ­ grubert [­­config_file <the config file>] [­­verbose]     ­         AutoCommit: 1 [­­clean] #plugins: [­­compact] #    'SLMS::Redis::ParserPlugin':  • To query the Xparent data store:#        'tag': 'MovingRegion'  perl xparent­search.pl ­­path <path regex> ­­driver <the Schema driver to use> [­­config_file <the config file>] • To clean the data store: perl xparent­clean.pl  ­­driver <the Schema driver to use> [­­config_file <the config file>]