SlideShare a Scribd company logo
HadoopXML                                                                              A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries
                                                                                                                                                                                                                                                                                                                                                                                     1Computer
                                                             Hyebong Choi1, Kyong-Ha Lee1, Soo-Hyong Kim1, Yoon-Joon Lee1 and Bongki Moon2                                                                                                                                                                                                                                                   Science Department, KAIST, Korea
                                                                                                                                                                                                                                                                                                                                                                     2Computer    Science Department, University of Arizona, USA
                                                               hbchoi@dbserver.kaist.ac.kr           bart7449@gmail.com                                         kimsh@dbserver.kaist.ac.kr                                         yoonjoon.lee@kaist.ac.kr                                                 bkmoon@cs.arizona.edu




                            Motivation                                                                                                                                                          System Architecture                                                                                                                                                                                                 Performance
                                                                                                                                                                                                                                                                             Twig pattern                                                                                                                              Experimental environment
                             Big data in XML                                                                                                                                                                                                                                     join
                                                                                                                                                                                                                                                                                                        Mappers
                                                                                                                                                                                                                                                                                                        Tagging
                                                                                                                                                                                                                                                                                                                                       Reducers                                                          Hadoop
                                                                                                                                                                                                                                                                                                                                                                                                                             CentOS 6.2          1Gb switching hub
                                                                                                                  A large                                                                                                                                                                                                                                                                               0.21.0 [1]
  ▶ More than 100GB of protein sequences and their 
                                                                                                                                                                XML                                                                                                                                    Reducer ID                        Holistic         Final
                                                                                                                  XML file                                                                                                                                                                                                                                                                                              AMD Athlon II x4 620       8GB memory
                                                                                                                                             Pre‐               blocks                           Path                                   Final                                                                                           twig join        answers                                        1 master
                                                                                                                                                                           1st M/R job                       2nd M/R job                                                                                Tagging                                                                                                                4‐cores            7200 RPM HDD
                                                                                                                   XPath                  processing          Query                            Solutions                               Answers                                       Path
   functional information are provided in XML format                                                              queries                                     index                                                                                                                solutions           Reducer ID                        Holistic         Final                                          8 slaves
                                                                                                                                                                                                                                                                                                                                                                                                                         i5‐2500k processor
                                                                                                                                                                                                                                                                                                                                                                                                                               4‐cores
                                                                                                                                                                                                                                                                                                                                                                                                                                                   8GB memory
                                                                                                                                                                                                                                                                                                                                                                                                                                                  7200 RPM HDD
                                                                                                                                                                                                                                                                                                        Tagging                         twig join        answers
   and also updated in every four weeks [2]                                                                                                                                                                                                                                                            Reducer ID       Shuffle by
                                                                                                                                                                                                                                                                                                                        ReducerId                                                              XML dataset statistics                                   Loading time
  ▶ Conventional XML tools like single‐site XML DBMSes                                                                                                                                                                                                                             Size information
                                                                                                                                                                                                                                                                                  for path solutions                 Distributed cache                                            Filename         UniRef100   UniParc UniProtKB XMark1000
                                                                                                                                                                                                                                                                                                                                                                                  File size (MB)      24,500    37,436 105,745     114,414
   and XML pub/sub systems failed to process that size of                                                                                                                                                                                                                             Relationship
                                                                                                                                                                                                                                                                                                                       Multi query                                                # of elements        335M      360M  2,110M      1,670M
                                                                                                                                                                                                                                                                                  btw. path patterns &
   XML data                                                                                                                                                                                                                                             Path query                   twig patterns                      optimizer                                                 # of attributes
                                                                                                                                                                                                                                                                                                                                                                                  Depth in avg.
                                                                                                                                                                                                                                                                                                                                                                                                       589M
                                                                                                                                                                                                                                                                                                                                                                                                      4.5649
                                                                                                                                                                                                                                                                                                                                                                                                               1,215M 2,783M
                                                                                                                                                                                                                                                                                                                                                                                                                3.7753    4.3326
                                                                                                                                                                                                                                                                                                                                                                                                                                     383M
                                                                                                                                                                                                                                                                                                                                                                                                                                    4.7375
                                                                                                                                                         Query index          Query 
  XML DB                    eXist [9]                                BaseX [10]                                                                            builder            index                                                                     processing                                                                                                                Max depth                6         5         7        12
                                                                                                                                                                                                                                                                                           Mappers                                   Reducers
                                                                                                                                                                                                                                                                    XML Label                                                                                                     # distinct paths        30        24       149       548
                                   Query processing                          Query processing                                                                                                    HDFS                                                                                        Path                 Path 
   Data size      Loading time                               Loading time                                                                                       Path                                                                                                                                                                 Counting          Path 
                                    w/ 4000 twig queries                      w/ 4000 twig queries                XPath             Query                                                                                                                          block1                  filtering            solutions
                                                                                                                                                              patterns                                                                                                                                                               solutions       solutions                                                          Overall execution time
                                                                                                                 queries         Decomposition                                                                             XML           Label                      XML Label
       1GB             5m 54s                     failed           2m 1s             2h 48m 7s                                                           Relationship                                                                                                                         Path                Path
                                                                                                                                                                                                                          block1         block1                    block2
                                                                                                                                                          btw. paths                                                                                                                        filtering           solutions            Counting          Path 




                                                                                                                                                                                                                                                                        …
      10GB          1h 5m 21s                     failed         19m 36s          30h 11m 34s                                                                         Copy to                                              XML           Label                                                                                                                                                     Synthetic dataset                                    Real‐world dataset
                                                                                                                                                          and twigs                                                                                                                                                                  Solutions       solutions
                                                                                                                                                                      HDFS                                                block2         block2                     XML Label
     100GB              failed                         ‐            failed                      ‐                                                                                                                                                                                            Path                 Path




                                                                                                                                                                                                                          …



                                                                                                                                                                                                                                          …
                                                                                                                 A large               Partitioning      Label blocks                                                                                              blockn                  filtering            solutions                 <Path ID, a list of labels>
                                           Yfilter [5]                                                           XML file              & Labeling                                                                          XML           Label                                                            <Path ID, label>
                                                                                                                                                          XML blocks
                                                                                                                                                                                                                          blockn         blockn                     Query index
      Data size                  Filtering time        Postprocessing time (twig  pattern join)                                                                                                                                                                                                                               Size information 
                                                                                                                                                                                                                         Block collocation                         Distributed cache
                                                                                                                                                                                                                                                                                                                             for path solutions
          1MB                            2m 4s                                           0.264s
         10MB                          20s 14s                                               16s
       100MB                        3h 22m 6s                                        1h 1m 37s
           1GB                           failed                                                 ‐
                                                                                                                                                                                                            Working Example                                                                                                                                                                 Effect of converting paths 
                                                                                                                                                                                                                                                                                                                                                                                                 to distinct paths
                                                                                                                                                                                                                                                                                                                                                                                                                                                 Effect of block collocation

                                                                                                                                                                                                                                            Label_1
                                                                                                                                                                                           <region>                     block_1    /
                                                                                                              <region>                 Example.xml                                          <Africa>                               1, 24, 1

                          HadoopXML                                                                            <Africa>
                                                                                                                  <item id=“item0”>
                                                                                                                    <quantity>1</quantity>
                                                                                                                    <payment>Creditcard</payment>
                                                                                                                                                                                               <item id=“item0”>
                                                                                                                                                                                                 <quantity>1</quantity>
                                                                                                                                                                                                 <payment>Creditcard</payment>
                                                                                                                                                                                               </item>
                                                                                                                                                                                                                                   2, 15, 2
                                                                                                                                                                                                                                   3, 8, 3
                                                                                                                                                                                                                                   4, 5, 4
                                                                                                                                                                                                                                   6, 7, 4
                                                                                                                                                                                                                                                        Path offset
                                                                                                                                                                                                                                                                                      Path query  Path solution
                                                                                                                                                                                                                                                                                          ID
                                                                                                                                                                                                                                                                                               1.1         3, 8, 3
                                                                                                                  </item>
▶ It efficiently processes many twig pattern queries for                                                                                                                                       <item id=“item1”>           block_2         Label_2                                                        9, 14, 3
                                                                                                                  <item id=“item1”>
                                                                                                                     <quantity>1</quantity>                                                         <quantity>1</quantity>         /region/Africa                                              1.2         4, 5, 4                               Twig query ID Path solution
                                                                                                                     <payment>Money order</payment>                     Preprocessing               <payment>Money order</payment> 9, 14, 3                                                             10, 11, 4        2nd M/R                             1          6, 7, 4
                                                                                                                  </item>                                                Partitioning              </item>                         10, 11, 4               1st M/R                                                     Twig pattern                                                                                                          Effect of multi query optimization
  a massive volume of XML data in parallel                                                                      </Africa>
                                                                                                                <Asia>
                                                                                                                                                                          & labeling             </Africa>
                                                                                                                                                                                                 <Asia>
                                                                                                                                                                                                                                   12, 13, 4
                                                                                                                                                                                                                                   16, 23, 2
                                                                                                                                                                                                                                                         Path filtering
                                                                                                                                                                                                                                                                                               1.3         6, 7, 4
                                                                                                                                                                                                                                                                                                        12, 13, 4          join                              2
                                                                                                                                                                                                                                                                                                                                                                    12, 13, 4
                                                                                                                                                                                                                                                                                                                                                                    17, 22, 3
                                                                                                                  <item id="item135">                                                      <item id="item135">                                                                                  2       17, 22, 3
                                                                                                                     <quantity>2</quantity>
                                                                                                                                                                                                                        block_3          Label_3
 ‐ Block partitioning with no loss of structural information                                                         <payment>Personal Check</payment>
                                                                                                                                                                                                 <quantity>2</quantity>          /region/Asia
                                                                                                                                                                                                 <payment>Personal Check</payment>
                                                                                                                                                                                                                                 17, 22, 3
                                                                                                                  </item>                                                                      </item>                                                                                Path query        Count
                                                                                                                                                                                                                                 18, 19, 4
                                                                                                                </Asia>                                                                      </Asia>                                                                                      ID
 ‐ Path filtering with NFA‐style query indexes [5]                                                            </region>                                                                    </region>
                                                                                                                                                                                                                                 20, 21, 4
                                                                                                                                                                                                                                                                                               1.1               2           Multi query
                                                                                                                                                                                                  Path query ID               Path query                                                       1.2               2            optimizer
 ‐ I/O optimal Holistic twig pattern joins [3]                                                          Twig query ID
                                                                                                             1
                                                                                                                                            Twig query
                                                                                                                        /region/Africa/item[quantity]/payment
                                                                                                                                                                        Query decomposition            1.1         /region/Africa/item                                                         1.3               2
                                                                                                                                                                                                       1.2         /region/Africa/item/quantity          A  query index                         2                1
                                                                                                             2          //Asia/item
                                                                                                                                                                           & Converting to 
▶ Simultaneous processing of multiple twig pattern 
                                                                                                                                                                                                       1.3         /region/Africa/item/payment
                                                                                                             …           .  . . .  .
                                                                                                                                                                          root‐to‐leaf paths            2          /region/Asia/item



  queries                                                                                                                                                                                                                                                                  Load Balancing & 
 ‐ Many twig pattern joins are distributed across nodes and 
                                                                                                                                                  Path Filtering
                                                                                                                                                                                                                                                                        Multi Query Optimization                                                                                                                     References
                                                                                                             <item id=“item1”>
                                                                                                                   <quantity>1</quantity>
                                                                                                                                           block_2                                 /region/Africa      Label_2                         ▶ Twig pattern join, a specialized multi‐way join that reads multiple                                                                         [1] Hadoop. http://hadoop.apache.org, Apache Software Foundation.
    executed in parallel                                                                                           <payment>Money order</payment>                                  9, 14, 3
                                                                                                                                                                                                                                         path solutions                                                                                                                              [2] A. Bairoch et al. The universal protein resource (uniprot). Nucleic acids 
                                                                                                                 </item>                                                           10, 11, 4
                                                                                                                                                                                   12, 13, 4                                              ‐ With static one‐to‐one shuffling scheme, i.e. given partitioned path solutions, reducers                                                    research, 33(suppl 1):D154–D159, 2005.
▶ Optimization of the I/O cost in MapReduce jobs                                                               </Africa>
                                                                                                               <Asia>                                                              16, 23, 2                                               generate incomplete join results                                                                                                          [3] N. Bruno et al. Holistic twig joins: optimal xml pattern matching. In 
                                                                                                                                                                                                                                                                                                     Reducer1                                Missing results!
                                                                                                                                                                                                                                                                                                      Q1: A1 join B1 join C1                 A1 join B2 join C2                         Proceedings of ACM SIGMOD, pages 310–321. ACM, 2002.
 ‐ Sharing input scans and intermediate path solutions                                                                                           startElement(“region”)                                                                                 A1                                            Q2: A1 join C1 join D1                 A2 join B1 join C2
                                                                                                                                                 startElement(“Africa”)                                                                                                                                                                                                              [4] J. Dean et al. Mapreduce: Simplified data processing on large clusters. 
                                                                                                                                                 & SAX events from block_2                                                                                         B1                                 Q3: A1 join B1 join D1
 ‐ Converting redundant path patterns  with {//, *} to a few                                                                                                                                                                                            A2
                                                                                                                                                                                                                                                                                                                                                        …
                                                                                                                                                                                                                                                                                                                                                                                        Communications of the ACM, 51(1):107–113, 2008.
                                                                                                                                                                                                                                                                             C1
                                                                                                                    NFA style                                                                                                            Path solutions  A         B2                 D1             Reducer2                                       Input queries                    [5] Y. Diao et al. Path sharing and predicate evaluation for high‐performance xml 
    distinct root‐to‐leaf paths                                                                                    Query index                                region              1st Mapper                                                                       B         C2                       Q1: A2 join B2 join C2                      Q1: A join B join C                   filtering. ACM Transactions on Database Systems, 28(4):467–516, 2003. 
                                                                                                                                                         &1                                                                                                                           D2              Q2: A2 join C2 join D2                      Q2: A join C join D
                                                                                                                                            Africa                                                                                                                           C
                                                                                                                                                                                                                                                                                                      Q3: A2 join B2 join D2                      Q3: A join B join D                [6] K. Lee et al. Parallel data processing with MapReduce: a survey. ACM 
 ‐ Collocation of XML blocks and corresponding label blocks                                                                                                        Asia
                                                                                                                                                                                                                                                                                      D                                                                                                 SIGMOD Record, 40(4):11–20, 2011.
                                                                                                                                                &2               &3
                                                                                                                                        item                                                                                           ▶ Runtime one‐to‐many data shuffling                                                                                                          [7] Q. Li et al. Indexing and querying xml data for regular path expressions. In 
▶ Runtime load balancing & multi query optimization                                                                         1.1
                                                                                                                                                                       item
                                                                                                                                                                                                                                          ‐ It distributes both queries and data at runtime                                                                                             Proceedings of VLDB, pages 361–370, 2001.
                                                                                                                                          &4                     &5
                                                                                                                     quantity                   payment                                                                                   ‐ Path solutions can be redundantly copied to reducers, involving redundant I/Os
                                                                                                                                                                 2                                                                                                                                                                                                                   [8] T. Nykiel et al. MRshare: Sharing across multiple queries in MapReduce. 
 ‐ XML twig queries may share path patterns each other                                                                                                                              Runtime stack                                         ‐ a straggling reduce task dominates the overall performance of M/R jobs
                                                                                                                                                                                                                                                                                                                                                                                        Proceedings of the VLDB Endowment, 3(1‐2):494–505, 2010
                                                                                                                                                                                                                                          ‐ Optimization problem : find  the optimal way that distributes queries and path solutions across 
                                                                                                                                 &6             &7                                                                                         reducers  so that every reducer is assigned even workload
 ‐ For I/O reduction and workload balance, twig pattern                                                                                                                                                                                                                                                                                                                              [9] W. Meier. eXist: An open source native XML database. Web, Web‐Services, 
                                                                                                                                1.2            1.3                                                                                                                                          Reducer1
                                                                                                                                                                                                                                                         30                                 Q1: A join B join C                                                                         and Database Systems 2002, LNCS 2593, Springer, Berlin (2002), pp. 169–183
    queries that share path patterns are grouped together                                                                                 Path solution                                                                                  Path solutions A     80                         cost = |A|+|B|+|C| = 200        Input queries                                               [10] C. Grün et al. BaseX ‐ Processing and Visualizing XML with a native XML 
                                                                                                                                          For block_2                                                                                                                                                                                               Q1: A join B join C                Database, http://www.basex.org/, 2010.
                                                                                                                                                                                                                                                                        90                               Reducer2
                                                                                                                                                 Path query ID Path solution                                                                                                                                                                        Q2: A join C join D
 ‐ The twig query groups are assigned to reducers at                                                                                                             1.1           9, 14, 3
                                                                                                                                                                                                                                                               B                                         Q2: A join C join D
                                                                                                                                                                                                                                                                                                                                                    Q3: A join B join D
                                                                                                                                                                                                                                                                                                         Q3: A join B Join D
                                                                                                                                                                 1.2          10, 11, 4                                                                                 C
    runtime such that every reducer has the same overall                                                                                                                                                                                                                          5                   cost = |A|+|C|+|D| +
                                                                                                                                                                 1.3          12, 13, 4                                                                                           D                         |A|+|B|+|D| = 240 

    cost of join operations
                                                                                                                                                      This work was partly supported by NRF grant funded by the Korea government (MEST) (no. 2011‐0016282)  

More Related Content

PDF
Print-n-Link: Weaving the Paper Web
PDF
Implementing Netezza Spatial
PDF
Line Upgrade Deferral Scenarios for Distributed Renewable Energy Resources
DOCX
Bsnl new 16112012.
PDF
amd_ar2000
PPTX
MapReduce: A useful parallel tool that still has room for improvement
PPTX
KIISE:SIGDB Workshop presentation.
PDF
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
Print-n-Link: Weaving the Paper Web
Implementing Netezza Spatial
Line Upgrade Deferral Scenarios for Distributed Renewable Energy Resources
Bsnl new 16112012.
amd_ar2000
MapReduce: A useful parallel tool that still has room for improvement
KIISE:SIGDB Workshop presentation.
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...

Similar to A poster version of HadoopXML (20)

PDF
Storage Efficiency Poster Pdfnewfor2011[1]
PDF
AMBER WWW 2012 Poster
PPTX
Microsoft HPC User Group
PDF
VO Course 10: Big data challenges in astronomy
PDF
Dvermaconceptualdetectordesign
PDF
HCLT Brochure: Engineering and R&D Services- System Devices
PDF
My Ph.D. Research
PDF
Gautier bosc2010 pythonbioconductor
PDF
Cascading meetup #4 @ BlueKai
PPT
Defining conservation taxonomy
PDF
DATE 2005 - OpenAccess Migration within Philips Semiconductor
PPTX
Centralizing sequence analysis
PPT
2008 lv nac-big-pic
PPT
2008 lv nac-big-pic
PPT
2008 lv nac-big-pic
PDF
Identifying Information Needs by Modelling Collective Query Patterns
PDF
Open Chemistry: Realizing Open Data, Open Standards, and Open Source
PDF
PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread a...
PDF
New light element opacities from the Los Alamos atomic code
Storage Efficiency Poster Pdfnewfor2011[1]
AMBER WWW 2012 Poster
Microsoft HPC User Group
VO Course 10: Big data challenges in astronomy
Dvermaconceptualdetectordesign
HCLT Brochure: Engineering and R&D Services- System Devices
My Ph.D. Research
Gautier bosc2010 pythonbioconductor
Cascading meetup #4 @ BlueKai
Defining conservation taxonomy
DATE 2005 - OpenAccess Migration within Philips Semiconductor
Centralizing sequence analysis
2008 lv nac-big-pic
2008 lv nac-big-pic
2008 lv nac-big-pic
Identifying Information Needs by Modelling Collective Query Patterns
Open Chemistry: Realizing Open Data, Open Standards, and Open Source
PERFORMANCE-STUDIES OF A MOLECULAR DYNAMICS CODE Evaluating Serial, Thread a...
New light element opacities from the Los Alamos atomic code
Ad

More from Kyong-Ha Lee (7)

PDF
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
PDF
Scalable and Adaptive Graph Querying with MapReduce
PPTX
좋은 논문 찾기
PDF
Parallel Data Processing with MapReduce: A Survey
PDF
Database Research on Modern Computing Architecture
PDF
Bitmap Indexes for Relational XML Twig Query Processing
PPTX
Bitmap Indexes for Relational XML Twig Query Processing
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...
Scalable and Adaptive Graph Querying with MapReduce
좋은 논문 찾기
Parallel Data Processing with MapReduce: A Survey
Database Research on Modern Computing Architecture
Bitmap Indexes for Relational XML Twig Query Processing
Bitmap Indexes for Relational XML Twig Query Processing
Ad

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
cuic standard and advanced reporting.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Approach and Philosophy of On baking technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
cuic standard and advanced reporting.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Spectral efficient network and resource selection model in 5G networks
MIND Revenue Release Quarter 2 2025 Press Release
Approach and Philosophy of On baking technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development

A poster version of HadoopXML

  • 1. HadoopXML A Suite for Parallel Processing of Massive XML Data with Multiple Twig Pattern Queries 1Computer Hyebong Choi1, Kyong-Ha Lee1, Soo-Hyong Kim1, Yoon-Joon Lee1 and Bongki Moon2 Science Department, KAIST, Korea 2Computer Science Department, University of Arizona, USA hbchoi@dbserver.kaist.ac.kr bart7449@gmail.com kimsh@dbserver.kaist.ac.kr yoonjoon.lee@kaist.ac.kr bkmoon@cs.arizona.edu Motivation System Architecture Performance Twig pattern  Experimental environment Big data in XML join Mappers Tagging Reducers Hadoop CentOS 6.2 1Gb switching hub A large 0.21.0 [1] ▶ More than 100GB of protein sequences and their  XML  Reducer ID Holistic Final XML file AMD Athlon II x4 620  8GB memory Pre‐ blocks Path Final twig join answers 1 master 1st M/R job 2nd M/R job Tagging 4‐cores 7200 RPM HDD XPath processing Query Solutions Answers Path functional information are provided in XML format  queries index solutions Reducer ID Holistic Final 8 slaves i5‐2500k processor 4‐cores 8GB memory 7200 RPM HDD Tagging twig join answers and also updated in every four weeks [2] Reducer ID Shuffle by ReducerId XML dataset statistics Loading time ▶ Conventional XML tools like single‐site XML DBMSes Size information for path solutions Distributed cache Filename UniRef100 UniParc UniProtKB XMark1000 File size (MB) 24,500 37,436 105,745 114,414 and XML pub/sub systems failed to process that size of  Relationship Multi query # of elements 335M  360M  2,110M 1,670M btw. path patterns & XML data Path query  twig patterns optimizer # of attributes Depth in avg. 589M 4.5649 1,215M 2,783M 3.7753 4.3326 383M 4.7375 Query index  Query  XML DB eXist [9] BaseX [10] builder index processing Max depth 6 5 7 12 Mappers Reducers XML Label # distinct paths 30 24 149 548 Query processing Query processing HDFS Path Path  Data size Loading time Loading time Path  Counting Path  w/ 4000 twig queries w/ 4000 twig queries XPath Query  block1 filtering solutions patterns solutions solutions Overall execution time queries Decomposition XML Label XML Label 1GB 5m 54s failed 2m 1s 2h 48m 7s Relationship Path Path block1 block1 block2 btw. paths  filtering solutions Counting Path  … 10GB 1h 5m 21s failed 19m 36s 30h 11m 34s Copy to  XML Label Synthetic dataset Real‐world dataset and twigs Solutions solutions HDFS block2 block2 XML Label 100GB failed ‐ failed ‐ Path Path … … A large Partitioning Label blocks blockn filtering solutions <Path ID, a list of labels> Yfilter [5] XML file & Labeling XML Label <Path ID, label> XML blocks blockn blockn Query index Data size Filtering time Postprocessing time (twig  pattern join) Size information  Block collocation Distributed cache for path solutions 1MB 2m 4s  0.264s 10MB 20s 14s 16s 100MB 3h 22m 6s 1h 1m 37s 1GB failed ‐ Working Example Effect of converting paths  to distinct paths Effect of block collocation Label_1 <region> block_1 / <region> Example.xml <Africa> 1, 24, 1 HadoopXML <Africa> <item id=“item0”> <quantity>1</quantity> <payment>Creditcard</payment> <item id=“item0”> <quantity>1</quantity> <payment>Creditcard</payment> </item> 2, 15, 2 3, 8, 3 4, 5, 4 6, 7, 4 Path offset Path query  Path solution ID 1.1 3, 8, 3 </item> ▶ It efficiently processes many twig pattern queries for  <item id=“item1”> block_2 Label_2 9, 14, 3 <item id=“item1”> <quantity>1</quantity> <quantity>1</quantity> /region/Africa 1.2 4, 5, 4 Twig query ID Path solution <payment>Money order</payment> Preprocessing <payment>Money order</payment> 9, 14, 3 10, 11, 4  2nd M/R 1 6, 7, 4 </item> Partitioning  </item> 10, 11, 4 1st M/R Twig pattern  Effect of multi query optimization a massive volume of XML data in parallel </Africa> <Asia> & labeling </Africa> <Asia> 12, 13, 4 16, 23, 2 Path filtering 1.3 6, 7, 4 12, 13, 4 join 2 12, 13, 4 17, 22, 3 <item id="item135"> <item id="item135"> 2 17, 22, 3 <quantity>2</quantity> block_3 Label_3 ‐ Block partitioning with no loss of structural information <payment>Personal Check</payment> <quantity>2</quantity> /region/Asia <payment>Personal Check</payment> 17, 22, 3 </item> </item> Path query  Count 18, 19, 4 </Asia> </Asia> ID ‐ Path filtering with NFA‐style query indexes [5] </region> </region> 20, 21, 4 1.1 2 Multi query Path query ID  Path query 1.2 2 optimizer ‐ I/O optimal Holistic twig pattern joins [3]  Twig query ID 1 Twig query /region/Africa/item[quantity]/payment Query decomposition 1.1 /region/Africa/item 1.3 2 1.2 /region/Africa/item/quantity A  query index 2 1 2 //Asia/item & Converting to  ▶ Simultaneous processing of multiple twig pattern  1.3 /region/Africa/item/payment …  .  . . .  . root‐to‐leaf paths 2 /region/Asia/item queries Load Balancing &  ‐ Many twig pattern joins are distributed across nodes and  Path Filtering Multi Query Optimization References <item id=“item1”> <quantity>1</quantity> block_2 /region/Africa Label_2 ▶ Twig pattern join, a specialized multi‐way join that reads multiple  [1] Hadoop. http://hadoop.apache.org, Apache Software Foundation. executed in parallel <payment>Money order</payment> 9, 14, 3 path solutions  [2] A. Bairoch et al. The universal protein resource (uniprot). Nucleic acids  </item> 10, 11, 4 12, 13, 4 ‐ With static one‐to‐one shuffling scheme, i.e. given partitioned path solutions, reducers  research, 33(suppl 1):D154–D159, 2005. ▶ Optimization of the I/O cost in MapReduce jobs </Africa> <Asia> 16, 23, 2 generate incomplete join results [3] N. Bruno et al. Holistic twig joins: optimal xml pattern matching. In  Reducer1 Missing results! Q1: A1 join B1 join C1 A1 join B2 join C2 Proceedings of ACM SIGMOD, pages 310–321. ACM, 2002. ‐ Sharing input scans and intermediate path solutions startElement(“region”) A1 Q2: A1 join C1 join D1 A2 join B1 join C2 startElement(“Africa”) [4] J. Dean et al. Mapreduce: Simplified data processing on large clusters.  & SAX events from block_2 B1 Q3: A1 join B1 join D1 ‐ Converting redundant path patterns  with {//, *} to a few  A2 … Communications of the ACM, 51(1):107–113, 2008. C1 NFA style Path solutions  A B2 D1 Reducer2 Input queries [5] Y. Diao et al. Path sharing and predicate evaluation for high‐performance xml  distinct root‐to‐leaf paths Query index region 1st Mapper B C2 Q1: A2 join B2 join C2 Q1: A join B join C filtering. ACM Transactions on Database Systems, 28(4):467–516, 2003.  &1 D2 Q2: A2 join C2 join D2 Q2: A join C join D Africa C Q3: A2 join B2 join D2 Q3: A join B join D [6] K. Lee et al. Parallel data processing with MapReduce: a survey. ACM  ‐ Collocation of XML blocks and corresponding label blocks Asia D SIGMOD Record, 40(4):11–20, 2011. &2 &3 item ▶ Runtime one‐to‐many data shuffling [7] Q. Li et al. Indexing and querying xml data for regular path expressions. In  ▶ Runtime load balancing & multi query optimization 1.1 item ‐ It distributes both queries and data at runtime Proceedings of VLDB, pages 361–370, 2001. &4 &5 quantity payment ‐ Path solutions can be redundantly copied to reducers, involving redundant I/Os 2 [8] T. Nykiel et al. MRshare: Sharing across multiple queries in MapReduce.  ‐ XML twig queries may share path patterns each other Runtime stack ‐ a straggling reduce task dominates the overall performance of M/R jobs Proceedings of the VLDB Endowment, 3(1‐2):494–505, 2010 ‐ Optimization problem : find  the optimal way that distributes queries and path solutions across  &6 &7 reducers  so that every reducer is assigned even workload ‐ For I/O reduction and workload balance, twig pattern  [9] W. Meier. eXist: An open source native XML database. Web, Web‐Services,  1.2 1.3 Reducer1 30 Q1: A join B join C and Database Systems 2002, LNCS 2593, Springer, Berlin (2002), pp. 169–183 queries that share path patterns are grouped together Path solution  Path solutions A 80 cost = |A|+|B|+|C| = 200 Input queries [10] C. Grün et al. BaseX ‐ Processing and Visualizing XML with a native XML  For block_2 Q1: A join B join C Database, http://www.basex.org/, 2010. 90 Reducer2 Path query ID Path solution Q2: A join C join D ‐ The twig query groups are assigned to reducers at  1.1 9, 14, 3 B Q2: A join C join D Q3: A join B join D Q3: A join B Join D 1.2 10, 11, 4  C runtime such that every reducer has the same overall  5 cost = |A|+|C|+|D| + 1.3 12, 13, 4 D |A|+|B|+|D| = 240  cost of join operations This work was partly supported by NRF grant funded by the Korea government (MEST) (no. 2011‐0016282)