SlideShare a Scribd company logo
Comparing ODF and OOXML Treating the extensibility, modularization, expressivity, packaging, performance, reuse of standards, programmability, ease of use, application/OS neutrality of the formats, along with other whims of the author Rob Weir IBM [email_address] http://www.robweir.com/blog OpenOffice.org Conference Lugdunum, Gaul Ides of September, 2006
The age of proprietary formats Created by a single vendor Controlled a single vendor Evolved by a single vendor
Rich Text Format “ The RTF standard provides a format for text and graphics interchange that can be used with different output devices, operating environments, and operating systems. RTF uses the ANSI, PC-8, Macintosh, or IBM PC character set to control the representation and formatting of a document, both on the screen and in print. With the RTF standard, you can transfer documents created under different operating systems and with different software applications among those operating systems and applications” From RTF 1.0 specification (1987)
We once had documentation Microsoft Excel Developers Handbook , Microsoft Press, 1997 MSDN CD's had the Office binary file format documentation MSDN web site had the Office binary file format documentation Visual C++ came with the Office binary file format documentation But at some point, the updates stopped coming,  and the documentation was pulled.  What happened ???
The door is shut... “. ..you may use documentation identified in the MSDN Library portion of the SOFTWARE PRODUCT as the file format specification for Microsoft Word, Microsoft Excel, Microsoft Access, and/or Microsoft PowerPoint ("File Format Documentation") solely in connection with your development of software product(s) that operate in conjunction with Windows or Windows NT that are not general purpose word processing, spreadsheet, or database management software products or an integrated work or product suite whose components include one or more general purpose word processing, spreadsheet, or database management software products.”  MSDN Licence, 1998
...and locked... By 1999 the format documentation is no longer available for download. Alternative is to license from Microsoft under these terms; “ ISV License Program This program entitles qualified software developers to license the Microsoft .doc, .xls, or .ppt file format documentation for use in the development of commercial software products and solutions that support the .doc, .xls, or .ppt file formats from Microsoft and to complement Microsoft Office.”
..and a guard posted at the door Since Office 2003, Digital Rights Management is being pushed into Office.  The Digital Millennium Copyright Act and the EU Copyright Directive have provisions which make it illegal to circumvent DRM So although progress at interop to date has been heroic, proponents of closed formats have the technological and legal means to prevent document exchange if they wish.  This is true in XML word as well as in the binary world.
Standardization Process ODF Based on OO.org XML formats 12 Dec 2002 -- submitted to OASIS 1 May 2005 – OASIS ODF standard released  16 Nov 2006 – Submitted to ISO/IEC JTC1 under Publicly Available Specification (PAS) rules. 3 May 2006 – ISO/IEC IS 26300 approved 706 page specification in 867 days OOXML Based on Office 2003 XML formats 15 Dec 2005 -- submitted to Ecma est. by 31 Dec 2006 – Ecma standard approved at Ecma General Assembly  est. by 31 January 2007 – Submitted to ISO/IEC JTC1 under FastTrack rules. est. Q1/2008 – ISO/IEC JTC1 approval of OOXML??? Gartner forecasts low likelihood of approval * 5,419 page specification (draft 1.4) in 254 days * http://www.gartner.com/resources/140100/140101/iso_approval_of_oasis_opendo_140101.pdf
Know the SDO's OASIS DocBook DITA RELAX NG ebXML LegalXML Emphasis on e-business standards Ecma C# CLI EcmaScript Eiffel Programing Language As well as various hardware and media standards
How open is open?
Reuse of standards “If I have seen a little further it is by standing on the shoulders of Giants.” Isaac Newton, letter to Robert Hooke, 1676 Choose reuse because: Reduced time to write specification Higher quality specifications Can leverage existing community expertise Can leverage existing education materials Better interop, especially in a word of promiscuous mashups, not monolithic silos Network effects – synergy is good
Reuse: Head to Head ODF reuses: Dublin Core XLS:FO SVG MathML XLink SMIL XForms OOXML reuses: Dublin Core
Packaging Both formats use a ZIP-format container file Good balance of compression and runtime efficiency, allows easy access to subdocuments and works with existing tools Notable that Microsoft did not go with their proprietary CAB format. A loose end to clean up? The ZIP format is 18 years old, but was never formally submitted for standardization.
Some comparative metrics 176 Word documents from Ecma TC45's document library Convert all to OOXML and ODF format Record: Number of pages ZIP size Numbered of contained files Numbered of contained XML files Total uncompressed size of contained files Total uncompressed size of contained XML files.
Mean = 34 pages Median = 8 pages * All charts and calculations done with the  excellent  open source “R” environment;  http://www.r-project.org/
Mean = 0.38
Mean = 0.50
 
Observed compression ratios ODF size / DOC size = 0.38 OOXML size / DOC size = 0.50 Net is ODF documents were smaller, on average 72% of the size of the OOXML document Double check with empty files ODF size = 6,888 bytes OOXML size = 10,001 bytes ODF/OOXML = 0.69
Platform/Application Neutrality ODF's status is clear from the multiple implementations in the market today, from multiple vendors on multiple platforms The situation with OOXML is not so clear
Things to look for in the spec Windows blob Base64Binary bit  Implementation-defined Undefined Legacy  Backwards compatibility Reserved
3.2.1.68 – DEVMODE This binary blob stores printer settings in a format that can only be understood on Windows.
7.4.1.5  –  Clipboard formats The OOXML specification does not say what any of these values mean, but merely restricts them to seemingly arbitrary numbers.  So, not only is the data stored in binary format, it is in a an unspecified format identified merely by number.
3.1.29 – Sheet-level passwords Problem is a CRC is not defined unless you give the polynomial as well as the bit length.  We would also need to know exactly how Unicode characters are to be turned into 8 bit ones.  Hex encode?  Throw out the high bits?  Two  bytes for each character? Insufficient information is disclosed to allow interop.
2.7.2.17 – Locale Signature typedef struct tagLOCALESIGNATURE { DWORD  lsUsb[4]; DWORD  lsCsbDefault[2]; DWORD  lsCsbSupported[2]; } LOCALESIGNATURE, *PLOCALESIGNATURE; C: XML: Can you tell the difference?
2.7.2.17 – Locale Signature Bitmasks in XML ?!
Conformance according to ODF Documents may contain elements in foreign namespaces Document Consumers must be able to read any document which would have been valid if all foreign markup were removed. Document Consumers may preserve such foreign markup
Implied validation pipeline Load XML Remove elements and attributes that are not in an ODF namespace Validate the resulting document according to ODF schema
OOXML's approach Part 5 of recently posted 1.4 draft OOXML Much more complex, 37 pages describing a sophisticated validation pipeline A new XML ML for Markup Compatibility
Markup Compatibility Ignorable MustUnderstand List of namespaces which either can be safely ignored, or must not be ignored PreserveElements PreserveAttributes List of elements or namespaces which should be preserved in editing, even if the namespace is ignorable ProcessContent Content is processed even if element is ignored
AlternateContent/Choice/Fallback Similar to a “switch/case/default” construct See example compliance.xml
Implied validation pipeline Verify validity of Markup Compatibility markup Process MustUnderstand's and generate errors if needed Remove Choice/Fallback markup for cases that were not used Namespace subsumption – remove markup in obsolete namespaces and replace with new namespaces Process the Ignorable's removing the ones you don't understand Process the ProcessContent content
My take on it Intriguing idea, but not fully baked (still draft). Gives a lot of (too much?) flexibility in negotiating fidelity of representation based on capabilities of the consumer. Danger – it essentially lets a producer rewrite the standard and the schema outside of a standards setting.  Could enable Office to maintain a two-tier file format, with the high-fidelity version not documented, and the low-fidelity version available only in <Fallback>.  Remember RTF?
Performance How to measure the performance of a format versus the performance of an application? Some factors: Number of XML files in the Zip which must be parsed Size of the XML files Preprocessing required to resolve Ignorable, MustUnderstand, etc. Can't give an absolute answer, since not all consumers of a the document are attempting the same thing.
Licensing Problem The only implementation of OOXML is the Office 2007 beta, and the End User License Agreement (EULA) has this restriction: “7. SCOPE OF LICENSE. ...You may not disclose the results of any benchmark tests of the software to any third party without Microsoft’s prior written approval”
Solution: test the XML Take 176 working documents from Ecma TC45's document library, in Word format Convert to ODF and OOXML formats Collect static and runtime metrics for these document pairs, including: Size, compressed and uncompressed Size of just the XML Number of pages Number of files in the Zip Time to parse the XML – this is the core of any tool which will consume these documents so large differences here will directly map into large differences in applications
Number of files in the Zip OOXML files = 5.7 + ODF files R^2 = 0.9958
Total Size of the XML's OOXML size = 82,000 bytes +  1.5 * ODF size R^2 = 0.92
Net effect on parse time OOXML time = 3.5 * ODF time R^2 = 0.9596 Time is time to parse all XML files in the Zip archive with  Python's minidom
Bimodal behavior?
Performance Conclusions Choice of an XML parser is key ODF files have larger, but fewer XML files OOXML have many small XML files  Many (most?) parsers are not well optimized for the 2 nd  case. Be a wise consumer of benchmark data Beware of tests which confuse application performance for file format performance Look to see if the documents used in the test are typical.  Consider the performance of all types of applications, not just heavy-weight desktop editor
The End Thank you

More Related Content

ODP
Office OpenXML: a technical approach for OOo.
PDF
Understanding_Markdowns_Pandoc_and_YALM
ODP
Processing OpenDocument Format
PDF
HTML to ODT to XML to PDF to …
PDF
Introduction to LaTeX (For Word users)
PDF
Phpconf taiwan-2012
PPTX
Moving from User Documentation to Developer Documentation
PDF
Understanding Dom
Office OpenXML: a technical approach for OOo.
Understanding_Markdowns_Pandoc_and_YALM
Processing OpenDocument Format
HTML to ODT to XML to PDF to …
Introduction to LaTeX (For Word users)
Phpconf taiwan-2012
Moving from User Documentation to Developer Documentation
Understanding Dom

What's hot (10)

PDF
LaTeX for beginners
PPTX
Introdution to HTML
PPT
PDF/A: A Preservation Format
ODP
An RDF Metadata Model for OpenDocument Format 1.2
PDF
Heiner Oberkampf: Semantics for Integrated Analytical Laboratory Processes – ...
PPT
PDF/A: A Preservation Format
PDF
DOCX
Bt0078 website design
DOCX
Bt0078 website design
PPT
Net framework
LaTeX for beginners
Introdution to HTML
PDF/A: A Preservation Format
An RDF Metadata Model for OpenDocument Format 1.2
Heiner Oberkampf: Semantics for Integrated Analytical Laboratory Processes – ...
PDF/A: A Preservation Format
Bt0078 website design
Bt0078 website design
Net framework
Ad

Viewers also liked (20)

PPTX
Btpro-Penetration Testing Service
PPT
Quantum Information
PPT
Foundation of business com chapter1
PPTX
Information system
PPTX
Using Lizzio’s ‘Five Senses’ to Shape Residents’ First Year Experience - Dary...
PPT
Secure Multicast Conferencing
PPTX
How to Avoid Epic Web Failure... Lessons Learned from Healthcare.gov
PPTX
NIEM and XML for Architects and Developers
PPTX
Provably secure nested one time secrete key
PPTX
PPTX
Capacitated Kinetic Clustering in Mobile Networks by Optimal Transportation T...
PPTX
Introduction To Parallel Computing
PPTX
Webinar | Vers un Portail Collaborateurs orienté service [Cas Sodexo/Liferay]
PPTX
Organ donation ethics and law Y5 UCL Medical School 2013
PPTX
Continuing Medical Education Market Statistics & Trends -- July 2014
PPT
Building linked data apps
PPTX
Quality specialist job description
PPTX
Quality engineer job description
PPT
Introduction to XML
PPT
Btpro-Penetration Testing Service
Quantum Information
Foundation of business com chapter1
Information system
Using Lizzio’s ‘Five Senses’ to Shape Residents’ First Year Experience - Dary...
Secure Multicast Conferencing
How to Avoid Epic Web Failure... Lessons Learned from Healthcare.gov
NIEM and XML for Architects and Developers
Provably secure nested one time secrete key
Capacitated Kinetic Clustering in Mobile Networks by Optimal Transportation T...
Introduction To Parallel Computing
Webinar | Vers un Portail Collaborateurs orienté service [Cas Sodexo/Liferay]
Organ donation ethics and law Y5 UCL Medical School 2013
Continuing Medical Education Market Statistics & Trends -- July 2014
Building linked data apps
Quality specialist job description
Quality engineer job description
Introduction to XML
Ad

Similar to A Technical Comparison: ISO/IEC 26300 vs Microsoft Office Open XML (20)

PDF
Migrating to Free Software: a Reference Protocol for LibreOffce
PPT
Ooxml Arabic support Technical Review
PPTX
The tale of the file formats
ODP
Intelligent Impress
PPTX
epicenter2010 Open Xml
PDF
Revisiting Open Document Format and Office Open XML: The Quiet Revolution Con...
PPT
Officexml
PDF
SFScon18 - Italo Vignoli - Open Standards for documents a significant advanta...
ODP
Reliable interoperation between OpenOffice & MS office by UOML
ODP
Reliable interoperation between OpenOffice & MS office by UOML
PDF
First Encounters With Office Open Xml Matt Turner 12 4 2007
PDF
First Encounters With Office Open Xml
PPTX
Open XML & MOSS
PDF
OpenDocument Traps
PDF
Adoption of Open Standards by European Public Administrations - The Case of D...
ODP
Lotus Symphony has matured quite a bit the past year, but are you taking full...
PDF
HCII2014 presentation
DOCX
Office xml markupexplained_en
PPT
Document Sucuess With Office 2007
PDF
Open Standard
Migrating to Free Software: a Reference Protocol for LibreOffce
Ooxml Arabic support Technical Review
The tale of the file formats
Intelligent Impress
epicenter2010 Open Xml
Revisiting Open Document Format and Office Open XML: The Quiet Revolution Con...
Officexml
SFScon18 - Italo Vignoli - Open Standards for documents a significant advanta...
Reliable interoperation between OpenOffice & MS office by UOML
Reliable interoperation between OpenOffice & MS office by UOML
First Encounters With Office Open Xml Matt Turner 12 4 2007
First Encounters With Office Open Xml
Open XML & MOSS
OpenDocument Traps
Adoption of Open Standards by European Public Administrations - The Case of D...
Lotus Symphony has matured quite a bit the past year, but are you taking full...
HCII2014 presentation
Office xml markupexplained_en
Document Sucuess With Office 2007
Open Standard

More from Alexandro Colorado (20)

ODP
Bitcuners revolucion blockchain
ODP
Presentacion Krita
ODP
Bitcuners porque bitcoins
ODP
ChamiloCon Enseñando con Tecnología
ODP
Curso de desarrollo web para principiantes
ODP
ChamiloCon: Recursos de Software Libre
ODP
Krita - Tu tambien puedes pintar un arbol Feliz
ODP
Gobernancia y particionacion en comunidades de Software Libre v2
PDF
Blender - FLISOL Cancun 2014
ODP
The Hitchhicker's Guide to Opensource
ODP
OpenERP: El ecosistema de negocios
ODP
Aprendiendo GnuPG
ODP
Catalogo decursos
ODP
Practicas virtuales v2.2
ODP
Introducción al curso de Extensiones de OpenOffice
ODP
Comunidades software libre
ODP
Practicas virtuales v2
ODP
Practicas virtuales
ODP
Economia digital
Bitcuners revolucion blockchain
Presentacion Krita
Bitcuners porque bitcoins
ChamiloCon Enseñando con Tecnología
Curso de desarrollo web para principiantes
ChamiloCon: Recursos de Software Libre
Krita - Tu tambien puedes pintar un arbol Feliz
Gobernancia y particionacion en comunidades de Software Libre v2
Blender - FLISOL Cancun 2014
The Hitchhicker's Guide to Opensource
OpenERP: El ecosistema de negocios
Aprendiendo GnuPG
Catalogo decursos
Practicas virtuales v2.2
Introducción al curso de Extensiones de OpenOffice
Comunidades software libre
Practicas virtuales v2
Practicas virtuales
Economia digital

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Tartificialntelligence_presentation.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PPTX
Chapter 5: Probability Theory and Statistics
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
project resource management chapter-09.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
A comparative study of natural language inference in Swahili using monolingua...
A comparative analysis of optical character recognition models for extracting...
Univ-Connecticut-ChatGPT-Presentaion.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Building Integrated photovoltaic BIPV_UPV.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Encapsulation theory and applications.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Tartificialntelligence_presentation.pptx
Hindi spoken digit analysis for native and non-native speakers
Heart disease approach using modified random forest and particle swarm optimi...
Chapter 5: Probability Theory and Statistics
A novel scalable deep ensemble learning framework for big data classification...
NewMind AI Weekly Chronicles - August'25-Week II
Enhancing emotion recognition model for a student engagement use case through...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
project resource management chapter-09.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
WOOl fibre morphology and structure.pdf for textiles
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
A comparative study of natural language inference in Swahili using monolingua...

A Technical Comparison: ISO/IEC 26300 vs Microsoft Office Open XML

  • 1. Comparing ODF and OOXML Treating the extensibility, modularization, expressivity, packaging, performance, reuse of standards, programmability, ease of use, application/OS neutrality of the formats, along with other whims of the author Rob Weir IBM [email_address] http://www.robweir.com/blog OpenOffice.org Conference Lugdunum, Gaul Ides of September, 2006
  • 2. The age of proprietary formats Created by a single vendor Controlled a single vendor Evolved by a single vendor
  • 3. Rich Text Format “ The RTF standard provides a format for text and graphics interchange that can be used with different output devices, operating environments, and operating systems. RTF uses the ANSI, PC-8, Macintosh, or IBM PC character set to control the representation and formatting of a document, both on the screen and in print. With the RTF standard, you can transfer documents created under different operating systems and with different software applications among those operating systems and applications” From RTF 1.0 specification (1987)
  • 4. We once had documentation Microsoft Excel Developers Handbook , Microsoft Press, 1997 MSDN CD's had the Office binary file format documentation MSDN web site had the Office binary file format documentation Visual C++ came with the Office binary file format documentation But at some point, the updates stopped coming, and the documentation was pulled. What happened ???
  • 5. The door is shut... “. ..you may use documentation identified in the MSDN Library portion of the SOFTWARE PRODUCT as the file format specification for Microsoft Word, Microsoft Excel, Microsoft Access, and/or Microsoft PowerPoint (&quot;File Format Documentation&quot;) solely in connection with your development of software product(s) that operate in conjunction with Windows or Windows NT that are not general purpose word processing, spreadsheet, or database management software products or an integrated work or product suite whose components include one or more general purpose word processing, spreadsheet, or database management software products.” MSDN Licence, 1998
  • 6. ...and locked... By 1999 the format documentation is no longer available for download. Alternative is to license from Microsoft under these terms; “ ISV License Program This program entitles qualified software developers to license the Microsoft .doc, .xls, or .ppt file format documentation for use in the development of commercial software products and solutions that support the .doc, .xls, or .ppt file formats from Microsoft and to complement Microsoft Office.”
  • 7. ..and a guard posted at the door Since Office 2003, Digital Rights Management is being pushed into Office. The Digital Millennium Copyright Act and the EU Copyright Directive have provisions which make it illegal to circumvent DRM So although progress at interop to date has been heroic, proponents of closed formats have the technological and legal means to prevent document exchange if they wish. This is true in XML word as well as in the binary world.
  • 8. Standardization Process ODF Based on OO.org XML formats 12 Dec 2002 -- submitted to OASIS 1 May 2005 – OASIS ODF standard released 16 Nov 2006 – Submitted to ISO/IEC JTC1 under Publicly Available Specification (PAS) rules. 3 May 2006 – ISO/IEC IS 26300 approved 706 page specification in 867 days OOXML Based on Office 2003 XML formats 15 Dec 2005 -- submitted to Ecma est. by 31 Dec 2006 – Ecma standard approved at Ecma General Assembly est. by 31 January 2007 – Submitted to ISO/IEC JTC1 under FastTrack rules. est. Q1/2008 – ISO/IEC JTC1 approval of OOXML??? Gartner forecasts low likelihood of approval * 5,419 page specification (draft 1.4) in 254 days * http://www.gartner.com/resources/140100/140101/iso_approval_of_oasis_opendo_140101.pdf
  • 9. Know the SDO's OASIS DocBook DITA RELAX NG ebXML LegalXML Emphasis on e-business standards Ecma C# CLI EcmaScript Eiffel Programing Language As well as various hardware and media standards
  • 10. How open is open?
  • 11. Reuse of standards “If I have seen a little further it is by standing on the shoulders of Giants.” Isaac Newton, letter to Robert Hooke, 1676 Choose reuse because: Reduced time to write specification Higher quality specifications Can leverage existing community expertise Can leverage existing education materials Better interop, especially in a word of promiscuous mashups, not monolithic silos Network effects – synergy is good
  • 12. Reuse: Head to Head ODF reuses: Dublin Core XLS:FO SVG MathML XLink SMIL XForms OOXML reuses: Dublin Core
  • 13. Packaging Both formats use a ZIP-format container file Good balance of compression and runtime efficiency, allows easy access to subdocuments and works with existing tools Notable that Microsoft did not go with their proprietary CAB format. A loose end to clean up? The ZIP format is 18 years old, but was never formally submitted for standardization.
  • 14. Some comparative metrics 176 Word documents from Ecma TC45's document library Convert all to OOXML and ODF format Record: Number of pages ZIP size Numbered of contained files Numbered of contained XML files Total uncompressed size of contained files Total uncompressed size of contained XML files.
  • 15. Mean = 34 pages Median = 8 pages * All charts and calculations done with the excellent open source “R” environment; http://www.r-project.org/
  • 18.  
  • 19. Observed compression ratios ODF size / DOC size = 0.38 OOXML size / DOC size = 0.50 Net is ODF documents were smaller, on average 72% of the size of the OOXML document Double check with empty files ODF size = 6,888 bytes OOXML size = 10,001 bytes ODF/OOXML = 0.69
  • 20. Platform/Application Neutrality ODF's status is clear from the multiple implementations in the market today, from multiple vendors on multiple platforms The situation with OOXML is not so clear
  • 21. Things to look for in the spec Windows blob Base64Binary bit Implementation-defined Undefined Legacy Backwards compatibility Reserved
  • 22. 3.2.1.68 – DEVMODE This binary blob stores printer settings in a format that can only be understood on Windows.
  • 23. 7.4.1.5 – Clipboard formats The OOXML specification does not say what any of these values mean, but merely restricts them to seemingly arbitrary numbers. So, not only is the data stored in binary format, it is in a an unspecified format identified merely by number.
  • 24. 3.1.29 – Sheet-level passwords Problem is a CRC is not defined unless you give the polynomial as well as the bit length. We would also need to know exactly how Unicode characters are to be turned into 8 bit ones. Hex encode? Throw out the high bits? Two bytes for each character? Insufficient information is disclosed to allow interop.
  • 25. 2.7.2.17 – Locale Signature typedef struct tagLOCALESIGNATURE { DWORD lsUsb[4]; DWORD lsCsbDefault[2]; DWORD lsCsbSupported[2]; } LOCALESIGNATURE, *PLOCALESIGNATURE; C: XML: Can you tell the difference?
  • 26. 2.7.2.17 – Locale Signature Bitmasks in XML ?!
  • 27. Conformance according to ODF Documents may contain elements in foreign namespaces Document Consumers must be able to read any document which would have been valid if all foreign markup were removed. Document Consumers may preserve such foreign markup
  • 28. Implied validation pipeline Load XML Remove elements and attributes that are not in an ODF namespace Validate the resulting document according to ODF schema
  • 29. OOXML's approach Part 5 of recently posted 1.4 draft OOXML Much more complex, 37 pages describing a sophisticated validation pipeline A new XML ML for Markup Compatibility
  • 30. Markup Compatibility Ignorable MustUnderstand List of namespaces which either can be safely ignored, or must not be ignored PreserveElements PreserveAttributes List of elements or namespaces which should be preserved in editing, even if the namespace is ignorable ProcessContent Content is processed even if element is ignored
  • 31. AlternateContent/Choice/Fallback Similar to a “switch/case/default” construct See example compliance.xml
  • 32. Implied validation pipeline Verify validity of Markup Compatibility markup Process MustUnderstand's and generate errors if needed Remove Choice/Fallback markup for cases that were not used Namespace subsumption – remove markup in obsolete namespaces and replace with new namespaces Process the Ignorable's removing the ones you don't understand Process the ProcessContent content
  • 33. My take on it Intriguing idea, but not fully baked (still draft). Gives a lot of (too much?) flexibility in negotiating fidelity of representation based on capabilities of the consumer. Danger – it essentially lets a producer rewrite the standard and the schema outside of a standards setting. Could enable Office to maintain a two-tier file format, with the high-fidelity version not documented, and the low-fidelity version available only in <Fallback>. Remember RTF?
  • 34. Performance How to measure the performance of a format versus the performance of an application? Some factors: Number of XML files in the Zip which must be parsed Size of the XML files Preprocessing required to resolve Ignorable, MustUnderstand, etc. Can't give an absolute answer, since not all consumers of a the document are attempting the same thing.
  • 35. Licensing Problem The only implementation of OOXML is the Office 2007 beta, and the End User License Agreement (EULA) has this restriction: “7. SCOPE OF LICENSE. ...You may not disclose the results of any benchmark tests of the software to any third party without Microsoft’s prior written approval”
  • 36. Solution: test the XML Take 176 working documents from Ecma TC45's document library, in Word format Convert to ODF and OOXML formats Collect static and runtime metrics for these document pairs, including: Size, compressed and uncompressed Size of just the XML Number of pages Number of files in the Zip Time to parse the XML – this is the core of any tool which will consume these documents so large differences here will directly map into large differences in applications
  • 37. Number of files in the Zip OOXML files = 5.7 + ODF files R^2 = 0.9958
  • 38. Total Size of the XML's OOXML size = 82,000 bytes + 1.5 * ODF size R^2 = 0.92
  • 39. Net effect on parse time OOXML time = 3.5 * ODF time R^2 = 0.9596 Time is time to parse all XML files in the Zip archive with Python's minidom
  • 41. Performance Conclusions Choice of an XML parser is key ODF files have larger, but fewer XML files OOXML have many small XML files Many (most?) parsers are not well optimized for the 2 nd case. Be a wise consumer of benchmark data Beware of tests which confuse application performance for file format performance Look to see if the documents used in the test are typical. Consider the performance of all types of applications, not just heavy-weight desktop editor

Editor's Notes

  • #2: I don&apos;t want to give the impression that one standard is evil while the other is destined for sainthood. Neither format was created by idiots. Both sides know what they are doing, and for the most part they are accomplishing what they set out to do. The question to ask yourself is, What are the goals of the format, stated and unstated, and do those goals align with yours.