SlideShare a Scribd company logo
e-Services to Keep Your
Digital Fil C
Di it l Files Current
                    t


Presented by: Peter Bajcsy
-Research Scientist at NCSA
-Associate Director of I-CHASS, I3
                               ,
Institute
-Adjunct Assistant Professor, CS & ECE
UIUC

National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Acknowledgement

   • This research was partially supported by a National
     Archives and Records Administration (NARA)
                                              (      )
     supplement to NSF PACI cooperative agreement CA
     #SCI-9619019 and NCSA Industrial Partners.
   • The views and conclusions contained in this doc ment
           ie s      concl sions                     document
     are those of the authors and should not be interpreted as
     representing the official policies, either expressed or
     implied, of the National Archives and Records
     Administration, or the U.S. government.
   • Contributions by: Peter Bajcsy Kenton McHenry Rob
                              Bajcsy,           McHenry,
     Kooper, Michal Ondrejcek, Jason Kastner, William
     McFadden, Sang-Chul Lee, Luigi Marini


Imaginations unbound
Outline

• Introduction
• Technologies
   • File format conversion software
     registry
   • Automated file format conversions
   • Conversion quality assessment
• Summary
• Future Work
Introduction
Supporting NARA’s Strategic Plan
• According to The Strategic Plan of The
  National Archives and Records
  Administration 2006–2016. “Preserving the
  Past to Protect the Future”
  • “Strategic Goal: We will preserve and
    process records to ensure access by the
    public as soon as legally possible”
                              possible
     • “Part D. We will improve the efficiency
       with which we manage our holdings
       from the time they are scheduled
       through accessioning, processing,
       storage, preservation
       storage preservation, and public
       use.”
To Preserve or Not To Preserve?
                       Digital representation of
                              information          Preservation
                             & knowledge




   Information
    transfer ?




  AGENCY                                             ARCHIVES

Imaginations unbound
Do We Know the Answers?
• (1) What is the granularity of information that one
  should preserve about a decision process in order to
  reconstruct it?
   • Example: the granularity of information collected
     from a decision process based on visual inspection
     of images has implications on storage and
     computational requirements/costs
     comp tational req irements/costs –
     ImageProvenance2Learn (IP2Learn)
Do We Know the Answers?
• (2) Given thousands of DVDs with files, which
  files are related?
   • Example: given files that contain 2D scans of
     blue prints and 3D CAD models, find the
          p                          ,
     content-based file correspondence - File2Learn
     prototype system
                       Relationship Discovery




            30 files                            784 files
Do We Know the Answers?
• (3) Given hundreds of versions of the ‘same’ file,
  which file version(s) are similar and which one(s)
  should be preserved?
    h ld b              d?
   • Example: given a collection of Adobe PDF
     documents,
     documents compare all pairs of Adobe PDF
     documents containing text, images, vector
     graphics,… and order them chronologically or
     based on similarities - Doc2Learn prototype
Do We Know the Answers?
• (4) Given thousands of file formats, which
  conversion software to use and which
  target file format to use so that the
  content of those thousands of files would
  be viewable in a long run?
   • Focus of today s talk is on examples
                today’s
     of technologies that would provide
     answers to (4) at large processing
     scale with computational scalability.
Goal
• Ob
  Observation: Fil f
           ti   File format conversions are
                           t          i
  inevitably one part of our daily life
• Question: Can file format conversions assist in
  making digital content created today to be
  accessible and viewable throughout its
  lifecycle?
• Consideration: we do not know what file
  formats will be around 100+ years down the
                                y
  road
• Goal: to make files backward and forward
  compatible
Background on File Format Conversions
• A very large number of file formats in which digital content is
  stored.
• A i
  An increasing number of complex fil f
             i        b      f     l file formats containing
                                                 t   t i i
  multiple types of digital content (e.g., Adobe PDF, HDF) or
  having very elaborate specifications (e.g., STEP).
• Many software implementations of import (read) and export
  (write) operations.
• A wide spectrum of quality of software i l
      id       t      f     lit f ft        implementations
                                                    t ti
  when reading and storing content in various file formats.
• Ephemeral support for many file formats and software
  implementations
• Hardware dependency of many software implementations
Illustration of 3D File Format Reality
                                         *.ma, * b *
                                         *     *.mb, *.mp    *.k3d
                                                               k3d
*.pdf (*.prc, *.u3d)



                                                             *.w3d




 *.lwo         *.c4d   *.dwg   *.blend   *.iam          *.max, *.3ds
Challenges and Objective
• Challenges:
   • The quality of file format conversions is unknown when
     using a particular software to do the conversion
   • The volume of file format conversions requires significant
     computational resources
   • Understanding information loss due to file format
     conversions is application dependent
   • Estimating information loss is complicated due to the
     complexity of file formats
   • Th file f
     The fil format, software and hardware d
                   t      ft      dh d        dependencies are
                                                    d   i
     often unknown
• Objective: Design and prototype services using a
     j             g         p     yp                g
  computational cloud to support forward-looking decisions
Parameters of File Format Conversions

• File format: Content representation depends on a
  file format
• Software: Retrieval and storage of content in a file
  format depends on the quality of software
  implementation
• Hardware: Software execution depends o access
     a d a e So t a e e ecut o depe ds on
  to storage media, operating system, and hardware
  platform
• Criteria defining information loss: Information
  loss due to file format conversions is defined by
  application specific criteria
Three Example Services of Interest

• (a) Find file format conversion software
  to convert from any file format to any
  other file format
• (b) Execute file format conversions with
  any available thi d party software
           il bl third    t   ft
• (c) Evaluate information loss due to file
  ( )
  format conversion over a set of files in
  multiple complex file formats
Technologies
Overview
#1: Conversion Software Registry (CSR)

• Problem: Find file format conversion
  software to convert from any file format to
  any other file format
• Technology: Conversion Software Registry
  (CSR) at
  https://isda.ncsa.uiuc.edu/NARA/CSR/
  https://isda ncsa uiuc edu/NARA/CSR/
• Features: Support for searching, editing and
  adding i f
   ddi information about fil f
                   ti   b t file format
                                      t
  conversion software, open access and login-
  based modification
  b    d     difi ti
Movie of CSR
Comparison of CSR with Other Systems
• File Format Registries
   • PRONOM developed by the National Archives of the United
     Kingdom
        g
   • Unified Digital Formats Registry (UDFR – before GDFR)
• Software Registries/Catalogues
   • C
     Community specific
             it      ifi
      • The Geotechnical and Geoenvironmental Software Directory
        (GGSD)
      • The Natural Language Software Registry (NLSR)
   • Business oriented
      • The Bit9 Global Software Registry (
                                     g y (whitelisting software)
                                                       g          )
      • Cnet (available software with links to feature descriptions)
• File Format Conversion Registries
   • Th Planets test bed (password protected, 18 software packages)
     The Pl  t t tb d(           d    t t d        ft        k    )
Novelty of Conversion Software Registry
• Existing file format registries focus on file format
  specifications
• Catalogues of software focus on software of interest
  to a specific community and include information
  about t level d
    b t top l     l description, vendors and price b t
                         i ti        d       d i but
  not capabilities to import and export file formats
• A file f
     fil format conversion registry lik Pl
                t         i       i t like Planets.org
                                                 t
  supports 16 software packages, only single-hop
  conversion paths and couples software to the reg  reg.
• Novelty: CSR provides answers about multi-hop
  conversion paths from about 70+ software
                                   70
  packages currently
                            Two-hop conversion path
#2: File Format Conversion Engine
• Problem: Execute file format conversions
  with any available third party software
• Technology: Polyglot version 1, operating
  on NCSA hardware resources
                       resources,
  downloadable for private deployment
• F t
  Features: web-based access t a
                  bb    d         to
  computational cloud consisting of
  commodity h d
            dit hardware and i t ll ti
                           d installations of
                                            f
  third party software with import/export
  capabilities
        biliti
Movie of Polyglot
Polyglot Design       EXTENSIBILITY




                         AUTOMATION

    Cloud Computing

              COMPUTATIONAL
               SCALABILITY




Services to Archivists
Comparison of File Format Conversion
  Systems
• Some existing file format conversion services
   • http://www.ps2pdf.com;
        p       p p       ;
      • Supports only certain conversion types
   • http://www.zamzar.com
      • Supports conversion of document, image,
        music, video and couple of CAD formats
   • http://media-convert.com
     • Supports about 20 multi-media formats
• D
  Drawbacks: Th existing systems are not
       b k The i ti              t             t
  extensible (limited by specific libraries), cannot be
  downloaded for private use (files with sensitive info)
                                                     info),
  computational scalability is unknown
Format Conversion Extensibility Via
 Software Reuse
• Observation: Nobody has the resources to load every
  possible file format
   • Fully supporting the many available formats is an
     enormous undertaking
   • If a file format is closed/proprietary it may be difficult to
     retrieve the data directly from the file
   • Vendor file formats sometimes store application feature
                                                pp
     specific pieces of information that is not supported in
     other formats
   • M t software support importing/exporting of a subset of
     Most ft                  ti    ti /        ti   f      b t f
     application domain specific file formats.
• Conclusion: Software reuse a d e te s b ty are t e key
  Co c us o So t a e euse and extensibility a e the ey
  characteristics of file format conversion systems
File Format Conversion Extensibility
• Extensibility in Polyglot: Software is reused by wrapping
  3rd party software while utilizing whatever access the
  software vendors make available to embedded
     f          d      k        il bl      b dd d
  functionality
   • published Application Programming Interface (API),
                                                    (API)
      command line and Graphics User Interfaces (GUI)
• Novelty: Polyglot p
          y     yg provides a single user interface that
                                      g
  allows the user to execute multiple software conversion
  software applications automatically, and over distributed
  computers that have a license for the software needed to
  do the conversion and/or have the computing resources
  necessary for the size of the job (computational scalability).
#3: File Comparison Engines
• Problem: Compare two files and evaluate
 information loss due to file format conversion over a
 set of files in multiple complex file formats
• Technologies:
          g
  • Initial prototypes: ModelBrowser (four 3D
    comparison metrics); Doc2Learn (one metric
    across multiple digital objects), Doc2LearnHadoop
    (computation scalability using Hadoop)
  • Work-in-progress: A general API for content-based
    comparison of any two files - Versus
3D Comparison Example (ModelBrowser)


                                             heart.stl



•    Software: Adobe 3D Reviewer                              heart.wrl
                                                              h t l

•    Original File: WRL
•    Converted Files: STP, STL,
     IGS, U3D
•    Comparison Method: Light
     Fields [C e , 2003] compares
       e ds [Chen, 003] co pa es             heart.stp
                                             heart stp
     silhouettes from various viewing
     angles around the objects


    Conclusion: Information loss(WRLSTP)=Information loss (WRLSTL)
Multiple Object Comparisons (Doc2Learn)




Adobe PDF documents ~ {text, images, vector graphics, ….}
Multiple Method Comparisons (Versus)
•   Software: MS Paint
•   Original File: TIF
•   Converted Files: PNG, GIF, JPG, BMP
•   Comparison Method: Pixel by pixel difference (sum of
    Euclidean distances over all pixels)



                                                           User Inputs




             Conclusion 1: Information loss(TIFBMP or TIFPNG) =0
     Conclusion 2: Information loss(TIFGIF) > Information loss(TIFJPG)
Information Loss Evaluation
Setup:
• Inputs: a set of files, a set of software packages,
    p                                       p    g
  criteria for defining information loss
• Wanted output: information loss ‘score’ per file
  format conversion
Approach:
• Phase I: Find all round-trip conversion paths from a
  given file format to the same file format
• Phase II: Execute all conversions to obtain
  converted files.
• Phase III: Compare the original and converted files
Information Loss Evaluation: Computational
    Requirements
•   Files: one file in STP file format
•   Software: Adobe 3D Reviewer, Cyberware PlyTool
•   Comparison Method: Light Fields [Chen, 2003]
•   Number of paths: 10 (28 individual conversions)




             Phase I: Find                       Phase III: Compare
                             Phase II: Execute
Summary
Information Technology Lessons
• Better understanding of preservation and reconstruction of
  electronic records in terms of file format conversions
   • Th data model needed f d
     The d t        d l    d d for documenting existing fil
                                             ti      i ti file
     format conversion software
   • A framework (test bed) for software reuse and
     extensibility to provide file format conversion services
   • The complexity of performing content-based file
     comparison and measurements of information loss d
               i      d               t fi f      ti l       due
     to file format conversions
   • The computational cost of file format conversions, file
     comparisons and information loss evaluations
   • The computational scalability of file format conversions
     and fil comparisons using parallel processing paradigms
        d file        i        i        ll l        i        di
The Value for Archivists
• Prototype services are freely available to digital preservation
  community and provide decision support tools
   • to select an ‘optimal’ file format to be preserved
   • to evaluate file format conversion software
   • to select minimum cost for a chosen file format conversion
     path
• The framework for conversion software documentation,         ,
  software reuse and functionality extensibility has a major
  impact on
   • Effi i
     Efficiency with which we manage our h ldi
                   ith hi h                    holdings
   • Understanding of the information loss introduced due to
     conversions
   • The cost of updating file format conversion services
Development Plans
• Prototype services are open to the public at
   • https://isda.ncsa.uiuc.edu/NARA/CSR/
   • http://teeve3.ncsa.uiuc.edu/polyglot/convert.php
• Software is open source technology and
  downloadable from
  http://isda.ncsa.uiuc.edu/download/
     p
• We have been building a second generation of
  these file format conversion services
• Feedback is very welcome
• Questions: Peter Bajcsy –
                         j y
  pbajcsy@ncsa.uiuc.edu

More Related Content

PDF
NISO Two-Part Webinar: Sustainable Information Part 2: Digital Preservation o...
PPTX
NISO Two-Part Webinar: Sustainable Information Part 2: Digital Preservation o...
PPTX
NISO Two-Part Webinar: Sustainable Information Part 1: Digital Preservation f...
PDF
NISO Webinar: Discoverable, Available, Accessible: Preserving Digital Content
PPTX
HydraDAM2: Repository Challenges and Solutions for Large Media Files
PPTX
Digital forensics lessons
PPT
Olaf Janssen on the principles of large-scale digital libraries and their app...
PPT
Archive Information Packages for NASA HDF-EOS Data
NISO Two-Part Webinar: Sustainable Information Part 2: Digital Preservation o...
NISO Two-Part Webinar: Sustainable Information Part 2: Digital Preservation o...
NISO Two-Part Webinar: Sustainable Information Part 1: Digital Preservation f...
NISO Webinar: Discoverable, Available, Accessible: Preserving Digital Content
HydraDAM2: Repository Challenges and Solutions for Large Media Files
Digital forensics lessons
Olaf Janssen on the principles of large-scale digital libraries and their app...
Archive Information Packages for NASA HDF-EOS Data

Viewers also liked (11)

PPT
Spc Gen Pres Final
PPT
Soccer 3v3 Fun Zone 2009
PPTX
Mobile ISD Metcalf IEL2010
PDF
Technologies For Appraising and Managing Electronic Records
PPTX
SLiMS improving librarian competences 20150508
PDF
To Preserve Or Not To Preserve?
PPT
Soccer 3v3 Field Sponsorship2009
PDF
Key Aspects in 3D File Format Conversions
PDF
Overview of Lincoln Paper Design
PDF
Gsm2009
PPT
Home Selling Tips - Pricing and Staging
Spc Gen Pres Final
Soccer 3v3 Fun Zone 2009
Mobile ISD Metcalf IEL2010
Technologies For Appraising and Managing Electronic Records
SLiMS improving librarian competences 20150508
To Preserve Or Not To Preserve?
Soccer 3v3 Field Sponsorship2009
Key Aspects in 3D File Format Conversions
Overview of Lincoln Paper Design
Gsm2009
Home Selling Tips - Pricing and Staging
Ad

Similar to e-Services to Keep Your Digital Files Current (20)

PPT
Trm Introduction
PPT
PRESERVATION Web archiving
PDF
Intro to Digital Preservation
PDF
Preservation Planning: Choosing a suitable digital preservation strategy
PDF
Emulation Bridging The Past To The Future Dirk Von Suchodoletz
PPT
Introduction to Digital Preservation
PPTX
Chap60
PDF
Digital preservation and institutional repositories
PPTX
EPrints Preservation: Why we need Preservation Planning
PPT
Gettingstartedwithdigitalcollectionsweb[1]
PPT
The Elephant in the Library
PDF
January 2006 Archival Storage Strategies and Technologies Presentation
PPTX
Electronic Records
PPTX
NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run ...
PDF
Digital Preservation in the Wild
PDF
Standard based Electronic Archiving for Clinical Trials
PPT
KeepIt Course 3: preservation workflow
PDF
Digital Kniznica 09 Ben Osteen Oxford
PDF
Corrado -- Establishing the Landscape
Trm Introduction
PRESERVATION Web archiving
Intro to Digital Preservation
Preservation Planning: Choosing a suitable digital preservation strategy
Emulation Bridging The Past To The Future Dirk Von Suchodoletz
Introduction to Digital Preservation
Chap60
Digital preservation and institutional repositories
EPrints Preservation: Why we need Preservation Planning
Gettingstartedwithdigitalcollectionsweb[1]
The Elephant in the Library
January 2006 Archival Storage Strategies and Technologies Presentation
Electronic Records
NISO Webinar: Software Preservation and Use: I Saved the Files But Can I Run ...
Digital Preservation in the Wild
Standard based Electronic Archiving for Clinical Trials
KeepIt Course 3: preservation workflow
Digital Kniznica 09 Ben Osteen Oxford
Corrado -- Establishing the Landscape
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Empathic Computing: Creating Shared Understanding
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
MYSQL Presentation for SQL database connectivity
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
Empathic Computing: Creating Shared Understanding

e-Services to Keep Your Digital Files Current

  • 1. e-Services to Keep Your Digital Fil C Di it l Files Current t Presented by: Peter Bajcsy -Research Scientist at NCSA -Associate Director of I-CHASS, I3 , Institute -Adjunct Assistant Professor, CS & ECE UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
  • 2. Acknowledgement • This research was partially supported by a National Archives and Records Administration (NARA) ( ) supplement to NSF PACI cooperative agreement CA #SCI-9619019 and NCSA Industrial Partners. • The views and conclusions contained in this doc ment ie s concl sions document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archives and Records Administration, or the U.S. government. • Contributions by: Peter Bajcsy Kenton McHenry Rob Bajcsy, McHenry, Kooper, Michal Ondrejcek, Jason Kastner, William McFadden, Sang-Chul Lee, Luigi Marini Imaginations unbound
  • 3. Outline • Introduction • Technologies • File format conversion software registry • Automated file format conversions • Conversion quality assessment • Summary • Future Work
  • 5. Supporting NARA’s Strategic Plan • According to The Strategic Plan of The National Archives and Records Administration 2006–2016. “Preserving the Past to Protect the Future” • “Strategic Goal: We will preserve and process records to ensure access by the public as soon as legally possible” possible • “Part D. We will improve the efficiency with which we manage our holdings from the time they are scheduled through accessioning, processing, storage, preservation storage preservation, and public use.”
  • 6. To Preserve or Not To Preserve? Digital representation of information Preservation & knowledge Information transfer ? AGENCY ARCHIVES Imaginations unbound
  • 7. Do We Know the Answers? • (1) What is the granularity of information that one should preserve about a decision process in order to reconstruct it? • Example: the granularity of information collected from a decision process based on visual inspection of images has implications on storage and computational requirements/costs comp tational req irements/costs – ImageProvenance2Learn (IP2Learn)
  • 8. Do We Know the Answers? • (2) Given thousands of DVDs with files, which files are related? • Example: given files that contain 2D scans of blue prints and 3D CAD models, find the p , content-based file correspondence - File2Learn prototype system Relationship Discovery 30 files 784 files
  • 9. Do We Know the Answers? • (3) Given hundreds of versions of the ‘same’ file, which file version(s) are similar and which one(s) should be preserved? h ld b d? • Example: given a collection of Adobe PDF documents, documents compare all pairs of Adobe PDF documents containing text, images, vector graphics,… and order them chronologically or based on similarities - Doc2Learn prototype
  • 10. Do We Know the Answers? • (4) Given thousands of file formats, which conversion software to use and which target file format to use so that the content of those thousands of files would be viewable in a long run? • Focus of today s talk is on examples today’s of technologies that would provide answers to (4) at large processing scale with computational scalability.
  • 11. Goal • Ob Observation: Fil f ti File format conversions are t i inevitably one part of our daily life • Question: Can file format conversions assist in making digital content created today to be accessible and viewable throughout its lifecycle? • Consideration: we do not know what file formats will be around 100+ years down the y road • Goal: to make files backward and forward compatible
  • 12. Background on File Format Conversions • A very large number of file formats in which digital content is stored. • A i An increasing number of complex fil f i b f l file formats containing t t i i multiple types of digital content (e.g., Adobe PDF, HDF) or having very elaborate specifications (e.g., STEP). • Many software implementations of import (read) and export (write) operations. • A wide spectrum of quality of software i l id t f lit f ft implementations t ti when reading and storing content in various file formats. • Ephemeral support for many file formats and software implementations • Hardware dependency of many software implementations
  • 13. Illustration of 3D File Format Reality *.ma, * b * * *.mb, *.mp *.k3d k3d *.pdf (*.prc, *.u3d) *.w3d *.lwo *.c4d *.dwg *.blend *.iam *.max, *.3ds
  • 14. Challenges and Objective • Challenges: • The quality of file format conversions is unknown when using a particular software to do the conversion • The volume of file format conversions requires significant computational resources • Understanding information loss due to file format conversions is application dependent • Estimating information loss is complicated due to the complexity of file formats • Th file f The fil format, software and hardware d t ft dh d dependencies are d i often unknown • Objective: Design and prototype services using a j g p yp g computational cloud to support forward-looking decisions
  • 15. Parameters of File Format Conversions • File format: Content representation depends on a file format • Software: Retrieval and storage of content in a file format depends on the quality of software implementation • Hardware: Software execution depends o access a d a e So t a e e ecut o depe ds on to storage media, operating system, and hardware platform • Criteria defining information loss: Information loss due to file format conversions is defined by application specific criteria
  • 16. Three Example Services of Interest • (a) Find file format conversion software to convert from any file format to any other file format • (b) Execute file format conversions with any available thi d party software il bl third t ft • (c) Evaluate information loss due to file ( ) format conversion over a set of files in multiple complex file formats
  • 19. #1: Conversion Software Registry (CSR) • Problem: Find file format conversion software to convert from any file format to any other file format • Technology: Conversion Software Registry (CSR) at https://isda.ncsa.uiuc.edu/NARA/CSR/ https://isda ncsa uiuc edu/NARA/CSR/ • Features: Support for searching, editing and adding i f ddi information about fil f ti b t file format t conversion software, open access and login- based modification b d difi ti
  • 21. Comparison of CSR with Other Systems • File Format Registries • PRONOM developed by the National Archives of the United Kingdom g • Unified Digital Formats Registry (UDFR – before GDFR) • Software Registries/Catalogues • C Community specific it ifi • The Geotechnical and Geoenvironmental Software Directory (GGSD) • The Natural Language Software Registry (NLSR) • Business oriented • The Bit9 Global Software Registry ( g y (whitelisting software) g ) • Cnet (available software with links to feature descriptions) • File Format Conversion Registries • Th Planets test bed (password protected, 18 software packages) The Pl t t tb d( d t t d ft k )
  • 22. Novelty of Conversion Software Registry • Existing file format registries focus on file format specifications • Catalogues of software focus on software of interest to a specific community and include information about t level d b t top l l description, vendors and price b t i ti d d i but not capabilities to import and export file formats • A file f fil format conversion registry lik Pl t i i t like Planets.org t supports 16 software packages, only single-hop conversion paths and couples software to the reg reg. • Novelty: CSR provides answers about multi-hop conversion paths from about 70+ software 70 packages currently Two-hop conversion path
  • 23. #2: File Format Conversion Engine • Problem: Execute file format conversions with any available third party software • Technology: Polyglot version 1, operating on NCSA hardware resources resources, downloadable for private deployment • F t Features: web-based access t a bb d to computational cloud consisting of commodity h d dit hardware and i t ll ti d installations of f third party software with import/export capabilities biliti
  • 25. Polyglot Design EXTENSIBILITY AUTOMATION Cloud Computing COMPUTATIONAL SCALABILITY Services to Archivists
  • 26. Comparison of File Format Conversion Systems • Some existing file format conversion services • http://www.ps2pdf.com; p p p ; • Supports only certain conversion types • http://www.zamzar.com • Supports conversion of document, image, music, video and couple of CAD formats • http://media-convert.com • Supports about 20 multi-media formats • D Drawbacks: Th existing systems are not b k The i ti t t extensible (limited by specific libraries), cannot be downloaded for private use (files with sensitive info) info), computational scalability is unknown
  • 27. Format Conversion Extensibility Via Software Reuse • Observation: Nobody has the resources to load every possible file format • Fully supporting the many available formats is an enormous undertaking • If a file format is closed/proprietary it may be difficult to retrieve the data directly from the file • Vendor file formats sometimes store application feature pp specific pieces of information that is not supported in other formats • M t software support importing/exporting of a subset of Most ft ti ti / ti f b t f application domain specific file formats. • Conclusion: Software reuse a d e te s b ty are t e key Co c us o So t a e euse and extensibility a e the ey characteristics of file format conversion systems
  • 28. File Format Conversion Extensibility • Extensibility in Polyglot: Software is reused by wrapping 3rd party software while utilizing whatever access the software vendors make available to embedded f d k il bl b dd d functionality • published Application Programming Interface (API), (API) command line and Graphics User Interfaces (GUI) • Novelty: Polyglot p y yg provides a single user interface that g allows the user to execute multiple software conversion software applications automatically, and over distributed computers that have a license for the software needed to do the conversion and/or have the computing resources necessary for the size of the job (computational scalability).
  • 29. #3: File Comparison Engines • Problem: Compare two files and evaluate information loss due to file format conversion over a set of files in multiple complex file formats • Technologies: g • Initial prototypes: ModelBrowser (four 3D comparison metrics); Doc2Learn (one metric across multiple digital objects), Doc2LearnHadoop (computation scalability using Hadoop) • Work-in-progress: A general API for content-based comparison of any two files - Versus
  • 30. 3D Comparison Example (ModelBrowser) heart.stl • Software: Adobe 3D Reviewer heart.wrl h t l • Original File: WRL • Converted Files: STP, STL, IGS, U3D • Comparison Method: Light Fields [C e , 2003] compares e ds [Chen, 003] co pa es heart.stp heart stp silhouettes from various viewing angles around the objects Conclusion: Information loss(WRLSTP)=Information loss (WRLSTL)
  • 31. Multiple Object Comparisons (Doc2Learn) Adobe PDF documents ~ {text, images, vector graphics, ….}
  • 32. Multiple Method Comparisons (Versus) • Software: MS Paint • Original File: TIF • Converted Files: PNG, GIF, JPG, BMP • Comparison Method: Pixel by pixel difference (sum of Euclidean distances over all pixels) User Inputs Conclusion 1: Information loss(TIFBMP or TIFPNG) =0 Conclusion 2: Information loss(TIFGIF) > Information loss(TIFJPG)
  • 33. Information Loss Evaluation Setup: • Inputs: a set of files, a set of software packages, p p g criteria for defining information loss • Wanted output: information loss ‘score’ per file format conversion Approach: • Phase I: Find all round-trip conversion paths from a given file format to the same file format • Phase II: Execute all conversions to obtain converted files. • Phase III: Compare the original and converted files
  • 34. Information Loss Evaluation: Computational Requirements • Files: one file in STP file format • Software: Adobe 3D Reviewer, Cyberware PlyTool • Comparison Method: Light Fields [Chen, 2003] • Number of paths: 10 (28 individual conversions) Phase I: Find Phase III: Compare Phase II: Execute
  • 36. Information Technology Lessons • Better understanding of preservation and reconstruction of electronic records in terms of file format conversions • Th data model needed f d The d t d l d d for documenting existing fil ti i ti file format conversion software • A framework (test bed) for software reuse and extensibility to provide file format conversion services • The complexity of performing content-based file comparison and measurements of information loss d i d t fi f ti l due to file format conversions • The computational cost of file format conversions, file comparisons and information loss evaluations • The computational scalability of file format conversions and fil comparisons using parallel processing paradigms d file i i ll l i di
  • 37. The Value for Archivists • Prototype services are freely available to digital preservation community and provide decision support tools • to select an ‘optimal’ file format to be preserved • to evaluate file format conversion software • to select minimum cost for a chosen file format conversion path • The framework for conversion software documentation, , software reuse and functionality extensibility has a major impact on • Effi i Efficiency with which we manage our h ldi ith hi h holdings • Understanding of the information loss introduced due to conversions • The cost of updating file format conversion services
  • 38. Development Plans • Prototype services are open to the public at • https://isda.ncsa.uiuc.edu/NARA/CSR/ • http://teeve3.ncsa.uiuc.edu/polyglot/convert.php • Software is open source technology and downloadable from http://isda.ncsa.uiuc.edu/download/ p • We have been building a second generation of these file format conversion services • Feedback is very welcome • Questions: Peter Bajcsy – j y pbajcsy@ncsa.uiuc.edu