SlideShare a Scribd company logo
Technologies For
Appraising and
A    i i     d
Managing Electronic
Records
Presented by: Peter Bajcsy
-Research Scientist at NCSA
-Associate Director of I-CHASS, I3
                               ,
Institute
-Adjunct Assistant Professor, CS & ECE
UIUC

National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Acknowledgement

   • This research was partially supported by a National
     Archive and Records Administration (NARA) supplement
                                             (      ) pp
     to NSF PACI cooperative agreement CA #SCI-9619019
     and NCSA Industrial Partners.
   • The views and conclusions contained in this doc ment
           ie s      concl sions                     document
     are those of the authors and should not be interpreted as
     representing the official policies, either expressed or
     implied, of the National Archive and Records
     Administration, or the U.S. government.
   • Contributions by: Peter Bajcsy Kenton McHenry Rob
                              Bajcsy,           McHenry,
     Kooper, Michal Ondrejcek, Jason Kastner, William
     McFadden, and Sang-Chul Lee


Imaginations unbound
Outline

• Introduction
• A disco er of relationships among digital
     discovery
  file collections (file2learn)
• A comprehensive comparison of
  contemporary documents (doc2learn)
• Automated file format conversions and
  conversion quality assessment (Polyglot)
• Summary
Introduction
Supporting NARA’s Strategic Plan
• According to The Strategic Plan of The
  National Archives and Records
  Administration 2006–2016. “Preserving the
  Past to Protect the Future”
  • “Strategic Goal: We will preserve and
    process records to ensure access by the
    public as soon as legally possible”
                              possible
     • “D. We will improve the efficiency with
       which we manage our holdings from
       the time they are scheduled through
       accessioning, processing, storage,
       preservation,
       preservation and public use.”
                                 use
To Be Preserved!
                       Digital representation of
                              information          Preservation
                             & knowledge




   Information
    transfer ?




  AGENCY                                             ARCHIVES

Imaginations unbound
Do We Know the Answers?

Questions During Appraisal of Electronic
  Records Series
• (1) Given M full DVDs with files, which
  files are related?
• (2) Given N versions of the ‘same’ file
                               same file,
  which file version(s) should be
  preserved?d?
Do We Know the Answers?
• (3) Given P file formats, which file format
  to use a d which co e s o so t a e
          and      c conversion software
  to use so files would be possible to view
  in a long run?
  • How much information is lost during file format
    conversion?
• (4) What is the granularity of
  information th t one should preserve
  i f     ti that       h ld
  about a decision process in order to
  reconstruct it?
Goal: Design Technologies for Appraising
and Managing Electronic Records
• Technologies should address the following
  problems:
   • (1) a discovery of relationships among
     digital file collections (file2learn)
   • (2) a comprehensive comparison of
     contemporary documents (doc2learn)
   • (3) automated file f
                     d fil format conversions and
                                           i    d
     conversion quality assessment (Polyglot)
A Discovery of Relationships
 Among Digital File Collections
Discovering Relationships Among Files

• How should one establish relationships
  among electronic records coming
      • From disparate sources or
      • From the same source at multiple time
        instances?


• Need to Understand the Complexity of the
  Problem
  P bl


Imaginations unbound
Discovering Relationships Among Files:
  Components
      p
• Metadata describing electronic records
    • How to extract metadata?
    • How to automate metadata extraction from multiple data
      types, e.g., 2D drawings and 3D CAD models?
• Storage of metadata
    • What ontology to use to represent the extracted metadata?
    • H
      How t represent and store d t and metadata?
          to          t d t      data d       t d t ?
• Exploratory and Search Capabilities
    • Ho to a tomate disco er of relationships?
      How automate discovery
    • How to support discovery of relationships between
      electronic records corresponding to the same p y
                                p     g              physical
      objects but different multidimensional observations?
Imaginations unbound
Relationships Among Multiple Data Types
  • Example Data: Torpedo Weapon Retriever 841
       • 784 existing 2D image drawings and N>22 3D CAD
         models
  • How to establish relationships among the 3D
    CAD models and 2D image drawings during a
    product lifecycle?




          Hypothetical Distribution of 3D CAD models for TWR 841


Imaginations unbound
Methodology
•   File Identification
•   Information Extraction from
     • File System
             S stem
     • File Content
•   Information Organization
     • Taxonomy
        (classification)
     • Ontology
        (relationships)
•   Information
    Representation,
    Integration and Storage
     • XML
     • RDF
•   Relationship Discovery
File Identification and File System Analyses
• File Identification
   • What is the file format?
   • Is the file format well formed?
• Approach: Used DROID built on top of the PRONOM File
  Registry with additional NCSA support of 3D file identification
• Metadata extraction about a file system
   • Where is the file located?
   • What is the file size, time stamp, etc.?
• Approach: Use any file system information extraction
  software, such as Aperture (cross platform, open source, active
  development), Google desktop, OS specific solutions (e.g.,
  Apple Spotlight Linux MS Search)
         Spotlight, Linux,
Content Analyses: Automation ?




                           Relat         iscovery
                               tionship Di          Part name,
                  OCR                                 Author,
                                                     Software,
                                                     Date, …



                        File Descriptors

Imaginations unbound
Content Analyses: Optical Character
Recognition (OCR) of 2D Drawings


              Reference Block




              Title Block


              MMC Block (Marinette Marine Corporation)
‘Standard’ Title Blocks: Organization and
 Ontology                           TEMPLATES

• Examples of title blocks used on
  drawings prepared by Naval
  Construction Battalion and Naval
  Construction Regiment
Title Block: Ontology and Metadata
    Representation

Ontology for sub-fields:
•    A – Record of preparation (<tdrw:recordOfPreparation>),
•   B – Drawing title (<tdrw:drawingTitle>),
•   C – Preparing Activity <tdrw:preparingActivity>,
           p     g        y       p p    g       y
•   F – Code identification number (<tdrw:FSCMNumber> ),
•   G – Drawing size (<tdrw:drawingSize>),
•   H – Drawing number (<tdrw:drawingNumber>)
                           (<tdrw:drawingNumber>),
•   J – Scale (<tdrw:drawingScale>),
•   K – Specification number (<tdrw:drawingNumber>),
•   L – Sheet number (<tdrw:sheetNumber>).
Resource Description Framework (RDF):
•   Metadata representation: subject – predicate - object
MMC and Reference Blocks: Organization

• MMC Blocks


                                   •The list varies
                                   in length
                                         g
                                   •The notation
                                   is not
                                   standardized




Inconsistencies
Summary of OCR Based Analyses
• Manually encoded block coordinates for 784 files in PNG
  (converted from originally LZW compressed TIFF files)
• Automated OCR and executed OCR on
   • 700 title blocks,
   • 150 reference blocks,
   • dozen of revision and list of material
   • about 200 additional areas with the drawing numbers
     (MMC DWG. NO.).
• Performance benchmarks:
   • Full OCR of TB, MMC and RF for about 50 image files
     (105 blocks) took about 6 hours on a quad core
     machine
Content Based Extraction from STEP Files

     • 3D CAD models in STEP file format are searched for any ASCII 
       strings matching English dictionary and following STEP 
       strings matching English dictionary and following STEP
       metadata specification.


                                      Example Metadata for TWR841 ship deck

STEP METADATA SPECIFICATION             EXPECTED STEP METADATA               PARSED STEP METADATA

FILE_DESCRIPTION( /* description */     FILE_DESCRIPTION((''),               FILE_DESCRIPTION((''),
(''),                                   /* implementation_level */ '2;1');   '2;1');
/* implementation_level */ '2;1');      FILE_NAME(                           FILE_NAME(
FILE_NAME(                              '120 TORPEDO WEAPONS RETRIEVER,      'D:NARAArchieve_data_samplesBHD_FR12
/* name */ '',                          TRANSVERSE BULKHEADS BELOW, MAIN     U2110_BHD12_2007_05_09.stp',
                                        DECK',
/* time_stamp */'',                     ‘04-10-86',                          '2007-05-10T13:45:37',
/* author */ (''),                      ('LDOBSON'),                         ('rakowpj'),
/* organization */ (''),                ('NAVAL SEA SYSTEMS COMMAND'),       (''),
/* preprocessor version */ ' ',
   preprocessor_version                 ' ',                                 'Autodesk Inventor 11 ,
                                                                              Autodesk          11'
/* originating_system */ '',            'IDA-STEP',                          'Autodesk Inventor 11',
/* authorization */ ' ');               ' ');                                '');
Exploratory Framework – User Interface
 Overview

Filter for Files                            Filter for Files

                   Graph of Relationships
                   Between Selected Files

          Files                                                Files




Preview of                                              Preview of
 Selected                                                Selected
   Data                                                    Data
Exploratory Framework – User Interface
Overview
        Additional Import/Export and Preference Options




               Table of Relationships
               Between Selected Files
Exploratory Framework: Modes of Operations

• Detection of discrepancies/anomalies in file descriptors
   • OCR results
      • View 2D drawings and OCR results, and then edit OCR
        descriptors
   • 3D Model
      • View 3D model and content based extraction, and then edit
        descriptors
• Comparison of pairs of files
   • Pairs of 2D drawings
   • Pairs of 3D models
   • Pairs of (2D drawing, 3D model)
• Establish file relationships
   • Insert logical links to relate a pair of files
Detection of Anomalies in OCR Results
Comparison of Files

Color encoding:
• P di t
  Predicates
  and values
  match
• Predicates
  match
• P di t
  Predicate
  occurs only
  in one file
Establish File Relationships
Establish File Relationships: Logical Link
AC
 Comprehensive Comparison
         h  i C       i
 of Contemporary Documents
Support of Appraisals by Enabling Comparisons
• How to compare containers with heterogeneous
  information (text, images, vector graphics,
                (         g           g
  animation, 3D, etc.)?
   • Methodology
   • Metrics
   • Weighting factors for fusion
         g    g
• How to quantify similarities between the same type
  of information?
   • Encodings and Representations
   • Metrics
   • Local versus global differences
Imaginations unbound
Example: Adobe Portable Document
   Format (PDF)
 • Why PDF? - PDF is just an example of a container
      • Office environment (Adobe PDF PS MS Word HTML …)
                                  PDF, PS,    Word, HTML, )
      • Satellite measurements (HDF, netCDF, …)




                                                              3D
                                                          Adobe Library 6.0


                                                              Movie
                                                          Adobe Lib
                                                          Ad b Library 7 0
                                                                       7.0




Imaginations unbound
Comparisons




Imaginations unbound
Example: Compare Veterans Affairs Fact
Sheets in PDF and MS Word file formats
• Test data: 108 files from RG 015 - Records of the
  Department of Veterans Affairs/Fact
  Sheets/www1.va.gov/opa/fact/docs.
   • These files are Veterans Affairs Fact Sheets and are stored in both
     PDF and MS Word file formats (54 MS word and 54 PDF files)
                                                              files).
• Which files have identical content?
• Demo: 6 files
   • amwars-2.pdf, amwars.pdf
   • claimpro-2.pdf, claimpro.pdf
   • comprates 2 pdf comprates.pdf
     comprates-2.pdf, comprates pdf
Methodology

  Pair-wise
 comparison
      p               +…
 of the same
digital objects




Comparison of
  multiple and
heterogeneous
 digital objects
                      +…

    Relationship to
  Permanent Records
Exploration of Text Components

                      LOADED FILES


Occurrence of words       Occurrence of numbers   “Ignore” words
Exploration of Image Components

                     LOADED FILES

                                                  “Ignore” colors
List of images   Occurrence of colors   Preview
Exploration of Vector Graphics
   Components

                       LOADED FILES

                                      Preview




            Occurrence of v/h lines




Imaginations unbound
Comprehensive Pair-Wise Comparison of
Documents
   Grouping and
Visualization Control

                             Similarity Values




Document ID
Visual Comparison for 6 Test Files




                   Result:
         amwars-2.pdf = amwars.pdf
        claimpro-2.pdf = claimpro.pdf
      comprates 2.pdf
      comprates-2.pdf = comprates.pdf
Computational
Requirements for
Executing the
Methodology


 Yellow indicates
  computations



    Relationship to
  Permanent Records




Appraisal & Sampling
Work in progress: Group and Validate
                Documents
   ributes of documents
Attr       o




                          Order of documents
Automated File Format Conversions
 and Conversion Quality
 Assessment
Conversions of Electronic Records
  • Conversions of electronic records are needed because
     • Visual exploration depends on various software
       packages
     • Many formats are retired (deprecated) over time
  • How to measure the degree of information
    preservation when files are converted from format A to
    format B?
     • During conversions, information could be lost, added or
       modified
     • Wh t i th i
       What is the importance of each b t object, etc. ?
                         t      f   h byte, bj t t
  • How to design a test bed for analyzing the quality of
    conversion and visualization software?

Imaginations unbound
Illustration of 3D File Format Reality
                                         *.ma, * b *
                                         *     *.mb, *.mp    *.k3d
                                                               k3d
*.pdf (*.prc, *.u3d)



                                                             *.w3d




 *.lwo         *.c4d   *.dwg   *.blend   *.iam          *.max, *.3ds
Our Survey about 3D Content
• Q: How Many 3D File Formats Exist?
• A: We have found more than 140 3D file
  formats. Many are proprietary file formats. Many
  are extremely complex ( ,
                y     p   (1,200 and more p g
                                            pages
  of specifications).
• Q: How Many Software Packages Support 3D
  File Format Import, Export and Display?
• A: We have documented about 16 software
  packages. There are many more. Most of them
  are proprietary/closed source code. Many
  contain incomplete support of file specifications
                                     specifications.
Examples of 3D Formats and Stored Content
    Format                 Geometry                            Appearance                                Scene                Animation
             Faceted   Parametric     CSG   B-Rep   Color   Material   Texture   Bump   Lights   Views     Trans.    Groups

     3ds       √           √                         √         √         √        √       √       √              √

     igs       √           √          √      √       √                                                           √     √

     lwo       √           √                         √         √         √        √

     obj       √           √                         √         √         √        √                                    √

     ply       √                                     √         √         √        √

     stp       √           √          √      √       √                                                                 √

      wrl      √           √                         √         √         √        √       √       √              √     √         √

     u3d       √                                     √                   √        √       √       √              √     √         √

     x3d       √           √                         √         √         √        √       √       √              √     √         √

 


     • Some content may be more important than others
             • The relative importance is situation dependent
Example: Conversion of X3D to STEP to X3D


                       Software:
                   X3dToVrml97



    X3D                                             WRL
                         Software:
                       A3D Reviewer




           Software:                   Software:     Nothing!
          A3D Reviewer                Vrml97ToX3d




  STEP                         WRL                    X3D
Towards a Universal Converter

• Use what is available in 3rd party software to
  perform conversions
      f            i
   • Document what formats can be
     opened/imported b each application
            d/i    t d by      h     li ti
   • Document what formats can be
     saved/exported by each application
   • Automate the use of each application and
     combine their abilities to perform conversions
     over larger set of formats
Input/Output Graphs

                      Adobe 3D Reviewer
Automation of 3D File Format Mapping




          Find the shortest path


                                   Convert



                                   Preview




Imaginations unbound
Automation of 3D File Format Conversion

• The I/O-Graph stores the information needed to convert
  between the formats represented in the graph
                                         graph.
• In order to perform the conversion we must execute the
  conversion path found.
               p
   • Many high end graphics programs are found on the windows
     platform
   • Those on other platforms, such as Linux, tend to have windows
     ports
   • Some are command line driven (usually small converter
     applications).
   • Many have only GUI interfaces
   • AutoHotKey: a scripting language for the Windows GUI.
Methodology           EXTENSIBILITY




                         AUTOMATION

    Cloud Computing

              COMPUTATIONAL
               SCALABILITY




Services to Archivists
NCSA Polyglot – Conversion Services

 • Web interface: user
   can drag and drop files
   into upload area for
   conversion

 • Java interface:
PolyglotRequest pgr;
pgr = new PolyglotRequest(“http://???”, “obj”);
pgr.convertFile(“file.wrl”, “./”);


 • Scalability Test
Number of PCs   One PC           Two PCs
Processing Time 33 minutes 6     16 minutes 40
                seconds          seconds
NCSA Polyglot – Data Loss Measurement
Services
                               We would like to assign
                                  a value to each
                                conversion edge …
Geometry Based Content Retention

  • Several metrics
  • D t di
    Data driven assignment
                    i    t




  • Example results
        p
MetricResult   Single Optimal Conversion                   ‘Best’ File Format
                Software     From       To    Information   Format         Information
                                              Retention                    Retention
Light Fields    Adobe 3D    .pdf       .stp   61.67         .stp         40.73
                Reviewer
Spin Images
 p      g       Adobe 3D    .obj
                               j       .pdf
                                        p     59.07         .stl         34.89
                Reviewer
Summary
• Technologies for appraisal of electronic records
  should assist archivists
• They are designed to support decisions and data
  explorations by automating appraisal tasks
• The software for doc2learn and Polyglot is available
  for downloading at
  http://isda.ncsa.uiuc.edu/download/
• File2learn software – the work is still in progress
• Feedback is very welcome

• Questions: Peter Bajcsy – pbajcsy@ncsa.uiuc.edu
Demo exercise

• Step 1: Check the
  path exists between
  wrl and pdf

• Step 2: drag and drop
  heart.wrl; select target
  to be pdf, click upload
        pdf

• Step 3: download to
  desktop and open in
  Adobe PDF Viewer

More Related Content

PPTX
NISO Webinar: Metadata for Preservation: A Digital Object's Best Friend
PPTX
NISO Forum, Denver, Sept. 24, 2012: EZID: Easy dataset identification & manag...
PPTX
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
PPTX
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
PDF
e-Services to Keep Your Digital Files Current
PDF
A Locality Sensitive Hashing Filter for Encrypted Vector Databases
KEY
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
NISO Webinar: Metadata for Preservation: A Digital Object's Best Friend
NISO Forum, Denver, Sept. 24, 2012: EZID: Easy dataset identification & manag...
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
e-Services to Keep Your Digital Files Current
A Locality Sensitive Hashing Filter for Encrypted Vector Databases
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence

What's hot (19)

PDF
IRJET - A Secure Access Policies based on Data Deduplication System
PDF
Tese phd
PDF
Stream Processing with DDS and CEP
PPTX
Guest Lecture: Exchange and QA for Metadata at WSU
PDF
Advanced OpenSplice Programming - Part I
PDF
Federated HDFS
PDF
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
PDF
Building Reactive Applications with DDS
PDF
HiTIME project
 
PPTX
Whither Small Data?
PPTX
HA Hadoop -ApacheCon talk
PDF
Presentation Ispass 2012 Session6 Presentation1
PDF
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
PDF
The DDS Security Standard
PDF
The DDS Tutorial - Part I
PDF
Digitization Projects for Small Archives and Museums
PDF
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
PPTX
Dexjava Technical Seminar Dec 2011
PDF
DDS In Action Part II
IRJET - A Secure Access Policies based on Data Deduplication System
Tese phd
Stream Processing with DDS and CEP
Guest Lecture: Exchange and QA for Metadata at WSU
Advanced OpenSplice Programming - Part I
Federated HDFS
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Building Reactive Applications with DDS
HiTIME project
 
Whither Small Data?
HA Hadoop -ApacheCon talk
Presentation Ispass 2012 Session6 Presentation1
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
The DDS Security Standard
The DDS Tutorial - Part I
Digitization Projects for Small Archives and Museums
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
Dexjava Technical Seminar Dec 2011
DDS In Action Part II
Ad

Viewers also liked (10)

PDF
Key Aspects in 3D File Format Conversions
PPT
Soccer 3v3 Field Sponsorship2009
PPT
Soccer 3v3 Fun Zone 2009
PPTX
SLiMS improving librarian competences 20150508
PPT
Spc Gen Pres Final
PDF
To Preserve Or Not To Preserve?
PPTX
Mobile ISD Metcalf IEL2010
PDF
Overview of Lincoln Paper Design
PPT
Home Selling Tips - Pricing and Staging
PDF
Gsm2009
Key Aspects in 3D File Format Conversions
Soccer 3v3 Field Sponsorship2009
Soccer 3v3 Fun Zone 2009
SLiMS improving librarian competences 20150508
Spc Gen Pres Final
To Preserve Or Not To Preserve?
Mobile ISD Metcalf IEL2010
Overview of Lincoln Paper Design
Home Selling Tips - Pricing and Staging
Gsm2009
Ad

Similar to Technologies For Appraising and Managing Electronic Records (20)

PDF
Preservation Planning: Choosing a suitable digital preservation strategy
PDF
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
PPT
The Elephant in the Library
PPT
PRESERVATION Web archiving
PPTX
Electronic Records
PDF
Emulation Bridging The Past To The Future Dirk Von Suchodoletz
PPT
Introduction to Digital Preservation
PPT
Gettingstartedwithdigitalcollectionsweb[1]
PDF
Tackling File Characterization and Analysis in Archivematica
PDF
Intro to Digital Preservation
PDF
Using and Developing with Open Source Digital Forensics Software in Digital A...
PDF
Watching the Detectives: Using digital forensics techniques to investigate th...
PPTX
NCompass Live: Best Practices for Digital Collections
PPT
Trm Introduction
PPT
The Elephant in the Library - Integrating Hadoop
PPTX
Digitizing a newspaper clippings collection: a case study in small-scale digi...
PDF
Dc sheridan dlf_2011_final
PPTX
UCD Digital Library: Creating online access to historical and contemporary co...
PDF
Evolving Domains, Problems and Solutions for Long Term Digital Preservation
Preservation Planning: Choosing a suitable digital preservation strategy
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival R...
The Elephant in the Library
PRESERVATION Web archiving
Electronic Records
Emulation Bridging The Past To The Future Dirk Von Suchodoletz
Introduction to Digital Preservation
Gettingstartedwithdigitalcollectionsweb[1]
Tackling File Characterization and Analysis in Archivematica
Intro to Digital Preservation
Using and Developing with Open Source Digital Forensics Software in Digital A...
Watching the Detectives: Using digital forensics techniques to investigate th...
NCompass Live: Best Practices for Digital Collections
Trm Introduction
The Elephant in the Library - Integrating Hadoop
Digitizing a newspaper clippings collection: a case study in small-scale digi...
Dc sheridan dlf_2011_final
UCD Digital Library: Creating online access to historical and contemporary co...
Evolving Domains, Problems and Solutions for Long Term Digital Preservation

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation theory and applications.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Empathic Computing: Creating Shared Understanding
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
NewMind AI Weekly Chronicles - August'25 Week I
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks
Encapsulation theory and applications.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Dropbox Q2 2025 Financial Results & Investor Presentation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf
Review of recent advances in non-invasive hemoglobin estimation

Technologies For Appraising and Managing Electronic Records

  • 1. Technologies For Appraising and A i i d Managing Electronic Records Presented by: Peter Bajcsy -Research Scientist at NCSA -Associate Director of I-CHASS, I3 , Institute -Adjunct Assistant Professor, CS & ECE UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
  • 2. Acknowledgement • This research was partially supported by a National Archive and Records Administration (NARA) supplement ( ) pp to NSF PACI cooperative agreement CA #SCI-9619019 and NCSA Industrial Partners. • The views and conclusions contained in this doc ment ie s concl sions document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archive and Records Administration, or the U.S. government. • Contributions by: Peter Bajcsy Kenton McHenry Rob Bajcsy, McHenry, Kooper, Michal Ondrejcek, Jason Kastner, William McFadden, and Sang-Chul Lee Imaginations unbound
  • 3. Outline • Introduction • A disco er of relationships among digital discovery file collections (file2learn) • A comprehensive comparison of contemporary documents (doc2learn) • Automated file format conversions and conversion quality assessment (Polyglot) • Summary
  • 5. Supporting NARA’s Strategic Plan • According to The Strategic Plan of The National Archives and Records Administration 2006–2016. “Preserving the Past to Protect the Future” • “Strategic Goal: We will preserve and process records to ensure access by the public as soon as legally possible” possible • “D. We will improve the efficiency with which we manage our holdings from the time they are scheduled through accessioning, processing, storage, preservation, preservation and public use.” use
  • 6. To Be Preserved! Digital representation of information Preservation & knowledge Information transfer ? AGENCY ARCHIVES Imaginations unbound
  • 7. Do We Know the Answers? Questions During Appraisal of Electronic Records Series • (1) Given M full DVDs with files, which files are related? • (2) Given N versions of the ‘same’ file same file, which file version(s) should be preserved?d?
  • 8. Do We Know the Answers? • (3) Given P file formats, which file format to use a d which co e s o so t a e and c conversion software to use so files would be possible to view in a long run? • How much information is lost during file format conversion? • (4) What is the granularity of information th t one should preserve i f ti that h ld about a decision process in order to reconstruct it?
  • 9. Goal: Design Technologies for Appraising and Managing Electronic Records • Technologies should address the following problems: • (1) a discovery of relationships among digital file collections (file2learn) • (2) a comprehensive comparison of contemporary documents (doc2learn) • (3) automated file f d fil format conversions and i d conversion quality assessment (Polyglot)
  • 10. A Discovery of Relationships Among Digital File Collections
  • 11. Discovering Relationships Among Files • How should one establish relationships among electronic records coming • From disparate sources or • From the same source at multiple time instances? • Need to Understand the Complexity of the Problem P bl Imaginations unbound
  • 12. Discovering Relationships Among Files: Components p • Metadata describing electronic records • How to extract metadata? • How to automate metadata extraction from multiple data types, e.g., 2D drawings and 3D CAD models? • Storage of metadata • What ontology to use to represent the extracted metadata? • H How t represent and store d t and metadata? to t d t data d t d t ? • Exploratory and Search Capabilities • Ho to a tomate disco er of relationships? How automate discovery • How to support discovery of relationships between electronic records corresponding to the same p y p g physical objects but different multidimensional observations? Imaginations unbound
  • 13. Relationships Among Multiple Data Types • Example Data: Torpedo Weapon Retriever 841 • 784 existing 2D image drawings and N>22 3D CAD models • How to establish relationships among the 3D CAD models and 2D image drawings during a product lifecycle? Hypothetical Distribution of 3D CAD models for TWR 841 Imaginations unbound
  • 14. Methodology • File Identification • Information Extraction from • File System S stem • File Content • Information Organization • Taxonomy (classification) • Ontology (relationships) • Information Representation, Integration and Storage • XML • RDF • Relationship Discovery
  • 15. File Identification and File System Analyses • File Identification • What is the file format? • Is the file format well formed? • Approach: Used DROID built on top of the PRONOM File Registry with additional NCSA support of 3D file identification • Metadata extraction about a file system • Where is the file located? • What is the file size, time stamp, etc.? • Approach: Use any file system information extraction software, such as Aperture (cross platform, open source, active development), Google desktop, OS specific solutions (e.g., Apple Spotlight Linux MS Search) Spotlight, Linux,
  • 16. Content Analyses: Automation ? Relat iscovery tionship Di Part name, OCR Author, Software, Date, … File Descriptors Imaginations unbound
  • 17. Content Analyses: Optical Character Recognition (OCR) of 2D Drawings Reference Block Title Block MMC Block (Marinette Marine Corporation)
  • 18. ‘Standard’ Title Blocks: Organization and Ontology TEMPLATES • Examples of title blocks used on drawings prepared by Naval Construction Battalion and Naval Construction Regiment
  • 19. Title Block: Ontology and Metadata Representation Ontology for sub-fields: • A – Record of preparation (<tdrw:recordOfPreparation>), • B – Drawing title (<tdrw:drawingTitle>), • C – Preparing Activity <tdrw:preparingActivity>, p g y p p g y • F – Code identification number (<tdrw:FSCMNumber> ), • G – Drawing size (<tdrw:drawingSize>), • H – Drawing number (<tdrw:drawingNumber>) (<tdrw:drawingNumber>), • J – Scale (<tdrw:drawingScale>), • K – Specification number (<tdrw:drawingNumber>), • L – Sheet number (<tdrw:sheetNumber>). Resource Description Framework (RDF): • Metadata representation: subject – predicate - object
  • 20. MMC and Reference Blocks: Organization • MMC Blocks •The list varies in length g •The notation is not standardized Inconsistencies
  • 21. Summary of OCR Based Analyses • Manually encoded block coordinates for 784 files in PNG (converted from originally LZW compressed TIFF files) • Automated OCR and executed OCR on • 700 title blocks, • 150 reference blocks, • dozen of revision and list of material • about 200 additional areas with the drawing numbers (MMC DWG. NO.). • Performance benchmarks: • Full OCR of TB, MMC and RF for about 50 image files (105 blocks) took about 6 hours on a quad core machine
  • 22. Content Based Extraction from STEP Files • 3D CAD models in STEP file format are searched for any ASCII  strings matching English dictionary and following STEP  strings matching English dictionary and following STEP metadata specification. Example Metadata for TWR841 ship deck STEP METADATA SPECIFICATION EXPECTED STEP METADATA PARSED STEP METADATA FILE_DESCRIPTION( /* description */ FILE_DESCRIPTION((''), FILE_DESCRIPTION((''), (''), /* implementation_level */ '2;1'); '2;1'); /* implementation_level */ '2;1'); FILE_NAME( FILE_NAME( FILE_NAME( '120 TORPEDO WEAPONS RETRIEVER, 'D:NARAArchieve_data_samplesBHD_FR12 /* name */ '', TRANSVERSE BULKHEADS BELOW, MAIN U2110_BHD12_2007_05_09.stp', DECK', /* time_stamp */'', ‘04-10-86', '2007-05-10T13:45:37', /* author */ (''), ('LDOBSON'), ('rakowpj'), /* organization */ (''), ('NAVAL SEA SYSTEMS COMMAND'), (''), /* preprocessor version */ ' ', preprocessor_version ' ', 'Autodesk Inventor 11 , Autodesk 11' /* originating_system */ '', 'IDA-STEP', 'Autodesk Inventor 11', /* authorization */ ' '); ' '); '');
  • 23. Exploratory Framework – User Interface Overview Filter for Files Filter for Files Graph of Relationships Between Selected Files Files Files Preview of Preview of Selected Selected Data Data
  • 24. Exploratory Framework – User Interface Overview Additional Import/Export and Preference Options Table of Relationships Between Selected Files
  • 25. Exploratory Framework: Modes of Operations • Detection of discrepancies/anomalies in file descriptors • OCR results • View 2D drawings and OCR results, and then edit OCR descriptors • 3D Model • View 3D model and content based extraction, and then edit descriptors • Comparison of pairs of files • Pairs of 2D drawings • Pairs of 3D models • Pairs of (2D drawing, 3D model) • Establish file relationships • Insert logical links to relate a pair of files
  • 26. Detection of Anomalies in OCR Results
  • 27. Comparison of Files Color encoding: • P di t Predicates and values match • Predicates match • P di t Predicate occurs only in one file
  • 30. AC Comprehensive Comparison h i C i of Contemporary Documents
  • 31. Support of Appraisals by Enabling Comparisons • How to compare containers with heterogeneous information (text, images, vector graphics, ( g g animation, 3D, etc.)? • Methodology • Metrics • Weighting factors for fusion g g • How to quantify similarities between the same type of information? • Encodings and Representations • Metrics • Local versus global differences Imaginations unbound
  • 32. Example: Adobe Portable Document Format (PDF) • Why PDF? - PDF is just an example of a container • Office environment (Adobe PDF PS MS Word HTML …) PDF, PS, Word, HTML, ) • Satellite measurements (HDF, netCDF, …) 3D Adobe Library 6.0 Movie Adobe Lib Ad b Library 7 0 7.0 Imaginations unbound
  • 34. Example: Compare Veterans Affairs Fact Sheets in PDF and MS Word file formats • Test data: 108 files from RG 015 - Records of the Department of Veterans Affairs/Fact Sheets/www1.va.gov/opa/fact/docs. • These files are Veterans Affairs Fact Sheets and are stored in both PDF and MS Word file formats (54 MS word and 54 PDF files) files). • Which files have identical content? • Demo: 6 files • amwars-2.pdf, amwars.pdf • claimpro-2.pdf, claimpro.pdf • comprates 2 pdf comprates.pdf comprates-2.pdf, comprates pdf
  • 35. Methodology Pair-wise comparison p +… of the same digital objects Comparison of multiple and heterogeneous digital objects +… Relationship to Permanent Records
  • 36. Exploration of Text Components LOADED FILES Occurrence of words Occurrence of numbers “Ignore” words
  • 37. Exploration of Image Components LOADED FILES “Ignore” colors List of images Occurrence of colors Preview
  • 38. Exploration of Vector Graphics Components LOADED FILES Preview Occurrence of v/h lines Imaginations unbound
  • 39. Comprehensive Pair-Wise Comparison of Documents Grouping and Visualization Control Similarity Values Document ID
  • 40. Visual Comparison for 6 Test Files Result: amwars-2.pdf = amwars.pdf claimpro-2.pdf = claimpro.pdf comprates 2.pdf comprates-2.pdf = comprates.pdf
  • 41. Computational Requirements for Executing the Methodology Yellow indicates computations Relationship to Permanent Records Appraisal & Sampling
  • 42. Work in progress: Group and Validate Documents ributes of documents Attr o Order of documents
  • 43. Automated File Format Conversions and Conversion Quality Assessment
  • 44. Conversions of Electronic Records • Conversions of electronic records are needed because • Visual exploration depends on various software packages • Many formats are retired (deprecated) over time • How to measure the degree of information preservation when files are converted from format A to format B? • During conversions, information could be lost, added or modified • Wh t i th i What is the importance of each b t object, etc. ? t f h byte, bj t t • How to design a test bed for analyzing the quality of conversion and visualization software? Imaginations unbound
  • 45. Illustration of 3D File Format Reality *.ma, * b * * *.mb, *.mp *.k3d k3d *.pdf (*.prc, *.u3d) *.w3d *.lwo *.c4d *.dwg *.blend *.iam *.max, *.3ds
  • 46. Our Survey about 3D Content • Q: How Many 3D File Formats Exist? • A: We have found more than 140 3D file formats. Many are proprietary file formats. Many are extremely complex ( , y p (1,200 and more p g pages of specifications). • Q: How Many Software Packages Support 3D File Format Import, Export and Display? • A: We have documented about 16 software packages. There are many more. Most of them are proprietary/closed source code. Many contain incomplete support of file specifications specifications.
  • 47. Examples of 3D Formats and Stored Content Format Geometry Appearance Scene Animation Faceted Parametric CSG B-Rep Color Material Texture Bump Lights Views Trans. Groups 3ds √ √ √ √ √ √ √ √ √ igs √ √ √ √ √ √ √ lwo √ √ √ √ √ √ obj √ √ √ √ √ √ √ ply √ √ √ √ √ stp √ √ √ √ √ √ wrl √ √ √ √ √ √ √ √ √ √ √ u3d √ √ √ √ √ √ √ √ √ x3d √ √ √ √ √ √ √ √ √ √ √   • Some content may be more important than others • The relative importance is situation dependent
  • 48. Example: Conversion of X3D to STEP to X3D Software: X3dToVrml97 X3D WRL Software: A3D Reviewer Software: Software: Nothing! A3D Reviewer Vrml97ToX3d STEP WRL X3D
  • 49. Towards a Universal Converter • Use what is available in 3rd party software to perform conversions f i • Document what formats can be opened/imported b each application d/i t d by h li ti • Document what formats can be saved/exported by each application • Automate the use of each application and combine their abilities to perform conversions over larger set of formats
  • 50. Input/Output Graphs Adobe 3D Reviewer
  • 51. Automation of 3D File Format Mapping Find the shortest path Convert Preview Imaginations unbound
  • 52. Automation of 3D File Format Conversion • The I/O-Graph stores the information needed to convert between the formats represented in the graph graph. • In order to perform the conversion we must execute the conversion path found. p • Many high end graphics programs are found on the windows platform • Those on other platforms, such as Linux, tend to have windows ports • Some are command line driven (usually small converter applications). • Many have only GUI interfaces • AutoHotKey: a scripting language for the Windows GUI.
  • 53. Methodology EXTENSIBILITY AUTOMATION Cloud Computing COMPUTATIONAL SCALABILITY Services to Archivists
  • 54. NCSA Polyglot – Conversion Services • Web interface: user can drag and drop files into upload area for conversion • Java interface: PolyglotRequest pgr; pgr = new PolyglotRequest(“http://???”, “obj”); pgr.convertFile(“file.wrl”, “./”); • Scalability Test Number of PCs One PC Two PCs Processing Time 33 minutes 6 16 minutes 40 seconds seconds
  • 55. NCSA Polyglot – Data Loss Measurement Services We would like to assign a value to each conversion edge …
  • 56. Geometry Based Content Retention • Several metrics • D t di Data driven assignment i t • Example results p MetricResult Single Optimal Conversion ‘Best’ File Format Software From To Information Format Information Retention Retention Light Fields Adobe 3D .pdf .stp 61.67 .stp 40.73 Reviewer Spin Images p g Adobe 3D .obj j .pdf p 59.07 .stl 34.89 Reviewer
  • 57. Summary • Technologies for appraisal of electronic records should assist archivists • They are designed to support decisions and data explorations by automating appraisal tasks • The software for doc2learn and Polyglot is available for downloading at http://isda.ncsa.uiuc.edu/download/ • File2learn software – the work is still in progress • Feedback is very welcome • Questions: Peter Bajcsy – pbajcsy@ncsa.uiuc.edu
  • 58. Demo exercise • Step 1: Check the path exists between wrl and pdf • Step 2: drag and drop heart.wrl; select target to be pdf, click upload pdf • Step 3: download to desktop and open in Adobe PDF Viewer