SlideShare a Scribd company logo
fiwalk With Me:
  Building Emergent Pre-Ingest Workflows for Digital
Archival Records using Open Source Forensic Software

      Mark A. Matienzo, Yale University Library
                  Code4lib 2011

mark@matienzo.org http://matienzo.org/ @anarchivist
Disclaimer


The following presentation expresses
 opinions of my own and not of my
   employer, my coworkers, etc.
Digital forensics?


  http://www.flickr.com/photos/freeparking/480863346/
http://www.flickr.com/photos/johnjack/3449280414
Locard’s exchange
     principle
Key Works




urn:isbn:978-0201634976     urn:isbn:978-0321268174   urn:isbn:978-1932326376
http://www.flickr.com/photos/bjornmeansbear/4662232392/
Design Principles
•Use digital forensics software and methodology to
  support accessioning of born-digital archival records

•Mitigate risk of media deterioration and obsolescence
•Prefer open source solutions whenever possible
•Integrate into a larger, but yet-to-be-defined workflow
•Use curation micro-services a guiding philosophy for
  implementation and further analysis
Applied Methodology
•Use Carrier’s (2005) model of the digital investigation
  process: Preservation    Searching    Reconstruction


•Volume and file system as main areas for analysis
•Assume much of the state is already lost
•Methods should approach or intend forensic soundness
•Ethical issues (as raised in CLIR report) are out of scope
Mitigate Risk




http://www.flickr.com/photos/moparx/4013824025/
A Larger Workflow
Phys. format inventory                   Detailed inventory                                 Create desc metadata                       Enforce restrictions
Document rcrdkeeping                     Disk imaging                                       Intellectual/phys. arr.                    Add supplemental
  systems                                File format survey                                 Lump/split dig.objects                        desc./annotations
Draft submission agmt                    Assess access reqmts                               Declare rights/access                      Use viewers as needed
Gather contextual info                   Complete submission                                   restrictions                            Log use/access
                                          agreement w/ source                               Develop access tools
                                         Assess preserv. needs                              Migrate selected recs




    Collection               Selection                                    Selection          Arrangement &            Selection
                                            Accessioning                                                                                      Access
   development                                                                                 description




                                                                    Preservation


                                                                 Migrate objects/files
                                                                 JHOVE/DROID reports
                                                                 Checksum verification
                                                                 Emulation/virt. env.




                   Physical &
                 network media

                                                                                        Preservation                   Dissemination
                                              Working storage
                                                                                          storage                         storage
Open Source Forensics
•Digital forensics is a high-stakes market
•Proprietary forensics software is not easily extensible
•Proprietary forensics software is often platform-specific
•Cultural heritage institutions are still an emerging market
  for digital forensics

•Our needs are different and still being defined
Microservices as
  Philosophy


  http://www.flickr.com/photos/gregmote/2797330534
Principles                      Preferences                            Practices
• Granularity                • Small and simple over              • Define, decompose,
                                  large and complex                   recurse


• Orthogonality              • Minimally sufficient over • Top down design, bottom
                                  feature-laden                       up implementation


• Parsimony                  • Configurable over the               • Code to interfaces
                                  prescribed


• Evolution                  • The proven over the                • Sufficiency through a
                                  merely novel                        series of incrementally
                                                                      necessary steps
                             • Outcomes over means

               UC Curation Center/California Digital Library, “UC3 Curation Foundations”
      https://confluence.ucop.edu/download/attachments/13860983/UC3-Foundations-latest.pdf
Accessioning Workflow
   Start
accessioning                              Write-protect media               Verify image
  process            Media




                                          Record identifying              Extract filesystem-     Disk     Meta-
                 Retrieve media            characteristics of               and file-level      images     data

                                          media as metadata                   metadata         Transfer package




                                                                           Package images
               Assign identifiers to                                                             Ingest transfer
                                            Create image                  and metadata for
                     media                                                                         package
                                                                               ingest




                                  Media                         FS/File                           Document
                                                Disk
                                   MD                            MD                              accessioning
                                               image
                                                                                                   process




                                                                                                     End
                                                                                                 accessioning
                                                                                                   process
Implementation


 http://www.flickr.com/photos/generated/2084287794/
Disk Imaging
•dd: creates raw images
 •Related implementations: dc3dd, dcfldd, dd_rescue
 •Fast, but no mechanism to store imaging metadata
•Advanced Forensic Format (AFF)/AFFLib
 •Cross platform, reasonably fast
 •Can store arbitrary metadata
 •Plenty of GUI-based imaging tools
AFF File Structure




Garfinkel et al. 2006 (http://nrs.harvard.edu/urn-3:HUL.InstRepos:2829932), p. 27
Accessioning Workflow
   Start
accessioning                                     Write-protect media                 Verify image
                                                                               F
  process            Media                                                  AF




                                                 Record identifying                Extract filesystem-     Disk     Meta-
                 Retrieve media                   characteristics of                 and file-level      images     data

                                             F   media as metadata                     metadata         Transfer package
                                          AF




                                                                                    Package images
               Assign identifiers to                                                                      Ingest transfer
                                                   Create image                    and metadata for
                     media                                                                                  package
                                                                                        ingest
                                             F
                                          AF




                                  Media                                FS/File                             Document
                                                       Disk
                                   MD                                   MD                                accessioning
                                                      image
                                                                                                            process




                                                                                                              End
                                                                                                          accessioning
                                                                                                            process
The Sleuth Kit
•Open source C library, command line tools, and browser-
  based Perl application (Autopsy) for forensic analysis

•Supports analysis of NTFS, FAT, HFS+, Ext2/3, UFS1/2
•Splits tools into layers: volume system, file system, file
  name, metadata, data unit (“block”)

•Additional utilities to sort and post-process extracted
  metadata
Extracting Metadata & Files
$ fls -m A: -a -f fat 2004-M-008.0018.aff
0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114
0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226
0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237
0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0
0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0
0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0
0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0

$ fls -m A: -a -f fat 2004-M-008.0018.aff | mactime
Wed Dec 31 1969 19:00:00   202152 ..c. r/rrwxrwxrwx   0      0        3        A:/_ublist1.wpd (deleted)
                           202119 ..c. r/rrwxrwxrwx   0      0        4        A:/publist.wpd
                           205607 ..c. r/rrwxrwxrwx   0      0        7        A:/publistwkg.wpd
Thu Feb 22 2001 00:00:00   202152 .a.. r/rrwxrwxrwx   0      0        3        A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:30:52   202152 m... r/rrwxrwxrwx   0      0        3        A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:31:54   202152 ...b r/rrwxrwxrwx   0      0        3        A:/_ublist1.wpd (deleted)
Fri Feb 23 2001 16:17:18   205607 m... r/rrwxrwxrwx   0      0        7        A:/publistwkg.wpd
Fri Feb 23 2001 16:17:34   202119 m... r/rrwxrwxrwx   0      0        4        A:/publist.wpd
Fri Feb 23 2001 16:20:26   202119 ...b r/rrwxrwxrwx   0      0        4        A:/publist.wpd
Fri Feb 23 2001 16:20:37   205607 ...b r/rrwxrwxrwx   0      0        7        A:/publistwkg.wpd
Tue Sep 21 2010 00:00:00   202119 .a.. r/rrwxrwxrwx   0      0        4        A:/publist.wpd
                           205607 .a.. r/rrwxrwxrwx   0      0        7        A:/publistwkg.wpd

$ icat 2004-M-008.0018.aff 4 | fido.sh -
OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN"

$ icat 2004-M-008.018.aff 4 > publist.wpd
Extracting Metadata & Files
$ fls -m A: -a -f fat 2004-M-008.0018.aff
0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114
0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226
0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237
0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0
0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0
0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0
0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0

$ fls -m A: -a -f fat ~/Desktop/2004-M-008/data/2004-M-008.0018.aff | mactime
Wed Dec 31 1969 19:00:00   202152 ..c. r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
                           202119 ..c. r/rrwxrwxrwx 0        0        4         A:/publist.wpd
                           205607 ..c. r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Thu Feb 22 2001 00:00:00   202152 .a.. r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:30:52   202152 m... r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:31:54   202152 ...b r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Fri Feb 23 2001 16:17:18   205607 m... r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Fri Feb 23 2001 16:17:34   202119 m... r/rrwxrwxrwx 0        0        4         A:/publist.wpd
Fri Feb 23 2001 16:20:26   202119 ...b r/rrwxrwxrwx 0        0        4         A:/publist.wpd
Fri Feb 23 2001 16:20:37   205607 ...b r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Tue Sep 21 2010 00:00:00   202119 .a.. r/rrwxrwxrwx 0        0        4         A:/publist.wpd
                           205607 .a.. r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd

$ icat 2004-M-008.0018.aff 4 | fido.sh -
OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN"

$ icat 2004-M-008.018.aff 4 > publist.wpd
Extracting Metadata & Files
$ fls -m A: -a -f fat 2004-M-008.0018.aff
0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114
0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226
0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237
0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0
0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0
0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0
0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0

$ fls -m A: -a -f fat ~/Desktop/2004-M-008/data/2004-M-008.0018.aff | mactime
Wed Dec 31 1969 19:00:00   202152 ..c. r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
                           202119 ..c. r/rrwxrwxrwx 0        0        4         A:/publist.wpd
                           205607 ..c. r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Thu Feb 22 2001 00:00:00   202152 .a.. r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:30:52   202152 m... r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:31:54   202152 ...b r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Fri Feb 23 2001 16:17:18   205607 m... r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Fri Feb 23 2001 16:17:34   202119 m... r/rrwxrwxrwx 0        0        4         A:/publist.wpd
Fri Feb 23 2001 16:20:26   202119 ...b r/rrwxrwxrwx 0        0        4         A:/publist.wpd
Fri Feb 23 2001 16:20:37   205607 ...b r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Tue Sep 21 2010 00:00:00   202119 .a.. r/rrwxrwxrwx 0        0        4         A:/publist.wpd
                           205607 .a.. r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd

$ icat 2004-M-008.0018.aff 4 | fido.sh -
OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN"

$ icat 2004-M-008.018.aff 4 > publist.wpd
Extracting Metadata & Files
$ fls -m A: -a -f fat 2004-M-008.0018.aff
0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114
0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226
0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237
0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0
0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0
0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0
0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0

$ fls -m A: -a -f fat ~/Desktop/2004-M-008/data/2004-M-008.0018.aff | mactime
Wed Dec 31 1969 19:00:00   202152 ..c. r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
                           202119 ..c. r/rrwxrwxrwx 0        0        4         A:/publist.wpd
                           205607 ..c. r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Thu Feb 22 2001 00:00:00   202152 .a.. r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:30:52   202152 m... r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:31:54   202152 ...b r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Fri Feb 23 2001 16:17:18   205607 m... r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Fri Feb 23 2001 16:17:34   202119 m... r/rrwxrwxrwx 0        0        4         A:/publist.wpd
Fri Feb 23 2001 16:20:26   202119 ...b r/rrwxrwxrwx 0        0        4         A:/publist.wpd
Fri Feb 23 2001 16:20:37   205607 ...b r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Tue Sep 21 2010 00:00:00   202119 .a.. r/rrwxrwxrwx 0        0        4         A:/publist.wpd
                           205607 .a.. r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd

$ icat 2004-M-008.0018.aff 4 | fido.sh -
OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN"

$ icat 2004-M-008.018.aff 4 > publist.wpd
Extracting Metadata & Files
$ fls -m A: -a -f fat 2004-M-008.0018.aff
0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114
0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226
0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237
0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0
0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0
0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0
0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0

$ fls -m A: -a -f fat ~/Desktop/2004-M-008/data/2004-M-008.0018.aff | mactime
Wed Dec 31 1969 19:00:00   202152 ..c. r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
                           202119 ..c. r/rrwxrwxrwx 0        0        4         A:/publist.wpd
                           205607 ..c. r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Thu Feb 22 2001 00:00:00   202152 .a.. r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:30:52   202152 m... r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Thu Feb 22 2001 17:31:54   202152 ...b r/rrwxrwxrwx 0        0        3         A:/_ublist1.wpd (deleted)
Fri Feb 23 2001 16:17:18   205607 m... r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Fri Feb 23 2001 16:17:34   202119 m... r/rrwxrwxrwx 0        0        4         A:/publist.wpd
Fri Feb 23 2001 16:20:26   202119 ...b r/rrwxrwxrwx 0        0        4         A:/publist.wpd
Fri Feb 23 2001 16:20:37   205607 ...b r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd
Tue Sep 21 2010 00:00:00   202119 .a.. r/rrwxrwxrwx 0        0        4         A:/publist.wpd
                           205607 .a.. r/rrwxrwxrwx 0        0        7         A:/publistwkg.wpd

$ icat 2004-M-008.0018.aff 4 | fido.sh -
OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN"

$ icat 2004-M-008.018.aff 4 > publist.wpd
Accessioning Workflow
   Start
accessioning                              Write-protect media                 Verify image
                                                                        K
  process            Media                                           TS




                                          Record identifying                Extract filesystem-     Disk     Meta-
                 Retrieve media            characteristics of                 and file-level      images     data

                                          media as metadata             K       metadata         Transfer package
                                                                     TS




                                                                             Package images
               Assign identifiers to                                                               Ingest transfer
                                            Create image                    and metadata for
                     media                                                                           package
                                                                                 ingest




                                  Media                         FS/File                             Document
                                                Disk
                                   MD                            MD                                accessioning
                                               image
                                                                                                     process




                                                                                                       End
                                                                                                   accessioning
                                                                                                     process
fiwalk
•C++ program with Python module for processing images
•Outputs results in plain text key/value pairs, XML, CSV,
  or ARFF (for Weka data mining software)

•Developed to support automated forensic processing by
  breaking it into three steps: extract, represent, process

•Pluggable file-level metadata extraction (expects key/
  value pairs)

•Makes development easy and fast
Sample fiwalk Output
<?xml version='1.0' encoding='UTF-8'?>
<fiwalk xmloutputversion='0.3'>
  <metadata> <!-- metadata about the disk image -->
  <creator> <!-- fiwalk provenance metadata (runtime environment, etc.) --></creator>
  <source>
    <image_filename>2004-M-008.dd-0018.001</image_filename>
  </source>
<!-- fs start: 0 -->
  <volume offset='0'>
    <!-- volume metadata -->
    <fileobject>
      <filename>_ublist1.wpd</filename>
      <!-- more metadata about specific files within the image -->
    </fileobject>
    <fileobject/><!-- one for each file -->
  </volume>
  <runstats>
    <!-- runtime statistics -->
  </runstats>
</fiwalk>
Sample fiwalk Output
<fileobject>
  <filename>_ublist1.wpd</filename>
  <partition>1</partition>
  <id>1</id>
  <name_type>r</name_type>
  <filesize>202152</filesize>
  <unalloc>1</unalloc>
  <used>1</used>
  <inode>3</inode>
  <meta_type>1</meta_type>
  <mode>511</mode>
  <nlink>0</nlink>
  <uid>0</uid>
  <gid>0</gid>
  <mtime>982881052</mtime>
  <atime>982818000</atime>
  <crtime>982881114</crtime>
  <libmagic>(Corel/WP)</libmagic>
  <byte_runs>
   <run file_offset='0' fs_offset='16896' img_offset='16896' len='512'/>
  </byte_runs>
  <hashdigest type='md5'>d7bc22242c0a88fd8b68712980d5ab28</hashdigest>
  <hashdigest type='sha1'>64bf2bdf82e33fcda50158804483ac611e753db5</hashdigest>
</fileobject>
Accessioning Workflow
   Start
accessioning                              Write-protect media                  Verify image
  process            Media




                                          Record identifying                 Extract filesystem-     Disk     Meta-
                 Retrieve media            characteristics of                  and file-level      images     data

                                          media as metadata           al
                                                                         k       metadata         Transfer package
                                                                    fiw




                                                                              Package images
               Assign identifiers to                                                                Ingest transfer
                                            Create image                     and metadata for
                     media                                                                            package
                                                                                  ingest




                                  Media                         FS/File                              Document
                                                Disk
                                   MD                            MD                                 accessioning
                                               image
                                                                                                      process




                                                                                                        End
                                                                                                    accessioning
                                                                                                      process
Why fiwalk?
•Faster (and more forensically sound) to extract metadata
  once rather than having to keep processing an image

•Develop better assessments during accessioning process
  (directory structure significant? timestamps accurate?)

•fiwalk’s output is something like a METS structMap
•Building non-invasive assessment tools takes less time
http://www.flickr.com/photos/cdrummbks/4574882817/
Gumshoe
•Prototype application
•Blacklight (Ruby on Rails + Solr) & Python indexing code
•Indexing code works with fiwalk output or directly over a
  disk image (using fiwalk’s Python bindings)

•Populates Solr index with all file-level metadata from
  fiwalk and text strings extracted from files

•Code at http://github.com/anarchivist/gumshoe
•Demo at http://xgumshoex.heroku.com/
Future Directions




 http://www.flickr.com/photos/87913776@N00/5129662279/
AFF4
•Emerging format, with tools still under development
•Better for a distributed environment
•Friendlier to micro-services philosophy
•Clearer object and metadata model (RDF-based)
•Container format is Zip64
•Cohen and Schatz 2010 (doi:10.1016/j.diin.2010.05.015)
  show hash based imaging as a more efficient alternative
Sample AFF4 Metadata
                @prefix   D1: <urn:aff4:652e4027-27fab2941>
                @prefix   G1: <urn:aff4:19857a87-a190b2f87>
                @prefix   G2: <urn:aff4:0a1fc78a-927bfacef>
                @prefix   S1: <urn:aff4:652e4027-ffff01199>
                @prefix   I1: <urn:aff4:9003027a-11199ffff>
                @prefix   aff4: <http://afflib.org/2009/aff4/#>
                @prefix   dcterms: <http://purl.org/dc/terms/>

                G1: {
                    D1: aff4:serialNumber "zx322o91"
                    D1: aff4:hash "3897450fa18094b13"^^aff4:md5
                }

                G2: {
                    S1:   aff4:name "aff4imager"
                    S1:   aff4:vendor <http://aff.org/>
                    S1:   aff4:asserts G1:
                    S1:   aff4:type aff4:AcquisitionTool.
                    S1:   aff4:version "0.2"
                    I1:   aff4:type aff4:Image.
                    I1:   aff4:hash "3897450fa18094b13"^^aff4:md5
                    I1:   dcterms:creator S1:
                }

 Adapted from Schatz and Cohen 2010 (doi:10.1007/978-3-642-15506-2_16), pp. 234-236
Accessioning Workflow
   Start
accessioning                                                    Write-protect media                    Verify image
                                                                                             F   4
  process                       Media                                                     AF




                                                                Record identifying                   Extract filesystem-     Disk     Meta-
                            Retrieve media                       characteristics of                    and file-level      images     data

                                                                media as metadata                        metadata         Transfer package
                                                        F   4
                                                     AF




                                                                                                      Package images
                          Assign identifiers to                                                                             Ingest transfer
                                                                  Create image                       and metadata for
                                media                                                                                         package
                                                                                                          ingest
                  F   4                                 F   4                                F   4
               AF                                    AF                                   AF




                                             Media                                    FS/File                                Document
                                                                      Disk
                                              MD                                       MD                                   accessioning
                                                                     image
                                                                                                                              process




                                                                                                                                End
                                                                                                                            accessioning
                                                                                                                              process
Further Toolsets


  http://www.flickr.com/photos/oskay/5369749968/
fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival Records using Open Source Forensic Software
Thank You

                        mark@matienzo.org
                        http://matienzo.org/
                        twitter: @anarchivist




This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

More Related Content

PDF
Digital Forensics for Digital Archives
PDF
Using and Developing with Open Source Digital Forensics Software in Digital A...
PDF
Accessioning-Based Metadata Extraction and Iterative Processing: Notes From t...
PDF
ArchivesSpace: Building a Next-Generation Archives Management Tool
PDF
Secure multimedia
PPTX
Electronic Records
PDF
Oracle Distributed Document Capture
PPT
PRESERVATION Web archiving
Digital Forensics for Digital Archives
Using and Developing with Open Source Digital Forensics Software in Digital A...
Accessioning-Based Metadata Extraction and Iterative Processing: Notes From t...
ArchivesSpace: Building a Next-Generation Archives Management Tool
Secure multimedia
Electronic Records
Oracle Distributed Document Capture
PRESERVATION Web archiving

What's hot (15)

PDF
Scanning & document management
PDF
Hardware
PDF
Data Loss Prevention de RSA
PDF
Iw2415551560
PDF
Ronnie Oomen (EMC)
KEY
Advanced modelling made simple with the Gmodel metalanguage
PDF
UML Profile for DDS
PDF
SANsymphony V
PPTX
Linked data and the LOCAH project ILI2011
PPTX
NCompass Live: Digital Preservation, Part 2: Storage and Protection
PDF
IBM Tivoli Storage Productivit Center overview and update
PPTX
Eudat user forum-london-11march2013-biovel-v3
PDF
Lec7
Scanning & document management
Hardware
Data Loss Prevention de RSA
Iw2415551560
Ronnie Oomen (EMC)
Advanced modelling made simple with the Gmodel metalanguage
UML Profile for DDS
SANsymphony V
Linked data and the LOCAH project ILI2011
NCompass Live: Digital Preservation, Part 2: Storage and Protection
IBM Tivoli Storage Productivit Center overview and update
Eudat user forum-london-11march2013-biovel-v3
Lec7
Ad

Viewers also liked (20)

PPTX
DiscoveringDH_ManagingDigitalProjects
PPTX
Let's Get Small: A Microservices Approach to Library Websites
PDF
[Dpf manager] berlin workshop
PDF
Adding MediaConch to Archivematica for mkv/ffv1 checking
DOCX
cv (2)
PPTX
Progress with FITS for analyzing video
PPT
Digital Preservation
PPTX
SHAREmodule3
PPTX
HNI leeromgeving 05
PPTX
HNI leeromgeving 04
PPTX
Rebecca Grant - Collection creation, management and ingest
PPTX
The lifecycle of a short story
PPTX
HNI leeromgeving 02
PPTX
Cultural heritage collections in a web 2
PPT
Challenges & opportunities in the preservation of (digital) information: the ...
PPTX
Mapping the Digital Preservation Wilderness: What you need to know
PPT
ENArC- international cooperation, current and past project activities - statu...
PPTX
CNZ2013 Keynote | Trust in Digital Preservation | Natalie Harrower
PPTX
SHAREmodule2
PPTX
Character profiles
DiscoveringDH_ManagingDigitalProjects
Let's Get Small: A Microservices Approach to Library Websites
[Dpf manager] berlin workshop
Adding MediaConch to Archivematica for mkv/ffv1 checking
cv (2)
Progress with FITS for analyzing video
Digital Preservation
SHAREmodule3
HNI leeromgeving 05
HNI leeromgeving 04
Rebecca Grant - Collection creation, management and ingest
The lifecycle of a short story
HNI leeromgeving 02
Cultural heritage collections in a web 2
Challenges & opportunities in the preservation of (digital) information: the ...
Mapping the Digital Preservation Wilderness: What you need to know
ENArC- international cooperation, current and past project activities - statu...
CNZ2013 Keynote | Trust in Digital Preservation | Natalie Harrower
SHAREmodule2
Character profiles
Ad

Similar to fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival Records using Open Source Forensic Software (20)

PDF
Workshop 1 revised
PPT
Bit Level Preservation
PDF
Securing Your Endpoints Using Novell ZENworks Endpoint Security Management
PDF
Wewebu customer success story California Dept. of Public Health
PDF
Tackling File Characterization and Analysis in Archivematica
PPTX
Building a Data Discovery Network for Sustainability Science
PDF
Preservation Planning: Choosing a suitable digital preservation strategy
PPTX
Lessons learned from the Digital Trenches: the experiences of two archivists ...
PPTX
CYBER INTELLIGENCE &amp; RESPONSE TECHNOLOGY
PDF
OSS Presentation Keynote by Hal Stern
PDF
January 2006 Document Scanning Considerations Presentation
PPTX
Open Digital Education Software Kit
PPTX
Introduction to Pig
PPTX
Resource space
PPT
KeepIt Course 4: Putting storage, format management and preservation planning...
PPSX
Saadallah vtls
PPTX
Pain points for preservation services / workflows in repositories
PDF
C P Doc Rev Story
PPTX
What to curate? Preserving and Curating Software-Based Art
Workshop 1 revised
Bit Level Preservation
Securing Your Endpoints Using Novell ZENworks Endpoint Security Management
Wewebu customer success story California Dept. of Public Health
Tackling File Characterization and Analysis in Archivematica
Building a Data Discovery Network for Sustainability Science
Preservation Planning: Choosing a suitable digital preservation strategy
Lessons learned from the Digital Trenches: the experiences of two archivists ...
CYBER INTELLIGENCE &amp; RESPONSE TECHNOLOGY
OSS Presentation Keynote by Hal Stern
January 2006 Document Scanning Considerations Presentation
Open Digital Education Software Kit
Introduction to Pig
Resource space
KeepIt Course 4: Putting storage, format management and preservation planning...
Saadallah vtls
Pain points for preservation services / workflows in repositories
C P Doc Rev Story
What to curate? Preserving and Curating Software-Based Art

More from Mark Matienzo (11)

PDF
To Hell With Good Intentions: Linked Data and the Power to Name
PDF
Linked Data and the Semantic Web in the Archival Context
PPTX
Archival Sensemaking: Personal Digital Archiving as an Iteration
PDF
Findability in the Flow: Discovery through Linking
PDF
Learning to Take, Learning to Give: Linking as Repurposing Metadata
PDF
EAD and MARC sitting in a tree: D-R-U-P-A-L
ZIP
Online Presence and Participation
PDF
Linked Data and Archival Description: Confluences, Contingencies, and Conflicts
PDF
Cheeseburgers With Everything: Context, Content, and Connections in Archival ...
PDF
Archives & the Semantic Web
PDF
How I failed to present on using DVCS to control archival metadata
To Hell With Good Intentions: Linked Data and the Power to Name
Linked Data and the Semantic Web in the Archival Context
Archival Sensemaking: Personal Digital Archiving as an Iteration
Findability in the Flow: Discovery through Linking
Learning to Take, Learning to Give: Linking as Repurposing Metadata
EAD and MARC sitting in a tree: D-R-U-P-A-L
Online Presence and Participation
Linked Data and Archival Description: Confluences, Contingencies, and Conflicts
Cheeseburgers With Everything: Context, Content, and Connections in Archival ...
Archives & the Semantic Web
How I failed to present on using DVCS to control archival metadata

Recently uploaded (20)

DOC
证书结业SU毕业证,莫道克大学毕业证假学位证
PDF
Mathura Sridharan's Appointment as Ohio Solicitor General Sparks Racist Backl...
PDF
Conflict, Narrative and Media -An Analysis of News on Israel-Palestine Confli...
PPTX
PPT on SardarPatel and Popular Media.pptx
PDF
Regional Media Representation of Kuki-Meitei Conflict - An Analysis of Peace ...
PPTX
ASEANOPOL: The Multinational Police Force
PDF
The Blogs_ Hamas’s Deflection Playbook _ Andy Blumenthal _ The Times of Israe...
PDF
The Most Dynamic Lawyer to Watch 2025.pdf
PDF
Theories of federalism showcasing india .pdf
PPTX
opher bryers alert -How Opher Bryer’s Impro.ai Became the Center of Israel’s ...
PDF
JUDICIAL_ACTIVISM_CRITICAL_ANALYSIS in india.pdf
PPTX
Bridging Horizons_ Indo-Thai Cultural and Tourism Synergy in a Competitive Asia.
PDF
Executive an important link between the legislative and people
PDF
05082025_First India Newspaper Jaipur.pdf
DOCX
Breaking Now – Latest Live News Updates from GTV News HD
PDF
62 America is Mentally Ill 20.pdf Politicians Promoting Violence
PPTX
Sir Creek Conflict: History and its importance
PPTX
Precised New Precis and Composition 2025.pptx
PPTX
Indian ancient knowledge system, ancient geopolitics
PDF
Role of federalism in the indian society
证书结业SU毕业证,莫道克大学毕业证假学位证
Mathura Sridharan's Appointment as Ohio Solicitor General Sparks Racist Backl...
Conflict, Narrative and Media -An Analysis of News on Israel-Palestine Confli...
PPT on SardarPatel and Popular Media.pptx
Regional Media Representation of Kuki-Meitei Conflict - An Analysis of Peace ...
ASEANOPOL: The Multinational Police Force
The Blogs_ Hamas’s Deflection Playbook _ Andy Blumenthal _ The Times of Israe...
The Most Dynamic Lawyer to Watch 2025.pdf
Theories of federalism showcasing india .pdf
opher bryers alert -How Opher Bryer’s Impro.ai Became the Center of Israel’s ...
JUDICIAL_ACTIVISM_CRITICAL_ANALYSIS in india.pdf
Bridging Horizons_ Indo-Thai Cultural and Tourism Synergy in a Competitive Asia.
Executive an important link between the legislative and people
05082025_First India Newspaper Jaipur.pdf
Breaking Now – Latest Live News Updates from GTV News HD
62 America is Mentally Ill 20.pdf Politicians Promoting Violence
Sir Creek Conflict: History and its importance
Precised New Precis and Composition 2025.pptx
Indian ancient knowledge system, ancient geopolitics
Role of federalism in the indian society

fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival Records using Open Source Forensic Software

  • 1. fiwalk With Me: Building Emergent Pre-Ingest Workflows for Digital Archival Records using Open Source Forensic Software Mark A. Matienzo, Yale University Library Code4lib 2011 mark@matienzo.org http://matienzo.org/ @anarchivist
  • 2. Disclaimer The following presentation expresses opinions of my own and not of my employer, my coworkers, etc.
  • 3. Digital forensics? http://www.flickr.com/photos/freeparking/480863346/
  • 6. Key Works urn:isbn:978-0201634976 urn:isbn:978-0321268174 urn:isbn:978-1932326376
  • 8. Design Principles •Use digital forensics software and methodology to support accessioning of born-digital archival records •Mitigate risk of media deterioration and obsolescence •Prefer open source solutions whenever possible •Integrate into a larger, but yet-to-be-defined workflow •Use curation micro-services a guiding philosophy for implementation and further analysis
  • 9. Applied Methodology •Use Carrier’s (2005) model of the digital investigation process: Preservation Searching Reconstruction •Volume and file system as main areas for analysis •Assume much of the state is already lost •Methods should approach or intend forensic soundness •Ethical issues (as raised in CLIR report) are out of scope
  • 11. A Larger Workflow Phys. format inventory Detailed inventory Create desc metadata Enforce restrictions Document rcrdkeeping Disk imaging Intellectual/phys. arr. Add supplemental systems File format survey Lump/split dig.objects desc./annotations Draft submission agmt Assess access reqmts Declare rights/access Use viewers as needed Gather contextual info Complete submission restrictions Log use/access agreement w/ source Develop access tools Assess preserv. needs Migrate selected recs Collection Selection Selection Arrangement & Selection Accessioning Access development description Preservation Migrate objects/files JHOVE/DROID reports Checksum verification Emulation/virt. env. Physical & network media Preservation Dissemination Working storage storage storage
  • 12. Open Source Forensics •Digital forensics is a high-stakes market •Proprietary forensics software is not easily extensible •Proprietary forensics software is often platform-specific •Cultural heritage institutions are still an emerging market for digital forensics •Our needs are different and still being defined
  • 13. Microservices as Philosophy http://www.flickr.com/photos/gregmote/2797330534
  • 14. Principles Preferences Practices • Granularity • Small and simple over • Define, decompose, large and complex recurse • Orthogonality • Minimally sufficient over • Top down design, bottom feature-laden up implementation • Parsimony • Configurable over the • Code to interfaces prescribed • Evolution • The proven over the • Sufficiency through a merely novel series of incrementally necessary steps • Outcomes over means UC Curation Center/California Digital Library, “UC3 Curation Foundations” https://confluence.ucop.edu/download/attachments/13860983/UC3-Foundations-latest.pdf
  • 15. Accessioning Workflow Start accessioning Write-protect media Verify image process Media Record identifying Extract filesystem- Disk Meta- Retrieve media characteristics of and file-level images data media as metadata metadata Transfer package Package images Assign identifiers to Ingest transfer Create image and metadata for media package ingest Media FS/File Document Disk MD MD accessioning image process End accessioning process
  • 17. Disk Imaging •dd: creates raw images •Related implementations: dc3dd, dcfldd, dd_rescue •Fast, but no mechanism to store imaging metadata •Advanced Forensic Format (AFF)/AFFLib •Cross platform, reasonably fast •Can store arbitrary metadata •Plenty of GUI-based imaging tools
  • 18. AFF File Structure Garfinkel et al. 2006 (http://nrs.harvard.edu/urn-3:HUL.InstRepos:2829932), p. 27
  • 19. Accessioning Workflow Start accessioning Write-protect media Verify image F process Media AF Record identifying Extract filesystem- Disk Meta- Retrieve media characteristics of and file-level images data F media as metadata metadata Transfer package AF Package images Assign identifiers to Ingest transfer Create image and metadata for media package ingest F AF Media FS/File Document Disk MD MD accessioning image process End accessioning process
  • 20. The Sleuth Kit •Open source C library, command line tools, and browser- based Perl application (Autopsy) for forensic analysis •Supports analysis of NTFS, FAT, HFS+, Ext2/3, UFS1/2 •Splits tools into layers: volume system, file system, file name, metadata, data unit (“block”) •Additional utilities to sort and post-process extracted metadata
  • 21. Extracting Metadata & Files $ fls -m A: -a -f fat 2004-M-008.0018.aff 0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114 0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226 0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237 0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0 0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0 0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0 0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0 $ fls -m A: -a -f fat 2004-M-008.0018.aff | mactime Wed Dec 31 1969 19:00:00 202152 ..c. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) 202119 ..c. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 ..c. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Thu Feb 22 2001 00:00:00 202152 .a.. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:30:52 202152 m... r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:31:54 202152 ...b r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Fri Feb 23 2001 16:17:18 205607 m... r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Fri Feb 23 2001 16:17:34 202119 m... r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:26 202119 ...b r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:37 205607 ...b r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Tue Sep 21 2010 00:00:00 202119 .a.. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 .a.. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd $ icat 2004-M-008.0018.aff 4 | fido.sh - OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN" $ icat 2004-M-008.018.aff 4 > publist.wpd
  • 22. Extracting Metadata & Files $ fls -m A: -a -f fat 2004-M-008.0018.aff 0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114 0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226 0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237 0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0 0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0 0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0 0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0 $ fls -m A: -a -f fat ~/Desktop/2004-M-008/data/2004-M-008.0018.aff | mactime Wed Dec 31 1969 19:00:00 202152 ..c. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) 202119 ..c. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 ..c. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Thu Feb 22 2001 00:00:00 202152 .a.. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:30:52 202152 m... r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:31:54 202152 ...b r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Fri Feb 23 2001 16:17:18 205607 m... r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Fri Feb 23 2001 16:17:34 202119 m... r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:26 202119 ...b r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:37 205607 ...b r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Tue Sep 21 2010 00:00:00 202119 .a.. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 .a.. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd $ icat 2004-M-008.0018.aff 4 | fido.sh - OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN" $ icat 2004-M-008.018.aff 4 > publist.wpd
  • 23. Extracting Metadata & Files $ fls -m A: -a -f fat 2004-M-008.0018.aff 0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114 0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226 0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237 0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0 0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0 0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0 0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0 $ fls -m A: -a -f fat ~/Desktop/2004-M-008/data/2004-M-008.0018.aff | mactime Wed Dec 31 1969 19:00:00 202152 ..c. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) 202119 ..c. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 ..c. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Thu Feb 22 2001 00:00:00 202152 .a.. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:30:52 202152 m... r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:31:54 202152 ...b r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Fri Feb 23 2001 16:17:18 205607 m... r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Fri Feb 23 2001 16:17:34 202119 m... r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:26 202119 ...b r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:37 205607 ...b r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Tue Sep 21 2010 00:00:00 202119 .a.. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 .a.. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd $ icat 2004-M-008.0018.aff 4 | fido.sh - OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN" $ icat 2004-M-008.018.aff 4 > publist.wpd
  • 24. Extracting Metadata & Files $ fls -m A: -a -f fat 2004-M-008.0018.aff 0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114 0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226 0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237 0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0 0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0 0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0 0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0 $ fls -m A: -a -f fat ~/Desktop/2004-M-008/data/2004-M-008.0018.aff | mactime Wed Dec 31 1969 19:00:00 202152 ..c. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) 202119 ..c. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 ..c. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Thu Feb 22 2001 00:00:00 202152 .a.. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:30:52 202152 m... r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:31:54 202152 ...b r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Fri Feb 23 2001 16:17:18 205607 m... r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Fri Feb 23 2001 16:17:34 202119 m... r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:26 202119 ...b r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:37 205607 ...b r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Tue Sep 21 2010 00:00:00 202119 .a.. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 .a.. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd $ icat 2004-M-008.0018.aff 4 | fido.sh - OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN" $ icat 2004-M-008.018.aff 4 > publist.wpd
  • 25. Extracting Metadata & Files $ fls -m A: -a -f fat 2004-M-008.0018.aff 0|A:/_ublist1.wpd (deleted)|3|r/rrwxrwxrwx|0|0|202152|982818000|982881052|0|982881114 0|A:/publist.wpd|4|r/rrwxrwxrwx|0|0|202119|1285041600|982963054|0|982963226 0|A:/publistwkg.wpd|7|r/rrwxrwxrwx|0|0|205607|1285041600|982963038|0|982963237 0|A:/$MBR|45779|v/v---------|0|0|512|0|0|0|0 0|A:/$FAT1|45780|v/v---------|0|0|4608|0|0|0|0 0|A:/$FAT2|45781|v/v---------|0|0|4608|0|0|0|0 0|A:/$OrphanFiles|45782|d/d---------|0|0|0|0|0|0|0 $ fls -m A: -a -f fat ~/Desktop/2004-M-008/data/2004-M-008.0018.aff | mactime Wed Dec 31 1969 19:00:00 202152 ..c. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) 202119 ..c. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 ..c. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Thu Feb 22 2001 00:00:00 202152 .a.. r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:30:52 202152 m... r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Thu Feb 22 2001 17:31:54 202152 ...b r/rrwxrwxrwx 0 0 3 A:/_ublist1.wpd (deleted) Fri Feb 23 2001 16:17:18 205607 m... r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Fri Feb 23 2001 16:17:34 202119 m... r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:26 202119 ...b r/rrwxrwxrwx 0 0 4 A:/publist.wpd Fri Feb 23 2001 16:20:37 205607 ...b r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd Tue Sep 21 2010 00:00:00 202119 .a.. r/rrwxrwxrwx 0 0 4 A:/publist.wpd 205607 .a.. r/rrwxrwxrwx 0 0 7 A:/publistwkg.wpd $ icat 2004-M-008.0018.aff 4 | fido.sh - OK,110,x-fmt/44,WordPerfect for MS-DOS/Windows Document,WordPerfect for Windows 6.x - 12,202119,"STDIN" $ icat 2004-M-008.018.aff 4 > publist.wpd
  • 26. Accessioning Workflow Start accessioning Write-protect media Verify image K process Media TS Record identifying Extract filesystem- Disk Meta- Retrieve media characteristics of and file-level images data media as metadata K metadata Transfer package TS Package images Assign identifiers to Ingest transfer Create image and metadata for media package ingest Media FS/File Document Disk MD MD accessioning image process End accessioning process
  • 27. fiwalk •C++ program with Python module for processing images •Outputs results in plain text key/value pairs, XML, CSV, or ARFF (for Weka data mining software) •Developed to support automated forensic processing by breaking it into three steps: extract, represent, process •Pluggable file-level metadata extraction (expects key/ value pairs) •Makes development easy and fast
  • 28. Sample fiwalk Output <?xml version='1.0' encoding='UTF-8'?> <fiwalk xmloutputversion='0.3'> <metadata> <!-- metadata about the disk image --> <creator> <!-- fiwalk provenance metadata (runtime environment, etc.) --></creator> <source> <image_filename>2004-M-008.dd-0018.001</image_filename> </source> <!-- fs start: 0 --> <volume offset='0'> <!-- volume metadata --> <fileobject> <filename>_ublist1.wpd</filename> <!-- more metadata about specific files within the image --> </fileobject> <fileobject/><!-- one for each file --> </volume> <runstats> <!-- runtime statistics --> </runstats> </fiwalk>
  • 29. Sample fiwalk Output <fileobject> <filename>_ublist1.wpd</filename> <partition>1</partition> <id>1</id> <name_type>r</name_type> <filesize>202152</filesize> <unalloc>1</unalloc> <used>1</used> <inode>3</inode> <meta_type>1</meta_type> <mode>511</mode> <nlink>0</nlink> <uid>0</uid> <gid>0</gid> <mtime>982881052</mtime> <atime>982818000</atime> <crtime>982881114</crtime> <libmagic>(Corel/WP)</libmagic> <byte_runs> <run file_offset='0' fs_offset='16896' img_offset='16896' len='512'/> </byte_runs> <hashdigest type='md5'>d7bc22242c0a88fd8b68712980d5ab28</hashdigest> <hashdigest type='sha1'>64bf2bdf82e33fcda50158804483ac611e753db5</hashdigest> </fileobject>
  • 30. Accessioning Workflow Start accessioning Write-protect media Verify image process Media Record identifying Extract filesystem- Disk Meta- Retrieve media characteristics of and file-level images data media as metadata al k metadata Transfer package fiw Package images Assign identifiers to Ingest transfer Create image and metadata for media package ingest Media FS/File Document Disk MD MD accessioning image process End accessioning process
  • 31. Why fiwalk? •Faster (and more forensically sound) to extract metadata once rather than having to keep processing an image •Develop better assessments during accessioning process (directory structure significant? timestamps accurate?) •fiwalk’s output is something like a METS structMap •Building non-invasive assessment tools takes less time
  • 33. Gumshoe •Prototype application •Blacklight (Ruby on Rails + Solr) & Python indexing code •Indexing code works with fiwalk output or directly over a disk image (using fiwalk’s Python bindings) •Populates Solr index with all file-level metadata from fiwalk and text strings extracted from files •Code at http://github.com/anarchivist/gumshoe •Demo at http://xgumshoex.heroku.com/
  • 35. AFF4 •Emerging format, with tools still under development •Better for a distributed environment •Friendlier to micro-services philosophy •Clearer object and metadata model (RDF-based) •Container format is Zip64 •Cohen and Schatz 2010 (doi:10.1016/j.diin.2010.05.015) show hash based imaging as a more efficient alternative
  • 36. Sample AFF4 Metadata @prefix D1: <urn:aff4:652e4027-27fab2941> @prefix G1: <urn:aff4:19857a87-a190b2f87> @prefix G2: <urn:aff4:0a1fc78a-927bfacef> @prefix S1: <urn:aff4:652e4027-ffff01199> @prefix I1: <urn:aff4:9003027a-11199ffff> @prefix aff4: <http://afflib.org/2009/aff4/#> @prefix dcterms: <http://purl.org/dc/terms/> G1: { D1: aff4:serialNumber "zx322o91" D1: aff4:hash "3897450fa18094b13"^^aff4:md5 } G2: { S1: aff4:name "aff4imager" S1: aff4:vendor <http://aff.org/> S1: aff4:asserts G1: S1: aff4:type aff4:AcquisitionTool. S1: aff4:version "0.2" I1: aff4:type aff4:Image. I1: aff4:hash "3897450fa18094b13"^^aff4:md5 I1: dcterms:creator S1: } Adapted from Schatz and Cohen 2010 (doi:10.1007/978-3-642-15506-2_16), pp. 234-236
  • 37. Accessioning Workflow Start accessioning Write-protect media Verify image F 4 process Media AF Record identifying Extract filesystem- Disk Meta- Retrieve media characteristics of and file-level images data media as metadata metadata Transfer package F 4 AF Package images Assign identifiers to Ingest transfer Create image and metadata for media package ingest F 4 F 4 F 4 AF AF AF Media FS/File Document Disk MD MD accessioning image process End accessioning process
  • 38. Further Toolsets http://www.flickr.com/photos/oskay/5369749968/
  • 40. Thank You mark@matienzo.org http://matienzo.org/ twitter: @anarchivist This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.