SlideShare a Scribd company logo
Lecture 4: Texts and Models

        Prof. Alvarado
      MDST 3703/7703
      11 September 2012
Review
• Posting “Hello, World!”
  – Put file in the public_html directory of your UVA
    Home Directory
  – Create a post and insert a link to this file
  – Categorize as: 09.06: (S) HTML
• If you cannot get to your home directory, try
  uploading to
  http://homedir.virginia.edu
Some Quick Corrections
• Digital text is not necessary
   – It’s an open question (i.e. do we have to have it?)
• Nelson did not conceive of “trails,” Bush did
• HTML is not the “first big idea” in the liberal arts;
  hypertext is (according to me)
• The idea that “text shapes knowledge” is not
  ancient, but relatively new
   – Media determinism is a 20th century perspective
   – Although Plato notes the effects of literacy in the Phaedo
• Not everything can be translated into HTML
   – i.e. HTML is not the richest framework for digital
     representation
Your Questions and Observations
• Is commercialization killing creativity?
  – What is the relationship between how the web is
    organized economically and how it shapes
    expression?  EFFECT OF SOCIAL ORGANIZATION
• What happens if the associations that
  someone makes is „off ‟ and illogical to
  others?
  – Does it loosen the way logical connections can be
    made and argued?  EFFECT ON LOGIC
Your Questions and Observations
• Computers in general still heavily rely on a
  hierarchical structure
  – To what extent rationalization has occurred with the
    invention of hypertext?
• Do things lose value and meaning in exchange
  for digital coding?
  – What is the effect of digitization on value?
• Hypertexts and links online can be distracting
  – Non-linear thinking or mindless surfing?
Your Questions and Observations
• People are trying to create the same exact
  classroom experience online that exists in the
  physical classroom, which is impossible
  – We need to rethink and restructure the online
    learning experience as a new and unique learning
    experience
• How can we keep hypertext from altering us
  too much?
• The beauty and the risk of an open source web
Practical Questions
• How can an HTML webpage on your own computer
  be found by the search bar but not be on the web?
  – Your browser lives on your machine
  – The protocol name tells it where to look
• I wondered if the picture from my computer would
  still show up if I opened the page from another
  computer?
• It is interesting to see how one little thing out of
  place can ruin the entire code
   – Computers are stupid in that way
• Why should coders learn HTML?
   – HTML is an interface language that can be easily generated
     from print statements in your code
What is HTML?
• HTML is not a programming language
  – Programming languages express IF … THEN logic
  – But it is code that obeys a syntax & gets interpreted
  – And it is produced and consumed by programs
• HTML is a very general interface language
• HTML is written in XML, which we discuss
  today
  – Technically called “XHTML”
  – The original version was written in SGML
In general, don’t conflate HTML with
       hypertext or with digital
      representation in general
HTML is a language that
generates a species of hypertext
 which is, in turn, a species of
    digital representation
A provisional
   taxonomy
Is hypertext new?
[Study Bible]
1 = Mishna, the first major
           transcription of the oral law
           2 = Gemara, analytical
           discussions
           3 = Rashi, glossary
[Talmud]   4 = Tosefos, additions
           5 = Hananel, comments
           6 = Eye of Justice, legal
           decisions
           8 = Light of the
           Bible, references to Biblical
           quotations.
           9 = Bach's Annotations
           10 = Gra's Annotations
[Charrette]
[The Wasteland]
[Critical Edition]
[OED]
These are all examples of
       traditional texts
They exhibit “latent hypertext”
Landow
• The concept of hypertext parallels
  poststructuralist views of text
  – Barthes, Foucault, Derrida, Kristeva, et al.
• In this view, a text is not, and has never
  been, a bounded, closed thing
  – it is a network of signifiers that connect meanings
    across time and space …
Digital humanists have been
concerned with encoding historical
     texts since at least 1949
Father Busa
• Creator of the Index Thomisticus
• Saw the computer as a solution to indexing
  the works of Aquinas in 1949
  – 13,000,000 words
  – “in” took 4 years
• Solution:
  – Lemmatization
  – Variations tagged as
    instances of a type
The complete works of Aquinas will be typed onto
punch cards; the machines will then work through
the words and produce a systematic index of every
word St. Thomas used, together with the number
of times it appears, where it appears, and the six
words immediately preceding and following each
appearance (to give the context). This will take the
machines 8,125 hours; the same job would be
likely to take one man a lifetime.

   Time Magazine, 1956, “Religion: Sacred: Electronics”
So, what is text?

Let‟s look at some material
         examples
page o’ text
Real world text
comes packaged in
documents
A document is a
material artifact


How is text
conveyed in
a document?
UVA MDST 3073 Texts and Models-2012-09-11
What is text?
Visual Signifiers
•   Small caps
•   Indentation
•   Alignment
•   Italics
•   Space


All used to signify elements of text
Documents have thee Levels:
        Content, Structure, Style
• Content
  – TEXT, images, video clips, etc.
• Structure
  – The organization of content into units (elements)
    and logical relationships (e.g. reading order)
• Style
  – Screen and print layout
  – Fonts, colors, etc.
Descriptive markup languages allow
us to define structure of documents
    for computational purposes

 Theoretically, they do not specify
        layout or content
[PDF, Procedural Markup]




In contrast to procedural markup like PDF
So, how are docs structured?
Hierarchically …




(theoretically)
Document Elements and Structures
Play                 – Heading
  – Act +               • Return Address
       • Scene +        • Date
          – Line +      • Recipient Info
                           – Name
Book                       – Title
  – Chapter +              – Address
       • Verse +     – Content
                        • Salutation
                        • Paragraph +
                        • Closing

Letter
These are all “trees”
XML is a markup
  language
What is XML?
• Stands for eXtensible Markup Language
   – Actually invented after the web
   – A simplification of SGML, the language used to create
     HTML
   – It specifies a set of rules for creating specialized markup
     languages such as HTML and TEI
• It is simplified version of the SGML
   – Standard Generalized Markup Language
• SGML was invented in the early 1970s to wrest the
  control of documents from computer people who
  were taking over industries like law and accounting
UVA MDST 3073 Texts and Models-2012-09-11
XML looks like this




Notice how the element names reference units, not layout or style
Also markup for “in-line” elements
XML Premises
1.   All documents are comprised of elements.
2.   Elements contain content.
3.   Elements have no layout.
4.   Elements are hierarchically ordered.
5.   Elements are to be indicated by “markup” –
     tags that define the beginning and end of an
     element
XML Markup Rules
• Tags signify structural elements
• Three kinds of tag
  – Start and End, e.g <p> and </p>
  – Singleton, e.g <br />
• Start and singleton tags can have attributes
  – Simple key/value pairs
  – <div class="stanza" style="color:red;">
• Basic rules
  – All attributes must be quoted
  – All tags must nest (no overlaps!)
Documents in XML that meet
these rules are “well formed”
XML also provides Document Types
• A Document Type Definition (DTD) defines a
  set of tags and rules for using them
  – Specifies elements, attributes, and possible
    combinations
  – E.g. in HTML, the ol and ul elements must contain li
    elements
• A DTD is just one kind of schema system used
  by XML
• Schema express data models of/for texts
  – TEI is a powerful way of describing primary source
    materials for scholars
• Documents that use a schema properly are
  called “valid”
Originally, DTDs defined “genres”
like business letter or mortgage form

They were later used to define more
 abstract models of textual content
XML is used everywhere
• HTML
    – E.g. Embed codes
•   TEI (Text Encoding Initiative)
•   RSS
•   Civilization IV
•   Playlists (e.g. XSPF or “spiff ”)
•   Google Maps (KML)
A Look Again at HTML
• aka XHTML
    – And now becoming HTML5
•   An instance of XML (formerly SGML)
•   An interface language
•   Language of the World Wide Web
•   Defined by a DTD that prescribes a specific
    set of elements and relations
HTML Document Structure
• Head
  – Title
  – [Directives]
• Body
  – H1+
  – H2+
     • P+
     • UL
          – LI
Basic Elements with associated Tags
Element         Tags                     Attributes
Paragraph       <p> ... </p>
Numbered List   <ol>
                 <li> ... </li>
                </ol>
Bulleted List   <ul>
                 <li> ... </li>
                </ul>
Table           <table>
                 <tr>
                  <td> ... </td>
                 </tr>
                </table>
Anchor          <a> ... </a>             href, target
Image           <img/>                   src, border
Object          <object> ... </object>
The Text Encoding Initiative created
TEI to mark up scholarly documents
    Mainly primary sources such as
       books and manuscripts
TEI
• The dominant language used to encode
  scholarly text
• The current room was the locations of
  UVa‟s EText Center
  – World famous for text encoding
  – Now part of the library and catalog
• Scholars create their own schema to match
  what they are interested in
Examples
• The TEI Header
  – http://tbe.kantl.be/TBE/examples/TBED02v00.ht
    m
• TEI Prose
  – http://tbe.kantl.be/TBE/examples/TBED03v00.ht
    m
• Find others at the TEI By Example Project
  – http://tbe.kantl.be/TBE/
XML contains an implicit theory
           of text
           What is it?
OCHO
• XML (and therefore HTML and TEI) imply
  a certain theory of text
  – A text is an OHCO
• OHCO
  – Ordered Hierarchy of Content Objects
• An OHCO is a kind of tree
  – Elements follow each other in sequences
  – Elements can contain other elements
What are the advantages of this
            view?
OHCO allows for easy processing
• Every element has a precise address in the text
  – E.g. HTML/body/p[1]
• Texts can be described in the language of
  kinship
  – Ancestors, parents, siblings, children, etc.
• Texts can be restructured and manipulated by
  known patterns and algorithms
  – Traversing
  – Pruning
  – Cross-referencing
What are the disadvantages of
           OCHO?
Logical vs. Physical Structure
Pages and
   Paragraphs


Two common structures
that overlap
Solution 1: Split Elements
<page n=“2”>
...
<p id=“foo”>His good looks and his rank had one fair
claim on his attachment, since to them he must have owed a
wife</p>
</page>
<page n=“3”>
<p id=“bar” prev_id=“foo”> a very superior character to
anything deserved by his own.</p>
...
</page>
Solution 2: Use “Milestones”

<p>His good looks and his rank had one fair claim on
his attachment, since to them he must have owed a
wife <pb n=“3” /> a very superior character to
anything deserved by his own.</p>



     One structure gets backgrounded
Wittgenstein’s Manuscripts




      What about this?
[Charrette]
The problem of overlap suggests
the need for a richer set of tools
What tools do McCarty and
  Unsworth reference?
Tables
A database for Ovid
McCarty
• A different use of markup
  – From document description to interpretation
  – Creative “misuse”
• Reverse engineering a “grammar” of
  personification from a markup strategy
  – Thickness = description (of text)
  – Depth = explanation (of text by reference to grammar)
• Is forced to use tables in collaboration with
  markup
Thick description = Markup
 Deep description = Tables
How to reconcile these tools?
A Proposed Model
• Texts are not documents
  – Documents are media, Texts are messages
• Texts and documents are part of a system
  comprised of “levels”
  – They are effectively archaeology sites with
    stratigraphic layers
  – Erasures are like cities building on top of each other
• Each level of the system is described by an
  appropriate set of tools
  – Document structures  XML
  – Textual structures, embedded ontologies  Tables
Basic Levels
• Document
  – Physical objects (paper)
  – Logical objects (defined by space, style, punctuation,
    etc.)
  – Style and layout (also defined by space, color, etc.)
  – Can have superimposed versions
• Text
  –   Sequences of characters
  –   Grammatical features
  –   Figures and poetic features
  –   Etc.

More Related Content

PPTX
Mdst3703 2013-09-17-text-models
PPTX
General Introduction for Semantic Web and Linked Open Data
PPTX
Semantic web xml-rdf-dom parser
PPTX
semantic web & natural language
PPTX
Semantic web
PPT
The Semantic Web
PDF
Schema and Identity for Linked Data
PPTX
MDST 3703 F10 Studio 4
Mdst3703 2013-09-17-text-models
General Introduction for Semantic Web and Linked Open Data
Semantic web xml-rdf-dom parser
semantic web & natural language
Semantic web
The Semantic Web
Schema and Identity for Linked Data
MDST 3703 F10 Studio 4

What's hot (6)

PPTX
Introduction to Information Retrieval
PDF
Semantic engagement handouts
PPT
Ontologies: vehicles for reuse
PPTX
Large-Scale Semantic Search
PPTX
3. introduction to text mining
PPTX
Web 3 final(1)
Introduction to Information Retrieval
Semantic engagement handouts
Ontologies: vehicles for reuse
Large-Scale Semantic Search
3. introduction to text mining
Web 3 final(1)
Ad

Viewers also liked (8)

PPTX
Mdst 3559-02-17-php2
PPTX
Mdst 3559-04-05-networks-and-graphs
PPTX
Mdst3559 2011-05-03-final-day
PPTX
Mdst 3559-03-03-sql-php-2
PPTX
Mdst 3559-01-27-data-journalism-studio
PPTX
MDST 3703 F10 Seminar 1
PPTX
Mdst 3559-02-01-html
PPTX
MDST 3703 F10 Studio 11
Mdst 3559-02-17-php2
Mdst 3559-04-05-networks-and-graphs
Mdst3559 2011-05-03-final-day
Mdst 3559-03-03-sql-php-2
Mdst 3559-01-27-data-journalism-studio
MDST 3703 F10 Seminar 1
Mdst 3559-02-01-html
MDST 3703 F10 Studio 11
Ad

Similar to UVA MDST 3073 Texts and Models-2012-09-11 (20)

PDF
Editing Correspondence. The I in TEI.
PPTX
UVA MDST 3703 Marking-Up a Text 2012-09-13
PPTX
Mdst3705 2013-02-19-text-into-data
PPTX
Mdst3705 2013-02-05-databases
PPTX
E-publishing
PPTX
PDF
Feb.2016 Demystifying Digital Humanities - Workshop 2
PPTX
UVA MDST 3703 Thematic Research Collections 2012-09-18
PPT
2_text operationinformation retrieval. ppt
PDF
learn about text preprocessing nip using nltk
PPTX
Semantic technology in nutshell 2013. Semantic! are you a linguist?
PPTX
Web Technology
PPTX
Web Technology
ODP
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
PDF
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
PPT
Introduction
PPTX
DMDS Winter 2015 Workshop 1 slides
PPTX
GRADE 12 UNIT 4 computer science Ethiopian.pptx
PPT
Wisneski TeI workshop 2009-2010
Editing Correspondence. The I in TEI.
UVA MDST 3703 Marking-Up a Text 2012-09-13
Mdst3705 2013-02-19-text-into-data
Mdst3705 2013-02-05-databases
E-publishing
Feb.2016 Demystifying Digital Humanities - Workshop 2
UVA MDST 3703 Thematic Research Collections 2012-09-18
2_text operationinformation retrieval. ppt
learn about text preprocessing nip using nltk
Semantic technology in nutshell 2013. Semantic! are you a linguist?
Web Technology
Web Technology
What's With The 1S And 0S? Making Sense Of Binary Data At Scale With Tika And...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
Introduction
DMDS Winter 2015 Workshop 1 slides
GRADE 12 UNIT 4 computer science Ethiopian.pptx
Wisneski TeI workshop 2009-2010

More from Rafael Alvarado (20)

PPTX
Mdst3703 2013-10-08-thematic-research-collections
PPTX
Mdst3703 2013-10-01-hypertext-and-history
PPTX
Mdst3703 2013-09-24-hypertext
PPTX
Presentation1
PPTX
Mdst3703 2013-09-12-semantic-html
PPTX
Mdst3703 2013-09-10-textual-signals
PPTX
Mdst3703 2013-09-05-studio2
PPTX
Mdst3703 2013-09-03-plato2
PPTX
Mdst3703 2013-08-29-hello-world
PPTX
UVA MDST 3703 2013 08-27 Introduction
PPTX
MDST 3705 2012-03-05 Databases to Visualization
PPTX
Mdst3705 2013-02-26-db-as-genre
PPTX
Mdst3705 2013-02-12-finding-data
PPTX
Mdst3705 2013-01-29-praxis
PPTX
Mdst3705 2013-01-31-php3
PPTX
Mdst3705 2012-01-22-code-as-language
PPTX
Mdst3705 2013-01-24-php2
PPTX
Mdst3705 2012-01-15-introduction
PPTX
Mdst3703 graph-theory-11-20-2012
PPTX
Mdst3703 maps-and-timelines-2012-11-13
Mdst3703 2013-10-08-thematic-research-collections
Mdst3703 2013-10-01-hypertext-and-history
Mdst3703 2013-09-24-hypertext
Presentation1
Mdst3703 2013-09-12-semantic-html
Mdst3703 2013-09-10-textual-signals
Mdst3703 2013-09-05-studio2
Mdst3703 2013-09-03-plato2
Mdst3703 2013-08-29-hello-world
UVA MDST 3703 2013 08-27 Introduction
MDST 3705 2012-03-05 Databases to Visualization
Mdst3705 2013-02-26-db-as-genre
Mdst3705 2013-02-12-finding-data
Mdst3705 2013-01-29-praxis
Mdst3705 2013-01-31-php3
Mdst3705 2012-01-22-code-as-language
Mdst3705 2013-01-24-php2
Mdst3705 2012-01-15-introduction
Mdst3703 graph-theory-11-20-2012
Mdst3703 maps-and-timelines-2012-11-13

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Unlocking AI with Model Context Protocol (MCP)
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Tartificialntelligence_presentation.pptx
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx
A comparative analysis of optical character recognition models for extracting...
The Rise and Fall of 3GPP – Time for a Sabbatical?
20250228 LYD VKU AI Blended-Learning.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
Advanced methodologies resolving dimensionality complications for autism neur...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Unlocking AI with Model Context Protocol (MCP)
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Programs and apps: productivity, graphics, security and other tools
Tartificialntelligence_presentation.pptx

UVA MDST 3073 Texts and Models-2012-09-11

  • 1. Lecture 4: Texts and Models Prof. Alvarado MDST 3703/7703 11 September 2012
  • 2. Review • Posting “Hello, World!” – Put file in the public_html directory of your UVA Home Directory – Create a post and insert a link to this file – Categorize as: 09.06: (S) HTML • If you cannot get to your home directory, try uploading to http://homedir.virginia.edu
  • 3. Some Quick Corrections • Digital text is not necessary – It’s an open question (i.e. do we have to have it?) • Nelson did not conceive of “trails,” Bush did • HTML is not the “first big idea” in the liberal arts; hypertext is (according to me) • The idea that “text shapes knowledge” is not ancient, but relatively new – Media determinism is a 20th century perspective – Although Plato notes the effects of literacy in the Phaedo • Not everything can be translated into HTML – i.e. HTML is not the richest framework for digital representation
  • 4. Your Questions and Observations • Is commercialization killing creativity? – What is the relationship between how the web is organized economically and how it shapes expression?  EFFECT OF SOCIAL ORGANIZATION • What happens if the associations that someone makes is „off ‟ and illogical to others? – Does it loosen the way logical connections can be made and argued?  EFFECT ON LOGIC
  • 5. Your Questions and Observations • Computers in general still heavily rely on a hierarchical structure – To what extent rationalization has occurred with the invention of hypertext? • Do things lose value and meaning in exchange for digital coding? – What is the effect of digitization on value? • Hypertexts and links online can be distracting – Non-linear thinking or mindless surfing?
  • 6. Your Questions and Observations • People are trying to create the same exact classroom experience online that exists in the physical classroom, which is impossible – We need to rethink and restructure the online learning experience as a new and unique learning experience • How can we keep hypertext from altering us too much? • The beauty and the risk of an open source web
  • 7. Practical Questions • How can an HTML webpage on your own computer be found by the search bar but not be on the web? – Your browser lives on your machine – The protocol name tells it where to look • I wondered if the picture from my computer would still show up if I opened the page from another computer? • It is interesting to see how one little thing out of place can ruin the entire code – Computers are stupid in that way • Why should coders learn HTML? – HTML is an interface language that can be easily generated from print statements in your code
  • 8. What is HTML? • HTML is not a programming language – Programming languages express IF … THEN logic – But it is code that obeys a syntax & gets interpreted – And it is produced and consumed by programs • HTML is a very general interface language • HTML is written in XML, which we discuss today – Technically called “XHTML” – The original version was written in SGML
  • 9. In general, don’t conflate HTML with hypertext or with digital representation in general
  • 10. HTML is a language that generates a species of hypertext which is, in turn, a species of digital representation
  • 11. A provisional taxonomy
  • 14. 1 = Mishna, the first major transcription of the oral law 2 = Gemara, analytical discussions 3 = Rashi, glossary [Talmud] 4 = Tosefos, additions 5 = Hananel, comments 6 = Eye of Justice, legal decisions 8 = Light of the Bible, references to Biblical quotations. 9 = Bach's Annotations 10 = Gra's Annotations
  • 18. [OED]
  • 19. These are all examples of traditional texts They exhibit “latent hypertext”
  • 20. Landow • The concept of hypertext parallels poststructuralist views of text – Barthes, Foucault, Derrida, Kristeva, et al. • In this view, a text is not, and has never been, a bounded, closed thing – it is a network of signifiers that connect meanings across time and space …
  • 21. Digital humanists have been concerned with encoding historical texts since at least 1949
  • 22. Father Busa • Creator of the Index Thomisticus • Saw the computer as a solution to indexing the works of Aquinas in 1949 – 13,000,000 words – “in” took 4 years • Solution: – Lemmatization – Variations tagged as instances of a type
  • 23. The complete works of Aquinas will be typed onto punch cards; the machines will then work through the words and produce a systematic index of every word St. Thomas used, together with the number of times it appears, where it appears, and the six words immediately preceding and following each appearance (to give the context). This will take the machines 8,125 hours; the same job would be likely to take one man a lifetime. Time Magazine, 1956, “Religion: Sacred: Electronics”
  • 24. So, what is text? Let‟s look at some material examples
  • 25. page o’ text Real world text comes packaged in documents
  • 26. A document is a material artifact How is text conveyed in a document?
  • 29. Visual Signifiers • Small caps • Indentation • Alignment • Italics • Space All used to signify elements of text
  • 30. Documents have thee Levels: Content, Structure, Style • Content – TEXT, images, video clips, etc. • Structure – The organization of content into units (elements) and logical relationships (e.g. reading order) • Style – Screen and print layout – Fonts, colors, etc.
  • 31. Descriptive markup languages allow us to define structure of documents for computational purposes Theoretically, they do not specify layout or content
  • 32. [PDF, Procedural Markup] In contrast to procedural markup like PDF
  • 33. So, how are docs structured?
  • 35. Document Elements and Structures Play – Heading – Act + • Return Address • Scene + • Date – Line + • Recipient Info – Name Book – Title – Chapter + – Address • Verse + – Content • Salutation • Paragraph + • Closing Letter
  • 36. These are all “trees”
  • 37. XML is a markup language
  • 38. What is XML? • Stands for eXtensible Markup Language – Actually invented after the web – A simplification of SGML, the language used to create HTML – It specifies a set of rules for creating specialized markup languages such as HTML and TEI • It is simplified version of the SGML – Standard Generalized Markup Language • SGML was invented in the early 1970s to wrest the control of documents from computer people who were taking over industries like law and accounting
  • 40. XML looks like this Notice how the element names reference units, not layout or style
  • 41. Also markup for “in-line” elements
  • 42. XML Premises 1. All documents are comprised of elements. 2. Elements contain content. 3. Elements have no layout. 4. Elements are hierarchically ordered. 5. Elements are to be indicated by “markup” – tags that define the beginning and end of an element
  • 43. XML Markup Rules • Tags signify structural elements • Three kinds of tag – Start and End, e.g <p> and </p> – Singleton, e.g <br /> • Start and singleton tags can have attributes – Simple key/value pairs – <div class="stanza" style="color:red;"> • Basic rules – All attributes must be quoted – All tags must nest (no overlaps!)
  • 44. Documents in XML that meet these rules are “well formed”
  • 45. XML also provides Document Types • A Document Type Definition (DTD) defines a set of tags and rules for using them – Specifies elements, attributes, and possible combinations – E.g. in HTML, the ol and ul elements must contain li elements • A DTD is just one kind of schema system used by XML • Schema express data models of/for texts – TEI is a powerful way of describing primary source materials for scholars • Documents that use a schema properly are called “valid”
  • 46. Originally, DTDs defined “genres” like business letter or mortgage form They were later used to define more abstract models of textual content
  • 47. XML is used everywhere • HTML – E.g. Embed codes • TEI (Text Encoding Initiative) • RSS • Civilization IV • Playlists (e.g. XSPF or “spiff ”) • Google Maps (KML)
  • 48. A Look Again at HTML • aka XHTML – And now becoming HTML5 • An instance of XML (formerly SGML) • An interface language • Language of the World Wide Web • Defined by a DTD that prescribes a specific set of elements and relations
  • 49. HTML Document Structure • Head – Title – [Directives] • Body – H1+ – H2+ • P+ • UL – LI
  • 50. Basic Elements with associated Tags Element Tags Attributes Paragraph <p> ... </p> Numbered List <ol> <li> ... </li> </ol> Bulleted List <ul> <li> ... </li> </ul> Table <table> <tr> <td> ... </td> </tr> </table> Anchor <a> ... </a> href, target Image <img/> src, border Object <object> ... </object>
  • 51. The Text Encoding Initiative created TEI to mark up scholarly documents Mainly primary sources such as books and manuscripts
  • 52. TEI • The dominant language used to encode scholarly text • The current room was the locations of UVa‟s EText Center – World famous for text encoding – Now part of the library and catalog • Scholars create their own schema to match what they are interested in
  • 53. Examples • The TEI Header – http://tbe.kantl.be/TBE/examples/TBED02v00.ht m • TEI Prose – http://tbe.kantl.be/TBE/examples/TBED03v00.ht m • Find others at the TEI By Example Project – http://tbe.kantl.be/TBE/
  • 54. XML contains an implicit theory of text What is it?
  • 55. OCHO • XML (and therefore HTML and TEI) imply a certain theory of text – A text is an OHCO • OHCO – Ordered Hierarchy of Content Objects • An OHCO is a kind of tree – Elements follow each other in sequences – Elements can contain other elements
  • 56. What are the advantages of this view?
  • 57. OHCO allows for easy processing • Every element has a precise address in the text – E.g. HTML/body/p[1] • Texts can be described in the language of kinship – Ancestors, parents, siblings, children, etc. • Texts can be restructured and manipulated by known patterns and algorithms – Traversing – Pruning – Cross-referencing
  • 58. What are the disadvantages of OCHO?
  • 59. Logical vs. Physical Structure
  • 60. Pages and Paragraphs Two common structures that overlap
  • 61. Solution 1: Split Elements <page n=“2”> ... <p id=“foo”>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife</p> </page> <page n=“3”> <p id=“bar” prev_id=“foo”> a very superior character to anything deserved by his own.</p> ... </page>
  • 62. Solution 2: Use “Milestones” <p>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife <pb n=“3” /> a very superior character to anything deserved by his own.</p> One structure gets backgrounded
  • 63. Wittgenstein’s Manuscripts What about this?
  • 65. The problem of overlap suggests the need for a richer set of tools
  • 66. What tools do McCarty and Unsworth reference?
  • 69. McCarty • A different use of markup – From document description to interpretation – Creative “misuse” • Reverse engineering a “grammar” of personification from a markup strategy – Thickness = description (of text) – Depth = explanation (of text by reference to grammar) • Is forced to use tables in collaboration with markup
  • 70. Thick description = Markup Deep description = Tables
  • 71. How to reconcile these tools?
  • 72. A Proposed Model • Texts are not documents – Documents are media, Texts are messages • Texts and documents are part of a system comprised of “levels” – They are effectively archaeology sites with stratigraphic layers – Erasures are like cities building on top of each other • Each level of the system is described by an appropriate set of tools – Document structures  XML – Textual structures, embedded ontologies  Tables
  • 73. Basic Levels • Document – Physical objects (paper) – Logical objects (defined by space, style, punctuation, etc.) – Style and layout (also defined by space, color, etc.) – Can have superimposed versions • Text – Sequences of characters – Grammatical features – Figures and poetic features – Etc.

Editor's Notes

  • #24: Text becomes reducible to its elementsBasic feature of the medium
  • #35: (theoretically)
  • #64: http://biblioklept.org/2012/01/31/list-of-rejections-of-wittgensteins-mistress-david-markson/