SlideShare a Scribd company logo
Visualizing the
Transcribe Bentham Corpus
Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo,
Thierry Poibeau
LATTICE Lab: ENS – CNRS – U Paris 3, PSL – USPC
Tim Causer, Melissa Terras
UCL Bentham Project, UCL Digital Humanities
UCLDH Seminar, December 2016
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 2
Jeremy Bentham (1748-1832)
•Jurist, philosopher, and legal and
social reformer
•Leading theorist in Anglo-American
philosophy of law
•Influenced the development of
welfarism
•Advocated utilitarianism
•Animal rights,
•Work on the “panopticon”
•Not founder of UCL, but...
•60,000 folios in UCL Sp. Collections
•40,000 untranscribed
•Auto-icon
The Bentham Project
• http://www.ucl.ac.uk/Bentham-Project/
• Since 1959
• “aims to produce a new scholarly
edition of the works and
correspondence of Jeremy Bentham”
• twenty six volumes of the new
Collected Works have been published
• 50 years to transcribe 20,000 folios
• Previous AHRC grant catalogued the
manuscripts
– http://www.benthampapers.ucl.ac.uk/
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
Facts and Figures (as of 1st July 2016)
• 16,205 manuscripts transcribed/partially-transcribed
• 15,351 (94%) checked and approved
• 83,955 visits
• 34,359 unique views
• Average session time: 14 minutes 13 seconds
• 140 countries
• 514 people have transcribed something
• Most of the work done by the 26 Super Transcribers
• Average of 54 transcripts edited since the start of the project
• Average of 56 per week during the last twelve months
• Greatest number of transcripts in any one week: 300 (w/c 14 June
• 2014)
Transcribe Bentham progress, 8 September 2010 to 20 March 2015
0
2000
4000
6000
8000
10000
12000
8
Sep
2010
5
Nov
2011
30
Dec
2010
25
Feb
2011
15
Apr
2011
17
Jun
2011
12
Aug
2011
7
Oct
2011
2
Dec
2011
27
Jan
2012
23
Mar
2012
18
May
2012
13
Jul
2012
7
Sep
2012
2
Nov
2012
28
Dec
2012
22
Feb
2013
26
Apr
2013
21
Jun
2013
16
Aug
2013
11
Oct
2013
6
Dec
2013
31
Jan
2014
28
Mar
2014
23
May
2014
18
Jul
2014
12
Sep
2014
7
Nov
2014
9
Jan
2015
6
Mar
2015
Manuscripts worked on Completed transcripts
NYT article
BL manuscripts made
available
With thanks to:
•Prof Philip Schofield (UCL Bentham Project, Principal
Investigator)
•Dr Tim Causer (Bentham Project)
•Dr Kris Grint (Bentham Project)
•Richard Davis (University of London Computer Centre
•José Martin (ULCC)
•Martin Moyle (UCL Library Services)
•Lesley Pitman (UCL Library Services)
•Tony Slade (UCL Creative Media)
•Miguel Faleiro Rodrigues, Alejandro Salinas Lopez, and
Raheel Nabi (UCL Creative Media)
•Dr Arnold Hunt (British Library)
•Anna-Maria Sichani (Bentham Project)
•Dr Justin Tonra (National University of Ireland Galway)
and Dr Valerie Wallace (Victoria University Wellington),
bother formerly of the Bentham Project
•All the partners in Transcriptorium
http://transcriptorium.eu/consortium/
•And Transcribe Bentham’s volunteers!
•Project previously funded by the AHRC and the Andrew W.
Mellon Foundation
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 13
Relevant access to a large corpus
14
Relevant access to a large corpus
• A search index?
• Topic models?
• Corpus cartography?
Challenges for this corpus
• Not an all-English corpus
• Difficulties posed by an historical variety
• Technical language
• Revision history, additions and deletions
15
Stats for analyzed corpus sample
• Total TEI files: 29,900
• In English: 29,400
• That we dated: 16,700
• We only visualized English transcripts that
we could date (with a simple heuristic)1
• Work is based on ca. 55% of the all the
TEI files in our sample
16
1We were not using the corpus’ date metadata for this exercise
Corpus Cartography
• Lexical extraction (of relevant sequences)
• Clustering based on similarity measures
• Visual representation (map of the corpus)
based on layout algorithms
17
Cartography tool: CorText
• CorText Manager covers all cartography
steps:
– Lexical extraction
– Clustering
– Visualization
• Each step can be used independently,
thanks to standard import/export formats
18
ToolscombinedwithCorText
CARTOGRAPHY STEP TOOLS and RESOURCES
Lexical Extraction
DBpedia Spotlight
YaTeA
Human domain-expert
Clustering CorText Analysis
Visualization Gephi + Sigma JS plugin
- Static
CorText MapExplorer
Inkscape
- Dynamic CorText Heatmaps,
Tubes, Distant Reading
19
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 20
Lexical Extraction
• CorText native option
– Noun-Phrase chunks (based on TreeTagger)
• Our options:
– Entity Linking / Wikification to DBpedia
– Keyphrase extraction tools like YaTeA
• In all cases: manual selection of pre-ranked
candidate terms by a domain-expert
21
Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
22
Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
23
Database
Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
24
Database
Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
25
DatabaseCorpus
- judicatory
- judicial
- judicature
- Judicatory
- Judicial
Entity Linking / Wikification
• Tool: DBpedia Spotlight
• Compares the context of sequences of
words in a text against DBpedia articles:
– Term definition’s text
– Links
– DBpedia structure (redirections etc.)
• Assigns a DBpedia term to the sequence if
a good match is found
26
Entity Linking / Wikification
Example terms and their variants
27
Term Variants
Judiciary judicature, judicatory,
judicial
Jury jury, juries
Monarch king, monarch
Quantity amount, quantity
Saint Peter Simon Peter, Cephas
Entity Linking / Wikification
28
• Applying a current knowledge-base
(DBpedia) to 18th-19th century texts
• Is this a valid method?
Keyphrase extraction
• YaTeA (Aubin and Hamon, 2006)
• Extracts noun-phrases of configurable
structure and length
29
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 30
Clustering
• CorText offers several similarity metrics
– we chose the default method (distributional)
for homogeneous networks (Weeds & Weir 2005)
31
Visualization
• Static (one map for all dated transcripts)
• Dynamic: temporal slices on the corpus
– Heatmaps
– “River” or Sankey networks (“Tubes layout”)
32
http://apps.lattice.cnrs.fr/bentham
Static visualization
33
CorText network visualized with Gephi
Static visualization
34
CorText network visualized with Gephi
Static visualization
35
Example term: Bill
36
Example term: happiness
37
CorText network made interactive thanks to Gephi’s Sigma JS Exporter
38
Example term: happiness
39
Example term: happiness
Example term: suffering
40
Example term: suffering
41
42
Example term:
death
43
Example term:
death
Examples: nodes linking clusters
44
Examples: nodes linking clusters
45
Heatmaps: Saliency per subcorpus
46
Heatmaps: 1800-1809 subcorpus
47
Heatmaps: 1810-1819 subcorpus
48
Dynamic visualization
49
Dynamic visualization
50
1795 1800 1805 1810
Dynamic visualization
51
1795 1800 1805 1810
Dynamic visualization
52
1795 1800 1805 1810
Dynamic visualization
53
1795 1800 1805 1810
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 54
Evaluation
• Static maps: terms in the clusters
correspond closely to issues dealt with by
Bentham for the thematic areas of each
cluster
• Heatmaps: The evolution depicted
corresponds to the evolution of topics in
Bentham’s work
• DBpedia vs. keyphrase extraction: The
keyphrases provide more relevant
evidence for specialized scholars, a
general encyclopedia can help other users
55
Challenges
Deleted material Additions
56
Challenges
Thematic Variety
• Animal Welfare
• Arts
• Capital punishment
• Civil Code
• Constitutional Code
• Convict transportation
• Correspondence
• Crime & Punishment
• Education
• Law
• Legislation
• Moral Philosophy
• New South Wales
• Panopticon
• Penal Code
• Political Economy
• Preventive Police
• Religion
• Science
• Sexual Morality
• Torture
Formal Variety
• Text sheets
• Copies / Fair copies
• Marginal summary sheets
• Correspondence
• Collectanea
• Rudiments
• Spencers
57
From http://www.transcribe-bentham.da.ulcc.ac.uk/td/Manuscripts and
http://www.benthampapers.ucl.ac.uk/help.aspx?subject=category
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 58
Distant Reading Module
• Follow evolution of selected lexical
sequences
59
Evolution of a lexical item
60
Temporal evolution
Temporal evolution profiles:
- Here: Rising, but present at all dates
- Other examples: falling, regular spikes etc.
Contexts: WordTree
61
Contexts: WordTree
62
Contexts: WordTree
63
Context evolution: Bump Charts
64
• Example: evil
65
Neighbours evolutionBumpCharts
66
Neighbours evolutionBumpCharts
• Example: relations among neighbours of
evil
Relations in the context: Egonetworks
67
Evolution of neighbours’ relations
68
Egonetworks(Period2)
Evolution of neighbours’ relations
69
Egonetworks(Period3)
Evolution of neighbours’ relations
70
Egonetworks(Period4)
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 71
Other Lexical Analyses
• TXM “textometry” tool
– Automatic part-of-
speech tagging
– Partition texts according
to metadata
– Query corpus using
linguistic criteria
– Statistical analyses
(overrepresentation,
underrepresentation)
72
[ http://textometrie.ens-lyon.fr/?lang=en ]
Lexical Analysis with TXM
73
Lexical Analysis with TXM
• Partition the corpus according to Category,
Year, Decade, Main headings, or other
available metadata
74
Lexical Analysis with TXM
Number of words per Category
75
Lexical Analyses with TXM
• Over- (or under-) representation of given
words per decade (after partitioning per decade)
76
TXM linguistic queries
• Evil followed by a noun, per text-category
77
TXM linguistic queries
• Sentences containing an adjective + evil
78
Summary
• Accessing a large unedited corpus
– Cartography methods
• Lexical extraction
• Maps
– Static picture of the corpus
– Temporal evolution
– Other visualizations (Distant, WordTree)
• Domain-expert feedback
• Challenges
• Other lexical analyses
79
http://apps.lattice.cnrs.fr/bentham
Bibliography
Aubin, S., and Hamon, T. (2006) Improving Term
Extraction with Terminological Resources. In
Advances in Natural Language Processing: 5th
International Conference on NLP, FinTAL 2006, pp.
380-387. LNAI 4139. Springer.
Auer, Sören, et al. (2007). DBpedia: A nucleus for a
web of open data. The Semantic Web. Springer.
Causer, Tim, and Terras, Melissa (2014a). Many
hands make light work. Many hands together
make merry work: Transcribe Bentham and
crowdsourcing manuscript collections, in
Crowdsourcing Our Cultural Heritage, ed. M. Ridge,
Ashgate
Causer, Tim, and Terras, Melissa (2014b).
Crowdsourcing Bentham: Beyond the Traditional
Boundaries of Academic History, International
Journal of Humanities and Arts Computing, 8
Chavalarias, David, and Jean-Philippe Cointet. (2013).
Phylomemetic Patterns in Science Evolution—The
Rise and Fall of Scientific Fields. PLoS ONE 8 (2)
Cortext Manager Documentation (2016).
https://docs.cortext.net/.
Mendes, Pablo N., Max Jakob, Andrés García-Silva,
and Christian Bizer. (2011). DBpedia Spotlight:
Shedding Light on the Web of Documents. In
Proceedings of the 7th International Conference on
Semantic Systems, 1–8. ACM.
Mélanie, F., Tieberghien, E., Ruiz, P., Poibeau, T.,
Causer, T. Terras, M. (2016). Mapping the Bentham
Corpus. In Digital Humanities Conference (DH
2016). Kraków, Poland.
Poibeau, T. and Ruiz, P. (2015). Generating Navigable
Semantic Maps from Social Sciences Corpora. In
Digital Humanities Conference (DH 2015). Sydney,
Australia.
Rule, Alix, Jean-Philippe Cointet, and Peter S.
Bearman. (2015). Lexical Shifts, Substantive
Changes, and Continuity in State of the Union
Discourse, 1790–2014. Proceedings of the National
Academy of Sciences 112 (35)
Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.
Zabban, and K. De Pryck. (2014). Three Maps and
Three Misunderstandings: A Digital Mapping of
Climate Diplomacy. Big Data & Society 1
Weeds J, Weir D (2005). Co-occurrence retrieval: A
flexible framework for lexical distributional similarity.
In Computational Linguistics 31(4), 439–475.
Wattenberg, M. and Viégas, F.B., 2008. The word tree,
an interactive visual concordance. In IEEE
transactions on visualization and computer graphics,
14(6), pp.1221-1228.
80
81
82
& return you all due thanks
pablo.ruiz.fabo@ens.fr http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541
http://apps.lattice.cnrs.fr/

More Related Content

PPT
Sharing an Open Methodology for Building Domain-specific Corpora for EAP
PDF
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
PPTX
From Open Access to Open Science: from the Viewpoint of a Scholarly Publisher
PPTX
ARPHA: Next-Generation Journal Publishing
PPT
Linked Data and cultural heritage data: an overview of the approaches from Eu...
PPTX
ISNI identifiers and linked data in the research space la trobe unviersity 20...
PDF
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
PPTX
The International Standard Name Identifier (ISNI): A Close Look, with Laura D...
Sharing an Open Methodology for Building Domain-specific Corpora for EAP
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
From Open Access to Open Science: from the Viewpoint of a Scholarly Publisher
ARPHA: Next-Generation Journal Publishing
Linked Data and cultural heritage data: an overview of the approaches from Eu...
ISNI identifiers and linked data in the research space la trobe unviersity 20...
Dr. Iztok Kosem - Innovations in Slovenian (e-)lexicography: from (semi-)auto...
The International Standard Name Identifier (ISNI): A Close Look, with Laura D...

What's hot (11)

PDF
PDF
Knowledge Patterns for the Web: extraction, transformation, and reuse
PDF
Data wrangling week1
PPT
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
PDF
The Standards Mosaic Opening the Way to New Technologies
PPTX
An overview of the PRIDE ecosystem of resources and computational tools for m...
PPTX
The Proteomics Standards Initiative (PSI)
PPTX
Information Extraction from EuroParliament and UK Parliament data
PDF
Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...
PPT
OAPEN Göttingen workshop may 9 2012
Knowledge Patterns for the Web: extraction, transformation, and reuse
Data wrangling week1
Lexicography and Lexicology from a Pan-European Perspective: COST ENeL Workin...
The Standards Mosaic Opening the Way to New Technologies
An overview of the PRIDE ecosystem of resources and computational tools for m...
The Proteomics Standards Initiative (PSI)
Information Extraction from EuroParliament and UK Parliament data
Datech2014 - Session 5 - Wittgenstein’s Nachlass: WiTTFind and Wittgenstein A...
OAPEN Göttingen workshop may 9 2012
Ad

Viewers also liked (20)

PPTX
งานอาเมเมเมเ
PDF
Ykone Insights #5 Silicon Switzerland
PPTX
TRABAJO DE LA WEBQUEST
PDF
MIMA Monthly January 2015 - "Content Strategy 2015: Marketing, Mobile, and th...
PPTX
Codes and Conventions
PPTX
Blog
PPTX
Social media presentation
PPTX
Examen bimestral
PPTX
My vacation
PPTX
MU 3313: Music History
PPT
Immigration lawyer jacksonville
PPTX
Pitch new final
PPT
Презентация видео-диагностики по методу Ануашвили (Дети)
PPTX
Eng1023 library instruction_sp2016
PDF
Dr. Kathryn E. Piquette, Cologne Center for eHumanities, Universität zu Köln:...
PPT
Lawyers in jacksonville
PPT
Sandbach Santa Route Map
PDF
Ekaluokan mikroskooppilöydökset blogiin
PPTX
Procesal constitucional
PPTX
Balayage with Balay Lama
งานอาเมเมเมเ
Ykone Insights #5 Silicon Switzerland
TRABAJO DE LA WEBQUEST
MIMA Monthly January 2015 - "Content Strategy 2015: Marketing, Mobile, and th...
Codes and Conventions
Blog
Social media presentation
Examen bimestral
My vacation
MU 3313: Music History
Immigration lawyer jacksonville
Pitch new final
Презентация видео-диагностики по методу Ануашвили (Дети)
Eng1023 library instruction_sp2016
Dr. Kathryn E. Piquette, Cologne Center for eHumanities, Universität zu Köln:...
Lawyers in jacksonville
Sandbach Santa Route Map
Ekaluokan mikroskooppilöydökset blogiin
Procesal constitucional
Balayage with Balay Lama
Ad

Similar to Visualizing the Transcribe Bentham Corpus (20)

PDF
Transcribe Bentham
PPTX
Mdst3703 2013-10-08-thematic-research-collections
PPTX
Transcribe Bentham presentation, Den Haag, 18 April 2013
PPTX
Aquiles imlr seminar
PPTX
This presentation about corpus linguistics
PDF
Usp dh 2013
PDF
Digital Humanities in a Linked Data World - Semnantic Annotations
PDF
Dh usp 2013
PPTX
Introducing Historical Texts' new resources for learning and teaching
PDF
MacroMicroZoom.pdf
PPT
LaTrobe_eCoffee_5-Nov-10
PPTX
Reimagining the Digital Monograph: Improving the Discovery and Use of Scholar...
PPTX
Corpus linguistics
PDF
Forty Years of the OTA
PPT
Development of the database, the website and the online transcription platfor...
PPTX
Scholarly Work 02. Corpus! Thy Name is KD.pptx
PPTX
TEI_train
PPTX
PPTX
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom
Transcribe Bentham
Mdst3703 2013-10-08-thematic-research-collections
Transcribe Bentham presentation, Den Haag, 18 April 2013
Aquiles imlr seminar
This presentation about corpus linguistics
Usp dh 2013
Digital Humanities in a Linked Data World - Semnantic Annotations
Dh usp 2013
Introducing Historical Texts' new resources for learning and teaching
MacroMicroZoom.pdf
LaTrobe_eCoffee_5-Nov-10
Reimagining the Digital Monograph: Improving the Discovery and Use of Scholar...
Corpus linguistics
Forty Years of the OTA
Development of the database, the website and the online transcription platfor...
Scholarly Work 02. Corpus! Thy Name is KD.pptx
TEI_train
Corpus Protocols IFLA Geneva August 2014 by Neil Smyth and Stella Wisdom

More from UCLDH (20)

PPTX
Neil Tarrant Defining Nature’s Limits 9 March 2022.pptx
PDF
Archiving the Medici: History and Future (1370s-2020s)
PPTX
The Pleasures and Sorrows of digitising primary source collections: The Case ...
PPTX
CVT Connect: Co-producing a digital platform for people with learning disabil...
PPTX
The opportunity of accessibility: increasing impact and improving the user ex...
PPT
National Trust 'For Everyone' strategy
PPTX
Digital Lives of People with Learning Disabilities
PPSX
Digital Content and Disability - The Librarian Perspective
PPT
SensusAccess: Alternate Media Made Easy
PPTX
Accessible Publishing
PPT
What might a spoken corpus tell us about language
PPT
“It is Time for the Slaves to Speak”: Transatlantic Abolitionism and African ...
PDF
Oceanic Exchanges presentation
PDF
Digital Face project presentation
PPTX
CrossCult presentation
PDF
Computational History and the Transformation of Public Discourse in Finland, ...
PPT
Where does the born- and reborn-digital material take the Digital Humanities?
PDF
Humanities Crowdsourcing on the Zooniverse Platform
PDF
Managing library collections with friends, favours and a spoonful of sugar
PDF
L taylor ucl_caribbean_digital_dreams_2017
Neil Tarrant Defining Nature’s Limits 9 March 2022.pptx
Archiving the Medici: History and Future (1370s-2020s)
The Pleasures and Sorrows of digitising primary source collections: The Case ...
CVT Connect: Co-producing a digital platform for people with learning disabil...
The opportunity of accessibility: increasing impact and improving the user ex...
National Trust 'For Everyone' strategy
Digital Lives of People with Learning Disabilities
Digital Content and Disability - The Librarian Perspective
SensusAccess: Alternate Media Made Easy
Accessible Publishing
What might a spoken corpus tell us about language
“It is Time for the Slaves to Speak”: Transatlantic Abolitionism and African ...
Oceanic Exchanges presentation
Digital Face project presentation
CrossCult presentation
Computational History and the Transformation of Public Discourse in Finland, ...
Where does the born- and reborn-digital material take the Digital Humanities?
Humanities Crowdsourcing on the Zooniverse Platform
Managing library collections with friends, favours and a spoonful of sugar
L taylor ucl_caribbean_digital_dreams_2017

Recently uploaded (20)

PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Lesson notes of climatology university.
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
01-Introduction-to-Information-Management.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
master seminar digital applications in india
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
Complications of Minimal Access Surgery at WLH
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Computing-Curriculum for Schools in Ghana
Final Presentation General Medicine 03-08-2024.pptx
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
102 student loan defaulters named and shamed – Is someone you know on the list?
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
O7-L3 Supply Chain Operations - ICLT Program
VCE English Exam - Section C Student Revision Booklet
Lesson notes of climatology university.
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
01-Introduction-to-Information-Management.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
master seminar digital applications in india
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
Complications of Minimal Access Surgery at WLH
Chinmaya Tiranga quiz Grand Finale.pdf

Visualizing the Transcribe Bentham Corpus

  • 1. Visualizing the Transcribe Bentham Corpus Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo, Thierry Poibeau LATTICE Lab: ENS – CNRS – U Paris 3, PSL – USPC Tim Causer, Melissa Terras UCL Bentham Project, UCL Digital Humanities UCLDH Seminar, December 2016
  • 2. Outline • UCL Bentham Project & Transcribe Bentham • How navigate this corpus? Visualizations – Lexical extraction – Co-occurrence networks • Static view and Temporal evolution • Evaluation and Challenges • Other corpus explorations via visualization • Distant Reading Module, WordTree • Other lexical analyses 2
  • 3. Jeremy Bentham (1748-1832) •Jurist, philosopher, and legal and social reformer •Leading theorist in Anglo-American philosophy of law •Influenced the development of welfarism •Advocated utilitarianism •Animal rights, •Work on the “panopticon” •Not founder of UCL, but... •60,000 folios in UCL Sp. Collections •40,000 untranscribed •Auto-icon
  • 4. The Bentham Project • http://www.ucl.ac.uk/Bentham-Project/ • Since 1959 • “aims to produce a new scholarly edition of the works and correspondence of Jeremy Bentham” • twenty six volumes of the new Collected Works have been published • 50 years to transcribe 20,000 folios • Previous AHRC grant catalogued the manuscripts – http://www.benthampapers.ucl.ac.uk/
  • 10. Facts and Figures (as of 1st July 2016) • 16,205 manuscripts transcribed/partially-transcribed • 15,351 (94%) checked and approved • 83,955 visits • 34,359 unique views • Average session time: 14 minutes 13 seconds • 140 countries • 514 people have transcribed something • Most of the work done by the 26 Super Transcribers • Average of 54 transcripts edited since the start of the project • Average of 56 per week during the last twelve months • Greatest number of transcripts in any one week: 300 (w/c 14 June • 2014)
  • 11. Transcribe Bentham progress, 8 September 2010 to 20 March 2015 0 2000 4000 6000 8000 10000 12000 8 Sep 2010 5 Nov 2011 30 Dec 2010 25 Feb 2011 15 Apr 2011 17 Jun 2011 12 Aug 2011 7 Oct 2011 2 Dec 2011 27 Jan 2012 23 Mar 2012 18 May 2012 13 Jul 2012 7 Sep 2012 2 Nov 2012 28 Dec 2012 22 Feb 2013 26 Apr 2013 21 Jun 2013 16 Aug 2013 11 Oct 2013 6 Dec 2013 31 Jan 2014 28 Mar 2014 23 May 2014 18 Jul 2014 12 Sep 2014 7 Nov 2014 9 Jan 2015 6 Mar 2015 Manuscripts worked on Completed transcripts NYT article BL manuscripts made available
  • 12. With thanks to: •Prof Philip Schofield (UCL Bentham Project, Principal Investigator) •Dr Tim Causer (Bentham Project) •Dr Kris Grint (Bentham Project) •Richard Davis (University of London Computer Centre •José Martin (ULCC) •Martin Moyle (UCL Library Services) •Lesley Pitman (UCL Library Services) •Tony Slade (UCL Creative Media) •Miguel Faleiro Rodrigues, Alejandro Salinas Lopez, and Raheel Nabi (UCL Creative Media) •Dr Arnold Hunt (British Library) •Anna-Maria Sichani (Bentham Project) •Dr Justin Tonra (National University of Ireland Galway) and Dr Valerie Wallace (Victoria University Wellington), bother formerly of the Bentham Project •All the partners in Transcriptorium http://transcriptorium.eu/consortium/ •And Transcribe Bentham’s volunteers! •Project previously funded by the AHRC and the Andrew W. Mellon Foundation
  • 13. Outline • UCL Bentham Project & Transcribe Bentham • How navigate this corpus? Visualizations – Lexical extraction – Co-occurrence networks • Static view and Temporal evolution • Evaluation and Challenges • Other corpus explorations via visualization • Distant Reading Module, WordTree • Other lexical analyses 13
  • 14. Relevant access to a large corpus 14
  • 15. Relevant access to a large corpus • A search index? • Topic models? • Corpus cartography? Challenges for this corpus • Not an all-English corpus • Difficulties posed by an historical variety • Technical language • Revision history, additions and deletions 15
  • 16. Stats for analyzed corpus sample • Total TEI files: 29,900 • In English: 29,400 • That we dated: 16,700 • We only visualized English transcripts that we could date (with a simple heuristic)1 • Work is based on ca. 55% of the all the TEI files in our sample 16 1We were not using the corpus’ date metadata for this exercise
  • 17. Corpus Cartography • Lexical extraction (of relevant sequences) • Clustering based on similarity measures • Visual representation (map of the corpus) based on layout algorithms 17
  • 18. Cartography tool: CorText • CorText Manager covers all cartography steps: – Lexical extraction – Clustering – Visualization • Each step can be used independently, thanks to standard import/export formats 18
  • 19. ToolscombinedwithCorText CARTOGRAPHY STEP TOOLS and RESOURCES Lexical Extraction DBpedia Spotlight YaTeA Human domain-expert Clustering CorText Analysis Visualization Gephi + Sigma JS plugin - Static CorText MapExplorer Inkscape - Dynamic CorText Heatmaps, Tubes, Distant Reading 19
  • 20. Outline • UCL Bentham Project & Transcribe Bentham • How navigate this corpus? Visualizations – Lexical extraction – Co-occurrence networks • Static view and Temporal evolution • Evaluation and Challenges • Other corpus explorations via visualization • Distant Reading Module, WordTree • Other lexical analyses 20
  • 21. Lexical Extraction • CorText native option – Noun-Phrase chunks (based on TreeTagger) • Our options: – Entity Linking / Wikification to DBpedia – Keyphrase extraction tools like YaTeA • In all cases: manual selection of pre-ranked candidate terms by a domain-expert 21
  • 22. Entity Linking / Wikification • Given a database with encyclopedic knowledge (e.g. Wikipedia) - Finds references (mentions) to DB terms in text - Dealing with variability in the mentions for a term 22
  • 23. Entity Linking / Wikification • Given a database with encyclopedic knowledge (e.g. Wikipedia) - Finds references (mentions) to DB terms in text - Dealing with variability in the mentions for a term 23 Database
  • 24. Entity Linking / Wikification • Given a database with encyclopedic knowledge (e.g. Wikipedia) - Finds references (mentions) to DB terms in text - Dealing with variability in the mentions for a term 24 Database
  • 25. Entity Linking / Wikification • Given a database with encyclopedic knowledge (e.g. Wikipedia) - Finds references (mentions) to DB terms in text - Dealing with variability in the mentions for a term 25 DatabaseCorpus - judicatory - judicial - judicature - Judicatory - Judicial
  • 26. Entity Linking / Wikification • Tool: DBpedia Spotlight • Compares the context of sequences of words in a text against DBpedia articles: – Term definition’s text – Links – DBpedia structure (redirections etc.) • Assigns a DBpedia term to the sequence if a good match is found 26
  • 27. Entity Linking / Wikification Example terms and their variants 27 Term Variants Judiciary judicature, judicatory, judicial Jury jury, juries Monarch king, monarch Quantity amount, quantity Saint Peter Simon Peter, Cephas
  • 28. Entity Linking / Wikification 28 • Applying a current knowledge-base (DBpedia) to 18th-19th century texts • Is this a valid method?
  • 29. Keyphrase extraction • YaTeA (Aubin and Hamon, 2006) • Extracts noun-phrases of configurable structure and length 29
  • 30. Outline • UCL Bentham Project & Transcribe Bentham • How navigate this corpus? Visualizations – Lexical extraction – Co-occurrence networks • Static view and Temporal evolution • Evaluation and Challenges • Other corpus explorations via visualization • Distant Reading Module, WordTree • Other lexical analyses 30
  • 31. Clustering • CorText offers several similarity metrics – we chose the default method (distributional) for homogeneous networks (Weeds & Weir 2005) 31
  • 32. Visualization • Static (one map for all dated transcripts) • Dynamic: temporal slices on the corpus – Heatmaps – “River” or Sankey networks (“Tubes layout”) 32 http://apps.lattice.cnrs.fr/bentham
  • 37. Example term: happiness 37 CorText network made interactive thanks to Gephi’s Sigma JS Exporter
  • 44. Examples: nodes linking clusters 44
  • 45. Examples: nodes linking clusters 45
  • 46. Heatmaps: Saliency per subcorpus 46
  • 54. Outline • UCL Bentham Project & Transcribe Bentham • How navigate this corpus? Visualizations – Lexical extraction – Co-occurrence networks • Static view and Temporal evolution • Evaluation and Challenges • Other corpus explorations via visualization • Distant Reading Module, WordTree • Other lexical analyses 54
  • 55. Evaluation • Static maps: terms in the clusters correspond closely to issues dealt with by Bentham for the thematic areas of each cluster • Heatmaps: The evolution depicted corresponds to the evolution of topics in Bentham’s work • DBpedia vs. keyphrase extraction: The keyphrases provide more relevant evidence for specialized scholars, a general encyclopedia can help other users 55
  • 57. Challenges Thematic Variety • Animal Welfare • Arts • Capital punishment • Civil Code • Constitutional Code • Convict transportation • Correspondence • Crime & Punishment • Education • Law • Legislation • Moral Philosophy • New South Wales • Panopticon • Penal Code • Political Economy • Preventive Police • Religion • Science • Sexual Morality • Torture Formal Variety • Text sheets • Copies / Fair copies • Marginal summary sheets • Correspondence • Collectanea • Rudiments • Spencers 57 From http://www.transcribe-bentham.da.ulcc.ac.uk/td/Manuscripts and http://www.benthampapers.ucl.ac.uk/help.aspx?subject=category
  • 58. Outline • UCL Bentham Project & Transcribe Bentham • How navigate this corpus? Visualizations – Lexical extraction – Co-occurrence networks • Static view and Temporal evolution • Evaluation and Challenges • Other corpus explorations via visualization • Distant Reading Module, WordTree • Other lexical analyses 58
  • 59. Distant Reading Module • Follow evolution of selected lexical sequences 59
  • 60. Evolution of a lexical item 60 Temporal evolution Temporal evolution profiles: - Here: Rising, but present at all dates - Other examples: falling, regular spikes etc.
  • 64. Context evolution: Bump Charts 64 • Example: evil
  • 67. • Example: relations among neighbours of evil Relations in the context: Egonetworks 67
  • 68. Evolution of neighbours’ relations 68 Egonetworks(Period2)
  • 69. Evolution of neighbours’ relations 69 Egonetworks(Period3)
  • 70. Evolution of neighbours’ relations 70 Egonetworks(Period4)
  • 71. Outline • UCL Bentham Project & Transcribe Bentham • How navigate this corpus? Visualizations – Lexical extraction – Co-occurrence networks • Static view and Temporal evolution • Evaluation and Challenges • Other corpus explorations via visualization • Distant Reading Module, WordTree • Other lexical analyses 71
  • 72. Other Lexical Analyses • TXM “textometry” tool – Automatic part-of- speech tagging – Partition texts according to metadata – Query corpus using linguistic criteria – Statistical analyses (overrepresentation, underrepresentation) 72 [ http://textometrie.ens-lyon.fr/?lang=en ]
  • 74. Lexical Analysis with TXM • Partition the corpus according to Category, Year, Decade, Main headings, or other available metadata 74
  • 75. Lexical Analysis with TXM Number of words per Category 75
  • 76. Lexical Analyses with TXM • Over- (or under-) representation of given words per decade (after partitioning per decade) 76
  • 77. TXM linguistic queries • Evil followed by a noun, per text-category 77
  • 78. TXM linguistic queries • Sentences containing an adjective + evil 78
  • 79. Summary • Accessing a large unedited corpus – Cartography methods • Lexical extraction • Maps – Static picture of the corpus – Temporal evolution – Other visualizations (Distant, WordTree) • Domain-expert feedback • Challenges • Other lexical analyses 79 http://apps.lattice.cnrs.fr/bentham
  • 80. Bibliography Aubin, S., and Hamon, T. (2006) Improving Term Extraction with Terminological Resources. In Advances in Natural Language Processing: 5th International Conference on NLP, FinTAL 2006, pp. 380-387. LNAI 4139. Springer. Auer, Sören, et al. (2007). DBpedia: A nucleus for a web of open data. The Semantic Web. Springer. Causer, Tim, and Terras, Melissa (2014a). Many hands make light work. Many hands together make merry work: Transcribe Bentham and crowdsourcing manuscript collections, in Crowdsourcing Our Cultural Heritage, ed. M. Ridge, Ashgate Causer, Tim, and Terras, Melissa (2014b). Crowdsourcing Bentham: Beyond the Traditional Boundaries of Academic History, International Journal of Humanities and Arts Computing, 8 Chavalarias, David, and Jean-Philippe Cointet. (2013). Phylomemetic Patterns in Science Evolution—The Rise and Fall of Scientific Fields. PLoS ONE 8 (2) Cortext Manager Documentation (2016). https://docs.cortext.net/. Mendes, Pablo N., Max Jakob, Andrés García-Silva, and Christian Bizer. (2011). DBpedia Spotlight: Shedding Light on the Web of Documents. In Proceedings of the 7th International Conference on Semantic Systems, 1–8. ACM. Mélanie, F., Tieberghien, E., Ruiz, P., Poibeau, T., Causer, T. Terras, M. (2016). Mapping the Bentham Corpus. In Digital Humanities Conference (DH 2016). Kraków, Poland. Poibeau, T. and Ruiz, P. (2015). Generating Navigable Semantic Maps from Social Sciences Corpora. In Digital Humanities Conference (DH 2015). Sydney, Australia. Rule, Alix, Jean-Philippe Cointet, and Peter S. Bearman. (2015). Lexical Shifts, Substantive Changes, and Continuity in State of the Union Discourse, 1790–2014. Proceedings of the National Academy of Sciences 112 (35) Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V. Zabban, and K. De Pryck. (2014). Three Maps and Three Misunderstandings: A Digital Mapping of Climate Diplomacy. Big Data & Society 1 Weeds J, Weir D (2005). Co-occurrence retrieval: A flexible framework for lexical distributional similarity. In Computational Linguistics 31(4), 439–475. Wattenberg, M. and Viégas, F.B., 2008. The word tree, an interactive visual concordance. In IEEE transactions on visualization and computer graphics, 14(6), pp.1221-1228. 80
  • 81. 81
  • 82. 82 & return you all due thanks pablo.ruiz.fabo@ens.fr http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541 http://apps.lattice.cnrs.fr/