SlideShare a Scribd company logo
Karen Cariani
AAPB Project Director, WGBH
Executive Director, WGBH Media
Library & Archives
Can the Computer,
and the Public,
do the Metadata
Work?
Casey Davis-Kaufman
AAPB Project Manager, WGBH
Associate Director, WGBH Media
Library & Archives
Agenda
• Who we are
• The project
– Cataloguing dilemma
– Audio cataloging
– Speech to text results
– Crowd source work
– Crowd source challenges
• CL tool data output –
useful?
• Additional available
tools- work with CL
experts
– We have a great data set
– You have expertise
WHO WE ARE
Can the Computer and the Public Do the Metadata Work?
The Library of Congress
Packard Campus for Audio Visual Conservation
American Archive of Public Broadcasting
Who we are:
WGBH Media Library and Archives
6
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
THE PROJECT
Can the Computer and the Public Do the Metadata Work?
the situation/dilemma
90,000 digitized television and radio programs
incomplete, inaccurate metadata records
limited staff resources
what is in the collection?
users need access to the collection
continued growth of the collection (content and sparse metadata)
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Wide variety
• Of content from locations across
the country
• Lots of different types of speakers
• Different speech patterns and
accents
• 40 states
• politicians, academics, man on
street, youth, elders, artists
• north, south, mid-west, west,
east
• Creole, French, Caribbean
Spanish, Alaskan native language
• Music and other non-spoken
sound
the potential:
transforming content into data
• Computational Tools
• Speech-to-text
• Audio analysis
• Image Analysis
• Visualization of Data
How can we use them?
AUDIO ANALYSIS
High Performance Sound Technologies
for Access and Scholarship (HiPSTAS)
generating additional programmatic data about the audiovisual
material beyond the actual words being spoken in the recordings
• speaker identities
• applause
• laughter
• musical interludes
4,000 hours
speaker
labeling
• WGBH-provided training set (all audio) 373.3 hours
[speakers we aren't using etc.]
• Extended corpus (all audio) 951.8 hours [collected
on own, keywords, etc. might change some of it
not;some included in above #]
• "Haystack" test set (all audio) 494.2 hours [gave us to
label because we think they're in there; might be
some duplicated so we don't use it for traing; use it
to run the machine. kept on separate]
• Human-labeled speaker segments (all) 103.8 hours
[everything]
• UBM_600 corpus 17.9 hours [all labeled speakers
they gave us; plus PennSound UBM] Universal
Background Model
HiPSTAS AAPB Output
AAPB speaker identity models
. . . so far
Hillary Clinton
Bill Clinton
James Baldwin
Malcolm X
Martin Luther King Jr.
Julia Child
Richard Nixon
Ronald Reagan
Lyndon Johnson
Gloria Steinem
Workshop
Process evaluation
Results
• We didn’t get very much identified
• Really hard to do
• Took a lot of time to identify 1 speaker
• Translating the work from the research at UT
Texas and tool training to our own catalog was
undefined
SPEECH TO TEXT RESULTS
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Transcript Creation Continues
• As part of the project, Pop Up Archive released
updated language models for Kaldi.
• The code has been “Dockerized” by our
University of Texas partners.
• We have incorporated transcript creation into our
workflow as the collection grows.
• Github: https://github.com/WGBH/kaldi-pop-up-
archive
Crowdsourcing tools
Zooniverse Project – “Roll the Credits”
Can the Computer and the Public Do the Metadata Work?
FIXITPLUS.americana
rchive.org
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
Lord thank you. Lord. You’re through. You’re.
Through. Mm mm. Mm. Mm mm mm mm
mm. Mm. Mm. Laura Iraq of. Glory. From.
From. From. Three three. Three. Three.
Three. Three. Three.
Laura it thank. You. That was it. And those
are the good other than those of the other.
But other than that I'm.
But you. Know. In PSA jogging for the leg.
That. Said alone but not for what they love
that about them. In the U.S. I don’t really like
deep down this guy John. Has seen me
naked young. Youre that one or put up or go
to a lie even. When we're going to be in a
way that it was. A scene or so because John
you and I will note that up why go to a lie.
When we're going to be in before and then
they run a scene will see My God. You up or
something. On Monday another tour and yet.
People are coming up a lot of. Joy to make
us even more. Of this season might. Be
made upon home with us so we're left with.
The money to go before the scene was
bought.
Kaldi Output for Music
A.
Leg.
Length.
Length.
Eh I'm
good.
Thank you.
Kaldi Output for Helicopter B-Roll
You mean you think you know always going. In
approximately I see that I'm being in a city that the
community is Brucey putting I mean in and deal with it
and they see it. Are they looking for. And COMINO is
improved on so many ideas to explore that they must be
in for fancier graphic thing like that Latina. Woman That
is keep it down for a. Made up out of there. I mean the
older you know the moment they do it and I mean no no
shangri la I wound up pretty compressed. But then finally
the. Most famous presenter being asked about Odama
seeking
impact done like a 20 year old uncle. I miss my foresight
when local news reported a loss for him which was the
original being with you we. Let's see it. No problem I say
to you since it was your own this and not a Californian.
When you guys buy up our land Mr. Merhi Nera for law
enforcement you said to be sure that will cost them. Less
personally and COMINO is bloviated not very effective.
They must be telescreen are so fake now. But up
Roamer they don't know about this the past few months I
feel I am sure that. If we didn't know what it is as soon as
one has been able to from my strong ally in cancer they
mine. I say most I would have had lots of them in Dallas.
I remember with Ahmadinejad. See that he's a see the.
This was very
programa I said Think end up in the Mystica. I think can
assume that I would be in the family is in Gaza you know
of.
Just a movement that eluting mother in that race in
Illinois. I mean these people are Latino and they they
face.
Like the Latinos in going through all of Emmelina repeat
Kaldi Output for Spanish Language
All.Right. OK. OK. West Virginia
Virginia. Plus. Brown. Girlfriend try a
temporary quick. Corollary proper and I
pointed to a song. Made on your
apartment made I'm easy. I generally.
Don't jump to a long. Record. Your point
here already called an average storm
either calling. Me a fearful planner for
us are really powerful for one I don't live
for or see me privately just 75 miles
from or see. Them. President Reagan
National Award. You have more screen.
Generation. We have a long trip. For. A
trip. Do you see music music from
music. And there is a common sense for
him when the. Nazi party. No no can
one. Man may or. Yet you monkey have
made of the tears and. Some are. Good
only the Starkey blues the silly your so I
look at this hour. Mr Parker knew me or
started. Their. Own eyes. What I am.
Not very good.
Kaldi Output for French-Creole Language
We know nothing about this item.
Kaldi Output for Yup’ik Language
Your all right already. Yeah OK. For the record.
Speaking to James guy Greek looked him up in the
crowd doesn't mean we thought i hope I do I'm
trying to talk but I think we shall I do you can and we
shall not be in hall and you know
I do wish I could. I will call that good old myth or the
old call wouldn't let me out with a dog. Be sure they
left the water good. How does how do you lead that
cuckoo clock in the south have you genius. And I
don't miss out on your good
guy you're not familiar with what do I do when you
need to whom you are now there. Dorise you I
laughed and I dunno it still gets you can pick on a
new record. Yeah gotta be deader than a good guy.
Let's go on a con not go off on your own of kind and
how can you know how God left her daughter. Julie I
don't know we don't really care what you never
knew that people can't go through the muck of the
kind in which our not. I don't come out now if you're
going to cause you need no with no trade in Kenya
and it could be should be a good deal or not a
number could not go home now or that not gotten
bad bitches want to go Honey you do not want me
one culture not clear what we've got to do what you
want to try New Delhi new kind o woman in her
thought didn't think enough on line to get rid of they
have a machine like you.
Kaldi Output for Spoken English
• Approximately 81% word accuracy rate
– not including punctuation errors
• Examples:
– 95% accurate for 1960s radio program from
Boston (no accents, one speaker)
– 55% accurate for 1970s television program
from Mississippi (strong Southern U.S.
accent)
Can the Computer and the Public Do the Metadata Work?
Types of Errors Corrected by Users
• Station call letters
• Mis-transcription of words spoken in southern accents,
e.g., “weary and” vs “we’re in”
• Local town names, e.g., “plaque and” vs. “Plaquemine”
• Person names, e.g., “Laurence” vs. “Lawrence”
• Numbers spelled out vs. numeric
• Adding words completely missing from original
transcript
• Incorrect “corrections” by crowdsource participants,
e.g. “achieved” vs. “achievedd” in the “corrected”
transcript
once corrected…
JSON transcripts are
stored on AAPB’s
Amazon S3 account
Transcripts are indexed
for keyword searching
on the AAPB website
Transcripts are made
available alongside the
media on the record
page
Transcripts can play as
captions within the
player
Transcripts can be
harvested via an API
and used as a dataset
for research such as a
digital humanities
project
Can the Computer and the Public Do the Metadata Work?
Can the Computer and the Public Do the Metadata Work?
CL experts
• Improving named entity vocabularies
• Forced alignment
• Time stamp for bars and tone
• Music identification
• Foreign language identification and transcription
• OCR of text on screen (lower thirds, credits)
CL Experts and Archivists
• A larger need for more accurate output and
ease of use of computational tools for
audiovisual archives to create descriptive
metadata and annotations.
Help Us!
Access our Dataset
• Metadata API (OAI-PMH and PBCore)
– https://github.com/WGBH/AAPB2#api
• Transcripts API – contact us to get credentials
• Media as a Dataset (MaaD) – contact us to get
digital audio/video copies for computational
research
facebook.com/amarchivepub
@amarchivepub
americanarchive.org
http://fixit.americanarchive.org
#FixItAAPB
Karen Cariani
Karen_Cariani@wgbh.org
Casey Davis-Kaufman
casey_davis-kaufman@wgbh.or

More Related Content

PDF
Rubyconf 2014 recap
PPTX
PPT
Grammar Roadshow Slides
PPTX
Technology Stacks for Navigating Narrative: Wordpress >> PopUp Archive >> OHMS
PDF
Transcript: Elements of Indigenous Style: Insights and applications for the b...
PDF
Transcript -sir-ken-robinson
DOCX
Running head Week 2 Assignment 1EVALUATION OF A HEALTH-RE.docx
PDF
2600 v25 n3 (autumn 2008)
Rubyconf 2014 recap
Grammar Roadshow Slides
Technology Stacks for Navigating Narrative: Wordpress >> PopUp Archive >> OHMS
Transcript: Elements of Indigenous Style: Insights and applications for the b...
Transcript -sir-ken-robinson
Running head Week 2 Assignment 1EVALUATION OF A HEALTH-RE.docx
2600 v25 n3 (autumn 2008)

Similar to Can the Computer and the Public Do the Metadata Work? (20)

PDF
James Baldwin Essays Online.pdf
PDF
How to win_friends_and_influence_people
PPTX
How do I make language communicative?
PPT
Proyecto
PDF
[Book];[How to win_friends_and_influence_people]
PDF
How to win friends and influence people by dale carnegie
PDF
Influence people
PDF
How to win friends and influence people
PDF
[Books];[how to win friends and influence people]
PDF
Influence people
PDF
Influence people
PPTX
Oral History, Radio, & Podcasting: Digital Storytelling from the Archive
PDF
Movers Picture Description
PDF
College Essay Ice Hockey. Online assignment writing service.
PPT
Black English - Girias e Palavroes?
DOC
Idioms
PDF
Public Law Essay Plans - Lecture Notes 1-12 - Publi
PDF
Report Writing Sample For Students - Horizonconsulti
PDF
Dental Hygiene Scholarship Essay Examples
PPT
Foreign wordspowerpoint
James Baldwin Essays Online.pdf
How to win_friends_and_influence_people
How do I make language communicative?
Proyecto
[Book];[How to win_friends_and_influence_people]
How to win friends and influence people by dale carnegie
Influence people
How to win friends and influence people
[Books];[how to win friends and influence people]
Influence people
Influence people
Oral History, Radio, & Podcasting: Digital Storytelling from the Archive
Movers Picture Description
College Essay Ice Hockey. Online assignment writing service.
Black English - Girias e Palavroes?
Idioms
Public Law Essay Plans - Lecture Notes 1-12 - Publi
Report Writing Sample For Students - Horizonconsulti
Dental Hygiene Scholarship Essay Examples
Foreign wordspowerpoint
Ad

More from WGBH Media Library and Archives (20)

PDF
Engage Your Community to Celebrate Your History
PDF
Wikipedia Editathon: How to Guide
PPTX
FIX IT+ Transcript Editing
PPTX
Press Play on History: Unlocking 70 Years of Primary Source Materials for Dis...
PPTX
AV Digitization Projects: Tools and Strategies for Enhancing Impact and Engag...
PPTX
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
PDF
Use of American Archive of Public Broadcasting in Humanities Research
PDF
American Archive of Public Broadcasting: a Digital Library for Teaching Media...
PDF
Accessibility of the American Archive of Public Broadcasting in Academic Libr...
PDF
How to Use the American Archive of Public Broadcasting as a Resource in the C...
PPTX
Putting the Pieces Together: Creating a National Educational Television Catalog
PPTX
DESIGN FOR CONTEXT: Cataloging, Web Design, and Linked Data for Exposing Nati...
PDF
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
PPTX
Preserving Your Station Legacy with the American Archive of Public Broadcasti...
PPTX
Let the Computer Do the Work
PPTX
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
PPTX
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...
PPTX
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
PPTX
Going Far by Going Together: Collaboration with Scholars and Other Allies
PPTX
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
Engage Your Community to Celebrate Your History
Wikipedia Editathon: How to Guide
FIX IT+ Transcript Editing
Press Play on History: Unlocking 70 Years of Primary Source Materials for Dis...
AV Digitization Projects: Tools and Strategies for Enhancing Impact and Engag...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Use of American Archive of Public Broadcasting in Humanities Research
American Archive of Public Broadcasting: a Digital Library for Teaching Media...
Accessibility of the American Archive of Public Broadcasting in Academic Libr...
How to Use the American Archive of Public Broadcasting as a Resource in the C...
Putting the Pieces Together: Creating a National Educational Television Catalog
DESIGN FOR CONTEXT: Cataloging, Web Design, and Linked Data for Exposing Nati...
DESIGN FOR CONTEXT: Cataloging and Linked Data for Exposing National Educatio...
Preserving Your Station Legacy with the American Archive of Public Broadcasti...
Let the Computer Do the Work
FIX IT - A Transcript Game to Make Historic Public Broadcasting More Discover...
Using Computational Tools and Crowdsourcing Games to Increase Metadata and Di...
Improving Access to Historic Public Broadcasting through Speech-to-Text, Crow...
Going Far by Going Together: Collaboration with Scholars and Other Allies
Building AAPB Participation into Digitization Grant Proposals: Requirements, ...
Ad

Recently uploaded (20)

PDF
Taming the Chaos: How to Turn Unstructured Data into Decisions
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
Modernising the Digital Integration Hub
PPTX
Tartificialntelligence_presentation.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
observCloud-Native Containerability and monitoring.pptx
PPTX
The various Industrial Revolutions .pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPT
What is a Computer? Input Devices /output devices
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Unlock new opportunities with location data.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
Taming the Chaos: How to Turn Unstructured Data into Decisions
A novel scalable deep ensemble learning framework for big data classification...
Modernising the Digital Integration Hub
Tartificialntelligence_presentation.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
observCloud-Native Containerability and monitoring.pptx
The various Industrial Revolutions .pptx
A review of recent deep learning applications in wood surface defect identifi...
Web Crawler for Trend Tracking Gen Z Insights.pptx
Final SEM Unit 1 for mit wpu at pune .pptx
What is a Computer? Input Devices /output devices
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
1 - Historical Antecedents, Social Consideration.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Unlock new opportunities with location data.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Zenith AI: Advanced Artificial Intelligence
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
Assigned Numbers - 2025 - Bluetooth® Document

Can the Computer and the Public Do the Metadata Work?

  • 1. Karen Cariani AAPB Project Director, WGBH Executive Director, WGBH Media Library & Archives Can the Computer, and the Public, do the Metadata Work? Casey Davis-Kaufman AAPB Project Manager, WGBH Associate Director, WGBH Media Library & Archives
  • 2. Agenda • Who we are • The project – Cataloguing dilemma – Audio cataloging – Speech to text results – Crowd source work – Crowd source challenges • CL tool data output – useful? • Additional available tools- work with CL experts – We have a great data set – You have expertise
  • 5. The Library of Congress Packard Campus for Audio Visual Conservation American Archive of Public Broadcasting
  • 6. Who we are: WGBH Media Library and Archives 6
  • 11. the situation/dilemma 90,000 digitized television and radio programs incomplete, inaccurate metadata records limited staff resources what is in the collection? users need access to the collection continued growth of the collection (content and sparse metadata)
  • 16. Wide variety • Of content from locations across the country • Lots of different types of speakers • Different speech patterns and accents • 40 states • politicians, academics, man on street, youth, elders, artists • north, south, mid-west, west, east • Creole, French, Caribbean Spanish, Alaskan native language • Music and other non-spoken sound
  • 17. the potential: transforming content into data • Computational Tools • Speech-to-text • Audio analysis • Image Analysis • Visualization of Data How can we use them?
  • 19. High Performance Sound Technologies for Access and Scholarship (HiPSTAS) generating additional programmatic data about the audiovisual material beyond the actual words being spoken in the recordings • speaker identities • applause • laughter • musical interludes 4,000 hours
  • 20. speaker labeling • WGBH-provided training set (all audio) 373.3 hours [speakers we aren't using etc.] • Extended corpus (all audio) 951.8 hours [collected on own, keywords, etc. might change some of it not;some included in above #] • "Haystack" test set (all audio) 494.2 hours [gave us to label because we think they're in there; might be some duplicated so we don't use it for traing; use it to run the machine. kept on separate] • Human-labeled speaker segments (all) 103.8 hours [everything] • UBM_600 corpus 17.9 hours [all labeled speakers they gave us; plus PennSound UBM] Universal Background Model
  • 21. HiPSTAS AAPB Output AAPB speaker identity models . . . so far Hillary Clinton Bill Clinton James Baldwin Malcolm X Martin Luther King Jr. Julia Child Richard Nixon Ronald Reagan Lyndon Johnson Gloria Steinem Workshop Process evaluation
  • 22. Results • We didn’t get very much identified • Really hard to do • Took a lot of time to identify 1 speaker • Translating the work from the research at UT Texas and tool training to our own catalog was undefined
  • 23. SPEECH TO TEXT RESULTS
  • 27. Transcript Creation Continues • As part of the project, Pop Up Archive released updated language models for Kaldi. • The code has been “Dockerized” by our University of Texas partners. • We have incorporated transcript creation into our workflow as the collection grows. • Github: https://github.com/WGBH/kaldi-pop-up- archive
  • 29. Zooniverse Project – “Roll the Credits”
  • 37. Lord thank you. Lord. You’re through. You’re. Through. Mm mm. Mm. Mm mm mm mm mm. Mm. Mm. Laura Iraq of. Glory. From. From. From. Three three. Three. Three. Three. Three. Three. Laura it thank. You. That was it. And those are the good other than those of the other. But other than that I'm. But you. Know. In PSA jogging for the leg. That. Said alone but not for what they love that about them. In the U.S. I don’t really like deep down this guy John. Has seen me naked young. Youre that one or put up or go to a lie even. When we're going to be in a way that it was. A scene or so because John you and I will note that up why go to a lie. When we're going to be in before and then they run a scene will see My God. You up or something. On Monday another tour and yet. People are coming up a lot of. Joy to make us even more. Of this season might. Be made upon home with us so we're left with. The money to go before the scene was bought. Kaldi Output for Music
  • 39. You mean you think you know always going. In approximately I see that I'm being in a city that the community is Brucey putting I mean in and deal with it and they see it. Are they looking for. And COMINO is improved on so many ideas to explore that they must be in for fancier graphic thing like that Latina. Woman That is keep it down for a. Made up out of there. I mean the older you know the moment they do it and I mean no no shangri la I wound up pretty compressed. But then finally the. Most famous presenter being asked about Odama seeking impact done like a 20 year old uncle. I miss my foresight when local news reported a loss for him which was the original being with you we. Let's see it. No problem I say to you since it was your own this and not a Californian. When you guys buy up our land Mr. Merhi Nera for law enforcement you said to be sure that will cost them. Less personally and COMINO is bloviated not very effective. They must be telescreen are so fake now. But up Roamer they don't know about this the past few months I feel I am sure that. If we didn't know what it is as soon as one has been able to from my strong ally in cancer they mine. I say most I would have had lots of them in Dallas. I remember with Ahmadinejad. See that he's a see the. This was very programa I said Think end up in the Mystica. I think can assume that I would be in the family is in Gaza you know of. Just a movement that eluting mother in that race in Illinois. I mean these people are Latino and they they face. Like the Latinos in going through all of Emmelina repeat Kaldi Output for Spanish Language
  • 40. All.Right. OK. OK. West Virginia Virginia. Plus. Brown. Girlfriend try a temporary quick. Corollary proper and I pointed to a song. Made on your apartment made I'm easy. I generally. Don't jump to a long. Record. Your point here already called an average storm either calling. Me a fearful planner for us are really powerful for one I don't live for or see me privately just 75 miles from or see. Them. President Reagan National Award. You have more screen. Generation. We have a long trip. For. A trip. Do you see music music from music. And there is a common sense for him when the. Nazi party. No no can one. Man may or. Yet you monkey have made of the tears and. Some are. Good only the Starkey blues the silly your so I look at this hour. Mr Parker knew me or started. Their. Own eyes. What I am. Not very good. Kaldi Output for French-Creole Language
  • 41. We know nothing about this item.
  • 42. Kaldi Output for Yup’ik Language Your all right already. Yeah OK. For the record. Speaking to James guy Greek looked him up in the crowd doesn't mean we thought i hope I do I'm trying to talk but I think we shall I do you can and we shall not be in hall and you know I do wish I could. I will call that good old myth or the old call wouldn't let me out with a dog. Be sure they left the water good. How does how do you lead that cuckoo clock in the south have you genius. And I don't miss out on your good guy you're not familiar with what do I do when you need to whom you are now there. Dorise you I laughed and I dunno it still gets you can pick on a new record. Yeah gotta be deader than a good guy. Let's go on a con not go off on your own of kind and how can you know how God left her daughter. Julie I don't know we don't really care what you never knew that people can't go through the muck of the kind in which our not. I don't come out now if you're going to cause you need no with no trade in Kenya and it could be should be a good deal or not a number could not go home now or that not gotten bad bitches want to go Honey you do not want me one culture not clear what we've got to do what you want to try New Delhi new kind o woman in her thought didn't think enough on line to get rid of they have a machine like you.
  • 43. Kaldi Output for Spoken English • Approximately 81% word accuracy rate – not including punctuation errors • Examples: – 95% accurate for 1960s radio program from Boston (no accents, one speaker) – 55% accurate for 1970s television program from Mississippi (strong Southern U.S. accent)
  • 45. Types of Errors Corrected by Users • Station call letters • Mis-transcription of words spoken in southern accents, e.g., “weary and” vs “we’re in” • Local town names, e.g., “plaque and” vs. “Plaquemine” • Person names, e.g., “Laurence” vs. “Lawrence” • Numbers spelled out vs. numeric • Adding words completely missing from original transcript • Incorrect “corrections” by crowdsource participants, e.g. “achieved” vs. “achievedd” in the “corrected” transcript
  • 46. once corrected… JSON transcripts are stored on AAPB’s Amazon S3 account Transcripts are indexed for keyword searching on the AAPB website Transcripts are made available alongside the media on the record page Transcripts can play as captions within the player Transcripts can be harvested via an API and used as a dataset for research such as a digital humanities project
  • 49. CL experts • Improving named entity vocabularies • Forced alignment • Time stamp for bars and tone • Music identification • Foreign language identification and transcription • OCR of text on screen (lower thirds, credits)
  • 50. CL Experts and Archivists • A larger need for more accurate output and ease of use of computational tools for audiovisual archives to create descriptive metadata and annotations. Help Us!
  • 51. Access our Dataset • Metadata API (OAI-PMH and PBCore) – https://github.com/WGBH/AAPB2#api • Transcripts API – contact us to get credentials • Media as a Dataset (MaaD) – contact us to get digital audio/video copies for computational research