SlideShare a Scribd company logo
crowdsourcing for the crowd
generating and curating open and accessible linguistic data




   Crowdsoucing and Translation Workshop University of Maryland June 10-11, 2010
Umd draft-2010 jun22
mission



The Language Commons seeks to increase open and accessible linguistic data of all
forms for all languages.

We are a consortium of individuals, institutes, organizations, and corporations working to
build and promote the tools, standards, policy, infrastructure, awareness, and community
needed to preserve the world’s linguistic diversity and gather the open data needed to
provide global access to knowledge and information across all languages.
urgency



   [Linguistics may] go down in history as the only science that presided obliviously
   over the disappearance of 90 per cent of the very field to which it is dedicated.
                                                                             –Hale et al

   We live during a brief period of overlap between the mass extinction of the world’s
   languages and the advent of the digital age.
                                                                                 –Bird
rationale

   Web-based: The multi-lingual, read/write web has created the opportunity to generate,
   share, curate linguistic data

   Open: leverage the momentum behind open content (Creative Commons) and open data
   (data.gov) movements

   Crowdsourced: Semisupervised communities can scale datasets (Haiti)

   Capturing the public imagination: This project represents the convergence of a grand
   social and grand scientific challenge
solution
   Function as a consortium working in parallel on various aspects of the mission:

     Collaborate on needed tools

     Influence data/content publishers to open license their data

     Influence policy makers to mandate an open linguistic data for publicly funded projects

     Generate and curate open data among our consortium members

     Work to identify and share resources - Language Commons SourceWiki

     Pursue longer term goals for universal corpus infrastructure and API design
projects

  NSF Si2 annotation framework for video/audio data (LDC/Meedan)

  UN Corpus effort ~600 million words/ seven languages (LC Steering Committee)

  Language Commons SourceWiki - presenting at WikiMania (Rosetta Project, Freebase)

  Human Language Project universal corpus infrastructure and API design (Bird, Abney)
Umd draft-2010 jun22
1. theory
“the meaning
of a word is its
   use in the
   language”
                   Wittgenstein
    Philosophical Investigations
a language is a socially constructed
framework for storing and
transporting meaning within a
community

however, there is an increasing need
to transport meaning across this:
Umd draft-2010 jun22
often the meaning (use) of the words
does not translate
Umd draft-2010 jun22
huh?
the war on terror
Umd draft-2010 jun22
Umd draft-2010 jun22
Umd draft-2010 jun22
global understanding problem




      ?                                                 ?




                               Creative Commons - Mushon Zer-Aviv
semantic namespace problem
translation in sensitive contexts often
does not solve for understanding, it
merely exposes the
mis(dis)understanding
2. practice
we are building translation solutions for
bloggers and bishops
translation for news.meedan.net




                    http://news.meedan.net
translation for the distrib global newsroom
    Wikipedia ethic to translation editing

      +translation as dynamic
      +revisions are collaborative
      +show translation history
      +the consumer as editor

      +MT feedback loop
      +able to translate more, better
      +constantly improving 
      +Community vets translations
      +humanizes the translator- translators profile

    Makes media global, conversational, social, cross-language



                                      http://news.meedan.net
translation a network of religious scholars

     Translation as a form of scholarship

       +no Machine Translation
       +Domain trained translators
       +Glossaries
       +Annotation layers- addresses the namespace issue
       +Granular Attribution - word/sentence/document level
the user interface



the meaning of a translation includes
the fact that it is a translation
Showing two languages side by side
                        counter to traditional UI/UX best practices




+provenance
+attribution
+version control
+visual cues
+url translation
+lots of human effort
unintended consequences: language learning
other fun stuff: generating data, transporting
knowledge, globalizing great NGOs




WikiArabia                  Meedan Memory       Kiva.org
 +Project with KACST         +Open AR/EN TM       +Translate 700k words
 +Translate 2000 articles    +Circa 2m words      +Jump start Kiva AR
 +Science Tech Health        +Informal domain     +Cisco Funded
 +116k articles in AR WP     +on Github
 +530 million AR speakers

More Related Content

PDF
Disrupting Digital Monolingualism
PDF
The Day of Archaeology: Archaeologists as Audience? The grassroots creation o...
PPTX
Dr Martin Poulter, Wikipedia and higher education
PPTX
Acem www
PDF
Why Wikipedia is Important to the Future of the Arabic Language Internet by W...
PPTX
Activity 13 common online terminologies
PPT
Stepping Out of the Vaccuum Without Leaving Your Desk
PPTX
To Wikipedia and Beyond
Disrupting Digital Monolingualism
The Day of Archaeology: Archaeologists as Audience? The grassroots creation o...
Dr Martin Poulter, Wikipedia and higher education
Acem www
Why Wikipedia is Important to the Future of the Arabic Language Internet by W...
Activity 13 common online terminologies
Stepping Out of the Vaccuum Without Leaving Your Desk
To Wikipedia and Beyond

What's hot (19)

PDF
Open labs at_remix
PDF
Increasing access to free and open knowledge for speakers of underserved lang...
PPT
Sns.gif
PPT
Archiving The Social Media Presence of The River-side
PPTX
Activity 13 Common Online Terminologies
PPTX
Activity 13 common online terms
PPTX
Wikimedia: accessible (new) media for (almost) all
PPT
Stepping Out of the Vacuum without Leaving Your Desk
PPTX
Activity common online terminologies
PPTX
Making the World Small: A Closed FB Group and Peace Activists in Indonesia
PDF
S28 overview
PPTX
World wide web
PPTX
Digital Humanities
PPTX
greenstone digital library software
PDF
Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...
PPTX
Rethinking the Language of Language Endangerment
PPT
Fd Wintercamp Presentation
PPTX
Common Online Terminologies
PPTX
Activity 13 common online terminologies
Open labs at_remix
Increasing access to free and open knowledge for speakers of underserved lang...
Sns.gif
Archiving The Social Media Presence of The River-side
Activity 13 Common Online Terminologies
Activity 13 common online terms
Wikimedia: accessible (new) media for (almost) all
Stepping Out of the Vacuum without Leaving Your Desk
Activity common online terminologies
Making the World Small: A Closed FB Group and Peace Activists in Indonesia
S28 overview
World wide web
Digital Humanities
greenstone digital library software
Defining the Archive on Our Terms: A Look at the Esperanza Peace and Justice ...
Rethinking the Language of Language Endangerment
Fd Wintercamp Presentation
Common Online Terminologies
Activity 13 common online terminologies
Ad

Similar to Umd draft-2010 jun22 (20)

KEY
Solstrand
PDF
Community Translation in a Multilingual Online Environment: Case study and th...
KEY
Jhu presentation-final-2010 jul20
PPTX
[Challenge:Future] Language Death - The Language Box
PDF
Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...
PPTX
Spanish in the U.S.: Developing an open linguistic corpus
PDF
Language commons wiki_final
KEY
東日本大震災から学ぶソーシャル翻訳
PDF
Language Migration And Multilingualism In The Age Of Digital Humanities Ignac...
PDF
D3.1 Multilingual content processing methods
PDF
Ubiquity: Designing a Multilingual Natural Language Interface
PPT
Introduction
PDF
Aldo Gangemi - Meaning on the Web: An Empirical Design Perspective
PDF
The Web As Corpus Theory And Practice Maristella Gatto
PDF
Semantic web and information graph
PDF
Corpus-Based Studies of Legal Language for Translation Purposes:
PDF
Unesco seminar: Language diversity and the Internet
PDF
Linguistics in the Twenty First Century 1st Edition Eloína Miyares Bermúdez
PDF
Remixing The Global Conversation
PDF
“A Universal Translator as a Cognitive System, beginning as a Guidebook with ...
Solstrand
Community Translation in a Multilingual Online Environment: Case study and th...
Jhu presentation-final-2010 jul20
[Challenge:Future] Language Death - The Language Box
Discourse Or Document? Issues of adopting Emerging Digital Genres for Scholar...
Spanish in the U.S.: Developing an open linguistic corpus
Language commons wiki_final
東日本大震災から学ぶソーシャル翻訳
Language Migration And Multilingualism In The Age Of Digital Humanities Ignac...
D3.1 Multilingual content processing methods
Ubiquity: Designing a Multilingual Natural Language Interface
Introduction
Aldo Gangemi - Meaning on the Web: An Empirical Design Perspective
The Web As Corpus Theory And Practice Maristella Gatto
Semantic web and information graph
Corpus-Based Studies of Legal Language for Translation Purposes:
Unesco seminar: Language diversity and the Internet
Linguistics in the Twenty First Century 1st Edition Eloína Miyares Bermúdez
Remixing The Global Conversation
“A Universal Translator as a Cognitive System, beginning as a Guidebook with ...
Ad

Recently uploaded (20)

PDF
Power and position in leadershipDOC-20250808-WA0011..pdf
PPTX
Belch_12e_PPT_Ch18_Accessible_university.pptx
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
PPTX
Lecture (1)-Introduction.pptx business communication
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PPTX
2025 Product Deck V1.0.pptxCATALOGTCLCIA
PDF
Tata consultancy services case study shri Sharda college, basrur
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PDF
Unit 1 Cost Accounting - Cost sheet
PDF
Chapter 5_Foreign Exchange Market in .pdf
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PDF
Deliverable file - Regulatory guideline analysis.pdf
PDF
Outsourced Audit & Assurance in USA Why Globus Finanza is Your Trusted Choice
PDF
Nidhal Samdaie CV - International Business Consultant
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PDF
IFRS Notes in your pocket for study all the time
PDF
COST SHEET- Tender and Quotation unit 2.pdf
PPTX
Board-Reporting-Package-by-Umbrex-5-23-23.pptx
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
Roadmap Map-digital Banking feature MB,IB,AB
Power and position in leadershipDOC-20250808-WA0011..pdf
Belch_12e_PPT_Ch18_Accessible_university.pptx
Ôn tập tiếng anh trong kinh doanh nâng cao
Lecture (1)-Introduction.pptx business communication
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
2025 Product Deck V1.0.pptxCATALOGTCLCIA
Tata consultancy services case study shri Sharda college, basrur
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
Unit 1 Cost Accounting - Cost sheet
Chapter 5_Foreign Exchange Market in .pdf
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Deliverable file - Regulatory guideline analysis.pdf
Outsourced Audit & Assurance in USA Why Globus Finanza is Your Trusted Choice
Nidhal Samdaie CV - International Business Consultant
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
IFRS Notes in your pocket for study all the time
COST SHEET- Tender and Quotation unit 2.pdf
Board-Reporting-Package-by-Umbrex-5-23-23.pptx
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
Roadmap Map-digital Banking feature MB,IB,AB

Umd draft-2010 jun22

  • 1. crowdsourcing for the crowd generating and curating open and accessible linguistic data Crowdsoucing and Translation Workshop University of Maryland June 10-11, 2010
  • 3. mission The Language Commons seeks to increase open and accessible linguistic data of all forms for all languages. We are a consortium of individuals, institutes, organizations, and corporations working to build and promote the tools, standards, policy, infrastructure, awareness, and community needed to preserve the world’s linguistic diversity and gather the open data needed to provide global access to knowledge and information across all languages.
  • 4. urgency [Linguistics may] go down in history as the only science that presided obliviously over the disappearance of 90 per cent of the very field to which it is dedicated. –Hale et al We live during a brief period of overlap between the mass extinction of the world’s languages and the advent of the digital age. –Bird
  • 5. rationale Web-based: The multi-lingual, read/write web has created the opportunity to generate, share, curate linguistic data Open: leverage the momentum behind open content (Creative Commons) and open data (data.gov) movements Crowdsourced: Semisupervised communities can scale datasets (Haiti) Capturing the public imagination: This project represents the convergence of a grand social and grand scientific challenge
  • 6. solution Function as a consortium working in parallel on various aspects of the mission: Collaborate on needed tools Influence data/content publishers to open license their data Influence policy makers to mandate an open linguistic data for publicly funded projects Generate and curate open data among our consortium members Work to identify and share resources - Language Commons SourceWiki Pursue longer term goals for universal corpus infrastructure and API design
  • 7. projects NSF Si2 annotation framework for video/audio data (LDC/Meedan) UN Corpus effort ~600 million words/ seven languages (LC Steering Committee) Language Commons SourceWiki - presenting at WikiMania (Rosetta Project, Freebase) Human Language Project universal corpus infrastructure and API design (Bird, Abney)
  • 10. “the meaning of a word is its use in the language” Wittgenstein Philosophical Investigations
  • 11. a language is a socially constructed framework for storing and transporting meaning within a community however, there is an increasing need to transport meaning across this:
  • 13. often the meaning (use) of the words does not translate
  • 15. huh?
  • 16. the war on terror
  • 20. global understanding problem ? ? Creative Commons - Mushon Zer-Aviv
  • 22. translation in sensitive contexts often does not solve for understanding, it merely exposes the mis(dis)understanding
  • 24. we are building translation solutions for bloggers and bishops
  • 25. translation for news.meedan.net http://news.meedan.net
  • 26. translation for the distrib global newsroom Wikipedia ethic to translation editing +translation as dynamic +revisions are collaborative +show translation history +the consumer as editor +MT feedback loop +able to translate more, better +constantly improving  +Community vets translations +humanizes the translator- translators profile Makes media global, conversational, social, cross-language http://news.meedan.net
  • 27. translation a network of religious scholars Translation as a form of scholarship +no Machine Translation +Domain trained translators +Glossaries +Annotation layers- addresses the namespace issue +Granular Attribution - word/sentence/document level
  • 28. the user interface the meaning of a translation includes the fact that it is a translation
  • 29. Showing two languages side by side counter to traditional UI/UX best practices +provenance +attribution +version control +visual cues +url translation +lots of human effort
  • 31. other fun stuff: generating data, transporting knowledge, globalizing great NGOs WikiArabia Meedan Memory Kiva.org +Project with KACST +Open AR/EN TM +Translate 700k words +Translate 2000 articles +Circa 2m words +Jump start Kiva AR +Science Tech Health +Informal domain +Cisco Funded +116k articles in AR WP +on Github +530 million AR speakers

Editor's Notes

  • #11: A hugely beautiful piece of philosophy. Extend a word to be an idea or an action, like, say a war, and you can surmise that the meaning of that idea or war is equal to its use in the language. The problem is that we toss phrases like... clash of civilizations and..
  • #16: ‘the war on terror’ and ideas like the invasion of Iraq into a place where we cannot speak the language. we have single events that are shaping our global landscape that are understood in radically different ways by each party...
  • #17: this is the Arabic phrase that approximates the ‘literal’ translation of the war on terror, it is approximately ‘the war against terrorism’ --it is only occasionally used in some of the conservative media outlets.
  • #18: though more commonly it is ‘the war against Arabs’
  • #19: in this phrasing it is simply, “Bush’s War”... you can see that some of the original intent of the...
  • #21: so the issue of what the war in Iraq means is very much a namespace issue- but when we understand the common referent and offer that a different signifier is used we can come a bit closer to understanding the complexity of the context for that referent.
  • #22: humility
  • #29: source author, source language, source location, translation author, etc, etc.