INTRODUCTION TO
CORPUS LINGUISTICS
karlinadenistia@staff.uns.ac.id
@karlinakuning
Karlina_Denistia
Corpus
Linguistics
–
Karlina
Denistia
2
https://scholar.google.de/citations?hl=en&user=D2U9r3cAAAAJ&view_op=list_works&sortby=pubdate
3
Corpus
linguistics
5
OUTLINE
• Background story
• What is corpus linguistics?
• Sources of corpus data
• Which sources for which research?
Language rules and systems
• Both of these are acceptable sentences
• We worked out the problem
• We worked the problem out
6
Language rules and systems
• Both of these are acceptable sentences
• We worked out the problem
• We worked the problem out
• Only one of these sentences may not be equally acceptable
• We worked out it
• We worked it out
the first one is likely to sound strange to many native speakers of English
7
8
Language variation:
- Speaker
- Context
- Necessity
9
OUTLINE
• Background story
• What is corpus linguistics?
• Sources of corpus data
• Which sources for which research?
10
What is corpus linguistics?
• Corpus linguistics describes language variation and use by looking at large amounts of
texts that have been produced
• Written: news writing, text messaging or academic writing
• Oral: news reporting, face-to-face conversation or academic lectures
• A corpus is a representative collection of language that can be used to make statements
about language use
• a fairly large number of examples
• can be read by local computer
11
OUTLINE
• Background story
• What is corpus linguistics?
• Sources of corpus data
• Which sources for which research?
12
Sources of corpus data
• Containing real world examples
• Books, papers, letters, spoken language, dialogues, twitter, news, chat
history, song lyrics, twitter, facebook posts, movie subtitle, etc
• Size: million words
Electronically available and computer-processable
• e.g., PDF  optical character recognition  text file
• e.g., audio file  speech to text by Siri  text file
• Built using semi-automated process (e.g., web crawlers)
• Manually typewritten text or copied - pasted news from internet file?
14
15
16
https://corpora.uni-leipzig.de/en?corpusId=ind_mixed_2013&word=
17
18
What is called as „I am doing a corpus
linguistics“?
• it is empirical, analyzing the actual patterns of use in natural language texts
• it utilizes a large and principled collection of natural texts, known as a “corpus”, as the
basis for analysis
• it makes extensive use of computers for analysis, using both automatic and interactive
techniques
• it depends on both quantitative and qualitative analytical techniques
(Biber, Conrad, & Reppen, 1998: 4)
19
20
21
Break and think:
What can we do with this corpus?
Morphology : Indonesian affix productivity
Semantics : figurative language with `head‘
Syntax : adverb mobility in Indonesian
Language use : new words in Indonesian corpora
Pragmatics : formal and informal construction
Any other ideas?
22
OUTLINE
• Background story
• What is corpus linguistics?
• Sources of corpus data
• Which sources for which research?
23
Which corpus for which research?
• British National Corpus
• 4,048 texts (variety of texts
written in British English)
• Around 100 million words
• Lake district corpus
• 28 texts (Texts about Lake District
between 1700 – 1900 British English)
• 273,861 words
Know the aim of your research
24
(Gabrielatos, 2013)
25
Which one will you use?
• British National Corpus
• 4,048 texts (variety of texts
written in British English)
• Around 100 million words
• Lake district corpus
• 28 texts (Texts about Lake District
between 1700 – 1900 British English)
• 273,861 words
26
Summary
• Corpus linguistics allows more possibilities to describe linguistics
phenomena based on language use
• There are various kinds of corpora that could be used as the source of
information for language research
• Choosing corpora depends on the research question(s)
27
Any questions?
See you next week 
Note: you need to download AntConc for our next meeting
28
References
• Biber, D., S. Conrad & R. Reppen. 1998. Corpus Linguistics: Investigating Language, Structure and Use.
Cambridge: Cambridge University Press
• Crawford, William J., and Eniko Csomay. 2016. Doing Corpus Linguistics. New York: Routledge.
• Gabrielatos, Costas. 2013. Sketching Muslims: A Corpus Driven Analysis of Representations Around the Word
'Muslim' in the British Press 1998-2009. Applied Linguistics, 34(3): 255:278.

More Related Content

PPTX
Corpus Linguistics
PPTX
Corpus Linguistics
PPTX
Corpus linguistics the basics
PDF
Corpus Based Language Studies An advanced resource book 1st Edition Tony Mcenery
PPTX
corpus linguistics.pptx
PPTX
COMPUTATIONAL LINGUISTICS.pptx
DOCX
Corpus Linguistics
PPTX
Corpus linguistics
Corpus Linguistics
Corpus Linguistics
Corpus linguistics the basics
Corpus Based Language Studies An advanced resource book 1st Edition Tony Mcenery
corpus linguistics.pptx
COMPUTATIONAL LINGUISTICS.pptx
Corpus Linguistics
Corpus linguistics

Similar to Introduction to Corpus Linguistics for Beginner (20)

PPTX
Corpus linguistics
PDF
Corpus linguistics intro
PDF
Corpus-Based Studies of Legal Language for Translation Purposes:
PPTX
Computer assisted text and corpus analysis
PDF
2001052491
PPTX
Corpus linguistics, ch6
PPTX
Corpus linguistics
PDF
Corpus Linguistics: An Introduction
PPTX
Corpus study design
PDF
(Ebook) Corpus linguistics for grammar: a guide for research by Christian Jon...
PPTX
Corpus Linguistics II.pptx
DOCX
Corpus Analysis in Corpus linguistics
PPTX
corpus.pptx
PPT
What corpora are available? by David Y. W.D
DOCX
Corpus approaches to discourse analysis
PDF
corpus linguistics and lexicography
PDF
Using Corpora In Discourse Analysis Paul Baker
PPTX
Introduction to corpus linguistics 1
PDF
Corpus Linguistics And Linguistically Annotated Corpora Sandra Kbler Heike Zi...
PDF
Corpus Linguistics Method Theory And Practice Tony Mcenery And Andrew Hardie
Corpus linguistics
Corpus linguistics intro
Corpus-Based Studies of Legal Language for Translation Purposes:
Computer assisted text and corpus analysis
2001052491
Corpus linguistics, ch6
Corpus linguistics
Corpus Linguistics: An Introduction
Corpus study design
(Ebook) Corpus linguistics for grammar: a guide for research by Christian Jon...
Corpus Linguistics II.pptx
Corpus Analysis in Corpus linguistics
corpus.pptx
What corpora are available? by David Y. W.D
Corpus approaches to discourse analysis
corpus linguistics and lexicography
Using Corpora In Discourse Analysis Paul Baker
Introduction to corpus linguistics 1
Corpus Linguistics And Linguistically Annotated Corpora Sandra Kbler Heike Zi...
Corpus Linguistics Method Theory And Practice Tony Mcenery And Andrew Hardie
Ad

Recently uploaded (20)

PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PPTX
20th Century Theater, Methods, History.pptx
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PPTX
Introduction to pro and eukaryotes and differences.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PDF
Weekly quiz Compilation Jan -July 25.pdf
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
Complications of Minimal Access-Surgery.pdf
PDF
International_Financial_Reporting_Standa.pdf
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
20th Century Theater, Methods, History.pptx
Cambridge-Practice-Tests-for-IELTS-12.docx
Introduction to pro and eukaryotes and differences.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Virtual and Augmented Reality in Current Scenario
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
TNA_Presentation-1-Final(SAVE)) (1).pptx
Unit 4 Computer Architecture Multicore Processor.pptx
Weekly quiz Compilation Jan -July 25.pdf
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
FORM 1 BIOLOGY MIND MAPS and their schemes
Complications of Minimal Access-Surgery.pdf
International_Financial_Reporting_Standa.pdf
Uderstanding digital marketing and marketing stratergie for engaging the digi...
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Ad

Introduction to Corpus Linguistics for Beginner

  • 3. 3
  • 5. 5 OUTLINE • Background story • What is corpus linguistics? • Sources of corpus data • Which sources for which research?
  • 6. Language rules and systems • Both of these are acceptable sentences • We worked out the problem • We worked the problem out 6
  • 7. Language rules and systems • Both of these are acceptable sentences • We worked out the problem • We worked the problem out • Only one of these sentences may not be equally acceptable • We worked out it • We worked it out the first one is likely to sound strange to many native speakers of English 7
  • 8. 8 Language variation: - Speaker - Context - Necessity
  • 9. 9 OUTLINE • Background story • What is corpus linguistics? • Sources of corpus data • Which sources for which research?
  • 10. 10 What is corpus linguistics? • Corpus linguistics describes language variation and use by looking at large amounts of texts that have been produced • Written: news writing, text messaging or academic writing • Oral: news reporting, face-to-face conversation or academic lectures • A corpus is a representative collection of language that can be used to make statements about language use • a fairly large number of examples • can be read by local computer
  • 11. 11 OUTLINE • Background story • What is corpus linguistics? • Sources of corpus data • Which sources for which research?
  • 12. 12 Sources of corpus data • Containing real world examples • Books, papers, letters, spoken language, dialogues, twitter, news, chat history, song lyrics, twitter, facebook posts, movie subtitle, etc • Size: million words
  • 13. Electronically available and computer-processable • e.g., PDF  optical character recognition  text file • e.g., audio file  speech to text by Siri  text file • Built using semi-automated process (e.g., web crawlers) • Manually typewritten text or copied - pasted news from internet file?
  • 14. 14
  • 15. 15
  • 17. 17
  • 18. 18 What is called as „I am doing a corpus linguistics“? • it is empirical, analyzing the actual patterns of use in natural language texts • it utilizes a large and principled collection of natural texts, known as a “corpus”, as the basis for analysis • it makes extensive use of computers for analysis, using both automatic and interactive techniques • it depends on both quantitative and qualitative analytical techniques (Biber, Conrad, & Reppen, 1998: 4)
  • 19. 19
  • 20. 20
  • 21. 21 Break and think: What can we do with this corpus? Morphology : Indonesian affix productivity Semantics : figurative language with `head‘ Syntax : adverb mobility in Indonesian Language use : new words in Indonesian corpora Pragmatics : formal and informal construction Any other ideas?
  • 22. 22 OUTLINE • Background story • What is corpus linguistics? • Sources of corpus data • Which sources for which research?
  • 23. 23 Which corpus for which research? • British National Corpus • 4,048 texts (variety of texts written in British English) • Around 100 million words • Lake district corpus • 28 texts (Texts about Lake District between 1700 – 1900 British English) • 273,861 words
  • 24. Know the aim of your research 24 (Gabrielatos, 2013)
  • 25. 25 Which one will you use? • British National Corpus • 4,048 texts (variety of texts written in British English) • Around 100 million words • Lake district corpus • 28 texts (Texts about Lake District between 1700 – 1900 British English) • 273,861 words
  • 26. 26 Summary • Corpus linguistics allows more possibilities to describe linguistics phenomena based on language use • There are various kinds of corpora that could be used as the source of information for language research • Choosing corpora depends on the research question(s)
  • 27. 27 Any questions? See you next week  Note: you need to download AntConc for our next meeting
  • 28. 28 References • Biber, D., S. Conrad & R. Reppen. 1998. Corpus Linguistics: Investigating Language, Structure and Use. Cambridge: Cambridge University Press • Crawford, William J., and Eniko Csomay. 2016. Doing Corpus Linguistics. New York: Routledge. • Gabrielatos, Costas. 2013. Sketching Muslims: A Corpus Driven Analysis of Representations Around the Word 'Muslim' in the British Press 1998-2009. Applied Linguistics, 34(3): 255:278.

Editor's Notes

  • #2: + 7 published SINTA 2-6 + 2 accepted book chapter + 1 accepted article SINTA + 2 under-review articles SCOPUS
  • #7: Fakta bahwa satu konstruksi itu acceptable dan yang lain engga. Plus bahwa ternyata ketika ada pattern yang berterima, ternyata ngga bisa diterima ketika menggunakan pattern lain.. Who knows? While it may be difficult to explain this particular aspects of English, bahwa ada yang boleh dan ngga boleh menurut aturan.. Salah satu komponen dari linguistik deskripsi kan adalah to make implicit rules of language yang kita ketahui menjadi knowledge yang bisa kita deskripsikan. Dalam mendeskripsikan implicit rules itu, lebih aman ketika kita bilang bahwa native speaker sometimes choose not to follow rules for specific reasons, even though they may not be able to explain the rules themselves. DARIPADA „ini salah dan itu benar“
  • #8: The concept of language rules raises another interesting question: Why are these rules sometimes followed and sometimes “violated”? Consider the prescriptive infinitive rule described above. Is it accurate to say that those who write „we worked out it“ are not following a rule? In some respects, this may be the case, but there is another—perhaps somewhat misunderstood—issue related to language that deserves some attention and serves as a basis for our topic today: the role of language variation. Language even changes and varies in a single person. The study of language variation seeks to gain an understanding of how language changes and varies for different reasons and in different contexts. There are different perspectives on how to investigate and understand language variation. We could choose to look at how language varies in different places of the world (for example, the differences between British and American English). We could also investigate how language varies by ethnicity or social class. Another way of looking at variation would be to consider the differences among individual writers or speakers. We could, for example, study the speeches of Anies Baswedan in order to understand how his “style” might differ from speeches given by other people such as, Ahok.
  • #9: The broad use of language, the existence of language variation, and the possibility of the emergence for a new construction. We need to see language from ist use. We need to consider using corpora.
  • #10: The result of this analysis is a collection of language patterns that are recurrent in the corpus and either provide an explanation of language use or serve as the basis for further language analysis.
  • #12: Korpus itu bentuk dan sumbernya macam-macam. Yang penting, mencakup penggunaan bahasa yang natural and real. Misalnya, korpus yang diambil dari buku, makalah, surat, bahasa lisan, dialog film. Korpus juga perlu memiliki jumlah besar, walaupun belum ada kesepakatan pasti berapa jumlah minimalnya, namun biasanya jutaan. Tujuan dari korpus yang jumlahnya besar ini karena penyedia korpus berusaha semaksimal mungkin untuk datanya bisa digunakan oleh linguis di berbagai bidang. Misalnya, untuk melihat hal menarik dari penggunaan bahasa secara umum, atau menemukan pola tertentu yang bisa digeneralisir dari language in real world. Kembali ke corpus size, kalau saya tertarik meneliti konstruksi afiks yang ambigu seperti imbuhan ter-, jika ada korpus novel sebesar 5000 kalimat versus korpus berita sebesar 5 juta kalimat, saya akan mengharapkan data imbuhan ter- akan lebih banyak terjaring dari korpus dengan jumlah 5 juta kalimat. Akan tetapi kalau saya mau analisis metafora, mungkin saya perlu pakai korpus berita DAN novel. Intinya, saya sebagai korpus linguis akan berusaha meng-capture data sebanyak mungkin untuk diolah.
  • #13: Korpus yang bagus adalah korpus yang: Datanya bisa dibaca dan diolah oleh komputer. Written corpus dengan menggunakan PDF akan sangat sulit untuk diolah. Harus diconvert dulu ke ms.word, belum lagi kalau banyak simbol yang unintepretable. Oleh karena itu, korpus bahasa tulis biasanya dibuat dalam format .txt Spoken corpus juga biasanya dianotasi dulu dengan praat, kemudian disediakan juga transkripsinya dalam format .txt Ini salah satu keunggulan corpus generator yang levelnya sudah high. Mereka menggunakan web crawler dalam pengumpulan datanya. Jadi, mesin yang kerja secara periodik untuk menambah jumlah korpus. Tim pembuat korpus perlu mendesain code untuk web crawler dan data cleanernya hingga akhirnya data siap disajikan secara online di website. LCC tahun 2016 masih 32 juta kalimat. Sekarang 2020 sudah 72 juta kalimat. Bagaimana dengan manually typewritten text file? It‘s okay, nothing;s wrong with it. The thing is, membuat korpus itu sungguh suatu upaya yang sangat perlu diapresiasi. Indonesia masih kekurangan korpus, so, sedikit demi sedikit lama-lama menjadi bukit kan nanti.
  • #14: Ini adalah contoh-contoh korpus Bahasa Inggris. - Google N-gram menyediakan data yang merupakan hasil dari dari kumpulan pencarian search engine di web. - Google books dari google books. - COCA lumayan hits. Sumbernya bervariasi. - BNC juga hits. Banyak penelitian phonetik, phonologi, linguistik struktural, maupun applied linguistik yang pakai data dari korpus BNC. - Untuk yang tertarik dengan English as second language, ada International Corpus of Learner English.
  • #15: Kemudian, ini korpus yang saya gunakan. Leipzig corpora collection (74 juta kalimat). LCC juga menyediakan olahan data berupa jumlah total searched words dalam korpus, similar words, contoh kalimat, kolokasi, dan grafik kolokasi.
  • #16: Karena korpus harus bisa diolah di komputer, ini adalah tombol primadona yang perlu kita cari.
  • #17: Ini adalah tampilan LCC kalau kita klik tombol download. Cukup download yang versi terakhir aja. Itu yang masih grey berarti belum available for public.
  • #18: In a general sense, a corpus can refer to any collection of texts that serve as the basis for analysis. A person might, for example, collect examples of news editorials that are on a particular topic and refer to this collection as a “corpus.” The third and fourth characteristics of corpus linguistics make reference to the importance of computers in the analysis of language as well as different analytical approaches. It would be hard to imagine how one might use a 450-million-word corpus such as COCA without using a computer to help identify certain language features. Despite the large number of texts and the relative ease of obtaining numerous examples, a corpus analysis does not only involve counting things (quantitative analysis); it also depends on finding reasons or explanations for the quantitative findings.
  • #20: Kalau discroll ke bawah, ada 100 ribu kalimat dalam bahasa Indonesia yang diambil dari berbagai sumber bahasa tulis di internet. Bisa wikipedia, blog, berita, dll.
  • #21: 2 ide lain
  • #22: The importance of understanding the corpora
  • #26: The importance of understanding the corpora