Phylogenetic models and MCMC methods for the reconstruction of language history

1. Phylogenetic models and MCMC methods for the reconstruction of language history Robin J. Ryder CEREMADE – Paris Dauphine / CREST – INSEE Joint work with Geoff K. Nicholls at the Department of Statistics, University of Oxford www.slideshare.net/robinryder

2. Carles li reis, nostre emper[er]e magnes Set anz tuz pleins ad estet en Espaigne : Tresqu’en la mer cunquist la tere altaigne. N’i ad castel ki devant lui remaigne ; Mur ne citet n’i est remes a fraindre, Fors Sarraguce, ki est en une muntaigne. Chanson de Roland , 1r (11 th century)

3. La plus commune façon d'amollir les coeurs de ceux qu'on a offensez, lors qu'ayant la vengeance en main, ils nous tiennent à leur mercy, c'est de les esmouvoir par submission à commiseration et à pitié. Montaigne, Essais , I, 1 (1580)

4. Tes yeux sont si profonds qu'en me penchant pour boire J'ai vu tous les soleils y venir se mirer S'y jeter à mourir tous les désespérés Tes yeux sont si profonds que j'y perds la mémoire Aragon, Les Yeux d'Elsa (1942)

5. Et la piaule swingue au son du ghetto, on tape à la porte Chill c'est trop fort ! baisse le son merde ! j'connais A chaque fois c'est pareil tant pis il faut qu'ça pète Et profite en traître des nouveaux albums qu'Rod m'achète Akhénaton, Juste une pression (2005)

6. What to expect Description of the data

7. Model of language diversification

8. MCMC for phylogenetic trees

9. Synthetic studies

10. Analysis of two data sets

11. Indo-European languages

12. Indo-European languages

13. Language diversification Languages change in a way comparable to biological species Similarities between languages indicate that they may be cousins. Most common model : phylogenetic tree

15. Questions Topology

16. Internal ages

17. Age of the root: 6000-6500 BP or 8000-9500 BP?

18. (BP=Before Present)

19. Core vocabulary 100 or 200 meanings, present in almost all languages : bird, hand, to eat, red...

20. Borrowing is possible (non-tree-like change), but:

21. “ Easy” to detect

22. Uncommon

23. Does not introduce systematic bias

24. Data coding Old English: stierfþ Old High German: stirbit , touwit Avestan: miriiete Old Church Slavonic: umĭretŭ Latin: moritur Oscan: ? Cognacy classes: 1. {stierfþ, stirbit} 2. {touwit} 3. {miriiete, umĭretŭ, moritur}

25. Constraints Constraints on parts of the topology

26. Constraints on some internal ages

27. We use these constraints to infer rates and other ages

29. Description of the model (1)‏ Traits are born at rate λ

30. Trait instances die at rate μ

31. λ and μ are constants

32. Description of the model (2)‏ Catastrophes occur at rate ρ

33. At a catastrophe, each trait dies with probability κ and Poiss(ν) traits are born.

34. λ/μ=ν/κ: the number of traits is constant on average.

35. Description of the model (3)‏ Observation model: each data point (0s and 1s) is missing with probability ξ

36. Some traits are not observed and are therefore deleted from the data

37. Registration process

41. Posterior distribution

42. Likelihood calculations

43. Prior distribution on trees Our main focus is on the root age

44. We would like the marginal prior on the root age to be (approximately) uniform over (say) 5000-15000BP

45. MCMC moves Random walk on the parameters

46. Various moves on the tree (Drummond et al., 2002)

66. Checking mixing and convergence Auto-correlations

67. Need statistics on the tree

68. Length of the tree

69. Root age

70. Presence/Absence of a few subtrees

71. Synthetic data True tree, ~40 words/language Consensus tree

72. Synthetic data (2)‏ Death rate (μ)

73. Influence of borrowing True tree, ~40 words/language Borrowing: 10% Consensus tree

74. Influence of borrowing (2) Consensus tree True tree, ~40 words/language Borrowing: 50%

75. Influence of borrowing (3) Topology is reconstructed correctly

76. Dates are underestimated for high levels of borrowing Root age Death rate ( μ) Borrowing: 50%

77. Detecting borrowing Confirmed: hardly any borrowing!

78. Data used Indo-European languages

79. Core vocabulary (Swadesh 100 or 200)

80. Two independent data sets

81. Dyen et al. (1997): 87 languages, mostly modern

82. Ringe et al. (2002): 24 languages, mostly ancient

83. Constraints

84. Cross-validation

97. Root age

98. Conclusions Strong support for the Anatolian hypothesis: root age around 8000BP. No support for the Kurgan hypothesis.

99. Applicable to a variety of linguistic and cultural data sets

100. TraitLab: it's free!

101. Questions otázky spørgsmåler vragen questions Fragen domande pytania questões întrebări вопросы vprašanja preguntes preguntas frågor vrae spurningar quaestiones ερωτήσεις въпроси kesses spørsmåler kláusimai запитанні سوال प्रश्न cwestiwnau

Phylogenetic models and MCMC methods for the reconstruction of language history

More Related Content

More from Robin Ryder (11)

Recently uploaded (20)

Phylogenetic models and MCMC methods for the reconstruction of language history