The evolution of languages closely resembles the evolution of haploid organisms. This similarity has been recently exploited ( Gray R. D. and Atkinson Q. D., Nature, 426 ( 2003) 435; Gray R. D. and Jordan F. M., Nature, 405 ( 2000) 1052) to construct language trees. The key point is the definition of a distance among all pairs of languages which is the analogous of a genetic distance. Many methods have been proposed to de. ne these distances; one of these, used by glottochronology, computes the distance from the percentage of shared "cognates". Cognates are words inferred to have a common historical origin, and subjective judgment plays a relevant role in the identfication process. Here we push closer the analogy with evolutionary biology and we introduce a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all words contained in a Swadesh list ( Swadesh M., Proc. Am. Philos. Soc., 96 ( 1952) 452). The subjectivity of process is consistently reduced and the reproducibility is highly facilitated. We test our method against the Indo-European group considering fifty different languages and the two hundred words of the Swadesh list for any of them. We find out a tree which closely resembles the one published in Gray and Atkinson ( 2003), with some significant differences. Copyright (c) EPLA, 2008.
Indo-European languages tree by Levenshtein distance
SERVA, Maurizio;
2008-01-01
Abstract
The evolution of languages closely resembles the evolution of haploid organisms. This similarity has been recently exploited ( Gray R. D. and Atkinson Q. D., Nature, 426 ( 2003) 435; Gray R. D. and Jordan F. M., Nature, 405 ( 2000) 1052) to construct language trees. The key point is the definition of a distance among all pairs of languages which is the analogous of a genetic distance. Many methods have been proposed to de. ne these distances; one of these, used by glottochronology, computes the distance from the percentage of shared "cognates". Cognates are words inferred to have a common historical origin, and subjective judgment plays a relevant role in the identfication process. Here we push closer the analogy with evolutionary biology and we introduce a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all words contained in a Swadesh list ( Swadesh M., Proc. Am. Philos. Soc., 96 ( 1952) 452). The subjectivity of process is consistently reduced and the reproducibility is highly facilitated. We test our method against the Indo-European group considering fifty different languages and the two hundred words of the Swadesh list for any of them. We find out a tree which closely resembles the one published in Gray and Atkinson ( 2003), with some significant differences. Copyright (c) EPLA, 2008.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.