Indo-European languages tree by Levenshtein distance

Serva, Maurizio; Petroni, F.

doi:10.1209/0295-5075/81/68005

The evolution of languages closely resembles the evolution of haploid organisms. This similarity has been recently exploited ( Gray R. D. and Atkinson Q. D., Nature, 426 ( 2003) 435; Gray R. D. and Jordan F. M., Nature, 405 ( 2000) 1052) to construct language trees. The key point is the definition of a distance among all pairs of languages which is the analogous of a genetic distance. Many methods have been proposed to de. ne these distances; one of these, used by glottochronology, computes the distance from the percentage of shared "cognates". Cognates are words inferred to have a common historical origin, and subjective judgment plays a relevant role in the identfication process. Here we push closer the analogy with evolutionary biology and we introduce a genetic distance among language pairs by considering a renormalized Levenshtein distance among words with same meaning and averaging on all words contained in a Swadesh list ( Swadesh M., Proc. Am. Philos. Soc., 96 ( 1952) 452). The subjectivity of process is consistently reduced and the reproducibility is highly facilitated. We test our method against the Indo-European group considering fifty different languages and the two hundred words of the Swadesh list for any of them. We find out a tree which closely resembles the one published in Gray and Atkinson ( 2003), with some significant differences. Copyright (c) EPLA, 2008.