Sindarin dictionary statistics

Roman Rausch

Sep. 2nd 2013

nanwen

Contents

Introduction

This is a statistical evaluation of the Sindarin dictionary hosted at http://www.sindarin.de.

1  All phonemes

The frequencies of all Sindarin phonemes are found to be:


rankphonemefrequency
1a0.145
2n0.11
3e0.094
4r0.087
5i0.075
6l0.071
7o0.055
8g0.044
9d0.043
10þ0.041
11u0.036
12m0.03
13s0.027
14t0.023
15b0.019
16k0.015
17w0.015
18f0.012
19h0.012
20v0.011
21ð0.011
22p0.009
23χ0.007
24j0.002
25y0.002
26R0.002
27L0.002
28W0.001
 

Notation (for easiness of counting, digraphs were converted to unigraphs):

Assumptions for simplicity:

1.1  Discussion

For the rank-frequency distribution p(r) (where r is a phoneme’s rank), an ad-hoc formula was first proposed by Zipf in 1929 [1]:

p(r) ∼ 
1
r

with the normalization s(N)=∑k=1N 1/k, where N is the total amount of phonemes.

Several authors noticed since then that it does not fit the data across languages too well and have proposed other ad-hoc fitting functions [3, 4]. In 1988, Gusein-Zade proposed a formula [2] based on a sensible assumption, namely that rank-frequencies are drawn from a uniform probability density and that p(r) can be approximated by the corresponding expectation value for any given language. This leads to:

p(r) = 
1
N
 
Nr
k=0
 
1
r+k

For large n and for large r at fixed r this can be approximated by:

p(r) ≈ 
1
N
 log
N+1
r

It turns out that this formula describes real-language data rather well and no wild fitting is required (see below). The fact that a model assumptions enters the calculation seems to have been overlooked or misunderstood by other authors, probably because Gusein-Zade’s paper was published in Russian. One can see that it makes no sense to generalize the Zipf distribution by adding fittable parameters, like r−β (as it it seems to be often done) because the dependency is different, approximately log1/r rather than a power law1. This means that a semilogarithmic plot of p(r) should produce a straight line. This is indeed the case for the Sindarin data, as seen in fig. 1.

Comparing it with data from natural languages (fig. 2) one finds a similarly good agreement for English and Swedish, somewhat worse for Bengali. Except for Bengali, deviations are spread both above and below the Gusein-Zade function which suggests a statistical rather than a systematic error. I do not know how reliable the Bengali data are.

Note that the formula does not predict how common a certain sound is, but rather how frequent the phoneme ranked r is (whatever the phoneme itself may be). It turns out that this value is completely determined by the total amount of phonemes N.

Note also that it matters for the individual frequencies whether one considers a dictionary or a text. In the latter case, English [ð] obviously becomes much more common [5] due to the thes and thats (in Sindarin texts, the frequency of i is expected to go up for the same reason). However, the distribution seems to stay the same: The RP data in figure 2 are from a dictionary, the American English data from a text.

Finally one should note that the RP data for English include diphthongs as separate phonemes, while the American English data do not; but again, this does not seem to affect the distribution itself.

We can thus conclude that the rank-frequency distribution of the Sindarin phonemes is indistinguishable from that of a natural language.


Figure 1: Rank-frequency distribution of Sindarin phonemes


Figure 2: Rank-frequency distributions of phonemes for various natural languages. The American English, Swedish and Bengali data are from the references in [3], the RP data are from [5].

2  Vowels & consonants

Rank frequencies for vowels only:

rankphonemefrequency
1a0.355
2e0.231
3i0.183
4o0.136
5u0.089
6y0.006

Rank frequencies for consonants only:

rankphonemefrequency
1n0.185
2r0.146
3l0.121
4g0.075
5d0.072
6þ0.07
7m0.051
8s0.046
9t0.038
10b0.032
11k0.026
12w0.025
13f0.019
14h0.019
15v0.018
16ð0.018
17p0.015
18χ0.012
19j0.004
20R0.004
21L0.003
22W0.001

Vowel-to-consonant ratio:

consonants0.592
vowels0.408

3  Place & manner of articulation

Place of articulation:

dentals0.567
labials0.184
velars0.15
interdentals0.1

Manner of articulation:

sonorants/semivowels0.541
stops and fricatives0.459

Distribution among stops:

rankphonemefrequency
1g0.291
2d0.28
3t0.148
4b0.123
5k0.099
6p0.059

Distribution among fricatives:

rankphonemefrequency
1þ0.344
2s0.226
3f0.096
4h0.096
5v0.09
6ð0.089
7χ0.058

Distribution among sonorants/semivowels:

rankphonemefrequency
1n0.343
2r0.271
3l0.223
4m0.094
5w0.046
6j0.008
7R0.007
8L0.006
9W0.002

4  Bigrams and entropy

A bigram is a cluster of two letters, or, in this case, two phonemes. One can introduce the conditional probability pi(j) to find the phoneme j if the preceding phoneme is i. It forms a matrix with normalized rows: ∑j pi(j) = 1. If one weighs the rows with the frequencies p(i), one obtains the probability to get the phonemes i and j in two sequential draws: p(i,j) = p(i)pi(j). This is now of course normalized with respect to the total sum: ∑ij p(i,j) = 1. The procedure is readily generalized to n-grams.

Linguistically, the matrix shows us a language’s phonotacics and the restriciveness of its phonology. (Probably, one can also use it to write a ruthlessly efficient hangman algorithm.) Obviously, the higher the spread of values across the bigram matrix, the freer the phonology. This is exactly what is measured by the n-gram entropy2:

Hn = −
N
i1i2… in
 p(i1,i2,…,in) log2(p(i1,i2,…,in))

Hn can already be computed for the unigram frequencies p(i), but as discussed above, their distribution is mostly determined by the total amount of phonemes N, so that the same goes for the entropy. It seems more interesting to look at the bigram entropy H2: The smaller it is, the more restrictive the phonology. Note that for any value of n, Hn has the maximum value of Hmax=log2(N) which corresponds to the case that all n-grams are equiprobable, which would make the phonology absolutely free and all the phonemes uncorrelated.

The following three tables show pi(j), computed for vowels only, consonants only, and for all phonemes. Colors are used as a visual guide to highlight values from 0.1 to 0.2 (blue); 0.2 to 0.3 (green); 0.3 to 0.4 (purple); 0.4 to 0.5 (orange); and finally above 0.5 (red).

Vowels only:

 aeiouy
a 0.510.2330.0040.253 
e0.2 0.7330.0330.033 
i0.6960.174 0.13  
o 1    
u0.048 0.952   
y      

Consonants only:

 ptkbdgsfþχhvðnlrLRmwWj
p              0.50.5      
t              0.20.8      
k        0.154     0.1540.692      
b    0.032         0.0650.903      
d   0.0680.017      0.017  0.0680.695   0.136  
g    0.007        0.0140.4350.116   0.428  
s0.0210.6040.0310.0210.010.0630.25               
f       0.833      0.0560.111      
þ     0.079        0.0950.825      
χ   1                  
h                      
v             0.5420.0420.417      
ð   0.063           0.938      
n 0.0720.0380.0040.1020.148 0.0080.015 0.011  0.5570.0080.004   0.034  
l 0.046 0.0370.0550.009 0.0370.0830.101 0.1010.009 0.450.028   0.046  
r   0.0560.0330.056 0.0110.2390.106 0.0560.0560.2780.0170.033  0.0060.056  
L                      
R                      
m0.159  0.1360.023     0.023   0.1590.136  0.364   
w    1                 
W                      
j                      

All phonemes:

 aeiouyptkbdgsfþχhvðnlrLRmwWj
a 0.1040.0480.0010.052  0.0020.0010.0150.1020.0250.0670.0060.0790.0150.0020.0260.0180.1510.0820.143  0.0440.021  
e0.007 0.0250.0010.001   0.0020.0270.0760.0770.0470.0010.0650.012 0.0130.0450.2460.2040.1140.001 0.0130.024  
i0.0970.024 0.018     0.0150.0190.0210.0210.0040.1050.0040.0010.0130.0190.2440.1210.16  0.10.01  
o 0.024       0.0150.0680.0410.068 0.0830.0130.0040.0410.020.2180.1240.274  0.0070.002  
u0.019 0.374      0.0160.0310.0970.022 0.04  0.0250.0220.140.0620.125 0.0030.0250.003  
y           0.087     0.043 0.3480.2610.261      
p0.3410.4510.0730.0370.024               0.0430.043      
t0.4180.1360.130.1530.079               0.0180.072      
k0.410.1580.0720.1370.0940.036        0.015     0.0150.07      
b0.3310.3570.0130.0760.0190.006    0.007         0.0130.184      
d0.2930.1070.1320.1180.1210.018   0.0150.004      0.004  0.0150.149   0.029  
g0.230.0550.0370.2410.04     0.003        0.0060.1740.046   0.171  
s0.170.0730.0850.030.061 0.0120.3550.0180.0120.0060.0370.147               
f0.3870.1530.1890.0810.027        0.143      0.010.019      
þ0.3250.1340.1340.0810.069      0.021        0.0250.215      
χ0.5140.0810.0810.1620.135    0.054                  
h0.470.2430.130.0960.061                       
v0.3070.1730.0670.120.013              0.1810.0140.139      
ð0.1210.3790.0340.1550.034    0.018           0.275      
n0.2120.1610.0990.0770.0470.003 0.0290.0150.0020.0410.059 0.0030.006 0.005  0.2240.0030.002   0.014  
l0.3030.160.1320.1240.0710.002 0.01 0.0080.0120.002 0.0080.0170.021 0.0210.002 0.0940.006   0.01  
r0.20.1280.1850.1360.0690.005   0.0150.0090.015 0.0030.0670.029 0.0150.0150.0770.0050.009  0.0020.015  
L0.2220.1670.2220.2780.111                       
R0.318 0.1360.1820.364                       
m0.2820.1670.2160.0880.0480.0040.032  0.0270.005     0.005   0.0320.027  0.072   
w0.5780.3210.092       0.018                 
W0.3750.1250.5                         
j0.72  0.160.12                       

For the unigram and bigram entropies, one obtains:

 H1H1/HmaxH2H2/Hmax
Sindarin data4.1110.8553.0510.635
Gusein-Zade N=284.2510.884

Unfortunately, data from natural languages are hard to come by. For English, Shannon gives H1=4.14, H1/Hmax=0.88 and H2=3.56, H2/Hmax=0.76. However, this was calculated for the N=26 Latin letters rather than for phonemes. Making a comparison nevertheless, one can see that the phonology of Sindarin is much more restricted, which makes sense.
H2 is expected to be smaller than H1 for any language (which is equivalent to the existence of phonotactics). To find a lower bound, languages like Japanese or Hawaiian are promising candidates.

5  Sources

To get a distribution by source, only unique entries were counted. Because of the ubiquitous conceptual changes by Tolkien, an editorial decision has to be made regarding what to count as unique.

For example, N. naith ’gore’ (Ety:387), S. neith, naith ’angle’ (PE17:55) and S. naith ’spearhead, gore, wedge, narrow promontory’ (UT:282) were regarded as the same (polysemous) word, with various possible translations into English, and a joined reference (Ety:387, PE17:55, UT:282).
On the other hand, S. eitha- ’1. prick with a sharp point, stab 2. treat with scorn, insult’ (HEK-, WJ:365) and S. eitha- ’to ease, assist’ (ATHA-, PE17:148) are clearly two different (homophonous) words, and are therefore kept separate. In this case it is obvious from their different etymologies.
There is a grey zone, however: For example, EN baran ’brown, swart, dark-brown’ and S. baran ’brown, yellow-brown’ suggest a conceptual change, albeit a small one, so that they were counted as separate entries, and thus also as different words for the statistics.

This gives the following absolute and relative counts (compare also the Hiswelóke charts [8]):

sourcecountrel. count
Ety10640.473
PE176800.302
LotR2340.104
S2140.095
WJ1850.082
UT890.04
VT42840.037
VT45740.033
PM710.032
Letters670.03
SD600.027
RGEO510.023
VT46510.023
VT48490.022
VT47390.017
VT50320.014
WR310.014
VT44290.013
MR250.011
LB200.009
RC200.009
PE19170.008
TC170.008
VT41140.006
TI110.005
LR100.004
PE1880.004
RS70.003
PE1370.003
TAI40.002
PE1140.002
VT3910.0
sum3269 
unique entries total2251 

Of course, a good amount of words is attested in various sources, so that the added count is higher than the actual entry count. The Venn diagram in figure 3 shows how words are shared across the two top sources (The Etymologies and Parma Eldalamberon 17) and the rest.


Figure 3: Sindarin vocabulary sources

References

[1]
G. K. Zipf, Relative frequency as a determinant of phonetic change, Harvard studies in classical philology, Vol. 40 (1929), pp. 1-95
[2]
С. М. Гусейн-Заде, О распределении букв русского языка по частоте встречаемости, Пробл. передачи информ. 24:4 (1988), 102–107
[3]
B. Sigurd, Rank-frequency distributions for phonemes, Phonetica 18: 1-15 (1968)
[4]
W. Li, P. Miramontes, G. Cocho, Fitting ranked linguistic data with two-parameter functions, Entropy 2010, 12, 1743-1764
[5]
J. Higgins, RP phonemes in the Advanced Learner’s Dictionary, http://myweb.tiscali.co.uk/wordscape/wordlist/phonfreq.html
[6]
C. E. Shannon, A mathematical theory of communication, The Bell system technical journal, 27, 379-423, 623-656 (1948)
[7]
C. E. Shannon, Prediction and entropy of printed English, The Bell system technical journal, 30(1), 50-64 (1950)
[8]
Hiswelóke Sindarin dictionary statistical charts http://www.jrrvf.com/hisweloke/sindar/online/sindar/charts-sd-en.html

nanwen


1
This does not mean that the Zipf distribution cannot be applicable somewhere else. It does seem to describe the distribution of words in a text [7].
2
The logarithm to base 2 is a convention and one says then that the entropy is measured in ”bits”. Of course, this sets the scale rather than the unit – H is dimensionless.
The interpretation of H in information theory is as (the average) uncertainty: H is zero if a probability is equal to one (a completely certain event), increases with N (the more outcomes, the higher the uncertainty), and is maximal at fixed N if all probabilities are equal (all outcomes equiprobable, hence maximal uncertainty). Finally, the uncertainty of two independent events is the sum of the individual uncertainties.

This document was translated from LATEX by HEVEA.