Cognate or False Friend? Ask the Web!
Svetlin Nakov
Sofia University
5 James Boucher Blvd.
Sofia, Bulgaria
nakov@fmi. uni-sofia. bg
Preslav Nakov
Univ. of Cal. Berkeley
EECS, CS division
Berkeley, CA 94720
nakov@cs. berkeley. edu
Elena Paskaleva
Bulgarian Academy of Sciences
25 A Acad. G. Bonchev Str.
Sofia, Bulgaria
hellen@lmLbas.bg
Abstract
We propose a novel unsupervised semantic
method for distinguishing cognates from false
friends. The basic intuition is that if two words
are cognates, then most of the words in their
respective local contexts should be translations
of each other. The idea is formalised using the
Web as a corpus, a glossary of known word trans-
lations used as cross-linguistic "bridges", and
the vector space model. Unlike traditional or-
thographic similarity measures, our method can
easily handle words with identical spelhng. The
evaluation on 200 Bulgarian-Russian word pairs
shows this is a very promising approach.
Keywords
Cognates, false friends, semantic similarity, Web as a corpus.
1 Introduction
Linguists define cognates as words derived from a com-
mon root. For example, the Electronic Glossary of
Linguistic Terms gives the following definition [5]:
Two words (or other structures) in related
languages are cognate if they come from the
same original word (or other structure). Gen-
erally cognates will have similar, though of-
ten not identical, phonological and semantic
structures (sounds and meanings). For in-
stance, Latin tu, Spanish tu^ Greek 5W, Ger-
man du, and English thou are all cognates; all
mean 'second person singular', but they differ
in form and in whether they mean specifically
'familiar' (non-honorific).
Following previous researchers in computational lin-
guistics [4, 22, 25], we adopt a simplified definition,
which ignores origin, defining cognates (or true friends)
as words in different languages that are translations
and have a similar orthography. Similarly, we define
false friends as words in different languages with sim-
ilar orthography that are not translations. Here are
some identically-spelled examples of false friends:
• pozor {no3op) means a disgrace in Bulgarian, but
attention in Czech;
• mart (Mapm) means March in Bulgarian, but a
market in English;
• Gift means a poison in German, but a present in
Enghsh;
• Prost means cheers in German, but stupidin Bul-
garian.
And some examples with a different orthography:
• embaracada means embarrassed in Portuguese,
while embarazada means pregnant in Spanish;
• spenden means to donate in German, but to
spend means to use up or to pay out in English;
• bachelier means a person who passed his bac
exam in French, but in English bachelor means
an unmarried man;
• babichka {6a6uuKa) means an old woman in Bul-
garian, but babochka {6a6ouKa) is a butterfly in
Russian;
• godina (zoduna) means a year in Russian, but
godzina is an hour in Polish.
In the present paper, we describe a novel semantic
approach to distinguishing cognates from false friends.
The paper is organised as follows: Sections 2 explains
the method, section 3 describes the resources, section
4 presents the data set, section 5 describes the experi-
ments, section 6 discusses the results of the evaluation,
and section 7 points to important related work. We
conclude with directions for future work in section 8.
2 Method
2.1 Contextual Web Similarity
We propose an unsupervised algorithm, which given
a Russian word Wm and a Bulgarian word w^g to be
compared, measures the semantic similarity between
them using the Web as a corpus and a glossary G
of known Russian-Bulgarian translation pairs, used as
"bridges". The basic idea is that if two words are
translations, then the words in their respective local
contexts should be translations as well. The idea is for-
malised using the Web as a corpus, a glossary of known
word translations serving as cross-linguistic "bridges" ,
and the vector space model. We measure the semantic
similarity between a Bulgarian and a Russian word,
Wbg and Wru, by construct corresponding contextual
semantic vectors Vbg and V^n, translating Vru into Bul-
garian, and comparing it to Vbg.
The process of building Vbg, starts with a query
to Google limited to Bulgarian pages for the target
word Wbg- We collect the resulting text snippets (up
to 1,000), and we remove all stop words - preposi-
tions, pronouns, conjunctions, interjections and some
adverbs. We then identify the occurrences of w^g, and
we extract three words on either side of it. We filter
out the words that do not appear on the Bulgarian side
of G. Finally, for each retained word, we calculate the
number of times it has been extracted, thus producing
a frequency vector Vbg. We repeat the procedure for
Wru to obtain a Russian frequency vector Vm, which
is then "translated" into Bulgarian by replacing each
Russian word with its translation (s) in G, retaining the
co-occurrence frequencies. In case of multiple Bulgar-
ian translations for some Russian word, we distribute
the corresponding frequency equally among them, and
in case of multiple Russian words with the same Bul-
garian translation, we sum up the corresponding fre-
quencies. As a result, we end up with a Bulgarian
vector Vru^bg for the Russian word Wm- Finally, we
calculate the semantic similarity between Wbg and Wm
as the cosine between their corresponding Bulgarian
vectors, Vbg and Vm-^bg-
2.2 Reverse Context Lookup
The reverse context lookup is a modification of the
above algorithm. The original algorithm implicitly as-
sumes that, given a word w, the words in the local
context oiw are semantically associated with it, which
is often wrong due to Web-specific words like home,
site, page, click, link, download, up, down, hack, etc.
Since their Bulgarian and Russian equivalents are in
the glossary G, we can get very high similarity for un-
related words. For the same reason, we cannot judge
such navigational words as true/false friends.
The reverse context lookup copes with the problem
as follows: in order to consider w associated with a
word Wc from the local context of w;, it requires that
w appear in the local context of Wc as well-^. More
formally, let i^{x,y) be the number of occurrences of x
in the local context of y. The strength of association is
calculated as p{w,Wc) = m.in{i^{w, Wc), i^{wc,w)} and
is used in the vector coordinates instead of ^{w,Wc),
which is used in the original algorithm.
and Qbg on the Web, where gbg immediately precedes
or immediately follows Wbg- This number is calculated
using Google page hits as a proxy for bigram frequen-
cies: we issue two exact phrase queries ''wbg gbg'' and
''gtg Wbg"", and we sum the corresponding numbers of
page hits. We repeat the same procedure with Wm and
gru in order to obtain the values for the correspond-
ing coordinates of the Russian vector Vm- Finally, we
calculate the semantic similarity between Wbg and Wru
as the cosine between Vbg and Vm-
3 Resources
3.1 Grammatical Resources
We use two monolingual dictionaries for lemmatisa-
tion. For Bulgarian, we have a large morphological
dictionary, containing about 1,000,000 wordforms and
70,000 lemmata [29], created at the Linguistic Model-
ing Department, Institute for Parallel Processing, Bul-
garian Academy of Sciences. Each dictionary entry
consists of a wordform, a corresponding lemma, fol-
lowed by morphological and grammatical information.
There can be multiple entries for the same wordform,
in case of multiple homographs. We also use a large
grammatical dictionary of Russian in the same for-
mat, consisting of 1,500,000 wordforms and 100,000
lemmata, based on the Grammatical Dictionary of A.
Zaliznjak [35]. Its electronic version was supplied by
the Computerised fund of Russian language. Institute
of Russian language, Russian Academy of Sciences.
3.2 Bilingual Glossary
We built a bilingual glossary using an online Russian-
Bulgarian dictionary^ with 3,982 entries in the follow-
ing format: a Russian word, an optional grammatical
marker, optional stylistic references, and a list of Bul-
garian translation equivalents. First, we removed all
multi-word expressions. Then we combined each Rus-
sian word with each of its Bulgarian translations -
due to polysemy/homonymy some words had multiple
translations. As a result, we obtained a glossary G of
4,563 word-word translation pairs (3,794 if we exclude
the stop words).
2.3 Web Similarity Using Seed Words 3.3 Huge Bilingual Glossary
For comparison purposes, we also experiment with the
seed words algorithm of Fung&;Yee'98 [12], which we
adapt to use the Web. We prepare a small glossary of
300 Russian-Bulgarian word translation pairs, which
is a subset of the glossary used for our contextual Web
similarity algorithm^. Given a Bulgarian word Wbg and
a Russian word Wm to compare, we build two vectors,
one Bulgarian {Vbg) and one Russian {Vm), both of
size 300, where each coordinate corresponds to a par-
ticular glossary entry {gmi gbg)- Therefore, we have a
direct correspondence between the coordinates of Vbg
and Vru- The coordinate value for gbg in Vbg is cal-
culated as the total number of co-occurrences of Wbg
Similarly, we adapted a much larger Bulgarian-Russian
electronic dictionary, transforming it into a bilingual
glossary with 59,583 word-word translation pairs.
4 Data Set
^ These contexts are collected using a separate query for Wc-
^ We chose those 300 words from the glossary that occur most
frequently on the Web.
4.1 Overview
Our evaluation data set consists of 200 Bulgarian-
Russian pairs - 100 cognates and 100 false friends. It
has been extracted from two large lists of cognates
and false friends, manually assembled by a linguist
from several monolingual and bilingual dictionaries.
We limited the scope of our evaluation to nouns only.
' http://www.bgru.net/intr/dictionary/
As Table 1 shows, most of the words in our pairs con-
stitute a perfect orthographic match: this is the case
for 79 of the false friends and for 71 of the cognates.
The remaining ones exhibit minor variations, e.g.:
• M ^ w (r. pu6a -^ b. pu6a, 'a fish^)]
• 9 ^ e {r. dmaoic -^ b. emaotc, 'a floor');
• b ^^ ^ {r. Kocnib -^ b. Kocm, 'a bone')]
• double consonant -^ single consonant (r.
npozpaMMa -^ b. npozpaMa, 'a programme');
• etc.
4.2 Discussion
There are two general approaches to testing a statisti-
cal hypothesis about a linguistic problem: (1) from the
data to the rules, and (2) from the rules to the data.
In the first approach, we need to collect a large num-
ber of instances of potential interest and then to filter
out the bad ones using lexical and grammatical compe-
tence, linguistic rules as formulated in grammars and
dictionaries, etc. This direction is from the data to the
rules, and the final evaluation is made by a linguist.
The second approach requires to formulate the pos-
tulates of the method from a linguistic point of view,
and then to check its consistency on a large volume of
data. Again, the check is done by a linguist, but the
direction is from the rules to the data.
We combined both approaches. We started with two
large lists of cognates and false friends, manually as-
sembled by a linguist from several monolingual and
bilingual dictionaries: Bulgarian [2, 3, 8, 19, 20, 30],
Russian [10], and Bulgarian-Russian [6, 28]. From
these lists, we repeatedly extracted Russian-Bulgarian
word pairs (nouns), cognates and false friends, which
were further checked against our monolingual elec-
tronic dictionaries, described in section 3. The process
was repeated until we were able to collect 100 cognates
and 100 false friends.
Given an example pair exhibiting orthographic dif-
ferences between Bulgarian and Russian, we tested
against our electronic dictionaries the corresponding
letter sequences substitutions. While the four corre-
spondences in section 4.1 have proven incontestable,
other have been found inconsistent. For example, the
correspondence between the Russian -opo- and the
Bulgarian -pa- (e.g. r. zopox -^ b. zpax, 'peas'), men-
tioned in many comparative Russian-Bulgarian stud-
ies, does not always hold, as it is formulated for root
morphemes only. This correspondence fails in cases
where these strings occur outside the root morpheme
or at morpheme boundaries, e.g. r. njiodopodue -^
b. njiodopodue, 'fruitfulness, fertility'. We excluded
all examples exhibiting inconsistent orthographic al-
ternations between Bulgarian and Russian.
The linguistic check against the grammatical dictio-
naries further revealed different interactions between
orthography, part-of-speech (POS), grammatical func-
tion, and sense, suggesting the following degrees of
falseness:
• Absolute falseness: same lemma, same POS,
same number of senses, but different meanings for
all senses, e.g. r. dtcecmh ('a tin') and b. dtcecm
('a gesture')]
• Partial lemma falseness: same lemma, same
POS, but different number of senses, and different
meanings for some senses. For example, the r. 6ac
{'bass, voice') is a cognate of the first sense of the
b. 6ac, 'bass, voice', but a false friend of its second
sense 'a bet' (the Russian for a bet is napu). On
the other hand, the r. napu is a false friend of the
first sense of the b. napu {'money'); the Russian
for money is dembu. In addition, the b. napu can
also be the plural for the b. napa {'a vapour'),
which translates into Russian as ucnapenujr. This
quite complex example shows that the falseness
is not a symmetric cross-linguistic relation. It is
shown schematically on Figure 1.
• Partial wordform falseness: the number of
senses is not relevant, the POS can differ, the
lemmata are different, but some wordform of the
lemma in one language is identical to a wordform
of the lemma in the second language. For ex-
ample, the b. xomeji {'a hotel') is the same as
the inflected form r. xomeji (past tense, singular,
masculine of xomemb, 'to want').
Our list of true and false friends contains only abso-
lute false friends and cognates, excluding any partial
cognates. Note that we should not expect to have a
full identity between all wordforms of the cognates for
a given pair of lemmata, since the rules for inflections
are different for Bulgarian and Russian.
Russian
Bulgarian
6ac 1
\'oice
^ ^
6ac 1
\'oice
^ p
" ^-,
napH 1
bet
^
6ac 2
bet
\ ■- ».
AeHbiH
money
V
\
^
napH 1
money
Hcnapeniia
\'apours
M
^
napH 2
\'apouis
^
Fig. 1: Falseness example: double lines link cog-
nates, dotted lines link false friends, and solid lines
link translations.
5 Experiments and Evaluation
In our experiments, we calculate the semantic (or or-
thographic) similarity between each pair of words from
the data set. We then order the pairs in ascending
order, so that the ones near the top are likely to be
false friends, while those near the bottom are likely
to be cognates. Following Bergsma&Kondrak'OT [4]
and Kondrak&Sherif'OG [18], we measure the quality
of the ranking using 11 -point average precision. We
experiment with the following similarity measures:
• Baseline
- BASELINE: random.
• Orthographic Similarity
- MEDR: minimum edit distance ratio, defined
as MEDR(si,S2) = 1 - '^^y^i^f^i' where \s\
\ ^^ ^y max(|si |,|s2 1) ' ' '
is the length of the string s, and med is the
minimum edit distance or Levenshtein dis-
tance [21], calculated as the minimum num-
ber of edit operations - insert, replace,
DELETE - needed to transform si into S2-
For example, the med between b. mjihko
{'miW) and r. mojioko {'miW) is two: one
REPLACE operation (h -^ o) and one in-
sert operation (of o). Therefore we obtain
MEDR(MJiiiKo, mojioko) = 1 — 2/6 ~ 0.667;
- LCSR: longest common subsequence ratio
[24], defined as LCSis„B,) = ^^l^^y
where LCS(si, S2) is the longest common sub-
sequence of si and S2- For example, the
LCS( MJi flKo, mojioko) = MJiKo, and there-
fore LCSR(MJiiiKo, mojioko) = 4/6 ^ 0.667.
• Semantic Similarity
- SEED: our implementation and adaptation
of the seed words algorithm of Fung&;Yee'98
[12];
- web3: the Web-based similarity algorithm
with the default parameters: local context
size of 3, the smaller bilingual glossary, stop
words filtering, no lemmatisation, no reverse
context lookup, no TF.lDF-weighting;
- NO-STOP: web3 without stop words re-
moval;
- WEBl: web3 with local context size of 1;
- web2: web3 with local context size of 2;
- web4: web3 with local context size of 4;
- web5: web3 with local context size of 5;
- WEB3+TF.IDF: web3 with context size of 1;
- LEMMA: web3 with lemmatisation;
- LEMMA+TF.IDF: web3 with lemmatisation
and TF.lDF-weighting;
- HUGEDICT: web3 with the huge glossary;
- HUGEDICT+TF.IDF: web3 with the huge
glossary and TF.lDF-weighting;
- REVERSE: web3 with reverse context
lookup;
- REVERSE+TF.IDF: web3 with reverse con-
text lookup and TF.lDF-weighting;
- COMBINED: web3 with lemmatisation, huge
glossary, and reverse context lookup;
- COMBINED + TF.IDF:
TF.lDF-weighting.
COMBINED with
100%
60%
40%
20%
92.05%
64.74%
■
45.08% 44.99% ^H
BASELINE LCSR
MEDR
SEED
WEB3
Fig. 2: Evaluation, 11-point average precision.
Comparing web3 with baseline and three old algo-
rithms - LCSR, MEDR and SEED.
First, we compare two semantic similarity measures,
web3 and seed, with the orthographic similarity mea-
sures, LCSR and MEDR, and with baseline; the re-
sults are shown on Figure 2. The baseline algorithm
achieves 50% on 11-point average precision, and out-
performs the orthographic similarity measures LCSR
and MEDR, which achieve 45.08% and 44.99%. This is
not surprising since most of our pairs consist of iden-
tical words. The semantic similarity measures, seed
and web3 perform much better, achieving 64.74% and
92.05% on 11-point average precision. The huge abso-
lute difference in performance (almost 30%) between
SEED and web3 suggests that building a dynamic set
of textual contexts from which to extract words co-
occurring with the target is a much better idea than
using a fixed set of seed words and page hits as a proxy
for Web frequencies.
100%
75%
50%
91.60% 91.65% 92.05% 91.78% 91.10%
63.94%
J
m
NO-STOP WEB1
WEB2
WEBS WEB4
WEB5
Fig. 3: Evaluation, 11-point average precision.
Different context sizes; keeping the stop words.
The remaining experiments try different variations
of the contextual Web similarity algorithm web3.
First, we tested the impact of stop words removal. Re-
call that web3 removes the stop words from the text
snippets returned by Google; therefore, we tried a ver-
sion of it, NO-STOP, which keeps them. As Figure 3
shows, this was a bad idea yielding about 28% abso-
lute loss in accuracy - from 92.05% to 63.94%. We also
tried different context sizes: 1, 2, 3, 4 and 5. Context
size 3 performed best, but the differences are small.
We also experimented with different modifications of
web3: using lemmatisation, TF.iDF-weighting, reverse
lookup, a bigger glossary, and a combination of them.
The results are shown on Figure 5.
First, we tried lemmatising the words from the snip-
pets using the monolingual grammatical dictionaries
described in section 3.1. We tried both with and with-
out TF.iDF-weighting, achieving in either case an im-
provement of about 2% over web3: as Figure 5 shows,
LEMMA and LEMMA+TF.IDF yield an 11-point average
precision of 94.00% and 94.17%, respectively.
In our next experiment, we used the thirteen times
larger glossary described in section 3.3, which yielded
an 11-point average precision of 94.37% - an absolute
improvement of 2.3% compared to web3 (Figure 5,
hugedict). Interestingly, when we tried using TF.iDF-
weighting together with this glossary, we achieved only
93.31% (Figure 5, hugedict-tf.idf).
We also tried the reverse context lookup described
in section 2.2, which improved the results by 3.64% to
95.69%. Again, combining it with TF.iDF-weighting,
performed worse: 94.58%.
Finally, we tried a combination of web3 with lem-
matisation, reverse lookup and the huge glossary,
achieving an 11-point average precision of 95.84%,
which is our best result. Adding TF.iDF-weighting to
the combination yielded slightly worse results: 94.23%.
Figure 4 shows the precision-recall curves for LCSR,
SEED, webS, and combined. We can see that web3
and COMBINED clearly outperform LCSR and seed.
a.
Recall
Fig. 4: Precision-Recall Curve. Comparing web3
with LCSR, SEED and combined.
6 Discussion
Ideally, our algorithm would rank first all false friends
and only then the cognates. Indeed, in the ranking
produced by combined, the top 75 pairs are false
friends, while the last 48 are cognates; things get
mixed in the middle.
The two lowest-ranked (misplaced) by combined
false friends are epama ('a door' in Bulgarian, but 'an
entrance' in Russian) at rank 152, and aBumypuenm
('a person who just graduated from a high school' in
Bulgarian, but 'a person who just enrolled in an uni-
versity' in Russian) at rank 148. These pairs are prob-
lematic for our semantic similarity algorithms, since
while the senses differ in Bulgarian and Russian, they
are related - a door is a kind of entrance, and the newly
admitted freshmen in an university are very likely to
have just graduated from a high school.
The highest-ranked (misplaced) by combined cog-
nate is the pair h. zopdocm / r. zopdocmb {'pride') at
position 76. On the Web, this word often appears in
historical and/or cultural contexts, which are nation-
specific. As a result the word's contexts appear mis-
leadingly different in Bulgarian and Russian.
In addition, when querying Google, we only have
access to at most 1,000 top-ranked results. Since
Google's ranking often prefers commercial sites, travel
agencies, news portals, etc., over books, scientific ar-
ticles, forum posts, etc., this introduces a bias on the
kinds of contexts we extract.
7 Related Work
Many researchers have exploited the intuition that
words in two different languages with similar or identi-
cal spelling are likely to be translations of each other.
Al-Onaizan&;al.'99 [1] create improved Czech-
English word alignments using probable cognates ex-
tracted with one of the variations of LCSR [24] de-
scribed in [34]. Using a variation of that technique,
Kondrak&;al.'03 [17] demonstrate improved transla-
tion quality for nine European languages.
Koehn&;Knight'02 [15] describe several techniques
for inducing translation lexicons. Starting with unre-
lated German and English corpora, they look for (1)
identical words, (2) cognates, (3) words with similar
frequencies, (4) words with similar meanings, and (5)
words with similar contexts. This is a bootstrapping
process, where new translation pairs are added to the
lexicon at each iteration.
Rapp'95 [31] describes a correlation between the
co-occurrences of words that are translations of each
other. In particular, he shows that if in a text in one
language two words A and B co-occur more often than
expected by chance, then in a text in another language
the translations of A and B are also likely to co-occur
frequently. Based on this observation, he proposes a
model for finding the most accurate cross-linguistic
mapping between German and English words using
non-parallel corpora. His approach differs from ours
in the similarity measure, the text source, and the ad-
dressed problem. In later work on the same problem,
Rapp'99 [32] represents the context of the target word
with four vectors: one for the words immediately pre-
100%
95%
90% -
85%
^^
#
.<^
•^
#
/^^ / / ..
,4
/
/
./>
.S<^^
Fig. 5: Evaluation, 11-point average precision. Different improvements o/web3.
ceding the target, another one for the ones immedi-
ately fohowing the target, and two more for the words
one more word before/after the target.
Fung&;Yee'98 [12] extract word-level translations
from non-parallel corpora. They count the number
of sentence-level co-occurrences of the target word
with a fixed set of "seed" words in order to rank
the candidates in a vector-space model using different
similarity measures, after normalisation and tf.idf-
weighting. The process starts with a small initial set
of seed words, which are dynamically augmented as
new translation pairs are identified. As we have seen
above, an adaptation of this algorithm, seed, yielded
significantly worse results compared to web3. An-
other problem of that algorithm is that for a glossary
of size IGI, it requires 4 x IGJ queries, which makes
it too expensive for practical use. We tried another
adaptation: given the words A and B, instead of us-
ing the exact phrase queries "A B" and "B A" and then
adding the page hits, to use the page hits for A and
B. This performed even worse.
Diab&Finch'OO [9] present a model for statistical
word-level translation between comparable corpora.
They count the co-occurrences for each pair of words in
each corpus and assign a cross-linguistic mapping be-
tween the words in the corpora such that it preserves
the co-occurrences between the words in the source
language as closely as possible to the co-occurrences
of their mappings in the target language.
Zhang&al.'OS [36] present an algorithm that uses a
search engine to improve query translation in order
to carry out cross-lingual information retrieval. They
issue a query for a word in language A and tell the
search engine to return the results only in language
B and expect the possible translations of the query
terms from language A to language B to be found in
the title and summary of returned search results. They
then look for the most frequently occurring word in the
search results and apply TF.iDF-weighting.
Finally, there is a lot of research on string sim-
ilarity that has been applied to cognate identifica-
tion: Ristad&Yianilos'98 [33] and Mann&Yarowsky'Ol
[22] learn the med weights using a stochastic trans-
ducer. Tiedemann'99 [34] and Mulloni&Pekar'OG
[26] learn spelling changes between two languages
for LCSR and for MEDR respectively. Kondrak'05
[16] proposes longest common prefix ratio, and
longest common subsequence formula, which coun-
ters lcsr's preference for short words. Klemen-
tiev&Roth'06 [14] and Bergsma&Kondrak'OT [4] pro-
pose discriminative frameworks for string similarity.
Rappoport&;Levent-Levi'06 [23] learn substring corre-
spondences for cognates, using string-level substitu-
tions method of Brill&Moore'OO [7]. Inkpen&al.'05
[13] compare several orthographic similarity measures.
Frunza&;Inkpen'06 [11] disambiguate partial cognates.
While these algorithms can successfully distinguish
cognates from false friends based on orthographic sim-
ilarity, they do not use semantics and therefore cannot
distinguish between equally spelled cognates and false
friends (with the notable exception of [4], which can
do so in some cases).
Unlike the above-mentioned methods, our approach:
• uses semantic similarity measure - not ortho-
graphical or phonetic;
• uses the Web, rather than pre-existing corpora to
extract the local context of the target word when
collecting semantic information about it;
• is applied to a different problem: classification of
(nearly) identically-spelled false/true friends.
8 Conclusions and Future Work
We have proposed a novel unsupervised semantic
method for distinguishing cognates from false friends,
based on the intuition that if two words are cognates,
then the words in their local contexts should be trans-
lations of each other, and we have demonstrated that
this is a very promising approach.
There are many ways in which we could improve
the proposed algorithm. First, we would like to au-
tomatically expand the bilingual glossary with more
word translation pairs using bootstrapping as well as
to combine the method with (language-specific) ortho-
graphic similarity measures, as done in [27]. We also
plan to apply this approach to other language pairs
and to other tasks, e.g. to improving word alignments.
Acknowledgments. We would like to thank the
anonymous reviewers for their useful comments.
References
[1] Y. Al-Onaizan, J. Curin, M. Jahr, K. Knight, J. Lafferty,
D. Melamed, F. Och, D. Purdy, N. Smith, and D. Yarowsky.
Statistical machine translation. Technical report, 1999.
[2] L. Andreychin, L. Georgiev, S. Ilchev, N. Kostov, I. Lekov,
S. Stoykov, and C. Todorov, editors. Explanatory Dictionary
of the Bulgarian Language. 3 edition, 1973 (JI. AHApeM^iHH, JI.
FeoprHeB, Ct. Hji^ieB, H. Koctob, Mb. JIckob, Ct. CtoJikob h
IIb. To/jopoB, ^^E SAzapcKU msAKoeeH peHHUK^\ Hs/jaTejicTBO
"HayKa h HSKycTBo", Co(})hh, 1973).
[3] Y. Baltova and M. Charoleeva, editors. Dictionary of the Bul-
garian Language, volume 11. 2002 (^^Pchhuk na 6sAzapcKusr
eawK". noA peA- Ha K). BajiTOBa an a M- MapojieeBa. T. 11, AKa-
/jeMH^IHO H3/iaTejICTBO "IIpGCJ). MapHH UpHHOB", Co(j)HH, 2002).
[4] S. Bergsma and G. Kondrak. Alignment-based discriminative
string similarity. In Proceedings of the 45th Annual Meet-
ing of the Association of Computational Linguistics, pages
656—663, Prague, Czech Republic, June 2007. Association for
Computational Linguistics.
[5] J. A. Bickford and D. Tuggy. Electronic glossary
of linguistic terms (with equivalent terms in Span-
ish). http://www.sil.org/mexico/ling/glosario/E005ai-
Glossary.htm, April 2002. version 0.6. Instituto Lingiii'stico de
Verano (Mexico).
[6] D. Bozhkov, V. Velchev, S. Vlahov, H. E. Rot, et al. Russian-
Bulgarian Dictionary. 1985-1986 {JX. Bo)kkob, B. Beji^ieB, C.
BjiaxoB, X. E. Pot h aP-, ^^PycKo-6dAzapcKu p euHUK''\ Hs/ja-
TejicTBO "HayKa h HSKycTBo" , Co(|)hh, 1985-1986).
[7] E. Brill and R. C. Moore. An improved error model for noisy
channel spelling correction. In Proceedings of ACL, pages 286—
293, 2000.
[8] K. Cholakova, editor. Dictionary of the Bulgarian Language,
volume 1-8. 1977-1995 (^^Pchhuk na 6dAzapcKusr e3UK^\ no^
pe/j. Ha Kp. MojiaKOBa. T. 1-8, Hs/jaTejicTBO na BAH, Codpna,
1977-2005).
[9] M. Diab and S. Finch. A statistical word-level translation
model for comparable corpora. In Proceedings of RIAO, 2000.
[10] A. Evgenyeva, editor. Dictionary of Russian in Four Vol-
umes, volume 1-4. 1981-1984 {"CAoeapb pyccKozo srsuna e
Hemup ex moMax'' . no/j pe/j. A.H. EBreHbeBoii. T. 1-4, VLs^a-
TejibCTBO "PyccKHH hsbik" , MocKBa, 1981-1984).
[11] O. Frunza and D. Inkpen. Semi-supervised learning of par-
tial cognates using bilingual bootstrapping. In Proceedings
ACL '06, pages 441-448, 2006.
[12] P. Fung and L. Y. Yee. An IR approach for translating from
nonparallel, comparable texts. In Proceedings of ACL, vol-
ume 1, pages 414-420, 1998.
[13] D. Inkpen, O. Frunza, and G. Kondrak. Automatic identifi-
cation of cognates and false friends in french and english. In
Proceedings of RANLP'05, pages 251-257, 2005.
[14] A. Klementiev and D. Roth. Named entity transliteration and
discovery from multilingual comparable corpora. In Proceed-
ings of the Human Language Technology Conference of the
NAACL, Main Conference, pages 82-88, New York City, USA,
June 2006. Association for Computational Linguistics.
[15] P. Koehn and K. Knight. Learning a translation lexicon from
monolingual corpora. In Proceedings of ACL workshop on
Unsupervised Lexical Acquisition, pages 9-16, 2002.
[16] G. Kondrak. Cognates and word alignment in bitexts. In Pro-
ceedings of the 10th Machine Translation Summit, pages 305-
312, Phuket, Thailand, September 2005.
[17] G. Kondrak, D. Marcu, and K. Knight. Cognates can improve
statistical translation models. In Proceedings of HLT-NAACL
2003 (companion volume), pages 44—48, 2003.
[18] G. Kondrak and T. Sherif. Evaluation of several phonetic sim-
ilarity algorithms on the task of cognate identification. In
Proceedings of the Workshop on Linguistic Distances, pages
43—50, Sydney, Australia, July 2006. Association for Compu-
tational Linguistics.
[19] V. Kyuvlieva-Mishaykova and M. Charoleeva, editors. Dic-
tionary of the Bulgarian Language, volume 9. 1998 {^^Pchhuk
Ha 6sAzapcKusr csuk". uofl. pe/j. na B. KioBJineBa-MHinaHKOBa h
M. MapojieeBa. T. 9, AKa^eMH^iHO HSAaTejicTBO "Hpo(}). MapHH
Uphhob" , Co(})HH, 1998).
[20] V. Kyuvlieva-Mishaykova and E. Pernishka, editors. Dictio-
nary of the Bulgarian Language, volume 12. 2004 ("Pchhuk
Ha 6dAzapcKusr eawK". ho/j pe/j. na B. KtoBJiHeBa-MnmaiiKOBa
aHA E. HepHHniKa. T. 12, AKa;ieMH^iHO Hs^aTejicTBO "Hpo(}).
MapHH Uphhob", Co(|)hh, 2004).
[21] V. Levenshtein. Binary codes capable of correcting deletions,
insertions, and reversals. Soviet Physics Doklady, (10):707—
710, 1966.
[22] G. Mann and D. Yarowsky. Multipath translation lexicon in-
duction via bridge languages. In Proceedings of NAACL'Ol,
pages 1-8, 2001.
[23] G. Mann and D. Yarowsky. Induction of cross-language affix
and letter sequence correspondence. In Proceedings of EACL
Workshop on Cross-Language Knowledge Induction, 2006.
[24] D. Melamed. Automatic evaluation and uniform filter cascades
for inducing N-best translation lexicons. In Proceedings of the
Third Workshop on Very Large Corpora, pages 184—198, 1995.
[25] D. Melamed. Bitext maps and alignment via pattern recogni-
tion. Computational Linguistics, 25(1):107— 130, 1999.
[26] A. Mulloni and V. Pekar. Automatic detection of orthographic
cues for cognate recognition. In Proceedings of LREC-06,
pages 2387-2390, 2006.
[27] P. Nakov, S. Nakov, and E. Paskaleva. Improved word align-
ments using the web as a corpus. In Proceedings of RANLP'07,
2007.
[28] K. Panchev. Differencial Russian- Bulgarian Dictionary. 1963
(K. Han^ieB, ^^JIuc^epeHV,uaAeH pycKO-SsAzapcKU peHHUK^\
Ho/i pe/j. Ha C. BjiaxoB h F.A. TaraMJiHn,Ka. Hs/jaTejicTBO
"HayKa h HSKycTBo", Co(})hh, 1963).
[29] E. Paskaleva. Compilation and validation of morphological re-
sources. In Workshop on Balkan Language Resources and
Tools (Balkan Conference on Informatics) , pages 68—74, 2003.
[30] E. Pernishka and L. Krumova-Cvetkova, editors. Dictionary
of the Bulgarian Language, volume 10. 2000 (^^Pchhuk na
SdAzapcKUfT esuK" . hoa pe^- na E. HepHHHiKa h KpyinoBa-
HBCTKOBa. T. 10, AKa/jeMH^iHO H3/iaTejicTBO "Hpo(|). MapHH
ilpHHOB", Co(})Hfl, 2000).
[31] R. Rapp. Identifying word translations in non-parallel texts.
In Proceedings of ACL, pages 320-322, 1995.
[32] R. Rapp. Automatic identification of word translations from
unrelated english and german corpora. In Proceedings of ACL,
pages 519-526, 1999.
[33] E. Ristad and P. Yianilos. Learning string-edit distance. IEEE
Trans. Pattern Anal. Mach. Intell., 20(5):522-532, 1998.
[34] J. Tiedemann. Automatic construction of weighted string sim-
ilarity measures. In Proceedings of EMNLP-VLC, pages 213-
219, 1999.
[35] A. Zaliznyak. Grammatical Dictionary of Russian. Russky
yazyk, Moscow, 1977 (A. 3ajiH3HHK, FpaMMamuHecKuu cao-
eapb pyccKozo srsuna. "PyccKHH hsbik", MocKBa, 1977).
[36] J. Zhang, L. Sun, and J. Min. Using the web corpus to trans-
late the queries in cross-lingual information retrieval. In IEEE
NLP-KE, pages 414-420, 2005.
Candidate (BG/RU) BG sense RU sense Sim. Cogn.? P@r R@r
1
My(|)Ta
gratis
muff
0.0085
no
100.00
1.00
2
6arpeHe / 6arpeHbe
mottle
gaff
0.0130
no
100.00
2.00
3
fl06MT"bK / fl06blT0K
livestock
income
0.0143
no
100.00
3.00
4
Mpas / Mpasb
chill
crud
0.0175
no
100.00
4.00
5
njiex / n/iexb
hedge
whip
0.0182
no
100.00
5.00
6
n/iMTKa
plait
tile
0.0272
no
100.00
6.00
7
Ky^a
doggish
heap
0.0287
no
100.00
7.00
8
jienKa
bur
modeling
0.0301
no
100.00
8.00
9
KaMMa
minced meat
selvage
0.0305
no
100.00
9.00
10
HM3
string
bottom
0.0324
no
100.00
10.00
11
repaH / repanb
draw-well
geranium
0.0374
no
100.00
11.00
12
ne^ypKa
mushroom
small stove
0.0379
no
100.00
12.00
13
BaxMaH
tram-driver
whatman
0.0391
no
100.00
13.00
14
KopeiTiKa
korean
bacon
0.0396
no
100.00
14.00
15
fly Ma
word
thought
0.0398
no
100.00
15.00
16
TOBap
load
commodity
0.0402
no
100.00
16.00
17
KaxpaH
tar
sea-kale
0.0420
no
100.00
17.00
76
renepaTop
generator
generator
0.1621
yes
94.74
72.00
77
jioflKa
boat
boat
0.1672
yes
93.51
72.00
78
6yKeT
bouquet
bouquet
0.1714
yes
92.31
72.00
79
npax / nopox
dust
gunpowder
0.1725
no
92.41
73.00
80
Bpaxa
door
entrance
0.1743
no
92.50
74.00
81
KJlKDKa
gossip
cammock
0.1754
no
92.59
75.00
97
flBM>KeHMe
motion
motion
0.2023
yes
83.51
81.00
98
KOMnfOTtp / KOMnbfOTep
computer
computer
0.2059
yes
82.65
81.00
99
ByjiKaH
volcano
volcano
0.2099
yes
81.82
81.00
100
roflMHa
year
time
0.2101
no
82.00
82.00
101
6yT
leg
rubble
0.2130
no
82.12
83.00
102
sanoBeflHMK
despot
reserve
0.2152
no
82.35
84.00
103
6a6a
grandmother
peasant woman
0.2154
no
82.52
85.00
154
MOCT
bridge
bridge
0.3990
yes
62.99
97.00
155
SBesfla
star
star
0.4034
yes
62.58
97.00
156
6paT
brother
brother
0.4073
yes
62.18
97.00
157
Me^xa
dream
dream
0.4090
yes
61.78
97.00
158
flpy>KecTBO
association
friendship
0.4133
no
62.03
98.00
159
MJl^IKO / MOJIOKO
milk
milk
0.4133
yes
61.64
98.00
160
KJlMHMKa
clinic
clinic
0.4331
yes
61.25
98.00
161
rjiMHa
clay
clay
0.4361
yes
60.87
98.00
162
yHe6HMK
textbook
textbook
0.4458
yes
60.49
98.00
185
KopeH / KopeHb
root
root
0.5498
yes
54.35
100.00
186
ncMxojiori/i?!
psychology
psychology
0.5501
yes
54.05
100.00
187
HaiTJKa
gull
gull
0.5531
yes
53.76
100.00
188
cna/iHJi / cna/ibH^
bedroom
bedroom
0.5557
yes
53.48
100.00
189
Maca>K / Macca>K
massage
massage
0.5623
yes
53.19
100.00
190
6eH3MH
gasoline
gasoline
0.6097
yes
52.91
100.00
191
neflaror
pedagogue
pedagogue
0.6459
yes
52.63
100.00
192
Teop\A9i
theory
theory
0.6783
yes
52.36
100.00
193
6pjir / 6eper
shore
shore
0.6862
yes
52.08
100.00
194
KOHTpaCT
contrast
contrast
0.7471
yes
51.81
100.00
195
cecTpa
sister
sister
0.7637
yes
51.55
100.00
196
(|)MHaHCM / (|)MHaHCbl
finances
finances
0.8017
yes
51.28
100.00
197
cpe6po / cepe6po
silver
silver
0.8916
yes
50.76
100.00
198
HayKa
science
science
0.9028
yes
50.51
100.00
199
(|)jiopa
fiora
fiora
0.9171
yes
50.25
100.00
200
Kpacoxa
beauty
beauty
0.9684
yes
50.00
100.00
11-point average precision: 92.05
Table 1: Ranked examples from our data set for web3.- Candidate is the candidate to be judged as being
cognate or not, Sim. is the Web similarity score, r is the rank, P@r and R@r are the precision and the recall
for the top r candidates. Cogn.? shows whether the words are cognates or not.