-
-
Notifications
You must be signed in to change notification settings - Fork 11
Comparison Settings
The Comparison settings view contains all of the settings that are used to configure how word lists are compared.
Cog utilizes a dynamic programming algorithm to align word pairs and compute their phonetic similarity. This algorithm works in a similar fashion to the Levenshtein distance algorithm, except instead of computing a simple edit distance, it computes a phonetic distance based on the phonological features of the segments.[1] Each feature is assigned a weight and each feature value is assigned a metric. The difference between the metrics of two feature values indicates their distance. All of these distances are weighted and combined to compute a single phonetic distance between a segment pair. In addition, the algorithm can reduce the phonetic distance of regular sound correspondences and penalize segments that are not in the same syllable position. The alignment algorithm used in Cog is based on Kondrak's ALINE algorithm.[2]
The alignment mode setting is used to configure what portions of a word pair, the alignment algorithm will attempt to align. Which mode to use is largely dependent on whether the words that will be compared are stems/roots. There are four modes: full, partial, beginning, and hybrid.
This mode aligns all segments in both words. If the words do not contain the same number of segments, then insertions/deletions will be indicated in the resulting alignment. This mode should be used when all words are known to be stems/roots.
This mode only aligns the subsequences of the words that have the best alignment. The segments at the beginning and the end of the words that do not align are treated like affixes and ignored during comparison. This mode should be used when the words are known to contain affixes.
This mode is similar to Partial except it only attempts to align the beginning of the words. This mode should be used when the ends of words are known to be unstable or contain suffixes.
This mode is a combination of Full and Partial. The longer word will be partially aligned and the shorter word will be fully aligned. This mode is used when some of the varieties have words that contain affixes and some do not.
If this setting is checked, then the alignment algorithm will attempt to align two segments with a single segment instead of indicating an insertion/deletion if the algorithm believes that it will result in a better overall alignment. Typically, this setting is disabled, because the syllabifier combines all segments at the same syllable position into a single complex segment. This allows the alignment algorithm to align complex segments, such as consonant clusters or diphthongs, to align with simple segments.
If this setting is checked, then the alignment algorithm will reduce the phonetic distance for a segment pair based on how frequently the segments correspond in a particular variety pair. The more frequently the segments correspond, the greater the phonetic distance is reduced. The alignment algorithm uses the probability of correspondence for a segment pair to determine how much the phonetic distance is reduced. See the section on likely cognate identification below for more information on how the probability is computed.
If this setting is checked, then the alignment algorithm will increase the phonetic distance for a segment pair where each segment is in a different syllable position. This setting will cause the alignment algorithm to favor alignments where the segments are both onsets, codas, or nuclei for a syllable.
This table contains the weights for features and feature values that are used to compute the phonetic distance between segments. The phonetic distance is computed according to the following formula:
where
This table contains the sound classes that the alignment algorithm will look for within the context of sound correspondences. Each sound correspondence will be associated with an environment based on the specified contextual sound classes. The order of the sound classes is significant, in that it searches the left and right environments for the first sound class that matches. Word boundaries can be included in segment-based classes by using the hash character (#), in order to look for word-initial and word-final sound correspondences.
This section contains all of the settings for identifying likely cognates. Cog uses an expectation-maximization (EM) algorithm in order to overcome the chicken-and-egg problem that linguists face when trying to identify cognates and sound correspondences from word list data.[3] The EM algorithm uses an iterative approach that consists of two steps, the E-step and the M-step. During the E-step, the algorithm identifies cognates, aligns each word pair, and identifies regular sound correspondences. During the M-step, the algorithm estimates a conditional probability distribution of a segment in one variety corresponding to a segment in the other variety given an environment. The probability distribution is estimated based on the sound correspondence frequencies gathered in the E-step. The probability distribution is smoothed using Witten-Bell Smoothing.[4] The algorithm continues in this fashion until the probability distribution converges. This algorithm was inspired by Melamed's work on machine translation and Kondrak's CORDI system.[5][6] An important thing to note is that this algorithm is inherently asymmetric. The conditional probability distribution assumes a specific directionality for sound correspondences. A drawback of the asymmetric approach used by the algorithm is that different results might be obtained if the directionality is reversed.
Initially, the cognate identification algorithm does not know which word pairs are cognates, so it must make an educated guess to begin the process. The algorithm computes the phonetic similarity for all word pairs and assumes that any word pair that has a similarity which is greater than or equal to this threshold is a cognate. A conservative threshold will ensure better results, but might cause the algorithm to take longer to converge.
Currently, Cog provides three methods for identifying likely cognates.
- Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady, 10(8), 707–710.
- Kondrak, G. (2000). A New Algorithm for the Alignment of Phonetic Sequences. In Proceedings of NAACL 2000 (pp. 288-295). Seattle, WA: ACL.
- Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1), 1-38.
- Witten, I. H. & Bell, T. C. (1991). The Zero-frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Transactions on Information Theory, 37, 1085-1094.
- Melamed, I. (2000). Models of Translational Equivalence among Words. Computational Linguistics, 26(2), 221-249.
- Kondrak, G. (2002). Determining Recurrent Sound Correspondences by Inducing Translational Models. In Proceedings of COLING 2002 (pp. 1-7). Stroudsburg, PA: ACL.