-
-
Notifications
You must be signed in to change notification settings - Fork 11
Input Settings
The Input settings view contains all of the settings that are used to configure how the word list data is interpreted and prepared.
Cog can perform automatic syllabification of all words. The algorithm that Cog uses is based on the Sonority Sequencing Principle (SSP) and the Maximal Onset Principle (MOP).[1][2] Using the provided sonority scale, a measure of sonority is assigned to each segment in a word. The algorithm attempts to mark syllable boundaries at troughs in the sonority of the word. The algorithm will then check for the longest onset that has been observed to occur word-initially in its variety. If it is ambiguous, then the maximal onset is assumed. The Sylli syllabifier served as the inspiration for the algorithm used in Cog.[3]
If this setting is checked, then all segments at the same syllable position will be combined into a single complex segment. This allows alignment to occur by syllable position, which generally results in a better alignment. This setting will take effect whether or not automatic syllabification is enabled.
If this setting is checked, then syllabification is performed automatically whenever any new word is manually entered or imported. If the language varieties that are being compared do not adhere to the SSP and MOP, then automatic syllabification should be disabled and the syllables should be marked manually.
This table represents the sonority scale that the automatic syllabifier utilizes to assign sonority to segments. The sonority scale consists of sound classes and an assigned sonority value. A segment is assigned the sonority value of the first sound class that it matches. In this regard, the order of the sound classes is significant. The higher the sonority value, the more sonorous the segment. The same sonority value can be assigned to multiple sound classes. Cog provides a default sonority scale that should work in most cases. The scale can be adjusted to handle consistent exceptions to the SSP that occur across all language varieties.
By default, the automatic syllabifier algorithm will not handle height-harmonic diphthongs where the individual components have the same sonority, since it will see this as a sonority trough. If this setting is checked, the syllabifier will consider vowels with the same sonority as part of the same syllable, thereby enabling support for height-harmonic diphthongs.
The automatic stemmer attempts to identify affixes and strip them off so that only roots/stems are compared. The stemmer must identify affixes with very little information. The algorithm calculates four different measures which are combined into a final confidence score for each candidate affix. The three measures are:
- Frequency: a weighted frequency of occurrence.
- Random adjustment: the probability of occurrence for an affix is compared against the probability of occurrence of the same sequence of segments occurring as a non-affix. This measure is used to determine if a sequence of segments tends to occur together within the language or not.
- Curve drop: the segment with the highest probability of proceeding a suffix or succeeding a prefix is determined. This probability is compared to the probability of the segment occurring anywhere. If the probabilities are similar, it indicates that the affix is likely a true affix and not a substring of a true affix.
- Syllable adjustment: affixes which occur at syllable breaks are given more weight.
The four measures are given equal weight when computing the final confidence score. The algorithm is a modified version of Hammarström's proposal for an unsupervised stemming algorithm.[4]
This setting is the threshold of the confidence score that is calculated for each candidate affix. If a candidate affix's confidence score meets or exceeds this threshold, then it is considered an affix and stripped from all words in the language variety. If the stemmer is being too conservative, then this threshold can be reduced to allow more candidates to be considered true affixes. This value should be determined by experimentation.
This value is the maximum allowable affix length that the stemmer will consider.
- Selkirk, E. (1984). On the Major Class Features and Syllable Theory. In M. Aronoff & R. T. Oerhle (Eds.) Language Sound Structure (pp. 107-136). Cambridge, MA: MIT Press.
- Selkirk, E. (1981). English Compounding and the Theory of Word Structure. In M. Moortgat, H. Van der Hulst, & T. Hoestra (Eds.) The Scope of Lexical Rules (pp. 229-277). Dordrecht: Foris.
- Iacoponi, L. & Savy, R. (2011). Sylli: Automatic Phonological Syllabification for Italian. In P. Cosi, R. De Mori, G. Di Fabbrizio, & R. Pieraccini (Eds.) Proceedings of INTERSPEECH 2011 (pp. 641-644). Florence, Italy: ISCA.
- Hammarström, H. (2006). A Naive Theory of Affixation and an Algorithm for Extraction. In Proceedings of SIGPHON 2006 (pp. 79-88). Stroudsburg, PA: ACL.