Skip to content

collate gives different results than applying compare on sortKey #91

@ChickenProp

Description

@ChickenProp
ghci> import qualified Data.Text.ICU as ICU
ghci> let testCompare c a b = (ICU.collate c a b, compare (ICU.sortKey c a) (ICU.sortKey c b))

according to the docs, testCompare c a b should always return a pair of two equal values (i.e. (EQ, EQ), (LT, LT) or (GT, GT)). But this isn't the case, for example:

ghci> let c = ICU.collator ICU.Root
ghci> testCompare c "" "\EOT"
(EQ,LT)
ghci> testCompare c "" "\ETX"
(EQ,LT)
ghci> testCompare c "" "\NUL"
(EQ,LT)
ghci> testCompare c "" "\2205"
(EQ,LT)
ghci> testCompare c "" "\2250"
(EQ,LT)
ghci> testCompare c "" "\2250\ETX\2205"
(EQ,LT)

As far as I can tell, there are a handful of characters (including all of those above) such that Data.ByteString.unpack $ ICU.sortKey "(char)" gives [1, 1, 0]. And the problem manifests when we compare a string of any number of these characters (such a string also has sort key [1, 1, 0]) to the empty string (sort key []). I haven't seen this in any other situation.

(\2250 is U+08ca "arabic small high farsi yeh" and \2205 is "arabic superscripet alef mokhassas". Found these essentially randomly. A few others in the vicinity have the same property, like \2251 but not \2206. I haven't looked to see if there's any pattern here.)

I tried a few other collators. collatorWith _ [Strength Secondary] makes the sort key of the non-empty strings [1, 0] instead of [1, 1, 0], but testCompare gives the same results. Changing the base to Locale "en" or adding Numeric True doesn't obviously make a difference.

This is with text-icu-0.8.0.2. I can't rule out that this is a bug in icu itself. I'm not familiar enough with C to be able to test that easily, though I expect I could figure it out. I'm using a version provided by nix. Based on the output of lsof, it seems to be version 72.1: my running GHC is has these files open:

/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicudata.so.72.1
/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicui18n.so.72.1
/nix/store/x6cq3940a5krcwj0p28y3b6lckxmcfqw-icu4c-72.1/lib/libicuuc.so.72.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions