ImapDb.Database: Register new ICU-based tokeniser for FTS

The SQLite tokeniser does not deal with scripts that do not use spaces
for word breaking (CJK, Thai, etc), thus searching in those languages
does not work well.

This adds a custom SQLite tokeniser based on ICU that breaks words for
all languages supported by that library, and uses NFKC_Casefold
normalisation to handle normalisation, case folding, and dropping of
ignorable characters.

Fixes #121
This commit is contained in:
Michael Gratton 2020-11-13 08:41:08 +11:00 committed by Michael James Gratton
parent 90711f234e
commit 7e38198287
7 changed files with 325 additions and 13 deletions

View file

@ -14,6 +14,6 @@ CREATE VIRTUAL TABLE MessageSearchTable USING fts5(
bcc,
flags,
tokenize="unicode61 remove_diacritics 2",
tokenize="geary_tokeniser",
prefix="2,4,6,8,10"
)