ImapDb.Database: Register new ICU-based tokeniser for FTS

The SQLite tokeniser does not deal with scripts that do not use spaces
for word breaking (CJK, Thai, etc), thus searching in those languages
does not work well.

This adds a custom SQLite tokeniser based on ICU that breaks words for
all languages supported by that library, and uses NFKC_Casefold
normalisation to handle normalisation, case folding, and dropping of
ignorable characters.

Fixes #121
This commit is contained in:
Michael Gratton 2020-11-13 08:41:08 +11:00 committed by Michael James Gratton
parent 90711f234e
commit 7e38198287
7 changed files with 325 additions and 13 deletions

View file

@ -26,7 +26,7 @@ variables:
meson vala desktop-file-utils enchant2-devel folks-devel gcr-devel
glib2-devel gmime30-devel gnome-online-accounts-devel gspell-devel
gsound-devel gtk3-devel iso-codes-devel json-glib-devel itstool
libappstream-glib-devel libgee-devel libhandy1-devel
libappstream-glib-devel libgee-devel libhandy1-devel libicu-devel
libpeas-devel libsecret-devel libstemmer-devel libunwind-devel
libxml2-devel libytnef-devel sqlite-devel webkitgtk4-devel
FEDORA_TEST_DEPS: glibc-langpack-en gnutls-utils tar Xvfb xz
@ -37,9 +37,9 @@ variables:
itstool libappstream-glib-dev libenchant-2-dev libfolks-dev
libgcr-3-dev libgee-0.8-dev libglib2.0-dev libgmime-3.0-dev
libgoa-1.0-dev libgspell-1-dev libgsound-dev libgtk-3-dev
libhandy-1-dev libjson-glib-dev libmessaging-menu-dev libpeas-dev
libsecret-1-dev libsqlite3-dev libstemmer-dev libunwind-dev
libwebkit2gtk-4.0-dev libxml2-dev libytnef0-dev
libhandy-1-dev libicu-dev libjson-glib-dev libmessaging-menu-dev
libpeas-dev libsecret-1-dev libsqlite3-dev libstemmer-dev
libunwind-dev libwebkit2gtk-4.0-dev libxml2-dev libytnef0-dev
UBUNTU_TEST_DEPS: gnutls-bin librsvg2-common locales xauth xvfb
fedora: