README.Debian
1 sunpinyin for Debian -*- outline -*-
2 --------------------
3
4 SunPinyin is an SLM (Statistical Language Model) based input method
5 engine (IME) developed by Sun Asian G11N Center. Currently, it
6 supports IIIMF (Internet/Intranet Input Method Framework), SCIM
7 (Smart Common Input Method) and BeCJK.
8
9 As a feature-rich IM, SunPinyin provides two input style: classic
10 input style and instant input style.
11
12 Options in scim-sunpinyin
13 -------------------------
14
15 Scim-sunpinyin can be customized with SCIM Input Method Setup panel.
16 Changing the setting of `input style' or `character set' will dismiss
17 the un-committed preedit string.
18
19 Shortcut keys
20 -------------
21
22 Besides the keys which are customizable, scim-sunpinyin comes with
23 some default shortcut key bindings not changable. They are listed as
24 following:
25
26 ctrl-backquote switch between classic/instant style.
27 only take effects while in Chinese input mode.
28 ctrl-k switch between gbk/gb2312 charset.
29 shift switch between Chinese/English input mode.
30 escape dismiss un-committed preedit string.
31
32 data files
33 ----------
34
35 Sunpinyin can hardly work without lexicon and language model data
36 files. First, we can not ship the _really_ large dataset for training
37 the language model. Second, for the sake of performance, we use our
38 binary format for storing the lexicon and language model data. But we
39 can not include these binary files in the source package without
40 violating DFSG.
41
42 So, to make sunpinyin usable, you may need to download the lexicon and
43 language data file from opensolaris website [2]. Two files are
44 necessary for sunpinyin to work:
45 - lm_sc.t3g, pydict_sc.bin
46
47 And we prepare two file for each of them:
48 - slm_sc.t3g.{be,le}, pydict_sc.bin.{be, le}
49
50 The `be' and `le' stand for `Big Endian' and `Little Endian'. Download
51 the right ones according to your computer's byte order, rename
52 them to lm_sc.t3g and pydict_sc.bin respectively, then put them to
53 /usr/share/sunpinyin/.
54
55
56 Create your own language model
57 ------------------------------
58
59 The statistical language model is the building block of SunPinyin.
60 Users are allowed to train their language models by using the tools
61 provided by sunpinyin-slm. (sunpinyin-slm is not packaged in Debian
62 yet, but its source code is available at the OpenSolaris input
63 method's project web page [1].)
64
65 * Terms
66 ** Language model
67 The language model (LM) is used to describe the characteristic of a
68 given languge. SLM is the attempt to capture the regularities of
69 natural language using statistical approaches.
70
71 The LM shipped with SunPinyin is actually a data file named
72 lm_sc.t3g. It was trained using a raw corpus collected from some
73 Chinese web sites. lm_sc.t3g stands for Language Model Simplified
74 Chinese Threading 3-Gram. The SunPinyin LM is an SLM which supports
75 back-off and trigram.
76
77 ** corpus
78 To build a LM, we need a good training data set, say, corpus. The
79 raw corpus used by SunPinyin was collected from some web sites in
80 Simpified Chinese. Actually, the corpus is just a large set of
81 sentences or text in a given language.
82
83 ** lexicon/dictionary
84 To segment the raw corpus into word tokens, we need a Chinese
85 dictionary in which the tuple of the Chinese words and its
86 frequency are stored.
87
88 There are two sets of lexicon used in SunPinyin: dict.utf8 and
89 pydict_sc.bin.
90
91 *** dict.utf8
92 The word frequency in dict.utf8 which are used in the first
93 iteration of segmentation of the raw corpus.
94
95 *** pydict_sc.bin
96 This data file is a trie presentation of the syllables and
97 corresponding Chinese words, so that we can lookup the Chinese
98 words with incomplete pinyin-prefix. This lexicon is also sorted
99 by the unigram of previously trained LM.
100
101 * How to train your own language model (LM)?
102
103 ** Prerequisites
104
105 *** raw corpus
106 To train a new LM, a raw corpus should be prepared. There is no
107 particular need for this raw corpus except that it is should be
108 encoded in UTF-8.
109
110 *** sunpinyin-slm
111 And a suite of tools provided by sunpinyin-slm is also a
112 necessary.
113
114 ** Steps
115 To train a decent LM, we need go through two rounds of training
116 process.
117
118 raw corpus ---> segmented corpus ---> n-gram result ---> back-off LM
119 ---> pruned trigram ---> threaded LM
120 ---> segment again using the result LM ---> ...
121
122 *** Segment the raw corpus
123 In this step, all words in raw corpus are indentified and are
124 translated to the corresponding IDs. The ${DICTFILE} is always
125 `dict.utf8'.
126
127 **** In the first round
128 Simply segment the raw corpus into words using MMF (Maximum
129 Matching Forwarding segmentation algorithm).
130
131 ./mmseg -d ${DICTFILE} -f bin -s 10 -a 9 ${CORPUSFILE} >${IDS_FILE}
132
133 **** In the second round
134 With the help of trained LM, we can get a better segmentation by:
135 ./slmseg -d ${DICTFILE} -f bin -s 10 -m ${TSLM_FILE} ${CORPUSFILE} >${IDS_FILE}
136
137 *** Calculate the 3-gram
138 The number of all occurrence of 3-words tuple are calculated and
139 written to file ${IDNGRAM_FILE}, like:
140
141 ./ids2ngram -n 3 -s ${SWAP_FILE} -o ${IDNGRAM_FILE} -p 5000000 ${IDS_FILE}
142
143 *** Build a back-off trigram LM
144 To build a trigram LM using back-off smoothing model in
145 ${RAW_LM_FILE} from original 3-gram result, just:
146
147 ./slmbuild -n 3 -o ${RAW_LM_FILE} -w 120000 -c 0,2,2 -d ABS,0.005 -d ABS -d ABS,0.6 -b 10 -e 9 ${IDNGRAM_FILE}
148
149 *** Prune the raw back-off LM using an entropy based approach
150 To remove as many useless probabilities as possible without
151 increasing the relative entropy, we exam the effect of removal of
152 each n-gram to find out the useless ones, like:
153
154 ./slmprune ${RAW_LM_FILE} ${SLM_FILE} R 100000 1250000 1000000
155
156 *** Thread the tree to speed up the looking up
157 To accelerate the speed of looking up, the lookup tree is threaded
158 using:
159
160 ./slmthread ${SLM_FILE} ${TSLM_FILE}
161
162 In this step, the ${TSLM_FILE} is just the language model,
163 i.e. the file `lm_sc.t3g'.
164
165 --
166 [1] http://www.opensolaris.org/os/project/input-method/files/inputmethod-repo-snapshot.tar.gz
167 [2] http://src.opensolaris.org/source/xref/nv-g11n/inputmethod/sunpinyin/ime/data/
168
169 -- Kov Chai <tchaikov (a] gmail.com> Sat, 12 Jul 2008 03:15:47 +0800
170