Home | History | Annotate | only in /nv-g11n/inputmethod/sunpinyin/ime/debian
Up to higher level directory
NameDateSize
changelog09-Dec-2008402
compat09-Dec-20082
control09-Dec-20081.8K
copyright09-Dec-200819.7K
iiimf-le-sunpinyin-chinese.install09-Dec-2008131
README.Debian09-Dec-20086.1K
rules09-Dec-20081.7K
scim-sunpinyin.install09-Dec-2008224
sunpinyin-data-be.install09-Dec-200832
sunpinyin-data-le.install09-Dec-200832

README.Debian

      1 sunpinyin for Debian                                -*- outline -*-
      2 --------------------
      3 
      4 SunPinyin is an SLM (Statistical Language Model) based input method
      5 engine (IME) developed by Sun Asian G11N Center. Currently, it
      6 supports IIIMF (Internet/Intranet Input Method Framework), SCIM
      7 (Smart Common Input Method) and BeCJK.
      8 
      9 As a feature-rich IM, SunPinyin provides two input style: classic
     10 input style and instant input style.
     11 
     12 Options in scim-sunpinyin
     13 -------------------------
     14 
     15 Scim-sunpinyin can be customized with SCIM Input Method Setup panel.
     16 Changing the setting of `input style' or `character set' will dismiss
     17 the un-committed preedit string.
     18 
     19 Shortcut keys
     20 -------------
     21 
     22 Besides the keys which are customizable, scim-sunpinyin comes with
     23 some default shortcut key bindings not changable. They are listed as
     24 following:
     25 
     26 ctrl-backquote  switch between classic/instant style.
     27 		    only take effects while in Chinese input mode.
     28 ctrl-k          switch between gbk/gb2312 charset.
     29 shift		switch between Chinese/English input mode.
     30 escape		dismiss un-committed preedit string.
     31 
     32 data files
     33 ----------
     34 
     35 Sunpinyin can hardly work without lexicon and language model data
     36 files. First, we can not ship the _really_ large dataset for training
     37 the language model. Second, for the sake of performance, we use our
     38 binary format for storing the lexicon and language model data. But we
     39 can not include these binary files in the source package without
     40 violating DFSG. 
     41 
     42 So, to make sunpinyin usable, you may need to download the lexicon and
     43 language data file from opensolaris website [2]. Two files are
     44 necessary for sunpinyin to work: 
     45  - lm_sc.t3g, pydict_sc.bin
     46 
     47 And we prepare two file for each of them: 
     48  - slm_sc.t3g.{be,le}, pydict_sc.bin.{be, le}
     49 
     50 The `be' and `le' stand for `Big Endian' and `Little Endian'. Download
     51 the right ones according to your computer's byte order, rename
     52 them to lm_sc.t3g and pydict_sc.bin respectively, then put them to
     53 /usr/share/sunpinyin/. 
     54 
     55 
     56 Create your own language model
     57 ------------------------------
     58 
     59 The statistical language model is the building block of SunPinyin.
     60 Users are allowed to train their language models by using the tools
     61 provided by sunpinyin-slm. (sunpinyin-slm is not packaged in Debian
     62 yet, but its source code is available at the OpenSolaris input
     63 method's project web page [1].)
     64 
     65 * Terms
     66 ** Language model
     67    The language model (LM) is used to describe the characteristic of a
     68    given languge. SLM is the attempt to capture the regularities of
     69    natural language using statistical approaches.
     70 
     71    The LM shipped with SunPinyin is actually a data file named
     72    lm_sc.t3g. It was trained using a raw corpus collected from some
     73    Chinese web sites. lm_sc.t3g stands for Language Model Simplified
     74    Chinese Threading 3-Gram. The SunPinyin LM is an SLM which supports
     75    back-off and trigram.
     76 
     77 ** corpus
     78    To build a LM, we need a good training data set, say, corpus. The
     79    raw corpus used by SunPinyin was collected from some web sites in
     80    Simpified Chinese. Actually, the corpus is just a large set of
     81    sentences or text in a given language.
     82 
     83 ** lexicon/dictionary
     84    To segment the raw corpus into word tokens, we need a Chinese
     85    dictionary in which the tuple of the Chinese words and its
     86    frequency are stored.
     87 
     88    There are two sets of lexicon used in SunPinyin: dict.utf8 and
     89    pydict_sc.bin.
     90 
     91 *** dict.utf8
     92     The word frequency in dict.utf8 which are used in the first
     93     iteration of segmentation of the raw corpus.
     94 
     95 *** pydict_sc.bin 
     96     This data file is a trie presentation of the syllables and
     97     corresponding Chinese words, so that we can lookup the Chinese
     98     words with incomplete pinyin-prefix. This lexicon is also sorted
     99     by the unigram of previously trained LM.
    100 
    101 * How to train your own language model (LM)?
    102 
    103 ** Prerequisites
    104 
    105 *** raw corpus 
    106     To train a new LM, a raw corpus should be prepared. There is no
    107     particular need for this raw corpus except that it is should be
    108     encoded in UTF-8.
    109 
    110 *** sunpinyin-slm
    111     And a suite of tools provided by sunpinyin-slm is also a
    112     necessary.
    113 
    114 ** Steps
    115    To train a decent LM, we need go through two rounds of training
    116    process.
    117 
    118    raw corpus ---> segmented corpus ---> n-gram result ---> back-off LM
    119               ---> pruned trigram ---> threaded LM 
    120               ---> segment again using the result LM ---> ...
    121 
    122 *** Segment the raw corpus
    123     In this step, all words in raw corpus are indentified and are
    124     translated to the corresponding IDs. The ${DICTFILE} is always
    125     `dict.utf8'.
    126 
    127 **** In the first round
    128      Simply segment the raw corpus into words using MMF (Maximum
    129      Matching Forwarding segmentation algorithm).
    130      
    131      ./mmseg -d ${DICTFILE} -f bin -s 10 -a 9 ${CORPUSFILE} >${IDS_FILE}
    132     
    133 **** In the second round
    134      With the help of trained LM, we can get a better segmentation by:
    135      ./slmseg -d ${DICTFILE} -f bin -s 10 -m ${TSLM_FILE} ${CORPUSFILE} >${IDS_FILE}	
    136 
    137 *** Calculate the 3-gram
    138     The number of all occurrence of 3-words tuple are calculated and
    139     written to file ${IDNGRAM_FILE}, like:
    140 
    141     ./ids2ngram -n 3 -s ${SWAP_FILE} -o ${IDNGRAM_FILE} -p 5000000 ${IDS_FILE}
    142 
    143 *** Build a back-off trigram LM
    144     To build a trigram LM using back-off smoothing model in
    145     ${RAW_LM_FILE} from original 3-gram result, just: 
    146 
    147     ./slmbuild -n 3 -o ${RAW_LM_FILE} -w 120000 -c 0,2,2 -d ABS,0.005 -d ABS -d ABS,0.6 -b 10 -e 9 ${IDNGRAM_FILE}
    148 
    149 *** Prune the raw back-off LM using an entropy based approach
    150     To remove as many useless probabilities as possible without
    151     increasing the relative entropy, we exam the effect of removal of
    152     each n-gram to find out the useless ones, like:
    153     
    154     ./slmprune ${RAW_LM_FILE} ${SLM_FILE} R 100000 1250000 1000000
    155 
    156 *** Thread the tree to speed up the looking up
    157     To accelerate the speed of looking up, the lookup tree is threaded
    158     using:
    159 
    160     ./slmthread ${SLM_FILE} ${TSLM_FILE}
    161 
    162     In this step, the ${TSLM_FILE} is just the language model,
    163     i.e. the file `lm_sc.t3g'.
    164 
    165 --
    166 [1] http://www.opensolaris.org/os/project/input-method/files/inputmethod-repo-snapshot.tar.gz
    167 [2] http://src.opensolaris.org/source/xref/nv-g11n/inputmethod/sunpinyin/ime/data/
    168 
    169  -- Kov Chai <tchaikov (a] gmail.com>  Sat, 12 Jul 2008 03:15:47 +0800
    170