- A character-based language model build with the SRILM toolkit,
- Plus, a viterbi-based decoding process of the language model implemented in C++.
- Given ZhuYin-mixed sequences obtained from an imperfect acoustic models with phoneme loss, reconstruct and decode the correct sentence using a character-based language model, this language model can be construct with the SRILM toolkit or this C++ implementation.
- Goal:
- Given ZhuYin-mixed sequences:
- 讓 他 十分 ㄏ怕 / 只 ㄒ望 ㄗ己 明ㄋ 度 別 再 這ㄇ ㄎ命 了
- Reconstruct correct sentence:
- 讓 他 十分 害怕 / 只 希望 自己 明年 度 別 再 這麼 苦命 了
- Given ZhuYin-mixed sequences:
- < Ubuntu 6.4.0-17 >
- Computer Architecture: x86_64
- CPU op-mode(s): 32-bit, 64-bit
- < SRILM 1.5.10 >
- < g++ [gcc version 8.2.0 (GCC)] > (Tested)
- < g++ [gcc version 6.4.0 (GCC)] > (Tested)
- < g++ [gcc version 4.2.1 (GCC)] > (Tested)
.
├── src/
| ├── Makefile -------------> g++ compiler make file
| ├── corpus.txt -----------> Training corpus in big5 encoding
| ├── Big5-ZhuYin.map ------> character to Zhu-Yin mapping in big5 encoding
| ├── mapping.py -----------> Creates Zhu-Yin to char mapping from its inverse mapping
| ├── mydisambig.cpp -------> My implementation of a viterbi-based decoding process of the language model
| ├── separator_big5.pl ----> Separate words into characters with white space inserted in between each character
| └── testdata/ ------------> testing data 1.txt ~ 5.txt are the easy ones, 6.txt ~ 10.txt are the hard ones
├── image/
├── srilm-1.5.10.tar.gz ------> SRILM binary source code
├── problem_description.pdf --> Work spec
└── Readme.md ----------------> This file
- Compile code:
make all
- Separate training and testing data into separate characters:
make separate
- Build Zhu-Yin to char mapping:
make map
- This generates 2 files: I) ZhuYin-Big5.map, and II) ZhuYin-Utf8.map where:
I) ZhuYin-Big5.map: the Zhu-Yin to Chinease character mapping in big5 encoding
II) ZhuYin-Utf8.map: the Zhu-Yin to Chinease character mapping in big5 encoding for user verification in ordinary linux environment
- Build language model:
make build_lm
- Decode with SRILM disambig:
make run_disambig
- Decode with MY disambig:
make run
- Decode with MY disambig but show output on screen instead of write to file:
make run_cout
- Clean executables:
make clean
- Clean everything generated in the above steps:
make cleanest
- The variables
SRIPATH
andMACHINE_TYPE
can be specified by the user through the make command:
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 all
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 separate
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 map
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 build_lm
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 run_disambig
make MACHINE_TYPE=i686-m64 SRIPATH=/home/user/srilm-1.5.10 run
- Default settings of
SRIPATH
andMACHINE_TYPE
are: SRIPATH
can be obtained by running the command$ pwd
under thesrilm-1.5.10/
directory.MACHINE_TYPE
can be verified through the command:$ lscpu
- Install csh if not already installed:
$ sudo apt-get install csh
- Install gawk if not already installed:
$ sudo apt-get install gawk
- The following instructions are for a Ubuntu 64 bit machine.
- Use the SRILM source code provided in this repo, or download it here.
- Untar the source code package:
$ tar zxvf srilm-1.5.10.tar.gz
- Enter the resulting SRILM directory:
$ cd srilm-1.5.10/
- Get the absolute path to the
srilm-1.5.10/
directory:$ pwd
- Modify
srilm-1.5.10/Makefile
and change the SRILM variable to the absolute path ofsrilm-1.5.10/
, and change the MACHINE_TYPE variable to match the 64-bit Ubuntu architecture:
# SRILM = /home/speech/stolcke/project/srilm/devel
SRILM = /home/andi611/dsp/srilm-1.5.10
# MACHINE_TYPE := $(shell $(SRILM)/sbin/machine-type)
MACHINE_TYPE := i686-m64
- Modify the following lines in
srilm-1.5.10/common/Makefile.machine.i686-m64
to:
- line 17:
CC = /usr/bin/gcc $(GCC_FLAGS)
- line 18:
CXX = /usr/bin/g++ $(GCC_FLAGS) -DINSTANTIATE_TEMPLATES
- Line 52, 53, 54 (Lines under # Tcl support):
TCL_INCLUDE =
TCL_LIBRARY =
NO_TCL = X
- Line 69:
GAWK = /usr/bin/gawk
- Comment out lines 14 ~ 29 in
srilm-1.5.10/lm/src/matherr.c
since glibc 2.27 has removed struct exception. - Make sure that all the programs under
srilm-1.5.10/sbin
are executable, if not:$ sudo chmod 755 *
- Compile:
$ sudo make World
- Clean up:
$ make cleanest
- The compiled executable files should be in
srilm-1.5.10/bin/i686-m64
- Refer to the following links for further environment issues:
- SRILM compilation problem:
- Encoding problem: