RE-PSOLA방식 TTS 시스템의 데이터 베이스 작성 및 검색 알고리즘 구현
- 원문 URL
Current trend of TTS construction is based on large corpus of utterances collected and segmented from the recordings of real speech. But in that case total size of database which is required for synthesis is oversized from several hundreds of megabytes to several giga bytes. Such system gives good quality of synthesized speech and suitable for the systems in server based system. But it is not suitable for small size systems such as PDA or embedded applications. Accordingly there is need to decrease the size of speech data to cope with such needs without loosing much quality of synthesized speech if possible. In Korean speech synthesizers, there are many corpus based speech synthesizers whose database size is 1 or 2 gigabytes. In the effort to develop a small scale Korean TTS system, we tried to reduce the size of the database by applying articulatory principles. Speech data is read by one female announcer for 1000 phonetically balanced sentences. All the recorded speech is then segmented by phoneticians. Total size of the original speech data is about 640 MB including laryngograph signal. To find the characteristics of the speech data, statistical analysis is performed. And the frequencies of each phonemic environments and each phonemes, diphones and triphones were obtained. As a synthesis unit, diphone and phoneme are chosen. Each set of diphone and phoneme units are separated from the original database. During the separation process some environmental informations such as neighboring phonemes and classes are recorded together. Initial diphone DB was 320 MB with much duplicative units. The database is reduced to 82 MB(without compression) by considering articulatory environments. Concatenation algorithm is as follows: 1) diphone class consists of CV, VC, CC, VV 2) CV unit is assigned for the input sentence 3) remaining diphone units are assigned in the order of VC, VV, CC 4) finally phoneme unit is assigned for un-assigned phonemes. Each consonants are classified into 5 categories and each vowels are classified into 4 categories according to their articulatory positions. During the search process, the best units in its articulatory position are sought. To synthesize wave, RE-PSOLA(Residual-excited pitch synchronous overlap and add method) was used. The voice quality of synthesized speech was compared with original speech in terms of spectrographic informations and objective tests. By considering phonemic environments synthesized speech from phoneme unit only gave pretty good quality of speech which was not possible for previous synthesizers. The synthesized speech also has original speakers prosodic informations even when no prosodic information was provided. Only some discontinuities of pitch and amplitude are observed. This phenomenon is caused by the articulatory informations contained in synthesis DB. The quality of the synthesized speech is not much degraded when the size of synthesis DB was reduced from 320 MB to 82 MB.