Ry pyconjp2015 karaoke

1
PyCon JP 2015
Renyuan Lyu
呂仁園
Chun-Han Lai
賴俊翰
Karaoke-style Read-aloud System
Chang Gung Univ.
Taiwan
Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1

CguTextKaraoke
a Karaoke-style Read-aloud System
Using Speech Alignment and Text-to-Speech Technology
Chun-Han Lai (賴俊翰)
Renyuan Lyu (呂仁園)
Chang Gung University (長庚大學)
Taiwan (台灣)
2

Abstract
• A procedure to create a Speech-to-Text
Synchronization file from an original text-only file
– can be used to show high-light text just like a Karaoke
machine
– very useful for language learning purpose.
• TTS (Text-to-speech) technology on clouds, like
Google TTS
• Speech-recognition technology, like HTK, for
temporal alignment
3

Introduction
• Starting from a text-only file, using a cloud-based text-to-speech
(TTS) technology, like Google Translate/TTS, and also a speech-
recognition technology, like Hidden Markov Model Toolkits (HTK),
we could generate its associated timed-text file which aligns up text
with speech waveform file in the temporal axis.
• Python is used not only as a glue to link all different styles of
software resources, like Google Translate and HTK, but also as a
powerful tool to deal with all text processing tasks in this project.
• From such a kind of timed text file, we have also provided a
JavaScript based web-app and also a Python GUI software to
demonstrate the time-aligned high-lighted text like a karaoke
machine in word level, which are considered very useful for the
language learning purpose.
4

a Karaoke-style Text Read-aloud System
https://www.youtube-nocookie.com/embed/9a5KoXNCagM?start=180
• Karaoke (カラオケ) is a form of interactive
entertainment in which an amateur singer sings
along with recorded music.
• Lyrics are usually displayed on a video screen, along
with a moving symbol, changing color, or music
video images, to guide the singer.
• Here is an example of my favorites
https://en.wikipedia.org/wiki/Karaoke
5

Speech Shadowing Technique
for Language Learning
• The motivation of this project
» https://en.wikipedia.org/wiki/Speech_shadowing
–Speech shadowing
• is an Language Learning technique in which
subjects repeat speech immediately after hearing it.
– The technique is used in language learning.
– A demonstration can be viewed at the following Youtube
link.
• “English Speaking Practice: How to improve your
English Speaking and Fluency: SHADOWING”
• https://www.youtube.com/watch?v=GVWFGIyNswI6

Text-to-Speech Synthesis
7
Wikipedia is a multilingual, web-based, free-content encyclopedia project supported
by the Wikimedia Foundation and based on a model of openly editable content. The
name "Wikipedia" is a portmanteau of the words wiki (a technology for creating
collaborative websites, from the Hawaiian word wiki, meaning "quick") and
encyclopedia. Wikipedia's articles provide links designed to guide the user to related
pages with additional information.
Given: a piece of Text and its speech, e.g.,
The goal is to obtain its speech

Google TTS API
in a Python module
8
• pip install gTTS
from gtts import gTTS
aText= 'Wikipedia is a multilingual, ...'
aLang= 'en'
tts= gTTS(text= aText, lang= aLang)
tts.save("aSpeech.mp3")
aSpeech.mp3aText
https://github.com/pndurette/gTTS

FFmpeg
• About Ffmpeg
– [https://en.wikipedia.org/wiki/FFmpeg]
– FFmpeg is a free software project that
produces libraries and programs for
handling multimedia data.
– It is one of the leading multimedia frameworks,
able to do many DSP tasks, including ...
• decode, encode,
• transcode, mux, demux, stream, filter and play
9

10
FFmpeg -i aSpeech.mp3 -y -
vn -acodec pcm_s16le -ac 1
-ar 16000 -f wav
aSpeech.wav
aSpeech.mp3 aSpeech.wav
Pcm, 16 bits/sample Little endian
1 (mono) channel
16000 samples/sec
FFplay
aSpeech.wav
Verifying
by seeing
and hearing
Or using an interactive audio tool, like Audacity.

Audacity (audio editor)
• Audacity is a powerful, free open source digital audio editor
– Its features include:
• Recording and playing back sounds
• Importing and exporting of WAV, MP3, ....
• Viewing and editing via cut, copy, and paste, ...
11
aSpeech.mp3
aSpeech.wav

Text-to-Speech Alignment
12
Wikipedia is a multilingual, web-based, free-content encyclopedia project
supported by the Wikimedia Foundation and based on a model of openly editable
content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for
creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and
encyclopedia. Wikipedia's articles provide links designed to guide the user to related
pages with additional information.
Given: a piece of Text and its speech, e.g.,
The goal is to obtain a ‘Timed-Text’
0.0000.080sil
0.0800.870wikipedia
0.8700.990is
0.9901.080a
1.0802.010multilingual
2.0102.140sil
2.1602.240sil
2.2403.020webbased
3.0203.180sil
3.2043.354sil
3.3544.284freecontent
4.2845.374encyclopedia
5.3745.774project
5.7746.454supported
6.4546.754by
6.7546.904the
6.9047.574wikimedia
7.5748.414foundation
8.4148.514sil
8.5328.622sil
8.6228.852and
8.8529.242based
9.2429.382on
9.3829.432a
9.4329.982model
9.98210.032of
10.03210.592openly
10.59211.212editable
11.21211.802content
11.80211.932sil
:
:
:

Wav splitting
13
In Sentence-level, this can be straightforward done by
extracting the time information from the TTS mp3 files,
which are received sentence by sentence.
Sentence boundaries

Phonetic Transcription
• Speech recognition technology needs to transcribe text into
phonetic symbols, in order to build up phone models.
14
“Wikipedia is a multilingual, web-based, free-content encyclopedia project.”
“wikipedia ɪz ə məltilɪŋwəl, wɛb- best, fri- kɑntɛnt ənsɑjkləpidiə prɑdʒɛkt.”
”wikipedia Iz @ m@ltilINw@l, wEb- best, fri- kAntEnt @nsAykl@pidi@ prAdZEkt.”
Original English Text: (ASCII only, perhaps!)
Transcription in IPA: (needs Unicode)
Transcription in SAMPA: (ASCII only, including non-alphabet symbols)
http://upodn.com/phon.asp

• Post processing of phonetic transcription
• To map or simply clean all undesired symbols from multiple
styles of outputs
– (usually in unicode, or some non-alphabet symbols)
• For plain English (en),
– Approximately using the original Text as the phone sequence.
– Although it seems too simple, it is so far so good.
• For Traditional Chinese (zh-tw),
– Google Translate was used to get phonetic symbols in Pinyin (拼音,
pīnyīn), and then plain romaji (eliminating the tone mark)
• For Japanese (ja),
– Mecab has been used recently to get the Katakana (片仮名, カタカナ).
– Romkan has been used to transform katakana to romaji (kunrei)
• Thanks to Python, it helps me do the most jobs
during this stage of processing!!
15

• Phonetic transcription for English
– Using regular expression module
16
phn= text2phn_en(enText)
enText=
‘’’Wikipedia is a multilingual, web-based,
free-content encyclopedia project.‘’’
phn=
‘’’wikipedia_is_a_multilingual_webbased
_freecontent_encyclopedia_project’’'
import re
pats= ''|"|-|^_|_$|,|.|(|)'
phn= re.sub(pats, '', phn)

• Phonetic transcription for Traditional Chinese
– Using Google Translate/TTS api
17
phn= text2phn_tc(tcText)
tcText=
‘維基百科是一個自由內容’
phn=
‘weiji_baike_shi_yige_ziyou_neirong’
GOOGLE_TTS_URL=
'https://translate.google.
com.tw/translate_a/singl
e?dt=bd&dt=ex&dt=at&'
req= urllib.request.Request(GOOGLE_TTS_URL + data)

• Phonetic transcription for Japanese
– Using MeCab and Romkan
18
phn= text2phn_jp(jpText)
jpText=
‘‘’ウィキペディアは、
信頼されるフリーなオンライン百科事典、‘’’
phn=
‘‘’wikipedyia_wa_sil_sinrai_sa_reru_furi-_
na_onrain_hyakka_ziten‘’’
import MeCab
import romkan
y= MeCab.Tagger().parse(text)
...
kun= romkan.to_kunrei(phn)

At the Halfway
• a bundle of files wav/lab
19

• HMM Toolkits (HTK),
– http://htk.eng.cam.ac.uk/
– Given a speech utterance, with its phone
sequence, the speech can be well aligned with
phones by ‘forced alignment’ techniques in the
HMM approach.
– A set of HMM Toolkits, called HTK, provided a
convenient way to utilize the HMM approach.
20
Speech recognition technology

HTK processing (abstract) ....
22
• #[00] setting the working dir
• #[01] creating the (hmm) model prototype
• #[02] label processing
• #[03] feature extraction
• #[04] model initialization
• #[05] model training
• #[06] forced alignment
• #[07] post file moving operation

HTK processing (detail)....
23
#[00] setting the working dir
dirName= ./_wav/
#[01] creating the (hmm) model prototype
CreateHProto....
myHmmPro
N = 3 M = 6
#[02] label processing
000, 0,----> ._htkhled -A -i spLab00.mlf -n spLab00.lst -S spLab.scp hL
001, 0,----> ._htkhled -A -i spLab.mlf -n spLab.lst -S spLab.scp hLed.l
002, 0,----> ._htkhled -A -i spLab_p.mlf -n spLab_p.lst -S spLab.scp -I
#[03] feature extraction
003, 0,----> ._htkHCopy -A -C hCopy.conf -S spWav2Mfc.scp 1>> 1.htk.out 2>>
#[04] model initialization
004, 1,----> mkdir hmms_p
005, 0,----> ._htkHCompV -A -m -C hInit.conf -S spMfc.scp -I spLab_p.mlf -M
#[05] model training
006, 0,----> ._htkHERest -A -C hErest.conf -S spMfc.scp -p 1 -t 2000.0 -w 3
007, 0,----> ._htkHERest -A -C hErest.conf -p 0 -t 2000.0 -w 3 -v 0.05 -I sp
: (repeating several times...)
:
#[06] forced alignment
016, 0,----> ._htkHVite -A -a -C hVite.conf -S spMfc.scp -d hmms_p/ -i s
#[07] post file moving operation
017, 1,----> mkdir outDir
018, 1,----> copy spLab_aligned.mlf outDir./_wav_aligned.mlf

24
HLedspLab.scp spLab.mlf
spLab.lst
hLed.led
HLed
spLab00.mlf
spLab00.lst
hLed00.led
HLed
spLab_p.mlf
spLab_p.lst
hLed.led
spLab_p.dic
HLed

25
HCopy
hCopy.conf
spWav2Mfc.scp
*.wav *.mfc
HCopy

HCompV
26
HCompV
HCompV.conf
*.mfc hmms_p/*
spMfc.scp
spLab_p.mlf
myHmmPro

HERest
27
HERest
hErest.conf
*.mfc
hmms_p/*
spMfc.scp
spLab_p.mlf spLab_p.lst
hmms_p/HER1.acc
N iterations
N=5
HERest

HVite
28
HVite
hVite.conf*.mfc
spMfc.scp
spLab_p.lst
spLab_aligned.mlf
spLab.mlf
spLab_p.dic
hmms_p/

HTK summary
29
HLed
HCopy
HCompV
HERest
HVite
HTK Tools
#!MLF!#
"./_wav/SN0.rec"
0 800000 sil -578.044434
800000 8700000 wikipedia -5636.368652
8700000 9900000 is -855.988770
9900000 10800000 a -693.554871
10800000 20100000 multilingual -7268.197266
20100000 21400000 sil -791.746216
.
"./_wav/SN1.rec"
0 800000 sil -541.083069
800000 8600000 webbased -5977.622070
8600000 10200000 sil -1048.225220
.
"./_wav/SN2.rec"
0 1500000 sil -1100.892822
1500000 10800000 freecontent -7094.197266
10800000 21700000 encyclopedia -8148.633789
21700000 25700000 project -3247.493896
25700000 32500000 supported -5594.979492
32500000 35500000 by -2412.487305
35500000 37000000 the -1176.310547
37000000 43700000 wikimedia -5128.852051
43700000 52100000 foundation -5995.618164
52100000 53100000 sil -695.872864
.
.
.
spLab_aligned.mlf
wavDir/

The major algorithm in HTK
30
‘Holiday Shopping’ = ‘h’+’o’+’l’+’i’+’d’+’ay’+’sil’+’sh’+’o’+’p’+’I’+’ng’
‘h’ ’o’ ’ng’
• Forced Alignment in HTK
– 1. Given a Speech signal
– 2. Doing the Pronunciation transcription
• Pronunciation symbols must be all-ASCII only!!
– 3. Training to get the HMM models

31
‘h’
’o’
’ng’
– 4. Doing the Viterbi Search for the optimal path (alignment):

32
#!MLF!#
"wavDir/SN0001.rec"
0 800000 sil -567.865356
800000 8700000 wikipedia -5670.471680
8700000 10000000 is -951.059692
10000000 10600000 a -489.843994
10600000 20000000 multilingual -7398.754395
20000000 20700000 sil -416.119415
.
"wavDir/SN0002.rec"
0 900000 sil -632.964050
900000 8600000 webbased -6000.767578
8600000 9900000 sil -914.236206
.
"wavDir/SN0003.rec"
0 2100000 sil -1373.137817
2100000 9000000 freecontent -5306.260742
9000000 18500000 encyclopedia -6654.958984
18500000 25600000 project -5698.730469
25600000 32700000 supported -5713.494141
32700000 33200000 by -429.306763
33200000 34800000 the -1205.477539
34800000 41500000 wikimedia -5115.318359
41500000 50000000 foundation -6074.208496
50000000 52000000 and -1746.236938
52000000 56200000 based -3267.695801
56200000 57000000 on -585.264404
57000000 57700000 a -577.346130
57700000 63200000 model -3769.413574
63200000 63800000 of -524.015503
63800000 65300000 sil -1129.348633
.
wavDir.align

33
Now it’s time
to KaraOke !

A Browser in Javascript and HTML
for Text-KaraOke
• https://youtu.be/11-ltx0yv_o
34

A Browser in Python using TKinter
for Text-KaraOke
35

Conclusion & Future Work
• Make the process more automatically.
• Make the user interface more friendly.
• Make the program more robust.
• Call for your help to improve.
• Thank you for Listening!
36

37
PyCon JP 2015
Renyuan Lyu
呂仁園
Chun-Han Lai
賴俊翰
Karaoke-style Read-aloud System
Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
Thank you for Listening.
ご聴取有り難う御座いました。
感謝您的收聽。

Ry pyconjp2015 karaoke

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Ry pyconjp2015 karaoke (20)

More from Renyuan Lyu (7)

Recently uploaded (20)

Ry pyconjp2015 karaoke