SlideShare a Scribd company logo
1
PyCon JP 2015
Renyuan Lyu
呂仁園
Chun-Han Lai
賴俊翰
Karaoke-style Read-aloud System
Chang Gung Univ.
Taiwan
Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
CguTextKaraoke
a Karaoke-style Read-aloud System
Using Speech Alignment and Text-to-Speech Technology
Chun-Han Lai (賴俊翰)
Renyuan Lyu (呂仁園)
Chang Gung University (長庚大學)
Taiwan (台灣)
2
Abstract
• A procedure to create a Speech-to-Text
Synchronization file from an original text-only file
– can be used to show high-light text just like a Karaoke
machine
– very useful for language learning purpose.
• TTS (Text-to-speech) technology on clouds, like
Google TTS
• Speech-recognition technology, like HTK, for
temporal alignment
3
Introduction
• Starting from a text-only file, using a cloud-based text-to-speech
(TTS) technology, like Google Translate/TTS, and also a speech-
recognition technology, like Hidden Markov Model Toolkits (HTK),
we could generate its associated timed-text file which aligns up text
with speech waveform file in the temporal axis.
• Python is used not only as a glue to link all different styles of
software resources, like Google Translate and HTK, but also as a
powerful tool to deal with all text processing tasks in this project.
• From such a kind of timed text file, we have also provided a
JavaScript based web-app and also a Python GUI software to
demonstrate the time-aligned high-lighted text like a karaoke
machine in word level, which are considered very useful for the
language learning purpose.
4
a Karaoke-style Text Read-aloud System
https://www.youtube-nocookie.com/embed/9a5KoXNCagM?start=180
• Karaoke (カラオケ) is a form of interactive
entertainment in which an amateur singer sings
along with recorded music.
• Lyrics are usually displayed on a video screen, along
with a moving symbol, changing color, or music
video images, to guide the singer.
• Here is an example of my favorites
https://en.wikipedia.org/wiki/Karaoke
5
Speech Shadowing Technique
for Language Learning
• The motivation of this project
» https://en.wikipedia.org/wiki/Speech_shadowing
–Speech shadowing
• is an Language Learning technique in which
subjects repeat speech immediately after hearing it.
– The technique is used in language learning.
– A demonstration can be viewed at the following Youtube
link.
• “English Speaking Practice: How to improve your
English Speaking and Fluency: SHADOWING”
• https://www.youtube.com/watch?v=GVWFGIyNswI6
Text-to-Speech Synthesis
7
Wikipedia is a multilingual, web-based, free-content encyclopedia project supported
by the Wikimedia Foundation and based on a model of openly editable content. The
name "Wikipedia" is a portmanteau of the words wiki (a technology for creating
collaborative websites, from the Hawaiian word wiki, meaning "quick") and
encyclopedia. Wikipedia's articles provide links designed to guide the user to related
pages with additional information.
Given: a piece of Text and its speech, e.g.,
The goal is to obtain its speech
Google TTS API
in a Python module
8
• pip install gTTS
from gtts import gTTS
aText= 'Wikipedia is a multilingual, ...'
aLang= 'en'
tts= gTTS(text= aText, lang= aLang)
tts.save("aSpeech.mp3")
aSpeech.mp3aText
https://github.com/pndurette/gTTS
FFmpeg
• About Ffmpeg
– [https://en.wikipedia.org/wiki/FFmpeg]
– FFmpeg is a free software project that
produces libraries and programs for
handling multimedia data.
– It is one of the leading multimedia frameworks,
able to do many DSP tasks, including ...
• decode, encode,
• transcode, mux, demux, stream, filter and play
9
10
FFmpeg -i aSpeech.mp3 -y -
vn -acodec pcm_s16le -ac 1
-ar 16000 -f wav
aSpeech.wav
aSpeech.mp3 aSpeech.wav
Pcm, 16 bits/sample Little endian
1 (mono) channel
16000 samples/sec
FFplay
aSpeech.wav
Verifying
by seeing
and hearing
Or using an interactive audio tool, like Audacity.
Audacity (audio editor)
• Audacity is a powerful, free open source digital audio editor
– Its features include:
• Recording and playing back sounds
• Importing and exporting of WAV, MP3, ....
• Viewing and editing via cut, copy, and paste, ...
11
aSpeech.mp3
aSpeech.wav
Text-to-Speech Alignment
12
Wikipedia is a multilingual, web-based, free-content encyclopedia project
supported by the Wikimedia Foundation and based on a model of openly editable
content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for
creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and
encyclopedia. Wikipedia's articles provide links designed to guide the user to related
pages with additional information.
Given: a piece of Text and its speech, e.g.,
The goal is to obtain a ‘Timed-Text’
0.0000.080sil
0.0800.870wikipedia
0.8700.990is
0.9901.080a
1.0802.010multilingual
2.0102.140sil
2.1602.240sil
2.2403.020webbased
3.0203.180sil
3.2043.354sil
3.3544.284freecontent
4.2845.374encyclopedia
5.3745.774project
5.7746.454supported
6.4546.754by
6.7546.904the
6.9047.574wikimedia
7.5748.414foundation
8.4148.514sil
8.5328.622sil
8.6228.852and
8.8529.242based
9.2429.382on
9.3829.432a
9.4329.982model
9.98210.032of
10.03210.592openly
10.59211.212editable
11.21211.802content
11.80211.932sil
:
:
:
Wav splitting
13
In Sentence-level, this can be straightforward done by
extracting the time information from the TTS mp3 files,
which are received sentence by sentence.
Sentence boundaries
Phonetic Transcription
• Speech recognition technology needs to transcribe text into
phonetic symbols, in order to build up phone models.
14
“Wikipedia is a multilingual, web-based, free-content encyclopedia project.”
“wikipedia ɪz ə məltilɪŋwəl, wɛb- best, fri- kɑntɛnt ənsɑjkləpidiə prɑdʒɛkt.”
”wikipedia Iz @ m@ltilINw@l, wEb- best, fri- kAntEnt @nsAykl@pidi@ prAdZEkt.”
Original English Text: (ASCII only, perhaps!)
Transcription in IPA: (needs Unicode)
Transcription in SAMPA: (ASCII only, including non-alphabet symbols)
http://upodn.com/phon.asp
• Post processing of phonetic transcription
• To map or simply clean all undesired symbols from multiple
styles of outputs
– (usually in unicode, or some non-alphabet symbols)
• For plain English (en),
– Approximately using the original Text as the phone sequence.
– Although it seems too simple, it is so far so good.
• For Traditional Chinese (zh-tw),
– Google Translate was used to get phonetic symbols in Pinyin (拼音,
pīnyīn), and then plain romaji (eliminating the tone mark)
• For Japanese (ja),
– Mecab has been used recently to get the Katakana (片仮名, カタカナ).
– Romkan has been used to transform katakana to romaji (kunrei)
• Thanks to Python, it helps me do the most jobs
during this stage of processing!!
15
• Phonetic transcription for English
– Using regular expression module
16
phn= text2phn_en(enText)
enText=
‘’’Wikipedia is a multilingual, web-based,
free-content encyclopedia project.‘’’
phn=
‘’’wikipedia_is_a_multilingual_webbased
_freecontent_encyclopedia_project’’'
import re
pats= ''|"|-|^_|_$|,|.|(|)'
phn= re.sub(pats, '', phn)
• Phonetic transcription for Traditional Chinese
– Using Google Translate/TTS api
17
phn= text2phn_tc(tcText)
tcText=
‘維基百科是一個自由內容’
phn=
‘weiji_baike_shi_yige_ziyou_neirong’
GOOGLE_TTS_URL=
'https://translate.google.
com.tw/translate_a/singl
e?dt=bd&dt=ex&dt=at&'
req= urllib.request.Request(GOOGLE_TTS_URL + data)
• Phonetic transcription for Japanese
– Using MeCab and Romkan
18
phn= text2phn_jp(jpText)
jpText=
‘‘’ウィキペディアは、
信頼されるフリーなオンライン百科事典、‘’’
phn=
‘‘’wikipedyia_wa_sil_sinrai_sa_reru_furi-_
na_onrain_hyakka_ziten‘’’
import MeCab
import romkan
y= MeCab.Tagger().parse(text)
...
kun= romkan.to_kunrei(phn)
At the Halfway
• a bundle of files wav/lab
19
• HMM Toolkits (HTK),
– http://htk.eng.cam.ac.uk/
– Given a speech utterance, with its phone
sequence, the speech can be well aligned with
phones by ‘forced alignment’ techniques in the
HMM approach.
– A set of HMM Toolkits, called HTK, provided a
convenient way to utilize the HMM approach.
20
Speech recognition technology
• The HTK overview
21
HTK processing (abstract) ....
22
• #[00] setting the working dir
• #[01] creating the (hmm) model prototype
• #[02] label processing
• #[03] feature extraction
• #[04] model initialization
• #[05] model training
• #[06] forced alignment
• #[07] post file moving operation
HTK processing (detail)....
23
#[00] setting the working dir
dirName= ./_wav/
#[01] creating the (hmm) model prototype
CreateHProto....
myHmmPro
N = 3 M = 6
#[02] label processing
000, 0,----> ._htkhled -A -i spLab00.mlf -n spLab00.lst -S spLab.scp hL
001, 0,----> ._htkhled -A -i spLab.mlf -n spLab.lst -S spLab.scp hLed.l
002, 0,----> ._htkhled -A -i spLab_p.mlf -n spLab_p.lst -S spLab.scp -I
#[03] feature extraction
003, 0,----> ._htkHCopy -A -C hCopy.conf -S spWav2Mfc.scp 1>> 1.htk.out 2>>
#[04] model initialization
004, 1,----> mkdir hmms_p
005, 0,----> ._htkHCompV -A -m -C hInit.conf -S spMfc.scp -I spLab_p.mlf -M
#[05] model training
006, 0,----> ._htkHERest -A -C hErest.conf -S spMfc.scp -p 1 -t 2000.0 -w 3
007, 0,----> ._htkHERest -A -C hErest.conf -p 0 -t 2000.0 -w 3 -v 0.05 -I sp
: (repeating several times...)
:
#[06] forced alignment
016, 0,----> ._htkHVite -A -a -C hVite.conf -S spMfc.scp -d hmms_p/ -i s
#[07] post file moving operation
017, 1,----> mkdir outDir
018, 1,----> copy spLab_aligned.mlf outDir./_wav_aligned.mlf
24
HLedspLab.scp spLab.mlf
spLab.lst
hLed.led
HLed
spLab00.mlf
spLab00.lst
hLed00.led
HLed
spLab_p.mlf
spLab_p.lst
hLed.led
spLab_p.dic
HLed
25
HCopy
hCopy.conf
spWav2Mfc.scp
*.wav *.mfc
HCopy
HCompV
26
HCompV
HCompV.conf
*.mfc hmms_p/*
spMfc.scp
spLab_p.mlf
myHmmPro
HERest
27
HERest
hErest.conf
*.mfc
hmms_p/*
spMfc.scp
spLab_p.mlf spLab_p.lst
hmms_p/HER1.acc
N iterations
N=5
HERest
HVite
28
HVite
hVite.conf*.mfc
spMfc.scp
spLab_p.lst
spLab_aligned.mlf
spLab.mlf
spLab_p.dic
hmms_p/
HTK summary
29
HLed
HCopy
HCompV
HERest
HVite
HTK Tools
#!MLF!#
"./_wav/SN0.rec"
0 800000 sil -578.044434
800000 8700000 wikipedia -5636.368652
8700000 9900000 is -855.988770
9900000 10800000 a -693.554871
10800000 20100000 multilingual -7268.197266
20100000 21400000 sil -791.746216
.
"./_wav/SN1.rec"
0 800000 sil -541.083069
800000 8600000 webbased -5977.622070
8600000 10200000 sil -1048.225220
.
"./_wav/SN2.rec"
0 1500000 sil -1100.892822
1500000 10800000 freecontent -7094.197266
10800000 21700000 encyclopedia -8148.633789
21700000 25700000 project -3247.493896
25700000 32500000 supported -5594.979492
32500000 35500000 by -2412.487305
35500000 37000000 the -1176.310547
37000000 43700000 wikimedia -5128.852051
43700000 52100000 foundation -5995.618164
52100000 53100000 sil -695.872864
.
.
.
spLab_aligned.mlf
wavDir/
The major algorithm in HTK
30
‘Holiday Shopping’ = ‘h’+’o’+’l’+’i’+’d’+’ay’+’sil’+’sh’+’o’+’p’+’I’+’ng’
‘h’ ’o’ ’ng’
• Forced Alignment in HTK
– 1. Given a Speech signal
– 2. Doing the Pronunciation transcription
• Pronunciation symbols must be all-ASCII only!!
– 3. Training to get the HMM models
31
‘h’
’o’
’ng’
– 4. Doing the Viterbi Search for the optimal path (alignment):
32
#!MLF!#
"wavDir/SN0001.rec"
0 800000 sil -567.865356
800000 8700000 wikipedia -5670.471680
8700000 10000000 is -951.059692
10000000 10600000 a -489.843994
10600000 20000000 multilingual -7398.754395
20000000 20700000 sil -416.119415
.
"wavDir/SN0002.rec"
0 900000 sil -632.964050
900000 8600000 webbased -6000.767578
8600000 9900000 sil -914.236206
.
"wavDir/SN0003.rec"
0 2100000 sil -1373.137817
2100000 9000000 freecontent -5306.260742
9000000 18500000 encyclopedia -6654.958984
18500000 25600000 project -5698.730469
25600000 32700000 supported -5713.494141
32700000 33200000 by -429.306763
33200000 34800000 the -1205.477539
34800000 41500000 wikimedia -5115.318359
41500000 50000000 foundation -6074.208496
50000000 52000000 and -1746.236938
52000000 56200000 based -3267.695801
56200000 57000000 on -585.264404
57000000 57700000 a -577.346130
57700000 63200000 model -3769.413574
63200000 63800000 of -524.015503
63800000 65300000 sil -1129.348633
.
wavDir.align
33
Now it’s time
to KaraOke !
A Browser in Javascript and HTML
for Text-KaraOke
• https://youtu.be/11-ltx0yv_o
34
A Browser in Python using TKinter
for Text-KaraOke
35
Conclusion & Future Work
• Make the process more automatically.
• Make the user interface more friendly.
• Make the program more robust.
• Call for your help to improve.
• Thank you for Listening!
36
37
PyCon JP 2015
Renyuan Lyu
呂仁園
Chun-Han Lai
賴俊翰
Karaoke-style Read-aloud System
Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
Thank you for Listening.
ご聴取 有り難う 御座いました。
感謝您的收聽。

More Related Content

PDF
pyconjp2015_talk_Translation of Python Program__
PDF
Writing Fast Code (JP) - PyCon JP 2015
PDF
Python Workshop
PDF
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
PDF
Programming with Python - Basic
PDF
Getting started with Linux and Python by Caffe
ODP
Python and Machine Learning
PDF
Python final ppt
pyconjp2015_talk_Translation of Python Program__
Writing Fast Code (JP) - PyCon JP 2015
Python Workshop
Advanced Python Tutorial | Learn Advanced Python Concepts | Python Programmin...
Programming with Python - Basic
Getting started with Linux and Python by Caffe
Python and Machine Learning
Python final ppt

What's hot (20)

PDF
Using SWIG to Control, Prototype, and Debug C Programs with Python
PDF
Python教程 / Python tutorial
PDF
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
PPTX
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
PPTX
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
PDF
Python on a chip
PPTX
2016 bioinformatics i_python_part_1_wim_vancriekinge
PPTX
Mixed-language Python/C++ debugging with Python Tools for Visual Studio- Pave...
PDF
Python Intro
PDF
Python Workshop
PDF
Python in Action (Part 1)
PDF
Introduction to Programming in Go
PDF
Python 3.5: An agile, general-purpose development language.
PPTX
SWIG Hello World
PPTX
Learn python – for beginners
PPTX
Introduction to-python
PDF
Py conjp2019 renyuanlyu_3
PDF
PyPy London Demo Evening 2013
PDF
Python Developer Certification
PPTX
Why Python?
Using SWIG to Control, Prototype, and Debug C Programs with Python
Python教程 / Python tutorial
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
2015 bioinformatics python_introduction_wim_vancriekinge_vfinal
2016 bioinformatics i_python_part_2_strings_wim_vancriekinge
Python on a chip
2016 bioinformatics i_python_part_1_wim_vancriekinge
Mixed-language Python/C++ debugging with Python Tools for Visual Studio- Pave...
Python Intro
Python Workshop
Python in Action (Part 1)
Introduction to Programming in Go
Python 3.5: An agile, general-purpose development language.
SWIG Hello World
Learn python – for beginners
Introduction to-python
Py conjp2019 renyuanlyu_3
PyPy London Demo Evening 2013
Python Developer Certification
Why Python?
Ad

Viewers also liked (20)

PDF
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
PDF
Python と型ヒント (Type Hints)
PPTX
組合せ最適化を体系的に知ってPythonで実行してみよう PyCon 2015
PDF
日本のオープンデータプラットフォームをPythonでつくる
PDF
強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp
PDF
SekainoKAO by TeamKAO
PDF
PyLadies Tokyo - 初心者向けPython体験ワークショップ開催の裏側
PPTX
Sphinxで作る貢献しやすい ドキュメント翻訳の仕組み
PDF
アドネットワークのデータ解析チームを支える技術
PDF
野球Hack!~Pythonを用いたデータ分析と可視化 #pyconjp
PDF
sqldf for pandas
PDF
pandasによるデータ加工時の注意点やライブラリの話
PDF
Django から各種チャットツールに通知するライブラリを作った話
PDF
3分でサーバオペレーションコマンドを作る技術
PDF
Zynq+PyCoRAM(+Debian)入門
PDF
How we realized SOA by Python at PyCon JP 2015
PPTX
PyCon JP 2015 keynote
PDF
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
PDF
Pythonで作る俺様サウンドエフェクター
PDF
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
PythonとPyCoRAMでお手軽にFPGAシステムを開発してみよう
Python と型ヒント (Type Hints)
組合せ最適化を体系的に知ってPythonで実行してみよう PyCon 2015
日本のオープンデータプラットフォームをPythonでつくる
強くなるためのプログラミング -プログラミングに関する様々なコンテストとそのはじめ方-#pyconjp
SekainoKAO by TeamKAO
PyLadies Tokyo - 初心者向けPython体験ワークショップ開催の裏側
Sphinxで作る貢献しやすい ドキュメント翻訳の仕組み
アドネットワークのデータ解析チームを支える技術
野球Hack!~Pythonを用いたデータ分析と可視化 #pyconjp
sqldf for pandas
pandasによるデータ加工時の注意点やライブラリの話
Django から各種チャットツールに通知するライブラリを作った話
3分でサーバオペレーションコマンドを作る技術
Zynq+PyCoRAM(+Debian)入門
How we realized SOA by Python at PyCon JP 2015
PyCon JP 2015 keynote
Pythonによるカスタム可能な高位設計技術 (Design Solution Forum 2016@新横浜)
Pythonで作る俺様サウンドエフェクター
Debian Linux on Zynq (Xilinx ARM-SoC FPGA) Setup Flow (Vivado 2015.4)
Ad

Similar to Ry pyconjp2015 karaoke (20)

PDF
Py conjp2019 renyuanlyu_3
PPTX
Py conjp2019 renyuanlyu_3
DOCX
Ig2 task 1 re edit version
PPTX
Speech synthesis technology
PPTX
Speech Synthesis - Christopher Mwololo Fred.pptx
PDF
speech technologies with neural networks present
PPTX
Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...
DOC
Thingy editedd
DOC
Ig2task1worksheetelliot 140511141816-phpapp02
PDF
Automated Podcasting System for Universities
PPTX
Screencasts, Captions and your Global Audience
PPTX
final ppt BATCH 3.pptx
PPT
The Casting Couch Claud
DOCX
Ig2 task 1 work sheet (glossary) steph hawkins
PDF
WebRTC, RED and Janus @ ClueCon21
PPT
Talking Technologies
DOCX
Sound recording glossary by Liam Oven for Unit 73
PDF
Assistive and Learning Technics
DOCX
Sound recording glossary improved
DOCX
IG2 Task 1 Work Sheet
Py conjp2019 renyuanlyu_3
Py conjp2019 renyuanlyu_3
Ig2 task 1 re edit version
Speech synthesis technology
Speech Synthesis - Christopher Mwololo Fred.pptx
speech technologies with neural networks present
Nerd sniping myself into a rabbit hole... Streaming online audio to a Sonos s...
Thingy editedd
Ig2task1worksheetelliot 140511141816-phpapp02
Automated Podcasting System for Universities
Screencasts, Captions and your Global Audience
final ppt BATCH 3.pptx
The Casting Couch Claud
Ig2 task 1 work sheet (glossary) steph hawkins
WebRTC, RED and Janus @ ClueCon21
Talking Technologies
Sound recording glossary by Liam Oven for Unit 73
Assistive and Learning Technics
Sound recording glossary improved
IG2 Task 1 Work Sheet

More from Renyuan Lyu (7)

PDF
Lightning talk01 docx
PDF
Lightning talk01
PPTX
Pycon JP 2016 ---- Pitch Detection
PPTX
pycon jp 2016 ---- CguTranslate
PDF
Ry pyconjp2015 turtle
PDF
教青少年寫程式
PDF
Pycon apac 2014
Lightning talk01 docx
Lightning talk01
Pycon JP 2016 ---- Pitch Detection
pycon jp 2016 ---- CguTranslate
Ry pyconjp2015 turtle
教青少年寫程式
Pycon apac 2014

Recently uploaded (20)

PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Classroom Observation Tools for Teachers
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Pre independence Education in Inndia.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Complications of Minimal Access Surgery at WLH
2.FourierTransform-ShortQuestionswithAnswers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
01-Introduction-to-Information-Management.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
Classroom Observation Tools for Teachers
Microbial diseases, their pathogenesis and prophylaxis
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Institutional Correction lecture only . . .
human mycosis Human fungal infections are called human mycosis..pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Cell Structure & Organelles in detailed.
Pre independence Education in Inndia.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Supply Chain Operations Speaking Notes -ICLT Program
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pharma ospi slides which help in ospi learning
PPH.pptx obstetrics and gynecology in nursing
Complications of Minimal Access Surgery at WLH

Ry pyconjp2015 karaoke

  • 1. 1 PyCon JP 2015 Renyuan Lyu 呂仁園 Chun-Han Lai 賴俊翰 Karaoke-style Read-aloud System Chang Gung Univ. Taiwan Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
  • 2. CguTextKaraoke a Karaoke-style Read-aloud System Using Speech Alignment and Text-to-Speech Technology Chun-Han Lai (賴俊翰) Renyuan Lyu (呂仁園) Chang Gung University (長庚大學) Taiwan (台灣) 2
  • 3. Abstract • A procedure to create a Speech-to-Text Synchronization file from an original text-only file – can be used to show high-light text just like a Karaoke machine – very useful for language learning purpose. • TTS (Text-to-speech) technology on clouds, like Google TTS • Speech-recognition technology, like HTK, for temporal alignment 3
  • 4. Introduction • Starting from a text-only file, using a cloud-based text-to-speech (TTS) technology, like Google Translate/TTS, and also a speech- recognition technology, like Hidden Markov Model Toolkits (HTK), we could generate its associated timed-text file which aligns up text with speech waveform file in the temporal axis. • Python is used not only as a glue to link all different styles of software resources, like Google Translate and HTK, but also as a powerful tool to deal with all text processing tasks in this project. • From such a kind of timed text file, we have also provided a JavaScript based web-app and also a Python GUI software to demonstrate the time-aligned high-lighted text like a karaoke machine in word level, which are considered very useful for the language learning purpose. 4
  • 5. a Karaoke-style Text Read-aloud System https://www.youtube-nocookie.com/embed/9a5KoXNCagM?start=180 • Karaoke (カラオケ) is a form of interactive entertainment in which an amateur singer sings along with recorded music. • Lyrics are usually displayed on a video screen, along with a moving symbol, changing color, or music video images, to guide the singer. • Here is an example of my favorites https://en.wikipedia.org/wiki/Karaoke 5
  • 6. Speech Shadowing Technique for Language Learning • The motivation of this project » https://en.wikipedia.org/wiki/Speech_shadowing –Speech shadowing • is an Language Learning technique in which subjects repeat speech immediately after hearing it. – The technique is used in language learning. – A demonstration can be viewed at the following Youtube link. • “English Speaking Practice: How to improve your English Speaking and Fluency: SHADOWING” • https://www.youtube.com/watch?v=GVWFGIyNswI6
  • 7. Text-to-Speech Synthesis 7 Wikipedia is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links designed to guide the user to related pages with additional information. Given: a piece of Text and its speech, e.g., The goal is to obtain its speech
  • 8. Google TTS API in a Python module 8 • pip install gTTS from gtts import gTTS aText= 'Wikipedia is a multilingual, ...' aLang= 'en' tts= gTTS(text= aText, lang= aLang) tts.save("aSpeech.mp3") aSpeech.mp3aText https://github.com/pndurette/gTTS
  • 9. FFmpeg • About Ffmpeg – [https://en.wikipedia.org/wiki/FFmpeg] – FFmpeg is a free software project that produces libraries and programs for handling multimedia data. – It is one of the leading multimedia frameworks, able to do many DSP tasks, including ... • decode, encode, • transcode, mux, demux, stream, filter and play 9
  • 10. 10 FFmpeg -i aSpeech.mp3 -y - vn -acodec pcm_s16le -ac 1 -ar 16000 -f wav aSpeech.wav aSpeech.mp3 aSpeech.wav Pcm, 16 bits/sample Little endian 1 (mono) channel 16000 samples/sec FFplay aSpeech.wav Verifying by seeing and hearing Or using an interactive audio tool, like Audacity.
  • 11. Audacity (audio editor) • Audacity is a powerful, free open source digital audio editor – Its features include: • Recording and playing back sounds • Importing and exporting of WAV, MP3, .... • Viewing and editing via cut, copy, and paste, ... 11 aSpeech.mp3 aSpeech.wav
  • 12. Text-to-Speech Alignment 12 Wikipedia is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and encyclopedia. Wikipedia's articles provide links designed to guide the user to related pages with additional information. Given: a piece of Text and its speech, e.g., The goal is to obtain a ‘Timed-Text’ 0.0000.080sil 0.0800.870wikipedia 0.8700.990is 0.9901.080a 1.0802.010multilingual 2.0102.140sil 2.1602.240sil 2.2403.020webbased 3.0203.180sil 3.2043.354sil 3.3544.284freecontent 4.2845.374encyclopedia 5.3745.774project 5.7746.454supported 6.4546.754by 6.7546.904the 6.9047.574wikimedia 7.5748.414foundation 8.4148.514sil 8.5328.622sil 8.6228.852and 8.8529.242based 9.2429.382on 9.3829.432a 9.4329.982model 9.98210.032of 10.03210.592openly 10.59211.212editable 11.21211.802content 11.80211.932sil : : :
  • 13. Wav splitting 13 In Sentence-level, this can be straightforward done by extracting the time information from the TTS mp3 files, which are received sentence by sentence. Sentence boundaries
  • 14. Phonetic Transcription • Speech recognition technology needs to transcribe text into phonetic symbols, in order to build up phone models. 14 “Wikipedia is a multilingual, web-based, free-content encyclopedia project.” “wikipedia ɪz ə məltilɪŋwəl, wɛb- best, fri- kɑntɛnt ənsɑjkləpidiə prɑdʒɛkt.” ”wikipedia Iz @ m@ltilINw@l, wEb- best, fri- kAntEnt @nsAykl@pidi@ prAdZEkt.” Original English Text: (ASCII only, perhaps!) Transcription in IPA: (needs Unicode) Transcription in SAMPA: (ASCII only, including non-alphabet symbols) http://upodn.com/phon.asp
  • 15. • Post processing of phonetic transcription • To map or simply clean all undesired symbols from multiple styles of outputs – (usually in unicode, or some non-alphabet symbols) • For plain English (en), – Approximately using the original Text as the phone sequence. – Although it seems too simple, it is so far so good. • For Traditional Chinese (zh-tw), – Google Translate was used to get phonetic symbols in Pinyin (拼音, pīnyīn), and then plain romaji (eliminating the tone mark) • For Japanese (ja), – Mecab has been used recently to get the Katakana (片仮名, カタカナ). – Romkan has been used to transform katakana to romaji (kunrei) • Thanks to Python, it helps me do the most jobs during this stage of processing!! 15
  • 16. • Phonetic transcription for English – Using regular expression module 16 phn= text2phn_en(enText) enText= ‘’’Wikipedia is a multilingual, web-based, free-content encyclopedia project.‘’’ phn= ‘’’wikipedia_is_a_multilingual_webbased _freecontent_encyclopedia_project’’' import re pats= ''|"|-|^_|_$|,|.|(|)' phn= re.sub(pats, '', phn)
  • 17. • Phonetic transcription for Traditional Chinese – Using Google Translate/TTS api 17 phn= text2phn_tc(tcText) tcText= ‘維基百科是一個自由內容’ phn= ‘weiji_baike_shi_yige_ziyou_neirong’ GOOGLE_TTS_URL= 'https://translate.google. com.tw/translate_a/singl e?dt=bd&dt=ex&dt=at&' req= urllib.request.Request(GOOGLE_TTS_URL + data)
  • 18. • Phonetic transcription for Japanese – Using MeCab and Romkan 18 phn= text2phn_jp(jpText) jpText= ‘‘’ウィキペディアは、 信頼されるフリーなオンライン百科事典、‘’’ phn= ‘‘’wikipedyia_wa_sil_sinrai_sa_reru_furi-_ na_onrain_hyakka_ziten‘’’ import MeCab import romkan y= MeCab.Tagger().parse(text) ... kun= romkan.to_kunrei(phn)
  • 19. At the Halfway • a bundle of files wav/lab 19
  • 20. • HMM Toolkits (HTK), – http://htk.eng.cam.ac.uk/ – Given a speech utterance, with its phone sequence, the speech can be well aligned with phones by ‘forced alignment’ techniques in the HMM approach. – A set of HMM Toolkits, called HTK, provided a convenient way to utilize the HMM approach. 20 Speech recognition technology
  • 21. • The HTK overview 21
  • 22. HTK processing (abstract) .... 22 • #[00] setting the working dir • #[01] creating the (hmm) model prototype • #[02] label processing • #[03] feature extraction • #[04] model initialization • #[05] model training • #[06] forced alignment • #[07] post file moving operation
  • 23. HTK processing (detail).... 23 #[00] setting the working dir dirName= ./_wav/ #[01] creating the (hmm) model prototype CreateHProto.... myHmmPro N = 3 M = 6 #[02] label processing 000, 0,----> ._htkhled -A -i spLab00.mlf -n spLab00.lst -S spLab.scp hL 001, 0,----> ._htkhled -A -i spLab.mlf -n spLab.lst -S spLab.scp hLed.l 002, 0,----> ._htkhled -A -i spLab_p.mlf -n spLab_p.lst -S spLab.scp -I #[03] feature extraction 003, 0,----> ._htkHCopy -A -C hCopy.conf -S spWav2Mfc.scp 1>> 1.htk.out 2>> #[04] model initialization 004, 1,----> mkdir hmms_p 005, 0,----> ._htkHCompV -A -m -C hInit.conf -S spMfc.scp -I spLab_p.mlf -M #[05] model training 006, 0,----> ._htkHERest -A -C hErest.conf -S spMfc.scp -p 1 -t 2000.0 -w 3 007, 0,----> ._htkHERest -A -C hErest.conf -p 0 -t 2000.0 -w 3 -v 0.05 -I sp : (repeating several times...) : #[06] forced alignment 016, 0,----> ._htkHVite -A -a -C hVite.conf -S spMfc.scp -d hmms_p/ -i s #[07] post file moving operation 017, 1,----> mkdir outDir 018, 1,----> copy spLab_aligned.mlf outDir./_wav_aligned.mlf
  • 29. HTK summary 29 HLed HCopy HCompV HERest HVite HTK Tools #!MLF!# "./_wav/SN0.rec" 0 800000 sil -578.044434 800000 8700000 wikipedia -5636.368652 8700000 9900000 is -855.988770 9900000 10800000 a -693.554871 10800000 20100000 multilingual -7268.197266 20100000 21400000 sil -791.746216 . "./_wav/SN1.rec" 0 800000 sil -541.083069 800000 8600000 webbased -5977.622070 8600000 10200000 sil -1048.225220 . "./_wav/SN2.rec" 0 1500000 sil -1100.892822 1500000 10800000 freecontent -7094.197266 10800000 21700000 encyclopedia -8148.633789 21700000 25700000 project -3247.493896 25700000 32500000 supported -5594.979492 32500000 35500000 by -2412.487305 35500000 37000000 the -1176.310547 37000000 43700000 wikimedia -5128.852051 43700000 52100000 foundation -5995.618164 52100000 53100000 sil -695.872864 . . . spLab_aligned.mlf wavDir/
  • 30. The major algorithm in HTK 30 ‘Holiday Shopping’ = ‘h’+’o’+’l’+’i’+’d’+’ay’+’sil’+’sh’+’o’+’p’+’I’+’ng’ ‘h’ ’o’ ’ng’ • Forced Alignment in HTK – 1. Given a Speech signal – 2. Doing the Pronunciation transcription • Pronunciation symbols must be all-ASCII only!! – 3. Training to get the HMM models
  • 31. 31 ‘h’ ’o’ ’ng’ – 4. Doing the Viterbi Search for the optimal path (alignment):
  • 32. 32 #!MLF!# "wavDir/SN0001.rec" 0 800000 sil -567.865356 800000 8700000 wikipedia -5670.471680 8700000 10000000 is -951.059692 10000000 10600000 a -489.843994 10600000 20000000 multilingual -7398.754395 20000000 20700000 sil -416.119415 . "wavDir/SN0002.rec" 0 900000 sil -632.964050 900000 8600000 webbased -6000.767578 8600000 9900000 sil -914.236206 . "wavDir/SN0003.rec" 0 2100000 sil -1373.137817 2100000 9000000 freecontent -5306.260742 9000000 18500000 encyclopedia -6654.958984 18500000 25600000 project -5698.730469 25600000 32700000 supported -5713.494141 32700000 33200000 by -429.306763 33200000 34800000 the -1205.477539 34800000 41500000 wikimedia -5115.318359 41500000 50000000 foundation -6074.208496 50000000 52000000 and -1746.236938 52000000 56200000 based -3267.695801 56200000 57000000 on -585.264404 57000000 57700000 a -577.346130 57700000 63200000 model -3769.413574 63200000 63800000 of -524.015503 63800000 65300000 sil -1129.348633 . wavDir.align
  • 34. A Browser in Javascript and HTML for Text-KaraOke • https://youtu.be/11-ltx0yv_o 34
  • 35. A Browser in Python using TKinter for Text-KaraOke 35
  • 36. Conclusion & Future Work • Make the process more automatically. • Make the user interface more friendly. • Make the program more robust. • Call for your help to improve. • Thank you for Listening! 36
  • 37. 37 PyCon JP 2015 Renyuan Lyu 呂仁園 Chun-Han Lai 賴俊翰 Karaoke-style Read-aloud System Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1 Thank you for Listening. ご聴取 有り難う 御座いました。 感謝您的收聽。