SlideShare a Scribd company logo
如何⽤用連登 data
訓練廣東話 ChatBot
https://t.me/lihkg_9up_bot
⼀一⿑齊傾下偈
我哋整咗隻 9up chatbot
9up bot
點樣學識傾偈?
⻝⾷食咗飯未?? (?????)
我哋想做嘅係...
輸 ⼊入 輸 出
輸⼊入廣東話句句⼦子出⼀一句句廣東話句句⼦子
但係點樣下⼿手?
Machine Learning
⽤用⼤大量量 廣東話 data 做地獄式訓練
seq2seq (sequence to sequence)
⻝⾷食咗飯未?? ⼀一⿑齊⻝⾷食🤘
於是我哋⽤用咗 Tensor Flow

implement ⼀一個 seq2seq framework
輸 出輸 ⼊入
seq2seq (sequence to sequence)
⻝⾷食咗飯未??
⻝⾷食 咗 飯 未??
⼀一 ⿑齊 ⻝⾷食 🤘
⼀一⿑齊⻝⾷食🤘
乜嘢係 seq2seq?
仲有咩例例⼦子⽤用到 seq2seq?
Google Translate
Google Inbox Auto Reply Google Allo
難題:
究竟有咩廣東話 data ?
( 仲要最好係對話 )
Data 1: 周星馳電影
19 套電影,約10萬條對話
周星馳電影 對話 (例例⼦子)
唐僧 (Input):做咩呀?
悟空 (Output):放⼿啊!
悟空 (Input):放⼿啊!
唐僧 (Output):你愛呀?你愛出聲⾄得架。你愛我會畀你。你唔愛我當然唔
畀你啦!無理由你話愛,我唔畀你,你唔愛我畀你架。⼤家講道理吖嘛!
嗱,我數三下,你話愛唔愛喇噃。⼀……
Data 1: 周星馳電影
19 套電影,約10萬條對話
100,000 條對話其實太少
唔⾜足以訓練到有意義嘅對話
Again:
究竟有咩廣東話 data ?
( 仲要最好係對話 )
Data 2: 連登
爬取連登吹⽔水台 64,751個 post ,

只篩選有10個 reply 以上嘅post

剩下 18,290個
Data 2: 連登
18,290個 post 擷取超過1,700,000條對答
⽤用program 分析所有post,選取「對答」data,

規則如下:

1. 如果有Quote Reply:⽤用reply做「輸⼊入」

2. 如果冇Quote Reply:⽤用原post title+內⽂文做「輸⼊入」

3. 如果原post太⻑⾧長,只會⽤用Title 做「輸⼊入」
例例⼦子
以下係選取 training data 例例
⾹香港係咪冇⼈人聽metal?
問⼗十個⾹香港⼈人都話metal係噪⾳音
metal同好
metal同好
咩類類型
輸⼊入
輸⼊入
輸出
輸出
有Quote Reply
冇Quote Reply, ⽤用原post title+內⽂文做「輸⼊入」
結果
點樣⽤用 9up bot API ?
For developers
Model Parameters Summary
Custom configurations
• seq2seq model with attention

• 5 layers encoder and decoder

• vocabulary size: 63000

• state vector size: 256

• learning rate: 0.5

• learning decay factor: 0.99

• batch size: 64

• bucket sizes (encoder length, decoder length)

• 10, 10

• 20, 30

• 40, 30

• 60, 30

More Related Content

PDF
Viroids, Prions and Mycoplasma
PPT
Sage technology
PPTX
Structure and types of RNA .pptx
PPTX
RNA as a genetic material
PDF
Loop mediated isothermal amplification by dr.pavulraj.s
PPTX
DNA and Forces stabilizes dna structure
PPTX
DNA structure, the bonds involved and it seperation
PDF
Increasing genome editing efficiency with optimized CRISPR-Cas enzymes
Viroids, Prions and Mycoplasma
Sage technology
Structure and types of RNA .pptx
RNA as a genetic material
Loop mediated isothermal amplification by dr.pavulraj.s
DNA and Forces stabilizes dna structure
DNA structure, the bonds involved and it seperation
Increasing genome editing efficiency with optimized CRISPR-Cas enzymes

What's hot (8)

PPTX
Variants of PCR
PDF
Transcriptomics,techniqes, applications.pdf
PPTX
Bacteriophage T4 and Bacteriophage lambda
PPTX
Real Time PCR
PPTX
Adensonian classification
PDF
FORMS OF DNA
PDF
PXT Select™ Comprehensive Selection Report sample
PDF
Bacterial taxonomy
Variants of PCR
Transcriptomics,techniqes, applications.pdf
Bacteriophage T4 and Bacteriophage lambda
Real Time PCR
Adensonian classification
FORMS OF DNA
PXT Select™ Comprehensive Selection Report sample
Bacterial taxonomy
Ad

Similar to 如何用連登 data 訓練廣東話 chatbot (How to use data from a popular forum to train a Cantonese chatbot) (18)

PDF
Java script 全面逆襲!使用 node.js 打造桌面環境!
PPTX
2024/11/29 DevOps Taiwan #64 : 從初建到進階:打造符合公司需求的混合雲端 GitLab DevOps 流水線
PPT
SCJP ch01
PDF
困境與轉型:一個小型開發團隊的 DevOps 學習之旅
PDF
Compiler for Dummy 一點都不深入的了解 Compiler, Interpreter 和 VM
PPT
Html5 games
PDF
Proud Plone on Cloud
PDF
摩登開發團隊的DevOps之道 (@DevOpsTaiwan)
PDF
Trading bot演算法與軟工在程式交易上的實踐
PDF
Continuous Delivery with Ansible x GitLab CI
PPT
42qu thrift1
PDF
AtticTV and NodeJS
PDF
Visual Studio 開發密技大補帖 | Study4.TW 2021 小聚#2
PPT
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
PDF
OpenWebSchool - 03 - PHP Part II
PDF
NKO4 視差滾動經驗分享
PDF
Arduino工作坊 - 羅伯特幫我寫作業
PDF
Introduction of Reverse Engineering
Java script 全面逆襲!使用 node.js 打造桌面環境!
2024/11/29 DevOps Taiwan #64 : 從初建到進階:打造符合公司需求的混合雲端 GitLab DevOps 流水線
SCJP ch01
困境與轉型:一個小型開發團隊的 DevOps 學習之旅
Compiler for Dummy 一點都不深入的了解 Compiler, Interpreter 和 VM
Html5 games
Proud Plone on Cloud
摩登開發團隊的DevOps之道 (@DevOpsTaiwan)
Trading bot演算法與軟工在程式交易上的實踐
Continuous Delivery with Ansible x GitLab CI
42qu thrift1
AtticTV and NodeJS
Visual Studio 開發密技大補帖 | Study4.TW 2021 小聚#2
腾讯大讲堂18 让我们戴上有色眼镜--qzone前台架构的优化分享
OpenWebSchool - 03 - PHP Part II
NKO4 視差滾動經驗分享
Arduino工作坊 - 羅伯特幫我寫作業
Introduction of Reverse Engineering
Ad

More from Oursky (8)

PDF
Build cloud native apps with Docker and Kubernetes
PDF
WebAR - "Build once, deploy anywhere"
PDF
A guide to hiring a great developer to build your first app (redacted version)
PDF
How to build a Whatsapp clone in 2 hours
PDF
Common issues QA teams often find
PDF
Using cyclomatic complexity to measure code complexity
PDF
How to write better code: in-depth best practices for writing readable, simpl...
PDF
How to use Flux (pattern) in React?
Build cloud native apps with Docker and Kubernetes
WebAR - "Build once, deploy anywhere"
A guide to hiring a great developer to build your first app (redacted version)
How to build a Whatsapp clone in 2 hours
Common issues QA teams often find
Using cyclomatic complexity to measure code complexity
How to write better code: in-depth best practices for writing readable, simpl...
How to use Flux (pattern) in React?

如何用連登 data 訓練廣東話 chatbot (How to use data from a popular forum to train a Cantonese chatbot)