[2016-12-01] DDBJデータ解析チャレンジ報告：機械学習コンペティションのタスク設計とルール設定

Kickoff meeting(7/6)
○Eli Kaminuma1, Yukino Baba2, Masahiro Mochizuki3, Hirotaka Matsumoto4, Haruka Ozaki4, Toshitsugu Okayama5, Takuya Kato6,
Shinya Oki7, Osamu Ogasawara1, Hisashi Kashima2, Toshihisa Takagi1
(1.Ctr for Info Biol , NIG, 2.Grad Sch of Info, Kyoto Univ, 3.IMSBIO Co., Ltd, 4. RIKEN・ACCC・BiT, 5.BITS Inc., 6.Grad Sch of Info Sci Tech, Tokyo Univ,
7.Grad Sch of Med Sci, Kyushu Univ )
DDBJデータ解析チャレンジ報告:機械学習コンペティションのタスク設計とルール設定
Highlights of the DDBJ Data Analysis Challenge: Task Design Rules for a Machine Learning Competition
ABSTRACT 近年、次世代シークエンサ等大規模な実験データの注釈解析を行う時に、手作業処理を自動化する目的で機械学習技術の適用が始まっている。しかし、実験研究
者の多くは機械学習の知識を持つ共同研究相手を探しにくい。解決の1方策に、群衆へ課題をアウトソーシングする「機械学習コンペティション」の活用がある。本発表では機
械学習コンペティションの実施知見を共有する。機械学習コンペティション(名称： DDBJデータ解析チャレンジ)を2016/7/6-8/31の期間に実施した。機械学習コンペティション
は参加型研究で個人情報を扱う為に、事前に国立遺伝学研究所の研究倫理審査委員会より実施承認を得た。チャレンジでは「 DNA配列からのクロマチン特徴予測」の課題で、
DDBJ保有配列データからクロマチン特徴注釈の有無を予測して精度を競う。データはDDBJ公開の Sequence Read Archiveの2次注釈データベース「 ChIP-Atlas」の未掲載生物
種を対象に、同一条件で準備した。構築モデルの投稿管理は、機械学習コンペティション基盤（ビッグデータ大学）で行った。ビッグデータ大学は教育目的で構築されており、
投稿モデルの暫定精度をオンラインで確認できる仕組みを持つ。実施の結果、1位モデルの精度(Area under the curve, AUC)は0.95だった。参加者総数は38名でモデル投稿数は
延べ360回だった。最初の投稿モデルは AUC=0.65で、コンペティション期間でモデル精度が0.30向上した事になる。1位モデルは DNNを含む2種類の分類器をアンサンブル学
習法で組み合わせており、特徴として外部データ（ゲノム位置情報と遺伝子構造注釈情報）を採り入れていた。コンペでは外部データ利用可否の他にも多くのルール設定が必
要だった。参考の為に、実施手順や条件・ルール (1. データ非公開化、2.不正防止用情報マスキング、3.専門分野外参加者向け整形、4.転移学習用訓練済モデルの利用可否、5.
計算機資源とトレードオフとなるデータ量削減、等)を紹介する。
Acknowledgments:
・ Competition Participants : orion, Ryota, tsukasa, emn, extraterrestrial Fuun species, ηzw, mkoido, ksh, AoYu@Tohoku, hiro, emihat, MorikawaH, hmt-yamamoto, tonets, すずどら, take2, bicycle1885, morizo,
forester, どいやさん, yudai, tag, nwatarai, soki, himkt, saoki, tsunechan, Ken, A.K, singular0316, IK, yk_tani, yota0000
・ Mathworks Japan : Fumitaka Otobe, Takuya Ohtani, Hikari Amano, Takafumi Ohbiraki
・ NIG : Ayako Oka, Naoko Sakamoto, Yasukazu Nakamura
・ DDBJ : Yuji Ashizawa, Tomohiko Yasuda, Naofumi Ishikawa, Tomohiro Hirai, Tomoka Watanabe, Chiharu Kawagoe, Emi Yokoyama, Kimiko Suzuki, Junko Kohira
Development of Submitted Model Precision and Prize Models
2LBA-015
※１ DDBJ（DNA Data Bank of Japan) http://www.ddbj.nig.ac.jp/
※2 NIG supercomputer http://sc.ddbj.nig.ac.jp/
※2 UniversityOfBigdata http://universityofbigdata.net/
DDBJ Data Analysis Challenge :
A Crowdsourced Machine Learning Competition
Rule Design for Machine Learning Competition
Designing the DDBJ Challenge Task
Outline of the 1st Prize Model
Scheduling the Challenge Project with IRB Approval
DDBJ
Challenge
Award
Winner AUC Model Design Tool Version
最優秀賞
1st Prize
株式会社情報数
理バイオ研究開
発部ライフサイエ
ンスグループ
望月正弘
0.94564
*2 Classifiers(Extremely
Randomized Trees, CNN)
*Ensemble
Learning(Stacking)
*External Data(Genomic
Position, Gene Structure
Annotation)
python=3.5
scikit-
learn=0.17.1
chainer=1.10.0
優秀賞
2nd Prize
国立研究開発法
人理化学研究所
情報基盤センター
バイオインフォマ
ティクス研究開発
ユニット
松本拡高(代表
※)、尾崎遼(※)
※チームとして2
名で参加
0.89859
*2 Classifiers(CNN, Product
of Genomic Distance Decay
Parameter and Nearest
Training Data Output)
*Ensemble
Learning(Averaged)
*External Data(Genomic
Position)
julia=0.4.6
python=2.7.10
skflow(tensorflo
w=0.8.0)
優良賞
3rd Prize
ビッツ株式会社
岡山利次
0.85428
*7 Classifiers(Naive Bayes
for Multivariate Bernoulli
Models, Logistic Regression,
Random Forest, Gradient
Boosting, Extremely
Randomized Trees, eXtreme
Gradient Boosting, CNN)
*Ensemble Learning
(Stacking)
python=2.7.11
numpy=1.10.4
scikit-
learn=0.17
chainer=1.11.0
xgboost=0.4a3
0
学生賞
Student
Prize
東京大学大学院
情報理工学系研
究科修士課程
1年
加藤卓也
0.84318
*3 Classifiers(LeNet like
CNN, DeepBind like CNN,
Variable filter DeepBind like
CNN)
*Ensemble Learning(Soft
Voting)
python=2.7
lasagne=0.2.dev
1
ModelPrecision(AUC)
Date of Challenge Period
Winner, Model Precision, Design, ToolnameSubmitted Model Precision with Participant Nickname
最優秀賞モデル概要
＊Extremely Randomized Trees(ERT, ※1), Convolutional Neural Network(CNN, ※2)
＊Stacked Generalization(Stacking,※3) アンサンブル学習法で統合
＊External Data(外部特徴量)は, ゲノム座標と遺伝子注釈情報
＜ERT model＞
＊Genomic Coordinates Based Model(GCBM)
＊特徴量はゲノム座標で, n(クエリ配列数) x m(染色体数)行列
＊ゲノム座標はシロイヌナズナTAIR10ゲノム参照配列へのアライメントで計算
＜CNN model＞
＊Gene Annotated Sequences Based Model(GASBM)
＊特徴量は, クエリ配列と遺伝子注釈情報(TAIR10)
＊Forward/Reverse Strand別で組み込む
＊遺伝子注釈情報の定義
変数rは減衰率で、変数dは遺伝子の1塩基目からの距離
rが0なら特徴量は1になり,遺伝子中に対象塩基が含まれる
rが0より大きい時は,遺伝子開始塩基からの勾配値
2. Masking data conditions, Reducing data size
不正防止用情報マスキング, 計算資源節約用データ量削減
In the ChIP-Atlas database, a combination of tissue/celltype conditions (CellType
class) and functional Type (Antigen class) are curated.
“756 conditions”→ (Reducing, Masking)→ publicly explained only “8 conditions”
Q&A in DDBJ Challenge Website
http://www.ddbj.nig.ac.jp/ddbj-challenge2016-j.html
DDBJ(※1) plans to develop big data analysis environment.
DDBJ DNA Databases
↓
Not widely known as data resource
for big data analysis
NIG Supercomputer(※2)
↓
Needs to furnish analytical tools
Organizing a Machine Learning Competition on
a Data Science Platform ‘京大鹿島研ビッグデータ大学(※3)’
- Supporting DDBJ database as data resource
- Installing bigdata analytical tools in NIG supercomputer.
2016 Feb 29 – Deadline of Submitting Applications
Mar 15 – IRB Meeting Date
Mar 30 – Notice of Certification
■参加型研究でCrowd（群衆）の個人情報を扱う
場合には、IRB(Institutional Review Board)による
研究倫理承認が必要
■ Reviewed by IRB of National Institute of Genetics
[1] Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Machine Learning 63, 3-42 (2006).
[2] LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.
[3] Wolpert, D. H. Stacked generalization. Neural Networks 5, 241-259 (1992).
GASBM model
*Participant Number＝38
*Final Model Precision(Total Change)＝AUC 0.95 (+ 0.30)
*Model Precision Increase Rate=46%
*Submission Number(TOTAL)=360
*Submission Number(AVG)=9.5
CHALLENGE TASK: Predicting whether genomic regions corresponding to input
DNA sequences includes chromatin feature regions.
1. Generating unpublished data
不正防止用非公開データの作成
3. Free Domain Knowledge for DNA Codes
専門分野外参加者向けデータ整形
Nucleobase Letter Base Code Number Code 01 Code
Adenine A 1 1000
Cytosine C 2 0100
Guanine G 3 0010
Thymine T 4 0001
Unknown N 0 0000
http://devbio.med.kyushu-u.ac.jp/chipatlas/img/DataNumber.png
ChIP-Atlas database published analyzed data with latest SRA entries.
→ Unpublished organism, Arabidopsis thaliana, was selected. © 2007 Emi Kosano
The 01 code may be easy-handling coding type for Information scientists.
IRB Approval Letter
Website release(6/27)
* A machine learning competition for a lifescience domain task was organized.
* Model precision of 57 competition days represented the rate of 46% increase.
* Competition rules of intellectual property rights and team participation should be clarified.
Summary
Rule DDBJ Challenge Kaggle
(10 research competitions)
https://www.kaggle.com/
EXTERNAL DATA Allowed Not Allowed (10)
CODE SHARING ------ Prohibited (10)
OPEN-SOURCE CODE ------ Prohibited (10)
PRE-TRAINED MODEL Allowed ------ ( Allowed at a
competition forum )
TEST DATA FOR TRAINING Allowed ------ ( Allowed at a
competition forum)
One account per participant ------ 1 (10)
No private sharing outside
teams
------ Prohibited (10)
Team limits ------ Max(8), 4 participants (1),
1 participant (1)
Submission limits per day 3 5 (4), 4(1), 3(1), 2(3), 1(1)
＊Kaggleでは外部データ利用不可が多い
＊著作権やチーム参加のルールが不足していた

[2016-12-01] DDBJデータ解析チャレンジ報告：機械学習コンペティションのタスク設計とルール設定

More Related Content

Similar to [2016-12-01] DDBJデータ解析チャレンジ報告：機械学習コンペティションのタスク設計とルール設定 (20)

More from Eli Kaminuma (13)

[2016-12-01] DDBJデータ解析チャレンジ報告：機械学習コンペティションのタスク設計とルール設定