資策會數位教育研究所 國際人才養成班
YB102企業智慧系統設計結業專題
老師 郭惠民
組員 簡俊能
MLB ANALYZE SYSTEM
THE BEGGING ~I READ
A BOOK
I study basin again
I watch baseball game in
I study any data about baseball game
I want analyze the secret in the big data
I want proof he is true
I began ..
I find that ….
Prediction baseball can make fans and
players have more fun in this game
because is it too complex to Prediction
ETL RDB R Matlab
Web Server
ML Mahout
Hadoop eco
pig
RDD Spark
HDFS
NOSQLDB
HIVE
Client UI
R hadoop
SYSTEM ARCHETECH
RAWDATA
ETL EXTRACT
CRAWLall team data 2010-15
Change link by Set dictionary
ETLTRANSFROM
Split case
Split Data
Replace symbol
Symbol transform
ETLTRANSFROM
Special comma
Special comma
ETLTRANSFROM
ETLLOAD
JDBC PYTHON TO ORACLE
25000 games in 5 years
PYTHON TO ORACLE
ETLLOAD
BEFORE on the web
AFTERinmyDB
Sep1->0902 W->1
136-> WSN20140902--136
BE FILTER
ETL
ETL EXTRACT
CRAWLMLB Website
DATA ACCOUNT
25000games*100 BAT EACH GMAE
More than 2,500,000
將資料進行轉換為介於0~1 間的資料型態
最後模式產生輸出變數,其值可為0 或1 ,
常代表選擇量度中的「是」與「否」或
「成功」與「失敗」
LOGISTICREGRESSION
羅吉斯迴歸分析是最普遍的迴歸技術,邏輯斯迴歸模型可用來計
算資料中單一類別或兩類別的分類機率,可描述諸多預測因子
(predictor)變項 X1、X2、X3…對於二分類依變項Y之間的關係
通常Y編碼為(0,1)兩分類(Kleinbaum, D. G.,1998)
ANALYZE MODEL
Forward stepwise正向逐步選擇法:即在截
距模型基礎上,將符合鎖定顯著水準的
自變數一次一個加入模型
Backward stepwise反向逐步選擇法:在模
型包括所有候選變數基礎上,將不符合
保留要求顯著水準的自變數一次一個地
刪除掉
Combined stepwise混和逐步選擇法
ANALYZE STEP
• IT doesn’t have advantages playing at home
(主場不一定有優勢)
ANALYZEPROCESS
ANALYZEPROCESS
IT is less difference between each team by OBP
(MLB各隊OBP差異比你想像的小很多)
• MLB players are pretty amazing(MLB球員的數
據世界級水準)
ANALYZESUPPOSE
• 個人打擊率難以決定勝負
全隊的OBP+SLG=OPS 才是關鍵
上壘
OBP
一次上
二壘SLG
全隊連
續上壘
OPS
總得分
WIN
LOGISTIC PREDICTIONMODEL ~R
Winprobability=
glm(GRESULT~GOBP+GSLG,
family=binomial(link=logit))
LOGISTIC PREDICTIONMODEL ~R
library(RJDBC) <-load RJDBC
drv=JDBC("oracle.jdbc.driver.OracleDriver","C:/mylib/ojdbc14.jar",id
entifier.quote="'") <-load driver
con=dbConnect(drv,"jdbc:oracle:thin:@localhost:1521:XE","user1","
u111") <-get connection
options(max.print=10000000) <-set list view
res=dbReadTable(con, "game") <-readtable
YEAR=substr(ex1$GDATE,4,7) <-subsitude string
YEARSEQ=as.data.frame(YEAR) <-transform data type
colnames(YEARSEQ)=c("YEAR") <-named colnume
exp2=cbind(ex1,YEARSEQ) <-combine
exp2$YEAR=as.character(exp2$YEAR) <-transform data type
exp2$YEAR=as.numeric(exp2$YEAR) <- transform data type
Load data From Oracle ETL IN R
LOGISTIC PREDICTIONMODEL ~R
TESTPREDICTION
Before
After
After
ADD column
SORT
LOGISTIC PREDICTIONMODEL ~R
pred=predict(logyear,newdata=esortyear,
type="response")
tab=table(Y=newdata$GRESULT,Ypred=pred)
a=100*sum(diag(tab))/sum(tab)
TEST PREDICTION
TEAM:ARI
Duration:1-60
Year:2014
LOGISTIC PREDICTIONMODEL ~R
pred=predict(logyear,newdata=esortyear,
type="response")
tab=table(Y=newdata$GRESULT,Ypred=pred)
a=100*sum(diag(tab))/sum(tab)
TEST PREDICTION
TEAM:ARI
Duration:50-110
Year:2014
LOGISTIC PREDICTIONMODEL ~R
pred=predict(logyear,newdata=esortyear,
type="response")
tab=table(Y=newdata$GRESULT,Ypred=pred)
a=100*sum(diag(tab))/sum(tab)
TEST PREDICTION
TEAM:ARI
Duration:100-162
Year:2014
LOGISTIC PREDICTIONMODEL ~R
TESTPREDICTION
YEAR 1-50 series 50-100 series 100-160 series
2014 58.33 63.93 68.85
2013 64.00 60.78 54.10
2012 62.00 66.67 59.02
2011 55.00 67.21 63.93
2010 58.33 62.75 57.38
TEAM:ARI
YEAR:2010-15
Game:810
SERIES:1-50/50-100/100-160
LOGISTIC PREDICTIONMODEL ~R
TEAM:ARI
YEAR:2010-15
Game:810
OPP:30
TEST PREDICTION
TEST ANALYZE
 Logistic is sign in 5 years
 9 teams have NOT enough games to run logistic model
對每個球隊
LOGISTIC PREDICTIONMODEL ~R
TEAM:ARI
YEAR:2010-15
Game:810
STARTER:254
TEST PREDICTION
 Total 254 starter to ARI
 Few has enough games to run logistic model
對每個先發投手
ANALYZECONCLUTION
• 若OBP+SLG+OPS 可估出七成勝率那其他三成?
一支穩定得分的球隊在大聯盟即可以預估近七成的勝率那意外的是??
• 投手被打爆
• 實力接近
• 對手超水準演出
• 團隊士氣
• 教練戰術
• 愛國裁判
• 其他
VISULIZATION
VISULIZATION
VISULIZATION
VISULIZATION
Q&A
THANKS YOU FOR ATTENTION

More Related Content

PPTX
MongoDB Aggregation Performance
PDF
10 Key MongoDB Performance Indicators
PDF
Which DBMS and Why?
PDF
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
ODP
MongoDB & PHP
ODP
Introduction to MongoDB with PHP
PDF
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
PDF
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...
MongoDB Aggregation Performance
10 Key MongoDB Performance Indicators
Which DBMS and Why?
[db tech showcase Tokyo 2017] C23: Lessons from SQLite4 by SQLite.org - Richa...
MongoDB & PHP
Introduction to MongoDB with PHP
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB .local Paris 2020: Adéo @MongoDB : MongoDB Atlas & Leroy Merlin : et ...

What's hot (6)

PPTX
MongoDB Memory Management Demystified
KEY
PPT
Mongo Web Apps: OSCON 2011
PDF
ODP
MongoDB - Ekino PHP
PPTX
PostgreSQL 9.4 JSON Types and Operators
MongoDB Memory Management Demystified
Mongo Web Apps: OSCON 2011
MongoDB - Ekino PHP
PostgreSQL 9.4 JSON Types and Operators
Ad

Similar to Big data mlb analyze system (20)

PPTX
JSON and the Oracle Database
PPTX
How to Use JSONB in PostgreSQL for Product Attributes Storage
PDF
Elastify you application: from SQL to NoSQL in less than one hour!
PDF
Nko workshop - node js & nosql
PDF
Upgrade 11.2.0.1 rac db to 11.2.0.2 in linux
PPTX
Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018
PPTX
Power JSON with PostgreSQL
 
PDF
DIY Netflow Data Analytic with ELK Stack by CL Lee
PDF
Postgres Performance for Humans
PDF
Video Games at Scale: Improving the gaming experience with Apache Spark
PDF
Jesper Richter-Reichhelm - Continuous Evolution at Wooga - code.talks 2015
PPTX
java database connectivity for java programming
PDF
Accessing Databases from R
PDF
Accessing Databases from R
PDF
Spark SQL - 10 Things You Need to Know
PDF
Evoloution of Ideas
PDF
Ruby on Rails Oracle adaptera izstrāde
PDF
Open Source SQL databases enters millions queries per second era
PDF
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
KEY
Battle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDB
JSON and the Oracle Database
How to Use JSONB in PostgreSQL for Product Attributes Storage
Elastify you application: from SQL to NoSQL in less than one hour!
Nko workshop - node js & nosql
Upgrade 11.2.0.1 rac db to 11.2.0.2 in linux
Oracle JSON treatment evolution - from 12.1 to 18 AOUG-2018
Power JSON with PostgreSQL
 
DIY Netflow Data Analytic with ELK Stack by CL Lee
Postgres Performance for Humans
Video Games at Scale: Improving the gaming experience with Apache Spark
Jesper Richter-Reichhelm - Continuous Evolution at Wooga - code.talks 2015
java database connectivity for java programming
Accessing Databases from R
Accessing Databases from R
Spark SQL - 10 Things You Need to Know
Evoloution of Ideas
Ruby on Rails Oracle adaptera izstrāde
Open Source SQL databases enters millions queries per second era
DBA だってもっと効率化したい!〜最近の自動化事情とOracle Database〜
Battle of NoSQL stars: Amazon's SDB vs MongoDB vs CouchDB vs RavenDB
Ad

Recently uploaded (20)

PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PDF
Best Data Science Professional Certificates in the USA | IABAC
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
The Data Security Envisioning Workshop provides a summary of an organization...
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PDF
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Business_Capability_Map_Collection__pptx
PPTX
MBA JAPAN: 2025 the University of Waseda
PDF
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PDF
Global Data and Analytics Market Outlook Report
PPT
statistics analysis - topic 3 - describing data visually
PPTX
Machine Learning and working of machine Learning
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
1 hour to get there before the game is done so you don’t need a car seat for ...
Best Data Science Professional Certificates in the USA | IABAC
CYBER SECURITY the Next Warefare Tactics
The Data Security Envisioning Workshop provides a summary of an organization...
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
Loose-Leaf for Auditing & Assurance Services A Systematic Approach 11th ed. E...
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
statsppt this is statistics ppt for giving knowledge about this topic
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Navigating the Thai Supplements Landscape.pdf
Business_Capability_Map_Collection__pptx
MBA JAPAN: 2025 the University of Waseda
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Global Data and Analytics Market Outlook Report
statistics analysis - topic 3 - describing data visually
Machine Learning and working of machine Learning

Big data mlb analyze system