Hadoop Con2015 - The Data Scientist’s Toolbox
LEN CHANG
• MACHINE LEARNING & DATA MINING
• DISTRIBUTION SYSTEM & NOSQL
• CRAWLER & CHINESE MINING
• Communication Engineering, General Study - CCU
• Software Engineering, Master Study - NCU
• Pixnet Hackathon 2014 – EXIT MINING
• Pixnet Hackathon 2015 – Spam User Detection
• Taipei Open Data Hackathon 2015
– The relation between Religion and Taipei City
• BI SYSTEM & DATA VISUALIZATION
• FINANCE & EDUCATION & ART & SPORT
• THE PLAYER OF BLIZZARD GAMES
AGENDA
• A GOOD STORY
• TOOL 1 : DATABASE
• TOOL 2 : COLLECTION AND REPLICATE.
• TOOL 3 : VISUALIZATION.
• TOOL 4: MACHINE LEARNING
• SAMPLE
• SUMMARY
A GOOD STORY
DIGITAL CUSTOMER EXPERIENCE
how much money do you want to pay ?
45 NT / Latte 95 NT / Latte
WHY ?
如果說家庭是人際交流的「第一個好去處」,而職場是
「第二個好去處」,那麼像咖啡館(如星巴克)這樣的公
共場所,就是我常提到的「第三個好去處」。咖啡館的環
境介於住家和辦公室兩者之間,既能社交,也能獨處,人
們可以在這裡與他人聯絡感情,也能重新面對自我。星巴
克的創業宗旨,就是想為一般人提供這種寶貴的機會。
~Howard Schultz
• the loyalty card
• pay in advance on mobile
• wireless device charging
Digital customer experience
Chief Digital Officer: Adam Brotman
Location
Mobile pay
loyalty card
A Good
Digital Customer Experience
Social network
BI System, Data
warehousing…etc
A GOOD STORY TELL US…
• FIND YOUR “UNIQUE CUSTOMER DATA”.
• USE “CUSTOMER DATA” TO IMPROVE “DIGITAL CUSTOMER EXPERIENCE"
• USE “DIGITAL CUSTOMER EXPERIENCE” TO HELP ORGANIZATION “MAKE MONEY”.
TOOL 1: DATABASE
OLAP AND NOSQL
Location
Mobile pay
loyalty card
A Good
Digital Customer Experience
Social network
BI System, Data
warehousing…etc
BI System, Data
warehousing…etc
Relation-DB NOSQL
How to choose ?
THE PURPOSE IS IMPORTANT
CDC
ETL
SQL
100 % accurate answer when I see the report
THE PURPOSE IS IMPORTANT
Marching Learning
Real time feedback
Real-time dashboard
less accurate, faster response when I need a rough answer
THE PURPOSE IS IMPORTANT
Marching Learning
Powerful at full-text search, weak at number computing.
THE PURPOSE IS IMPORTANT
High frequency
Real-time dashboard
To ensure accurate and speed, costing isn’t important.
DATABASE
• 100 % ACCURATE
• RELATION DATABASE
• LESS ACCURATE, MORE FASTER
• HBASE, SPARK ,CASSANDRA, MONGODB, OTHERS..
• SPECIAL CASE
• FULL-TEXTING SEARCH: ELASTICSEARCH
• ACCURATE AND SPEED: REDIS OR OTHER IN-MEMORY DB.
COLLECTION AND REPLICATE
LOGSTASH AND FLUENTD
REPLICATION TOOL
Location
Mobile pay
loyalty card
A Good
Digital Customer Experience
Social network
BI System, Data
warehousing…etc
Collection: Any Data in, Any Data out
Location
Mobile pay
loyalty card
Social network
BI System, Data
warehousing…etc
Collection: Any Data in, Any Data out
FLUENTD: BUILD YOUR UNIFIED LOGGING LAYER
LOGSTASH: COLLECT, ENRICH & TRANSPORT DATA
COMPARISON
FLUENTD
• LANG: C EMBEDDED IN RUBY
• PLATFORM: LINUX
• MAJOR OUTPUT DB: MONGODB
LOGSTASH
• LANG: JAVA
• PLATFORM: LINUX AND WINDOWS
• MAJOR OUTPUT DB: ELASTICSEARCH
• ELK ARCH.
Location
Mobile pay
loyalty card
Social network
BI System, Data
warehouse…etc
Replicate: replicate data
from DB_A to DB_B
RDB RDB
Case 1
NOSQL RDB
Case 3
Transaction
DB
NOSQL NOSQL
Case 2
ETL: Extract-Transform-Load
RDB RDB
Case 1
NOSQL NOSQL
Case 2
NOSQL RDB
Case 3
Node
PostgresNode
Node
Node
Node
mongo
COMPARISON
RDB TO RDB NOSQL TO RDBNOSQL TO NOSQL
• TRADITIONAL MECHANISM
• TO ENSURE THE “DATA
CONSISTENCY”
• FINANCIAL INDUSTRY
• HUGE DATA ANALYSIS
• LOW COSTING HARDWARE ,
POWERFUL AND FAST
COMPUTATION
• NEED PROGRAMMING SKILL,
NOT ONLY SQL
• MAKE A RDB AS A NODE OF
NOSQL CLUSTER
• MAYBE IT IS A BALANCE
BETWEEN NOSQL AND RDB
VISUALIZATION
VISUALIZE YOUR DATA
1,999 USD
Hadoop Con2015 - The Data Scientist’s Toolbox
Hadoop Con2015 - The Data Scientist’s Toolbox
MACHINE LEARNING
GENETIC ALGORITHM
Genetic algorithm
Travelling salesman problem
Self-help tourism Scheduling
Genetic Algorithm
System
Linear algebra and Probability are important
Bayesian probability Decision Tree
Regression
Support Vector Machine
SAMPLE
SOME INTERESTING APPLICATION SAMPLE ….
“
”
FINANCIAL DISTRESS PREDICTION
SYSTEM
financial index
Company Share price
Genetic Algorithm
3000 financial indices
20 financial indices
Support Vector Machine
Matlab & C# & ASP.NET
“
”
GAME TREND MONITOR SYSTEM
Crawler System
Crawler System
Crawler System
Crawler System
DB
Text Mining
System
Article =>
Emotional Value
C# & MSSQL & SSRS
DB
C# & MSSQL & SSRS
“
”
APP BEHAVIOR ANALYSIS SYSTEM
RDB
s3fs
Node
PostgresNode
Node
Node
Node
mongo
Pentaho
R
R & RUBY & MONGODB & POSTGRES & Pentaho & MOSQL & FLUENTD & s3fs
SUMMARY
FOR THE SAME THING, YOU WILL MAKE A BETTER SOLUTION OR MECHANISM WHEN YOU'RE A MULTI
DOMAIN-EXPERT.
Crawler System
Text Mining
System
Article => Emotional Value
8 years up…
Shortcut?
What’s the fastest method to understand zombie ?
Hadoop Con2015 - The Data Scientist’s Toolbox

More Related Content

PDF
Exploring the Great Olympian Graph
PPTX
Network and IT Operations
PDF
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
PPTX
Spam user detection report
PDF
Machine Learning Preliminaries and Math Refresher
PDF
Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)
PDF
MLPI Lecture 1: Maths for Machine Learning
PDF
CASL vs CAN-SPAM - Canada’s Anti‐Spam Law
Exploring the Great Olympian Graph
Network and IT Operations
LoCloud Historic Place Names Service, Rimvydas Laužikas, Justinas Jaronis and...
Spam user detection report
Machine Learning Preliminaries and Math Refresher
Spam Detection with a Content-based Random-walk Algorithm (SMUC'2010)
MLPI Lecture 1: Maths for Machine Learning
CASL vs CAN-SPAM - Canada’s Anti‐Spam Law

Viewers also liked (19)

PDF
UX, ethnography and possibilities: for Libraries, Museums and Archives
PDF
Designing Teams for Emerging Challenges
PDF
Visual Design with Data
PDF
3 Things Every Sales Team Needs to Be Thinking About in 2017
PDF
綠黨網路支黨部 黨員大會工作報告
PDF
臺北市政府開放資料黑客松
PDF
How to Become a Thought Leader in Your Niche
PDF
2014 Pixnet Hackathonh - EXIF Mining
PDF
Use Redis in Odd and Unusual Ways
PDF
Madrid Agudelo Juliana_AporteIndividual
PDF
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
PDF
DevNexus 2017 - Building and Deploying 12 Factor Apps in Scala, Java, Ruby, a...
PDF
02 math essentials
PDF
Agile scrum in startup
PDF
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
PDF
Nine Pages You Should Optimize on Your Blog and How
PDF
African Americans: College Majors and Earnings
PDF
The Online College Labor Market
PDF
GAME ON! Integrating Games and Simulations in the Classroom
UX, ethnography and possibilities: for Libraries, Museums and Archives
Designing Teams for Emerging Challenges
Visual Design with Data
3 Things Every Sales Team Needs to Be Thinking About in 2017
綠黨網路支黨部 黨員大會工作報告
臺北市政府開放資料黑客松
How to Become a Thought Leader in Your Niche
2014 Pixnet Hackathonh - EXIF Mining
Use Redis in Odd and Unusual Ways
Madrid Agudelo Juliana_AporteIndividual
Secure Because Math: A Deep-Dive on Machine Learning-Based Monitoring (#Secur...
DevNexus 2017 - Building and Deploying 12 Factor Apps in Scala, Java, Ruby, a...
02 math essentials
Agile scrum in startup
Hadoop con2016 - Implement Real-time Centralized logging System by Elastic Stack
Nine Pages You Should Optimize on Your Blog and How
African Americans: College Majors and Earnings
The Online College Labor Market
GAME ON! Integrating Games and Simulations in the Classroom
Ad

Similar to Hadoop Con2015 - The Data Scientist’s Toolbox (20)

PPTX
DataScience and BigData Cebu 1st meetup
PDF
Mastering Your Customer Data on Apache Spark by Elliott Cordo
PPTX
In-Memory Computing Webcast. Market Predictions 2017
PPTX
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
PDF
Moving Targets: Harnessing Real-time Value from Data in Motion
PPTX
Introduction: Relational to Graphs
PDF
Introducing Neo4j
PDF
Rapid Data Exploration With Hadoop
PDF
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
PPTX
Neo4j GraphTalk Oslo - Introduction to Graphs
PDF
Big Data, Fast Data @ PayPal (YOW 2018)
PDF
Geschäftliches Potential für System-Integratoren und Berater - Graphdatenban...
PPTX
Semantics and Machine Learning
PPTX
Graphs fun vjug2
PDF
Introduction to Neo4j
PDF
Keynote: GraphTour Toronto
PDF
Доклад Владимира Бичева на третьем митапе сообщества блокчейн-разработчиков С...
PDF
Data Science At Zillow
PDF
The Great Lakes: How to Approach a Big Data Implementation
DataScience and BigData Cebu 1st meetup
Mastering Your Customer Data on Apache Spark by Elliott Cordo
In-Memory Computing Webcast. Market Predictions 2017
Neo4j GraphTalks - Introduction to GraphDatabases and Neo4j
Moving Targets: Harnessing Real-time Value from Data in Motion
Introduction: Relational to Graphs
Introducing Neo4j
Rapid Data Exploration With Hadoop
Smart Cities, Open Data and SMW - SMWCon Spring 2012 Keynote
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Neo4j GraphTalk Oslo - Introduction to Graphs
Big Data, Fast Data @ PayPal (YOW 2018)
Geschäftliches Potential für System-Integratoren und Berater - Graphdatenban...
Semantics and Machine Learning
Graphs fun vjug2
Introduction to Neo4j
Keynote: GraphTour Toronto
Доклад Владимира Бичева на третьем митапе сообщества блокчейн-разработчиков С...
Data Science At Zillow
The Great Lakes: How to Approach a Big Data Implementation
Ad

Recently uploaded (20)

PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
Credit Without Borders: AI and Financial Inclusion in Bangladesh
PPTX
The various Industrial Revolutions .pptx
PDF
OpenACC and Open Hackathons Monthly Highlights July 2025
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
STKI Israel Market Study 2025 version august
PPTX
Chapter 5: Probability Theory and Statistics
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Benefits of Physical activity for teenagers.pptx
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPT
What is a Computer? Input Devices /output devices
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Architecture types and enterprise applications.pdf
CloudStack 4.21: First Look Webinar slides
Microsoft Excel 365/2024 Beginner's training
Credit Without Borders: AI and Financial Inclusion in Bangladesh
The various Industrial Revolutions .pptx
OpenACC and Open Hackathons Monthly Highlights July 2025
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Developing a website for English-speaking practice to English as a foreign la...
STKI Israel Market Study 2025 version august
Chapter 5: Probability Theory and Statistics
1 - Historical Antecedents, Social Consideration.pdf
Benefits of Physical activity for teenagers.pptx
Enhancing emotion recognition model for a student engagement use case through...
A comparative study of natural language inference in Swahili using monolingua...
What is a Computer? Input Devices /output devices
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
NewMind AI Weekly Chronicles – August ’25 Week III
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Architecture types and enterprise applications.pdf

Hadoop Con2015 - The Data Scientist’s Toolbox

Editor's Notes

  • #4: Introduction (5 min) TOOL 1 (5 min) TOOL 2 (10 min) TOOL 3 (10 min) SUMMARY (5 min)
  • #7: Question 1: Why we feel a thing which Starbucks latte is more expensive is reasonable ? Question 2: Have anyone can identify what different between general latte and Starbucks latte ? Question 3: So, What Starbucks do something for this? (Animation)
  • #8: Starbucks 1.0 : The relation between person and person. Starbucks 2.0 : Make customers a good digital experience.
  • #14: 不存在一個完美的資料庫,每種資料庫都有其擅長與不擅長的地方。 以銀行業而言,資料不容許出錯,就算是報表也是一樣。這就不適合利用 擁有 “弱”一致性的 NOSQL,而是必須要使用 “強”一致性的 RDB。 因銀行業有其固定的看報表時間,所以可以利用其他時間跑大量的程序,甚至建立許多的 Cube 供報表使用。
  • #15: - 以手機語音小幫手為例,跟上述銀行業最大的差別就在於,些微的資料偏差對於分析來說是沒有太大的影響的,此時,我們就可以利用到NOSQL 的大規模運算能力去快速的獲取我們所需要的答案。
  • #16: 有的時候,某些 NOSQL 是為了處理一些特殊情況而被設計出來的,譬如: 文字檢索。 Elasticsearch 的文字檢索功能非常強大而且快速,可以說整個資料庫就是為了文字檢索而生的。但其對於數值處理方面卻不是很擅長。
  • #17: 資料格式簡單,但需與前台和後端做大規模的資料頻繁更新與一致性確認。 使用 in-memory database
  • #31: - 不管是個人出於興趣作分析,或者是當數據顧問。或者是人數5~10人的新創小公司,這套工具可以幫助你大幅增加判斷的準確度和減少大幅的內部 IT 視覺化工具開發。划算的投資。
  • #33: Monitor 專用
  • #36: - 不要浪費了分散式系統提供給我們將近無窮無盡的運算能力
  • #47: - 沒有捷徑
  • #48: - Domain Knowledge isn’t tool. It’s common sense.