Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib

‹#›© 2015 Pivotal Software, Inc. All rights reserved. ‹#›
Building a Stock Prediction system with
Machine Learning using Geode, Spring XD
e Spark MLLib
William Markito
@william_markito
Fred Melo
@fredmelo_br

‹#›© 2015 Pivotal Software, Inc. All rights reserved.
It's all about DATA
Data Sources
Look for patterns
Prediction

medium avg
(x+1)
relative
strength (x)
medium avg (x)
price(x)
Machine Learning Model
(e.g. Linear Regression)

© Copyright 2014 Pivotal. All rights reserved.
Transform Sink
SpringXD
Extensible
Open-Source
Fault-Tolerant
Horizontally Scalable
Cloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions

Apache Geode (incubating)
Introduction

Introduction
A distributed, memory-based data management platform for
data oriented apps that need:
High performance, scalability, resiliency and continuous
availability
Fast access to critical data set
Location aware distributed data processing
Event driven data architecture

Concepts
Cache
In-memory storage and management for
your data
Configurable through XML, Spring, Java
API or CLI
Collection of Region

Concepts
Region
Distributed java.util.Map on steroids
(Key/Value)
Consistent API regardless of where or how data
is stored
Observable (reactive)
Highly available, redundant on cache Member
(s).

Concepts
Region
Local, Replicated or Partitioned
In-memory or persistent
Redundant
LRU
Overflow
LOCAL
LOCAL_HEAP_LRU
LOCAL_OVERFLOW
LOCAL_PERSISTENT
LOCAL_PERSISTENT_OVERFLOW
PARTITION
PARTITION_HEAP_LRU
PARTITION_OVERFLOW
PARTITION_PERSISTENT
PARTITION_PERSISTENT_OVERFLOW
PARTITION_PROXY
PARTITION_PROXY_REDUNDANT
PARTITION_REDUNDANT
PARTITION_REDUNDANT_HEAP_LRU
PARTITION_REDUNDANT_OVERFLOW
PARTITION_REDUNDANT_PERSISTENT
PARTITION_REDUNDANT_PERSISTENT_OVERFLOW
REPLICATE
REPLICATE_HEAP_LRU
REPLICATE_OVERFLOW
REPLICATE_PERSISTENT
REPLICATE_PERSISTENT_OVERFLOW
REPLICATE_PROXY

Concepts
Member
A process that has a connection to the system
A process that has created a cache
Embeddable within your application
Client
Locator
Server

Concepts
Client cache
A process connected to the Geode server(s)
Can have a local copy of the data
Can be notified about events on the servers

Concepts
Listeners
CacheWriter / CacheListener
AsyncEventListener (queue / batch)
Parallel or Serial
Conflation

© Copyright 2014 Pivotal. All rights reserved. 19
Currently under incubation in Apache Software Foundation
Welcome contributions and contributors
Code and Patches
Bugs, feature requests
Documentation and content
Any form of feedback

Code
New features
Bug fixes (patches)
Writing tests
Documentation
Wiki
Web site
User guides
Community
Join our mailing lists (Ask or answer)
Become a speaker
Find and report bugs
Testing a release candidate or beta

JIRA - https://issues.apache.org/jira/browse/GEODE
GitHub - https://github.com/apache/incubator-geode
Mailing lists:
Development - dev@geode.incubator.apache.org
Users - user@geode.incubator.apache.org
Wiki - cwiki.apache.org/confluence/display/GEODE
StackOverflow - http://stackoverflow.com/questions/tagged/geode+or+gemfire

SpringXD
Introduction

Concepts

Concepts
 A stream is composed from modules. Each module is deployed to a container and its
channels are bound to the transport.

Apache Zeppelin
(incubating)
Introduction

Concepts
Web based REPL
Iterative & Exploratory
Support for Data Ingestion

Concepts
Multi interpreters
Markdown
Shell
Spark
Geode
Python…

Concepts
Sharing through URLs without Reports

Apache Spark
Introduction

Concepts
RDD
Dataframe
Driver
Worker
"An RDD in Spark is simply an immutable distributed collection of objects.
Each RDD is split into multiple partitions, which may be computed on different nodes
of the cluster. RDDs can contain any type of Python, Java, or Scala objects,
including user-defined classes."

Concepts
RDD
Dataframe
Driver
Worker
“A dataframe is a distributed collection of rows organized into named columns. An
abstraction for selecting, filtering and plotting structured data (pandas), previously
known as SchemaRDD."

Concepts
RDD
Dataframe
Driver
Worker

Summary

Summary
• Integration
• Spark, JDBC, Geode
• HDFS, Twitter, File, Mail…
• Data pipeline orchestration
• Intuitive DSL
• Streaming & Analytics
• Distributed and scalable
• Web based REPL
• Multiple Interpreters
• Apache Spark
• Markdown
• Flink
• Python
• Geode…
• Iterative & Exploratory

Summary
• Fast data processing
• Columnar queries
• RDDs
• Machine Learning
• Analytics & Streaming
• Fast data store and processing
• In-memory & Persistent
• Highly Consistent
• Transaction processing
• Thousands of concurrent
clients

Source Code
http://pivotal-open-source-hub.github.io/StockInference-Spark/

Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib (20)

Recently uploaded (20)

Building a Stock Prediction system with Machine Learning using Geode, SpringXD and Spark MLLib