HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

Stock Market Order Flow
Reconstruction
Using HBase on AWS
Aaron Carreras, HBaseCon
– San Francisco, May 2015

About Presenter
•  Director of Enterprise Data Platforms at FINRA
•  Data Ingestion, Processing and Management

WHAT DO WE DO?
• Collect and Create
•  33B events/day
•  18 national exchanges
•  Equities, Options and Fixed Income
•  Reconstruct the market from trillions of events spanning years
• Detect & Investigate
•  Identify market manipulations, insider trading, fraud and compliance violations
• Enforce & Discipline
•  Ensure rule compliance
•  Fine and bar broker dealers
•  Refer matters to the SEC and other authorities
TRF
FIRM
Exchange
Dark Pool

What stock trade looks like to the investor

Example of what is actually happening

Conﬁgurations/Approaches in
Common

Logical Architecture
CDH 4.5; HBase 0.94.6; EC2 hs1.8xlarge – 16 (vCPU), 117 (GiB), 24 drives x 2,000 (GB)

Row Key Design & Pre-‐‑spliJing
•  Salt Our Row Keys
o  Our “natural” keys are
monotonically
increasing
o  Row Key = salt (PK) +
PK
•  Pre-split
•  Better control of
distribution of data
across regions

Compactions & SpliJing Conﬁgurations
Parameter
Default
Override
hbase.hregion.majorcompaction
7 days
0 (disable)
hbase.hstore.compactionThreshold
3
10
hbase.hstore.compaction.max
10
15
hbase.hregion.max.ﬁlesize
10 GB
200 GB
RegionSplitPolicy
IncreasingToUpperBoundRegionSplit
Policy
ConstantSizeRegionSplitPol
icy
hbase.hstore.useExploringCompati
on
false
true

OS Configuration Considerations
§  Some of these may not be relevant to you depending on your
OS/Version but are worth confirming
Parameter
Se1ing
redhat_transparent_hugepage/
defrag
never
nofile/nproc ulimit
32768
tcp_low_latency
1 (enabled)
vm.swappiness
0 (disabled)
selinux
Disabled
IPv6
no (disabled)
iptables
off/stop

Other Hadoop Configuration
Considerations
Where
Parameter
Se1ing
core-‐‑site.xml
ipc.client.tcpnodelay
true
core-‐‑site.xml
ipc.server.tcpnodelay
true
hdfs-‐‑site.xml
dfs.client.read.shortcircuit
true
hdfs-‐‑site.xml
fs.s3a.buffer.dir
[machine specific]
hbase-‐‑site.xml
hbase.snapshot.master.timeoutMillis
1800000
hbase.snapshot.master.timeout.millis
1800000
hbase.master.cleaner.interval
600000 (ms)

Use Case ‘A’: PaJerns

Use Case ‘A’: Background
•  Create graphs for historical market event data (trillion
records)
•  Basically a batch process
o  Each batch had ~ 4 billion events
o  Related events may span batches (e.g., root could arrive later, children
may be corrected, etc.)
•  Back process prior 18 months (540 batches)
•  Complete the project given the and

Use Case ‘A’: Utilize Bulk Loads
•  Back processing and ongoing update process is 100% Bulk HFile load
•  Our column families and processing aligned with this approach by splitting the linkage
and content into separate column families
•  Eliminate Puts completely and the WAL writes, memstore flushes, and additional
compactions that often accompany them
HFile Bulk Load

Use Case ‘A’: Optimize Gets
•  Used sorted / partitioned batched Gets
o  Minimize required RPC calls
o  Leverage sorting to better leverage block cache
•  Allocate more on-heap memory for reads
Parameter
Default
Override
hﬁle.block.cache.size
.4
.65
hbase.regionserver.global.memstore.upperLi
mit
.4
.15

Use Case ‘B’: PaJerns

Use Case ‘B’: Background
•  Not a once a day batch process, it must process the
data as it arrives
o  200+ business rules covering data validation, create/break linkages, and
identify compliance issues within SLA
o  Progressively build the tree
•  The different processes required different access
paths sometimes requiring multiple copies of some
portions of the data

Use Case ‘B’: Put Strategy
•  HFiles for the
incremental
processing
didn’t fit as
well here
•  Partitioned
Batch Puts
•  memstore vs
block cache
(50/50)

Use Case ‘B’: Scan
•  Scan
o  Distinct Daily along with a single Historical table to more naturally
support the processing
o  Scan Daily tables only
o  Switched from Get to Scan for rows with millions of columns

HBase Backup to S3
•  HBase ExportSnapshots to S3 didn’t really support
our use case
•  Significant updates to the ExportSnapshot for S3
o  Support for S3A (HADOOP-10400)
o  Remove the expensive rename operation on S3 (HBASE-11119)
S3

Disaster Recovery
o  AWS provides multiple
Availability Zones (AZ) in different
geographic regions
o  HBASE snapshots backed up to
S3 and to a separate cluster in a
different AZ
o  S3 buckets are backed up from
one region to another for cross-
region redundancy

Running Hadoop on AWS

Lessons Learned

Running Hadoop on AWS
•  S3
o  For now at least, s3a is probably the file system implementation you want to
use (if you are not using EMR)
o  Rename is not a logical operation and therefore expensive
o  Eventual consistency should be accounted for
o  Consider turning S3 versioning on
•  Instance Types / Topology
o  # of virtual instances on a single physical host impacts fault tolerance
o  Tradeoff between network performance and availability/capacity
•  Region - Availability Zone - Placement Group
o  Be aware that Availability Zone identifiers are intentionally inconsistent
across accounts

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS (20)

More from HBaseCon (20)

Recently uploaded (20)

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS