SlideShare a Scribd company logo
An Exploration of 3 Very Different
ML Solutions Running on Accumulo
By Gadalia O’Bryan and Aaron Cordova, presented by Don Miner
• Introduction
• Koverse Accumulo Table Structures
• Supply Chain Risk
• Cyber Monitoring
• Forensic Document Search
• Questions
©Koverse 2
Talk Outline
• Our customers have an appetite for building
very diverse ML solutions on Accumulo
• These solutions require varying interaction
patterns with Accumulo
• We have found we are able to support these
use cases using the same set of Accumulo
table structures
©Koverse 3
Introduction
©Koverse 4
Koverse Accumulo Table Structures
Record Table: Objectives
• Store records under a unique ID
• Optimized for reading newly written records in
time order
• Bucket ID is prepended to distribute the newly
written records evenly across tablet servers
• Also supports fetching records that match
query criteria after consulting index table
©Koverse
Record Table: Key Components
Record ID
Bucket ID Dataset ID Timestamp
000 Dataset A 1539458936
000 Dataset A 1539458937
000 Dataset B 1539458931
000 Dataset B 1539458933
001 Dataset A 1539458932
001 Dataset A 1539458935
001 Dataset B 1539458930
001 Dataset B 1539458932
©Koverse
©Koverse
Record Table: Organization
Bucket 00
Bucket 01
Dataset A
Dataset B
Dataset A
Dataset B
Time ordered
©Koverse
Record Table: Ingest
Bucket 00
Bucket 01
Dataset A
Dataset B
Dataset A
Dataset B
Writes to the ends of many buckets
©Koverse
Record Table: Bulk Reading
Bucket 00
Bucket 01
Dataset A
Dataset B
Dataset A
Dataset B
Sequential reads from many buckets
Process newest batches incrementally
©Koverse
Index Table: Objectives
• Store a value-to-record ID pairs
• Support point queries and range queries on
values in specific fields, or ’any’ field
• Set intersection is done by query client,
comparing sorted record IDs for each criterion,
so matching record IDs do not need to fit in
memory
©Koverse
Index Table
Index Entry
Dataset ID Field Type Value Record ID
Dataset A $any N 45 000_Dataset_A_1539458936
Dataset A $any N 63 003_Dataset_A_1539458929
Dataset A $any S Bob 000_Dataset_A_1539458936
Dataset A Age N 45 000_Dataset_A_1539458936
Dataset A Age N 63 003_Dataset_A_1539458929
Dataset A Name S Bob 000_Dataset_A_1539458936
Dataset B $any S B0001 001_Dataset_B_1539458920
Dataset B Order ID S B0001 001_Dataset_B_1539458920
Index Table: Querying
Dataset A $any N 45 000_Dataset_A_1539458936
Dataset A $any N 63 003_Dataset_A_1539458929
Dataset A $any S Bob 000_Dataset_A_1539458936
Dataset A Age N 45 000_Dataset_A_1539458936
Dataset A Age N 63 003_Dataset_A_1539458929
Dataset A Name S Bob 000_Dataset_A_1539458936
Dataset B $any S B0001 001_Dataset_B_1539458920
Dataset B Order ID S B0001 001_Dataset_B_1539458920
SELECT * FROM TABLE WHERE
AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’
©Koverse
©Koverse
Index Table: Querying
Dataset A $any N 45 000_Dataset_A_1539458936
Dataset A $any N 63 003_Dataset_A_1539458929
Dataset A $any S Bob 000_Dataset_A_1539458936
Dataset A Age N 45 000_Dataset_A_1539458936
Dataset A Age N 63 003_Dataset_A_1539458929
Dataset A Name S Bob 000_Dataset_A_1539458936
Dataset B $any S B0001 001_Dataset_B_1539458920
Dataset B Order ID S B0001 001_Dataset_B_1539458920
SELECT * FROM TABLE WHERE
AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’
©Koverse
Index Table: Querying
SELECT * FROM TABLE WHERE
AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’
Batch Scan to fetch records
Bucket 00
Bucket 01
Dataset A
Dataset B
Dataset A
Dataset B
Matching Record IDs
Index Table Record Table
Composite Indexes
• Entries in the index table are for a
single value found in a field
• Record IDs that have a particular
value are stored together in
sorted order
• Spanning a range of values means
the Record IDs are no longer
totally ordered, so we can’t do
streaming set intersection
Value Record ID
Bat 002
Bat 004
Bat 005
Bask 001
Bask 009
For Bat
Record IDs
are sorted
For Ba*
Record IDs are
NOT sorted
©Koverse
Composite Indexes
• Composite Indexes allow us to query for
multiple ranges of values
• E.g. querying for points that fall in a specific
latitude longitude box, or query for a time
range and a range of port numbers
• Basically, enables apps to query a multi-
dimensional range
©Koverse
©Koverse
Composite Indexes
Data workers specify which fields to include in a composite index
Record ID Age Height
001 14 67
002 23 72
003 13 60
004 64 68
Composite Value
(interleaved bytes)
Record ID
1647 001
2732 002
1630 003
6648 004
Composite Index
Height
Age
©Koverse
©Koverse
Composite Index
Height
Age
Age > 30 AND Age < 50 AND height > 50 AND height < 70
©Koverse
Composite Index
Height
Age
Age > 30 AND Age < 50 AND height > 50 AND height < 70
True Positives
False Positives
©Koverse 21
Supply Chain Risk
• PricewaterhouseCoopers works with clients to evaluate risk in
their supply chain
• E.g., vendors with unethical business practices, based out of
sanctioned countries, history of tainted products, etc.
• Until now, analysis was very manual
• Could only evaluate each vendor every few years
• Not enough bandwidth to evaluate vendors’ vendors, or their
vendors’ vendors
©Koverse 22
Supply Chain Risk Use Case
• Automatically evaluate vendor risk on a daily basis
• Chain through arbitrary levels of vendors in the
supply chain
• Incorporate social media ML text analysis
©Koverse 23
Automated ‘Know Your Vendor’ Solution
!
• Storage of various record schemas from Excel
files, databases, webservices, and social
media
• Incremental batch processing to refresh
results on a daily basis
©Koverse 24
Accumulo Table Features Leveraged
©Koverse 25
Cyber Monitoring
• Fully managed service provided by a multinational
cybersecurity company
• Threat monitoring, detection and mitigation
• The use of Accumulo allowed the company to scale
their application, which had previously been built on
PostgreSQL
• Security features of Accumulo allow the managed
service to be multi-tenant
©Koverse 26
Managed Cyber Security Services Use Case
• Streaming writes of cyber logs using
Accumulo batchwriters
• Bulk threat detection analytics on time-
windowed event data
• Aggressive use of indexing, including
composite indexes, to enable scalable log
search on single terms and multiple ranges
©Koverse 27
Accumulo Table Features Leveraged
©Koverse 28
Forensic Document Search
• Investigation team at a large pharmaceutical
company
• Analysts need to search for and retrieve all
relevant documents related to a case
• Many users access the application on their
mobile devices
• Documents come from personal laptops,
databases, email attachments, shared
drives, Sharepoint, etc.
• OCR allows for search on evidence photos
that contain text
©Koverse 29
Forensic Document Search Use Case
• Storage of various record schemas resulting
from differing document formats and
metadata
• Indexing of all terms from document text to
enable term search
• Incremental batch NLP analytics on raw
document records
©Koverse 30
Accumulo Table Features Leveraged
©Koverse 31
Questions?

More Related Content

PDF
Les objets connectés : de nombreux cas d'usage
PDF
Temporal database
PDF
Building Pinterest Real-Time Ads Platform Using Kafka Streams
PDF
Big data processing with PubSub, Dataflow, and BigQuery
PPTX
An Intro to Elasticsearch and Kibana
PDF
Apache Accumulo and the Data Lake
PDF
Codemotion 2017 - "Dime cómo manejas tus datos y te diré qué clase de base de...
PDF
Introduction to Accumulo
Les objets connectés : de nombreux cas d'usage
Temporal database
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Big data processing with PubSub, Dataflow, and BigQuery
An Intro to Elasticsearch and Kibana
Apache Accumulo and the Data Lake
Codemotion 2017 - "Dime cómo manejas tus datos y te diré qué clase de base de...
Introduction to Accumulo

Similar to An Exploration of 3 Very Different ML Solutions Running on Accumulo (20)

PDF
Accumulo design
PDF
Accumulo design
PPTX
Adam Fuchs' Accumulo Talk at NoSQL Now! 2013
PDF
Modern Database Systems (for Genealogy)
PDF
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
PDF
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
PPTX
Relational Database to Apache Spark (and sometimes back again)
PDF
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
PPTX
Survey of Accumulo Techniques for Indexing Data
PPTX
Big Data Everywhere Chicago: SQL on Hadoop
PDF
Operational-Analytics
PDF
Cassandra Data Modelling with CQL (OSCON 2015)
PDF
Non Relational Databases And World Domination
PPTX
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
PDF
Slide presentation pycassa_upload
PDF
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
PDF
Large Scale Accumulo Clusters
PDF
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
PPTX
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
PPTX
HBase and Drill: How loosley typed SQL is ideal for NoSQL
Accumulo design
Accumulo design
Adam Fuchs' Accumulo Talk at NoSQL Now! 2013
Modern Database Systems (for Genealogy)
Five data models for sharding and which is right | PGConf.ASIA 2018 | Craig K...
Five Data Models for Sharding | Nordic PGDay 2018 | Craig Kerstiens
Relational Database to Apache Spark (and sometimes back again)
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012
Survey of Accumulo Techniques for Indexing Data
Big Data Everywhere Chicago: SQL on Hadoop
Operational-Analytics
Cassandra Data Modelling with CQL (OSCON 2015)
Non Relational Databases And World Domination
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
Slide presentation pycassa_upload
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Large Scale Accumulo Clusters
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
HBase and Drill: How Loosely Typed SQL is Ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
Ad

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Nekopoi APK 2025 free lastest update
PDF
System and Network Administraation Chapter 3
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
ai tools demonstartion for schools and inter college
PDF
top salesforce developer skills in 2025.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Introduction to Artificial Intelligence
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
medical staffing services at VALiNTRY
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
2025 Textile ERP Trends: SAP, Odoo & Oracle
Designing Intelligence for the Shop Floor.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Which alternative to Crystal Reports is best for small or large businesses.pdf
Nekopoi APK 2025 free lastest update
System and Network Administraation Chapter 3
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
ai tools demonstartion for schools and inter college
top salesforce developer skills in 2025.pdf
Design an Analysis of Algorithms I-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
Introduction to Artificial Intelligence
How to Choose the Right IT Partner for Your Business in Malaysia
Digital Systems & Binary Numbers (comprehensive )
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
medical staffing services at VALiNTRY
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
CHAPTER 2 - PM Management and IT Context
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Ad

An Exploration of 3 Very Different ML Solutions Running on Accumulo

  • 1. An Exploration of 3 Very Different ML Solutions Running on Accumulo By Gadalia O’Bryan and Aaron Cordova, presented by Don Miner
  • 2. • Introduction • Koverse Accumulo Table Structures • Supply Chain Risk • Cyber Monitoring • Forensic Document Search • Questions ©Koverse 2 Talk Outline
  • 3. • Our customers have an appetite for building very diverse ML solutions on Accumulo • These solutions require varying interaction patterns with Accumulo • We have found we are able to support these use cases using the same set of Accumulo table structures ©Koverse 3 Introduction
  • 4. ©Koverse 4 Koverse Accumulo Table Structures
  • 5. Record Table: Objectives • Store records under a unique ID • Optimized for reading newly written records in time order • Bucket ID is prepended to distribute the newly written records evenly across tablet servers • Also supports fetching records that match query criteria after consulting index table ©Koverse
  • 6. Record Table: Key Components Record ID Bucket ID Dataset ID Timestamp 000 Dataset A 1539458936 000 Dataset A 1539458937 000 Dataset B 1539458931 000 Dataset B 1539458933 001 Dataset A 1539458932 001 Dataset A 1539458935 001 Dataset B 1539458930 001 Dataset B 1539458932 ©Koverse
  • 7. ©Koverse Record Table: Organization Bucket 00 Bucket 01 Dataset A Dataset B Dataset A Dataset B Time ordered
  • 8. ©Koverse Record Table: Ingest Bucket 00 Bucket 01 Dataset A Dataset B Dataset A Dataset B Writes to the ends of many buckets
  • 9. ©Koverse Record Table: Bulk Reading Bucket 00 Bucket 01 Dataset A Dataset B Dataset A Dataset B Sequential reads from many buckets Process newest batches incrementally
  • 10. ©Koverse Index Table: Objectives • Store a value-to-record ID pairs • Support point queries and range queries on values in specific fields, or ’any’ field • Set intersection is done by query client, comparing sorted record IDs for each criterion, so matching record IDs do not need to fit in memory
  • 11. ©Koverse Index Table Index Entry Dataset ID Field Type Value Record ID Dataset A $any N 45 000_Dataset_A_1539458936 Dataset A $any N 63 003_Dataset_A_1539458929 Dataset A $any S Bob 000_Dataset_A_1539458936 Dataset A Age N 45 000_Dataset_A_1539458936 Dataset A Age N 63 003_Dataset_A_1539458929 Dataset A Name S Bob 000_Dataset_A_1539458936 Dataset B $any S B0001 001_Dataset_B_1539458920 Dataset B Order ID S B0001 001_Dataset_B_1539458920
  • 12. Index Table: Querying Dataset A $any N 45 000_Dataset_A_1539458936 Dataset A $any N 63 003_Dataset_A_1539458929 Dataset A $any S Bob 000_Dataset_A_1539458936 Dataset A Age N 45 000_Dataset_A_1539458936 Dataset A Age N 63 003_Dataset_A_1539458929 Dataset A Name S Bob 000_Dataset_A_1539458936 Dataset B $any S B0001 001_Dataset_B_1539458920 Dataset B Order ID S B0001 001_Dataset_B_1539458920 SELECT * FROM TABLE WHERE AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’ ©Koverse
  • 13. ©Koverse Index Table: Querying Dataset A $any N 45 000_Dataset_A_1539458936 Dataset A $any N 63 003_Dataset_A_1539458929 Dataset A $any S Bob 000_Dataset_A_1539458936 Dataset A Age N 45 000_Dataset_A_1539458936 Dataset A Age N 63 003_Dataset_A_1539458929 Dataset A Name S Bob 000_Dataset_A_1539458936 Dataset B $any S B0001 001_Dataset_B_1539458920 Dataset B Order ID S B0001 001_Dataset_B_1539458920 SELECT * FROM TABLE WHERE AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’
  • 14. ©Koverse Index Table: Querying SELECT * FROM TABLE WHERE AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’ Batch Scan to fetch records Bucket 00 Bucket 01 Dataset A Dataset B Dataset A Dataset B Matching Record IDs Index Table Record Table
  • 15. Composite Indexes • Entries in the index table are for a single value found in a field • Record IDs that have a particular value are stored together in sorted order • Spanning a range of values means the Record IDs are no longer totally ordered, so we can’t do streaming set intersection Value Record ID Bat 002 Bat 004 Bat 005 Bask 001 Bask 009 For Bat Record IDs are sorted For Ba* Record IDs are NOT sorted ©Koverse
  • 16. Composite Indexes • Composite Indexes allow us to query for multiple ranges of values • E.g. querying for points that fall in a specific latitude longitude box, or query for a time range and a range of port numbers • Basically, enables apps to query a multi- dimensional range ©Koverse
  • 17. ©Koverse Composite Indexes Data workers specify which fields to include in a composite index Record ID Age Height 001 14 67 002 23 72 003 13 60 004 64 68 Composite Value (interleaved bytes) Record ID 1647 001 2732 002 1630 003 6648 004
  • 19. ©Koverse Composite Index Height Age Age > 30 AND Age < 50 AND height > 50 AND height < 70
  • 20. ©Koverse Composite Index Height Age Age > 30 AND Age < 50 AND height > 50 AND height < 70 True Positives False Positives
  • 22. • PricewaterhouseCoopers works with clients to evaluate risk in their supply chain • E.g., vendors with unethical business practices, based out of sanctioned countries, history of tainted products, etc. • Until now, analysis was very manual • Could only evaluate each vendor every few years • Not enough bandwidth to evaluate vendors’ vendors, or their vendors’ vendors ©Koverse 22 Supply Chain Risk Use Case
  • 23. • Automatically evaluate vendor risk on a daily basis • Chain through arbitrary levels of vendors in the supply chain • Incorporate social media ML text analysis ©Koverse 23 Automated ‘Know Your Vendor’ Solution !
  • 24. • Storage of various record schemas from Excel files, databases, webservices, and social media • Incremental batch processing to refresh results on a daily basis ©Koverse 24 Accumulo Table Features Leveraged
  • 26. • Fully managed service provided by a multinational cybersecurity company • Threat monitoring, detection and mitigation • The use of Accumulo allowed the company to scale their application, which had previously been built on PostgreSQL • Security features of Accumulo allow the managed service to be multi-tenant ©Koverse 26 Managed Cyber Security Services Use Case
  • 27. • Streaming writes of cyber logs using Accumulo batchwriters • Bulk threat detection analytics on time- windowed event data • Aggressive use of indexing, including composite indexes, to enable scalable log search on single terms and multiple ranges ©Koverse 27 Accumulo Table Features Leveraged
  • 29. • Investigation team at a large pharmaceutical company • Analysts need to search for and retrieve all relevant documents related to a case • Many users access the application on their mobile devices • Documents come from personal laptops, databases, email attachments, shared drives, Sharepoint, etc. • OCR allows for search on evidence photos that contain text ©Koverse 29 Forensic Document Search Use Case
  • 30. • Storage of various record schemas resulting from differing document formats and metadata • Indexing of all terms from document text to enable term search • Incremental batch NLP analytics on raw document records ©Koverse 30 Accumulo Table Features Leveraged