Vital AI: Big Data Modeling

Big Data Modeling
Today:
Marc C. Hadﬁeld, Founder 
Vital AI 
http://vital.ai
marc@vital.ai
917.463.4776

intro
Marc C. Hadﬁeld, Founder Vital AI 
http://vital.ai 
marc@vital.ai

Big Data Modeling is 
Data Modeling with the
“Variety” Big Data Dimension
in mind…

Big Data “Variety” Dimension
The “Variety” problem can be addressed by a
combination of improved tools and a methodology
involving both system architecture and 
data science / analysis.
Compared to Volume and Velocity, Variety is a very
labor -intensive human-centric process.
Variety is the many types of data to be utilized
together in a data-driven application. 
Potentially too many types for any single person to
keep track of (especially in Life Sciences).

Key Takeaways:
Using OWL as a “meta-schema” can drastically
reduce operations/development eﬀort and
increase the value of the data for analysis.
OWL can augment and not replace familiar
development processes and tools.
A huge amount of ongoing development eﬀort is
spent transforming data across components and
keeping data consistent during analysis.
Collecting Good Data = Good Analytics

Big Data Modeling:
Challenges
Goals
OWL as Modeling Language
Using OWL-based Models…
Collaboration/Modeling Tools

Examples from NYC Department of Education:
Domain Ontology
Application Architecture
Development Methodology/Tools

Data Architecture
Data Science
Data Models in:

Mobile/Web App Architecture
Data Model Data Model
Mobile App
Server
Implementation Database

Database
Master Database
"Data Lake"
Database
Database
Database
Business Intelligence
Data Analytics
Dashboard
Enterprise DataWarehouse Architecture
Schema “on read” or “on write”
Data Model
Data Model
Data Model
Data Model
Data Model
ETL Process

MobileApp
Server Layer
Real Time Data
Calculated Views
Hadoop
Predictive Analytics
Master Database
"Data Lake"
Data Analytics
Dashboard
Lambda Architecture + Hadoop: Data Driven App
Data Model
Data Model
Data Model
Data ModelData Model

Data Wrangling / Data Science Master Database
"Data Lake"
Data Analytics
Raw Data
R
Data Model
Data Model
Data Model
Prediction Models must integrate back with
production environment:

Same Data, Diﬀerence Contexts…
Redundant Models.

Data Architecture Issues
{
Database Schema 
JSON Data 
Data Object Classes 
Avro/Parquet
Redundant Data Deﬁnitions:
Considerable Development / Maintenance / Operational Overhead

Data Science / Data Wrangling Issues
 
Data Harmonization: Merging Datasets from Multiple Sources 
 
Loss of Context: Feature f123 = Column135 X Column45 / Column13 
Side note: Let’s stop using CSV ﬁles for datasets! 
No more ﬂat datasets!

Goals:
Reduce redundancy in Data Deﬁnitions
Enforce Clean/Harmonized Data 
Use Contextual Datasets
Use Best Software Components (Databases, Analytics, …)
Use Familiar Tools (IDE, git, Languages, R)

Web Ontology Language (OWL)
Speciﬁes an Ontology (“Data Model”)
Formal Semantics, W3C Standard
Provides a language to describe the meaning of data
properties and how they relate to classes.
Example: Mammal 
Necessary Conditions: warm-blooded, vertebrate animal,
has hair or fur, secrets milk, (typically) live birth
Greater descriptive power than Schema (SQL Tables) and
Serialization Frameworks (Avro)

Why OWL?
If we can more formally specify what the data *means*,
then we can have a single data model (ontology) apply to
our entire architecture, and data can be transformed
automatically locally as per the needs of a speciﬁc
software module.
Manually coded data transforms may be “lossy” and/or
introduce errors, so eliminating them helps keep data
clean.

Why OWL? (continued)
Example: if we specify what a “Document” is, then a text-
mining analyzer will know how to access the textual data
without further prompting.
Example: if we specify Features for use in Machine
Learning in the ontology, then features can be generated
automatically to train Machine Learning Models, and the
same features would be generated when we use the model
in production.

Why OWL? (continued)
Note: As ontologies can extend other ontologies, rather
than a single ontology, a collection of linked ontologies
can be used, allowing segmentation across an
organization.

Vital Core Ontology
Protege Editor…
Nodes, Edges, HyperNodes, HyperEdges get URIs
John/WorksFor/IBM —> Node / Edge / Node

Vital Core
Ontology
Vital Domain
Ontology
Application
Domain Ontology
Extending the Ontology

NYC Dept of Education Domain Ontology

Generating Data Bindings with VitalSigns:
Ontology VitalSigns
Groovy Bindings
Semantic Bindings
Hadoop Bindings
Prolog Bindings
Graph Bindings
HBase Bindings
JavaScript Bindings
Code/Schema Generation
vitalsigns generate -ont name…

person123.name = "John"
person123.worksFor.company456
<person123> <hasName> "John"
<worksFor123> <hasSource> <person123>
<worksFor123> <hasDestination> <company456>
<worksFor123> <hasType> <worksFor>
person123, Node:type=Person, Node:hasName="John"
worksFor123, Edge:type=worksFor, Edge:hasSource=person123, Edge:hasDestination=company456
Groovy
RDF
HBase
Data Representations

VitalSigns
Generation —> JAR Library
Runtime
Domain Ontology
Domain Ontology
Domain Ontology
Domain Ontology
VitalSigns
Class

Developing with the Ontology in UI, Hadoop, NLP, Scripts, ...
Node:Person Node:PersonEdge:hasFriend
Set<Friend> person123.getFriends()
Eclipse IDE

// Reference to an NYCSchool object
NYCSchool school123 = … // get from database 
!
// Get a list of programs, local context (cache)
List<NYCSchoolProgram> programs = school123.getPrograms()
!
// Get list of programs, global context (database)
List<NYCSchoolProgram> programs =
school123.getPrograms(Context.ServiceWide)
!
JVM Development

Using JSON-Schema Data in JavaScript
for(var i = 0 ; i < progressReports.length; i++) {

var r = progressReports[i];

var sub = $('<ul>');

sub.append('<li>Overall Grade : ' + r.progReportOverallGrade + '</li>');

sub.append('<li>Progress Grade: ' + r.progReportProgressGrade + '</li>');

sub.append('<li>Environment Grade: ' + r.progReportEnvironmentGrade + '</li>');

sub.append('<li>College and Career Readiness Grade: ' + r.progRepCollegeAndCareerReadinessGrade+ '</li>');

sub.append('<li>Performance Grade: ' + r.progReportPerformanceGrade+ '</li>');

sub.append('<li>Closing the Achievement Gap Points: ' + r.progReportClosingTheAchievementGapPoints+ '</li>');

sub.append('<li>Percentile Rank: ' + r.progReportPercentileRank + '</li>');

sub.append('<li>Overall Score: ' + r.progReportOverallScore + '</li>');

}

NoSQL Queries
Query API / CRUD Operations 
!
Queries generated into “native” NoSQL Query format:
Sparql / Triplestore (Allegrograph)
HBase / DynamoDB
MongoDB
Hive/HiveQL (on Spark/Hadoop2.0)
Query Types: “Select” and “Graph”
Abstract type of datastore from application/analytics code
Pass in a “native” query when necessary

Data Serialization, Analytics Jobs
Data Serialized into ﬁle format by blocks of objects
Leverage Hadoop Serialization Standards:
Sequence File, Avro, Parquet
Get data in and out of HDFS Files
Spark/Hadoop jobs passed a set of objects as input 
URI of object is key
Data Objects are serialized into Compressed Strings for
transport over Flume, etc.

Machine Learning
Via Hadoop, Spark, R
Mahout, MLLib
Build Predictive Models
Classification, Clustering...
Use Features defined in
Ontology
Learn Target defined in
Ontology
Models consume Ontology
Data as input

Natural Language Processing/Text Mining
Topic Categorization…
Extract Entities… Text Features from Ontology
Classes extending Document…

Graph Analytics
GraphX, Giraph: PageRank,
Centrality, Interest Graph, …

Inference / Rules
Use Semantic Web Rule Engines / Reasoners 
!
Load Ontology + RDF Representation of Data Instances (Individuals)

R Analytics
Load Serialized Data into R Dataframes
!
Reference Classes and Properties by Name in Dataframes
(cleaner code than huge number of columns)

Graph Visualization with Cytoscape
Data already in Node/Edge Graph Form

Graph Visualization with Cytoscape

Visualize Data “Hot Spots”

NYC Schools Architecture
Mobile App
JSON
Schema
VertX
Vital Flow Queue
Rule
Engine
NLP
DynamoDB
Vital Prime
VitalService
Client
NYC Schools
Data Model
R
Serialized Data
Data
Insights

Collaboration/Tools
git - code revision system
OWL Ontologies treated as code artifact
Coordinate across Teams: 
“Front End”, “Back End”, “Operations”,
“Business Intelligence”, “Data Science”…
Coordinate across Enterprise:
Departments / Business Units
“Data Model of Record”

Ontology Versioning
NYCSchoolRecommendation-0.1.8.owl
Semantic Versioning (http://semver.org/)

vitalsigns command line
vitalsigns generate
vitalsigns upversion/downversion
code/schema generation
increase version patch number 
move previous version to archive 
rename OWL ﬁle including username
JAR ﬁles pushed into Maven 
(Continuous Integration)

Git Integration
git: add, commit, push, pull
diﬀ: determine diﬀerences
merge: merge two Ontologies
detect types of Ontology changes
merge into new patch version

OWL as Data Modeling Language: 
Data Architecture & Data Science / Analytics
Conclusions
Leverage Existing Tools, Components
Reduce model redundancy, reduce eﬀort.
A Means to Collaborate Across Teams: Data Model of Record
Cleaner Data
Integrate additional analysis

For more information, please
contact:
Marc C. Hadﬁeld
http://vital.ai
marc@vital.ai
917.463.4776
Thank You!

Vital AI: Big Data Modeling

More Related Content

What's hot (11)

Similar to Vital AI: Big Data Modeling (20)

Recently uploaded (20)

Vital AI: Big Data Modeling