Inferring Versioned Schemas from NoSQL Databases and its Applications

Inferring Versioned
Schemas from NoSQL
Databases and its
Applications
ER’15
Stockholm, October 2015
[{ ”id”: ”90234 af”, ”value”: { ”author”: ”Diego Sevilla Ruiz”,
”e-mail”: ”dsevilla@um.es”,
”institution”: ”U. of Murcia”}},
{ ”id”: ”a243bb5”, ”value”: { ”author”: ”Severino Feliciano Morales”,
”e-mail”: ”severino.feliciano@um.es”,
”institution”: ”U. of Murcia”}},
{ ”id”: ”096705d”, ”value”: { ”author”: ”Jesús García Molina”,
”e-mail”: ”jmolina@um.es”,
”institution”: ”U. of Murcia”}}]

Motivation
NoSQL Databases are Schemaless
Benefits
▶ No need to previously
define an Schema
▶ Non-uniform data
▶ Custom fields
▶ Non-uniform types
▶ Easier evolution
Drawbacks
▶ Harder to reason about
the DB
▶ Static checking is lost
▶ Some of the data logic is
in the application code
(more error prone)
▶ Some utilities need
Schema information to
work

Schemas for NoSQL Databases
▶ How to alleviate the problems of schemaless
databases? ⇒ Inferring a Schema
▶ The Schema Model contains information about
Entities and Relationships
▶ Take into account the diﬀerent Entity Versions in
the Database
▶ Heterogeneity usually because of slight variations on
Entities
▶ We obtain a precise database model
▶ The Schema allows us to automate the construction
of tools:
▶ migration, refactoring, visualization, …

Related Work
▶ JSON Schema
▶ Object versions and relationships are not considered
▶ Apache Spark SQL/Drill: SQL-like schemas
▶ Union of all ﬁelds, nullable ⇒ incorrect combinations
▶ Over-generalization to String
▶ Aggregations and Reference relations not considered
▶ MongoDB-Schema
▶ Prototype to infer schemas from MongoDB
collections
▶ Same limitations than Spark SQL
▶ JSON Discoverer
▶ A MDE solution to infer domain models from REST
web services (i.e. JSON documents)
▶ Not database-oriented; Object versions not
considered

Spark SQL Example
{”name”:”Michael”}
{”name”:”Andy”, ”age”:30}
{”name”:”Justin”, ”age”:19}
{”name”:”Peter”, ”age”:”tiny”}
{”name”:”Martina”, ”address”:”home!”}
> people.printSchema
root
|-- address: string (nullable = true)
|-- age: string (nullable = true)
|-- name: string (nullable = true)
▶ age promoted to string
▶ age and address are never part of the same object

{
”rows”:[
{
”content”:{
”chapters”:33,
”pages”:527
},
”authors”:[
{
”company”:{
”country”:”USA”,
”name”:”IBM”
},
”name”:”Grady Booch”,
”_id”:”210”
},
{
”company”:{
”name”:”IBM”
},
”name”:”James Rumbaugh”,
”_id”:”310”
},
{
”company”:”Ivar Jacobson Consulting”,
”name”:”Ivar Jacobson”,
”_id”:”410”
}],
”type”:”book”,
”year”:2013,
”publisher_id”:”345679”,
”title”:”The Unified Modeling Language”,
”_id”:”1”
},
{
”discipline”:”software engineering”,
”issn”:[
”0098 -5589”,
”1939 -3520”
],
”name”:”IEEE Trans. on Software Engineering”,
”type”:”journal”,
”_id”:”11”
},
{
”name”:”Automated Software Engineering”,
”issn”:[
”0928 -8910”,
”1573 -7535”
],
”discipline”:”software engineering”,
”type”:”journal”,
”_id”:”12”,
”number”:10515
},
{
”city”:”Barcelona”,
”name”:”Omega”,
”type”:”publisher”,
”_id”:”123451”
},
{
”city”:”Newton”,
”name”:”O’Reilly Media”,
”_id”:”928672”
},
{
”author”:{
”_id”:”101”,
”name”:”Bradley Holt”,
”company”:{
”name”:”IBM Cloudant”,
}
},
”title”:”Writing and Querying MapReduce Views in
CouchDB”,
”_id”:”2”
},
{
”name”:”Addison -Wesley”,
”_id”:”345679”
},
{
”journals”:[
”11”,
”12”
],
”name”:”IEEE Publications”,
”_id”:”907863”
}]}

NoSQL Database Model
▶ Objects (Entities) and Entity Versions
▶ Attributes
▶ Relationships
▶ Aggregation
▶ References
{
”city”:”Newton”,
”name”:”O’Reilly Media”,
”_id”:”928672”
},
{
”author”:{
”_id”:”101”,
”name”:”Bradley Holt”,
”company”:{
”name”:”IBM Cloudant”,
}
},
”title”:”Writing and Querying MapReduce Views in CouchDB”,
”_id”:”2”
},

Schema & Entity Versions Description
Entity Publisher {
Version 1 {
name: String
city: String
}
Version 2 {
name: String
}
Version 3 {
name: String
journal[+]: [Ref]->[Journal] (opposite=False)
}
}
Entity Journal {
Version 1 {
issn: Tuple [String, String]
name: String
discipline: String
}
Version 2 {
issn: Tuple [String, String]
name: String
discipline: String
number: int
}
}
Entity Book {
Version 1 {
title: String
year: int
publisher[1]: [Ref]->[Publisher] (opossite=False)
content[1]: [Aggregate]Content1
author[+]: [Aggregate]Author1
}
Version 2 {
title: String
publisher[1]: [Ref]->[Publisher] (opossite=False)
author[1]: [Aggregate]Author1
}
}
Entity Author {
Version 1 {
name: String
company[1]: [Aggregate]Company
}
Version 2 {
country: String
company: String
name: String
}
}
Entity Company {
Version 1 {
name: String
country: String
}
}
Entity Content {
Version 1 {
chapters: int
pages: int
}
}
(a) (b)
[1..1] company
[1..1] publisher[1..1] content[1..*] authors
[1..*] journals

Solution Design Considerations
▶ We have to process all the objects in the Database
⇒ Map-Reduce
▶ Natural data processing on NoSQL databases
▶ Leverage MDE technologies
▶ Reuse EMF/Ecore tooling to show entity diagrams
▶ Automation & Code Generation by Metamodeling &
Model Transformations

Proposed MDE Architecture
NoSQL
Database
MapReduce
Object
Versions
(JSON)
JSON
Injection
JSON
Model
JSON
Metamodel
Schema
Reverse
Eng
Schema
Model
Application
Generation
Schema
Viewer/
Data
Validator/
Migration
Assistant
Applications Schema
Metamodel
instance
instance

Reverse Engineering Process (i)
▶ Map-Reduce process
▶ Map: obtains the Raw Schema for each object
▶ Reduce: selects an archetype for each Entity Version
▶ Entity Type
▶ Root objects ⇒ “type” ﬁeld or collection name
▶ Aggregated objects ⇒ key of the pair (e.g. “author”)
JSON object Raw Schema
{name:“Omega”, city:“Barcelona”} {name:String, city:String}
{title:“Writing and...”,
publisher_id:“928672”,
author:{name:“Bradley Holt”,
company:{country:“USA”,
name:“IBM Cloudant”} } }
{title:String,
publisher_id:String,
author:{name:String,
company:{country:String,
name:String} } }

Reverse Engineering Process (ii)
▶ Attributes: primitive or tuple
▶ Aggregated Entities
▶ Value of the pair is an Object (or array of objects)
▶ Entity type inferred from the key
▶ References
▶ Heuristics/Conventions
▶ Key: <entity_name>_id
▶ Value: MongoDB’s DBRef abstraction:
{”$ref”: ”<entity_name>”, ”$id”, <id_value>}
▶ Honor cardinalities (arrays)

Inferring Versioned Schemas from NoSQL Databases and its Applications

Example NoSQL Applications
▶ From the DBSchema model, using Model
Transformations and Model-to-Text transformations
(Code Generation), we can:
▶ Generate models that Characterize each Entity
Version
▶ That characterization can be used to Visualize the
Database
▶ And also to generate code to Validate objects
entering the Database
▶ Generate models that allow Database Migration to
the desired Entity Versions

Type Discrimination/Characterization
Metamodel

Type Transformation Metamodel
db.<collection >. update(
<query >,
<update >,
{
multi: true
}
)
Obtained by Entity Type Characterization
Generate the correct update
MongoDB statement using $set,
$push, etc., maybe via user assis-
tance through a DSL.
For example, for Journal_1 to
Journal_2:
$set: { ”number”: 1 }

Conclusions & Future work
▶ A process for obtaining Conceptual Model Schemas
for NoSQL Databases is shown
▶ The process takes into account the diﬀerent Entity
Versions present in the Database
▶ A MDE process allows us to automate the
production of several applications from the Schemas
▶ Example applications that allow Database
Visualization and Migration are shown

Conclusions & Future work (ii)
▶ Future work includes:
▶ Building a NoSQL Database Tool Set (NoSQL Data
Engineering)
▶ DSL for Entity Version migration
▶ Reﬁning the Schema to allow a richer Type System
▶ Allow value ranges or enumerated sets
▶ Infer attribute dependencies (derived attributes,
i.e. the value of an attribute dictates the value of
another attribute)
▶ etc.

Inferring Versioned Schemas from NoSQL Databases and its Applications

More Related Content

What's hot (20)

Viewers also liked (19)

Similar to Inferring Versioned Schemas from NoSQL Databases and its Applications (20)

Recently uploaded (20)

Inferring Versioned Schemas from NoSQL Databases and its Applications