Data Modeling Fundamentals
What is Data Modeling?
- What is Data Modeling? Representation of what the data is in the real-world. Provide us insights into characteristics and rules that apply to the data.
Lifecycle of Data Modeling
OLTP – Online Transactional Processing
- Optimized for inserts and updates
- Optimal for application use
- Schema structure might fundamentally change for different application needs
- “Focus on customers entering data”
OLAP – Online Analytical Processing
- Optimized for heavy reads
- Optimal for business structure, understandable by business people
- Schema structure should be consistent and flexible for different business needs
The Building Blocks of Data Modeling
- Commonly called “entities”
- Others call it “objects” or “classes”
- Analogous to database tables
- Analogous to a database column
- Typically associated with Data Subjects
- Attribute types often “shared” across multiple entities“
- Bottom-up” data modeling once semi-popular
- Go down to all the attributes then model them back to entities
- If there are common attributes if you want you can make the attributes more descriptive.
Attributes have descriptions & rules
- Data types and size
- Whether NULL values allowed
- Permissible values
- Decompose but carefully
- Ex: “where a student lives
- ”“STUDENT_ADDR_AND_CITY” = Street address + City + State + Zip Code all in one attribute/field (NOT BEST PRACTICE)
- “STUDENT_ADDR” + “STUDENT_CITY” + “STUDENT_STATE” + “STUDENT_ZIP” each as a separate attribute/field (BEST PRACTICE)
Relationships among data subjects
- Defining the relationship amongst the entities
- Cardinality
- Mandatory or optional relationships
- Permissible attribute values(including NULLs)
- “Data change dynamics”
Hierarchies in Entities/Data Subjects
- Special kind of entity
- A special type of relationship
- Think “specializtion”
- Two or more entities
- That have “a lot” in common
- But also “at least a little bit” different
Strong vs Weak Entities/Data Subjects
Strong entity “exists on its own terms”
- Exists independent of any other entity
- Does not require any other entity instances to help identify its own instance
Weak entity “needs some help”
- To identify specific instance of that entity
- Can’t exist without an instance of another entity
- or both
Multiple Relationships Between Entities
- Multiple Relationships between 2 entities
-->Recursive Relationship
- Special Type of relationship that acts like an entity.
- Relationship that have specific attributes specific to the relationship itself
Cardinality
Cardinality: "the number of something"
Maximum Cardinality
- Number of instances of both sides of a relationship
- Typical values: “1” or “M”(many)
- Can also be a specific numeric value
Specific number of Cardinality
Minimum Cardinality
Sometimes referred to as “participation constraint”
- Total Participation ( Min. cardinality = 1)
- Partial Participation ( Min. cardinality = 0)
Mandatory vs optional relationship
- Mandatory relationship (Min. cardinality = 1)
- Optional relationship (Min. cardinality = 0)
3rd possible value for min. cardinality
- 0: optional/partial participation
- 1: mandatory/total participation
- (n): some explicit number of minimum instances“A full time lecturer must teach at least 6 classes”“A full professor must advise at least 2 other faculty members”
Crow's Foot Notations for Cardinality
Normalization
Normalization “The Key, the Whole Key, Nothing but the Key…”
- Every row (tuple) must be unique
- NO repeating groups
- Multi-valued attributes are in violation of 1N
- Must be in 1NF“
- No partial key dependencies”
- Applies if composite primary key
- If single-column (field) primary key then already in 2NF
- Must be in 2NF“
- No non-key dependency”
Forward Engineering
Typical conceptual -> logical transformaitions
- Address violations of normalization
Transform M:M relationships
- Add “intersection entity (or table)” to your model
- Also referred to as “associative entity (or table)”
- Or “bridge entity (or table)”
- Purpose: decompose M:M relationship into multiple “semantically equivalent” relationships
- Semantically equivalent…but “artificial”
Typical logical -> physical transformation
- Violating normalization rules deliberate for performance gains
- Aggregates
- Materialized Views
- Join across various tables that physically creates the result of a query
- Optimized storage placement(e.g. partitioning)
- Database Indices
Dimensional Modeling
"Dimensional Modeling is a design technique for databases intended to support end-user queries in a data warehouse"
Key Terms
- Artificially created keys (usually integers) used only by the data warehouse to uniquely identify a row of dimension table
- Required to implement history of slowly changing dimensions
- Avoids conflicts among backend source systems
- Insulates the data warehouse from source systems
- By what we measure things
- The who, what, when, where etc. of things
- What users would want to sort, group, and filter on
- Also called a Measure
- Measurable metric with is described by the dimensions
- An observation or event
- determines what each fact row contains and in what detail
- defined by dimensions in the fact table, and their details
Steps of Dimensional Modeling
Choose the business process
- Describe the business process which the model builds on.
- The grain of the model is the exact description of what the dimensional model should be focusing on. To clarify, you should pick the central process and describe it with one sentence
- The dimensions must be defined within the grain. Dimensions are the foundation of the fact table, and is where the data for the fact table is collected. Typically dimensions are nouns like date, store, inventory, etc.
- Identify numeric facts that will populate each fact table row
Star Schema
- Marriage of Fact Schema to dimension schema
- Dimensions relate directly to fact table only
- Dimensions are deformalized. Does not have a related region lookup tables as an OLTP design
- Usually dimension keys are NOT keys from the source systems, rather they are generated by the data warehouse load process(Surrogate Keys)
- Dimension attributes you define determine granularity called the grain of the facts
Snowflake Schema
- Dimensions relate to another dimension you have a snowflake
- Snowflake causes a number of performance and usability issues and are rarely justified
- The principle behind snowflaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.
Slowly Changing Dimensions
- Never update. Keep the original value
- Useful for original based tracking
- Overwrite the record row
- Useful when history is not a factor
- Modeling and querying only by current state
- Reporting will reflect the current value only
- Maintain History
- Track complete history of dimension
- Adding 3 columns to maintain type 2 tables
- 'effective_date' - when the new row becomes the truth
- 'expiration_date' - when the row expired due to new update
- 'is_current' - is this row the current system truth
- Using surrogate key is recommended regardless but essential with Type 2 SCD
- Expiration date should be in the distant future(9999-01-01)
- Use only for true slowly changing dimensions
- Fast changing attributes lead to inflated dimension tables (Fix can be found using Type 4 SCD)
- Big Dimensions lead to poor performance and slow filtering
- Keep Limited History
- Add new column to table to keep old value
- old_value
- new_value
- Typically used for one-time, cross data change
- Maintain a separate history table
- Addresses Type 2 scaling issues
- Fast changing dimensions that are not facts
- they are dimensions for modeling purpose(filter by)
- they are dimensions for not changing fast enough
- The mini dimension is tracked through time via the fact table
- Creates dependence on the fact table to exist and never fundamentally change
- Separate changing values into mini dimensions
- Builds on type 4 SCD by embedding a mini-dimension table that is of type 1 SCD
- Allows for currently assigned mini-dimension tables to be accessed along with the base dimension table without being linked to a fact table
- combination of type 1, 2 and 3