SlideShare a Scribd company logo
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 121
Insert Picture Here
MySQL Cluster page
management
Frazer Clement
MySQL Cluster Technical lead
frazer.clement@oracle.com
messagepassing.blogspot.com
November 2014
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.2
●
MySQL Cluster data nodes allocate all memory at initialisation
●
Bulk of allocated memory is commonly DataMemory (DM) and
IndexMemory (IM) - separate pools for historic reasons
●
Both are managed as pages. 8KB pages for IndexMemory and 32kB
pages for DataMemory
●
Confusingly, IndexMemory pages are only used for the built-in primary
key hash index for each fragment replica.
●
This is literally only a hash table, the keys are stored externally (in
DataMemory pages)
●
DataMemory pages are used to store Primary keys and other columns,
as well as Ordered Index T-tree nodes
●
Occasionally we 'borrow' DM pages for other reasons
Memory types
Secondary unique indices use both DM +
IM as they are implemented as tables
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.3
●
Index and Data Memory pages are allocated to fragments as
necessary to handle growth due to Inserts and Updates.
●
Both are freed back to the shared pools when fragments no longer
need them (Deletes).
●
Fragments use DM pages for storing either Fixed size data or Variable
sized data
●
In both cases there is a 128 byte per-page header, leaving 32768 – 128
= 32640 bytes usable / page. (~0.4% overhead)
●
Most storage is handled in terms of 32bit words, so there's 32640 / 4 =
8160 words usable / page.
●
The usable space within Fixed-sized and Var-sized pages is handled
differently
Page based allocation
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.4
●
Every row has a fixed-size part containing the tuple header and any
fixed-size columns.
●
For tables with var-sized columns (VARCHAR,BINARY,BLOB,TEXT,
dynamic columns), every row has a variable-sized part containing the
var-sized columns.
●
Each fragment replica has :
●
Logical to Physical page map mapping per-fragment logical Fixed-
size page ids to a physical 32kB page
●
Fixed-size-pages-with-free-space freelist
●
Five size-binned var-size-pages-with-free-space freelists
●
Allocation involves 1) Finding/allocating a page to allocate from, 2)
Finding a space on the page to use.
Page management
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.5
Pages
Fragment
Fixed freelist
Var freelist 1
Var freelist 2
Var freelist 3
Var freelist 4
L2Pmap
0 1234
1 1235
2 600
3 -
4 983
5 786
...
1234 983 600 1235 786
Physical pages
containing fixed-
size parts
Physical pages
containing var-
sized parts
Rows are externally located via RowId (Table:Fragment:Page:Index). Every
row has a fixed-size part and an optional var-sized part. Fixed size parts refer
to var-sized parts. Page allocation is managed by the fragment, and each
page manages its own free space
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.6
●
For Fixed-size elements, the usable space is treated as storage for an
array of the fixed-size elements. Therefore there can be up to
element_size -1 words always wasted at the end.
●
The elements within a Fixed-size page are linked together into a per-
page free list.
●
Fixed-size pages with 1 or more free elements are linked together in a
per-fragment replica 'pages with space' list.
●
Elements have an index within the page, which is their word offset.
Fixed size pages
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.7
●
For Variable-size elements, the usable space in each page is split into
an index (at the end of the page) which refers to variable length parts
which grow up from the start of the page.
●
The index can grow and shrink as the number of elements changes
●
New inserts are made from the insert position (append only)
●
The last inserted element can grow efficiently
●
The index entries are on a freelist, similar to the Fixed-page slots
Variable size pages
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.8
●
If a non-last element wants to grow, or there is not enough space after
the insert pos for a new element, the page is re-organised
automatically.
●
Re-organisation compacts the in-use parts together, making the free
space contiguous in the 'middle' of the page
●
The index entries stay in the same positions, so external references to
a stored var-part are unchanged.
Variable size pages
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.9
●
Goal : efficiency in the common case
●
RowId (Logical page, Fixed-page index) are the same in all replicas of
a fragment (On different nodes)
This is required for optimised node recovery, where only rows changed
since a node has failed are copied across
●
Pages can only be freed when they are entirely empty
●
Pages are freed to the global pool (can be used by other tables etc)
●
Var-sized page content is reorganised regularly (within a page),
preserving external references via an index.
Row allocation details and constraints
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.10
Fixed sized pages
●
Potential permanent waste at end of fixed-size array
Not possible to avoid currently, but can maybe be made use of to store
extra data/row for 'free', or a small reduction in (fixed) row length can
gain more capacity than expected.
●
Unused 'slots' in pages due to rows deleted and no new rows to
take the space
Currently can only be solved by dumping and restoring data. All
fragment replicas must change atomically as the ROWID must be the
same across them.
Feature development required to implement an online defragmentation
here.
(De)Fragmentation
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.11
Var sized pages
●
Index with lots of free space
Some index shrinking applied already. Waste % not high.
●
Free space fragmentation within each page
Handled automatically as needed
●
Fragmentation across pages
Lots of free var-sized space, but not enough in any one page.
OPTIMIZE TABLE solves this.
Also solved by : Rolling node restart, Backup + Restore etc..
OPTIMIZE TABLE attempts to move every var-part not in a full var-sized
page into a 'better fitting' different page. Goal is to fill some pages and
free others
(De)Fragmentation Var-sized pages can be defragmented
online using OPTIMIZE TABLE
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.12
Optimize
Fragment
Fixed freelist
Var freelist 1
Var freelist 2
Var freelist 3
Var freelist 4
L2Pmap
0 1234
1 1235
2 600
3 -
4 983
5 786
...
1234 983 600 1235 786
Physical pages
containing fixed-
size parts
Physical pages
containing var-
sized parts
BEFORE : Fragmentation of Fixed and Var sized tables
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.13
Optimize
Fragment
Fixed freelist
Var freelist 1
Var freelist 2
Var freelist 3
Var freelist 4
L2Pmap
0 1234
1 1235
2 600
3 -
4 983
5 786
...
1234 983 600 1235 786
Physical pages
containing fixed-
size parts
Physical pages
containing var-
sized parts
AFTER : Var part moved, filling existing page, so it's no longer on freelist.
Source page now empty so returned to pool. Var page internal fragmentation
not necessarily affected. Fixed page fragmentation remains.
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.14
Prior to 7.4 :
●
ndb_mgm> ALL REPORT MEMORY
Total IM and DM in use in each data node
●
shell>ndb_desc ­d<database> <table> ­p ­n
Total Fixed and Var-sized DM pages allocated per fragment (Primary
only)
●
mysql> SELECT AVG_ROW_LENGTH from 
INFORMATION_SCHEMA.TABLES where 
TABLE_NAME=”<my_tab>”;
Ndb currently reports the size of the Fixed-part of rows (in bytes) as the
AVG_ROW_LENGTH. Can therefore be used to determine # of rows per
page (32640 / AVG_ROW_LENGTH) which can then be used to
determine level of Fixed-size page fragmentation.
Monitoring usage Difficult to determine var-sized fragmentation
without scanning whole table using LENGTH()
and summing.
Balance information not available
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.15
From 7.4 :
mysql> SELECT * FROM ndbinfo.memory_per_fragment;
●
Per-fragment replica DM and IM usage
●
Correlated to Node, LDM instance – good for checking balance
●
Explicit info on Fixed and Var size free space – triggers for online reorg
or other action
●
Can use normal SQL to compare across replicas, group by table, group
by node or LDM or nodegroup etc...
●
Can sample periodically to spot trends, track rates of change etc.
Monitoring usage
Copyright © 2014, Oracle and/or its affiliates. All rights reserved.16
Notes on NdbInfo tables
●
Implemented in data nodes, analogous to Linux /proc/ filesystem –
contents are generated for each view
●
Currently no indexing / filtering. Any query will retrieve full content of
table (full table scan). We hope to improve this in future.
●
All existing tables are relatively lightweight in terms of CPU cost to build
and send content
●
But they are not cached in MySQLD or 'free'. Beware sampling at a
high frequency.
Monitoring usage

More Related Content

PDF
MySQL Cluster Schema management (2014)
PDF
Breakthrough performance with MySQL Cluster (2012)
PPT
No SQL and MongoDB - Hyderabad Scalability Meetup
PDF
MariaDB: Connect Storage Engine
PDF
Elephants vs. Dolphins: Comparing PostgreSQL and MySQL for use in the DoD
PPTX
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
PDF
The Great Debate: PostgreSQL vs MySQL
 
PPT
6 Data Modeling for NoSQL 2/2
MySQL Cluster Schema management (2014)
Breakthrough performance with MySQL Cluster (2012)
No SQL and MongoDB - Hyderabad Scalability Meetup
MariaDB: Connect Storage Engine
Elephants vs. Dolphins: Comparing PostgreSQL and MySQL for use in the DoD
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
The Great Debate: PostgreSQL vs MySQL
 
6 Data Modeling for NoSQL 2/2

What's hot (20)

PDF
MariaDB CONNECT Storage Engine
PDF
Postgres_9.0 vs MySQL_5.5
PDF
NoSQL Databases
PPTX
In memory databases presentation
PDF
Architecture of exadata database machine – Part II
PDF
NoSQL databases
PPTX
In-Memory DataBase
PDF
In-memory Database and MySQL Cluster
PPTX
Appache Cassandra
KEY
North Bay Ruby Meetup 101911
PPTX
Oracle 11gR2 plain servers vs Exadata - 2013
PPTX
Sql server compression
PDF
MySQL Cluster Local Checkpoint (LCP) evolution in 7.6 (2019)
PPTX
PDF
Big Challenges in Data Modeling: NoSQL and Data Modeling
PPT
Mysql database
PPTX
Microsoft azure database offerings
PPTX
NoSQL Consepts
PPTX
Hp vertica certification guide
PDF
The Truth About Partitioning
 
MariaDB CONNECT Storage Engine
Postgres_9.0 vs MySQL_5.5
NoSQL Databases
In memory databases presentation
Architecture of exadata database machine – Part II
NoSQL databases
In-Memory DataBase
In-memory Database and MySQL Cluster
Appache Cassandra
North Bay Ruby Meetup 101911
Oracle 11gR2 plain servers vs Exadata - 2013
Sql server compression
MySQL Cluster Local Checkpoint (LCP) evolution in 7.6 (2019)
Big Challenges in Data Modeling: NoSQL and Data Modeling
Mysql database
Microsoft azure database offerings
NoSQL Consepts
Hp vertica certification guide
The Truth About Partitioning
 
Ad

Similar to MySQL Cluster page management (2014) (20)

PDF
SQLServer Database Structures
PPSX
DBA Lounge - Data Recovery and Fixing Database Corruptions
PDF
Inno Db Internals Inno Db File Formats And Source Code Structure
PDF
MySQL Space Management
PDF
The Science of DBMS: Data Storage & Organization
PPT
Sql Server Basics
PDF
Lecture storage-buffer
PDF
Inno db internals innodb file formats and source code structure
PDF
InnoDB Internal
PPSX
Index Tuning
PPTX
All about Storage - Series 3 - All about indexes
PPT
Database Sizing
ODP
Optimizing InnoDB bufferpool usage
PPT
Data Indexing Presentation-My.pptppt.ppt
PPT
Recovery of lost or corrupted inno db tables(mysql uc 2010)
PPTX
Storage talk
PPT
Recovery of lost or corrupted inno db tables(mysql uc 2010)
PPTX
War of the Indices- SQL Server and Oracle
PDF
MySQL innoDB split and merge pages
PPTX
Physical architecture of sql server
SQLServer Database Structures
DBA Lounge - Data Recovery and Fixing Database Corruptions
Inno Db Internals Inno Db File Formats And Source Code Structure
MySQL Space Management
The Science of DBMS: Data Storage & Organization
Sql Server Basics
Lecture storage-buffer
Inno db internals innodb file formats and source code structure
InnoDB Internal
Index Tuning
All about Storage - Series 3 - All about indexes
Database Sizing
Optimizing InnoDB bufferpool usage
Data Indexing Presentation-My.pptppt.ppt
Recovery of lost or corrupted inno db tables(mysql uc 2010)
Storage talk
Recovery of lost or corrupted inno db tables(mysql uc 2010)
War of the Indices- SQL Server and Oracle
MySQL innoDB split and merge pages
Physical architecture of sql server
Ad

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PDF
medical staffing services at VALiNTRY
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
System and Network Administration Chapter 2
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
history of c programming in notes for students .pptx
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
top salesforce developer skills in 2025.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
L1 - Introduction to python Backend.pptx
medical staffing services at VALiNTRY
Operating system designcfffgfgggggggvggggggggg
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
wealthsignaloriginal-com-DS-text-... (1).pdf
System and Network Administration Chapter 2
Design an Analysis of Algorithms II-SECS-1021-03
Navsoft: AI-Powered Business Solutions & Custom Software Development
VVF-Customer-Presentation2025-Ver1.9.pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Odoo Companies in India – Driving Business Transformation.pdf
history of c programming in notes for students .pptx
Odoo POS Development Services by CandidRoot Solutions
Which alternative to Crystal Reports is best for small or large businesses.pdf
top salesforce developer skills in 2025.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...

MySQL Cluster page management (2014)

  • 1. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 121 Insert Picture Here MySQL Cluster page management Frazer Clement MySQL Cluster Technical lead frazer.clement@oracle.com messagepassing.blogspot.com November 2014
  • 2. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.2 ● MySQL Cluster data nodes allocate all memory at initialisation ● Bulk of allocated memory is commonly DataMemory (DM) and IndexMemory (IM) - separate pools for historic reasons ● Both are managed as pages. 8KB pages for IndexMemory and 32kB pages for DataMemory ● Confusingly, IndexMemory pages are only used for the built-in primary key hash index for each fragment replica. ● This is literally only a hash table, the keys are stored externally (in DataMemory pages) ● DataMemory pages are used to store Primary keys and other columns, as well as Ordered Index T-tree nodes ● Occasionally we 'borrow' DM pages for other reasons Memory types Secondary unique indices use both DM + IM as they are implemented as tables
  • 3. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.3 ● Index and Data Memory pages are allocated to fragments as necessary to handle growth due to Inserts and Updates. ● Both are freed back to the shared pools when fragments no longer need them (Deletes). ● Fragments use DM pages for storing either Fixed size data or Variable sized data ● In both cases there is a 128 byte per-page header, leaving 32768 – 128 = 32640 bytes usable / page. (~0.4% overhead) ● Most storage is handled in terms of 32bit words, so there's 32640 / 4 = 8160 words usable / page. ● The usable space within Fixed-sized and Var-sized pages is handled differently Page based allocation
  • 4. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.4 ● Every row has a fixed-size part containing the tuple header and any fixed-size columns. ● For tables with var-sized columns (VARCHAR,BINARY,BLOB,TEXT, dynamic columns), every row has a variable-sized part containing the var-sized columns. ● Each fragment replica has : ● Logical to Physical page map mapping per-fragment logical Fixed- size page ids to a physical 32kB page ● Fixed-size-pages-with-free-space freelist ● Five size-binned var-size-pages-with-free-space freelists ● Allocation involves 1) Finding/allocating a page to allocate from, 2) Finding a space on the page to use. Page management
  • 5. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.5 Pages Fragment Fixed freelist Var freelist 1 Var freelist 2 Var freelist 3 Var freelist 4 L2Pmap 0 1234 1 1235 2 600 3 - 4 983 5 786 ... 1234 983 600 1235 786 Physical pages containing fixed- size parts Physical pages containing var- sized parts Rows are externally located via RowId (Table:Fragment:Page:Index). Every row has a fixed-size part and an optional var-sized part. Fixed size parts refer to var-sized parts. Page allocation is managed by the fragment, and each page manages its own free space
  • 6. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.6 ● For Fixed-size elements, the usable space is treated as storage for an array of the fixed-size elements. Therefore there can be up to element_size -1 words always wasted at the end. ● The elements within a Fixed-size page are linked together into a per- page free list. ● Fixed-size pages with 1 or more free elements are linked together in a per-fragment replica 'pages with space' list. ● Elements have an index within the page, which is their word offset. Fixed size pages
  • 7. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.7 ● For Variable-size elements, the usable space in each page is split into an index (at the end of the page) which refers to variable length parts which grow up from the start of the page. ● The index can grow and shrink as the number of elements changes ● New inserts are made from the insert position (append only) ● The last inserted element can grow efficiently ● The index entries are on a freelist, similar to the Fixed-page slots Variable size pages
  • 8. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.8 ● If a non-last element wants to grow, or there is not enough space after the insert pos for a new element, the page is re-organised automatically. ● Re-organisation compacts the in-use parts together, making the free space contiguous in the 'middle' of the page ● The index entries stay in the same positions, so external references to a stored var-part are unchanged. Variable size pages
  • 9. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.9 ● Goal : efficiency in the common case ● RowId (Logical page, Fixed-page index) are the same in all replicas of a fragment (On different nodes) This is required for optimised node recovery, where only rows changed since a node has failed are copied across ● Pages can only be freed when they are entirely empty ● Pages are freed to the global pool (can be used by other tables etc) ● Var-sized page content is reorganised regularly (within a page), preserving external references via an index. Row allocation details and constraints
  • 10. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.10 Fixed sized pages ● Potential permanent waste at end of fixed-size array Not possible to avoid currently, but can maybe be made use of to store extra data/row for 'free', or a small reduction in (fixed) row length can gain more capacity than expected. ● Unused 'slots' in pages due to rows deleted and no new rows to take the space Currently can only be solved by dumping and restoring data. All fragment replicas must change atomically as the ROWID must be the same across them. Feature development required to implement an online defragmentation here. (De)Fragmentation
  • 11. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.11 Var sized pages ● Index with lots of free space Some index shrinking applied already. Waste % not high. ● Free space fragmentation within each page Handled automatically as needed ● Fragmentation across pages Lots of free var-sized space, but not enough in any one page. OPTIMIZE TABLE solves this. Also solved by : Rolling node restart, Backup + Restore etc.. OPTIMIZE TABLE attempts to move every var-part not in a full var-sized page into a 'better fitting' different page. Goal is to fill some pages and free others (De)Fragmentation Var-sized pages can be defragmented online using OPTIMIZE TABLE
  • 12. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.12 Optimize Fragment Fixed freelist Var freelist 1 Var freelist 2 Var freelist 3 Var freelist 4 L2Pmap 0 1234 1 1235 2 600 3 - 4 983 5 786 ... 1234 983 600 1235 786 Physical pages containing fixed- size parts Physical pages containing var- sized parts BEFORE : Fragmentation of Fixed and Var sized tables
  • 13. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.13 Optimize Fragment Fixed freelist Var freelist 1 Var freelist 2 Var freelist 3 Var freelist 4 L2Pmap 0 1234 1 1235 2 600 3 - 4 983 5 786 ... 1234 983 600 1235 786 Physical pages containing fixed- size parts Physical pages containing var- sized parts AFTER : Var part moved, filling existing page, so it's no longer on freelist. Source page now empty so returned to pool. Var page internal fragmentation not necessarily affected. Fixed page fragmentation remains.
  • 14. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.14 Prior to 7.4 : ● ndb_mgm> ALL REPORT MEMORY Total IM and DM in use in each data node ● shell>ndb_desc ­d<database> <table> ­p ­n Total Fixed and Var-sized DM pages allocated per fragment (Primary only) ● mysql> SELECT AVG_ROW_LENGTH from  INFORMATION_SCHEMA.TABLES where  TABLE_NAME=”<my_tab>”; Ndb currently reports the size of the Fixed-part of rows (in bytes) as the AVG_ROW_LENGTH. Can therefore be used to determine # of rows per page (32640 / AVG_ROW_LENGTH) which can then be used to determine level of Fixed-size page fragmentation. Monitoring usage Difficult to determine var-sized fragmentation without scanning whole table using LENGTH() and summing. Balance information not available
  • 15. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.15 From 7.4 : mysql> SELECT * FROM ndbinfo.memory_per_fragment; ● Per-fragment replica DM and IM usage ● Correlated to Node, LDM instance – good for checking balance ● Explicit info on Fixed and Var size free space – triggers for online reorg or other action ● Can use normal SQL to compare across replicas, group by table, group by node or LDM or nodegroup etc... ● Can sample periodically to spot trends, track rates of change etc. Monitoring usage
  • 16. Copyright © 2014, Oracle and/or its affiliates. All rights reserved.16 Notes on NdbInfo tables ● Implemented in data nodes, analogous to Linux /proc/ filesystem – contents are generated for each view ● Currently no indexing / filtering. Any query will retrieve full content of table (full table scan). We hope to improve this in future. ● All existing tables are relatively lightweight in terms of CPU cost to build and send content ● But they are not cached in MySQLD or 'free'. Beware sampling at a high frequency. Monitoring usage