Designing Information Structures For Performance And Reliability

1. Designing Information Structures for Performance and ReliabilityKey elements to maximizing DB Server PerformanceBryan RandolIT/Systems Manager1

2. Designing Information Structures for Performance and Reliability : Discussion Outline DAY 1: Hardware Performance: Systematic Tuning ConceptsCPUMemory Architecture and Front-Side Bus (FSB)Data Flow ConceptsDisk ConsiderationsRAIDDAY 2: Database Performance:OLAP vs. OLTPGreenPlum vs. PostgreSQLPostgreSQL Concepts and Performance TweakingPSA v.1 – GreenPlum AOPen mini-PCs “dbnode1-dbnode6”PSA v.2 – Tyan Transport w/PostgreSQLPSA v.3 – Current PSA Implementation, DELL PowerEdge 2950 w/PostgreSQL 8.32

3. 3I. Database Server Performance: Hardware & Operating System ConsiderationsDAY 1: Hardware Performance

4. 4Designing Information Structures for Performance and Reliability : Discussion OutlineSystematic tuning essentially follows these five steps:Assess the problem and establish numeric values that categorize acceptable behavior. (Know the system’s specifications and set realistic goals.)Measure the performance of the system before modification. (Benchmark)Identify the part of the system that is critical for improving the performance. This is called the “bottleneck”. (Analyze)Modify that part of the system to remove the bottleneck. (Upgrade/Tweak)Measure the performance of the system after modification. (Benchmark)Repeat steps 3-6 as needed. (Continuous Improvement)

5. 5I. Database Server Performance: Data Flow ConceptsDB Files are stored in the filesystem on disk in blocks.A “job” is requested, initiating a “process thread”, associated files are read into memory “pages”.Memory pages are read into the CPU’s cache as needed.“Page-outs” to disk occur to make space as needed. “Page-ins “ fromdisk are what slows down performanceOnce in CPU cache, jobs are processed in threads per CPU (or “core”).

6. 6I. Database Server Performance: Hardware & Operating System ConsiderationsServer Performance Considerations:CPU: Each CPU has at least one core, each core processes jobs (threads) sequentially based on the job’s priority. Higher priority jobs get more CPU time. Multi-threaded jobs are distributed evenly across all cores (“parallelized”).Internal Clock Speed: Operations the CPU can process internally per second in MHz, as advertised.External Clock Speed: Speed at which the CPU interacts with the rest of the system….also known as the front side bus (FSB).Memory Clock Speed: Speed at which RAM is given requests for data.Important PostgreSQL Performance Note:PostgreSQL uses a multi-process model, meaning each database connection has its own Unix process. Because of this, all multi-cpu operating systems can spread multiple database connections among the available CPUs. However, if only a single database connection is active, it can only use one CPU. PostgreSQL does not use multi-threading to allow a single process to use multiple CPUs.

7. 7I. Database Server Performance: Hardware & Operating System ConsiderationsServer Performance Considerations:Memory Architecture and FSB (Front Side Bus): On Intel based computers the CPU interfaces with memory through the “North Bridge” memory controller, across the FSB (Front Side Bus). FSB speed and the NorthBridge MMU (memory management unity) drastically affects the server’s performance, as it determines how fast data can be fed into the CPU from memory. Unless special care is taken, a database server running even a simple sequential scan on a table will spend 95% of its cycles waitingfor memory to be accessed. This memory access bottleneck is even more difficult to avoid in more complex database operations such as sorting, aggregation and join, which exhibit a random access pattern. Database algorithms and data structures should therefore be designed and optimized for memory access from the outset.

8. 8I. Database Server Performance: Hardware & Operating System ConsiderationsIntel “Xeon” based systems: Memory Access ChallengesFSB is a fixed frequency and requires a separate chip to access memory.Newer processors will run at the same fixed FSB speed. Memory access is delayed by passing through the separate controller chip. Both Processors share the same Front Side Buseffectively halving each processors bandwidth to memory, thereby stalling one processor while the other is accessing memory or I/O.All processor to system I/O and control must use this one path. One interleaved memory bank for both processors, again, effectively halving each processor’s bandwidth to memory. Half the bandwidth of a 2 memory bank architecture. All program access to graphics, PCI(e), PCI-X or other I/O must be through this bottleneck

9. 9I. Database Server Performance: Hardware & Operating System ConsiderationsMultiprocessing Memory Access ApproachesIntel Xeon Multiprocessing “1st Gen.”FSB cuts bandwidth per CPU

10. NorthBridge controller produces overhead

11. UMA (Uniform Memory Access)Access to memory banks is “uniform”.AMD Multiprocessing“HyperTransport”

12. FSB is on the CPU

13. NUMA (Non-Uniform Memory Access)Latency to each memory bank varies

14. 10I. Database Server Performance: Hardware & Operating System ConsiderationsIntel “Harpertown” Xeon ImprovementsDELL PowerEdge 2950 III(2 x Xeon E5405 = 8 cores)4 cores/CPU + faster FSB ( >= 1333MHz)Northbridge Controller bandwidth increased to 21.3GB/sreads from memory, and 10.7GB/swrites into memory…32GB/s overall bandwidth.DELL PowerEdge 1950(2 x Xeon E5405 = 8 cores)

15. 11I. Database Server Performance: Hardware & Operating System ConsiderationsDisk Considerations (secondary storage):Seek Time/Rotational Delay: How fast the read/write head is positioned appropriately for reading/writing and how fast the addressed area is placed under the read/write head for data transfer…SATA (Serial Advanced Technology Attachment) drives are cheap and come in sizes up to 2.5TB, typically maxing out at 7200RPMs. (“Velociraptor” is the exception @ 10,000RPM)SAS (Serial Attached SCSI) drives are twice as fast (15,000 RPMS) and typically twice as expensive, with roughly 1/5 the max capacity of SATA (~450GB).Bandwidth/Throughput (Transfer Time):Raw throughput rate at which data is transferred from disk into memory. This can be aggregated using RAID, which will be discussed later.SATA-I bandwidth is 1Gb/s which translates into ~ 150MB/s real speed.SATA-II and SAS bandwidth is 3Gb/s, which translates into ~ 300MB/s real speed.

16. 12I. Database Server Performance: Hardware & Operating System ConsiderationsDisk Considerations (secondary storage):Buffer/Cache: Disks contain intelligent controllers, read cache and write cache. When you ask for a given piece of data, the disk locates the data and sends it back to the motherboard. It also reads the rest of the track and caches this data on the assumption that you will want the next piece of data on the disk. This data is stored locally in its read cache. If, sometime later you request the next piece of data and it is in the read cache the disk can deliver it with almost no delay.Write back cache improves performance, because a write to the high-speed cache is faster than writes to normal RAM or disk….this cache aids in addressing the disk-to- memory subsystem bottleneck. Most good drives feature a 32MB buffer cache.

17. 13I. Database Server Performance: Hardware & Operating System ConsiderationsDisk Considerations :4. Track Data Density : Defines how much information can be stored on a given track. The higher the track data density, the more information the disk can store. If a disk can store more data on one track it does not have to move the head to the next track as often. This means that the higher the recording density the lower the chances are that the head will have to be moved to the next track to get the required data.

18. 14I. Database Server Performance: Hardware & Operating System ConsiderationsDisk Considerations:5. RAID: (n = number of drives in array) “Redundant Array of Inexpensive Disks”. Pools disks together to aggregate their throughput by “striping” data in segments across each disk. Also provides fault-tolerance. (n = number of drives)RAID0 “Striping” (n) : Fastest due to no parity…raw cumulative speed. Single drive failure causes the entire array to fail. “All-or-none” RAID1 “Mirroring” (n/2): Each drive is mirrored, speed and capacity is ½ of RAID0, requires even number of disks in order to be divided. Entire source or mirror array can go bad before data is jeopardized.RAID5 “Striping w/Parity” (n – 1): Fast, with a drive set aside for fault-tolerance. Only one drive can fail before the array is lost.RAID6 “Striping with dual Parity” (n -2): Fast, with 2 drives set aside for fault tolerance. Two drives can fail before the array is lost.

19. 15I. Database Server Performance: Hardware & Operating System ConsiderationsDisk Considerations:RAID controller Device responsible for managing the disk drives in an array. Stores the RAID configuration while also providing additional disk cache. Offloads costly checksum routines from CPU in parity driven RAID configurations (e.g. RAID5 and RAID6) The type of internal and external interface dramatically impacts the overall I/O performance of the array.Internal bus interface should be PCIe v2.0 (500 MB/s per lane throughput). Most common cards are x2, x4, and x8 “lanes” providing: 1GB/s, 2GB/s, and 4 GB/s throughput respectively. Notable external storage interfaces to the array enclosure include:

20. 16I. Database Server Performance: Hardware & Operating System ConsiderationsFilesystem ConsiderationsAs an easy performance boost with no downside, make sure the file system on which your database is kept is mounted "noatime", which turns off the access time bookkeeping.XFS is a 64-bit filesystem, supports a maximum filesystem size of 8 binary exabytes minus one byte.On 32-bit Linux systems, XFS is “limited” to 16 binary terabytes.Journal updates in XFS are performed asynchronously to prevent a performance penalty.Files and directories in XFS can span allocation groups, each allocation group manages its own inode tables (unlike EXT3/EXT2), providing scalability and parallelism.Multiple threads and processes can perform I/O operations on the same filesystem simultaneously.On a RAID array, a “stripe unit” can be specified within XFS at creation time. This maximizes throughput by aligning inode allocations with RAID stripe sizes.XFS provides a 64-bit sparse address space for each file, which allows both for very large file sizes, and for holes within files for which no disk space is allocated.

21. 17I. Database Server Performance: Hardware & Operating System ConsiderationsTakeaways from Hardware Performance Concepts: Keep relevant data closest to the CPU in memory once it has been read from disk. More memory reduces the need for costly “page-in” operations from disk by reducing the need to “page-out” data to make space for new data.Memory bus speed is still much slower than CPU bus speeds, often becoming a bottleneck as CPU speeds increase. It’s important to have the fastest memory speed and FSB that your chipset will support.More CPU cores allows you to parallelize workloads. A multithreaded database takes advantage of multi-processing by distributing a query into several threads across multiple CPUs, drastically increasing the query’s efficiency while reducing its process time.Faster disks with high bandwidth and low seek times maximize read performance into memory for CPUs to process complex queries. OLAP databases benefit from this because they scan large datasets frequently. Using RAID allows you to aggregate disk I/O by striping data across several spindles, drastically decreasing the time it takes to read data into memory and write back onto the disks during commits, while also providing massive storage space, redundancy and fault-tolerance.

22. 18I. Database Server Performance: Hardware & Operating System ConsiderationsDAY 2: Database Performance

23. 19II. Software & Application Considerations: OLAP and OLTP OLAP (Online Analytical Processing):Provides big picture, supports analysis, needs aggregate data, evaluates all datasets quickly, uses a multidimensional model.DB size is typically 100GB to several TB (even petabytes)Mostly read-only operations, lots of scans, complex queries.Benefits from multi-threading, parallel processing, and fast drives with highread throughput/low seek times.Key Performance Metrics: Query throughput/Response time.OLTP (Online Transactional Processing): Provides detailed audit, supports operations, needs detailed data, finds one dataset quickly, uses a relational model.DB size typically < 100GBShort, atomic transactions. Heavy emphasis on lightning fast writes.Key Performance Metrics: Transaction Throughput, Availability

24. 20II. Software & Application Considerations: OLAP and OLTP Database Types:OLAP (Online Analytical Processing):OLAP databases should only receive historical business data and remain isolated from OLTP (transactional) databases. Summaries not transactions.Data in OLAP databases never change, OLTP data constantly changes.OLAP databases typically contain fewer tables arranged into a “star” or “snowflake” schema. The central table in this star schema is called the “fact table”. The leaf tables are called “dimension tables”. The facts within a dimension table are called “members”.The joins between the dimension and fact tables allow you to browse through the facts across any number of dimensions.The simple design of the star schema makes it easier to write queries, and they run faster. OLTP database could involve dozens of tables, making query design complicated. In addition, the resulting query could take hours to run.OLAP databases make heavy use of indexes because they help find records in less time. In contrast, OLTP databases avoid them because they lengthen the process of inserting data.

25. 21II. Software & Application Considerations: OLAP and OLTP Database Types:OLAP (Online Analytical Processing):The process by which OLAP databases are populated is called: Extract, Transform, and Load (ETL). No direct data-entries are made into a OLAP database, only summaritive bulk ETL transactions.A cube aggregates the facts in each level of each dimension in a given OLAP schema. Because the cube contains all of the data in an aggregated form, it seems to know the answers to queries in advance.This arrangement of data into cubes overcomes a limitation of relational databases.

26. 22II. Software & Application Considerations: OLAP and OLTP OLAP (Online Analytical Processing):What happens during a query?Client statement is issued Database Server Processes the query by locating extents Data is found on DiskResults are sent through database server to client.

27. 23II. Software & Application Considerations: PostgreSQL Query FlowPostgreSQL: The Path of a Query1. Connection from Application.2. Parsing Stage3. Rewrite Stage4. Cost comparison and Plan/Optimization Stage5. Execution Stage6. Result

28. 24II. Software & Application Considerations: OLAP and OLTP GreenPlum and PostgreSQL:Of the open source database options, PostgreSQL is the most robust, object-relational database management system.GreenPlum is a commercially based PostgreSQL DBMS, adding enterprise (OLAP) oriented enhancements to PostgreSQL, promising the following features: Economical Petabyte Scaling

29. Massively Parallel Query Execution

30. Unified Analytical Processing

31. Shared-nothing massively parallel processing architecture

32. Fault tolerance

33. Linear Scalability

34. “In-database” compression, 3-10x disk space reduction, with corresponding I/O improvement.License was $20,000 every 6 months ($40,000/yr.)It’s important to note that PostgreSQL is free and can be modified to perform similarly to GreenPlum. We did just that with our PSA server reconstruction project.

35. PostgreSQL tweaks explained:PostgreSQL is tweaked through a configuration file called: “postgresql.conf” This flat file contains several dozen parameters from which the masterPostgreSQL service “postmaster” reads at startup. Changes made to this file require the “postgresql “ service to be bounced (restarted) via the command as root: “service postgresql restart”Corresponding “postgresql.conf” parameter affecting query performance:Maximum Connections (max_connections): Determines the maximum number of concurrent connections to the database server. Keep in mind that this figure is used as a multiplier for work_mem. Shared Buffers (shared_buffers): The shared_buffers configuration parameter determines how much memory is dedicated to PostgreSQL to use for caching data. If you have a system with 1GB or more of RAM, a reasonable starting value for shared_buffers is 1/4 of the memory in your system. Working Memory (work_mem): If you do a lot of complex sorts, and have a lot of memory, then increasing the work_mem parameter allows PostgreSQL to do larger in-memory sorts which, unsurprisingly, will be faster than disk-based equivalents. 25II. Software & Application Considerations: PostgreSQL Tweaks

36. 26The default POSTGRESQL configuration allocates 1000 shared buffers. Each buffer is 8 kilobytes. Increasing the number of buffers makes it more likely backends will find the information they need in the cache, thus avoiding an expensive operating system request. The change can be made with a postmaster command-line flag or by changing the value of shared_buffers in postgresql.conf.The default POSTGRESQL configuration allocates 1000 shared buffers. Each buffer is 8 kilobytes. Increasing the number of buffers makes it more likely backends will find the information they need in the cache, thus avoiding an expensive operating system request. The change can be made with a postmaster command-line flag or by changing the value of shared_buffers in postgresql.conf.II. Software & Application Considerations: PostgreSQL TweaksPostgreSQL tweaks explained:Shared BuffersPostgreSQL does not directly change information on disk. Instead, it requests data be read into the PostgreSQL shared buffer cache. PostgreSQL backends then read/write these blocks, and finally flush them back to disk.Backends that need to access tables first look for needed blocks in this cache. If they are already there, they can continue processing right away. If not, an operating system request is made to load the blocks. The blocks are loaded either from the kernel disk buffer cache, or from disk. These can be expensive operations. The default PostgreSQL configuration allocates 1000 shared buffers. Each buffer is 8 kilobytes. Increasing the number of buffers makes it more likely backends will find information in cache...to a limit.

37. 27The default POSTGRESQL configuration allocates 1000 shared buffers. Each buffer is 8 kilobytes. Increasing the number of buffers makes it more likely backends will find the information they need in the cache, thus avoiding an expensive operating system request. The change can be made with a postmaster command-line flag or by changing the value of shared_buffers in postgresql.conf.The default POSTGRESQL configuration allocates 1000 shared buffers. Each buffer is 8 kilobytes. Increasing the number of buffers makes it more likely backends will find the information they need in the cache, thus avoiding an expensive operating system request. The change can be made with a postmaster command-line flag or by changing the value of shared_buffers in postgresql.conf.II. Software & Application Considerations: PostgreSQL TweaksPostgreSQL tweaks explained:Shared Buffers “How much is too much?” Setting “shared_buffers” too high results in expensive “paging”...which severely degrades the database’s performance.If everything doesn't fit in RAM, the kernel starts forcing memory pages to a disk area called swap. It moves pages that have not been used recently. This operation is called a swap pageout. Pageouts are not a problem because they happen during periods of inactivity. What is bad is when these pages have to be brought back in from swap, meaning an old page that was moved out to swap has to be moved back into RAM. This is called a swap pagein.This is bad because while the page is moved from swap, the program is suspended until the pagein completes.

38. PostgreSQL tweaks explained:Horizontal “Range” Partitioning:Also known as “shard” involves putting different rows into different tables for improved manageability and performance. Benefits of partitioning include:Query performance can be improved dramatically in certain situations, particularly when most of the heavily accessed rows of the table are in a single partition or a small number of partitions. The partitioning substitutes for leading columns of indexes, reducing index size and making it more likely that the heavily-used parts of the indexes fit in memory.When queries or updates access a large percentage of a single partition, performance can be improved by taking advantage of sequential scan of that partition instead of using an index and random access reads scattered across the whole table.Seldom-used data can be migrated to cheaper and slower storage media. 28II. Software & Application Considerations: PostgreSQL Tweaks

39. PostgreSQL tweaks explained:Partitioning (cont.)The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server. The following forms of partitioning can be implemented in PostgreSQL: Range Partitioning (aka “Horizontal”)The table is partitioned into "ranges" defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example one might partition by date ranges, or by ranges of identifiers for particular business objects. List Partitioning The table is partitioned by explicitly listing which key values appear in each partition. 29II. Software & Application Considerations: PostgreSQL Tweaks

40. PostgreSQL tweaks explained:VACUUM:Ensures database is ACIDAtomicConsistentIsolatedDurablePostgreSQL uses MVCC (Multi-version Concurrency Control)…eliminating read locks on records by allowing several versions of data to exist in a database.VACUUM removes old versions of this multi-versioned data in base tables from the database. These old versions waste space once a commit is made.To keep a PostgreSQL database performing well, you must ensure VACUUM is run correctly.AUTOVACUUM suffices for our query based, low transaction database, keeping dead space to a minimum. 30II. Software & Application Considerations: PostgreSQL Tweaks

41. 31III. PSA Server Case Studies: AOPen mini-PCs + GreenPlum PSA Server (v1): “dbnode1 – dbnode6”Originally, PSA was hosted on GreenPlum using 6 AOpen mini-PC nodes.Performance was slow, disk I/O was roughly 90MB/s (realized), Sysco’s weekly reports took roughly 15 minutes. Database volume was constantly around 90% capacity, causing Mike to have to manually delete tables…space was at a premium.Licensing with GreenPlum was expensive ($20,000/6 months….$40,000/yr.) and the system didn’t deliver performance as promised (in either PSA or NewOps). NewOps’ performance should have been significantly better given it’s more robust hardware (12 x DELL PowerEdge 2950’s).Since GreenPlum is based on PostgreSQL, it made sense to leverage the underlying free open source code and scrap the proprietary distributed DB solution, opting for a standalone server with enhanced space and I/O. Migrating existing tables to PostgreSQL required very little modification.The mini-PC’s we used to cluster GreenPlum were limited in capacity and scalability…each box was sealed and didn’t allow for expansion.Mini-PC Details:AOpenMP965-DIntel® Core™2 Duo CPU T7300 @ 2GHz3.24GB MemoryBus Speed: 800MHz150GBSATA Drive

42. III. PSA Server Case Studies: TYAN Transport + PostgreSQLPSA Server (v2): “sentrana-psa-dw”This is our second generation PSA box, this time using PostgreSQL 8.3 instead of GreenPlum.Formerly used as a testing box at the colo, named “econ.sentrana.com”….consists of a basic Tyan Transport GX28 (B2881) commodity chassis, with a Tyan Thunder K8SR (S2881) motherboard, 2 Dual Core AMD Opteron 270’s @ 1000MHz w/2MBL2 Cache, 8GB memory, and 4 SATA-1 drive bays (SATA-II drives are backwards compatible, able to fit in these bays, however running at SATA-I speed). Filesystem: EXT3 (4KB block size = kernel page size)Storage Configuration: 4 drives bays = 1 OS drive + 3 RAID5 DB Drives @ SATA-I speed (150MB/s)Read Performance: ~ 76.75MB/s32

43. III. PSA Server Case Studies: DELL PowerEdge 2950 + PostgreSQLPSA Server (v3): “psa-dw-2950”This is our third (and current) generation PSA box, still using PostgreSQL, only the server platform has evolved to a DELL PowerEdge 2950, with dual Xeon Quad Core processors @ 2.5GHz, 16GBDDR memory, 1333MHz FSB, and 6 SATA-II/SAS drive bays configured via PCIe PERC6/I integrated RAID controller.Formerly used as one of the NewOps DBNode’s, with GreenPlum, this box was rebuilt from the OS out using Ubuntu 8.10 Linux as the OS serving PostgreSQL 8.3 as the DB System. Filesystem: XFS (4KB block size = kernel page size)Storage Configuration: 6 x 1TB Drives @ 7,2KRPMs (300Mb/s SATA-II speed) in single RAID5 array ~ 5TB actual storage space (5 drive spindles used for data, 1 for RAID5 parity)Read Performance: ~ 507MB/s33

44. III. PSA Server Case Studies: DELL PowerEdge 2950 + PostgreSQLPSA Server (v3): “psa-dw-2950”Postgresql.conf settings:max_connections = 25shared_buffers = 4096MB (1/4 total physical memory)(Sets the amount of memory the database server uses for shared memory buffers. )temp_buffers = 1024MB(Sets the maximum number of temporary buffers used by each database session.)work_mem = 4096MBSpecifies the amount of memory to be used by internal sort operations and hash tables before switching to temporary disk files. (too high = paging will occur, too low = writing to tempdb)maintenance_work_mem = 256MBrandom_page_cost = 2.0(query planner constant... stating the cost of using disks is 2.0)effective_cache_size = 12288MB(query planner constant)constraint_exclusion = on(query planner uses table constraints to optimize queries...e.g. partitioned tables)34

45. 1725 Eye St. NW, Suite 900Washington DC, 20006OFFICE 202.507.4480FAX 866.597.3285WEB sentrana.com

Designing Information Structures For Performance And Reliability

More Related Content

What's hot (18)

Viewers also liked (20)

Similar to Designing Information Structures For Performance And Reliability (20)

Designing Information Structures For Performance And Reliability

Editor's Notes