SlideShare a Scribd company logo
Building and Deploying Large Scale
        Real Time News System with
       MySQL and Distributed Cache
Presented	
  to	
  MySQL	
  Conference	
  
Apr.	
  13,	
  2011	
  
Who am I?
                          Pag
                          e2


  Tao Cheng <tao.cheng@teamaol.com>, AOL Real
   Time News (RTN).
  Worked on Mail and Browser clients in the ‘90 and
   then moved to web backend servers since.
  Not an expert but am happy to share my experience
   and brainstorm solutions.




Presentation for
[CLIENT]
Agenda

  AOL Real Time News (RTN): what it is?
  Requirements
  Technical solutions with focus on MySQL
  Deployment Topology
  Operational Monitoring
  Metrics Collection
Agenda

  Tips for query tuning and optimization
  Heuristic Query Optimization Algorithm
  Lessons learned
  Q & A
Real Time News : background
                           Pag
                           e5


AOL deployed its large scale Real Time News (RTN)
system in 2007.
This system ingests and processes news from 30,000
sources on every second around the clock. Today, its
data store, MySQL, has accumulated over several
billions of rows and terabytes of data.
However, news are delivered to end users in close to
real time fashion. This presentation shares how it is
done and the lessons learned.


Presentation for
AOLU Un-University
Brief Intro: sample features
                                  Pag
                                  e6


  Data presentation: return most recent news in
     flat view – most recent news about an entity. An entity could
      be a person, a company, a sports team, etc.
     topic clusters – most recent news grouped by topics. A topic is
      a group of news about an event, headline news, etc.
  News filtering by
     source types such as news, blogs, press releases, regional, etc.

     relevancy level (high, medium, low, etc) to the entities .

  Data Delivery: push (to subscribers) and pull
  Search by entities, categories (National, Sports,
    Finance, etc), topics, document ID, etc.
Presentation for
[CLIENT]
Requirements for Phase I (2006)
                                 Pag
                                 e7


  Commodity hardware: 4 CPU, 16 GB MEM, 600 GB
   disk space.
  Data ingestion rate = 250K docs/day; average
   document size = 5 KB.
  Data retention period: 7 days to forever
  Est. data set size: (1.25 GB/day or 456 GB/year) +
   space for indexes, schema change, and optimization.
  Response time: < 30 milli-second/query
  Throughputs: > 400 queries/sec/server
  Up time: 99.999%
Presentation for
[CLIENT]
Solutions: MySQL + Bucky
                                      Pag
                                      e8


  MySQL
     Serve raw/distinct queries

     Back fill

  Bucky Technology (AOL’s distributed cache &
    computing framework)
      Write ahead cache: pre-compute query results and push them
       into cache.
      Messaging (optional): push data directly to subscribers
           Updatesare pushed to data consumers or browsers via AIM
            Complex.
  Updates go to both database and cache.

Presentation for
[CLIENT]
Architecture Diagram (over-simplified)
                                                        Pag
                                                        e9




     WWW

                                       AIM	
           push

   Relegence	
  




    Ingestor	
       Distributed	
  
                        Cache	
  
                                                 Gateway	
           pull
                                                               WWW
                     Distributed	
  
                        Cache	
                  Gateway	
  



   Asset	
  DB	
  




Presentation for
[CLIENT]
Data Model: SOR v.s. Query DB
                                  Pag
                                  e 10


  Separate query from storage to keep tables small and
   query fast.
  System of Record (SOR): has all raw data
      The authoritative data store; designed for data storage
      Normalized schema: for simple key look-up; no table join.

  Query DB – de-normalized for query speed
     avoid JOIN, reduce # of trips to DB, increase throughputs.

  Read/write small chunk of data at a time so database
   can get requests out quickly and process more.
  Use replication to achieve linear scalability for read.

Presentation for
[CLIENT]
Design Strategies: partitioning (Why)
                                  Pag
                                  e 11


  Dataset too big to fit on one host
  Performance consideration: divide and conquer
     Write: more masters (Nx) to take writes

     Read: smaller tables + more (NxM) slaves to handle read.

  Fault tolerance – distribute the risk and reduce the
   impact of system failure
  Easier Maintenance – size does matter
      Faster nightly backup, disaster recovery, schema change, etc.
      Faster optimization –need optimization to reclaim disk space
       after deletion, rebuild indexes to improve query speed.


Presentation for
[CLIENT]
Design Strategies: partitioning (How)
                                    Pag
                                    e 12


  Partition on most used keys (look at query patterns)
     Document table – on document ID

     Entity table – on entity ID

  Simple hash on IDs – no partition map; thus no
   competition of read/write locks on yet another table
  Managing growth: add another partition set
      New documents are written into both old and new partition
       sets for a few weeks. Then, stop writing into the old partitions.
      Queries go to the new partitions first and then the old ones if
       in-sufficient results found.
  Works great in our case but might not for everyone.
Presentation for
[CLIENT]
Schema design: De-normalization
                                       Pag
                                       e 13


  Make query tables small:
     put only essential attributes in the de-normalized tables

     store long text attributes in separate tables.

  De-normalization: how to store and match attributes
     Single value attributes (1:1) : document ID, short string, date
      time, etc. – one column, one row.
     Multi-value attributes (1:many): tricky but feasible
          Use  multiple rows with composite index/key: (c1, c2, etc.)
          One row one column: CSV string, e.g., “id1, id2, id3” – SQL: “val
           like ‘%id2%’”
          One row but multiple columns, e.g., group1, group2, etc. – SQL:
           group1=val1 OR group2=val2 ...

Presentation for
[CLIENT]
Tips for indexing
                                Pag
                                e 14


  Simple key – for metadata retrieval
  Composite key – find matching documents
     Start with low cardinality and most used columns

     Order matter: (c1, c2, c3) != (c2, c3, c1)

  InnoDB – all secondary indexes contain primary key
     Make primary key short to keep index size small

     Queries using secondary index references primary key too.

  Integer v.s. String – comparison of numeric values is
   faster => index hash values of long string instead.
  Index length – title:varchar(255) => idx_title(32)
  Enforce referential integrity on application side.
Presentation for
[CLIENT]
MySQL configuration
                                  Pag
                                  e 15


  Storage engine: InnoDB – row level locking
  Table space – one file per table
     Easier to maintain (schema change, optimization, etc.)

  Character set: ‘UTF-8’
     Disable persistent connection (5.0.x)

     skip-character-set-client-handshake

  Enable slow query log to identify bad queries.
  System variables for memory buffer size
     innodb_buffer_pool_size: data and indexes

     Sort_buffer_size, max_heap_table_size, tmp_table_size

     Query cache size=0; tables are updated constantly
Presentation for
[CLIENT]
Runtime statistics (per server)
                                 Pag
                                 e 16


  Average write rate:
     daily: < 40 tps

     max at 400 tps during recovery

     Perform best when write rate < 100 tps

  Query rate: 20~80 qps
  Query response time – shorter when indexes and
    data are in memory
      75%: ~3 ms when qps < 15; ~2 ms when qps ~= 60
      95%: 6~8 ms when qps < 15; 3~4 ms when qps ~= 60

      CPU Idle %: > 99%.



Presentation for
[CLIENT]
Pag
                   e 17




Presentation for
[CLIENT]
Deployment Topology Consideration
                                   Pag
                                   e 18


•  Minimum configuration: host/DC redundency
   •  DC1: host 1 (master), host 3 (slave)

   •  DC2: host 2 (failover master), host 4 (slave)

•  Data locality: significant when network latency is a
    concern (100 Mbps)
    •    3,000 qps when DB is on remote host.
    •    15,000 qps when DB is on local host.
•  Linking dependent servers across data centers
   •  Push cross link up as far as possible (Topology 3): link to
      dependent servers in the same data center.


Presentation for
[CLIENT]
Deployment Topology 1: minimum config
                             Pag
                             e 19
   Date Center 1


       DB          DB



                          Data      WWW
                        Consumer




       DB          DB


   Date Center 2


Presentation for
[CLIENT]
Topology 2: link across DCs (bad)
                                   Pag
                                   e 20


                                        Data
                   DB   V                        V
       DB                             Consumer
                        I                        I
                        P                        P
                                        Data
                   DB                 Consumer       G
                                                     S
                                                     L   WWW
                            GSLB
                                                     B

                                        Data
                   DB   V                        V
                                      Consumer
                        I                        I
       DB               P                        P
                                        Data
                   DB
                                      Consumer

Presentation for
[CLIENT]
Topology 3: link to same DC (better)
                             Pag
                             e 21


                                Data
                   DB   V                V
       DB                     Consumer
                        I                I
                        P                P
                                Data
                   DB         Consumer       G
                                             S
                                             L   WWW
                                             B

                                Data
                   DB   V                V
                              Consumer
                        I                I
       DB               P                P
                                Data
                   DB
                              Consumer

Presentation for
[CLIENT]
Topology 4: use local UNIX socket
                              Pag
                              e 22


                            Data
                     DB                V
       DB                 Consumer
                                       I
                                       P
                            Data
                     DB   Consumer         G
                                           S
                                           L   WWW
                                           B

                            Data
                     DB   Consumer     V
                                       I
       DB                              P
                            Data
                     DB
                          Consumer

Presentation for
[CLIENT]
Production Monitoring
                            Pag
                            e 23


  Operational Monitoring: logcheck, Scout/NOC alert,
   etc.
  DB monitoring on replication failure, latency, read/
   write rate, performance metrics.




Presentation for
[CLIENT]
Metrics Collection
                                   Pag
                                   e 24


  Graphing collected metrics: visualize and collate
    operational metrics.
      Help analyzing and fine tuning server performance.
      Help trace production issues and identify point of failure.

  What metrics are important?
     Host: CPU, MEM, disk I/O, network I/O, # of processes, CPU
      swap/paging
     Server: Throughputs, response time

  Comparison: line up charts (throughputs, response
    time, CPU, disk i/o) in the same time window.

Presentation for
[CLIENT]
Pag
                   e 25




Presentation for
[CLIENT]
Pag
                   e 26




Presentation for
[CLIENT]
Pag
                   e 27




Presentation for
[CLIENT]
Tuning and Optimizing Queries
                                 Pag
                                 e 28


  Explain: mysql> explain SELECT ... FROM …
  Watch out for tmp table usage, table scan, etc.
  SQL_NO_CACHE
  MySQL Query profiler
     mysql> set profiling=1;

  Linux OS Cache: leave enough memory on host
  USE INDEX hint to choose INDEX explicitly
     use wisely: most of the time, MySQL chooses the right index
      for you. But, when table size grows, index cardinality might
      change.

Presentation for
[CLIENT]
Important MySQL statistics
                               Pag
                               e 29


  SHOW GLOBAL STATUS…
     Qcache_free_blocks

     Qcache_free_memory

     Qcache_hits

     Qcache_inserts

     Qcache_lowmem_prunes

     Qcache_not_cached

     Qcache_queries_in_cache

     Select_scan

     Sort_scan




Presentation for
[CLIENT]
Important MySQL statistics (cont.)
                               Pag
                               e 30

      Table_locks_waited
      Innodb_row_lock_current_waits

      Innodb_row_lock_time

      Innodb_row_lock_time_avg

      Innodb_row_lock_time_max

      Innodb_row_lock_waits

      Select_scan

      Slave_open_temp_tables




Presentation for
[CLIENT]
Heuristic Query Optimization Algorithm
                                    Pag
                                    e 31


  Primary for complex cluster queries: find latest N
   topics and related stories.
  Strategy: reduce the number of records database
   needs to load from disk to perform a query.
      Pick a default query range. If in-sufficient docs are returned,
       expand query range proportionally.
      If none return => sparse data => drop the range and retry.

      Save query range for future references.

  Result: reduce number of rows needed to process
    from millions to hundreds => cut query time down
    from minutes to less than 10 ms.
Presentation for
[CLIENT]
Query	
  range	
  
                                               Cluster	
  query	
  
               look	
  up	
  
                                             NumOfTripToDB	
  =0	
  

                                  no	
  
              Has query                    Use default
               range?                        range
                                                                    Compute docs to range ratio and
                                                                  prorate it to a range that would return
                                                                        sufficient amount of docs.

       Bound query with the
        range and send it to
                DB                                                                                              yes	
  
                                                                                 NumOfTrip
                                                                                 ToDB	
  >=2?	
  
                        NumOfTripToDB++	
  



             Suf@icient	
                                                                             yes	
  
              results	
                                                          numOfResults                             Send original
               from	
                                                               == 0?                                 query to DB
               query	
  
              engine?	
  
                                            Query	
  
                                            Engine	
  

          yes	
  

      Compute docs to range
       ratio and save it back                                Return query
      to the look up table for                             results to clients.
             future use.
Presentation for
[CLIENT]
Lessons Learned
                           Pag
                           e 33


  Always load test well ahead of launch (2 weeks) to
   avoid fire drill.
  Don’t rely on cache solely. Database needs to be able
   to serve reasonable amount of queries on its own.
  Separate cache from applications to avoid cold start.
  Keep transaction/query simple and return fast.
  Avoid table join; limit it to 2 if really needed.
  Avoid stored procedure: results are not cached; need
   DBA when altering implementation.

Presentation for
[CLIENT]
Lessons Learned (cont.)
                           Pag
                           e 34


  Avoid using ‘offset’ in LIMIT clause; use application
    based pagination instead.
  Avoid ‘SQL_CALC_FOUND_ROWS’ in SELECT
  If possible, exclude text/blob columns from query
    results to avoid disk I/O.
  Store text/blob in separate table to speed up backup,
    optimization, and schema change.
  Separate real time v.s. archive data for better
    performance and easier maintenance.
  Keep table size under control ( < 100 GB) ; optimized
    periodically.
Presentation for
[CLIENT]
Lessons Learned (cont.)
                                  Pag
                                  e 35


  Put SQL statement (templates) in resource files so
   you can tune it without binary change.
  Set up replication in dev & qa to catch replication
   issues earlier
      Transactional (MySQL 5.0.x) v.s. data/mixed (5.1 or above)
      Auto-increment + (INSERT.. ON DUPLICATE UPDATE…)

      Date time column: default to NOW()

      Oversized data: increase max_allowed_packet

      Replication lag: transactions that involve index update/
       deletion often take longer to complete.
  Host and data center redundancy is important –
    don’t put all eggs in one basket.
Presentation for
[CLIENT]
RTN 3 Redesign
                                   Pag
                                   e 36


  Free Text Search with SOLR
     Real time v.s. archive shards.

     1 minute latency w/o Ramdisk.

  Asset DB partitioned – 5 rows/doc -> 25 rows/doc
  Avoid (System) Virtual Machine; instead, stack high
    end hosts with processes that use different system
    resources (CPU, MEM, disk space, etc)
      Better network and system resource utilization – cost effective.
      Data Locality

  More processors (< 12 ) help when under load.

Presentation for
[CLIENT]
Q&A
                        Pag
                        e 37


  Questions or comments?




Presentation for
[CLIENT]
Pag
                   e 38


  THANK YOU !!




Presentation for
[CLIENT]

More Related Content

PPTX
Introducing Azure SQL Database
PDF
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...
PPTX
Oracle: DW Design
PDF
[SSA] 03.newsql database (2014.02.05)
PPTX
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
PPT
The Database Environment Chapter 13
PPT
Introduction to cassandra
PPTX
SQL Server Reporting Services Disaster Recovery Webinar
Introducing Azure SQL Database
NoSQL Databases: An Introduction and Comparison between Dynamo, MongoDB and C...
Oracle: DW Design
[SSA] 03.newsql database (2014.02.05)
IN-MEMORY DATABASE SYSTEMS FOR BIG DATA MANAGEMENT.SAP HANA DATABASE.
The Database Environment Chapter 13
Introduction to cassandra
SQL Server Reporting Services Disaster Recovery Webinar

What's hot (17)

PPTX
Oracle 11g data warehouse introdution
PPTX
The IBM Netezza datawarehouse appliance
PPT
An Introduction to Netezza
PDF
From Raw Data to Analytics with No ETL
PDF
NewSQL Database Overview
PPTX
Experience SQL Server 2017: The Modern Data Platform
PDF
Whats New Sql Server 2008 R2
PDF
Bigtable and Dynamo
PPTX
IBM Pure Data System for Analytics (Netezza)
PDF
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
 
PPTX
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
PDF
Netezza vs teradata
PPT
An overview of snowflake
PPTX
Hadoop & Greenplum: Why Do Such a Thing?
PDF
Architecture of exadata database machine – Part II
PPTX
Polyglot Database - Linuxcon North America 2016
PPTX
What Your Database Query is Really Doing
Oracle 11g data warehouse introdution
The IBM Netezza datawarehouse appliance
An Introduction to Netezza
From Raw Data to Analytics with No ETL
NewSQL Database Overview
Experience SQL Server 2017: The Modern Data Platform
Whats New Sql Server 2008 R2
Bigtable and Dynamo
IBM Pure Data System for Analytics (Netezza)
Die 10 besten PostgreSQL-Replikationsstrategien für Ihr Unternehmen
 
Bigdata netezza-ppt-apr2013-bhawani nandan prasad
Netezza vs teradata
An overview of snowflake
Hadoop & Greenplum: Why Do Such a Thing?
Architecture of exadata database machine – Part II
Polyglot Database - Linuxcon North America 2016
What Your Database Query is Really Doing
Ad

Viewers also liked (17)

PPTX
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
PPTX
Choosing a Data Visualization Tool for Data Scientists_Final
PPTX
Microsoft NERD Talk - R and Tableau - 2-4-2013
PPTX
Using Salesforce, ERP, Tableau & R in Sales Forecasting
PDF
Performance data visualization with r and tableau
PDF
R Markdown Tutorial For Beginners
PDF
RMySQL Tutorial For Beginners
PDF
Open Source Software for Data Scientists -- BigConf 2014
PPTX
Big Data: The 4 Layers Everyone Must Know
PPTX
Big Data Analytics
PPTX
Big Data Analytics
PDF
Big Data Visualization
PPTX
Tableau Software - Business Analytics and Data Visualization
PDF
Big Data Architecture
PDF
Lista 2 (1)
PPTX
An Interactive Introduction To R (Programming Language For Statistics)
PPT
Big Data Analytics 2014
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Choosing a Data Visualization Tool for Data Scientists_Final
Microsoft NERD Talk - R and Tableau - 2-4-2013
Using Salesforce, ERP, Tableau & R in Sales Forecasting
Performance data visualization with r and tableau
R Markdown Tutorial For Beginners
RMySQL Tutorial For Beginners
Open Source Software for Data Scientists -- BigConf 2014
Big Data: The 4 Layers Everyone Must Know
Big Data Analytics
Big Data Analytics
Big Data Visualization
Tableau Software - Business Analytics and Data Visualization
Big Data Architecture
Lista 2 (1)
An Interactive Introduction To R (Programming Language For Statistics)
Big Data Analytics 2014
Ad

Similar to Building and deploying large scale real time news system with my sql and distributed cache mysql_conf (20)

PDF
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
PDF
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
PDF
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
PPTX
MinneBar 2013 - Scaling with Cassandra
PDF
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
PDF
Building a High Performance Analytics Platform
PPTX
Using SAS GRID v 9 with Isilon F810
PPTX
Compare Clustering Methods for MS SQL Server
PDF
Cassandra's Odyssey @ Netflix
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
PDF
Massive Data Processing in Adobe Using Delta Lake
PPTX
NewSQL - Deliverance from BASE and back to SQL and ACID
PDF
Amazon Elastic Map Reduce - Ian Meyers
PDF
Software Developer Portfolio: Backend Architecture & Performance Optimization
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PPT
Hadoop and Voldemort @ LinkedIn
PPTX
Software architecture for data applications
PDF
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
PDF
Cassandra Summit 2015 - A Change of Seasons
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
MinneBar 2013 - Scaling with Cassandra
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
Building a High Performance Analytics Platform
Using SAS GRID v 9 with Isilon F810
Compare Clustering Methods for MS SQL Server
Cassandra's Odyssey @ Netflix
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Azure Days 2019: Grösser und Komplexer ist nicht immer besser (Meinrad Weiss)
Massive Data Processing in Adobe Using Delta Lake
NewSQL - Deliverance from BASE and back to SQL and ACID
Amazon Elastic Map Reduce - Ian Meyers
Software Developer Portfolio: Backend Architecture & Performance Optimization
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Hadoop and Voldemort @ LinkedIn
Software architecture for data applications
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Cassandra Summit 2015 - A Change of Seasons

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPT
Teaching material agriculture food technology
PPTX
A Presentation on Artificial Intelligence
PPTX
Big Data Technologies - Introduction.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Electronic commerce courselecture one. Pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
A Presentation on Artificial Intelligence
Big Data Technologies - Introduction.pptx
MYSQL Presentation for SQL database connectivity
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
NewMind AI Weekly Chronicles - August'25 Week I
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Unlocking AI with Model Context Protocol (MCP)
Electronic commerce courselecture one. Pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Network Security Unit 5.pdf for BCA BBA.

Building and deploying large scale real time news system with my sql and distributed cache mysql_conf

  • 1. Building and Deploying Large Scale Real Time News System with MySQL and Distributed Cache Presented  to  MySQL  Conference   Apr.  13,  2011  
  • 2. Who am I? Pag e2   Tao Cheng <tao.cheng@teamaol.com>, AOL Real Time News (RTN).   Worked on Mail and Browser clients in the ‘90 and then moved to web backend servers since.   Not an expert but am happy to share my experience and brainstorm solutions. Presentation for [CLIENT]
  • 3. Agenda   AOL Real Time News (RTN): what it is?   Requirements   Technical solutions with focus on MySQL   Deployment Topology   Operational Monitoring   Metrics Collection
  • 4. Agenda   Tips for query tuning and optimization   Heuristic Query Optimization Algorithm   Lessons learned   Q & A
  • 5. Real Time News : background Pag e5 AOL deployed its large scale Real Time News (RTN) system in 2007. This system ingests and processes news from 30,000 sources on every second around the clock. Today, its data store, MySQL, has accumulated over several billions of rows and terabytes of data. However, news are delivered to end users in close to real time fashion. This presentation shares how it is done and the lessons learned. Presentation for AOLU Un-University
  • 6. Brief Intro: sample features Pag e6   Data presentation: return most recent news in   flat view – most recent news about an entity. An entity could be a person, a company, a sports team, etc.   topic clusters – most recent news grouped by topics. A topic is a group of news about an event, headline news, etc.   News filtering by   source types such as news, blogs, press releases, regional, etc.   relevancy level (high, medium, low, etc) to the entities .   Data Delivery: push (to subscribers) and pull   Search by entities, categories (National, Sports, Finance, etc), topics, document ID, etc. Presentation for [CLIENT]
  • 7. Requirements for Phase I (2006) Pag e7   Commodity hardware: 4 CPU, 16 GB MEM, 600 GB disk space.   Data ingestion rate = 250K docs/day; average document size = 5 KB.   Data retention period: 7 days to forever   Est. data set size: (1.25 GB/day or 456 GB/year) + space for indexes, schema change, and optimization.   Response time: < 30 milli-second/query   Throughputs: > 400 queries/sec/server   Up time: 99.999% Presentation for [CLIENT]
  • 8. Solutions: MySQL + Bucky Pag e8   MySQL   Serve raw/distinct queries   Back fill   Bucky Technology (AOL’s distributed cache & computing framework)   Write ahead cache: pre-compute query results and push them into cache.   Messaging (optional): push data directly to subscribers   Updatesare pushed to data consumers or browsers via AIM Complex.   Updates go to both database and cache. Presentation for [CLIENT]
  • 9. Architecture Diagram (over-simplified) Pag e9 WWW AIM   push Relegence   Ingestor   Distributed   Cache   Gateway   pull WWW Distributed   Cache   Gateway   Asset  DB   Presentation for [CLIENT]
  • 10. Data Model: SOR v.s. Query DB Pag e 10   Separate query from storage to keep tables small and query fast.   System of Record (SOR): has all raw data   The authoritative data store; designed for data storage   Normalized schema: for simple key look-up; no table join.   Query DB – de-normalized for query speed   avoid JOIN, reduce # of trips to DB, increase throughputs.   Read/write small chunk of data at a time so database can get requests out quickly and process more.   Use replication to achieve linear scalability for read. Presentation for [CLIENT]
  • 11. Design Strategies: partitioning (Why) Pag e 11   Dataset too big to fit on one host   Performance consideration: divide and conquer   Write: more masters (Nx) to take writes   Read: smaller tables + more (NxM) slaves to handle read.   Fault tolerance – distribute the risk and reduce the impact of system failure   Easier Maintenance – size does matter   Faster nightly backup, disaster recovery, schema change, etc.   Faster optimization –need optimization to reclaim disk space after deletion, rebuild indexes to improve query speed. Presentation for [CLIENT]
  • 12. Design Strategies: partitioning (How) Pag e 12   Partition on most used keys (look at query patterns)   Document table – on document ID   Entity table – on entity ID   Simple hash on IDs – no partition map; thus no competition of read/write locks on yet another table   Managing growth: add another partition set   New documents are written into both old and new partition sets for a few weeks. Then, stop writing into the old partitions.   Queries go to the new partitions first and then the old ones if in-sufficient results found.   Works great in our case but might not for everyone. Presentation for [CLIENT]
  • 13. Schema design: De-normalization Pag e 13   Make query tables small:   put only essential attributes in the de-normalized tables   store long text attributes in separate tables.   De-normalization: how to store and match attributes   Single value attributes (1:1) : document ID, short string, date time, etc. – one column, one row.   Multi-value attributes (1:many): tricky but feasible   Use multiple rows with composite index/key: (c1, c2, etc.)   One row one column: CSV string, e.g., “id1, id2, id3” – SQL: “val like ‘%id2%’”   One row but multiple columns, e.g., group1, group2, etc. – SQL: group1=val1 OR group2=val2 ... Presentation for [CLIENT]
  • 14. Tips for indexing Pag e 14   Simple key – for metadata retrieval   Composite key – find matching documents   Start with low cardinality and most used columns   Order matter: (c1, c2, c3) != (c2, c3, c1)   InnoDB – all secondary indexes contain primary key   Make primary key short to keep index size small   Queries using secondary index references primary key too.   Integer v.s. String – comparison of numeric values is faster => index hash values of long string instead.   Index length – title:varchar(255) => idx_title(32)   Enforce referential integrity on application side. Presentation for [CLIENT]
  • 15. MySQL configuration Pag e 15   Storage engine: InnoDB – row level locking   Table space – one file per table   Easier to maintain (schema change, optimization, etc.)   Character set: ‘UTF-8’   Disable persistent connection (5.0.x)   skip-character-set-client-handshake   Enable slow query log to identify bad queries.   System variables for memory buffer size   innodb_buffer_pool_size: data and indexes   Sort_buffer_size, max_heap_table_size, tmp_table_size   Query cache size=0; tables are updated constantly Presentation for [CLIENT]
  • 16. Runtime statistics (per server) Pag e 16   Average write rate:   daily: < 40 tps   max at 400 tps during recovery   Perform best when write rate < 100 tps   Query rate: 20~80 qps   Query response time – shorter when indexes and data are in memory   75%: ~3 ms when qps < 15; ~2 ms when qps ~= 60   95%: 6~8 ms when qps < 15; 3~4 ms when qps ~= 60   CPU Idle %: > 99%. Presentation for [CLIENT]
  • 17. Pag e 17 Presentation for [CLIENT]
  • 18. Deployment Topology Consideration Pag e 18 •  Minimum configuration: host/DC redundency •  DC1: host 1 (master), host 3 (slave) •  DC2: host 2 (failover master), host 4 (slave) •  Data locality: significant when network latency is a concern (100 Mbps) •  3,000 qps when DB is on remote host. •  15,000 qps when DB is on local host. •  Linking dependent servers across data centers •  Push cross link up as far as possible (Topology 3): link to dependent servers in the same data center. Presentation for [CLIENT]
  • 19. Deployment Topology 1: minimum config Pag e 19 Date Center 1 DB DB Data WWW Consumer DB DB Date Center 2 Presentation for [CLIENT]
  • 20. Topology 2: link across DCs (bad) Pag e 20 Data DB V V DB Consumer I I P P Data DB Consumer G S L WWW GSLB B Data DB V V Consumer I I DB P P Data DB Consumer Presentation for [CLIENT]
  • 21. Topology 3: link to same DC (better) Pag e 21 Data DB V V DB Consumer I I P P Data DB Consumer G S L WWW B Data DB V V Consumer I I DB P P Data DB Consumer Presentation for [CLIENT]
  • 22. Topology 4: use local UNIX socket Pag e 22 Data DB V DB Consumer I P Data DB Consumer G S L WWW B Data DB Consumer V I DB P Data DB Consumer Presentation for [CLIENT]
  • 23. Production Monitoring Pag e 23   Operational Monitoring: logcheck, Scout/NOC alert, etc.   DB monitoring on replication failure, latency, read/ write rate, performance metrics. Presentation for [CLIENT]
  • 24. Metrics Collection Pag e 24   Graphing collected metrics: visualize and collate operational metrics.   Help analyzing and fine tuning server performance.   Help trace production issues and identify point of failure.   What metrics are important?   Host: CPU, MEM, disk I/O, network I/O, # of processes, CPU swap/paging   Server: Throughputs, response time   Comparison: line up charts (throughputs, response time, CPU, disk i/o) in the same time window. Presentation for [CLIENT]
  • 25. Pag e 25 Presentation for [CLIENT]
  • 26. Pag e 26 Presentation for [CLIENT]
  • 27. Pag e 27 Presentation for [CLIENT]
  • 28. Tuning and Optimizing Queries Pag e 28   Explain: mysql> explain SELECT ... FROM …   Watch out for tmp table usage, table scan, etc.   SQL_NO_CACHE   MySQL Query profiler   mysql> set profiling=1;   Linux OS Cache: leave enough memory on host   USE INDEX hint to choose INDEX explicitly   use wisely: most of the time, MySQL chooses the right index for you. But, when table size grows, index cardinality might change. Presentation for [CLIENT]
  • 29. Important MySQL statistics Pag e 29   SHOW GLOBAL STATUS…   Qcache_free_blocks   Qcache_free_memory   Qcache_hits   Qcache_inserts   Qcache_lowmem_prunes   Qcache_not_cached   Qcache_queries_in_cache   Select_scan   Sort_scan Presentation for [CLIENT]
  • 30. Important MySQL statistics (cont.) Pag e 30   Table_locks_waited   Innodb_row_lock_current_waits   Innodb_row_lock_time   Innodb_row_lock_time_avg   Innodb_row_lock_time_max   Innodb_row_lock_waits   Select_scan   Slave_open_temp_tables Presentation for [CLIENT]
  • 31. Heuristic Query Optimization Algorithm Pag e 31   Primary for complex cluster queries: find latest N topics and related stories.   Strategy: reduce the number of records database needs to load from disk to perform a query.   Pick a default query range. If in-sufficient docs are returned, expand query range proportionally.   If none return => sparse data => drop the range and retry.   Save query range for future references.   Result: reduce number of rows needed to process from millions to hundreds => cut query time down from minutes to less than 10 ms. Presentation for [CLIENT]
  • 32. Query  range   Cluster  query   look  up   NumOfTripToDB  =0   no   Has query Use default range? range Compute docs to range ratio and prorate it to a range that would return sufficient amount of docs. Bound query with the range and send it to DB yes   NumOfTrip ToDB  >=2?   NumOfTripToDB++   Suf@icient   yes   results   numOfResults Send original from   == 0? query to DB query   engine?   Query   Engine   yes   Compute docs to range ratio and save it back Return query to the look up table for results to clients. future use. Presentation for [CLIENT]
  • 33. Lessons Learned Pag e 33   Always load test well ahead of launch (2 weeks) to avoid fire drill.   Don’t rely on cache solely. Database needs to be able to serve reasonable amount of queries on its own.   Separate cache from applications to avoid cold start.   Keep transaction/query simple and return fast.   Avoid table join; limit it to 2 if really needed.   Avoid stored procedure: results are not cached; need DBA when altering implementation. Presentation for [CLIENT]
  • 34. Lessons Learned (cont.) Pag e 34   Avoid using ‘offset’ in LIMIT clause; use application based pagination instead.   Avoid ‘SQL_CALC_FOUND_ROWS’ in SELECT   If possible, exclude text/blob columns from query results to avoid disk I/O.   Store text/blob in separate table to speed up backup, optimization, and schema change.   Separate real time v.s. archive data for better performance and easier maintenance.   Keep table size under control ( < 100 GB) ; optimized periodically. Presentation for [CLIENT]
  • 35. Lessons Learned (cont.) Pag e 35   Put SQL statement (templates) in resource files so you can tune it without binary change.   Set up replication in dev & qa to catch replication issues earlier   Transactional (MySQL 5.0.x) v.s. data/mixed (5.1 or above)   Auto-increment + (INSERT.. ON DUPLICATE UPDATE…)   Date time column: default to NOW()   Oversized data: increase max_allowed_packet   Replication lag: transactions that involve index update/ deletion often take longer to complete.   Host and data center redundancy is important – don’t put all eggs in one basket. Presentation for [CLIENT]
  • 36. RTN 3 Redesign Pag e 36   Free Text Search with SOLR   Real time v.s. archive shards.   1 minute latency w/o Ramdisk.   Asset DB partitioned – 5 rows/doc -> 25 rows/doc   Avoid (System) Virtual Machine; instead, stack high end hosts with processes that use different system resources (CPU, MEM, disk space, etc)   Better network and system resource utilization – cost effective.   Data Locality   More processors (< 12 ) help when under load. Presentation for [CLIENT]
  • 37. Q&A Pag e 37   Questions or comments? Presentation for [CLIENT]
  • 38. Pag e 38   THANK YOU !! Presentation for [CLIENT]