SlideShare a Scribd company logo
Evolution of a “big data”
         project
        Michael Peacock
@michaelpeacock
@michaelpeacock


Head Developer @groundsix
@michaelpeacock


Head Developer @groundsix

Author of a number of web related books
@michaelpeacock


Head Developer @groundsix

Author of a number of web related books

Occasional conference / user group speaker
Ground Six
Ground Six
Tech company based in the North East of
England
Ground Six
Tech company based in the North East of
England

Specialise in developing web and mobile
applications
Ground Six
Tech company based in the North East of
England

Specialise in developing web and mobile
applications

Provide investment (financial and tech) to
interesting app ideas
Ground Six
Tech company based in the North East of
England

Specialise in developing web and mobile
applications

Provide investment (financial and tech) to
interesting app ideas

Got an idea? Looking for investment?
www.groundsix.com
Whats in store
Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day
Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day

   Processing and storing the data
Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day

   Processing and storing the data

   Querying the data quickly
Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day

   Processing and storing the data

   Querying the data quickly

   Reporting against the data
Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day

   Processing and storing the data

   Querying the data quickly

   Reporting against the data

   Keeping the application responsive
Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day

   Processing and storing the data

   Querying the data quickly

   Reporting against the data

   Keeping the application responsive

   Keeping the application running
Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day

   Processing and storing the data

   Querying the data quickly

   Reporting against the data

   Keeping the application responsive

   Keeping the application running

   Legacy project, problems and code
Vehicle Telematics
Electric Vehicles: Need
        for Data
Electric Vehicles: Need
        for Data
We need to receive all of the data
Electric Vehicles: Need
        for Data
We need to receive all of the data

We need to keep all of the data
Electric Vehicles: Need
        for Data
We need to receive all of the data

We need to keep all of the data

We need to be able to display data in real time
Electric Vehicles: Need
        for Data
We need to receive all of the data

We need to keep all of the data

We need to be able to display data in real time

We need to transfer large chunks of data to
customers and government departments
Electric Vehicles: Need
        for Data
We need to receive all of the data

We need to keep all of the data

We need to be able to display data in real time

We need to transfer large chunks of data to
customers and government departments

We need to be able to calculate performance
metrics from the data
Some stats
Some stats

500 (approx) telemetry enabled vehicles
using the system
Some stats

500 (approx) telemetry enabled vehicles
using the system

2500 data points captured per vehicle per
second
Some stats

500 (approx) telemetry enabled vehicles
using the system

2500 data points captured per vehicle per
second

> 1.5 billion MySQL inserts per day
Some stats

500 (approx) telemetry enabled vehicles
using the system

2500 data points captured per vehicle per
second

> 1.5 billion MySQL inserts per day

Worlds largest vehicle telematics project
outside of Formula 1
More stats


Constant minimum of 4000 inserts per
second within the application

Peaks:

  3 million inserts per second
Processing and storing
      the data
Receiving continuous
   data streams
Receiving continuous
   data streams

We need to be online
Receiving continuous
   data streams

We need to be online

We need to have capacity to process the
data
Receiving continuous
   data streams

We need to be online

We need to have capacity to process the
data

We need to scale
Evolution of a big data project
Evolution of a big data project
Message Queue


Fast, secure, reliable and scalable

Hosted: they worry about the server
infrastructure and availability

We only have to process what we can
AMQP + PHP

php-amqplib (github.com/videlalvaro/php-
amqplib)

  OR install it via composer:   videlalvaro/php-amqplib


Pure PHP implementation

Handles publishing and consuming messages
from a queue
AMQP: Consume
// connect to the AMQP server
$connection = new AMQPConnection($host,$port,$user,$password);

// create a channel; a logical stateful link to our physical connection
$channel = $connection->channel();


// link the channel to an exchange (where messages are sent)
$channel->exchange_declare($exchange, ‘direct’);

// bind the channel to the queue
$channel->queue_bind($queue, $exchange);

// consume by sending the message to our processing callback function
$channel->basic_consume($queue, $consumerTag, false, false, false,
$callbackFunctionName);

while(count($channel->callbacks))
{
  $channel->wait();
}
Buffers
Pulling in the data
Dedicated application and hardware to
consume from the Message Queue and
convert to MySQL Inserts

MySQL: LOAD DATA INFILE

  Very fast

  Due to high volumes of data, these “bulk
  operations” only cover a few seconds of
  time - still giving a live stream of data
Optimising MySQL
innodb_flush_method=O_DIRECT

    Lets the buffer pool bypass the OS cache

    InnoDB buffer pools more efficient that OS

    Can have negative side effects

Improve write performance:

    innodb_flush_log_at_trx_commit=2

    Prevents per-commit log flushing

Query cache size (query_cache_size)

    Measure your applications usage and make a judgement

    Our data stream was too frequent to make use of the cache
Sharding (1)

Evaluate data, look for natural break points

Split the data so each data collection unit
(vehicle) had a seperate database

Gives some support for horizontal scaling

  Provided the data per vehicle is a
  reasonable size
System architecture
But the MQ can store data...why
    do you have a problem?

  Message Queue isn’t designed for storage

  Messages are transferred in a compressed
  form

  Nature of vehicle data (CAN) means that a 16
  character string is actually 4 - 64 pieces of
  data
Sam Lambert
Solves big-data MySQL problems for
breakfast

Constantly tweaking the servers and
configuration to get more and more
performance

Pushing the capabilities of our SAN,
tweaking configs where no DBA has gone
before

www.samlambert.com

http://www.samlambert.com/2011/07/how-
to-push-your-san-with-open-iscsi_13.html

http://www.samlambert.com/
2011/07/diagnosing-and-fixing-
mysql-io.html

Twitter: @isamlambert
Querying the data
      QUICKLY!
Graphs! Slow!
Long Running Queries
More and more vehicles came into service

Huge amount of data resulted in very slow
queries

  Page load

  Session locking

  Slow exports

  Slow backups
Real time information
Original database schema dictated all
information was accessed via a query, or a
separate subquery. Expensive.

Live information:

  Up to 30 data points

  Refreshing every 5 - 30 seconds via AJAX

Painful
Requests
Asynchronous requests let the page load before
the data

Number of these requests had to be monitored

Real time information used Fusion Charts

  1 AJAX call per chart

  10 - 30 charts per vehicle live screen

  Refresh every 5 - 30 seconds
Requests: Optimised
Single entry point

Multiple entry points make it difficult to
dynamically change the time out and
memory usage of key pages, as well as
dealing with session locking issues
effectively.

Single point of entry is essential

Checkout the symfony routing component...
Symfony Routing
// load your routes
$locator = new FileLocator( array(__DIR__ . '/../../' ) );
$loader = new YamlFileLoader( $locator );
$loader->load('routes.yml');
$request = ( isset( $_SERVER['REQUEST_URI'] ) ) ? $_SERVER['REQUEST_URI'] : '';
$requestContext = new RequestContext( $request );

// Setup the router
$router = new RoutingRouter( new YamlFileLoader( $locator ), "routes.yml",
array('cache_dir' => null), $requestContext );
$requestURL = ( isset( $_SERVER['REQUEST_URI'] ) ) ? $_SERVER['REQUEST_URI'] : '';
$requestURL = (strlen( $requestURL ) > 1 ) ? rtrim( $requestURL, '/' ) : $requestURL;

// get the route for your request
$route = $this->router->match( $requestURL );

// act on the route
Sharding: split the data into
      smaller buckets
Sharding (2)

Data is very time relevant

  Only care about specific days

  Don’t care about comparing data too much

Split the data so that each week had a
separate table
Supporting Sharding
                  Simple PHP function to run all queries
                  through. Works out the table name. Link
                  with a sprintf to get the full query string
/**
  * Get the sharded table to use from a specific date
  * @param String $date YYYY-MM-DD
  * @return String
  */
public function getTableNameFromDate( $date )
{
	    // ASSUMPTION: todays database is ALWAYS THERE
	    // ASSUMPTION: You shouldn't be querying for data in the future
	    $date = ( $date > date( 'Y-m-d') ) ? date('Y-m-d') : $date;
	    $stt = strtotime( $date );
     if( $date >= $this->switchOver ) {
     	    $year = ( date( 'm', $stt ) == 01 && date( 'W', $stt ) == 52 ) ? date('Y', $stt ) - 1 : date('Y', $stt );
     	    return 'datavalue_' . $year . '_' . date('W', $stt );
     }
     else {
     	    return 'datavalue';
     }
}
Sharding: an excuse

Alterations to the database schema

Code to support smaller buckets of data



Take advantage of needing to touch queries
and code: improve them!
Index Optimisation
Two sharding projects left the schema as a
frankenstien

Indexes still had data from before the first shard
(the vehicle ID)

   Wasting storage space

   Increasing the index size

   Increasing query time

   Makes the index harder to fit into memory
Schema Optimisation
MySQL provides a range of data-types

Varying storage implications

   Does that need to be a BIGINT

   Do you really need DOUBLE PRECISION when a
   FLOAT will do?

Are those tables, fields or databases still required?

Perform regular schema audits
Query Optimisation

Run your queries through EXPLAIN
EXTENDED

  Check they hit the indexes

For big queries avoid functions such as
CURDATE - this helps ensure the cache is hit
Reporting against the
        data
Performance report
Reports & Intensive
     Queries
How far did the vehicle travel today

  Calculation involves looking at every single
  motor speed value for the day

How much energy did the vehicle use today

  Calculation involves looking at multiple
  variables for every second of the day

Lookup time + calculation time
Group the queries
Leverage indexes

  Perform related queries in succession

  Then perform calculations

Catching up on a backlog of calculations and
exports?

  Do a table of queries at a time

  Make use of indexes
Save the report
Automate the queries in dead time, grouped
together nicely

Save the results in a reports table

Only a single record per vehicle per day of
performance data

  Means users and management can run
  aggregate and comparison queries
  themselves quickly and easily
Enables date-range aggregation
Check for efficiency
      savings

Initial export scripts maintained a MySQLi
connection per database (500!)

Updated to maintain one per server and
simply switch to the database in question
Leverage your RAM

Intensive queries might only use X% of your
RAM

Safe to run more than one report / export
at a time

Add support for multiple exports / reports
within your scripts e.g.
$numberOfConcurrentReportsToRun = 2;
$reportInstance = 0;
$counter = 0;
foreach( $data as $unit ) {
! if( ( $counter % $numberOfConcurrentReportsToRun ) == $reportInstance ) {
! ! $dataToProcess[] = $unit;
! }!
   $counter++;
}
Extrapolate & Assume

Data is only stored when it changes

Known assumptions are used to extrapolate
values for all seconds of the day

Saves MySQL but costs in RAM

“Interlation”
Interlation
  * Add an array to the interlation
public function addArray( $name, $array )

  * Get the time that we first receive data in one of our arrays
public function getFirst( $field )

  * Get the time that we last received data in any of our arrays
public function getLast( $field )

  * Generate the interlaced array
public function generate( $keyField, $valueField )

  * Beak the interlaced array down into seperate days
public function dayBreak( $interlationArray )

   * Generate an interlaced array and fill for all timestamps within the range
of     _first_ to _last_
public function generateAndFill( $keyField, $valueField )

  * Populate the new combined array with key fields using the common field
public function populateKeysFromField( $field, $valueField=null )

http://www.michaelpeacock.co.uk/interlation-library
Food for thought



Gearman

  Tool to schedule and run background jobs
Keeping the application
      responsive
Session Locking

Some queries were still (understandably, and
acceptably) slow

Sessions would lock and AJAX scripts would
enter race conditions

User would attempt to navigate to another
page: their session with the web server
wouldn’t respond
Session Locking:
      Resolution
Session locking caused by how PHP handles
sessions;

  Session file is closed once it has finishes
  executing the request

Potential solution: use another method e.g.
database

Our solution: manually close the session
Closing the session

session_write_close();

Caveats:

  If you need to write to sessions again in
  the execution cycle, you must call
  session_start() again

  Made problematic by the lack of template
  handling
Live real-time data
Request consolidation helped

Each data point on the live screen was still a
separate query due to original design
constraints

Live fleet information spanned multiple
databases e.g. a map of all vehicles
belonging to a customer

Solution: caching
Caching with memcached
      Fast, in-memory key-value store

         Used to keep a copy of the most recent
         data from each vehicle
$mc = new Memcache();
$mc->connect($memcacheServer, $memcachePort);
$realTimeData = $mc->get($vehicleID . ‘-’ . $dataVariable);


      Failover: Moxi Memcached Proxy
Caching enables large range of
 data to be looked up quickly
Legacy Project
Constraints, problems and code. Easing
         deployment anxiety.
Source Control
       Management

Initially SVN

Migrated to git

  Branch per feature strategy

Automated deployment
Dependencies
Dependency Injection framework missing
from the application, caused problems with:

  Authentication

  Memcache

  Handling multiple concurrent database
  connections

  Access control
Autoloading



PSR-0
Templates and sessions


Closing and opening sessions means you need
to know when data has been sent to the
browser

Separation of concerns and template systems
help with this
Database rollouts
              Specific database table defines how the data should
              be processed

              Log database deltas

              Automated process to roll out changes

                   Backup existing table first
DATE=`date +%H-%M-%d-%m-%y`
mysqldump -h HOST -u USER -pPASSWORD DATABASE TABLENAME > /backups/dictionary_$DATE.sql
cd /var/www/pdictionarypatcher/repo/
git pull origin master
cd src
php index.php


                   Rollout changes
private function applyNextPatch( $currentPatchID ) {
	 $patchToTry = ++$currentPatchID;
	 if( file_exists( FRAMEWORK_PATH . '../patches/' . $patchToTry . '.php' ) ) {
	 	 $sql = file_get_contents( FRAMEWORK_PATH . '../patches/' . $patchToTry . '.php' );
	 	 $this->database->multi_query( $sql );
	 	 return $this->applyNextPatch( $patchToTry );
	 }
	 else {
	 	 return $patchToTry-1;
	 }
}
The future
Tiered SAN hardware
NoSQL?
MySQL was used as a “golden hammer”

Original team of contractors who built the
system knew it

Easy to hire developers who know it

Not necessarily the best option

We had to introduce application-level
sharding for it to suite the growing needs
Rationalisation


Do we need all that data? Really?

  At the moment: probably

  In the future: probably not
Direct queue interaction


 Types of message queue could allow our live
 data to be streamed direct from a queue

 We could use this infrastructure to share
 the data with partners instead of providing
 them regular processed exports
More hardware



More vehicles + New components = Need for
more storage
Conclusions
So you need to work with a crap-load of data?
PHP needs lots of
        friends
PHP is a great tool for:

   Displaying the data

   Processing the data

   Exporting the data

   Binding business logic to the data

It needs friends to:

   Queue the data

   Insert the data

   Visualise the data
Continually Review


Your schema & indexes

Your queries

Efficiencies in your code

Number of AJAX requests
Message Queue: A
     safety net

Queue what you can

Lets you move data around while you process
it

Gives your hardware some breathing space
Code Considerations
Template engines

Dependency management

Abstraction

Autoloading

Session handling

Request management
Compile Data
Keep related data together

Look at storing summaries of data

   Approach used by analytics companies: granularity
   changes over time:

      This week: per second data

      Last week: Hourly summaries

      Last month: Daily summaries

      Last year: Monthly summaries
Thanks; Q+A


Michael Peacock

mkpeacock@gmail.com

@michaelpeacock

www.michaelpeacock.co.uk
Photo credits


flickr.com/photos/itmpa/4531956496/

flickr.com/photos/eveofdiscovery/3149008295

flickr.com/photos/gadl/89650415/

flickr.com/photos/brapps/403257780

More Related Content

PDF
High-Performance Advanced Analytics with Spark-Alchemy
PPTX
How to Choose The Right Database on AWS - Berlin Summit - 2019
PPTX
Powering a Virtual Power Station with Big Data
PPTX
Dealing with Changed Data in Hadoop
PDF
Koalas: Pandas on Apache Spark
PPTX
Apache MetaModel - unified access to all your data points
PPTX
Hadoop and Spark for the SAS Developer
PPTX
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...
High-Performance Advanced Analytics with Spark-Alchemy
How to Choose The Right Database on AWS - Berlin Summit - 2019
Powering a Virtual Power Station with Big Data
Dealing with Changed Data in Hadoop
Koalas: Pandas on Apache Spark
Apache MetaModel - unified access to all your data points
Hadoop and Spark for the SAS Developer
Big Data Day LA 2015 - Applications of the Apriori Algorithm on Open Data by ...

Viewers also liked (11)

PDF
The Evolution of Big Data Analytics
PPT
The evolution of web and big data
PDF
Big Data Evolution
PPTX
A Brief History of Big Data
PPTX
What is big data?
PDF
Big Data: Evolution? Game Changer? Definitely
 
PPT
Big Data
PPT
Big data ppt
PPTX
What is Big Data?
PPTX
Big Data Analytics with Hadoop
PPTX
Big data ppt
The Evolution of Big Data Analytics
The evolution of web and big data
Big Data Evolution
A Brief History of Big Data
What is big data?
Big Data: Evolution? Game Changer? Definitely
 
Big Data
Big data ppt
What is Big Data?
Big Data Analytics with Hadoop
Big data ppt
Ad

Similar to Evolution of a big data project (20)

PPTX
PHP Continuous Data Processing
PDF
DataBearings: A semantic platform for data integration on IoT, Artem Katasonov
PDF
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
PDF
Cassandra Summit 2015 - A Change of Seasons
PDF
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
PPTX
Real Time Data Warehousing Mastering Business Objects June 11
PDF
Elasticsearch in Netflix
PDF
Deconstructing Lambda
PDF
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
PPTX
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
PDF
Big Data Meetup #7
PPT
Sql portfolio admin_practicals
PPT
Gemfire
PPTX
Automating Big Data with the Automic Hadoop Agent
PPT
Oracle Coherence: in-memory datagrid
PPTX
SplunkLive! Munich 2018: Data Onboarding Overview
PPTX
WebAppseqweqweqweqwewqeqweqweReImagined.pptx
PPTX
Turbocharged Data - Leveraging Azure Data Explorer for Real-Time Insights fro...
PPTX
Cloud to hybrid edge cloud evolution Jun112020.pptx
PPTX
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
PHP Continuous Data Processing
DataBearings: A semantic platform for data integration on IoT, Artem Katasonov
Movile Internet Movel SA: A Change of Seasons: A big move to Apache Cassandra
Cassandra Summit 2015 - A Change of Seasons
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Real Time Data Warehousing Mastering Business Objects June 11
Elasticsearch in Netflix
Deconstructing Lambda
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
Big Data Meetup #7
Sql portfolio admin_practicals
Gemfire
Automating Big Data with the Automic Hadoop Agent
Oracle Coherence: in-memory datagrid
SplunkLive! Munich 2018: Data Onboarding Overview
WebAppseqweqweqweqwewqeqweqweReImagined.pptx
Turbocharged Data - Leveraging Azure Data Explorer for Real-Time Insights fro...
Cloud to hybrid edge cloud evolution Jun112020.pptx
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Ad

More from Michael Peacock (20)

PPTX
Immutable Infrastructure with Packer Ansible and Terraform
PPTX
Test driven APIs with Laravel
PPTX
Symfony Workflow Component - Introductory Lightning Talk
PPTX
Alexa, lets make a skill
PPTX
API Development with Laravel
PPTX
An introduction to Laravel Passport
PDF
Phinx talk
PDF
Refactoring to symfony components
PPT
Dance for the puppet master: G6 Tech Talk
PPT
Powerful and flexible templates with Twig
PPT
Introduction to OOP with PHP
KEY
KEY
Phpne august-2012-symfony-components-friends
PPTX
Real time voice call integration - Confoo 2012
PPTX
Dealing with Continuous Data Processing, ConFoo 2012
PPTX
Data at Scale - Michael Peacock, Cloud Connect 2012
PPTX
Supermondays twilio
PPTX
PHP & Twilio
PPTX
PHP North East Registry Pattern
PPTX
PHP North East - Registry Design Pattern
Immutable Infrastructure with Packer Ansible and Terraform
Test driven APIs with Laravel
Symfony Workflow Component - Introductory Lightning Talk
Alexa, lets make a skill
API Development with Laravel
An introduction to Laravel Passport
Phinx talk
Refactoring to symfony components
Dance for the puppet master: G6 Tech Talk
Powerful and flexible templates with Twig
Introduction to OOP with PHP
Phpne august-2012-symfony-components-friends
Real time voice call integration - Confoo 2012
Dealing with Continuous Data Processing, ConFoo 2012
Data at Scale - Michael Peacock, Cloud Connect 2012
Supermondays twilio
PHP & Twilio
PHP North East Registry Pattern
PHP North East - Registry Design Pattern

Recently uploaded (20)

PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Tartificialntelligence_presentation.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Machine Learning_overview_presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Approach and Philosophy of On baking technology
SOPHOS-XG Firewall Administrator PPT.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Big Data Technologies - Introduction.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Group 1 Presentation -Planning and Decision Making .pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
Tartificialntelligence_presentation.pptx
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectral efficient network and resource selection model in 5G networks
Machine Learning_overview_presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Approach and Philosophy of On baking technology

Evolution of a big data project

  • 1. Evolution of a “big data” project Michael Peacock
  • 4. @michaelpeacock Head Developer @groundsix Author of a number of web related books
  • 5. @michaelpeacock Head Developer @groundsix Author of a number of web related books Occasional conference / user group speaker
  • 7. Ground Six Tech company based in the North East of England
  • 8. Ground Six Tech company based in the North East of England Specialise in developing web and mobile applications
  • 9. Ground Six Tech company based in the North East of England Specialise in developing web and mobile applications Provide investment (financial and tech) to interesting app ideas
  • 10. Ground Six Tech company based in the North East of England Specialise in developing web and mobile applications Provide investment (financial and tech) to interesting app ideas Got an idea? Looking for investment? www.groundsix.com
  • 12. Whats in store Challenges, solutions and approaches when dealing with billions of inserts per day
  • 13. Whats in store Challenges, solutions and approaches when dealing with billions of inserts per day Processing and storing the data
  • 14. Whats in store Challenges, solutions and approaches when dealing with billions of inserts per day Processing and storing the data Querying the data quickly
  • 15. Whats in store Challenges, solutions and approaches when dealing with billions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data
  • 16. Whats in store Challenges, solutions and approaches when dealing with billions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive
  • 17. Whats in store Challenges, solutions and approaches when dealing with billions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive Keeping the application running
  • 18. Whats in store Challenges, solutions and approaches when dealing with billions of inserts per day Processing and storing the data Querying the data quickly Reporting against the data Keeping the application responsive Keeping the application running Legacy project, problems and code
  • 21. Electric Vehicles: Need for Data We need to receive all of the data
  • 22. Electric Vehicles: Need for Data We need to receive all of the data We need to keep all of the data
  • 23. Electric Vehicles: Need for Data We need to receive all of the data We need to keep all of the data We need to be able to display data in real time
  • 24. Electric Vehicles: Need for Data We need to receive all of the data We need to keep all of the data We need to be able to display data in real time We need to transfer large chunks of data to customers and government departments
  • 25. Electric Vehicles: Need for Data We need to receive all of the data We need to keep all of the data We need to be able to display data in real time We need to transfer large chunks of data to customers and government departments We need to be able to calculate performance metrics from the data
  • 27. Some stats 500 (approx) telemetry enabled vehicles using the system
  • 28. Some stats 500 (approx) telemetry enabled vehicles using the system 2500 data points captured per vehicle per second
  • 29. Some stats 500 (approx) telemetry enabled vehicles using the system 2500 data points captured per vehicle per second > 1.5 billion MySQL inserts per day
  • 30. Some stats 500 (approx) telemetry enabled vehicles using the system 2500 data points captured per vehicle per second > 1.5 billion MySQL inserts per day Worlds largest vehicle telematics project outside of Formula 1
  • 31. More stats Constant minimum of 4000 inserts per second within the application Peaks: 3 million inserts per second
  • 33. Receiving continuous data streams
  • 34. Receiving continuous data streams We need to be online
  • 35. Receiving continuous data streams We need to be online We need to have capacity to process the data
  • 36. Receiving continuous data streams We need to be online We need to have capacity to process the data We need to scale
  • 39. Message Queue Fast, secure, reliable and scalable Hosted: they worry about the server infrastructure and availability We only have to process what we can
  • 40. AMQP + PHP php-amqplib (github.com/videlalvaro/php- amqplib) OR install it via composer: videlalvaro/php-amqplib Pure PHP implementation Handles publishing and consuming messages from a queue
  • 41. AMQP: Consume // connect to the AMQP server $connection = new AMQPConnection($host,$port,$user,$password); // create a channel; a logical stateful link to our physical connection $channel = $connection->channel(); // link the channel to an exchange (where messages are sent) $channel->exchange_declare($exchange, ‘direct’); // bind the channel to the queue $channel->queue_bind($queue, $exchange); // consume by sending the message to our processing callback function $channel->basic_consume($queue, $consumerTag, false, false, false, $callbackFunctionName); while(count($channel->callbacks)) { $channel->wait(); }
  • 43. Pulling in the data Dedicated application and hardware to consume from the Message Queue and convert to MySQL Inserts MySQL: LOAD DATA INFILE Very fast Due to high volumes of data, these “bulk operations” only cover a few seconds of time - still giving a live stream of data
  • 44. Optimising MySQL innodb_flush_method=O_DIRECT Lets the buffer pool bypass the OS cache InnoDB buffer pools more efficient that OS Can have negative side effects Improve write performance: innodb_flush_log_at_trx_commit=2 Prevents per-commit log flushing Query cache size (query_cache_size) Measure your applications usage and make a judgement Our data stream was too frequent to make use of the cache
  • 45. Sharding (1) Evaluate data, look for natural break points Split the data so each data collection unit (vehicle) had a seperate database Gives some support for horizontal scaling Provided the data per vehicle is a reasonable size
  • 47. But the MQ can store data...why do you have a problem? Message Queue isn’t designed for storage Messages are transferred in a compressed form Nature of vehicle data (CAN) means that a 16 character string is actually 4 - 64 pieces of data
  • 48. Sam Lambert Solves big-data MySQL problems for breakfast Constantly tweaking the servers and configuration to get more and more performance Pushing the capabilities of our SAN, tweaking configs where no DBA has gone before www.samlambert.com http://www.samlambert.com/2011/07/how- to-push-your-san-with-open-iscsi_13.html http://www.samlambert.com/ 2011/07/diagnosing-and-fixing- mysql-io.html Twitter: @isamlambert
  • 49. Querying the data QUICKLY!
  • 51. Long Running Queries More and more vehicles came into service Huge amount of data resulted in very slow queries Page load Session locking Slow exports Slow backups
  • 52. Real time information Original database schema dictated all information was accessed via a query, or a separate subquery. Expensive. Live information: Up to 30 data points Refreshing every 5 - 30 seconds via AJAX Painful
  • 53. Requests Asynchronous requests let the page load before the data Number of these requests had to be monitored Real time information used Fusion Charts 1 AJAX call per chart 10 - 30 charts per vehicle live screen Refresh every 5 - 30 seconds
  • 55. Single entry point Multiple entry points make it difficult to dynamically change the time out and memory usage of key pages, as well as dealing with session locking issues effectively. Single point of entry is essential Checkout the symfony routing component...
  • 56. Symfony Routing // load your routes $locator = new FileLocator( array(__DIR__ . '/../../' ) ); $loader = new YamlFileLoader( $locator ); $loader->load('routes.yml'); $request = ( isset( $_SERVER['REQUEST_URI'] ) ) ? $_SERVER['REQUEST_URI'] : ''; $requestContext = new RequestContext( $request ); // Setup the router $router = new RoutingRouter( new YamlFileLoader( $locator ), "routes.yml", array('cache_dir' => null), $requestContext ); $requestURL = ( isset( $_SERVER['REQUEST_URI'] ) ) ? $_SERVER['REQUEST_URI'] : ''; $requestURL = (strlen( $requestURL ) > 1 ) ? rtrim( $requestURL, '/' ) : $requestURL; // get the route for your request $route = $this->router->match( $requestURL ); // act on the route
  • 57. Sharding: split the data into smaller buckets
  • 58. Sharding (2) Data is very time relevant Only care about specific days Don’t care about comparing data too much Split the data so that each week had a separate table
  • 59. Supporting Sharding Simple PHP function to run all queries through. Works out the table name. Link with a sprintf to get the full query string /** * Get the sharded table to use from a specific date * @param String $date YYYY-MM-DD * @return String */ public function getTableNameFromDate( $date ) { // ASSUMPTION: todays database is ALWAYS THERE // ASSUMPTION: You shouldn't be querying for data in the future $date = ( $date > date( 'Y-m-d') ) ? date('Y-m-d') : $date; $stt = strtotime( $date ); if( $date >= $this->switchOver ) { $year = ( date( 'm', $stt ) == 01 && date( 'W', $stt ) == 52 ) ? date('Y', $stt ) - 1 : date('Y', $stt ); return 'datavalue_' . $year . '_' . date('W', $stt ); } else { return 'datavalue'; } }
  • 60. Sharding: an excuse Alterations to the database schema Code to support smaller buckets of data Take advantage of needing to touch queries and code: improve them!
  • 61. Index Optimisation Two sharding projects left the schema as a frankenstien Indexes still had data from before the first shard (the vehicle ID) Wasting storage space Increasing the index size Increasing query time Makes the index harder to fit into memory
  • 62. Schema Optimisation MySQL provides a range of data-types Varying storage implications Does that need to be a BIGINT Do you really need DOUBLE PRECISION when a FLOAT will do? Are those tables, fields or databases still required? Perform regular schema audits
  • 63. Query Optimisation Run your queries through EXPLAIN EXTENDED Check they hit the indexes For big queries avoid functions such as CURDATE - this helps ensure the cache is hit
  • 66. Reports & Intensive Queries How far did the vehicle travel today Calculation involves looking at every single motor speed value for the day How much energy did the vehicle use today Calculation involves looking at multiple variables for every second of the day Lookup time + calculation time
  • 67. Group the queries Leverage indexes Perform related queries in succession Then perform calculations Catching up on a backlog of calculations and exports? Do a table of queries at a time Make use of indexes
  • 68. Save the report Automate the queries in dead time, grouped together nicely Save the results in a reports table Only a single record per vehicle per day of performance data Means users and management can run aggregate and comparison queries themselves quickly and easily
  • 70. Check for efficiency savings Initial export scripts maintained a MySQLi connection per database (500!) Updated to maintain one per server and simply switch to the database in question
  • 71. Leverage your RAM Intensive queries might only use X% of your RAM Safe to run more than one report / export at a time Add support for multiple exports / reports within your scripts e.g.
  • 72. $numberOfConcurrentReportsToRun = 2; $reportInstance = 0; $counter = 0; foreach( $data as $unit ) { ! if( ( $counter % $numberOfConcurrentReportsToRun ) == $reportInstance ) { ! ! $dataToProcess[] = $unit; ! }! $counter++; }
  • 73. Extrapolate & Assume Data is only stored when it changes Known assumptions are used to extrapolate values for all seconds of the day Saves MySQL but costs in RAM “Interlation”
  • 74. Interlation * Add an array to the interlation public function addArray( $name, $array ) * Get the time that we first receive data in one of our arrays public function getFirst( $field ) * Get the time that we last received data in any of our arrays public function getLast( $field ) * Generate the interlaced array public function generate( $keyField, $valueField ) * Beak the interlaced array down into seperate days public function dayBreak( $interlationArray ) * Generate an interlaced array and fill for all timestamps within the range of _first_ to _last_ public function generateAndFill( $keyField, $valueField ) * Populate the new combined array with key fields using the common field public function populateKeysFromField( $field, $valueField=null ) http://www.michaelpeacock.co.uk/interlation-library
  • 75. Food for thought Gearman Tool to schedule and run background jobs
  • 77. Session Locking Some queries were still (understandably, and acceptably) slow Sessions would lock and AJAX scripts would enter race conditions User would attempt to navigate to another page: their session with the web server wouldn’t respond
  • 78. Session Locking: Resolution Session locking caused by how PHP handles sessions; Session file is closed once it has finishes executing the request Potential solution: use another method e.g. database Our solution: manually close the session
  • 79. Closing the session session_write_close(); Caveats: If you need to write to sessions again in the execution cycle, you must call session_start() again Made problematic by the lack of template handling
  • 80. Live real-time data Request consolidation helped Each data point on the live screen was still a separate query due to original design constraints Live fleet information spanned multiple databases e.g. a map of all vehicles belonging to a customer Solution: caching
  • 81. Caching with memcached Fast, in-memory key-value store Used to keep a copy of the most recent data from each vehicle $mc = new Memcache(); $mc->connect($memcacheServer, $memcachePort); $realTimeData = $mc->get($vehicleID . ‘-’ . $dataVariable); Failover: Moxi Memcached Proxy
  • 82. Caching enables large range of data to be looked up quickly
  • 83. Legacy Project Constraints, problems and code. Easing deployment anxiety.
  • 84. Source Control Management Initially SVN Migrated to git Branch per feature strategy Automated deployment
  • 85. Dependencies Dependency Injection framework missing from the application, caused problems with: Authentication Memcache Handling multiple concurrent database connections Access control
  • 87. Templates and sessions Closing and opening sessions means you need to know when data has been sent to the browser Separation of concerns and template systems help with this
  • 88. Database rollouts Specific database table defines how the data should be processed Log database deltas Automated process to roll out changes Backup existing table first DATE=`date +%H-%M-%d-%m-%y` mysqldump -h HOST -u USER -pPASSWORD DATABASE TABLENAME > /backups/dictionary_$DATE.sql cd /var/www/pdictionarypatcher/repo/ git pull origin master cd src php index.php Rollout changes
  • 89. private function applyNextPatch( $currentPatchID ) { $patchToTry = ++$currentPatchID; if( file_exists( FRAMEWORK_PATH . '../patches/' . $patchToTry . '.php' ) ) { $sql = file_get_contents( FRAMEWORK_PATH . '../patches/' . $patchToTry . '.php' ); $this->database->multi_query( $sql ); return $this->applyNextPatch( $patchToTry ); } else { return $patchToTry-1; } }
  • 92. NoSQL? MySQL was used as a “golden hammer” Original team of contractors who built the system knew it Easy to hire developers who know it Not necessarily the best option We had to introduce application-level sharding for it to suite the growing needs
  • 93. Rationalisation Do we need all that data? Really? At the moment: probably In the future: probably not
  • 94. Direct queue interaction Types of message queue could allow our live data to be streamed direct from a queue We could use this infrastructure to share the data with partners instead of providing them regular processed exports
  • 95. More hardware More vehicles + New components = Need for more storage
  • 96. Conclusions So you need to work with a crap-load of data?
  • 97. PHP needs lots of friends PHP is a great tool for: Displaying the data Processing the data Exporting the data Binding business logic to the data It needs friends to: Queue the data Insert the data Visualise the data
  • 98. Continually Review Your schema & indexes Your queries Efficiencies in your code Number of AJAX requests
  • 99. Message Queue: A safety net Queue what you can Lets you move data around while you process it Gives your hardware some breathing space
  • 100. Code Considerations Template engines Dependency management Abstraction Autoloading Session handling Request management
  • 101. Compile Data Keep related data together Look at storing summaries of data Approach used by analytics companies: granularity changes over time: This week: per second data Last week: Hourly summaries Last month: Daily summaries Last year: Monthly summaries

Editor's Notes

  • #2: Hello everyone; Thanks for coming. I spent the last 12 months working on a large scale data intensive project, focusing on the development of a PHP web application which had to support, display, process, report against and export a pheonomenal amount of data each day.\n
  • #3: \n
  • #4: \n
  • #5: \n
  • #6: \n
  • #7: \n
  • #8: \n
  • #9: \n
  • #10: \n
  • #11: \n
  • #12: \n
  • #13: \n
  • #14: \n
  • #15: \n
  • #16: \n
  • #17: The project concerned dealing with vehicle telematics data from vehicles produced by Smith Electric Vehicles. One of the worlds largest manufacturers of all-electric commercial vehicles. As a new and emerging industry, performance, efficiency and fault reporting data from these vehicles is very valuable. As I’m sure you can imagine, with electric vehicles the drive and battery systems generate a large amount of data - with batteries broken down into smaller cells, each giving us temperature, current, voltage and state of charge data.\n
  • #18: As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • #19: As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • #20: As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • #21: As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • #22: As the data may relate to performance and faults - we need to ensure we get the data. Telematics projects which offer safety features have this as an even more important issue. We also have government partners who subsidise the vehicle cost in exchange for some of this data. Subsequently we need to be able to give this data to them, as well as receiving it ourselves. \nAs EV’s rely on chemistry and external factors, we need to keep data so we can compare data at different times of the year and different locations \n
  • #23: What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • #24: What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • #25: What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • #26: What you will realise is that we in effect built a large scale distributed-denial-of-service system, and pointed it directly at our own hardware, with the caveat of needing the data from the DDOS attack!\n
  • #27: \n
  • #28: before we could do anything - we need to be able to process the data and store it within the system. This includes actually transferring the data to our servers, inserting it into our database cluster and performing business logic on the data.\n
  • #29: In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • #30: In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • #31: In order for us to reliably receive the data, we need the system to be online so that data can be transferred. We also need to have the server capacity to process the data, and we need to be able to scale the system. Just because there are X number of data collection units out there - we don’t know how many will be on at a given time, and we have to deal with more and more collection units being build and delivered.\n
  • #32: The biggest problem is dealing with the pressure of that data stream. \n
  • #33: \n
  • #34: \n
  • #35: There are a range of AMQP libraries for PHP, some of them based off the C-library and other difficult dependencies. \n\nA couple of guys developed a pure PHP implementation of the library which is really easy to use and install, and can be installed directly via Composer. As its a pure PHP implementation its really easy to get up and running on any platform.\n\nProvides support for both publishing and consuming messages from a queue.Great not only for dealing with streams of data but also for storing events and requests across multiple sessions, or dispatching jobs.\n
  • #36: \n
  • #37: A small buffer allows us to cope with the issue of connectivity problems to our message queue, or signal problems with the data collection devices.\n
  • #38: To give data import the resources it needs, the system had dedicated hardware to consume messages from the message queue, perform business logic and convert them to MySQL Inserts.\n\nAlthough its an obvious one, its also easily overlooked. The data is bundled together into LOAD DATA INFILE statements with MySQL. \n
  • #39: \n
  • #40: \n
  • #41: \n
  • #42: \n
  • #43: \n
  • #44: \n
  • #45: \n
  • #46: \n
  • #47: \n
  • #48: \n
  • #49: \n
  • #50: \n
  • #51: \n
  • #52: \n
  • #53: \n
  • #54: \n
  • #55: \n
  • #56: \n
  • #57: \n
  • #58: \n
  • #59: \n
  • #60: \n
  • #61: \n
  • #62: \n
  • #63: \n
  • #64: \n
  • #65: \n
  • #66: \n
  • #67: \n
  • #68: \n
  • #69: \n
  • #70: \n
  • #71: \n
  • #72: \n
  • #73: \n
  • #74: \n
  • #75: \n
  • #76: \n
  • #77: \n
  • #78: With a project of this scale, dealing with business-critical data could lead to deployment anxiety. This is because a bug in rolled out code could cause problems with displaying real time data, or cause exported data or processed data reports to be incorrect; requiring them to be re-run at a cost of CPU time - most of which was already in use generating that days reports or dealing with that days data imports. Architecture of the application also provided constraints for maintenance.\n
  • #79: \n
  • #80: \n
  • #81: \n
  • #82: \n
  • #83: \n
  • #84: \n
  • #85: \n
  • #86: \n
  • #87: \n
  • #88: \n
  • #89: \n
  • #90: \n
  • #91: \n
  • #92: \n
  • #93: \n
  • #94: \n
  • #95: \n
  • #96: \n
  • #97: \n
  • #98: \n