Evolution of a big data project

Evolution of a “big data”
project
Michael Peacock

@michaelpeacock

Head Developer @groundsix

@michaelpeacock


Author of a number of web related books

@michaelpeacock


Author of a number of web related books

Occasional conference / user group speaker

Ground Six
Tech company based in the North East of
England

Ground Six
England

Specialise in developing web and mobile
applications

Ground Six
England

applications

Provide investment (ﬁnancial and tech) to
interesting app ideas

Ground Six
England

applications

Provide investment (ﬁnancial and tech) to
interesting app ideas

Got an idea? Looking for investment?
www.groundsix.com

Whats in store
Challenges, solutions and approaches when dealing with
billions of inserts per day

Whats in store

Processing and storing the data

Whats in store


Querying the data quickly

Whats in store



Reporting against the data

Whats in store




Keeping the application responsive

Whats in store





Keeping the application running

Whats in store





Keeping the application running

Legacy project, problems and code

Electric Vehicles: Need
for Data

for Data
We need to receive all of the data

for Data

We need to keep all of the data

for Data


We need to be able to display data in real time

for Data



We need to transfer large chunks of data to
customers and government departments

for Data



We need to transfer large chunks of data to
customers and government departments

We need to be able to calculate performance
metrics from the data

Some stats

500 (approx) telemetry enabled vehicles
using the system

Some stats

using the system

2500 data points captured per vehicle per
second

Some stats

using the system

second

> 1.5 billion MySQL inserts per day

Some stats

using the system

second

> 1.5 billion MySQL inserts per day

Worlds largest vehicle telematics project
outside of Formula 1

More stats

Constant minimum of 4000 inserts per
second within the application

Peaks:

3 million inserts per second

Processing and storing
the data

Receiving continuous
data streams

data streams

We need to be online

data streams


We need to have capacity to process the
data

data streams


We need to have capacity to process the
data

We need to scale

Evolution of a big data project

Message Queue

Fast, secure, reliable and scalable

Hosted: they worry about the server
infrastructure and availability

We only have to process what we can

AMQP + PHP

php-amqplib (github.com/videlalvaro/php-
amqplib)

OR install it via composer: videlalvaro/php-amqplib

Pure PHP implementation

Handles publishing and consuming messages
from a queue

AMQP: Consume
// connect to the AMQP server
$connection = new AMQPConnection($host,$port,$user,$password);

// create a channel; a logical stateful link to our physical connection
$channel = $connection->channel();

// link the channel to an exchange (where messages are sent)
$channel->exchange_declare($exchange, ‘direct’);

// bind the channel to the queue
$channel->queue_bind($queue, $exchange);

// consume by sending the message to our processing callback function
$channel->basic_consume($queue, $consumerTag, false, false, false,
$callbackFunctionName);

while(count($channel->callbacks))
{
$channel->wait();
}

Pulling in the data
Dedicated application and hardware to
consume from the Message Queue and
convert to MySQL Inserts

MySQL: LOAD DATA INFILE

Very fast

Due to high volumes of data, these “bulk
operations” only cover a few seconds of
time - still giving a live stream of data

Optimising MySQL
innodb_flush_method=O_DIRECT

Lets the buffer pool bypass the OS cache

InnoDB buffer pools more efficient that OS

Can have negative side effects

Improve write performance:

innodb_flush_log_at_trx_commit=2

Prevents per-commit log flushing

Query cache size (query_cache_size)

Measure your applications usage and make a judgement

Our data stream was too frequent to make use of the cache

Sharding (1)

Evaluate data, look for natural break points

Split the data so each data collection unit
(vehicle) had a seperate database

Gives some support for horizontal scaling

Provided the data per vehicle is a
reasonable size

But the MQ can store data...why
do you have a problem?

Message Queue isn’t designed for storage

Messages are transferred in a compressed
form

Nature of vehicle data (CAN) means that a 16
character string is actually 4 - 64 pieces of
data

Sam Lambert
Solves big-data MySQL problems for
breakfast

Constantly tweaking the servers and
configuration to get more and more
performance

Pushing the capabilities of our SAN,
tweaking configs where no DBA has gone
before

www.samlambert.com

http://www.samlambert.com/2011/07/how-
to-push-your-san-with-open-iscsi_13.html

http://www.samlambert.com/
2011/07/diagnosing-and-fixing-
mysql-io.html

Twitter: @isamlambert

Querying the data
QUICKLY!

Long Running Queries
More and more vehicles came into service

Huge amount of data resulted in very slow
queries

Page load

Session locking

Slow exports

Slow backups

Real time information
Original database schema dictated all
information was accessed via a query, or a
separate subquery. Expensive.

Live information:

Up to 30 data points

Refreshing every 5 - 30 seconds via AJAX

Painful

Requests
Asynchronous requests let the page load before
the data

Number of these requests had to be monitored

Real time information used Fusion Charts

1 AJAX call per chart

10 - 30 charts per vehicle live screen

Refresh every 5 - 30 seconds

Single entry point

Multiple entry points make it difﬁcult to
dynamically change the time out and
memory usage of key pages, as well as
dealing with session locking issues
effectively.

Single point of entry is essential

Checkout the symfony routing component...

Symfony Routing
// load your routes
$locator = new FileLocator( array(__DIR__ . '/../../' ) );
$loader = new YamlFileLoader( $locator );
$loader->load('routes.yml');
$request = ( isset( $_SERVER['REQUEST_URI'] ) ) ? $_SERVER['REQUEST_URI'] : '';
$requestContext = new RequestContext( $request );

// Setup the router
$router = new RoutingRouter( new YamlFileLoader( $locator ), "routes.yml",
array('cache_dir' => null), $requestContext );
$requestURL = ( isset( $_SERVER['REQUEST_URI'] ) ) ? $_SERVER['REQUEST_URI'] : '';
$requestURL = (strlen( $requestURL ) > 1 ) ? rtrim( $requestURL, '/' ) : $requestURL;

// get the route for your request
$route = $this->router->match( $requestURL );

// act on the route

Sharding: split the data into
smaller buckets

Sharding (2)

Data is very time relevant

Only care about speciﬁc days

Don’t care about comparing data too much

Split the data so that each week had a
separate table

Supporting Sharding
Simple PHP function to run all queries
through. Works out the table name. Link
with a sprintf to get the full query string
/**
* Get the sharded table to use from a specific date
* @param String $date YYYY-MM-DD
* @return String
*/
public function getTableNameFromDate( $date )
{
// ASSUMPTION: todays database is ALWAYS THERE
// ASSUMPTION: You shouldn't be querying for data in the future
$date = ( $date > date( 'Y-m-d') ) ? date('Y-m-d') : $date;
$stt = strtotime( $date );
if( $date >= $this->switchOver ) {
$year = ( date( 'm', $stt ) == 01 && date( 'W', $stt ) == 52 ) ? date('Y', $stt ) - 1 : date('Y', $stt );
return 'datavalue_' . $year . '_' . date('W', $stt );
}
else {
return 'datavalue';
}
}

Sharding: an excuse

Alterations to the database schema

Code to support smaller buckets of data

Take advantage of needing to touch queries
and code: improve them!

Index Optimisation
Two sharding projects left the schema as a
frankenstien

Indexes still had data from before the ﬁrst shard
(the vehicle ID)

Wasting storage space

Increasing the index size

Increasing query time

Makes the index harder to ﬁt into memory

Schema Optimisation
MySQL provides a range of data-types

Varying storage implications

Does that need to be a BIGINT

Do you really need DOUBLE PRECISION when a
FLOAT will do?

Are those tables, ﬁelds or databases still required?

Perform regular schema audits

Query Optimisation

Run your queries through EXPLAIN
EXTENDED

Check they hit the indexes

For big queries avoid functions such as
CURDATE - this helps ensure the cache is hit

Reporting against the
data

Reports & Intensive
Queries
How far did the vehicle travel today

Calculation involves looking at every single
motor speed value for the day

How much energy did the vehicle use today

Calculation involves looking at multiple
variables for every second of the day

Lookup time + calculation time

Group the queries
Leverage indexes

Perform related queries in succession

Then perform calculations

Catching up on a backlog of calculations and
exports?

Do a table of queries at a time

Make use of indexes

Save the report
Automate the queries in dead time, grouped
together nicely

Save the results in a reports table

Only a single record per vehicle per day of
performance data

Means users and management can run
aggregate and comparison queries
themselves quickly and easily

Enables date-range aggregation

Check for efﬁciency
savings

Initial export scripts maintained a MySQLi
connection per database (500!)

Updated to maintain one per server and
simply switch to the database in question

Leverage your RAM

Intensive queries might only use X% of your
RAM

Safe to run more than one report / export
at a time

Add support for multiple exports / reports
within your scripts e.g.

$numberOfConcurrentReportsToRun = 2;
$reportInstance = 0;
$counter = 0;
foreach( $data as $unit ) {
! if( ( $counter % $numberOfConcurrentReportsToRun ) == $reportInstance ) {
! ! $dataToProcess[] = $unit;
! }!
$counter++;
}

Extrapolate & Assume

Data is only stored when it changes

Known assumptions are used to extrapolate
values for all seconds of the day

Saves MySQL but costs in RAM

“Interlation”

Interlation
* Add an array to the interlation
public function addArray( $name, $array )

* Get the time that we first receive data in one of our arrays
public function getFirst( $field )

* Get the time that we last received data in any of our arrays
public function getLast( $field )

* Generate the interlaced array
public function generate( $keyField, $valueField )

* Beak the interlaced array down into seperate days
public function dayBreak( $interlationArray )

* Generate an interlaced array and fill for all timestamps within the range
of _first_ to _last_
public function generateAndFill( $keyField, $valueField )

* Populate the new combined array with key fields using the common field
public function populateKeysFromField( $field, $valueField=null )

http://www.michaelpeacock.co.uk/interlation-library

Food for thought

Gearman

Tool to schedule and run background jobs

Keeping the application
responsive

Session Locking

Some queries were still (understandably, and
acceptably) slow

Sessions would lock and AJAX scripts would
enter race conditions

User would attempt to navigate to another
page: their session with the web server
wouldn’t respond

Session Locking:
Resolution
Session locking caused by how PHP handles
sessions;

Session ﬁle is closed once it has ﬁnishes
executing the request

Potential solution: use another method e.g.
database

Our solution: manually close the session

Closing the session

session_write_close();

Caveats:

If you need to write to sessions again in
the execution cycle, you must call
session_start() again

Made problematic by the lack of template
handling

Live real-time data
Request consolidation helped

Each data point on the live screen was still a
separate query due to original design
constraints

Live ﬂeet information spanned multiple
databases e.g. a map of all vehicles
belonging to a customer

Solution: caching

Caching with memcached
Fast, in-memory key-value store

Used to keep a copy of the most recent
data from each vehicle
$mc = new Memcache();
$mc->connect($memcacheServer, $memcachePort);
$realTimeData = $mc->get($vehicleID . ‘-’ . $dataVariable);

Failover: Moxi Memcached Proxy

Caching enables large range of
data to be looked up quickly

Legacy Project
Constraints, problems and code. Easing
deployment anxiety.

Source Control
Management

Initially SVN

Migrated to git

Branch per feature strategy

Automated deployment

Dependencies
Dependency Injection framework missing
from the application, caused problems with:

Authentication

Memcache

Handling multiple concurrent database
connections

Access control

Templates and sessions

Closing and opening sessions means you need
to know when data has been sent to the
browser

Separation of concerns and template systems
help with this

Database rollouts
Specific database table defines how the data should
be processed

Log database deltas

Automated process to roll out changes

Backup existing table first
DATE=`date +%H-%M-%d-%m-%y`
mysqldump -h HOST -u USER -pPASSWORD DATABASE TABLENAME > /backups/dictionary_$DATE.sql
cd /var/www/pdictionarypatcher/repo/
git pull origin master
cd src
php index.php

Rollout changes

private function applyNextPatch( $currentPatchID ) {
$patchToTry = ++$currentPatchID;
if( file_exists( FRAMEWORK_PATH . '../patches/' . $patchToTry . '.php' ) ) {
$sql = file_get_contents( FRAMEWORK_PATH . '../patches/' . $patchToTry . '.php' );
$this->database->multi_query( $sql );
return $this->applyNextPatch( $patchToTry );
}
else {
return $patchToTry-1;
}
}

NoSQL?
MySQL was used as a “golden hammer”

Original team of contractors who built the
system knew it

Easy to hire developers who know it

Not necessarily the best option

We had to introduce application-level
sharding for it to suite the growing needs

Rationalisation

Do we need all that data? Really?

At the moment: probably

In the future: probably not

Direct queue interaction

Types of message queue could allow our live
data to be streamed direct from a queue

We could use this infrastructure to share
the data with partners instead of providing
them regular processed exports

More hardware

More vehicles + New components = Need for
more storage

Conclusions
So you need to work with a crap-load of data?

PHP needs lots of
friends
PHP is a great tool for:

Displaying the data

Processing the data

Exporting the data

Binding business logic to the data

It needs friends to:

Queue the data

Insert the data

Visualise the data

Continually Review

Your schema & indexes

Your queries

Efﬁciencies in your code

Number of AJAX requests

Message Queue: A
safety net

Queue what you can

Lets you move data around while you process
it

Gives your hardware some breathing space

Code Considerations
Template engines

Dependency management

Abstraction

Autoloading

Session handling

Request management

Compile Data
Keep related data together

Look at storing summaries of data

Approach used by analytics companies: granularity
changes over time:

This week: per second data

Last week: Hourly summaries

Last month: Daily summaries

Last year: Monthly summaries

Thanks; Q+A

Michael Peacock

mkpeacock@gmail.com

@michaelpeacock

www.michaelpeacock.co.uk

Photo credits

flickr.com/photos/itmpa/4531956496/

flickr.com/photos/eveofdiscovery/3149008295

flickr.com/photos/gadl/89650415/

flickr.com/photos/brapps/403257780

Evolution of a big data project

More Related Content

Viewers also liked (11)

Similar to Evolution of a big data project (20)

More from Michael Peacock (20)

Recently uploaded (20)

Evolution of a big data project

Editor's Notes