SlideShare a Scribd company logo
Handling Real-time Geostreams
        #rtgeo #where20




O’Reilly Where 2.0                      TM



March 30, 2010
Handling Real-time Geostreams
Background
Wherehoo (2000)
‣   “The Stuff Around You”
‣   “Wherehoo Server: An interactive location service for software agents and intelligent
    systems” - J.Youll, R.Krikorian
‣   In your /etc/services file
BusRadio (2004)
‣   Designed mobile computers to play media while also transmitting telemetry
‣   Looked and sounded like a radio - but really a Linux computer
OneHop (2007)
‣   Bluetooth proximity-based social networking
Table of Contents
Background
‣   Why are we interested in this?
Twitter’s Geo APIs
‣   How do we allow people to talk about place?
Problem statement
‣   What are we trying to have our system do?
Infrastructure
‣   How is Twitter solving this problem?
People want to talk
about places
Handling Real-time Geostreams
Handling Real-time Geostreams
Handling Real-time Geostreams
Handling Real-time Geostreams
What’s happening here?
Twitter’s Geo APIs
Original attempts
Adding it to the tweet
‣   Use myloc.me, et. al. to add text to the tweet
‣   Localizes mobile phone and puts location “in band”
‣   Takes from 140 characters


Setting profile level locations
‣   Set the user/location of a Twitter user
‣   There is an API for that!
‣   Not on a per-tweet basis and not designed for high frequency updates
Handling Real-time Geostreams
Handling Real-time Geostreams
curl -u USERNAME:PASSWORD 
-d location="San Francisco, California" 
http://twitter.com/account/update_location.xml

<user>
  <id>8285392</id>
  <name>raffi</name>
  <screen_name>raffi</screen_name>
  <location>San Francisco, California</location>
  ...
</user>
Geotagging API
Geotagging API
Adding it to the tweet
‣   Per-tweet basis
‣   Out of band / pure meta-data
‣   Does not take from the 140 characters

Native Twitter support
‣   Simple way to update status with location data
‣   Ability to remove geotags from your tweets en masse
‣   Using GeoRSS and GeoJSON as the encoding format
‣   Across all Twitter APIs (REST, Search, and Streaming)
Sending an update
status/update

curl -u USERNAME:PASSWORD -d "status=hey-ho&lat=37.3&long=-121.9" 
http://api.twitter.com/1/status/update.xml




<status>
  <text>hey-ho</text>
  ...
  <geo xmlns:georss="http://www.georss.org/georss>
    <georss:point>37.3 -121.9</georss:point>
  </geo>
  ...
</user>
Handling Real-time Geostreams
Search
search (with geocode)
curl "http://search.twitter.com/search.atom?
geocode=40.757929%2C-73.985506%2C25km&source=foursquare"

geocode parameter takes “latitude,longitude,radius” where radius has
units of mi or km
...
<title>On the way to ace now, so whenever you can make it I'll be there. (@
Port Imperial Ferry in Weehawken) http://4sq.com/2rq0vO</title>
...
<twitter:geo>
   <georss:point>40.7759 -74.0129</georss:point>
</twitter:geo>
...
Handling Real-time Geostreams
Handling Real-time Geostreams
Handling Real-time Geostreams
Handling Real-time Geostreams
Geo-hose
Geo-hose
location filtering
curl "http://stream.twitter.com/1/statuses/filter.xml?
locations=-74.5129,40.2759,-73.5019,41.2759"

locations is a bounding box specified by “long1,lat1,long2,lat2” and can
track up to 10 locations that are most 1 degree square (~60 miles
square and enough to cover most metropolitan areas)
Trends API
Handling Real-time Geostreams
Trends API
Global trends
‣   Currently on front page of Twitter.com and on search.twitter.com
‣   Analysis of “hot conversations”
‣   Does not take from the 140 characters

Location specific trends
‣   Tweets being localized through a variety of means into trends
‣   Locations exposed over the API as WOEIDs
‣   Can ask for available trends sorted by distance from your location
‣   Querying for a parent of a location will return all locations under it
Available locations
trends/available
curl "http://api.twitter.com/1/trends/available.xml"

Can optionally take a lat and long parameter to have trends locations
returned, sorted, as distance from you.
<locations type=”array”>
  <location>
     <woeid>2487956</woeid>
     <name>San Francisco</name>
     <placeTypeName code=”7”>Town</placeTypeName>
     <country type=”Country” code=”US”>United States</country>
     <url>http://where.yahooapis.com/v1/place/2487956</url>
  </location>
  ...
</locations>
Available locations
trends/woeid.xml (trends/twid.xml coming soon)
curl "http://api.twitter.com/1/trends/2487956.xml"

Look up the trends at the given WOEID


<matching_trends type=”array”>
  <trends as_of=”2009-12-15T20:19:09Z”>
    ...
      <trend url=”http://search.twitter.com/search?q=Golden+Globe+nominations” query=”Golden
+Globe+nominations”>Golden Globe nominations</trend>
      <trend url=”http://search.twitter.com/search?q=%23somethingaintright”
query=”%23somethingaintright”>#somethingaintright</trend>
    ...
  </trends>
</matching_trends>
Geo-place API
Geo-place API
Support for “names"
‣   Not just coordinates
‣   More contextually relevant
‣   Positive privacy benefits

Increased complexity
‣   Need to be able to look up a list of places
‣   Requires a “reverse geocoder”
‣   Human driven tagging and not possible to be fully automatic
Finding a place
geo/reverse_geocode

curl http://api.twitter.com/1/geo/reverse_geocode.json&lat=37.3&long=-121.9
{
    "result": {
        "places": [
            {
                "place_type":"neighborhood",
                "country_code":"US",
                "contained_within": [...]
                "full_name":"Willow Glen",
                "bounding_box": {
                    "type":"Polygon",
                    "coordinates": [[                  Put some graphic to
                                                       explain what goes in the
                      [-121.92481908, 37.275903], [-121.88083608, 37.275903],
                                                       contained_within
                      [-121.88083608, 37.31548203], [-121.92481908, 37.31548203]
                    ]]
                },
                "name":"Willow Glen",
                "id":"46bc64ecd1da2a46",
                "url":"http://api.twitter.com/1/geo/id/46bc64ecd1da2a46.json",
                "country":""
            },
            ...
          ]
      }
}
Sending an update
status/update

curl -u USERNAME:PASSWORD -d "status=hey-ho&place_id=46bc64ecd1da2a46" 
http://api.twitter.com/1/status/update.xml

<status>
  <text>hey-ho</text>
  ...
  <place xmlns:georss="http://www.georss.org/georss>
    <id>46bc64ecd1da2a46</id>
    <name>Willow Glen</name>
    <full_name>Willow Glen</full_name>
    <place_type>neighborhood</place_type>
    <url>http://api.twitter.com/1/geo/id/46bc64ecd1da2a46.json</url>
    <country code=”US”>United States</country>
  </place>
  ...
</user>
Handling Real-time Geostreams
Problem statement
What do we need to build?
What do we need to build?
‣   Database of places
    ‣   Given a real-world location, find programatic places that that
        place maps to
    ‣   Spatial search
‣   Method to store places with content
    ‣   Per user basis
    ‣   Per tweet basis
Spatial lookup and index
As background... MySQL + GIS
‣   Ability to index points and do a spatial query
    ‣   For example, get points within a bounding rectangle
    ‣   SELECT
        MBRContains(GeomFromText(
        'POLYGON((0 0,0 3,3 3,3 0,0 0))' ), coord)
        FROM geometry
‣   Hard to cache the spatial query
‣   Possibly requires a DB hit on every query
Options
Grid / Quad-tree
‣   Create a grid (possibly nested) of the entire Earth
Geohash
‣   Arbitrarily precise and hierarchical spatial data reference
Space filling curves
‣   Mapping 2D space into 1D while preserving locality
R-Tree
‣   Spatial access data structure
Grid / Quad-Tree
Grid / Quad-Tree
Grid / Quad-Tree


‣   Recursively subdivide regions
‣   Trie structure to store “prefixes”
‣   Spatially oriented data structure
Geohash
Geohash
‣   37o18’N    121o54’W   = 9q9k4
‣   Hierarchical spatial data structure
‣   Precision encoded
‣   Distance captured
    ‣   Nearby places (usually) share the same prefix
    ‣   The longer the string match, the closer the places are
Geohash
‣   9q9k4 = 01001 / 10110 / 01001 / 10010 / 00100
‣   Longitude bits = 0010100101010
    ‣   -90.0 (0), -135.0 (0), -112.5 (1), -123.75 (0), -118.125 (1), -120.9375 (0),
        -122.34375 (0), -121.640625 (1), -121.9921875 (0), -121.81640625 (1),
        -121.904296875 (0), -121.8603515625(1), -121.88232421875 (0) = 121           o53’W

‣   Latitude bits = 1011010100000
    ‣   45.0 (1), 22.5 (0), 33.75 (1), 39.375 (1), 36.5625 (0), 37.96875 (1), 37.265625 (0),
        37.617185 (1), 37.4414025 (0), 37.35351125 (0), 37.309565625 (0),
        37.287692813 (0) = 37    o17’N
Geohash
‣   Possible to do range query in database
    ‣   Matching based on prefix will return all the points that fit in that
        “grid”
    ‣   Able to store 2D data in a 1D space
Space filling curve
Space filling curve
Space filling curve

‣   Generalization of geohash
    ‣   2D to 1D mapping
    ‣   Nearness is captured
‣   Recursively can fill up space
    depending on resolution desired
‣   Fractal-like pattern can be used to
    take up as much room as possible
R-Tree
R-Tree




         Image from Wikipedia
R-Tree
‣   Height-balanced tree data
    structure for spatial data
‣   Uses hierarchically nested
    bounding boxes
‣   Nearby elements are placed in
    the same node
Representations
GeoRSS / GeoJSON
‣   http://www.georss.org/ and http://geojson.org/
‣   <georss:point>37.3 -121.9</georss:point>
‣   {
        “type”:”Point”,
        “coordinates”:[-121.9, 37.3]
    }
How do you store precision?
‣   “Precision” is a hard thing to encode
‣   Accuracy can be encoded with an error radius
‣   Twitter opts for tracking the number of decimals passed
    ‣   140.0 != 140.00
    ‣   DecimalTrackingFloat
Handling Real-time Geostreams
Twitter
Twitter Infrastructure
‣   Ruby on Rails-ish frontend
‣   Scala-based services backend
‣   MySQL and soon to be Cassandra as the store
‣   RPC to back-end or put items into queues
Rock Dove (redux)
Can be used as a homing pigeon
Handling Real-time Geostreams
Simplified architecture
‣   R-Tree for spatial lookup
    ‣   Data provider for front-end lookups
    ‣   Store place object with envelope of place in R-Tree
‣   Mapping from ID to place object
Java Topology Suite (JTS)
‣   http://www.vividsolutions.com/jts/jtshome.htm
‣   Open source
‣   Good for representing and manipulating “geometries”
‣   Has support for fundamental geometric operations
    ‣   contains
    ‣   envelope
‣   Has a R-Tree implementation
point
      Insid
point       e in
      Outsi       polyg
            de in       on? t
                   polyg      rue
                         on? f
                               alse
at (0
          .0, 0
      -- re      .0)
   at (1    gion
         .0, 1      1
     -- re     .0)
           gion
     -- re         1
  at (2    gion
        .0, 2     2
    -- re     .0)
          gion
    -- re        1
 at (3    gion
       .0, 3    2
   -- re     .0)
at (4    gion
      .0, 4    2
  -- em     .0)
        pty
Java Topology Suite (JTS)
‣   Serializers and deserializers
    ‣   Well-known text (WKT)
    ‣   Well-known binary (WKB)
    ‣   No GeoRSS or GeoJSON support
Interface / RPC
‣   RockDove is a backend service
    ‣   Data provider for front-end lookups
    ‣   Uses some form of RPC (Thrift, Avro, etc.) to communicate with
    ‣   Data could be cached on frontend to prevent lookups
‣   Simple RPC interface
    ‣   get(id)
    ‣   containedWithin(lat, long)
Handling Real-time Geostreams
Interface / RPC
‣   Watch those RPC queues!
‣   Fail fast and potentially throw “over capacity” messages
    ‣   get(id) throws OverCapacity
    ‣   containedWithin(lat, long) throws
        OverCapacity
‣   Distinguish between write path and read path
GeoRuby
‣   http://georuby.rubyforge.org/
‣   Open source
‣   OpenGIS Simple Features Interface Standard
‣   Only good for representing geometric entities
‣   GeoRuby::SimpleFeatures::Geometry::from_ewkb
‣   No GeoJSON serializers
Handling Real-time Geostreams
Front-end
Bringing geo data to and from the web
Location in Browser
‣   Geolocation API Specification for JavaScript
    navigator.geolocation.getCurrentPosition
‣   Does a callback with a position object
‣   position.coords     has
    ‣   latitude and longitude
    ‣   accuracy
    ‣   other stuff
‣   Support in Firefox 3.5, Chromium, Opera, and others with Google Gears
Handling Real-time Geostreams
Handling Real-time Geostreams
Hose
Streaming out real-time geo data
Geo-hose
location filtering
curl "http://stream.twitter.com/1/statuses/filter.xml?
locations=-74.5129,40.2759,-73.5019,41.2759"


‣   Status objects are enqueued
‣   Hose server parses location (parsing place data COMING
    SOON)
‣   Quickly determines if there are any subscribers for location
‣   Streams out serialized object
Thanks also to
‣   Marius Eriksen (@marius)
‣   David Helder (@dhelder)
‣   Marc McBride (@mccv)
‣   John Kalucki (@jkalucki)
Questions?   Follow me at
             twitter.com/raffi




                           TM

More Related Content

PDF
The InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxData
PPTX
Day 6 - PostGIS
PDF
Data warehouse or conventional database: Which is right for you?
PDF
On Beyond (PostgreSQL) Data Types
PDF
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
PDF
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
PDF
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
PDF
ClickHouse Features for Advanced Users, by Aleksei Milovidov
The InfluxDB 2.0 Storage Engine | Jacob Marble | InfluxData
Day 6 - PostGIS
Data warehouse or conventional database: Which is right for you?
On Beyond (PostgreSQL) Data Types
Big Data in Real-Time: How ClickHouse powers Admiral's visitor relationships ...
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
ClickHouse Features for Advanced Users, by Aleksei Milovidov

What's hot (6)

PDF
ClickHouse Materialized Views: The Magic Continues
PDF
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
PDF
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
PDF
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
PDF
Webinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
PDF
ClickHouse materialized views - a secret weapon for high performance analytic...
ClickHouse Materialized Views: The Magic Continues
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
Webinar slides: MORE secrets of ClickHouse Query Performance. By Robert Hodge...
Webinar: Secrets of ClickHouse Query Performance, by Robert Hodges
Webinar slides: Adding Fast Analytics to MySQL Applications with Clickhouse
ClickHouse materialized views - a secret weapon for high performance analytic...
Ad

Similar to Handling Real-time Geostreams (20)

PDF
Where20 2008 Ruby Tutorial
PDF
#rtgeo (Where 2.0 2011)
KEY
OSCON july 2011
PDF
Geospatial Enhancements in MongoDB 2.4
PDF
Rails Gis Hacks
ODP
Intro To PostGIS
PDF
Geospatial technologies
PDF
Location Analytics - Real-Time Geofencing using Kafka
PPTX
IT Days - Parse huge JSON files in a streaming way.pptx
PDF
Hive dirty/beautiful hacks in TD
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
How to build a html5 websites.v1
PDF
A Century Of Weather Data - Midwest.io
PDF
Ingesting streaming data into Graph Database
ODP
Introduction To PostGIS
PDF
Terraform at Scale - All Day DevOps 2017
PPTX
Svcc 2013-d3
PPTX
SVCC 2013 D3.js Presentation (10/05/2013)
PDF
Geospatial Graphs made easy with OrientDB - Codemotion Warsaw 2016
PPTX
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
Where20 2008 Ruby Tutorial
#rtgeo (Where 2.0 2011)
OSCON july 2011
Geospatial Enhancements in MongoDB 2.4
Rails Gis Hacks
Intro To PostGIS
Geospatial technologies
Location Analytics - Real-Time Geofencing using Kafka
IT Days - Parse huge JSON files in a streaming way.pptx
Hive dirty/beautiful hacks in TD
Location Analytics Real-Time Geofencing using Kafka
How to build a html5 websites.v1
A Century Of Weather Data - Midwest.io
Ingesting streaming data into Graph Database
Introduction To PostGIS
Terraform at Scale - All Day DevOps 2017
Svcc 2013-d3
SVCC 2013 D3.js Presentation (10/05/2013)
Geospatial Graphs made easy with OrientDB - Codemotion Warsaw 2016
To scale or not to scale: Key/Value, Document, SQL, JPA – What’s right for my...
Ad

Handling Real-time Geostreams

  • 1. Handling Real-time Geostreams #rtgeo #where20 O’Reilly Where 2.0 TM March 30, 2010
  • 3. Background Wherehoo (2000) ‣ “The Stuff Around You” ‣ “Wherehoo Server: An interactive location service for software agents and intelligent systems” - J.Youll, R.Krikorian ‣ In your /etc/services file BusRadio (2004) ‣ Designed mobile computers to play media while also transmitting telemetry ‣ Looked and sounded like a radio - but really a Linux computer OneHop (2007) ‣ Bluetooth proximity-based social networking
  • 4. Table of Contents Background ‣ Why are we interested in this? Twitter’s Geo APIs ‣ How do we allow people to talk about place? Problem statement ‣ What are we trying to have our system do? Infrastructure ‣ How is Twitter solving this problem?
  • 5. People want to talk about places
  • 11. Original attempts Adding it to the tweet ‣ Use myloc.me, et. al. to add text to the tweet ‣ Localizes mobile phone and puts location “in band” ‣ Takes from 140 characters Setting profile level locations ‣ Set the user/location of a Twitter user ‣ There is an API for that! ‣ Not on a per-tweet basis and not designed for high frequency updates
  • 14. curl -u USERNAME:PASSWORD -d location="San Francisco, California" http://twitter.com/account/update_location.xml <user> <id>8285392</id> <name>raffi</name> <screen_name>raffi</screen_name> <location>San Francisco, California</location> ... </user>
  • 16. Geotagging API Adding it to the tweet ‣ Per-tweet basis ‣ Out of band / pure meta-data ‣ Does not take from the 140 characters Native Twitter support ‣ Simple way to update status with location data ‣ Ability to remove geotags from your tweets en masse ‣ Using GeoRSS and GeoJSON as the encoding format ‣ Across all Twitter APIs (REST, Search, and Streaming)
  • 17. Sending an update status/update curl -u USERNAME:PASSWORD -d "status=hey-ho&lat=37.3&long=-121.9" http://api.twitter.com/1/status/update.xml <status> <text>hey-ho</text> ... <geo xmlns:georss="http://www.georss.org/georss> <georss:point>37.3 -121.9</georss:point> </geo> ... </user>
  • 19. Search search (with geocode) curl "http://search.twitter.com/search.atom? geocode=40.757929%2C-73.985506%2C25km&source=foursquare" geocode parameter takes “latitude,longitude,radius” where radius has units of mi or km ... <title>On the way to ace now, so whenever you can make it I'll be there. (@ Port Imperial Ferry in Weehawken) http://4sq.com/2rq0vO</title> ... <twitter:geo> <georss:point>40.7759 -74.0129</georss:point> </twitter:geo> ...
  • 25. Geo-hose location filtering curl "http://stream.twitter.com/1/statuses/filter.xml? locations=-74.5129,40.2759,-73.5019,41.2759" locations is a bounding box specified by “long1,lat1,long2,lat2” and can track up to 10 locations that are most 1 degree square (~60 miles square and enough to cover most metropolitan areas)
  • 28. Trends API Global trends ‣ Currently on front page of Twitter.com and on search.twitter.com ‣ Analysis of “hot conversations” ‣ Does not take from the 140 characters Location specific trends ‣ Tweets being localized through a variety of means into trends ‣ Locations exposed over the API as WOEIDs ‣ Can ask for available trends sorted by distance from your location ‣ Querying for a parent of a location will return all locations under it
  • 29. Available locations trends/available curl "http://api.twitter.com/1/trends/available.xml" Can optionally take a lat and long parameter to have trends locations returned, sorted, as distance from you. <locations type=”array”> <location> <woeid>2487956</woeid> <name>San Francisco</name> <placeTypeName code=”7”>Town</placeTypeName> <country type=”Country” code=”US”>United States</country> <url>http://where.yahooapis.com/v1/place/2487956</url> </location> ... </locations>
  • 30. Available locations trends/woeid.xml (trends/twid.xml coming soon) curl "http://api.twitter.com/1/trends/2487956.xml" Look up the trends at the given WOEID <matching_trends type=”array”> <trends as_of=”2009-12-15T20:19:09Z”> ... <trend url=”http://search.twitter.com/search?q=Golden+Globe+nominations” query=”Golden +Globe+nominations”>Golden Globe nominations</trend> <trend url=”http://search.twitter.com/search?q=%23somethingaintright” query=”%23somethingaintright”>#somethingaintright</trend> ... </trends> </matching_trends>
  • 32. Geo-place API Support for “names" ‣ Not just coordinates ‣ More contextually relevant ‣ Positive privacy benefits Increased complexity ‣ Need to be able to look up a list of places ‣ Requires a “reverse geocoder” ‣ Human driven tagging and not possible to be fully automatic
  • 33. Finding a place geo/reverse_geocode curl http://api.twitter.com/1/geo/reverse_geocode.json&lat=37.3&long=-121.9
  • 34. { "result": { "places": [ { "place_type":"neighborhood", "country_code":"US", "contained_within": [...] "full_name":"Willow Glen", "bounding_box": { "type":"Polygon", "coordinates": [[ Put some graphic to explain what goes in the [-121.92481908, 37.275903], [-121.88083608, 37.275903], contained_within [-121.88083608, 37.31548203], [-121.92481908, 37.31548203] ]] }, "name":"Willow Glen", "id":"46bc64ecd1da2a46", "url":"http://api.twitter.com/1/geo/id/46bc64ecd1da2a46.json", "country":"" }, ... ] } }
  • 35. Sending an update status/update curl -u USERNAME:PASSWORD -d "status=hey-ho&place_id=46bc64ecd1da2a46" http://api.twitter.com/1/status/update.xml <status> <text>hey-ho</text> ... <place xmlns:georss="http://www.georss.org/georss> <id>46bc64ecd1da2a46</id> <name>Willow Glen</name> <full_name>Willow Glen</full_name> <place_type>neighborhood</place_type> <url>http://api.twitter.com/1/geo/id/46bc64ecd1da2a46.json</url> <country code=”US”>United States</country> </place> ... </user>
  • 37. Problem statement What do we need to build?
  • 38. What do we need to build? ‣ Database of places ‣ Given a real-world location, find programatic places that that place maps to ‣ Spatial search ‣ Method to store places with content ‣ Per user basis ‣ Per tweet basis
  • 40. As background... MySQL + GIS ‣ Ability to index points and do a spatial query ‣ For example, get points within a bounding rectangle ‣ SELECT MBRContains(GeomFromText( 'POLYGON((0 0,0 3,3 3,3 0,0 0))' ), coord) FROM geometry ‣ Hard to cache the spatial query ‣ Possibly requires a DB hit on every query
  • 41. Options Grid / Quad-tree ‣ Create a grid (possibly nested) of the entire Earth Geohash ‣ Arbitrarily precise and hierarchical spatial data reference Space filling curves ‣ Mapping 2D space into 1D while preserving locality R-Tree ‣ Spatial access data structure
  • 44. Grid / Quad-Tree ‣ Recursively subdivide regions ‣ Trie structure to store “prefixes” ‣ Spatially oriented data structure
  • 46. Geohash ‣ 37o18’N 121o54’W = 9q9k4 ‣ Hierarchical spatial data structure ‣ Precision encoded ‣ Distance captured ‣ Nearby places (usually) share the same prefix ‣ The longer the string match, the closer the places are
  • 47. Geohash ‣ 9q9k4 = 01001 / 10110 / 01001 / 10010 / 00100 ‣ Longitude bits = 0010100101010 ‣ -90.0 (0), -135.0 (0), -112.5 (1), -123.75 (0), -118.125 (1), -120.9375 (0), -122.34375 (0), -121.640625 (1), -121.9921875 (0), -121.81640625 (1), -121.904296875 (0), -121.8603515625(1), -121.88232421875 (0) = 121 o53’W ‣ Latitude bits = 1011010100000 ‣ 45.0 (1), 22.5 (0), 33.75 (1), 39.375 (1), 36.5625 (0), 37.96875 (1), 37.265625 (0), 37.617185 (1), 37.4414025 (0), 37.35351125 (0), 37.309565625 (0), 37.287692813 (0) = 37 o17’N
  • 48. Geohash ‣ Possible to do range query in database ‣ Matching based on prefix will return all the points that fit in that “grid” ‣ Able to store 2D data in a 1D space
  • 51. Space filling curve ‣ Generalization of geohash ‣ 2D to 1D mapping ‣ Nearness is captured ‣ Recursively can fill up space depending on resolution desired ‣ Fractal-like pattern can be used to take up as much room as possible
  • 53. R-Tree Image from Wikipedia
  • 54. R-Tree ‣ Height-balanced tree data structure for spatial data ‣ Uses hierarchically nested bounding boxes ‣ Nearby elements are placed in the same node
  • 56. GeoRSS / GeoJSON ‣ http://www.georss.org/ and http://geojson.org/ ‣ <georss:point>37.3 -121.9</georss:point> ‣ { “type”:”Point”, “coordinates”:[-121.9, 37.3] }
  • 57. How do you store precision? ‣ “Precision” is a hard thing to encode ‣ Accuracy can be encoded with an error radius ‣ Twitter opts for tracking the number of decimals passed ‣ 140.0 != 140.00 ‣ DecimalTrackingFloat
  • 60. Twitter Infrastructure ‣ Ruby on Rails-ish frontend ‣ Scala-based services backend ‣ MySQL and soon to be Cassandra as the store ‣ RPC to back-end or put items into queues
  • 61. Rock Dove (redux) Can be used as a homing pigeon
  • 63. Simplified architecture ‣ R-Tree for spatial lookup ‣ Data provider for front-end lookups ‣ Store place object with envelope of place in R-Tree ‣ Mapping from ID to place object
  • 64. Java Topology Suite (JTS) ‣ http://www.vividsolutions.com/jts/jtshome.htm ‣ Open source ‣ Good for representing and manipulating “geometries” ‣ Has support for fundamental geometric operations ‣ contains ‣ envelope ‣ Has a R-Tree implementation
  • 65. point Insid point e in Outsi polyg de in on? t polyg rue on? f alse
  • 66. at (0 .0, 0 -- re .0) at (1 gion .0, 1 1 -- re .0) gion -- re 1 at (2 gion .0, 2 2 -- re .0) gion -- re 1 at (3 gion .0, 3 2 -- re .0) at (4 gion .0, 4 2 -- em .0) pty
  • 67. Java Topology Suite (JTS) ‣ Serializers and deserializers ‣ Well-known text (WKT) ‣ Well-known binary (WKB) ‣ No GeoRSS or GeoJSON support
  • 68. Interface / RPC ‣ RockDove is a backend service ‣ Data provider for front-end lookups ‣ Uses some form of RPC (Thrift, Avro, etc.) to communicate with ‣ Data could be cached on frontend to prevent lookups ‣ Simple RPC interface ‣ get(id) ‣ containedWithin(lat, long)
  • 70. Interface / RPC ‣ Watch those RPC queues! ‣ Fail fast and potentially throw “over capacity” messages ‣ get(id) throws OverCapacity ‣ containedWithin(lat, long) throws OverCapacity ‣ Distinguish between write path and read path
  • 71. GeoRuby ‣ http://georuby.rubyforge.org/ ‣ Open source ‣ OpenGIS Simple Features Interface Standard ‣ Only good for representing geometric entities ‣ GeoRuby::SimpleFeatures::Geometry::from_ewkb ‣ No GeoJSON serializers
  • 73. Front-end Bringing geo data to and from the web
  • 74. Location in Browser ‣ Geolocation API Specification for JavaScript navigator.geolocation.getCurrentPosition ‣ Does a callback with a position object ‣ position.coords has ‣ latitude and longitude ‣ accuracy ‣ other stuff ‣ Support in Firefox 3.5, Chromium, Opera, and others with Google Gears
  • 78. Geo-hose location filtering curl "http://stream.twitter.com/1/statuses/filter.xml? locations=-74.5129,40.2759,-73.5019,41.2759" ‣ Status objects are enqueued ‣ Hose server parses location (parsing place data COMING SOON) ‣ Quickly determines if there are any subscribers for location ‣ Streams out serialized object
  • 79. Thanks also to ‣ Marius Eriksen (@marius) ‣ David Helder (@dhelder) ‣ Marc McBride (@mccv) ‣ John Kalucki (@jkalucki)
  • 80. Questions? Follow me at twitter.com/raffi TM