Locality Sensitive Hashing By Spark

Alain Rodriguez, Fraud Platform, Uber
Kelvin Chu, Hadoop Platform, Uber
Locality Sensitive
Hashing by Spark
June 08, 2016

Overlapping Routes
Finding similar trips in a city

The problem
Detect trips with a high degree of overlap
We are interested in detecting trips that have
various degrees of overlap.
• Large number of trips
• Noisy, inconsistent GPS data
• Not looking for exact matches
• Directionality is important

Input Data
Millions of trips scattered over time and space
GPS traces are represented as an ordered list
of (latitude,longitude,time) tuples.
• Coordinates are reals and have noise
• Traces can be dense or sparse, yet overlapping
• Large time and geographic search space
[
{
"latitude":25.7613453844,
"epoch":1446577692,
"longitude":-80.197244976
},
{
"latitude":25.7613489535,
"epoch":1446577693,
"longitude":-80.1972450862
},
…
]

Divides the world into consistently sized regions. Area segments can be had of different sizes
Google S2 Cells
Efficient geo hashing

Jaccard index
Set similarity coefficient
The Jaccard index can be used as
a measure of set similarity A = {a, b, c}, B = {b, c, d}, C = {c, d, e}
J(A, A) = 1.0
J(A, B) = 0.5
J(A, C) = 0.2

Sparse and dense traces should be matched Ensure points are at most X distance apart
Different devices generate varying data densities.
Two segments that start and end at the same
location should be detected as overlapping.
Densification ensures that continuous segments
are independently overlapping.
Heuristic
Densify sparse traces
A
A
B
B
A
A
B
B

Heuristic
Remove noise resulting from a
vehicle stopped at a light or a very
chatty device.
Remove contiguous duplicates
Discretize segments
Break down routes into equal size
area segments; this eliminates
route noise. Segment size
determines matching sensitivity.
Discretize route segments

Directionality matters Shingling captures directionality
Two overlapping trips with opposite directions
should not be matched.
Combining contiguous segments captures the
sequence of moves from one segment to
another.
Heuristic
Shingle contiguous area segments
1 2 3 4 5 6 7 8
A
A
B
B
1 2 3 4 5 6 7 8
1->2 2->3 3->4 4->5 5->6 6->7 7->8
A
A
B
B
2->1 3->2 4->3 5->4 6->5 7->6 8->7

Set overlap problem
Find traces that have the desired level of common shingles
1->2
2->3
3->4 4->5
5->6 6->7
7->8
8->9
9->10

N^2 takes forever
LSH to the rescue
● Sifting through a month’s worth of trips for a city
takes forever with the N^2 approach
● Locality-Sensitive Hashing allows us to find most
matches quickly. Spark provides the perfect engine.

Locality-Sensitive Hashing (LSH)
Quick Introduction

Problem - Near Neighbors Search
Set of Points P
Distance Function D
Query Point Q

Problem - Clustering
Set of Points P
Distance Function D

Curse of Dimensionality
1-Dimension e.g. single integer
Q: 7 Distance: 3
A Solution: Binary Tree e.g. Return 9, 4, 8, ...
2-Dimension e.g. GPS point
Q: (12.73, 61.45) Distance: 10
A Solution: Quadtree, R-tree, etc

Curse of Dimensionality
How about very high dimension?
1->2 2->3 3->4 4->5 5->6 6->7 7->8
Very hard problem
A trip often has thousands of shingles
->3k

Approximate Solution
Bucket1
T1
T2
h(T1
)
h(T2
)
D(T1
, T2
) is small
With high probability T1
and T2
are hashed into the same bucket.
Trip T1
& Trip T2
are similar

Approximate Solution
Bucket1T1
T2
h(T1
)
h(T2
)
D(T1
, T2
) is large
With high probability T1
and T2
are hashed into the different buckets.
Bucket2
Trip T1
& Trip T2
are not similar

Some distance functions have good companions of hash functions.
For Jaccard distance, it is MinHash function.

MinHash(S) = min { h(x) for all x in the set S }
h(x) is hash function such as (ax + b) % m where a & b are some good
constants and m is the number of hash bins
Example:
S = {26, 88, 109}
h(x) = (2x + 7) % 8
MinHash(S) = min {3, 7, 1} = 1

Distance Hash Function
Jaccard MinHash
Hamming i-th value of vector x
Cosine Sign of the dot product of x and a random vector
Some Other Examples

How to increase and control the probability?
It turns out the solution is very intuitive.

Use Multiple Hash
Bucket1T1
T2
h1
(T1
)
h1
(T2
)
Bucket2
Bucket3
T1
T2
h2
(T1
)
h2
(T2
)
Both h1
and h2
are MinHash, but with different
parameters (e.g. a & b)

Shuffle Keys
h1
range
T1
T2
h1
(T1
)
h1
(T2
)
h2
(T1
)
h2
(T2
)
● RDD[Trip]
● The hash values are shuffle keys
● h1
and h2
have non-overlapping key ranges
● groupByKey()
h2
range
other hash
Keys Range

Post Processing
Bucket1
T1
, T2
● If T1
and T2
are hashed into the same bucket,
it’s likely that they are similar.
● Compute the Jaccard distance.

Approach 2
h1
range
T1
T2
h1
(T1
)
h1
(T2
)
h2
(T1
)
h2
(T2
)
● Same pair of trips are matched in both h1
and
h2
buckets
● Use one more shuffle to dedup
● Network vs Distance Computation
h2
range
other hash
Keys Range

Approach 3
● Don’t send the actual trip vector in the LSH and Dedup shuffles
● Send only the trip ID
● After dedup, join back with the trip objects with one more shuffle
○ Then compute the Jaccard distance of each pair of matched trips.
● When the trip object is large, Approach 3 saves a lot of network usage.

How to Generate Thousands of Hash Functions
● Naive approach
○ Generate thousands tuples of (a, b, m)
● Cache friendly approach - CPU register/L1/L2
○ Generate only two hash functions
○ h1
(x) = (a1
x + b1
) % m1
○ h2
(x) = (a2
x + b2
) % m2
hi
(x) = h1
(x) + i * h2
(x) i from 1 to number of hash functions

Other Features
● Amplification
○ Improve the probabilities
○ Reduce computation, memory and network used in final post-processing
○ More hashing (usually insignificant compared to the cost in final post-processing)
● Near Neighbors Search
○ Used in information retrieval, instances based machine learning

Other Applications of LSH
● Search for top K similar items
○ Documents, images, time-series, etc
● Cluster similar documents
○ Similar news articles, mirror web pages, etc
● Products recommendation
○ Collaborative filtering

Future Work
● Migrate to Spark ML API
○ DataFrame as first class citizen
○ Integrate it into Spark
● Low latency inserts with Spark Streaming
○ Avoid re-hashing when new objects are streaming in

Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or
by any information storage or retrieval systems, without permission in writing from Uber. This document is intended
only for the use of the individual or entity to whom it is addressed and contains information that is privileged,
confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person
other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.

Locality Sensitive Hashing By Spark

More Related Content

What's hot (20)

Similar to Locality Sensitive Hashing By Spark (20)

More from Spark Summit (20)

Recently uploaded (20)

Locality Sensitive Hashing By Spark