Stop Worrying & Love the SQL - A Case Study

Stop Worrying
Love the SQL! (the Quepid story)

OpenSource Connections
Me/Us
Doug Turnbull
@softwaredoug
Likes: Solr, Elasticsearch, Cassandra,
Postgres
@o19s
Search, Discovery and Analytics
Let us introduce you to freelancing!

Most Importantly we do...
Make my search results more relevant!
“Search Relevancy”
What database works best for problem X?
“(No)SQL Architect/Trusted Advisor”

How products actually get built
Rena: Doug, John can you come by this afternoon?
One of our Solr-based products needs some urgent relevancy
work
Its Friday, it needs to get done today!
Us: Sure!
The Client
(Rena!)
smart
cookie!

A few hours later
Us: we’ve made a bit of progress!
image frustration-1081 by jseliger2
Rena: but everytime we fix something, we break an
existing search!
Us: yeah! we’re stuck in a whack-a-mole-game
other image: whack a mole by jencu

Whack-a-Mole
What search relevancy
work actually looks like

I HAVE AN IDEA
● Middle of the afternoon, I stop doing search
work and start throwing together some
python from flask import Flask
app = Flask(__name__)
Everyone: Doug, stop that, you have important search work to do!
Me: We’re not making any progress!
WE NEED A WAY TO REGRESSION TEST OUR RELEVANCY AS WE TUNE!
Everyone: You’re nuts!

What did I make?
Focus on gathering stakeholder (ie Rena)
feedback on search, coupled w/ workbench
tuning against that feedback
Today we have customers...
… forget that, tell me about your failures!

Our war story
My mistakes:
● Building a product
● Selling a product
● As a user experience engineer
● As an Angular developer
● At choosing databases

Quepid 0.0.0.0.0.0.1
Track multiple user searches
for this query (hdmi cables) Rena rates this document
as a good/bad search result
need to store:
<search> -> <id for search result> -> <rating 1-10>
“hdmi cables” -> “doc1234” -> “10”
*Actual UI may have been much uglier

Data structure selection under duress
● What’s simple, easy, and will persist our
data?
● What plays well with python?
● What can I get working now in Rena’s office?

Redis
● In memory “Data Structure Server”
○ hashes, lists, simple key-> value storage
● Persistent -- write to disk every X minutes

Redis
from redis import Redis
redis = Redis()
redis.set("foo", "bar")
redis.get("foo") # gets ‘bar’
$ pip install redis
Easy to install and go! Specific to our problem:
redis = Redis()
ratings = {“doc1234”: “10”,
“doc532”: “5”}
searchQuery = “hdmi cables”
redis.hsetall(searchQuery, ratings)
Store a hash table
at “hdmi cables”
with:
“doc1234” -> “10”
“doc532” -> “5”

Success!
● My insanity paid off that afternoon
● Now we’re left with a pile of hacked together
(terrible) code -- now what?

Adding some features
● Would like to add multiple “cases”
(different search projects that solve different problems)
● Would like to add user accounts
● Still a one-off for Silverchair

Cases
Tuning a cable shopping site... … vs state laws

Cases in Redis?
redis = Redis()
ratings = {“doc1234”: “10”,
“doc532”: “5”}
redis.hset(searchQuery, ratings)
Recall our existing implementation
“data model”
Out of the box, redis can deal with 2 levels deep:
{
“hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“ethernet cables”
...
}
Can’t add extra layer (redis hash only one layer)
{“cable site”: {
“hdmi cables”: {...}
“ethernet cables”: {...}
}
“laws site: {...}}

Time to give up Redis?
“All problems in computer science can be solved by another level of indirection” -- David Wheeler
Crazy Idea: Add dynamic prefix to query keys to indicate case, ie:
{
“case_cablestore_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“case_cablestore_ethernet cables”: {
… },
“case_statelaws_car tax”: {
…}
}
Queries for “Cable Store” case
Query for “State Laws” case
redis.keys(“case_cablestore*”)
To Fetch:

Store other info about cases?
New problem: we need to store some information about cases, case name, et
{
“case_cablestore_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“case_cablestore_ethernet cables”: {
… },
“case_statelaws_car tax”: {
…}
}
Where would it go here?
{
“case_cablestore” {
“name”: “cablestore”,
“created” “20140101”
},
“case_cablestore_query_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“case_cablestore_query_ethernet cables”:
{
… },
“case_statelaws_query_car tax”: {
…}
}

Oh but let’s add users
Extrapolating on past patterns {
“user_doug” {
“name”: “Doug”,
“created_date”: “20140101”
},
“user_doug_case_cablestore” {
“name”: “cablestore”,
“created_date” “20140101”
},
“user_doug_case_cablestore_query_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_doug_case_cablestore_query_ethernet cables”:
{
… },
“user_tom_case_statelaws_query_car tax”: {
…}
}image: Rage Wallpaper from Flickr user Thoth God of Knowledge
You right now!

Step Back
We ask ourselves: Is this tool a
product? Is it useful outside of this
customer?
What level of software engineering helps us move forward?
● Migrate to RDMS?
● “NoSQL” options?
● Clean up use of Redis somehow?

SubRedis
Operationalizes hierarchy inside of redis
https://github.com/softwaredoug/subredis
from subredis import SubRedis
redis = Redis()
sr = SubRedis(“case_%s” % caseId , redis)
ratings = {“doc1234”: “10”,
“doc532”: “5”}
sr.hsetall(searchQuery, ratings)
Create a redis sandbox for this case
Interact with this case’s queries with redis
sandbox specific to that case
Behind the scenes, subredis
queries/appends the case_1 prefix to
everything

SubRedis == composable
userSr = SubRedis(“user_%s” % userId , redis)
caseSr = SubRedis(“case_%s” % caseId , userSr)
# Sandbox redis for queries about user
ratings = {“doc1234”: “10”,
“doc532”: “5”}
caseSr.hsetall(searchQuery, ratings)
SubRedis takes any Redis like
thing, and works safely in that
sandbox
Now working on sandbox, within a sandbox

Does something reasonable under the hood
{
“user_1_name”: “Doug”,
“user_1_created_date”: “Doug”,
“user_1_case_1_name”: “name”: “cablestore”
“user_1_case_1_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_2_name”, “Rena”,
...
}
All
Redis
user_1
subred.
case_1
subred.

We reflect again
● Ok we tried this out as a product. Launched.
● Paid off *some* tech debt, but wtf are we
doing
● Works well enough, we’ve got a bunch of
new features, forge ahead

We reflect again
● We have real customers
● Our backend is evolving away from simple
key-value storage
○ user accounts? users that share cases? stored
search snapshots? etc etc

Attack of the relational
Given our current set of tools, how would we solve the problem
“case X can be shared between multiple users”?
{
“doc1234”: “10”,
“doc532”: “5”
},
“doc1234”: “10”,
“doc532”: “5”
},
}
Could duplicate the data?
This stinks!
● Updates require visiting many (every?)
user, looking for this case
● Bloated database
Duplicate the data?

Attack of the relational
Given our current set of tools, how would we solve the problem
“case X can be shared between multiple users”?
{
“user_1_cases”: [1, ...]
“case_1_name”: “name”: “cablestore”
“case_1_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_2_cases”: [1, ...]
...
}
User 1
Case 1
User 2
Store list of
owned cases
Break out cases to a top-level record?

SudRedisRelational?
{
“user_1_cases”: [1, ...]
“case_1_name”: “name”: “cablestore”
“case_1_hdmi cables”: {
“doc1234”: “10”,
“doc532”: “5”
},
“user_2_cases”: [1, ...]
...
}
We’ve actually just normalized our data.
Why was this good?
● We want to update case 1 in isolation
without anomalies
● We don’t want to visit every user to
update case 1!
● We want to avoid duplication
We just made our “NoSQL” database a bit relational

Other Problems
● Simple CRUD tasks like “delete a case”
need to be coded up
● We’re managing our own record ids
● Is any of this atomic? does it occur in
isolation?

What’s our next DB?
● These problems are hard, we need a new
DB
● We also need better tooling!

Irony
● This is the exact situation we warn clients
about in our (No)SQL Architect Roles.
○ Relational == General Purpose
○ Many-many, many-one, one-many, etc
○ Relational == consistent tooling
○ NoSQL == solve specific problems well

So we went relational!
● Took advantage of great tooling: MySQL,
Sqlalchemy (ORM), Alembic (migrations)
● Modeled our data relationships exactly like
we needed them to be modeled

Map db Python classes
class SearchQuery(Base):
__tablename__ = 'query'
id = Column(Integer, primary_key=True)
search_string = Column(String)
ratings = relationship("QueryRating")
class QueryRating(Base):
__tablename__ = 'rating'
doc_id = Column(String)
rating = Column(Integer)
Can model my domain in coder-friendly
classes class SearchQuery(Base):
__tablename__ = 'query'
search_string = Column(String)
ratings = relationship("QueryRating")
class QueryRating(Base):
__tablename__ = 'rating'
doc_id = Column(String)
rating = Column(Integer)

Easy CRUD
q = SearchQuery(search_string=”hdmi cable”)
db.session.add(q)
db.session.commit()
del q.ratings[0]
db.session.add(q)
db.session.commit()
q = SearchQuery.query.filter(id=1).one()
q.search_string=”foo”
db.session.add(q)
db.session.commit()
Create!
Delete!
Update!

Migrations are good
alembic revision --autogenerate -m "name for tries"
alembic upgrade head
alembic downgrade 0ab51c25c
How do you upgrade your database to add/move/reorganize data?
● Redis this was always done manually/scripted
● Migrations with RDMS are a very robust/well-understood way to
handle this
SQLAlchemy has “alembic” to help:

Modeling Users ←→ Cases
association_table = Table(case2users, Base.metadata,
Column('case_id', Integer, ForeignKey('case.id')),
Column('user_id', Integer, ForeignKey('user.id'))
)
class User(Base):
__tablename__ = 'user'
cases = relationship("Case",
secondary=association_table)
class Case(Base):
__tablename__ = 'case'
Can model many-many relationships

Ultimate Query Flexibility
for user in User.query.all():
for case in user.cases:
print case.caseName
for user in User.query.filter(User.isPaying==True):
for case in user.cases:
print case.caseName
Print all cases:
Cases from paying members:

Lots of things easier
● backups
● robust hosting services (RDS)
● industrial strength ACID with flexible
querying
● 3rd-party tooling (ie VividCortex for MySQL)

When NoSQL?
● Solve specific problems well
○ Optimize for specific query patterns
○ Full-Text Search (Elasticsearch, Solr)
○ Caching, shared data structure (Redis)
● Optimize for specific scaling problems
○ Provide a denormalized “view” of your data for
specific task

Final Thoughts
Sometimes RDMS’s have harder initial hurdle
for setup, figuring out migrations; data
modeling; etc
Why isn’t the easy path the wise path?

In conclusion

Stop Worrying & Love the SQL - A Case Study

More Related Content

What's hot (20)

Similar to Stop Worrying & Love the SQL - A Case Study (20)

More from All Things Open (20)

Recently uploaded (20)

Stop Worrying & Love the SQL - A Case Study