Thursday, June 03, 2010

Heavy Reads and Heavy Writes

One of the interesting challenges we have here at Hip Digital is managing our vast catalog of metadata. We have millions and millions of data points that all need to be collected, indexed, and ultimately made available for browsing and searching. We've spent a lot of time improving our technology and processes with some notable successes. For example, switching from SQL Server to SOLR to power our searches was a huge win with average search time dropping from 15 seconds to less than two (even under load) AND with search relevancy improving tenfold. However, what I wanted to write about today was how we turned our queries upside down and opened up a new world of performance and granularity that previously escaped us.

So, what do I mean when I say we turned our queries 'upside down'? Traditionally, with relational databases SELECT statements with JOINS and WHERE clauses are used to filter and locate the correct records. This works great for medium to small data sets with medium to small complexity on granularity (i.e. WHERE x = y) but, in our experience, starts to fail spectacularly when you move to large data sets with high granularity. Here's our story of how we evolved beyond the middling to the big time.

First, let's set the stage. Our metadata database is composed of the following entity tiers:
  1. Labels - a handful
  2. Sub Labels - thousands
  3. Artists - tens of thousands
  4. Albums - hundreds of thousands
  5. Tracks - millions
All are loosely related to each other in descending order of how they are listed so filtering higher in the order will filter out more records than filtering lower down. For example, if we filter at the Label-level we can eliminate vast swaths of the catalog but if we filter at the album-level we are only filtering the few tracks that below to that album. Ok, now we're all on the same page for structure so let's talk about how we go about filtering efficiently even at the lowest tier.

Filtering at the uppermost tier is quite easy and efficient just using good old fashioned WHERE clauses since there are only a few Labels and SQL is pretty good at handling two or three WHERE conditions even with fairly sizable tables. However, it completely falls apart if we were trying to filter at a lower level of, say, the middle with Artists. If we wanted to launch a music store with a small number of artists we could use the WHERE approach but what if we had 200 artists to include while excluding all the others? The initial approach would be to create a big WHERE IN clause with a big comma-delimited list of artists. This might work for a very low traffic situation but wouldn't handle scale very well. Besides, what if we then wanted to add another 100 artists? The query duration would continue to increase dramatically with every additional condition. Even getting creative with stored procedures, views, functions, etc. would not save us.

So what's the solution to adding more conditions without increasing query time dramatically? The solution is to embed filter information within the record itself. What do I mean by this? Well, we add a new column to all the entity tables with a value that describes in which music stores the content is available. Yes, we are talking about maintaining millions and millions of records - one for each entity. It's a big job, no doubt, since we release music stores constantly and for each we need to update all of the entities. Clearly, careful management is required but now we are doing most of the work up front before a user hits our site as opposed to when a user is browsing catalog and expecting the site to refresh instantly.

We have found that bitwise operations are extremely fast for this type of approach but have limitations with a maximum of 32 possible combinations which might be ok for some but at any given time we have over 100 live sites so we've developed an approach that uses a string value to represent larger numbers which are then translated into operable values. There's probably room for optimization here which we will look to include in a future iteration.

By embedding filter data within the record all of our queries' WHERE conditions are now very short and consistent no matter what level of filtering may be required for a given site; whether it's 100 or 500 artists the WHERE clause is exactly the same. This means that SQL sees fairly flat query durations across the board with all tiers of entity filtering.

The real beauty of this approach is that we can now filter at even the Track-level with an amazingly fine-grain control over what content is available on which site. Track-level filtering would have been impossible previously.

Now we have a solution that optimizes filter queries across the enterprise as well as opening up new opportunities for filtering that did not exist before. THAT's the power of heavy writes and I encourage you to investigate them for yourselves.

No comments: