So, what do I mean when I say we turned our queries 'upside down'? Traditionally, with relational databases SELECT statements with JOINS and WHERE clauses are used to filter and locate the correct records. This works great for medium to small data sets with medium to small complexity on granularity (i.e. WHERE x = y) but, in our experience, starts to fail spectacularly when you move to large data sets with high granularity. Here's our story of how we evolved beyond the middling to the big time.
First, let's set the stage. Our metadata database is composed of the following entity tiers:
- Labels - a handful
- Sub Labels - thousands
- Artists - tens of thousands
- Albums - hundreds of thousands
- Tracks - millions
Filtering at the uppermost tier is quite easy and efficient just using good old fashioned WHERE clauses since there are only a few Labels and SQL is pretty good at handling two or three WHERE conditions even with fairly sizable tables. However, it completely falls apart if we were trying to filter at a lower level of, say, the middle with Artists. If we wanted to launch a music store with a small number of artists we could use the WHERE approach but what if we had 200 artists to include while excluding all the others? The initial approach would be to create a big WHERE IN clause with a big comma-delimited list of artists. This might work for a very low traffic situation but wouldn't handle scale very well. Besides, what if we then wanted to add another 100 artists? The query duration would continue to increase dramatically with every additional condition. Even getting creative with stored procedures, views, functions, etc. would not save us.
So what's the solution to adding more conditions without increasing query time dramatically? The solution is to embed filter information within the record itself. What do I mean by this? Well, we add a new column to all the entity tables with a value that describes in which music stores the content is available. Yes, we are talking about maintaining millions and millions of records - one for each entity. It's a big job, no doubt, since we release music stores constantly and for each we need to update all of the entities. Clearly, careful management is required but now we are doing most of the work up front before a user hits our site as opposed to when a user is browsing catalog and expecting the site to refresh instantly.
We have found that bitwise operations are extremely fast for this type of approach but have limitations with a maximum of 32 possible combinations which might be ok for some but at any given time we have over 100 live sites so we've developed an approach that uses a string value to represent larger numbers which are then translated into operable values. There's probably room for optimization here which we will look to include in a future iteration.
By embedding filter data within the record all of our queries' WHERE conditions are now very short and consistent no matter what level of filtering may be required for a given site; whether it's 100 or 500 artists the WHERE clause is exactly the same. This means that SQL sees fairly flat query durations across the board with all tiers of entity filtering.
The real beauty of this approach is that we can now filter at even the Track-level with an amazingly fine-grain control over what content is available on which site. Track-level filtering would have been impossible previously.
Now we have a solution that optimizes filter queries across the enterprise as well as opening up new opportunities for filtering that did not exist before. THAT's the power of heavy writes and I encourage you to investigate them for yourselves.
No comments:
Post a Comment