Ask Joel IT: The Big Deal with Big Data!

I pre-date the relational database management system world (RDBMS), having been weaned on things you may never heard of, like ISAM (indexed-sequential access method). So, I have to admit that I was a little skeptical about the NoSQL world. It sounded a lot to me like ISAM, and why go backwards? But it is not fair to dismiss something out of hand so of course I had to investigate these NoSQL databases. And while I do not believe at all that the days of RDBMS are numbered as some folks say, I do believe that NoSQL has its place. To understand why, you need to know the terrain.

Onwards to RDBMS

In a simple view of the early days we basically used a sequential table of records. Most of the time the records were fixed length, reading the whole set sequentially was fast, and we used indexes to quickly locate a record based on a key. Does any of that sound familiar? Need to look up records by more than one field? Add an index! When you need to sort the records faster you pick an index and sort that instead of the table itself. An example of a single-index product still prevalent is Ken Thompson's dbm, which was included with Unix in 1979. I would provide a citation, but it would be myself because I first came across it when I upgraded to Unix Version 7.

So a relational database is a natural extension to the "single-table" model. It has multiple indexed tables, an engine that translates structured query language (SQL) into a plan of execution to return information, and allows merging of data from multiple tables into one result. The real advantages of RDBMS are that we are able to look at the same data in different ways for different applications, and the joins between tables allow us to avoid data duplication. In fact, RDBMS came about because we were already trying to use single-file "tables" and indexes together but we were doing all the work in our applications. Look behind the scenes in your RDBMS and many of them still manage individual files for each table and index. I found the shift to RDBMS simple, because for me what it initially did was move all that yucky code into the database engine.

Imagine that I have an application organizing events at different venues. Each event takes place at at a single venue, and each venue can host many events. RDBMS allows me to separate the events and the venues, and quickly return results for queries like "show me all the events at this venue" or "what venue is this event at?" Of course the tables have to be structured to do this, maybe something like this entity relationship diagram (ERD) shows, where one venue may be associated with many events:

Separating the events and venues makes complete sense; a venue is not part of an event nor is an event part of a venue. To be fair the reason the queries work well here is because they use indexes. The Events table has a foreign key dependent on the Venues table, and a foreign key is always an index. If that index is not there full table scans (a visit to every record in both tables) will be required to put the results together.

But what happens when we we look at another part of my system: events and sessions? Every event is made up of a group of sessions, but no session ever appears in more than one event. Now, since every session is part of the event let's add columns to the event table to support them:

But if we add columns like this we violate the first normal form of relational database design: "A relation in the first normal form is if it has the property that none of its domains has elements which are themselves sets. (Codd)" That happens any time you have a column or a group of columns that is repeated, which is what happened in the previous example. There are a lot of good reasons for following first normal form in RDBMS. When violated it becomes really difficult to search for sessions across multiple columns. And there is a harder problem to solve: you know that somewhere there will always be an event with one more session than you allowed for.

But wait, the sessions are part of the event! Well in the application they are, but in RDBMS that just does not work. So the solution is to split the sessions into their own table:

Who's the Winner?!

Before we go any further, now is a good time to talk about ACID: atomicity, consistency, isolation, and durability. When a database engine supports ACID it ensures that writing the event and the sessions takes place as an atomic operation. That prevents two clients from writing conflicting records into the the tables simultaneously. Usually ACID is implemented through transactions that make sure multiple processes updating multiple tables don't interfere with each other.

What happens if two people edit the same event at the same time, change the data or sessions independently, and then try to save? One solution would be to lock the event in a transaction when the first person (green) opens it. Then the second person (blue) cannot even read the record until green finishes updating it:

In general terms what we just described is a pessimistic lock. But this is almost impossible to use when a user interface is involved. The blue user will be unhappy waiting for the green user. The green user can walk away and the transaction could be held open for an indefinite amount of time. So most of us developers end up using an optimistic lock, where we put the transaction around just the update:

Both of the users can read the data, modify the data, and then update the data. Only one can update the data at a time, and whoever updates the data last overwrites the previous write and wins! But even though blue was able to overwrite green's changes, it's still ACID. But wait, the blue user still just overwrote whatever the green user changed!

Make no mistake: ACID is not responsible for protecting data from being overwritten, it is only responsible for protecting the consistency of the data. As long as the two writes are sequential and do not corrupt the indexes shared by the event and session tables then that data is consistent. And that is when programmers get into trouble; when they try to use the database engine and ACID to decide what data is written!

So there are two similar solutions for protecting the data. One proposed by Martin Fowler adds a version number to the primary record being written, in this case that would be the event. The version number is incremented during the write. After the transaction is started and before the write takes place, the current version id number checked to make sure that someone else has not already modified the record. If that happens then the write can be rejected (Fowler, "Introduction to NoSQL"):

The SQL to check and reject the update is simple, I just have to add a constraint to the update statement:

UPDATE Events SET version = 2, ... WHERE eventId = 97 and version = 1

My problem with Martin's solution is that it adds a column to the table that is not related to the rest of the columns. It violates the second normal form: "A table is in 2NF if and only if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the table (Codd)." Turn that definition around: all of the non-prime attributes must be cohesive with the entire candidate key. How is the version number at all cohesive with the key?

Some frameworks, like the ADO.NET Entity Framework, use a different approach. When changes to a record are made, the original values for the columns are saved. Then instead of a version number, the SQL statement imposes a gate to make sure that none of the columns have been modified before allowing the update to proceed. The drawback is the effort to keep the original values, and the SQL statement is more involved. But the two advantages far outweigh that: we do not add extraneous data to the record, and we trust nobody; we don't rely on everyone else updating the version number correctly.

UPDATE Events SET ... WHERE eventId = 97 and venueId = ...

So what's the problem?

The problem is the disconnect between the objects in the application and the tables in the database. Objects use references and databases use foreign keys. In the Unified Modeling Language (UML) the event object has a composite relationship with one or more sessions:

This UML structurally looks a lot like the ERD did earlier. But there are no primary and foreign keys, they are replaced with references. To round out our class definitions the aggregation diamond between Event and Venue shows that an Event needs a Venue, but is not responsible for it, and the composition diamond between Event and Session defines that an Event is made up of and is responsible for managing multiple sessions.

To move data between the database and the application we need to translate the foreign keys into object references, or back again.. That is what an object-relational-map layer (ORM) does, such as Linq in .NET or Hibernate in Java.

And with everything we have discussed so far it turns out that we have already stumbled across the root of our real-world problems. In RDBMS terms we have a really good structure. Potentially unlimited growth of the tables, and it is easy to use the data in various ways for different applications: search by session, search by event, etc. But in reality we would hardly ever query in any way except to find an event and then look at the sessions. Most of the time you will likely find that only one application is using a database. So structuring data this way was driving applications at Google and Amazon to their knees.

Oh, Where did We Go Wrong?

Why was this such a big deal to a company like Amazon? And why not just throw more hardware at the problem?

One part of the problem is the ORM layer. Just by existing it has to impede performance as we move data between the objects and the database, and back again. That takes place on the "client" side, the application that is using the data. Remember, that could be the web application supporting hundreds of thousands of users.

On the database side we have a bigger problem. In our example we split the sessions into a separate table. So every time an event is looked at, we have to let the RDBMS crunch the sessions table (really, the index) and find all the related sessions. And if we add or delete sessions, then it has to update the table and the index. Try having your database serve 100,000 or 1,000,000 of these requests all at the same time. Meep.

For a while throwing hardware at the problem will show some benefit. But eventually you will run out of capacity again. Disk space really is not as big a problem as execution speed for the database engine. But you can only throw so many CPUs at the problem, and then eventually the problem will not be a lack of CPUs. It will be that you cannot overcome the network bottleneck getting requests into and results out of the computer. I went down that road once early this century for a certain group that had a very infamous and very public meltdown.

So what do you need? Distributed processing! Where will you get it? Probably not from any of the existing RDBMS. Oh, they are trying but there are a lot of technical problems to overcome. Been there. Done that. Got the T-shirt.

Enter NoSQL

The best way to compare NoSQL databases is around their storage models. Those can be roughly divided into document, key-value, column family, and graph. I initially equated NoSQL to ISAM and while in a sense it often is, I initially discounted a very important commonality: the records stored don't have to follow a rigid schema. Some folks say "schema-less," but I see it more as a dynamic-schema. After all, if we did not know something about the structure of what was in there, how would we ever use it and retrieve any information? We are usually moving data between the database and an object model in an application, so doesn't that object model imply there must be some known schema?

Document Storage Model

With the document storage model the unit of storage is a single document. The structure of the document is something decipherable by the database engine like a comma-separated list or XML, but most commonly JSON. MongoDB is a popular implementation. Almost any language can interface to Mongo and it provides a shell with an object paradigm for direct interaction. In the shell the db object is a root object. The document collections (like tables) are properties of db. The methods of the collections allow operations: insert, update, etc.

> db.events.insert( { _id: 97, name: "Event 97" } );
> db.events.find( { _id: 97 } );
{ "_id: : "97", "name" : "Event 97" }

Because document collections do not have a fixed schema we can do insert anything we like. How about an event document that contains a collection session documents:

> db.events.insert( { _id: 98, name: "Event 98", sessions: [
{ _id: "98-1", name: "Session 98-1", location: "Riverside" },
{ _id: "98-2", name: "Session 98-2", location: "Seaside" } ] } );

MongoDB looks in the document data for the property _id and that becomes the unique primary key for the document. If _id is not found Mongo will add the property and create a unique value.

Here is a visual representation of the database collection. The two documents have different structures, something that cannot be duplicated in a RDBMS because of its fixed schema:

That is sweet! For my structure with events and sessions everything is stored in a single document. I do not need the sessions without the event document, so keeping them together as an aggregate works. No multi-table join. It is much more efficient to retrieve with a single read. And translating it into the application is a trivial migration from the properties of the JSON object to my object-model.

I can search for records on any field, even the fields in a nested document. Of course if the fields are not indexed that forces a full scan of all the documents in a collection. So Mongo allows indexes to be added to any field, event the fields in nested documents. If a document is missing an indexed field so what? It may be added to the index as not having the field.

We did not model venues in this example, but they truly are separate from events. In order to support them in Mongo we will have to link the documents by adding the key to the venue to the event, and the keys to the events to the venue. In a throw-back to the old days in this situation the application has to do the dirty work of pulling both the venue and the event separately from the database. But to be fair, most of the time we already implement lazy loading in most applications that use a RDBMS. That means we would only load the venue for an event at the time we need to see the data. So Mongo forces us to do lazy loading, boo-hoo.

The key here is that writing whole documents in and reading whole documents out without any joins is so much faster. And we removed most of the ORM logic. The combination is a big kick-in-the-pants for applications on the scale of Google and Amazon.

OK, someone has to ask: are we breaking first normal form? Maybe, maybe not, and I am not sure that it really matters. I do not view the event document as a row in a table or the sessions as repeated columns in that row. I view the event document as the object hierarchy I had in the application, and it just happens to be stored adjacently for improved performance.

I am also not sure that it matters because do I break normal forms for safety and efficiency occasionally in the real world. Being "pure" does not mean anything if your application does not work. Here is an example: Something about the third normal form says that in a bank account I should not keep a calculated balance column because it duplicates the information available from the account transactions (Codd). Spin through them and you will produce the current balance. But in the real world do I want to expend that energy for every one of 100,000 clients simultaneously accessing the system? Of course not, so I keep a balance in the account to show the customers. If there is every any doubt we can use the transactions to validate the balance.

So one document is great. We only looked at a single index, the key. You can put more indexes on the documents. But indexing technology has not changed, just what we are storing has. Everything we know about indexes still applies, and the more indexes you add the slower creates, updates, and deletes become. It is a balance, because if I need to swing a little towards an RDBMS world-view and aggregate data in different ways then I need to index the fields that I am using.

Key-Value Storage Model

In the key-value storage model we always have a primary key and it is used for inserting and retrieving "values" that the database engine does not concern itself with. A value can be anything; a number, text, or an image or a PDF document. To compensate for not being able to read and index the document, the key-value databases allow metadata to be added when the record is created. This metadata effectively puts fields and values outside of the data being stored. Except for where the index data is stored, with metadata the key-value function very similarly to a document storage model.

Column Family Storage Model

The column family model is described as a collection of columns containing name=value pairs, each identified by a row-key. In a simple view it works like the document model because it can read the data, but moves the row-key outside of the unit like the key-value storage model:

A super column family expands on this to create nested column families. Each row-key identifies a column that contains multiple columns, each identified with their own name:

event98 is the column name, event, session98-1, and session98-2 are the super column names. The aggregate of the column groups is our single unit of storage. The key for the whole aggregate is still called the row-key, and that combined with a family name allows retrieval of the individual pieces. The point is that like the key-value and document models, everything is tied together in one unit for faster inserts and updates.

Graph Storage Model

With the graph storage model the events and the sessions are stored separately, but are closely linked together mimicking the application's object structure. In fact, that is a major goal of graphs: remove the need for the ORM layer between the application and the database. The edges of a graph (the links) are analogous to the references between objects in the application. The edges are entities with properties too. In a graph we still use indexes to find what we are looking for:

So a graph is much more efficient than RDBMS for retrieving our events, but a little worse than the document storage because the edges have to be followed. It is a tradeoff that affect your choice: document storage is a little faster and may provide a bit better aggregation. But graph storage is much faster for inserting things like a session in an event or performing updates because it does not have to rewrite the complete document.

After I finally figured this out for myself writing some other documents about it a few years back, and then Martin Fowler came along in 2013 and pointed out the same similarities between the key-value, document, and even the column family models. At least as far as the point of fixing our speed problems go. I am a long-time fan of Martin, and I love reading his work. So it is only fair to reference it here (Sadalage; Fowler, "Introduction to NoSQL").

If You Break Something, You Get Shards!

So the problem with distributing the data in relational databases is keeping the related data together. That's a mouthful!

To clarify that, when I have an event in one table and sessions in another and I decide that some part of both tables needs to be distributed to another computer, it is really hard to do. Unless the database engine has some expert knowledge to group the event and the sessions together on the same computer, they will get separated and things will be even less efficient. In the general sense maybe watching the foreign keys will help. But the vary nature of the RDBMS is to allow different aggregations of the same data. And that leads to a cascading effect with the foreign keys when trying to decide what data belongs together. The physical distribution is different for different views. So nobody has been that successful with an RDBMS, yet.

To be fair, the graph storage model has a similar problem. What is the dividing line in the graph? If you follow the tree, everything ends up on the same computer. Maybe we can put information on the edges between nodes to influence the distribution?

The problem is virtually eliminated in the document, key-value, and column family storage models because all of the related information is saved as one aggregate unit. The documents can easily be distributed across multiple computers, or shards of the database:

The indexes on any forward facing server can contain a shard location, and that quickly gets us to the right computer and the right document no matter where it is. Voila! Problem solved, at least for the foreseeable future. So NoSQL has helped us on two fronts: get the data we need back quicker because it is kept together, and clear a path to distributed data because the unit of distribution is atomic.

So NoSQL Fixes Everything?

Of course it doesn't. The key-value, document, and column family storage models that we focused on fix retrieval speed and distribution in a high volume situation. Now you have to make a design decision and pick which storage model will work best for your application:

Is your application and data model more centric around hierarchical object references like my event-session, or is it more organized around shared data like my event-venue? If you fall towards event-session, then NoSQL. But event-venue, then you probably need RDBMS.

What about the application environment? I have just one application and I am leaning towards NoSQL. But if you have multiple applications that share the data and use it differently, then you may need to go with RDBMS.

Can my application support the work to make the event-venue association in separate database calls? Then NoSQL. But if I need to isolate that logic and do the work in the database engine, then RDBMS.

And there is one more factor that enters the decision: maturity. RDBMS is mature and it is easy to find people who know how to use it and support it. The NoSQL products are immature; they are still finding their way, there are a large number of them to choose from, and there are fewer experienced people.

I believe that by now you have the picture, so good luck!

References

See the references page.

Monday, March 17, 2014

The Big Deal with Big Data!