Database theory, your friend for success.
In the past weeks I've read several articles / blog-posts and other digitally expressed thoughts about relational databases, query systems and how they all suck compared to K/V stores, CQRS, OODBs or whatever Hype of the Day-term. While most of them were simply re-labeling 20+ year old common knowledge, others were pretty stupid and downright sending the (novice) reader the wrong message. With 'wrong' I mean: the conclusions are based on false 'facts', assumptions and hand-waving n==1 pseudo-science.
Instead of writing a long essay here, I'll quote from and link to several Wikipedia articles and other articles which can help you learn about what relational models, databases are all about, what the theory is they're based on, why they work and what tools (as in: methodologies) are at your disposal. It's not meant to sell you the picture of 'OODB==bad, RDBMS==good', as that would be silly and as short-sighted as the articles I mentioned above. Instead you should see this small subset of knowledge about relational models and databases as a starting point for yourself when you are researching what to use and how to face a problem domain. After all, you can only make an informed decision if you know what you're talking about.
Relational model and theory
Relational databases are based on the relational model as described by E.F. Codd. Its core is based on predicate logic, and operations on the model are based on relational algebra which is a form of predicate logic.
Links:
- Relational model: http://en.wikipedia.org/wiki/Relational_model
- Relational algebra: http://en.wikipedia.org/wiki/Relational_algebra
- Peter Chen: The Entity Relationship model: http://www.csc.lsu.edu/news/erd.pdf
Why is this so important? It's important because it will teach you what the idea is behind grouping attributes together to entities and use them to define the meaning of data and above all: creating new entities from them by using projections.
Normalization and De-normalization
Relational models are in a given normal form. This normal form is a way to describe how much redundant information is stored in the data set defined by the relational model. To transform a model from a lower normal form to a higher normal form is called normalization. The other way around is called de-normalization. Normalization is considered a good thing because it solves various problems which arise with redundant information in a de-normalized model, if you perform data manipulation operations.
Links:
- Normalization: http://en.wikipedia.org/wiki/Database_normalization
- De-normalization: http://en.wikipedia.org/wiki/Denormalization
Important quote, which should describe for you why normalization is a Good Thing:
Normalized tables are suitable for general-purpose querying. This means any queries against these tables, including future queries whose details cannot be anticipated, are supported. In contrast, tables that are not normalized lend themselves to some types of queries, but not others
This is precisely the point of using a relational model: you store data in a model which gives it meaning in a general form, so you can create new information from it by using relational algebra, in such a way that it doesn't matter what query you might need in the future, it's already suitable for dealing with that.
Normalization has a downside: for reading data it could lead to performance intensive operations. For performance reasons, relational models sometimes get a de-normalized variant. This variant is then used for situations where a lot of queries have to be ran on many different tables with many joins to obtain the result requested. To use a copy of the data for reads, in an optimized de-normalized form, by projecting the data (using normal relational algebra) from the normalized model to the de-normalized model, read queries can be sped up drastically. As the de-normalized form is a copy of the original data, and isn't used for data manipulation, it doesn't run into the downsides of a de-normalized model.
In modern relational databases, these de-normalized variants are implemented by materialized views. Materialized views (in SQL Server they're called indexed views) are queries which results are stored on disk like a normal table. One could add indexes to these 'tables' to make them even more suitable for optimal performance in read logic. Materialized views can be partitioned across multiple systems to be able to handle massive sets of data. In today's hyped up term CQRS one can clearly recognize these two variants of the same model: for data manipulation work, the normal relational model is used, and for reads, the de-normalized variant with materialized views.
Links:
- Materialized view: http://en.wikipedia.org/wiki/Materialized_view
- Oracle, materialized view concepts: http://download.oracle.com/docs/cd/B10501_01/server.920/a96567/repmview.htm
- SQL Server, indexed view concepts: http://technet.microsoft.com/en-us/library/cc917715.aspx
Information Analysis
No matter what people say, always remember that performing analysis of your problem domain, which functionality is required, what the abstract entity definitions are that you can recognize in your problem domain etc. is a good thing. If you don't know what you're dealing with, you can't create successful software to solve the problem at hand. Period.
The big problem with analysis is: if you use a flaky analysis method, you could end up with a skewed picture of your problem domain and therefore your software will likely suck, however how do you know your analysis method is flaky?
In short, analysis comes down to: gather as much information so you can make informed, well reasoned decisions during the software developing process. So you need to convince yourself, based on facts, that your analysis was complete and you have gathered all the information to make these informed decisions. This is a complex process and you need a methodology which allows you to do proper analysis of your problem domain without leaving important areas untouched. One of the pioneers on this subject is E. Yourdon. Today, some people consider his work very 'dated' but I firmly disagree: analysis of problem domains hasn't suddenly changed, the tools Yourdon provides are still very much alive and useful today as they were 20+ years ago. If you look closer and understand what Yourdon meant with all the elements he described, you'll see that most of them have got new names in recent years, and are actually not that dated at all.
As the field of analysis is very wide and deep, it's impossible to describe all possible forms and methodologies. I'll give you a list of links below which is very incomplete, but should get you started.
Links:
- E. Yourdon, Structured analysis wiki: http://yourdon.com/strucanalysis/wiki/index.php?title=Introduction
- NIAM described in short: http://softwaredesign.nl/index.php?pageID=27
- Object Role Modeling (ORM): http://www.orm.net/
- Unified Modeling Language (UML): http://en.wikipedia.org/wiki/Unified_Modeling_Language
Friction and the Impedance Mismatch
One important aspect of a relational model is that it doesn't define behavior. Behavior is defined outside the relational model, through relational algebra using operations. This is a mismatch with modern day Object Oriented software design where data and behavior are combined in an object. If you look at software as code, you'll run into a problem if you want to store live objects you have in memory into a relational database: relational databases work with tables and relational algebra, not with objects and their embedded behavior.
This has resulted in a different form of database: the OO database or OODB. OODBs are nothing new, they've been around for many years. They're ideal if you want to store data in the object it is in as it is in memory, so your live in-memory objects can be stored without any conversion and you can get them back later on without a conversion (there are conversions, but they're not 'in your face', so hidden behind the scenes). You effectively look at the data as if it's inside your object model, so you navigate from one object to the next, not from one table to the next.
If you go back to the relational model and the quoted important aspect of a normalized model, you'll see that an OODB has an important side-effect: it's not really suitable for queries which don't match your object model (or don't use objects at all). This gets particularly problematic when the software using the OODB is replaced with a different system: the original database is outlived by the software requiring its existence however it's not usable in a different form as it's setup and designed to meet the requirements of the replaced software.
Being agile means to be able to cope with this, to be able to deal with change. If you use a relational model which is designed to represent reality, the facts you recognize in the problem domain, the software consuming it doesn't dictate how the model is defined, the reality does, as the relational model represents reality. That's an important aspect which makes it clear why relational models are so important, even in OO software: if the software is replaced with something else, the relational model can be re-used as well as all its data, simply because it's not tied to (or as I should say coupled with) the software consuming it.
To overcome this impedance mismatch between OO software and relational models, the concept of Object / Relational mapping has been introduced. This 'mapping' between two sides A and B is based on the idea that in A there is an element E which is directly related to an element F in B and vice versa. This theoretical connection between E and F, so the connection the mapping is based on, is the sole reason why it works in the first place. That's why you can save an object to a relational model and fetch it back (you're not saving the object, you're saving its data, the entity instance, but I've already written about that some time ago, see below). It works because both sides give the same meaning to the data. This is important because you can then work with a relational database in an OO fashion and you don't run the risk of being unable to deal with change.
As many databases outlive the software they're initially created for, it's important to realize this. All that matters is that if you look at your data, can you distill information from that data in any shape or form you might wish / need without the necessity of the special / original software the database was created for?
Links:
- Scott Ambler: O/R mapping in detail: http://www.agiledata.org/essays/mappingObjects.html
- By myself (about entities, entity instances etc.) : The database is the domain model
- Object databases: http://en.wikipedia.org/wiki/Object_database
I hope this article has given you some insight in how to deal with databases, what they are, what the essential aspects are of databases, relational models, why they're important and how to successfully embed them into your own software. Now, go and build great software based on well informed decisions.