Thursday, November 20, 2008

Data modeling

Time after time it is worth to remember (or learn) basic theory. One of them, which is very important when creating new systems, is data modeling. I'm quite sure most of the DBMS people have seen one or another data model with tables, columns, primary and foreign keys. However the theory of data modeling says that this is only one level out of three. Unfortunately I've seen quite many cases when people have forgotten two another levels. So let's remember what they are.

Conceptual data model

Conceptual data model contains entities, relationships and attributes. One can draw them using different notations for example ER modeling or UML, but the idea remains the same - this model doesn't know anything about the underlying DBMS. This model is the same for Oracle, SQL Server, DB2, MySQL, whatever else. So why is it important?

  • It speaks in business language and uses terms only from business users, not any programmer's or IT person's hacks.
  • It is normalized, every piece of information is stored only once.
  • It displays common structures and tries to aggregate them, it is easier to find them in such model.
  • It is readable (after a short introduction probably) by every sensible customer.
  • It doesn't contain any reference to any particular DBMS, so one can easily reuse it for ultimately any DBMS.

So the question - have you used it? If not - why? Are you aware of possible data waste dangers?

Logical data model

This is the model people usually use. It contains tables, columns, primary keys, foreign keys, unique keys. Some characteristics of it are:

  • It is created on the basis of conceptual data model, but it might not correspond 1:1 to it. For example some entities might be split, some unioned.
  • It may contain some programmer's hacks for example different flag columns for easier data selection.
  • It is aware of used DBMS. For example data types for columns are DBMS specific.
  • It may have denormalized columns (i.e. the same facts stored in more than one column) for performance reasons.
  • Table and column names might not exactly mimic business terms i.e. they might be abbreviated, changed to conform some naming rules and/or DBMS restrictions, although for easier readability it is worth to retain some dependency.
  • It is based on conceptual data model but created keeping performance and most common possible SQL statements in mind.

I'm quite sure you have used it, so the only question is - have you created it keeping performance and business critical SQL in mind?

Physical data model

What's that? Is there anything remaining to model at all? Yes it is. Heap table (i.e. table type which is used by default) is not the only available table type in Oracle. Partitioning is not the only option, which can be used in Oracle. So what is modeled here?

  • Table type - heap, index organized, clustered, sorted hash cluster, single table hash cluster there are many possible options.
  • Partitioning, compression, encryption - do you need it, can you afford it?
  • Indexes - will you use only b-tree or something more?
  • Will you use object tables?
  • What tablespaces will be created and on what devices/files?
  • Here comes the real strength (or weakness) of your chosen DBMS - if it offers all these features - why not at least consider them?

If you are DBA you have definitely used some of the features mentioned above, but have you thought about non-default ones? Have you systematically gone through at least the most business critical tables and considered what kind of table type to use, what kind of indexes create?

Not that I'm suggesting to enforce anything just for the sake of completeness ;) but I'm sure I've read something like that somewhere "If you have only heap tables and b-tree indexes, most probably your schema is not optimal".

Conclusion

Sometimes conceptual modeling is called logical and logical physical and then of course question remains how should be called the real "physical modeling", but it is not so important. The important thing is - whatever DBMS you choose for a new application it is worth to go through all the data model steps. Go through the conceptual modeling to better understand your requirements, to better confirm them with your customer, to better find out common patterns. Of course logical modeling cannot be avoided if we'd like to get at least one table, however quite many people forget about referential integrity constraints and rely only on application enforcing it. Sighhhh... And at last don't forget about physical modeling because when your table will contain (hundreds) millions of rows in production it would be much harder to change it's type, move it somewhere else etc.

No comments: