Wednesday, March 26, 2008

Error Handling and Event Logging

As part of my continuing effort to make this blog so esoteric that nobody ever reads it I'm going to post some thoughts on Error Handling and Event Logging. I felt it was a good time to post this as I've been discussing it a bunch with the development team lately and from those discussions we've worked out couple of methodologies that I thought were worth sharing. I'm going to say up front that I recognize that its very likely that this has all be discussed and written about in numerous programming books so it's probably not super original stuff. Nonetheless, I'm going to take a moment to describe how we are approaching the problem anyway.

The "Caller Responsibility" Error Handling Practice

Simply put, "He who starts the ball rolling catches any error thrown back."

The idea behind this best practice is that the process/method/procedure/whatever that started the workflow which ultimately generated an error is best suited to understand the context of the error in terms of the greater operation of the application. For example, if we have an application with two independent services that form a workflow then we cannot give either service responsibility for handling an error since neither understands the other service (they are "independent"). Instead, the process that governs the workflow which calls each service understands how errors in either are handled specifically.

This concept is important in SOA applications where abstraction between services is key to layering functional blocks. Each service only reports the error back, cleans up any internal state inconsistencies that may result, and then trusts the caller to react to the error in an appropriate way. I've found this approach helpful when breaking up work within teams by assigning distinct services to groups of individuals. Since error handling is distinct it's quite easy for different teams to plug into a bigger picture of error handling by trusting someone else to handle it.


The next thing I wanted to write about today was…

The "Reporter On Scene" Event Logging Practice

The idea here is that don't even bother logging exceptions/events if you are not going to capture enough information to do anything useful later on when you are trying to figure out what the heck happened. It turns out that newspaper reporters have pondered and solved this problem for us by using reporting on the "Six W's". They are: Who?, What?, Where?, When?, How?, and Why?. We can use the answers to provide a complete picture of any event to the poor sucker reading the log at 3am when something breaks.

What exactly do we mean by all of this? Let's use an exception event as an example:

Who

"Who" is the component reporting the error. Usually, this is the website and class name (or page name).

What

"What" are exception details such as error description and stack trace.

Where

"Where" is method and line number in the source code.

When

"When" is the time that the exception occurred (not the time it was reported).

How

"How" is additional context provided by the caller (see "Caller Responsibility" above) to explain the workflow that lead to the exception.

Why

"Why" is probably the most difficult to automate but helps describe why the error happened. Often this is accomplished by providing additional supporting information such as the values of related variables and objects.

When the event log provides details that address each of these questions it is immensely useful for analyzing failure after the fact. Give it a try!