Monday, May 07, 2007

The Siren Call of Web Services

Most of the solutions that I've worked on over the last few years have implemented a Service Oriented Architecture (SOA) and generally use web services to implement the services. Usually, the web services have worked out great but occasionally they have failed spectacularly. Why have they failed? Read on to find out.

The "Siren call"
The great thing about web services is that they are really easy to create and (i.e. web service project template) and just as easy to consume (i.e. web reference) due to tight integration in Visual Studio. They are so easy to use that even new developers can be productive in a short period of time.
Web Services also make traversing application or network boundaries easy since constructing a SOAP call is easy (thanks to Visual Studio) and HTTP over port 80 will get you past most firewalls.

The "Danger"
The thing to keep in mind is that web services are slow. The ease of development and simple usability mean that we pay a price in performance.
Any call to a web service needs to be translated into SOAP, sent over the wire as wordy HTTP, translated from SOAP, service logic executed, translated back into SOAP, sent back as wordy HTTP, and finally translated back from SOAP. This may be a simplification of the actual process but you get the idea. There is a lot of activity around even invoking a simple "hello world" web service.

The "Fate"
At best, a slow web service means that your speedy application needs to wait while the service is invoked. Most of the time this is the only problem you'll have and it can be managed in a variety of ways (i.e. caching, asynchronous calls, etc.).
At worst, a slow web service could take down your beautiful solution and duplicate or lose important data.

How could a web service do so much damage?

It generally comes down to threads. Specifically, not having any available threads to handle another service request. I'm not claiming to know exactly how the threading works but every call to a web service requires threads from IIS and .NET. These technologies maintain a pool of threads that service requests and then wait for the next one.
Normally, a request can be serviced quite quickly and the thread is freed quickly. However, under high load a thread might not free up in time to service the next request so another thread is needed to handle it. As requests hit the server the threads are allocated and freed as quickly as possible but under high load requests can 'pile up' until eventually there are no threads left to service the next request. When this happens you will begin to see longer delays in service response as the caller waits for the service to answer while the service waits for a thread to free up. Around this time you will likely start to see "request timed out" errors.
If the duress continues and requests continue to arrive then the server will ultimately give up and start throwing IIS and/or .NET exceptions such as "There were not enough free threads in the ThreadPool object to complete the operation."
It is at these times when I have noticed unexpected conditions to arise that can create unpredictable results. If the caller gives up on the service (timeout expires) you would expect the service to perish but there are certain boundary conditions by which a service can continue to execute even though the client has given up on it. The service may not fully complete executing but even a few lines can cause problems. We have seen duplicated rows inserted into a database due to this problem when the client retries the service thinking the service has timed out but the service continued to execute.

The "Safety Net"
The first thing to do when working with web services is have a good understanding of the maximum load that service will have to bear. Plan for that load and test the crap out of it before going live. Hit that service as hard as you can for as long as you can before you trust it with your data. I've seen very innocuous web services cause absolute chaos under extreme duress so don't assume that it will 'just work'.

The second thing is to actively work to reduce the execution time of that service to the smallest time possible. Threads are only used during execution so if the service executes quickly then it will free that thread up quickly.

Lastly, think about farming out your web servers. There are a multitude of configuration settings for IIS and .NET to tune threading but all of these settings will ultimately fail under extreme load. It might just die later but it will still die eventually if the traffic is extreme enough. The best way to give your application more threads to work with is to use multiple servers. Usually, these would be physical web servers but more commonly farms of virtual servers are being used quite effectively.

I hope you excuse my crude metaphor to hapless sailors but perhaps my little brain dump helps you plan your web services better in the future and avoid some of the pain I've experienced.