Behaviour in case of service unavailability

  • Last Post 17 April 2013
  • Topic Is Solved
Nico posted this 12 April 2013

The documentation doesn't say anything about the HTTP code sent when the service is (temporarily) not available. In order to be able to cope with such situations programmatically (e.g classify that as a temporary error) it would be nice to know which (or whether) there's a specific HTTP code (maybe HTTP 503) returned.

Thanks, Nico

Order By: Standard | Newest | Votes
Andrey Isaev posted this 12 April 2013

The thing is that the service is designed to be fail safe, redundant and always up. There is no planned outages. We first test updates on staging environment and when it is OK we switch test and production. Production environment consist of several independent instances and if one fails others continue to work. There are several level of monitoring that detect problems and take proactive actions, such as role rebooting.

With all that said, the only case when service is not working as expected, is that something went wrong. I mean really wrong. Such as global failure of Microsoft Azure hosting environment, for example. In this case, we cannot predict what exactly error code you would get. That completely depends on how exactly things went wrong.

Nico posted this 12 April 2013

The questions in this forum show that there had been indeed some outages in the past. Would be really beneficial to have a defined error code for these situations - otherwise we cannot build reliable software distinguishing between temporary failures and those which persist.

Andrey Isaev posted this 13 April 2013

I understand your intention, and it is reasonable. However, I don't really understand how to fulfill it. It is like if you would ask some person who is sick and going to die soon "just call me when you are dead, so I am not late to the funerals".

Speaking seriously, the program should be functioning normally in order to return expected error code. But why shouldn't it just work then? And if something went wrong and program cannot function, then it is very likely it cannot return special code too. For example, what code should it return if whole data center got cut from the internet?

Andrey Isaev posted this 13 April 2013

Outages happen, that is the nature of internet. Google has it from time to time, Amazon, and others. And those are the leaders. The real question is - what outage probability is acceptable for your business scenario. So far our service proves to be pretty stable.

Nico posted this 14 April 2013

Maybe me initial question was misleading. Speaking with your words: I'm not asking for an answer of a dead person, but for an answer of a sick person unable to serve my request right now. :-) Meaning: if a request cannot be served in a decent time frame, the client should be told by means of an appropriate error code.

I mean, the HTTP specification defines an error code for exactly these situations (HTTP 503). If it was pointless, why should they do so? ;-)

Natalia Karaseva posted this 16 April 2013

On the one hand it is not always possible to return reasonable error code. On the other hand it is right that there are cases when it is possible. So, both sides are right.

The Cloud OCR SDK developers agree that in some situations, when our Web Role is able to diagnose a breakdown of Worker Role, we could return HTTP 500.

Actually the known problems are solved and will not be repeated. In other words we don't know about situations when we could return HTTP 503, therefore it is not described in documentation. As soon as our tests or the developing process reveal "vulnerabilities", we will use this approach and we will return more specific error 503 instead of common error 500.

  • Liked by
  • Nico
Natalia Karaseva posted this 16 April 2013

Just let me to specify information about types of errors:

If something in task processing is wrong the status of this task is assigned to ProcessingFailed value and money is not debited. If something is wromg with FREngine inside a handler the task will have ProcessingFailed status too.

If something is wrong with an environment (database, blob storage, network), the server will return an error "HTTP 500 Internal Server Error".

Our developers will think about catching the particular types of errors and in this case to turn on HTTP 503. But in any case, if you have HTTP 5xx, it is worth to wait 10 minutes (if it can be it) and communicate with a server only after a break.

Nico posted this 17 April 2013

Sounds promising. Thanks.