First off, lets get one thing straight: Message Queues are awesome.

They allow you to decouple the various parts of your application form each other, and communicate asynchronously. They protect you from data loss during restarts, and give you a an excellent visualisation of processing bottlenecks in your system.

It all starts by passing a message from A to B, via a queue Q.

You become enamoured with this approach and the flexibility it offers, running multiple instances of A and B and scaling to your heart's content.

A requirement for a new data-source pops up, and you think "aha!" - my queue shall save me, I'll just write a new process C, and have it publish messages to B via Q. This works wonderfully, and you sleep soundly at night.

Many months later you've run out of single-letter monikers for your applications, and many thousands of messages flow through your sturdy queues every second of the day. A sends to B via Q which sends to F via R which sends to G via S which sometimes sends back to A via T. You've got passive and active monitoring, graphing and alerting on the size and throughput of those lovely, lovely message queues.

And then one day a process logs an error: "Hey, this message is a bit dodgy - I can't handle this!". It includes enough debug information for you to see that yes: it is certainly a dodgy message, and failing to handle it is the correct course of action. But how on earth did it get there?

If you're any of the way along the journey above, these tips are for you.

Tip #1 - Identification

When a new message is assembled, give it a globally unique identifier.

You can store this in AMQP's message-id field.

Tip #2 - Birthday

When a new message is assembled, give it a timestamp.

You can store this in AMQP's timestamp field. Be as accurate as you can, but know that computers rarely completely agree on time - so don't overly worry about this.

Tip #3 - Source

When a new message is assembled, note which application is assembling it - and ideally why.

You could encode this into the message-id, or some combination of app-id, user-id and type.

Tip #4 - Ancestry

This is the really important one.

When creating a new message as a result of an existing message, include the data from tips 1, 2 and 3 in the new message.

Depending on the properties of the new message, you can use the same fields or use the headers field with names like parent-message-id or source-message-id - or even as part of the message body in some way.

Tip #5 - Tracking

Throughout the lifecycle of messages in your system, include the source-message-id in your log lines. This allows messages to be correlated to the event that spawned them.

Tip #6 - Lifetime

When an action is performed, log the time delta between the source-message-timestamp and the current time.

Network time means this isn't necessarily exact - but it a useful indicatation of the health and throughput of the system. We like to refer to this as "message lifetime".

Tip #7 - Factory

Create a small standalone module that exports only factory functions for creating messages in the format that will be used amongst your queues.

Give this module a version number, document it well, and then treat it as a third party dependency of any application that interacts with a message queue.

Why?

I've recently been attempting to debug issues in a system with a myriad of processes that interact via message queues. There are many desirable properties of this system - particularly around fault tolerance and horizontal scaling.

Many of the bugs I see are either

Very hard to find the root cause of
A result of very slightly inconsistent message contents

The tips above are intended to make it far easier to such resolve issues. The lifetime thing is just a nice-to-have.

Coding through the Glen

Notes on Programming and Web Software

Top tips for distributing work with queues