Tue. Mar 19th, 2024

A problem developers face is extracting messages from a byte stream. This post covers a strategy for accomplishing that task.

This is not a problem typically found in higher level coding spaces. Much of modern development has been relegated to the realm of HTTP and web servers, or other middleware such as message queuing systems that abstract away the problem behind some library.

Lower level developers face this problem, often when working with embedded systems and interfacing with devices on memory constrained systems. More generally, this is a problem faced by all developers working with protocol implementations.

The Problem

Data bytes are streamed to an application from some source. This may be a file on disk, a network socket, hardware such as a serial port, I2C bus, SPI bus, or similar. An application reading these bytes needs to decode useful data from the stream, ignore noise, and recover from errors.

Consider a simple message having the following structure:

<header><data><footer>

where:

<header> will always be the byte value (0x55)

<footer> will always be the byte value (0xAA)

<data> is alpha numeric data i.e. a-z, A-Z, and 0-9 only.

The message is byte oriented; Unicode characters are not covered for simplicity in this case. A simple message may look like so (values in parentheses are single hex bytes):

(0x55)ABC(0xAA)

And two consecutive messages may look something like this:

(0x55)ABC(0xAA)(0x55)123(0xAA)

In both cases processing could be accomplished as follows:

  1. Allocate a 5 byte buffer
  2. Read 5 bytes (from disk, some bus, a network socket, etc)
  3. Ignore the first and last byte and you have 3 valid data bytes

This solution appears fine, but contains numerous problems including:

  1. The previous example was intentionally selected so both messages were the same size. The message specification places no limit on message size, nor does it specify that each message must be the same size
  2. Both messages above were examples of clean data – every read is assumed to return a valid message byte. The data stream is free of noise/garbage
  3. A buffer containing 5 bytes of clean data guarantees a full message will fit completely into a single buffer. There is no chance of the buffer containing a partial message. In reality partial message data is commonplace and a full message often needs to be constructed from repeated partial data reads
  4. Despite needing 3 bytes of data, provision has been made in the buffer for the header and footer framing bytes. Admittedly there are only 2 bytes so the overhead is not huge, but this is often unnecessary overhead resulting in higher than required RAM usage. For a server with 32GB of RAM this is not usually an issue. For a microcontroller with 512 bytes of RAM, this may be a problem.

To illustrate the problems above, consider the same 2 messages, but in a dirtier data stream.

(0x55)(0x55)ABC(0xAA)(0x00)(0x00)(0x55)12(0x55)123(0x55)123(0xAA)(0x00)

The byte stream above contains 2 valid messages, highlighted below:

(0x55)(0x55)ABC(0xAA)(0x00)(0x00)(0x55)12(0x55)123(0x55)123(0xAA)(0x00)

Now apply the previous solution:

  1. Allocate a 5 byte buffer (no problem here)
  2. Read 5 bytes. The problem starts here. Rather than read a full message we will instead have: (0x55)(0x55)ABC
  3. Discard the first and last byte. This leaves us with: (0x55)AB This is not a valid message

Repeat the process:

  1. Allocate a 5 byte buffer
  2. Read 5 bytes. This reads the following: (0xAA)(0x00)(0x00)(0x55)1
  3. Discard the first and last bytes and the message is: (0x00)(0x00)(0x55) This is not a valid message.

Repeating the process a third time gives the message data:

(0x55)12

This is not a valid message

Repeating again produces the data:

123

This is a valid message!

Finally, we have one stray byte in the data stream

Therefore the dirty data stream resulted in missing one message completely. We successfully read the second message but that was coincidence. Had there been one more or one less stray byte in the stream between valid messages, both valid messages would go missing.

The Solution – State Machines

Modern reactive programming frowns on old fashioned ideas like state machines. They’re monolithic and don’t easily conform to the best practices recommended by testing frameworks. But not all old ideas are bad and state machines, while old fashioned, are both simple and powerful.

The state machine pictured below solves the problem. Before discussing how to construct the state machine, let’s apply it to the dirty data stream and see what happens. The data is also shown again below:

(0x55)(0x55)ABC(0xAA)(0x00)(0x00)(0x55)12(0x55)123(0x55)123(0xAA)(0x00)

  1. We always start at the state marked “start”
  2. The first byte read is (0x55). We follow the arrow (known as an edge) to the collect state. We don’t need to collect this byte – we’re interested in the data only, not the “packaging” around the data.
  3. We read next second byte, also (0x55). Since we’re in the collect state, we follow the “Read 0x55 and restart collection” edge. This means discard any data we have collected (we haven’t collected anything yet), and remain in the collect state. So we’re ready to collect data again.
  4. Next we read ‘A’. This is a valid data byte, so we save it.
  5. The next 2 reads are for B and C, both data bytes, so we save them too.
  6. We read (0xAA). Following the appropriate edge, we transition to the “stop” state. We have a full message of data bytes only. In our naive solution we completely missed this message! The process then starts over by returning to the “start state”

Already we’re looking great. The first iteration of the state machine captured one message that would otherwise have been lost. Let’s continue processing the byte stream.

  1. We resume and the next byte read is (0x00). The start state only transitions to the collect state when (0x55) is read. Since we’ve read (0x00), we follow the “Not 0x55” edge and remain in the start state. The following byte is also (0x00), so once again we remain in the start state.
  2. We read (0x55) and transition to the “Collect” state
  3. Then we read the byte values “1” and “2” respectively. Both these bytes are saved. It looks like we have another message.
  4. But no! The next byte is (0x55). Since we’re in the collect state we follow the “Read 0x55 and restart collection” edge. All collected data is discarded. We don’t have a message after all, just a partial message. For our use case, that qualifies as garbage.
  5. The next three bytes are 1,2,3 which we collect since we’re in the “Collect” state
  6. But as luck would have it we then read a (0x55), So we discard all the collected data and restart data collection in the “Collect” state; we had another partial message.
  7. The next three bytes are 1,2,3 which we collect
  8. Then we read (0xAA) and follow the edge to the “stop” state. We have another message!!!
  9. We transition back to the start state and read 0x00. We can’t do anything with this, so we remain in the start state. This is the end of the stream

A second message was captured successfully. The thing is, it doesn’t matter how much noise is added to the system. The state machine is designed to look for what we’re interested in and discard the rest.

Try increasing, decreasing, or changing the noise between, before, or after the two messages in the data stream. Both messages will still be read successfully.

How To Build A State Machine

A state machine consists of, well, states. A state is like a snapshot in time of your processing logic. The state itself doesn’t do anything, but when something happens (eg. a new byte is read) the code either stays in the current state or follows an edge (i.e. the arrows) and transitions to another state.

The logic for each state should be simple. Similarly, the data fed into the state machine should be simple. Even if you read multiple bytes of data from the stream at once, always feed the data into the state machine one unit at a time. I say “unit” because in these examples 1 unit is one byte. In your state machine, one unit may be one UTF-8 character, which may be up to 4 bytes. That’s OK. Just ensure you work with the smallest unit of data that’s reasonable for your problem.

Let’s break down our message structure. It was described above like so:

<header><data><footer>

The message is easily split into 3 parts:

  1. <header>
  2. <data>
  3. <footer>

That seems obvious, but that’s exactly the point. A state machine helps you do exactly what you want, simply. So we need to read the header, and then collect some data, then read a footer.

At a minimum we need 3 states, one to read the header, one to collect data, one for the footer. Now traditionally, state machines always have start and stop states. In the state machine diagram above, the header was read in the start state and the footer in the stop state. The collect state was for data collection. Other state machines don’t have the start and stop states actually do anything, they’re just there to indicate where processing should start and end. I’m trying to be a little more practical and using all states.

When building a state machine, start with the simplest case. Pretend everything is perfect and you have no errors in your system. Then draw the transitions that lead to success. It may look like this:

Basic state machine with no fault tolerance

This state machine so far is not fault tolerate at all but apply it to a clean data stream such as:

(0x55)ABC(0xAA)(0x55)123(0xAA)

and you will get two valid messages. In fact, you can even change the message sizes and the state machine will still work perfectly. But only on condition the data is clean.

Now that you have a basic solution, the idea is to consider problems. But when doing this, treat every state individually!!! For instance when you’re in the “Start” state, you are only interested in what happens to the “start” state. Do not worry about what will happen at other states. You focus on one state only. At no point should something occurring in one state affect another state!! This is the beauty of state machines. Keep your thinking small and simple. There are no complex dependencies anywhere!

So, let’s consider the start state. We know we want to start the process by reading (0x55). We’ve already added that to our machine. But what if something goes wrong and we don’t get (0x55)? Well, it’s not what we want, so there’s no place for us to go. In other words, we should stay where we are. We dealt with this condition by drawing a transition for “Not 0x55” in the original state machine. That keeps us where we are, in the “start” state. Since we’ve catered for (0x55) and “Not 0x55”. That’s literally everything. We’re done here, let’s move on.

Again, don’t overthink things. Don’t create complex dependencies. Think about the state you’re dealing with only and nothing else. The most complex part of this process is realizing how easy it is!

The full state machine is shown again below for reference.

Next, let’s look at the “Collect” state. When looking at this we don’t care about any other states..

What do we need to do in the collect state? Well, we need to collect alpha numeric values. So if we we read() a byte in the range A-Z,a-z,0-9, we save it. But what about everthing else? Well, if we get anything else it means we shouldn’t be collecting data. So let’s look at what may happen:

  1. We get (0xAA). If we get that, we have a full message so transition to the end state
  2. What if we read (0x55) again? We could go back to the start state, but (0x55) means we need to start collecting data. If we returned to the start state, the next byte we read will be the second byte of the message which we should be collecting. Since we’re already in the collect state, that means we should just remain here i.e. that’s the reason the state loops back to itself. The fact that we got (0x55) instead of an end of message (0xAA) means we don’t have a complete message. That’s why we discard the bytes we already collected – we want to start over from scratch.
  3. Lastly, if we read any other byte we literally have nothing we can do with it in our message specification. If we read it something has gone wrong with processing. There’s no point in going to the end state since that’s where we go if we have a full message. What we want to do is start over. So just go back to the start

Like before, we’re simply looking at what do we want, and what don’t we want for this state only.

Lastly we look at the stop state. When we get here we have a full message. So process the message as required by your application. There’s nothing special that we need to do here. We’ve completed processing. Just start over.

And we’re done.

How to implement this?

The easiest way is a switch() statement. This works in languages like Java and C. Rust’s version of the switch() statement is “match”. For languages like Python, a simple if-else statement works perfectly.

I have a full rust message decoder on github for a more complex sample message. You can clone the repository at https://github.com/faeemali/blog-protocol-decoder

Here’s some pseudo C code to illustrate processing:

typedef enum {
	GET_HEADER,
	READ_DATA
} states;
static states state = GET_HEADER;
void main() {
	sock = open_socket();
	while (1) {
		size = read(sock, buffer, 16);
		for (int j = 0; j < size; j++) {
			b = buffer[j];
			switch (state) {
				case GET_HEADER: {
					if (b == 0x55) {
						state = GET_DATA;
						break;
					} //else stay here. There’s nothing more to do!
				}
				break;
				case GET_DATA: {
					if (b == 0x55) {
						/* looks like we have a partial message */
						clear_saved_data();
					} else if (b == 0xAA) {
						/* 
						 * we have a full message. Start over. Don’t waste time
						 * going to a footer state that does nothing. Just start over		
						 */
						state = GET_DATA;
						break;
					} else {
						/* save the data byte somewhere */
						save_data_byte(b);
					}
				}
				break;
			}
		}
	}
}

And that’s it, a fully fault tolerant message processor that will filter out noise, discard partial messages, ignore overhead and allow for saving only data bytes.

From our state machine diagram, the STOP state didn’t do much, so rather than waste a loop iteration (and read() call), the GET_DATA state looks to see if we have a full message and if we do, it just starts over.

State machines are brilliant for message processing, but additionally for simplifying almost any complex problem. It surprises me that they aren’t used more often. I frequently see massively overengineered code made reactive with multiple layers of complexity to address problems that could be solved simply with an old fashioned state machine.

That’s all folks!

By user

Leave a Reply

Your email address will not be published. Required fields are marked *