Unionize Your Variables – An Introduction to Advanced Data Types in C

Unionize Your Variables – An Introduction to Advanced Data Types in C

Programming C without variables is like, well, programming C without variables. They are so essential to the language that it doesn’t even require an analogy here. We can declare and use them as wildly as we please, but it often makes sense to have a little bit more structure, and combine data that belongs together in a common collection. Arrays are a good start to bundle data of the same type, especially when there is no specific meaning of the array’s index other than the value’s position, but as soon as you want a more meaningful association of each value, arrays will become limiting. And they’re useless if you want to combine different data types together. Luckily, C provides us with proper alternatives out of the box.


This write-up will introduce structures and unions in C, how to declare and use them, and how unions can be (ab)used as an alternative approach for pointer and bitwise operations.



Structs


Before we dive into unions, though, we will start this off with a more common joint variable type — the struct. A struct is a collection of an arbitrary amount of variables of any data type, including other structs, wrapped together as a data type of its own. Let’s say we want to store three 16-bit integers representing the values of a temperature, humidity, and light sensor.


Yes, we could use an array, but then we always have to remember which index represents what value, while with a struct, we can give each value its own identifier. To ensure we end up with an unsigned 16-bit integer variable regardless of the underlying system, we’ll be using the C standard library’s type definitions from stdint.h.



#include <stdint.h>

struct sensor_data {
uint16_t temperature;
uint16_t humidity;
uint16_t brightness;
};

We now have a new data type that contains three integers arranged next to each other in the memory. Let’s declare a variable of this new type and assign values to each of the struct‘s field.



struct sensor_data data;

data.temperature = 123;
data.humidity = 456;
data.brightness = 789;

Alternatively, the struct can be initialized directly while declaring it. C offers two different ways to do so: pretending it was an array or using designated initializers. Treating it like an array assigns each value to the sub-variable in the same order as the struct was defined. Designated initializers can be arbitrarily assigned by name. Once initialized, we can access each individual field the same way we just assigned values to it.



struct sensor_data array_style = {
123, /* temperature */
456, /* humidity */
789 /* brightness */
};

struct sensor_data designated_initializers = {
.humidity = 456,
.temperature = 123,
.brightness = 789
};

printf("Temperature: %d
", array_style.temperature);
printf("Humidity: %d
", array_style.humidity);
printf("Brightness: %d
", array_style.brightness);

Notice how the fields in the designated initializers are not in their original order, and we could even omit individual fields and leave them simply uninitialized. This allows us to modify the struct itself later on, without worrying much about adjusting every place it was used before — unless of course we rename or remove a field.


Bitfields


The bitfield is a special-case struct that lets us split up a portion of an integer into its own variable of arbitrary bit length. To stick with the sensor data example, let’s assume each sensor value is read by an analog-to-digital converter (ADC) with 10-bit resolution.


Storing the results in 16-bit integers will therefore waste 6 bits for each value, which is more than one third. Using bitfields will let us use a single 32-bit integer and split it up in three 10-bit variables instead, leaving only 2 bits unused altogether.



struct sensor_data_bitfield {
uint32_t temperature:10;
uint32_t humidity:10;
uint32_t brightness:10;
};

We could also add a 2-bit wide fourth field to use the remaining space at no extra cost. And this is pretty much all there is to know about bitfields. Other than adding the bit length, bitfields are still just structs, and are therefore handled as if they were just any other regular struct. Bitfields can be somewhat architecture and compiler dependent, so some caution is required.


Unions


Which brings us to today’s often overlooked topic, the union. From the outside, they look and behave just like a struct, and are in fact declared, initialized and accessed the exact same way. So to turn our struct sensor_data into a union, we simply have to change the keyword and we are done.



union sensor_data {
uint16_t temperature;
uint16_t humidity;
uint16_t brightness;
};

However, unlike a struct, the fields inside a union are not arranged in sequential order in the memory, but are all located at the same address. So if a struct sensor_data variable starts at memory address 0x1000, the temperature field will be located at 0x1000, the humidity field at 0x1010, and the brightness field at address 0x1020. With a union, all three fields will be located at address 0x1000.


What this means in practice is easily shown once we assign values to all the fields like we did in the struct example earlier.



union sensor_data data;

data.temperature = 123;
data.humidity = 456;
data.brightness = 789;

printf("Temperature: %d
", data.temperature);

Unlike the struct example, the value printed here won’t be the assigned value 123, but 789 instead. Since every field in the union shares the exact same memory location, any time one of the fields gets assigned a value, all other field’s previously assigned values are overwritten. For this reason, it rarely makes sense to have fields with the same data type inside a union, but instead mix different types together. Note that the data type sizes don’t need to match, so it’s no problem to have a union with, for example, a 32-bit and a single 8-bit integer, the 8-bit value is simply truncated if needed. The size of the union itself will be equal to the biggest field’s size, so with a 32-bit and a 8-bit integer, the union will be 4 bytes in size.


Using Unions


A union essentially gives one memory location different names and correspondingly different sizes. That might seem like a strange concept, but let’s see how that can be used to easily access different single bytes within a longer data type.



union data_bytes {
uint32_t data;
uint8_t bytes[4];
};

Here we have a 32-bit integer overlapping with an array of four 8-bit integers. If we assign a value to the 32-bit data field and read a single location from the bytes array, we can effectively extract each individual byte from the data field.



union data_bytes db;
db.data = 0x12345678;
printf("0x%02x
", db.bytes[1]);


The actual output will depend whether your processor architecture is little-endian or big-endian. Little-endian architectures will interpret the array index 1 as the integer’s second least significant byte 0x56, while big-endian architectures will interpret it as the integer’s second most significant byte 0x34.


The same principle used to extract a byte works also the other way around, and we can use unions to concatenate integers. Let’s consider a real world example involving the ATmega328’s analog-to-digital converter. The ADC has a 10-bit resolution, and looking at its registers, the converted value is stored in two separate 8-bit registers — ADCL and ADCH for the lower and higher byte respectively. A struct with two fields named after those two registers seems like a good choice for this, and since we also want the whole 10-bit value of the conversion, we’ll use the struct together with a 16-bit integer inside a union.



union adc_data {
struct {
uint8_t adcl;
uint8_t adch;
}
uint16_t value;
};

As you can see, the struct has neither a type name nor has the field itself a name, which lets us access the fields inside the struct as if they were part of the union itself.



union adc_data adc;

adc.adch = ADCH;
adc.adcl = ADCL;

printf("0x%04x
", adc.value);

Note that accessing the struct fields anonymously will only work as long as there are no name conflicts. If there are duplicate field names, the struct itself will require a field name. Once the struct has its own identifier, we can also add a type name to the struct itself, which lets us use it also outside the union.



union adc_data {
struct register_map {
uint8_t adcl;
uint8_t adch;
} registers;
uint16_t value;
};

union adc_data adc;
struct register_map adc_registers;

adc.registers.adch = ADCH;
adc.registers.adcl = ADCL;
printf("0x%04x
", adc.value);

Once the register values are stored in the struct fields, we can read the full value from the 16-bit `value` field. Of course, it doesn’t require a union to combine those two register values, we could also just use bitwise shifting and an OR operation:



printf("0x%04x
", (ADCH << 8) | ADCL);

Truth be told, there is actually nothing unique about unions. In whichever way you are using them, you could achieve the same with either bitwise operations or pointer casts. But that equivalence is exactly what makes them interesting.


Shortcuts with Unions


Let’s have another look at the previous byte-extraction example and see what other options we have to get a single byte out of an integer. As we remember, we had a union with a 32-bit integer and an array of four 8-bit integers:



union data_bytes {
uint32_t data;
uint8_t bytes[4];
};

The most common way to extract parts of any value is combining bitwise shifts with an AND operation, however, in this particular case, we can also cast the 32-bit value to a series of 8-bit values. Well, let’s just implement all of these options and see how that will look like.



uint32_t value = 0x12345678;
union data_bytes db;
db.data = value;

// shift one byte to the right and extract the LSB
printf("0x%02x
", (value >> 8) & 0xff);
// cast to uint8_t pointer, access it as an array
printf("0x%02x
", ((uint8_t *) &value)[1]);
// cast to uint8_t pointer, access via pointer arithmetic
printf("0x%02x
", *(((uint8_t *) &value) + 1));
// simply take the union field
printf("0x%02x
", db.bytes[1]);

Taking a closer look at the pointer casts, we basically tell that whatever is located in the memory address of the 32-bit value, is in fact a collection of 8-bit values. Now, applying the same terminology to the union declaration, we basically tell that whatever is located at the union‘s memory address is either one 32-bit or four 8-bit values, so just like we can do with the cast — except, with a union, we will be very explicit which one of those two types it will be when we access the value. In a sense, unions provide a shortcut to data type conversions, while at the same time making sure the data itself is used in a way that makes sense and is valid in its context, with the compiler keeping you honest. You could say that unions are to pointers what enums are to a bunch of preprocessor constants.


Looking into floating point numbers


Let’s have another example and explore floating-point numbers, IEEE 754 single-precision floating-point numbers to be precise — also known as a float. If you ever wondered what a float looks like to a CPU, just make it think it’s an integer. Obviously not in a “cast an int to float to remove the fraction part” way, but in a “raw IEEE 754 binary32 format” way.



union float_inspection {
float floatval;
uint32_t intval;
} fi;

float f = 65.65625;
fi.floatval = f;

printf("0x%08x
", fi.intval);
// ..or then again with pointers
printf("0x%08x
", *((uint32_t *) &f));

Both will output 0x42835000 which won’t tell us much without thoroughly studying the binary32 format, which is a combination of a sign, exponent, and fraction value with a standardized bit width. Recalling the concept of a bitfield, we can extend the union with a struct, helping us taking the binary32 format apart. For completeness, the same data is also extracted with bitwise operations as a non-union alternative.



union float_inspection {
float floatval;
uint32_t intval;
struct {
uint32_t fraction:23;
uint32_t exponent:8;
uint32_t sign:1;
};
} fi;

float f = 65.65625;
uint32_t i = *((uint32_t *) &f);
fi.floatval = f;

printf("%d %d 0x%x
", fi.sign, fi.exponent, fi.fraction);
printf("%d %d 0x%x
", (i >> 31), ((i >> 23) & 0xff), (i & 0x7fffff));


I’ll leave it for you to decide which option is clearer to read and easier to maintain. Either way, the output will give us a sign value 0, exponent 133, and the fraction 0x35000. Following the format’s definition, we can construct the initial floating point number 65.65625 back from it. So if you ever end up analyzing some raw data dump or binary blob and come across a floating point value, now you know how to use a union to find out what number it represents.


That’s All Folks


There are two more things to worry about when using unions to peer inside other data types: endianness and alignment. Most computers and microcontrollers are little-endian, but watch out for Motorola 68k and AVR32 architectures which are big-endian. For performance reasons, different processors also like to align memory on 2-byte or 4-byte boundaries, which may mean that two uint8_ts might be located four bytes apart in memory. In GCC, you can use the aligned attribute to control this behavior, but you may be subject to a speed penalty and it’s beyond the scope of this article.


This concludes our expedition into structs and unions. Hopefully we could give you some new insights and ideas of how to arrange your variables, and some convenient alternatives to handle them. Let us know if you can think of other ways to make use of all this, and in what peculiar ways you have used or come across unions before.