The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
Donald Knuth

Background

More often than not these days, I tend to work at the architecture/infrastructure level. On occasion, however, I’ve had the good fun of being involved in a project which involved lots of low-level optimizations - cycles of gcc, gprof, vim (sometimes, this is just lovely). During one such project, it was discovered that the bulk of the userspace CPU time for some socket code was spent int calls to the C tolower function. This post details a neat hack I was able to leverage to reduce the CPU overhead in that portion of the code.

The overwhelming odds are: this trick will not be practically useful in most real-world projects. However, it is fun! Fun counts too.

Introducing: fast_tolower - a neat trick for converting entire strings to lower case with minimal branching and CPU usage.

Fast Single-Character Conversions

So there’s this weird trick we can do to determine if a single 8-bit value (c) lies within a range. That trick is as follows:

define LOW and HIGH to integer values defining the range (inclusive)
subtract the character value, c, from LOW
subtract the character value, c, from HIGH
XOR the difference

Disclaimer: first made aware of this trick here. ¹

After performing this operation, we have the property that the top bit of the resulting number will be set if LOW < c <= HIGH (other bits may also be set, but we don’t care). Next, we observe that the difference between upper and lowercase ascii characters is (conveniently!) 32, i.e. a lowercase letter is equal to its uppercase equivalent with the 6th bit set (e.g. 'a' == 'A' | 0x20). Using these two observations, we can flip an uppercase letter to lowercase without branching, like so:

/* Set c to some character value: */
char c = 'D'; // Any value works!

/* Here, the top bit is set if 0x40 ('A'-1) <= c <= 0x5a ('Z')
 * We shift two places to the right so that we can leverage the bit
 * that's been conditionally set as a case-toggle and mask using 0x20
 * because we don't *care* about any of the other bits set in the XOR: */
uint8_t mask = (((0x40 - c) ^ (0x5a - c)) >> 2) & 0x20;

/* Now, we conditionally flip the 6th bit - changing to lowercase! */
c = c ^ mask;

Extending to Multiple Characters

An interesting optimization presents itself, given the above single-byte hack. Why operate on one character at a time, if we can extend the high, low, and mask values to be multi-byte? This is precisely what we do below.

FAST_TOLOWER_STRIDE is set to the number of bytes in our systems largest supported unsigned integer type. Using this information, we define MASK_S, HIGH_S, and LOW_S to be the multi-character equivalent of the values used in the discussion above, i.e. instead of checking one character at a time, we cant test 2, 4, or 8 characters at a time.

Here’s an example of the same algorithm with 16 bits:

/* Here, c2 is the two chars, 'A' and 'B',
 * next to each other in mem:
 */
uint16_t c2 = 'B' | (uint16_t)('A' << 8);

/* Perform the same trick at 16-bits: */
uint16_t mask = (((0x4040 - c2) ^ (0x5a5a - c2)) >> 2) & 0x2020;
c2 = c ^ mask;

/* Lowercased! */
assert( c2 == ('b' | (uint16_t)('a' << 8)) );

Once we have this information, we can cast the input and output character arrays to a suitably large integer type and iterate in strides the size of the integer type, instead of byte-by-byte.

Determining the “Stride”

If you’re working in C11, the best approach is probably to determine FAST_TOLOWER_STRIDE, using a modified version of Hallvard B. Furuseth’s IMAX_BITS macro (more info here):

/* A modified version of Hallvard B. Furuseth's IMAX_BITS
 * which, given the maximum value for an unsigned type,
 * yields the type's width, in bytes:
 */
#define WIDTH_FROM_MAX_VAL(m) \
    (((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12)) >> 3)
#define FAST_TOLOWER_STRIDE \
    WIDTH_FROM_MAX_VAL(UINTMAX_MAX)

Thereafter, we can just use uintmax_t from <stdint.h> as our unsigned type.

No uintmax_t?

In some cases, simply setting FAST_TOLOWER_STRIDE to __SIZEOF_SIZE_T__ is a relatively safe approach. On many systems, sizeof(size_t) == sizeof(uintmax_t). However, this is not guaranteed to be the case and could break things (for older C versions or novel hardware, autoconf is your friend!).

The Remainder

Many strings will end up not being neatly divisible by our maximum stride. After we’ve run through as many characters as we can at maximum stride, we use a switch with intentional fallthrough to convert the remaining values on a byte-by-byte basis without looping², like so:

/* Iterate byte-by-byte until we achieve
 * proper alignment for uintmax_t:
 */
switch( align ) {
#if FAST_TOLOWER_STRIDE == 8
    case 7: FAST_CHAR_TOLOWER(dst, src, c, mask)
    case 6: FAST_CHAR_TOLOWER(dst, src, c, mask)
    case 5: FAST_CHAR_TOLOWER(dst, src, c, mask)
    case 4: FAST_CHAR_TOLOWER(dst, src, c, mask)
#endif /* FAST_TOLOWER_STRIDE == 8 */
#if FAST_TOLOWER_STRIDE >= 4
    case 3: FAST_CHAR_TOLOWER(dst, src, c, mask)
    case 2: FAST_CHAR_TOLOWER(dst, src, c, mask)
#endif /* FAST_TOLOWER_STRIDE >= 4 */
    case 1: FAST_CHAR_TOLOWER(dst, src, c, mask)
    case 0:
    default:
        break;
};

A Note on Alignment

Simply converting the input string pointers to integer pointer types is the obvious approach. However, on architectures with strict alignment this may cause the program to abort. Many modern systems will allow you to violate strict alignment rules, but as a result the machine generates extra instructions to shift data across word boundaries before the computations are performed.

To avoid this penalty, we determine how many bytes away from a properly aligned byte boundary the input pointer is and handle this many bytes - one at a time - ahead of time, until we reach the integer boundary.

Afterwards, we proceed with the integer-stride conversions and finish out the computation by iterating over the remaining bytes (if any) byte-by-bye once more.

(You can see the full code here).

Benchmarks and Disclaimers

(Just for fun!)

I’m not sure it’s really even fair to call these “benchmarks”; they’re just the output of a toy program in this repo. (In practice, you’d want to test the thing under real-world circumstances and gather some profiling data).

Disclaimers

This approach:

May not be portable.
Is almost certainly unnecessary.

Local Results

Running locally, I get the following results with:

string length: 607
number of iterations: 1,000,000

Strategy	Time (s)	Notes
`tolower`	`1.84s`	Byte-wise iteration, invoking `tolower` on each char.
`slicker_tolower`	`0.14s`	Byte-wise iteration, invoking the single-char version of the algorithm.
`fast_tolower`	`0.10s`	Stride-wise iteration, invoking the 64-bit version, four bytes at a time.

Invocation and Output

(ins)-▹ FAST_CFLAGS="-O3 -Wall -Wno-format -march=native" make
cc -O3 -Wall -Wno-format -march=native ./main.c -o benchmark

(ins)-▹ ./benchmark 1000000

Running test with:
    Max string length: 607
    Number of iterations: 1000000
    Stride: 8 bytes
Timing naive tolower...1.837373s
Timing slicker_tolower...0.137536s
Timing fast tolower...0.098518s

Faster tolower in C