The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
Donald Knuth
Background
More often than not these days, I tend to work at the architecture/infrastructure
level. On occasion, however, I’ve had the good fun of being involved in
a project which involved lots of low-level optimizations - cycles of gcc
,
gprof
, vim
(sometimes, this is just lovely).
During one such project, it was discovered that the bulk of the userspace CPU
time for some socket code was spent int calls to the C tolower
function.
This post details a neat hack I was able to leverage to reduce the CPU overhead
in that portion of the code.
The overwhelming odds are: this trick will not be practically useful in most real-world projects. However, it is fun! Fun counts too.
Introducing: fast_tolower
- a
neat trick for converting entire strings to lower case with minimal branching
and CPU usage.
Fast Single-Character Conversions
So there’s this weird trick we can do to determine if a single 8-bit value
(c
) lies within a range. That trick is as follows:
- define
LOW
andHIGH
to integer values defining the range (inclusive) - subtract the character value,
c
, fromLOW
- subtract the character value,
c
, fromHIGH
- XOR the difference
Disclaimer: first made aware of this trick here. 1
After performing this operation, we have the property that the top bit of the
resulting number will be set if LOW < c <= HIGH
(other bits may also
be set, but we don’t care).
Next, we observe that the difference between upper and lowercase ascii
characters is (conveniently!) 32
, i.e. a lowercase letter is equal to
its uppercase equivalent with the 6th bit set (e.g. 'a' == 'A' | 0x20
).
Using these two observations, we can flip an uppercase letter to lowercase
without branching, like so:
/* Set c to some character value: */
char c = 'D'; // Any value works!
/* Here, the top bit is set if 0x40 ('A'-1) <= c <= 0x5a ('Z')
* We shift two places to the right so that we can leverage the bit
* that's been conditionally set as a case-toggle and mask using 0x20
* because we don't *care* about any of the other bits set in the XOR: */
uint8_t mask = (((0x40 - c) ^ (0x5a - c)) >> 2) & 0x20;
/* Now, we conditionally flip the 6th bit - changing to lowercase! */
c = c ^ mask;
Extending to Multiple Characters
An interesting optimization presents itself, given the above single-byte hack. Why operate on one character at a time, if we can extend the high, low, and mask values to be multi-byte? This is precisely what we do below.
FAST_TOLOWER_STRIDE
is set to the number of bytes in our systems largest
supported unsigned integer type. Using this information, we define MASK_S
,
HIGH_S
, and LOW_S
to be the multi-character equivalent of the values used
in the discussion above, i.e. instead of checking one character at a time,
we cant test 2, 4, or 8 characters at a time.
Here’s an example of the same algorithm with 16 bits:
/* Here, c2 is the two chars, 'A' and 'B',
* next to each other in mem:
*/
uint16_t c2 = 'B' | (uint16_t)('A' << 8);
/* Perform the same trick at 16-bits: */
uint16_t mask = (((0x4040 - c2) ^ (0x5a5a - c2)) >> 2) & 0x2020;
c2 = c ^ mask;
/* Lowercased! */
assert( c2 == ('b' | (uint16_t)('a' << 8)) );
Once we have this information, we can cast the input and output character arrays to a suitably large integer type and iterate in strides the size of the integer type, instead of byte-by-byte.
Determining the “Stride”
If you’re working in C11, the best approach is probably to determine
FAST_TOLOWER_STRIDE
, using a modified version of Hallvard B. Furuseth’s
IMAX_BITS
macro (more info here):
/* A modified version of Hallvard B. Furuseth's IMAX_BITS
* which, given the maximum value for an unsigned type,
* yields the type's width, in bytes:
*/
#define WIDTH_FROM_MAX_VAL(m) \
(((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12)) >> 3)
#define FAST_TOLOWER_STRIDE \
WIDTH_FROM_MAX_VAL(UINTMAX_MAX)
Thereafter, we can just use uintmax_t
from <stdint.h>
as our unsigned type.
No uintmax_t?
In some cases, simply setting FAST_TOLOWER_STRIDE
to __SIZEOF_SIZE_T__
is a relatively safe approach. On many systems, sizeof(size_t) == sizeof(uintmax_t)
.
However, this is not guaranteed to be the case and could break things (for
older C versions or novel hardware, autoconf
is your friend!).
The Remainder
Many strings will end up not being neatly divisible by our maximum stride. After we’ve run through as many characters as we can at maximum stride, we use a switch with intentional fallthrough to convert the remaining values on a byte-by-byte basis without looping2, like so:
/* Iterate byte-by-byte until we achieve
* proper alignment for uintmax_t:
*/
switch( align ) {
#if FAST_TOLOWER_STRIDE == 8
case 7: FAST_CHAR_TOLOWER(dst, src, c, mask)
case 6: FAST_CHAR_TOLOWER(dst, src, c, mask)
case 5: FAST_CHAR_TOLOWER(dst, src, c, mask)
case 4: FAST_CHAR_TOLOWER(dst, src, c, mask)
#endif /* FAST_TOLOWER_STRIDE == 8 */
#if FAST_TOLOWER_STRIDE >= 4
case 3: FAST_CHAR_TOLOWER(dst, src, c, mask)
case 2: FAST_CHAR_TOLOWER(dst, src, c, mask)
#endif /* FAST_TOLOWER_STRIDE >= 4 */
case 1: FAST_CHAR_TOLOWER(dst, src, c, mask)
case 0:
default:
break;
};
A Note on Alignment
Simply converting the input string pointers to integer pointer types is the obvious approach. However, on architectures with strict alignment this may cause the program to abort. Many modern systems will allow you to violate strict alignment rules, but as a result the machine generates extra instructions to shift data across word boundaries before the computations are performed.
To avoid this penalty, we determine how many bytes away from a properly aligned byte boundary the input pointer is and handle this many bytes - one at a time - ahead of time, until we reach the integer boundary.
Afterwards, we proceed with the integer-stride conversions and finish out the computation by iterating over the remaining bytes (if any) byte-by-bye once more.
(You can see the full code here).
Benchmarks and Disclaimers
(Just for fun!)
I’m not sure it’s really even fair to call these “benchmarks”; they’re just the output of a toy program in this repo. (In practice, you’d want to test the thing under real-world circumstances and gather some profiling data).
Disclaimers
This approach:
- May not be portable.
- Is almost certainly unnecessary.
Local Results
Running locally, I get the following results with:
- string length:
607
- number of iterations:
1,000,000
Strategy | Time (s) | Notes |
---|---|---|
tolower |
1.84s |
Byte-wise iteration, invoking tolower on each char. |
slicker_tolower |
0.14s |
Byte-wise iteration, invoking the single-char version of the algorithm. |
fast_tolower |
0.10s |
Stride-wise iteration, invoking the 64-bit version, four bytes at a time. |
Invocation and Output
(ins)-▹ FAST_CFLAGS="-O3 -Wall -Wno-format -march=native" make
cc -O3 -Wall -Wno-format -march=native ./main.c -o benchmark
(ins)-▹ ./benchmark 1000000
Running test with:
Max string length: 607
Number of iterations: 1000000
Stride: 8 bytes
Timing naive tolower...1.837373s
Timing slicker_tolower...0.137536s
Timing fast tolower...0.098518s
Further Reading
If you got a kick out of this, I strongly encourage you to peruse:
- Stanford’s Bit Twiddling Hacks page!
- Eric S. Raymond’s The Lost Art of Structure Packing
- And of course: Duff’s Device
-
…I think so, anyway. All I know for sure is that I didn’t concoct the single-character version of this bit magic myself. ↩
-
a sort of homage to Duff’s Device ↩