In computer science classes, we are taught that hash table lookups are O(1), and that this is as fast as it gets. However, in the world of high-performance applications, that "constant" makes an enormous difference. This post walks through the performance considerations I navigated to make an O(1) transposition table even faster by listening to the hardware rather than the textbook.

Context

Modern chess engines use a recursive DFS (Depth-First Search) to explore the move tree. Naturally, you encounter the exact same board position many times through different sequences of moves, a phenomenon known as a transposition. To avoid redundant work, engines use Transposition Tables (TT). These are essentially large caches that store the evaluation score of a previously searched position, keyed by a Zobrist hash of the board.

The Problem

For a performance-critical application like my chess engine, Typhon, std::unordered_map is a non-starter. The standard library implementation typically uses an array of pointers to nodes, leading to non-contiguous memory layouts, pointer chasing, and frequent cache misses. My solution was a direct-addressed flat hash table implemented as a contiguous array:

static std::array<TTEntry, TT_SIZE_MB / sizeof(TTEntry)> m_table;

The core of the optimization lies within the TTEntry struct itself:

struct MoveSkeleton {
    Square from;
    Square to;
    Piece piece;
    Piece promoPiece;
    MoveFlag flags;

    MoveSkeleton(const Move& m) {
        from       = m.getFrom();
        to         = m.getTo();
        piece      = m.getPieceType();
        promoPiece = m.getPromoPiece();
        flags      = m.getFlags();
    }
};

struct alignas(32) TTEntry {
    uint16_t partial_hash;
    Centipawns score;
    uint8_t depth;
    TTFlag flag;
    MoveSkeleton bestMove;

    TTEntry() : partial_hash(0), score(0), depth(0), flag(EXACT), bestMove(Move()) {}

    TTEntry(ZobristHash h, Centipawns s, int d, TTFlag f, Move m)
        : partial_hash(h & 0xFFFF), score(s), depth(d), flag(f), bestMove(m) {}
};

Note the alignas(32). This ensures that no TTEntry straddles a cache line boundary. By aligning to 32 bytes, every probe is guaranteed to require only one cache line read, avoiding the latency penalty of a split-load access.

I eventually realized I could shrink MoveSkeleton from 20 bytes to 5 bytes by using uint8_t type specifiers for my enums. Since each enum (Piece, Square, etc.) contained fewer than 256 values, an 8-bit integer was more than enough. This dropped the TTEntry size to 12 bytes, which I then aligned to 16 bytes. This allowed four entries to fit into a single 64-byte cache line. Higher density, fewer misses—big win, right?

The Paradox

To my surprise, profiling revealed that the "optimized" smaller version was significantly slower. Here are the results of a depth-12 search from a standard starting position:

16-byte TTEntry (Aligned):

1.164526166 +- 0.007964957 seconds time elapsed  ( +-  0.68% )

28-byte TTEntry (Unaligned):

1.271577627 +- 0.002913635 seconds time elapsed  ( +-  0.23% )

32-byte TTEntry (Aligned):

1.024808752 +- 0.002325387 seconds time elapsed  ( +-  0.23% )

The data shows a clear 13% performance delta between the 16-byte version and the 32-byte version. The only difference? Removing the : uint8_t specifier and letting the enums default to their natural 4-byte integer size.

Analysis: Alignment vs. Density

This result seems paradoxical. Conventional wisdom suggests that smaller data is faster because it increases cache density. However, in high-performance systems, Mechanical Sympathy often outweighs raw density.

1. Word-Sized Register Efficiency

Modern CPUs are natively optimized for 32-bit and 64-bit "words." When we force a variable into a uint8_t, the compiler must often generate extra instructions to mask registers or shift bits to isolate the data. By using standard int enums, we provide data in the CPU's "natural language," allowing the ALU to process comparisons and assignments along the fastest possible path.

2. The Cost of Misalignment

Tightly packing 1-byte enums next to 2-byte or 4-byte fields often creates Misaligned Access. If a multi-byte member starts at an odd address, the CPU may require multiple memory cycles to fetch it. In the 32-byte version, members naturally sit on 4-byte or 8-byte boundaries. The CPU can "blast" these into registers in a single clock cycle.

Conclusion

A 13 increase perforrmance proved that for Typhon, Instruction-Level Efficiency was the primary bottleneck, not cache misses. The takeaway for performance-critical C++: Measure, don't guess. Optimization isn't a checklist of "best practices"; it is a constant conversation with the hardware. Sometimes, the fastest way to move forward is to give the CPU a bit more room to breathe.