The hash function - currently SHA-1 or MD5 - is evaluated by the FPGA in a fully-unrolled, fully-pipelined manner. In the case of SHA-1 it requires over 320 32-bit adders, consuming nearly all logic resources in the XC2VP20 device. However, SHA-1 requires also a key expand operation. A direct implementation of that ends up as a 512-wide, 80-deep pipeline. This exceeds the device capacity.
The implementation used in NSA@home takes advantage of the linearity of the key expansion block. The only operations used in it are bitwise XORs and bit rotates. Therefore, changing any bit in the 512-bit input vector toggles some bits in the expanded keys, and the exact pattern of toggled bits depends only on the index of changed input vector bit.
Since the SHA-1 core does not consume any Block RAM, the change vectors were loaded into lookup tables, a 512x32 table for each SHA-1 round. The outputs of those tables are XORed with 32-bit expanded key registers. Since Xilinx Virtex-II Block RAM can provide two read ports, two bit changes per cycle are acceptable - a single change would not be enough to process some alphabets efficiently (without losing cycles). The generator block is responsible for enforcing this rule.
For MD5, the implementation is more straightforward, except the pipeline stage boundaries are shifted from the location one would expect them in. The reason for this is that with the "natural" boundaries, the pipeline stage contains an add-rol-add path. The rotation causes a signal to pass through a longer path on the second adder, and degrades timing by about 2ns.
(c) 2007 Stanislaw Skowronek
Contact me: nsa unaligned org (figure out where to put the @ and .)