Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs
نویسندگان
چکیده
Performance in hardware has been demonstrated to be an important factor in the evaluation of candidates for cryptographic standards. Up to now, no consensus exists on how such an evaluation should be performed in order to make it fair, transparent, practical, and acceptable for the majority of the cryptographic community. In this report, we formulate a proposal for a fair and comprehensive evaluation methodology, and apply it to the comparison of hardware performance of 14 Round 2 SHA-3 candidates. The most important aspects of our methodology include the definition of clear performance metrics, the development of a uniform and practical interface, generation of multiple sets of results for several representative FPGA families from two major vendors, and the application of a simple procedure to convert multiple sets of results into a single ranking. The VHDL codes for 256 and 512-bit variants of all 14 SHA-3 Round 2 candidates and the old standard SHA-2 have been developed and thoroughly verified. These codes have been then used to evaluate the relative performance of all aforementioned algorithms using seven modern families of Field Programmable Gate Arrays (FPGAs) from two major vendors, Xilinx and Altera. All algorithms have been evaluated using four performance measures: the throughput to area ratio, throughput, area, and the execution time for short messages. Based on these results, the 14 Round 2 SHA-3 candidates have been divided into several groups depending on their overall performance in FPGAs. Chapter 1 Introduction and Motivation Starting from the Advanced Encryption Standard (AES) contest organized by NIST in 1997-2000 [1], open contests have become a method of choice for selecting cryptographic standards in the U.S. and over the world. The AES contest in the U.S. was followed by the NESSIE competition in Europe [2], CRYPTREC in Japan, and eSTREAM in Europe [3]. Four typical criteria taken into account in the evaluation of candidates are: security, performance in software, performance in hardware, and flexibility. While security is commonly recognized as the most important evaluation criterion, it is also a measure that is most difficult to evaluate and quantify, especially during a relatively short period of time reserved for the majority of contests. A typical outcome is that, after eliminating a fraction of candidates based on security flaws, a significant number of remaining candidates fail to demonstrate any easy to identify security weaknesses, and as a result are judged to have adequate security. Performance in software and hardware are next in line to clearly differentiate among the candidates for a cryptographic standard. Interestingly, the differences among the cryptographic algorithms in terms of hardware performance seem to be particularly large, and often serve as a tiebreaker when other criteria fail to identify a clear winner. For example, in the AES contest, the difference in hardware speed between the two fastest final candidates (Serpent and Rijndael) and the slowest one (Mars) was by a factor of seven [1][4]; in the eSTREAM competition the spread of results among the eight top candidates qualified to the final round was by a factor of 500 in terms of speed (Trivium x64 vs. Pomaranch), and by a factor of 30 in terms of area (Grain v1 vs. Edon80) [5][6]. At this point, the focus of the attention of the entire cryptographic community is on the SHA-3 contest for a new hash function standard, organized by NIST [7][8]. The contest is now in its second round, with 14 candidates remaining in the competition. The evaluation is scheduled to continue until the second quarter of 2012. In spite of the progress made during previous competitions, no clear and commonly 2 Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates 3 accepted methodology exists for comparing hardware performance of cryptographic algorithms [9]. The majority of the reported evaluations have been performed on an ad-hoc basis, and focused on one particular technology and one particular family of hardware devices. Other pitfalls included the lack of a uniform interface, performance metrics, and optimization criteria. These pitfalls are compounded by different skills of designers, using two different hardware description languages, and no clear way of compressing multiple results to a single ranking. In this paper, we address all the aforementioned issues, and propose a clear, fair, and comprehensive methodology for comparing hardware performance of SHA-3 candidates and any future algorithms competing to become a new cryptographic standard. Our methodology is based on the use of FPGA devices from various vendors. The advantages of using FPGAs for comparison include short development time, wide availability of tools, and a limited number of vendors dominating the market. The hardware evaluation of SHA-3 candidates started shortly after announcing the specifications and reference software implementations of 51 algorithms submitted to the contest [7][8][10]. The majority of initial comparisons were limited to less than five candidates, and their results have been published at [10]. The more comprehensive efforts became feasible only after NISTs announcement of 14 candidates qualified to the second round of the competition in July 2009. Since then, two comprehensive studies have been reported in the Cryptology ePrint Archive [11][12]. The first, from the University of Graz, has focused on ASIC technology, the second from two institutions in Japan, has focused on the use of the FPGA-based SASEBO-GII board from AIST, Japan. Although both studies generated quite comprehensive results for their respective technologies, they did not quite address the issues of the uniform methodology, which could be accepted and used by a larger number of research teams. Our study is intended to fill this gap, and put forward the proposal that could be evaluated and commented on by a larger cryptographic community. Chapter 2 Methodology 2.1 Choice of a Language, FPGA Devices, and Tools Out of two major hardware description languages used in industry, VHDL and Verilog HDL, we choose VHDL. We believe that either of the two languages is perfectly suited for the implementation and comparison of SHA-3 candidates, as long as all candidates are described in the same language. Using two different languages to describe different candidates may introduce an undesired bias to the evaluation. FPGA devices from two major vendors, Xilinx and Altera, dominate the market with about 90% of the market share. We therefore feel that it is appropriate to focus on FPGA devices from these two companies. In this study, we have chosen to use seven families of FPGA devices from Xilinx and Altera. These families include two major groups, those optimized for minimum cost (Spartan 3 from Xilinx, and Cyclone II and III from Altera) and those optimized for high performance (Virtex 4 and 5 from Xilinx, and Stratix II and III from Altera). Within each family, we use devices with the highest speed grade, and the largest number of pins. As CAD tools, we have selected tools developed by FPGA vendors themselves: Xilinx ISE Design Suite v. 11.1 (including Xilinx XST, used for synthesis) and Altera Quartus II v. 9.1 Subscription Edition Software. 2.2 Performance Metrics for FPGAs Choosing proper performance metrics for the implementation of hash functions (or any other cryptographic transformations) using FPGAs is a non-trivial task, and no clear consensus exists so far on how these metrics should be defined. Below we summarize our proposed approach, which we applied in our study. 4 Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates 5 Speed. In order to characterize the speed of the hardware implementation of a hash function, we suggest using Throughput, understood as a throughput (number of input bits processed per unit of time) for long messages. To be exact, we define Throughput using the following formula: Throughput = block size T · (HTime(N + 1)−HTime(N)) (2.1) where block size is a message block size, characteristic for each hash function, HTime(N) is a total number of clock cycles necessary to hash an N-block message, T is a clock period, different and characteristic for each hardware implementation of a specific hash function. Throughput defined this way is typically independent of N (and thus the size of the message), as in all hash function architectures we investigated so far, the expression HTime(N + 1)−HTime(N) is a constant that corresponds to the number of clock cycles between processing of two subsequent input blocks. The effective throughput for short messages is always smaller, and is expressed by the formula Throughputeff = N · block size T ·HTime(N) (2.2) In this paper, we provide the exact formulas for HTime(N) for each SHA-3 candidate (see Table 4.2), and values of f = 1/T for each algorithm–FPGA device pair (see Tables 4.8 and 4.9). Therefore, we provide sufficient information to calculate and compare values of the effective throughputs for each specific message size, which may be of interest in a given application. For short messages, it is more important to evaluate the total time required to process a message of a given size (rather than throughput). The size of the message can be chosen depending on the requirements of an application. For example, in the eBASH study of software implementations of hash functions, execution times for all sizes of messages, from 0-bytes (empty message) to 4096 bytes, are reported, and five specific sizes 8, 64, 576, 1536, and 4096 are featured in the tables [13]. The generic formulas we include in this paper (see Table 4.2) allow the calculation of the execution times for any message size. In order to characterize the capability of a given hash function implementation for processing short messages, we present in this study the comparison of execution times for an empty message (one block of data after padding) and a 100-byte (800-bits) message before padding (which becomes equivalent for majority, but not all, of the investigated functions to 1024 bits after padding). To be exact our parameters are defined as follows Tempty = T ·HTime(1) (2.3) T100B = T ·HTime ( padlen(800) block size ) , (2.4) 6 E. Homsirikamol, M. Rogawski, and K. Gaj where padlan(800) denotes the size of an 800-bit message after padding. Resource Utilization/Area. Resource utilization is particularly difficult to compare fairly in FPGAs, and is often a source of various evaluation pitfalls. First, the basic programmable block (such as CLB slice in Xilinx FPGAs) has a different structure and different capabilities for various FPGA families from different vendors. For example, in Virtex 5, a CLB slice includes four 6-input Look-Up-Tables (LUTs); in Spartan 3 and Virtex 4, a CLB slice includes two 4-input LUTs. In Cyclone II and Cyclone III, the basic programmable block is called Logic Element (LE); in Stratix II and III, the basic programmable component has a different structure and is called ALUT (Adaptive Look-Up Table). Taking this issue into account, we suggest avoiding any comparisons across family lines. Secondly, all modern FPGAs include multiple dedicated resources, which can be used to implement specific functionality. These resources include Block RAMs (BRAMs), multipliers (MULs), and DSP units in Xilinx FPGAs, and memory blocks, multipliers, and DSP units in Altera FPGAs. In order to implement a specific operation, some of these resources may be interchangable, but there is no clear conversion factor to express one resource in terms of the other. Therefore, we suggest in the general case, treating resource utilization as a vector, with coordinates specific to a given FPGA family. For example, Resource UtilizationSpartan3 = (#CLBslices,#BRAMs,#MULs) (2.5) Resource UtilizationCycloneIII = (#LE,#memory bits,#MULs) (2.6) Taking into account that vectors cannot be easily compared to each other, we have decided to opt out of using any dedicated resources in the hash function implementations used for our comparison. Thus, all coordinates of our vectors, other than the first one have been forced (by choosing appropriate options of the synthesis and implementation tools) to be zero. This way, our resource utilization (further referred to as Area) is characterized using a single number, specific to the given family of FPGAs, namely the number of CLB slices (#CLBslices) for Xilinx FPGAs, the number of Logic Elements (#LE) for Cyclone II and Cyclone III, and the number of Adaptive Look-Up Tables (#ALUT ) in Stratix II and Stratix III. The resource utilization vector in FPGAs (or even its simplified one-coordinate form, referred to as Area above) cannot be easily translated to an equivalent area or the number of transistors in ASICs. Any attempts to define a resource utilization unit that would apply to both technologies (such as an equivalent logic gate) have been mostly unsuccessful, and of limited value in practice. The only common denominator is cost, but unfortunately the prices of integrated circuits, and FPGAs in particular, are not commonly available, and are affected by multiple non-technical factors (including the number of units ordered, the relationship between companies, etc.) Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates 7 2.3 Uniform Interface In order to remove any ambiguity in the definition of our hardware cores for SHA-3 candidates, and in order to make our implementations as practical as possible, we have developed an interface shown in Fig. 2.1a, and described below. In a typical scenario, the SHA core is assumed to be surrounded by two standard FIFO modules: Input FIFO and Output FIFO, as shown in Fig. 2.1b. In this configuration, SHA core is an active module, while a surrounding logic (FIFOs) is passive. Passive logic is much easier to implement, and in our case is composed of standard logic components, FIFOs, available in any major library of IP cores. Each FIFO module generates signals empty and full, which indicate that the FIFO is empty and/or full, respectively. Each FIFO accepts control signals write and read, indicating that the FIFO is being written to and/or read from, respectively. The aforementioned assumptions about the use of FIFOs as surrounding modules are very natural and easy to meet. For example, if a SHA core implemented on an FPGA communicates with an outside world using PCI, PCI-X, or PCIe interface, the implementations of these interfaces most likely already include Input and Output FIFOs, which can be directly connected to a SHA core. If a SHA core communicates with another core implemented on the same FPGA, then FIFOs are often used on the boundary between the two cores in order to accommodate for any differences between the rate of generating data by one core and the rate of accepting data by another core. Additionally, the inputs and outputs of our proposed SHA core interface do not need to be necessarily generated/consumed by FIFOs. Any circuit that can support control signals src ready and src read can be used as a source of data. Any circuit that can support control signals dst ready and dst write can be used as a destination for data. The exact format of an input to the SHA core, for the case of pre-padded messages, is shown in Fig. 2.2. Two scenarios of operation are supported. In the first scenario, the message bitlength after padding is known in advance and is smaller than 2w. In this scenario, shown in Fig. 2.2a, the first word of input represents message length after padding, w SHA core din dout src_ready src_read dst_ready dst_write clk rst clk rst
منابع مشابه
Comparing Hardware Performance of Round 3 SHA-3 Candidates using Multiple Hardware Architectures in Xilinx and Altera FPGAs
In this paper we present a comprehensive comparison of all Round 3 SHA-3 candidates and the current standard SHA-2 from the point of view of hardware performance in modern FPGAs. Each algorithm is implemented using multiple architectures based on the concepts of folding, unrolling, and pipelining. Trade-offs between speed and area are investigated, and the best architecture from the point of vi...
متن کاملFair and Comprehensive Methodology for Comparing Hardware Performance of Fourteen Round Two SHA-3 Candidates Using FPGAs
Performance in hardware has been demonstrated to be an important factor in the evaluation of candidates for cryptographic standards. Up to now, no consensus exists on how such an evaluation should be performed in order to make it fair, transparent, practical, and acceptable for the majority of the cryptographic community. In this paper, we formulate a proposal for a fair and comprehensive evalu...
متن کاملComprehensive Comparison of Hardware Performance of Fourteen Round 2 SHA-3 Candidates with 512-bit Outputs Using Field Programmable Gate Arrays
In this paper, we extend our evaluation of the hardware performance of 14 Round 2 SHA-3 candidates, presented at CHES 2010, to the case of high security variants, with 512 bit outputs. A straightforward method for predicting the performance of 512-bit variants, based on the results for 256-bit versions of investigated hash functions is presented, and confirmed experimentally. The VHDL codes for...
متن کاملATHENa – Automated Tool for Hardware EvaluatioN: Toward Fair and Comprehensive Benchmarking of Cryptographic Algorithms using FPGAs
In this talk, we will introduce an open-source en vironment, called ATHENa for fair, comprehensive, automated, and collaborative hardware benchmarking of cryptographic al gorithms. We believe that this environment is very suitable for use in evaluation of hardware performance of SHA-3 candidates from the point of view of speed, resource utilization, cost, power consumption, etc. At this point...
متن کاملThroughput vs. Area Trade-offs in High-Speed Architectures of Five Round 3 SHA-3 Candidates Implemented Using Xilinx and Altera FPGAs
In this paper we present a comprehensive comparison of all Round 3 SHA-3 candidates and the current standard SHA-2 from the point of view of hardware performance in modern FPGAs. Each algorithm is implemented using multiple architectures based on the concepts of folding, unrolling, and pipelining. Trade-offs between speed and area are investigated, and the best architecture from the point of vi...
متن کاملComprehensive Evaluation of High-Speed and Medium-Speed Implementations of Five SHA-3 Finalists Using Xilinx and Altera FPGAs
In this paper we present a comprehensive comparison of all Round 3 SHA-3 candidates and the current standard SHA-2 from the point of view of hardware performance in modern FPGAs. Each algorithm is implemented using multiple architectures based on the concepts of iteration, folding, unrolling, pipelining, and circuit replication. Trade-offs between speed and area are investigated, and the best a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IACR Cryptology ePrint Archive
دوره 2010 شماره
صفحات -
تاریخ انتشار 2010