Low Latency Montgomery Multiplier for Cryptographic Applications

In this modern era, data protection is very important. To achieve this, the data must be secured using either Public-Key (PKC) or PrivateKey Cryptography (P-KC). PKC eliminates the need of sharing keys at the beginning of communication. PKC systems such as ECC and RSA are implemented for different security services such as the key exchange between sender, receiver, and key distribution between different network nodes and authentication protocols. PKC is based on computationally intensive finite field arithmetic operations. In the PKC schemes, Modular Multiplication is the most critical operation. Usually, this operation is performed by Integer Multiplication (IM) followed by a reduction module M. However, the reduction step involves a long division operation that is expensive in terms of area, time, and resources. Montgomery multiplication algorithm facilitates faster MM operation without the division operation. In this paper, low latency hardware implementation of the Montgomery multiplier is proposed. Many interesting and novel optimization strategies are adopted in the proposed design. The proposed Montgomery multiplier is based on School-Book (SB) multiplier, Karatsuba-Ofman (KA) algorithm, and fast adders techniques. The Karatsuba-Ofman algorithm and School-Book multiplier recommend; cutting down the operands into smaller chunks while adders facilitate fast addition for large size operands. The proposed design is simulated, synthesized, and implemented using Xilinx ISE Design Suite by targeting different Xilinx Field-Programmable Gate Arrays (FPGA) devices for different bit sizes (64-1024). The proposed design is evaluated based on computational time, area consumption, and throughput. The implementation results show that the proposed design can easily outperform the state-of-the-art.


I. INTRODUCTION
Cryptography has generally been divided into two types, Private-Key Cryptography -symmetric and Public-Key Cryptography (PKC) -asymmetric [1]. In Private-Key Cryptography, encryption and decryption use the same key but in PKC there is a key pair, private key, and public key. The public key is open and accessible for everyone, but the private or personal key is solely known by the message receiver. Both keys are mathematically related. In PKC, the public key is used for the encryption of data while for decryption, the private key is used. It is impossible to derive the private key from the public key. This property allows us to freely exchange public keys over the network. PKC algorithms that are most commonly used are RSA and elliptic curve cryptography (ECC) [2]. Most of the PKC algorithms implementation need arithmetic operations over a finite field of large characteristics. These operations are modular addition (or subtraction) modular multiplication and modular inversion with large size operands over the finite fields. The security of the scheme can be increased by increasing operand lengths. In RSA, large operands are used for the same security level as compared to ECC. Hence, the ECC is the preferred scheme, particularly when utilized in resourceconstrained environments. The software implementation of ECC achieves a greater flexibility level at the price of higher computational time [3]. The hardware implementations of ECC provide higher performance if the modular multiplier is designed efficiently. Operations like modular multiplication and inversion are most time and resource-consuming. Modular multiplication is comparatively faster in terms of hardware and software when compared to modular inversion. To avoid inversion, extra multiplications are added and changing the coordinate system to a projective coordinates system. By doing so, the total performance depends on the modular multiplication, which causes the bottleneck in cryptosystems that need to be optimized. Modular multipliers in the literature are divided into three types. The regular technique for modular multiplication is division-using modulus (M). The division is costly in terms of execution time and utilization of resources. The second type of modular multipliers is Interleaved Modular Multiplication (IMM), which can do the reduction during multiplication. The last type of modular multiplier is Montgomery Modular Multiplication Khalid Javed et al, 2 (MMM), which is faster when the operands are large compared to the other methods [4]. Large operand divided into several small parts and then Karatsuba-Ofman algorithm is employed which gives the best output in terms of area and performance [5]. Cryptographic applications are deployed in several smart devices like mobile phones and Wi-Fi devices. These cryptographic applications utilize the varying size of mathematical operands starting from 160 to 1024-bits. The security of the devices can be increased by increasing the size of the operands. Therefore, we require operations with varying field sizes depending upon the applications.
II. RELATED WORK Efficient implementations in hardware are categorized into two main classes, word-wise, and bit-wise implementations. In modern Field-Programmable Gate Arrays (FPGA), dedicated multipliers are used in wordwise implementations. The utilization of dedicated multipliers on FPGA is faster than the standard designbased FPGA. The operands are divided into different parts and these parts utilize the dedicated multipliers for multiplication, which offer faster speed in time-critical applications for the word-wise implementations. The investigators provided the concept to use 64x64 bit cores for multiplication and provided different implementations for different field sizes [6]. Authors in 2013 presented a concept for the modular multiplier that can be used in Barreto-Naehrig (BN) curves. They allotted the special prime number for implementations of BN curves and utilized the atypical division to adjust the FPGA, Digital Signal Processor (DSP) block [7]. The authors extended the concept to high speed and low-cost Montgomery algorithm with the carry-save adder for addition operations [8]. A Carry-save adder is used to cut off the extra clock cycles for implementation and conversion. Investigators advanced the concept of highspeed Montgomery multiplier architecture using digitserial computation [9]. It uses binary multiplication in high-radix partial multiplication. Consecutive zero-bit multiplications can be performed within the one-clock cycle. Some authors performed the addition of 512-bit operands and multiplication of 256-bit operands by utilization of the carry chain of 64-bit with soft-core multiplier and achieved 188 MHz frequency [10]. Investigators elaborated an implementation scheme of using IP cores of the FPGA with the addition of 512-bit and 256-bit multiplication to achieve 50% better results as compared to standard implementations [11]. Researchers contributed to implementing the hardware processor for Error Correcting Code [12]. Full block MMM that utilizes 256-bit integer multiplication is achieved by 16-bit multipliers in cascaded form and this method is pursued until the specified size of the multiplier is attained. Fast carry look-ahead adders are used for the addition. A design concept was put forward to divide the operands and performing computations on them. The complexity is not depending on the operand size, it relies on the divided part of the operand [13]. To increase the efficiency of MMM architecture, many solutions are provided using the Karatsuba algorithm. Another contribution was that; the design concept of 256-bit MMM using the dedicated multiplier on FPGA with pipelining stages. Using the Karatsuba algorithm, the operands are divided into two parts to cut down the number of the multiplier on FPGA. The hardware architecture decreased the clock cycle utilization [14]. Another design concept was put forward which consists of a series of nine multipliers of 64-bits to create a block to deploy the Karatsuba algorithm-based number multiplier [15]. The integer multiplier can be used to create a huge block of 256-bit Montgomery modular multiplier. The design provides low space architecture and path delay cost is decreased to increased iterations. An investigator G. C. Chow and his associates presented the design to decrease routing delay that is increased in large multipliers. For decreasing the routing delay dividing the operands into small parts and these parts are using the dedicated multipliers to increase the efficiency of the hardware [16]. The authors extended the design concept of the Karatsuba algorithm to divide the operand into two parts and use it in the MMM algorithm with higher radix. The architecture uses the dedicated blocks of the multiplier, which increases the utilized space of the hardware [17]. In research, investigators advanced the design concept of the Karatsuba algorithm, divided into 4 levels, use in the Montgomery modular multiplier. Using the splitting method operand is divided into two parts. The divided part is again divided into two other parts in the reappearance style until the divided parts are length matches with the DSP blocks of an FPGA [18]. The Lookup- Table (LUT)-based Montgomery multiplier was designed by researchers rather than the FPGA dedicated multipliers. Radix-4 based modular multiplier with the serial interleaved design is employed [19]. The authors came up with the design concept of parallel interleaved modular multiplier implementation of hardware architecture. According to the architecture, an operand is working on four parallel processing elements to complete the dedicated task according to the algorithm [20]. Some researchers presented the design to deploy of systolic architecture array in Montgomery modular multiplier. The systolic architecture array, repeating the structures in parallel, to overcome the path delay [21]. The authors provided the design to deploy a radix-4 serial multiplier in MMM with the laddering method of power to cut off the 50% in clock cycles [22].

III. CONTRIBUTION
In the PKC system, modular multiplication is a basic operation. The MMM algorithm is commonly adopted for modular multiplication implementation. To enhance the efficiency of the MMM algorithm, we have utilized the School-Book (SB) multiplier and Karatsuba-Ofman (KA) algorithm. Karatsuba-Ofman algorithm suggests dividing the operands into smaller parts. The number of multiplications increases when we further decrease the size of the chunks of the operands. In this paper, we have optimized hardware utilization and computation time by utilizing the SB multiplier and KA algorithm. We have worked on three methods i.e. two-parts, four-parts, and Low Latency Montgomery Multiplier for Cryptographic Applications 3 eight-parts splitting. After the splitting of operands, these operands utilize the integer multiplier architecture based on operands splitting techniques and the MMM algorithm. Increasing the speed of the integer multiplier can help to enhance the overall efficiency of MMM. The proposed design is simulated; synthesized and implemented using Xilinx ISE Design Suite by targeting different Xilinx FPGA devices for different bit sizes (64-1024). The proposed design of MMM is evaluated based on computational time, area consumption, and throughput.

A. Karatsuba-Ofman Algorithm
In this algorithm, multiplication complexity is overcome by dividing the operands into equivalent small chunks. A School-Book multiplication complexity is O〖(n〗^2). The strategy found by Karatsuba-Ofman by dividing the operand into parts has decreased the complexity to O〖(n〗^1.58) [23]. During the multiplication of two numbers A and B, the algorithm suggests dividing these into the higher and lower parts as given below: The operand A and B can be written as: Now the multiplication result is given below: Four multiplications are required if the School-Book method is adopted as shown in eq. (1). They utilize the four DSP blocks for multiplication. The result of the School-Book multiplier is fast but they use more resources. However, using the Karatsuba technique, the required number of multiplications is three as shown in eq. (2): The number of multiplications required to multiply the two operands in the Karatsuba algorithm is three, which can save one multiplication operation and increase the addition and subtraction operations. They have also utilized the signed bit operations which decrease the speed of the multiplication but utilize fewer resources of hardware. The repeated division of operands into small parts until reaching the required size of the operand increases efficiency. Utilizing the School-Book multiplier improves the speed of multiplication and the Karatsuba algorithm uses fewer resources of hardware.

B. Operands Splitting
The operands may be divided into a different number of parts and utilize either School-Book or Karatsuba technique for the multiplication. Table I shows the multipliers required according to operand splitting with two different techniques of multiplication. When we further divide, the operand size is decreasing, but the number of multipliers is increased with the addition of adder and subtraction operations to produce the final output. Further splitting of the operands may not be suitable because the area of the hardware is increased due to extra addition and subtraction operations.

C. Two-Parts Splitting
Karatsuba algorithm suggests dividing the operands into two parts. The main operations include the calculation of the difference of divided operands ( 1 − 0 ) ( 1 − 0 ), it is an effective and vital part of the Karatsuba calculation. Then compute the product of the divided which saves one multiplication compared to the School-Book method. The latest FPGA chips contain DSP blocks which consist of built-in multipliers. In Virtex-5 and Virtex-6, the DSP blocks consist of an 18x25 asymmetrical and signed dedicated multiplier. The size of the dedicated multiplier is n-bit the output of the multiplication is 2n-bits. Which can use the three dedicated multipliers for the 2n-bits multiplication, the hardware architecture utilized the three dedicated multipliers. Operand size is less than 2n-bits it always utilizes the three dedicated multipliers in the hardware architecture. It is a generalized theory of multiplication. The old FPGA devices do not support a dedicated multiplier. The latest FPGA devices are fast for multiplication due to the dedicated multiplier inside the DSP blocks, which also perform the addition inside the dedicated multiplier.

D. Four-Parts Splitting
Repeatedly Karatsuba-Ofman is applied on the operands. Four parts are obtained after applying operation splitting into the two parts of the operands. The four sections of the operands are given below: The general output of the multiplication is given below: 4 The above condition demonstrates that 16 multiplications and 15 adders are needed for the output of eq. (3). If the size of the operand part is equal to DSP block size then we needed 16 dedicated multipliers to perform the multiplication of the equation. After applying the Karatsuba-Ofman calculation on the same equation the output is: Output = A * B = P 00 + 2 n (P 11 + P 00 − D 10 ) + 2 2n (P 22 + P 11 + P 00 − D 20 ) + 2 3n (P 33 + P 22 + P 11 + P 00 − D 30 − D 21 ) + 2 4n (P 33 + P 22 + P 11 − D 31 ) + 2 5n (P 33 + P 22 − D 32 ) + 2 6n (P 33 ) After comparing both equations, it is observed that reduction occurs in the number of multiplications from 16 to 10 and with the increase in 15 adders and 18 subtractions. The operand's size remains the same as onefourth of the first operands. If the single multiplication utilizes the single DSP block then 10 numbers of DSP block are acquired to do complete multiplication. The 6 numbers of the DSP block were saved through Karatsuba-Ofman calculation. These numbers of operations utilize in the main equation:

F. Montgomery Algorithm
A strategy for faster modular multiplication has been presented in 1985 by Peter L. Montgomery. The MMM architecture computes A x B mod M, where A and B are certain whole numbers and M is a large prime number [24]. Regular methodologies for processing the remainder involve the division task. In Montgomery multiplication shift and adds operations replace the costly division operation. Shift and adds operations strategy work only in the Montgomery domain. Before the task, the operands are first converted into the domain of the Residue Number System. After completing the operation, the output is re-converted. Word length must be selected in the power of two in the selection of Radix (R) and Modulus (M) must be smaller than radix R. For the run of algorithm R and M must be a prime number. Whether for modulus M, an n bit is a positive number of A and B are two n bit operands. In modular multiplication, Output = A x B mod M where 0 < A; B < M.

V. FPGA IMPLEMENTATION
Integrated circuits such as Field Programmable Gate Arrays (FPGA) could be programmed by the user after fabrication. FPGA devices contain Configurable Logic Blocks (CLB) which are connected through programmable interconnects. This ability of the FPGA devices has made them suitable hardware accelerators for different applications and they are largely deployed in cryptographic applications. The modern FPGA devices provide the dedicated portion for software and hardware cores and configurable blocks. In modern Xilinx FPGAs different memory, cores, and dedicated blocks for the arithmetic operations are available which are already tailored for high speed and low power applications. These cores are easily changeable according to requirements and multiple blocks can be utilized at the same time. The CLB provides the facility to the programmer to minimize the code to maximize the speed of the hardware.

A. Integer Multiplier
The performance of the MMM fully depends on the Integer Multiplier (IM) efficiency. In this paper, we have deployed the School-Book algorithm and Karatsuba-Ofman algorithm to increase the performance of the IM. We adopt the three approaches to increase the efficiency of IM, i.e. two-part splitting four-part splitting, and eightpart splitting. They are discussed with their results.

B. Two-Parts Splitting Multiplier
Integer multiplication using the two-part splitting technique is given in Algorithm-1 as shown in figure 1. It describes all the steps involved in partial product generation and accumulation to achieve the final product. In a two-part splitting algorithm, the operands can be sliced into two equal parts using the Karatsuba-Ofman approach and generate the product using IM. Here the two unsigned multipliers of N/2-bits and one signed multiplier of (N/2 + 1)-bits generate the third partial product. Three multiplications are executed in parallel with the help of a multiplier. In Algorithm-1 as shown in figure 1, steps 1 to 12 explain the generation of partial product. eq. (2) represents that Low Latency Montgomery Multiplier for Cryptographic Applications 5 the output utilizes the three multipliers. The reduction of the multiplier is achieved through the Karatsuba-Ofman algorithm. In the algorithm, steps 13 to 15 shows that the results of partial products are utilized as the inputs for the adders.
Step 16 is the final addition to generate the final output product. The splitting depth of the operand is one.

C. Four-Parts Splitting Multiplier
In four parts splitting technique, the depth for splitting is two. It shows that the operands are further subdivided divided into two parts. The prime benefit of four-part splitting is to optimize the resources. The four-part splitting multiplier utilized the fundamental multiplier of the DSP block in Virtex-6.   Here in figure 3, at the beginning of the multiplication, the inputs can store in the registers. Then, divide the operand into four parts using the Karatsuba-Ofman algorithm. Then generate the product using an integer multiplier. The partial products are generated using unsigned (4 number) and signed (6 number) multipliers. Ten multiplications are executed in parallel using the multiplier. In Algorithm-2, shown in figure 2, steps 1 to 12 explain the generation of partial product. eq. (4) represents that the output utilizes ten multipliers. The reduction of multipliers is achieved through the Karatsuba-Ofman algorithm. In Algorithm-2 steps, 13 to 19 shows that the result of the partial product is utilized as the inputs for the adders. The N/4-bits fast carry chain adders add the partial product shown in figure 3. Step 20 is the addition to create the final product. The splitting depth of the operand is two.   Table II shows the clock cycle for the four-part splitting multipliers. The complete product takes the seven-clock cycles. In figure 3, the first operation is to load the operand into the input register. In table II, the LR shows the Load Register which required a one-clock cycle. The second operation in table II is the calculation of the partial products, which is Partial Product Multiplication (PPM), which utilizes a one-clock cycle. The third operation in figure 3 is Partial Product Addition (PPA), which adds the partial product in the four-clock cycle. The PPA is the final stage. Final Addition (FA) of the product, which adds the PPA in the one-clock cycle. The whole multiplication utilized the seven-clock cycle for full product generation. The results for the implementation are shown in table III.

D. Eight-Parts Splitting Multiplier
In the eight-parts splitting technique, the depth is three. It shows that each input part is divided into two parts further. The prime benefit of eight-part splitting is to optimize the hardware resources further. The four-part splitting multiplier utilized the basic multiplier of DSP block in Virtex-6 and Virtex-7.
In the eight-part splitting algorithm, the operands are divided into eight equal parts. At the beginning of the multiplication, the inputs are stored in the registers. During the next step, divide the operand into eight-part using the splitting algorithm. Now after that generate the partial products using an integer multiplier. Sixty-four unsigned multipliers of N/8-bits are required to generate the sixty-four partial products. In eq. (5), the output utilizes sixty-four multipliers. The splitting depth of the operand is three. Table IV depicts the clock cycle instants for the four-part splitting multipliers. The complete product takes the twoclock cycles. The first operation is to load the operand into the input register. In table IV, the LR shows the Load Register which required a one-clock cycle. The second operation is the Final Addition (FA) of the product, which adds the PPA in one clock cycle. The whole multiplication utilized the two-clock cycle for the full product. Table 4, shows the implementation results for the eight-parts splitting multiplier.

VI. MONTGOMERY MULTIPLIER ARCHITECTURE
The architecture of MMM is demonstrated in Algorithm-3 as shown in figure 4. In this algorithm, there are 3 n-bit Integer multipliers are required. The general efficiency of the Montgomery multiplier depends on the Integer Multiplier. In this paper, we present an efficient MMM that is implemented on modern FPGA devices. The proposed architecture of the Montgomery multiplier is shown in figure 5. This architecture consists of an n-bit Integer Multiplier. The intermediate multiplication results are held in a 2n-bit register. This result in the register is utilized in the next steps. All three multiplications are performed in series. The clock cycle in the Karatsuba algorithm and two-clock cycle in the School-Book algorithm to calculate the multiplication of the operands and 2n-bit product result stored in the register, which is 7 represented by 'W' in Table VI. The result of the product written in the register file utilizes the one-clock cycle.

Algorithm 3: Montgomery Multiplier
: , , , = log 2 , = 2 The outcome of the first integer multiplication is stored in the register. The stored result than got added to the third multiplication result. The final step is the reduction that utilizes to compute the Montgomery multiplication.
In table VI the proposed architecture performs the series of operations for executing the MMM. In the first clock cycle, the operands are loaded in Register A and B which is represented by Load Register (LR). The first integer multiplication utilizes the input operands A and B. When the multiplier gets the operands A and B, the multiplication operation gets started. During the operation of Load Register (LR), new operands are loaded in the input register A and B. The second multiplication utilizes the modulus of the first multiplication output and M1 as an operand. The second multiplication also utilizes the seven clock cycles in the Karatsuba algorithm and the two clock cycles in the School-Book algorithm to compute the product and the one-clock cycle required to store the result in the register file. The three-multiplication required twenty-six clock cycles in series in the Karatsuba algorithm and eleven clock cycles in the School-Book algorithm.
In the last step, three more clock cycles are required for the addition of the products of the three multiplications as per the Algorithm-3, which is depicted in figure 4.
Comparison and subtraction operation computes if required. In this way, MMM architecture required twentynine clock cycles in Karatsuba Algorithm and fourteen clock cycles in the School-Book algorithm to compute the complete result.  Table VII demonstrates the implementation results for two-parts, four-parts, and eight-parts splitting MMM in an FPGA device. Table VII shows that the eight-part splitting architectures utilize fewer DSP blocks for the same operand length compared to two-part splitting and fourpart splitting.  [27]. A group of researchers advanced the concept of an implementation scheme by using IP cores of the FPGA with the addition of 512-bit and 256-bit multiplication to achieve 50% better results as compared to a standard implementation. They improved the low frequency and high latency in this architecture. In this architecture, the authors have achieved a 40 MHz frequency, which is too low for high-speed applications. The drawback of this architecture is observed in the School-Book multiplier in MMM, which utilized too much area [28] Another concept was being presented to implement the hardware processor for ECC. Full block MMM with the 256-bit integer multiplier (IM) is intended to 16-bit cascading unsigned multipliers and this method is sustained until the specified size of the multiplier is Low Latency Montgomery Multiplier for Cryptographic Applications 9 achieved. Fast carry look-ahead adders are required for the addition of modular multiplication. The drawback of this architecture in terms of time is the long duration for synthesis comparatively to our purposed architecture [29]. Another design concept was put forward by researchers when FPGA working of high frequency and routing delay is increased in large multipliers. Decreasing the routing delay is achieved by dividing the operands into small parts and by using the deep pipeline stages. The number of pipeline stages and time for the Montgomery modular multiplier is not mentioned in this paper. The drawback of this architecture is utilized more than fifty clock cycles [30]. Our purposed architecture utilized only twenty-nine clock cycles for one Montgomery Modular Multiplier. The design concept of the Karatsuba algorithm by dividing the operand into two chunks and use in the MMM algorithm with higher radix was introduced. This architecture uses the dedicated blocks of the multiplier. The drawback of this architecture is increasing the space of the hardware [31].
In another design concept, the Karatsuba algorithm was divided into 4 levels to use in MMM. Using the splitting method operand is divided into two parts. The divided part is again divided into two other parts in the reappearance style until the divided parts are length matches with the DSP blocks of an FPGA. This design utilized the LUTs on FPGA rather than the dedicated multipliers [32]. They utilized the same number of clock cycles as in our purposed architecture. The results are better in terms of time. Our purposed architecture is better in terms of utilization of hardware resources and time delay.
We also discussed the result of MMM with bit-wise implementation. In modern FPGA dedicated multipliers are not used in bitwise implementations, it uses only standard FPGA logic. The utilization of dedicated multipliers on FPGA is faster than the standard designbased FPGA. The purposed architecture consumes the 256-bit interleaved modular multiplier. In this architecture, they achieve a 96 MHz frequency [33]. The results of our purposed architecture are 4 times better in terms of time for MMM and throughput. A design concept of LUT used on the hardware implementation rather than the FPGA dedicated multipliers was presented. Radix-4 is based on a modular multiplier with the serial interleaved [34]. In Table VIII, it is shown that our purposed architecture is 4 times better as compared to the design of Author Javeed who presented the similar design concept of Parallel interleaved modular multiplier implementation of hardware architecture. According to the architecture, an operand is working on four parallel processing elements to complete the dedicated task according to the algorithm [35]. The results of our purposed architecture are 2.5 times better in terms of time and throughput. figure 6 shows the performance of several modular multipliers on the Virtex-6 FPGA platform. The proposed design has a higher throughput rate and comparable area and maximum frequencies. In the hardware implementation of public-key cryptographic algorithms, Montgomery Modular Multiplier plays a vital role. This paper provides a fullword implementation of MMM which enhances the execution speed of Elliptic-Curve Cryptography (ECC) and RSA cryptographic algorithms on hardware. We have utilized the Karatsuba-Ofman and School-Book algorithm to calculate the 64-1024 bits. Multiplications are done by using Xilinx FPGA devices. In this work, we exploit the efficiency of the Karatsuba-Ofman and School-Book algorithm to deploy a 1024-bit MMM architecture. We implement the Karatsuba-Ofman and School-Book techniques to divide the operands into different sizes according to Xilinx FPGA devices. The proposed design is evaluated on computational time, area consumption, throughput and it will significantly surpass the state of the art.