ELSEVIER Information Processing Letters 5 1 ( 1994) 163- 169 Information Processing Letters Fast RNS division algorithms for fixed divisors with application to RSA encryption Ching Yu Hung l, Behrooz Parhami Department of Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 931069560, USA Communicated Keywords: Algorithm complexity; system: Sign detection by G.R. Andrews; received 8 July 1993; revised 29 April 1994 Computer arithmetic; Cryptography; 1. Introduction Residue number systems (RNS) present the advantage of fast addition and multiplication over other number systems, and have thus received much attention for high-throughput computations. Digit-parallel, carry-free, and constant-time multiplication and addition is a unique feature of RNS. However, certain operations such as overflow detection, magnitude comparison, and division are quite difficult in RNS. Thus, RNS is in general limited to applications that do not require extensive use of those difficult operations; for example, liltering in Digital Signal Processing. By improving RNS division, many application areas for which RNS was previously infeasible, such as RSA encryption, can use the fast RNS multiplication without being penalized too much by the slow RNS division. * Corresponding author. Email: [email protected] ’ Email: [email protected] Division; Modular multiplication; Residue number In this paper we consider the problem of division by fixed divisors in RNS. Ordinary integer division is performed; i.e., given the dividend X and the divisor D, we wish to find the quotient Q = LX/D] and th e remainder R = X - QD. The idea is to perform some preprocessing based on the divisor to improve the on-line speed of divisions. Our algorithms do not require inordinately expensive preprocessing; the cost of preprocessing is negligible if 0 (log, D) divisions are performed with the same D, and that is true in the context of RSA encryption. In m-bit encryption, up to 2m modular multiplications are performed with the same m-bit modulus. Several algorithms for general residue division have been proposed in the past [ 1,3,5,8]. Under the assumption that D is fixed and X is uniformly distributed, the fastest of them [5] has a worst-case time complexity of O(log, X) = O(nb), where n is the number of moduli and b is the number of bits in the largest modulus. In this paper we present two division algorithms for fixed divisors that achieve time 0020-0190/94/$07.00 @ 1994 Elsevier Science B.V. All rights reserved S.SDI0020-0190(94)00099-K 164 C. Y. Hung, B. Parhami /Information complexity of O(n) for each division. The first algorithm is based on the well-known division method of multiplying by the divisor reciprocal. The second algorithm is based on the Chinese Remainder Theorem (CRT) decoding and table lookup, and requires that the divisor D be relatively prime to all moduli. The second algorithm requires more storage but is faster. The computation time analysis is based on the usual parallel residue processor assumption: There are n residue processors for an n modulus system, each being capable of one modulus-size modular addition or multiplication in one time step. Additionally, we assume that the divisor D is relatively prime to all moduli, even for the first algorithm that does not require it but becomes somewhat more efficient if it holds. Unless otherwise noted, the computation times given in the paper are parallel times. For software implementation of RNS, we also provide the sequential time complexities. We also present adaptation of both algorithms for RSA encryption. The first algorithm leads to 4n + b time steps per modular multiplication, while the second algorithm requires 2n time steps per modular multiplication. The second algorithm is found to be very competitive with previously proposed RSA implementations. 2. Basic operations in RNS 2.1. Notation The expression I,$ denotes the remainder of x divided by y, where x and y are real withy > 0. When x and y are relatively prime integers, with y > 1, the multiplicative inverse of x modulo y, I-y+/& is defined. A residue number system is specified by a list of n pairwise relatively prime moduli, A number X is represented in ml,m2,...,mn. RNS by a list of residues (XI, x2,. . . , xn ), where x1 = IXI,,, . Let it4 = n m, represent the product of all moduli. Let A4, = M/mi and No = Iy’lm,. We also define M[a,b], 1 < a d h d n, as the product of a sequence of moduli: M[a,b] = @_m,. We let M[a,b] = 1 for Processing Letters 51 (1994) 163-169 h < a. Signed numbers in the range -[M/21 < X < 1(A4 - 1)/2J are represented. These RNS parameters are used throughout the rest of the paper without being explicitly noted in each case. Let b be the number of bits needed to represent each residue. For algorithm efficiency and convenience in analyzing complexities, we assume that the magnitudes of the moduli are more or less uniform. This assumption leads to mi d 2b, 1, and M z 2nb. We say a number X is k digits long when X z 2kb. Un,fWlj E 2.2. Base extension and division by a product of moduli Let a number X be representable by k residues, k < n, in an n modulus RNS. Base extension refers to the procedure of finding the n - k unknown residues. Base extension is usually implemented with mixed-radix conversion (see, e.g., [ 9, pp. 41-471) and takes exactly 2h- - 1 time steps with n residue processors. Base extension can be used to divide a number by a product of first powers of moduli [9, pp. 47-501. Let X be the dividend and M[ 1, k] be the divisor. The first k residues of the remainder R = X mod M [ 1, k] are simply the first k residues of X. The n - k remaining residues are found by a base extension from the front k residues toward the back n - k residues, taking 2k - 1 steps. With all residues of the remainder known, the quotient Q can be found by evaluating (X - R)M [ 1, k ] -’ in one step. However, M[ 1, k]-’ is defined only for the back n - k residues, so only the corresponding residues of Q are known. Another base extension, from the back n - k residues to the front k residues, is applied, taking 2 (n - k) - 1 steps. Totally the division takes 2n - 2 steps. 2.3. Sign detection In [ 5 1, Hung and Parhami propose a sign estimation procedure that in [log, nj steps indicates whether a residue number is positive, negative, or too small in magnitude to tell. The procedure outlined below uses a parameter u, u > 1, C. Y. Hung, B. Parhami/Information 165 Processing Letters 51 (1994) 163-169 to specify input range and output precision: The procedure requires input number X in the range [-( l/2-2-“)M, (l/2-2-“)M]; i.e., a fraction of the dynamic range is excluded. When the output ES(X) is indeterminate, X is guaranteed to be in the range [ -2-UM, 2-“Ml. 2. Y = C:=, EF[i][xi] 3. EF(X) = IYI1, B(X) Preprocessing 1. EF[i][j] = Truncate [jar/mill to the (-t)th bit, for 1 < i d n, 0 d j < mi, where t = u + [log, nl Sign estimation of an input X 2. EF(X) = Icy__, EF[il [&Ill 3. ES(X) = +, lfEF(X) < l/2 -, if l/2 < EF(X) < 1 - 2-’ 4. f, otherwise 5. Reference [ 5 ] also contains an algorithm to perform division in RNS without preprocessing. We shall call this algorithm general division since it does not require prior knowledge of the divisor. The general division algorithm is based on the well-known binary SRT division. After proper normalization of the dividend X and the divisor D, in each iteration we perform X = 2X, X = 2(X-D),orX = 2(X+D),basedontheestimated sign of X. To optimize for hardware implementation, the n operand summation (line 2) of the sign estimation procedure is performed once every [log, n] iterations so that the average cost per iteration is constant. The algorithm presented in [5] takes O(log,(M/D) + log, Q) steps, where Q is the quotient computed by the algorithm. The controlled way in which we use general division in our fixed-divisor algorithms renders some of the computations unnecessary. Specifically, when [log, Dl is known or is guaranteed to be within a small range, normalization of D can be simplified. In this case, the general division takes 3[log, Qj time steps on the average and 3 [log, Q1 + 2n time steps in the worst case. The extra 2n time is due to a possible final sign detection by mixed-radix conversion. For software implementation, the sequential time is roughly 2n log, Q. From this sign estimation procedure we construct a sign detection procedure as follows. The relatively inexpensive sign estimation is tried first. In case it fails (sign being indeterminate), we compute the sign by mixed-radix conversion. Since the chance of having to use the mixed-radix conversion is low, the sign detection requires [log2 n] steps on the average and 2n + [log, n] steps in the worst case (We assume uniform distribution of X in the allowed range, and ignore the time spent on communication.) 2.4. Chinese Remainder Theorem and B (X) The Chinese Remainder Theorem states that We define B(X) [9, p. 301 as the number of times the modular summation in Eq. (1) overflows M: The sign estimation procedure can be adapted to efficiently compute B(X) when input X is nonnegative. The same EF [i] [j] table is used, with parameter u satisfying u > 1. The preprocessing stage is the same as in sign estimation, followed by: = Int(Y) 4. IfEF(X) < l/2 return B(X) 5. Otherwise, return B(X) = B(X) + 1 2.5. General division 3. Multiplying by the divisor reciprocal Our first algorithm for fixed divisor RNS division precomputes the reciprocal of the divisor and uses it to compute the approximate quotient (X is the dividend and D the divisor): Preprocessing 1. Compute C = [M/Dj, choose k such that 1 <k ,< nandM[l,k-1] < D<M[l,k] 166 C. Y. Hung, B. Parhami /Information Each subsequent division 2. X’ = LX/M[k,n]J 3. Q = \X’C’/M[l,k11) 4. X” = X - QD 5. Call general division to obtain Q and R with 0 d R < D such that X” = Q’D + R 6. Return Q” = Q + Q’ and R In the preprocessing stage, C is computed as the quotient of M divided by D, using the general division algorithm. For each subsequent division, first we scale down X by a factor of 1/M [k, n] to obtain X’. Then X’ is multiplied by Candscaleddownby l/M[l,kl] totind an approximate quotient Q. The remainder X”, found on line 4, can thus be off by a multiple of D. We next use general division to divide X” by D, thereby correcting the error of the approximate quotient. We call lines 2-4 the coarse modulo stage and lines 5-6 the correction stage. The approximate quotient satisfies Processing Letters 5 1 (1994) 163-169 6n + 3b(n - 2k) steps in the worst case. In a software implementation, a base extension for k unknown residues takes n2 - k2 steps, and so a division by a product of moduli takes n2 - k2 + n2 - (n -k)’ = n2 + 2nk - 2k2 steps. Our algorithm takes 2n (n - k) b steps for preprocessing and 2n2 + 4nk - 4k2 + 2nb(n - 2k) steps for each division. 4. CRT decoding and table lookup The second sion achieves lookup table outline of the algorithm for fixed divisor divifaster computation with a larger (n + 1 entries rather than 1). An algorithm follows. Preprocessing 1. 2. 3. 4. z = IMID For i = 1,2,. . . , n do k, = ) - ZD-*I,, Compute Z; = (Z + k,D)/m, Each subsequent division 5. Compute B(X) i.e., there is an upper bound error of [M[k, n]/ Dl + 1 with respect to the correct quotient. Intuitively, C is close to the fraction M/D = M(l,n]/D,X’isclosetoX/M[k,n],andsoQ is close to X’C/M [ 1, k - 1 ] = X/D. Hence, Q is close to the correct quotient iX/Dj . Any error is due to truncations in the three integer divisions. The general division needed in the preprocessing stage takes 3 (n - k ) h steps on the average and 2n steps more in the worst case [ 51. The coarse module stage in each subsequent division requires two divisions by products of moduli, M[k,n] andM(l,kl],besidesafewmultiplications and additions. Dividing by a product of moduli is by base extension and each takes 2n - 2 steps. With the largest error in the quotient being [M[k,n]/Dl + 1, or (n -k + 1) (k - 1) = n - 2k + 2 digits long, the correction stage takes about 3h (n - 2k) steps on the average and 3b (n - 2k) + 2n steps in the worst case. (When n < 2k, it takes constant steps on the average and 2n steps in the worst case.) Each subsequent division using this algorithm thus takes about 4n + 3b (n - 2k) steps on the average and 6. Y = C;=, Itr,.~,],~Zi + B(X)(D - Z) 7. Call general division to obtain Q and R with 0 G R < D such that Y = QD + R 8. Return Q’ = 1(X - R)D-‘~M and R The algorithm is based on the Chinese Remainder Theorem. When X is nonnegative, Eq. (2) becomes x = 5 jrL;Xl],ti,‘14,- B(X)M. I=1 (3) We view Eq. (3) as a linear decomposition of X. To reduce X modulo a fixed divisor D, we precompute Z, = M, mod D and Z = A4 mod D. We then have jX]D - e 1=1 + I~~Jllrn,Z, B(X)(D - Z) (modD). (4) Thus, for each division, the algorithm first computes the weighted sum of Eq. (4) in the coarse modulo stage. The sum, Y, is at most C. Y. Hung, B. Parhami/Information CC:=‘=, (mi - 1) + HID, or (log,n)/b + I digits longer than the divisor. Next, the correction stage utilizes general division to further reduce Y to the correct remainder. In step 8, X - R is divisible by D and since D is assumed to be relatively prime to each mi, D-’ exists. The proof that the expression & = (Z + kiD)/mi computed by the algorithm is actually A4i mod D follows from the following easily provable statements: = 0, lz + ki&, and 0 d Z + kiD < D. mi It takes 3b (n - k) steps to compute 2 with general division, constant time to compute all the ki’s, and 2 (n - k) time to compute each Zi with base extension, where k is the number of residues required to represent the divisor D, and so is roughly the same as the k defined in Section 3. Total time for the preprocessing stage is 3b(n-k)+2n(n-k).Notethatitwouldtake 3b (n - k) (n + 1) steps if the preprocessing is performed as n + 1 instances of general division. Sequential time is 2nb(n -k) + n(n -k)*. Computation time for each division is analyzed as follows. Computing B (X) takes [log, n] time steps. Computing (aiXj(m, takes one step. The weighted summation takes about 2n steps assuming that the time needed to broadcast the Jai~i]~, values to all processors is negligible (each processor does 12 + 1 multiplications for taking the weights into account and n additions). If it takes one time step to send a residue-sized number to an adjacent processor, the broadcast operation takes n steps on a ring. The general division on line 7 takes 3 (b + log, n) steps on the average and 3(b + [log, n] ) + 2n steps in the worst case. Total time for each subsequent division is thus log, n -l- 2n + 3 (b -t log, n ) = 2n + 3b steps on the average and 4n + 3b steps in the worst case. Sequential time is 2n2 + 2nb. Processing Letters 51 (1994) 163-169 16-i 5. Application to RSA cryptography Encryption and decryption in RSA cryptography are modular exponentiation operations of the form Z = Xy mod D. For encryption, X is the plain text, Y and D together comprise the encryption key, and Z is the ciphered text. For decryption, X is the ciphered text, Y and D the decryption key, and Z is the deciphered text. All operands, X, Y, D are potentially very large integers, perhaps 1000 bits long. Let m be the number of bits in D, and let log, Y N”m. A modular exponentiation requires up to 2m modular multiplications in a simple square-and-multiply scheme (see, e.g., [ 10 ] ). It is possible to use only ( 1 + E) m modular multiplications with exponent recoding [ 71, and, in decryption only, perform shorter modular operations with respect to the two (secret) factors of D. We shall compare our algorithm with existing ones in terms of computation time for each modulo-D multiplication. Our fixed-divisor algorithms apply to the modulo-D reduction step that follows a regular multiplication. The dynamic range of RNS thus needs to be at least the square of the modulus D. The 2m instances of modular reduction in a modular exponentiation are viewed as a sequence of divisions with the same divisor D. The preprocessing based on D is therefore good for 2m divisions, and can be good for many times more when a long message is broken into several modular exponentiations with the same modulus D. The preprocessing times are $nb and n2 + inb, respectively for the first and the second algorithm, when k “N n/2. With n = 2m/b, the preprocessing times become 3m and 4m2/b2 + 3m, both of which are negligible compared to the O(mn) = O(m*/b) time taken by 2m instances of O(n)-time division. The conversions between RNS and binary take O(n) = 0 (m/b) time, also negligible compared to O(m*/b). The on-line portion of each algorithm is further divided into the coarse modulo stage and the correction stage. For modular exponentiation, it is not necessary to fully reduce intermediate results modulo D. In its coarse modulo stage, the 168 C. Y. Hung, B. Parhami /Information first algorithm reduces an n digit dividend to an approximate remainder up to n - k + 1 digits long. This is not enough since we know that n > 2k and that the modular reduction must at least reduce a dividend to half-length to accommodate the squaring in the modular exponentiation. It is necessary, therefore, for the first fixed-divisor algorithm to perform at least b iterations of general division after the coarse modulo stage. The second fixed-divisor algorithm works better. It produces, in the coarse modulo stage, an approximate remainder up to (log, n ) /b + 1 + k digits long. If we make n large enough such that 2[(log,n)/b+ I +k] isnomorethann,weget sufficient reduction in the coarse modulo stage. The second algorithm is also more efficient in this stage, taking 2n steps versus 4n steps for the first algorithm. With the coarse modulo stage of the second algorithm and constant-time multiplication inherent in RNS, each modular multiplication takes about 2n z 4m/b steps. The sequential time is 2n2 NN 8m2/b2 for each modular multiplication. 6. Conclusions We have presented two new algorithms for RNS division with fixed divisors. Adaptation of our second algorithm leads to an efficient RSA implementation, with 4m/b steps per modular multiplication. Existing implementations of RSA encryption can be roughly classified into word-level single processor, bit-level array processors, and wordlevel array processors. Because of the variety of special hardware involved in the designs, it is rather difficult to compare different designs in terms of time complexity; we almost always have to compare actual or estimated encryption rates of the designs. We compare our proposed method with two classical sequential methods: one uses a binary version of multiplying by divisor reciprocal for modular reduction [ 21, the other uses a residue table for modular reduction [6]. With a b-bit processor, a modular multiplication takes 9(m/b)2 and 4(m/b)* steps, respectively. Processing Letters 51 (1994) 163-169 Treating a b-bit processor as having hardware methods complexity of b2, the conventional have hardware-time products of 9m2 and 4m2. Our method has a hardware complexity of (2m/b)b* = 2mb, and a hardware-time product of 8m2. On the basis of hardware and time complexity, our design competes well with sequential implementation of classical methods. Actual encryption speed depends on the hardware platform, and is still under investigation. While we have analyzed the time complexity of our algorithms, there are many implementation details that must be considered. For example, the communication and storage requirements of the algorithm, integration of the binaryresidue and residue-binary conversions into the algorithm, and the possibility of systolic implementation. Other cryptographic algorithms can benefit from our new techniques. Our choice of RSA to illustrate the efftciency of these techniques is merely a reflection of the fact that it is better known and more widely applied. We are also looking for other applications of our new residue division algorithms. The EF function serves as an index function of residue numbers in our sign detection procedure. In a recent publication [ 41, Dimauro et al. propose another index function, called the Sum of Quotients, for comparison of residue numbers. While a straightforward implementation of their technique seems as expensive as residue-to-binary conversion with the Chinese Remainder Theorem, it remains to be investigated whether a truncated version leads to an efficient approximate comparison procedure. Acknowledgement Careful reading of the manuscript by the referees has led to a significant improvement in our presentation. We thank them for their efforts. References [ I] D.K. Banerji, T.-Y. Cheung and V. Ganesan, A highspeed division method in residue arithmetic, in: Proc. C. Y. Hung, B. Parhami /Information [2] [3] [4] [5] 5th Symp. on Computer Arithmetic (IEEE Press, New York, 1981) 158-164. P. Barrett, Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor, in: A.M. Odlyzko, ed., Advances in Cryptology, Proc. Crypt0 86 (Springer, Berlin, 1986) 31 l-323. W.A. Chren Jr, A new residue number system division algorithm, Comput. Math. Appl. 19 (7) ( 1990) 13-29. G. Dimauro, S. Impedovo and G. Pirlo, A new technique for fast number comparison in the residue number system, IEEE Trans. Comput. 42 (5) (1993) 608-612. C.Y. Hung and B. Parhami, An approximate sign detection method for residue numbers and its application to RNS division, Comput. Math. Appl. 27 (4) (1994) 23-35. Processing Letters 51 (1994) 163-169 169 [6] S. Kawamura and K. Hirano, A fast modular arithmetic algorithm using a residue table, in: C.G. Giinther, ed., Advances in Cryptology, Proc. Eurocrypt 88 (Springer, Berlin, 1988) 245-250. [7] C.K. Koc and C.-Y. Hung, Adaptive m-ary segmentation and canonical recoding algorithms for multiplication of large binary numbers, Compuf. Math. Appl. 24 (3) (1992) 3-12. [8] M. Lu and J.-S. Chiang, A novel division algorithm for the residue number system, IEEE Trans. Comput. 41 (8) (1992) 1026-1032. [9] N.S. Szabo and R.I. Tanaka, Residue Arithmetic and its Applications to Computer Technology (McGrawHill, New York, 1967). [lo] N. Takagi, A radix-4 modular multiplication hardware algorithm for modular exponentiation, IEEE Trans. Comput. 41 (8) (1992) 949-956.

© Copyright 2021 Paperzz