Simple yet efficient parallel signed multiplier design using radix-8 structure
N.V.V.K. Boppana and S. Ren
The continued quest for finding a low-power and high-performance hardware algorithm for signed number multiplication led to designing a simple and novel radix-8 signed number multiplier with 3-bit grouping and partial product reduction performed using magnitudes of the multiplicand and the multiplier. The pre-computation stage constitutes magnitude calculation and non-trivial computations required to generate partial products. A new partial product reduction strategy is deployed in the design to improve the speed with low cost. 8 X 8, 16 X 16, 32 X 32, and 64 X 64 designs are presented for the proposed architectures. Performance results include area, power, delay, and power-delay-product of synthesized and post-layout designs using 32 nm CMOS technology with 1.05 V supply voltage.
Introduction: Multiplication is the most used computer arithmetic after addition and subtraction. DSPs use multipliers for frequently used computationally intensive applications such as filtering, convolution, fast Fourier transform (FFT), audio/video codecs etc. High performance computer hardware, CPUs, and graphical processing units (GPU), for scientific computing rely majorly on use of these fundamental digital arithmetic. Digital signal processors (DSP) spend most of the time multiplying and require significant chip area for multipliers to meet performance requirements. Multipliers often are a dominant factor in critical path delay which in turn effects the throughput in case of pipelined designs while consuming significant power in applications such as multimedia and DSP. Demand for low power consuming portable computing and communication devices such as smart watches, Internet of Things (IoT) devices, mobile phones, laptops, PCs etc., comprise of signal processing algorithms and other multiplication intense algorithms, has been increasing.
Modified Booth Multiplier: The conventional Booth multiplier algorithm of radix-4 structure employing Booth encoding scheme reduces the number of partial products (PP) to half and hence reduces computational latency (D ), design area (A ), and overall power consumption (P ). Whereas the modified Booth multiplier design of size 8 x 8 is presented in [1] with the minimized Booth encoder along with the replacement of adder/subtractor block with 9-bit wide 2:1 multiplexer (MUX) in the first stage and replacing the full-length adder/subtractor and MUX blocks with 9-bit wide blocks in later stages to obtain the low power-delay product(PDP). The Booth multiplier was further modified/optimized in [2] to achieve low-cost and high-performance by further optimizing theBinary two’s complement (B2C) and the Booth encoder and by deploying the parallel addition to reduce the 3-stage PP reduction to 2-stages.
Proposed radix-8 based multiplier: A new, simple yet efficient, radix-8 structure based parallel signed multiplier is presented in this letter which is designed, synthesized, and assessed for performance using 32nm CMOS technology at 1.05V supply voltage. The design of the proposed radix-8 architecture is inspired by the low-cost 64-bit digital comparator design presented in [3], with the redundant computations taken out and performed at the beginning in the form of XOR-XNORto reduce P , D , and A . In the proposed architecture, the magnitudes of the multiplicand (A) andmultiplier (B) are computed resulting inAp (or X ) and Bprespectively. The redundant non-trivial (NT) operations are computed in the precomputation stage, shown in Fig. 1, to reduce the number of PPs and thereby to reduce the computations, additions, required for the PP reduction. This can be achieved by grouping 3-bits from LSB side towards MSB side and by taking out the non-trivial computations from the PPs which are combination of both trivial and non-trivial values. As shown in Table 1, trivial computations are just shifters or doing nothing; non-trivial computations include computing 3X (P), 5X (Q), and 7X (R) in the precomputation stage. The fourth NTvalue, 6X , generation does not need any separate computation but can be generated by left shifting the 3X value from NT block by one-bit position. It can be seen that for higher multiplier sizes, the proposed design is progressively more efficient compared to reported state-of-the-art modified Booth multiplier designs.