In maths class you’ve seen how arithmetic operators like addition, subtraction, multiplication, and division are integral to performing computations. They are ever present in the field of computer science (being a subset of maths, I’d argue maths is the representation of the principles of the universe) as well.

Arithmetics in programming

In programming languages, the same operators are used to perform arithmetic operations.

func multiply(a, b int) int {
    return a * b
}

func add(a, b int) int {
    return a + b
}

func subtract(a, b int) int {
    return a - b
}

func divide(a, b int) int {
    return a / b
}

func modulo(a, b int) int {
    return a % b
}

func main() {
    fmt.Println(divide(6,3))    // 2
    fmt.Println(subtract(4,1))  // 3
    fmt.Println(add(2,3))      //  5
    fmt.Println(multiply(2,3)) //  6
    fmt.Println(modulo(14,2)) //   7
}

Bitwise for computation

The topic of focus here is multiplication. And how we can multiply two numbers without using the standard multiplication operator (*). Why would you want to do this? You may ask. Well, in the past when hardware wasn’t as powerful as it is today and resources were limited, systems programmers had to come up with more efficient ways of performing computations. One such way is utilizing bitwise operators to perform computations. An example of this was if you wanted to half a number, you would shift the number by one bit to the right (right shift >>).

func half(a int) int {
    return a >> 1
}

func main() {
    fmt.Println(half(2)) //  1
    fmt.Println(half(4)) //  2
    fmt.Println(half(6)) //  3
    fmt.Println(half(8)) //  4
    fmt.Println(half(10)) // 5
}

Or if you wanted to double a number, you would shift the number by one bit to the left (left shift <<).

func double(a int) int {
    return a << 1
}

func main() {
    fmt.Println(double(2)) //  4
    fmt.Println(double(4)) //  8
    fmt.Println(double(6)) //  12
    fmt.Println(double(8)) //  16
    fmt.Println(double(10)) // 20
}

As you can see in the last two examples we have carried out some basic computation using bitwise operators instead of standard arithmetic operations. We won’t be going too much in-depth about bitwise operators here, but you can read more about them here.

Russian Peasant Algorithm

Let’s say we are given two numbers a and b, to find a * b we must:

  • let a third variable c which will hold our result be 0
  • if b is odd we add a to c otherwise we skip
  • double a
  • half b
  • repeat the process until b becomes 0

To put it in simple terms, the algorithm involves breaking the multiplication into a series of additions to arrive at the result.

func multiply(a, b int) int {
	result := 0

	for b > 0 {
		// check if b is odd
		if b&1 == 1 {
			result += a
		}

		a <<= 1 // double a
		b >>= 1 // half b

	}

	return result
}

func main() {
    multiply(2,3) // 6
    multiply(3,3) // 8
    multiply(4,4) // 16
    multiply(4,5) // 20  
}

Things to note from the implementation

  • If you noticed from the implementation above, we check if b is odd by using the bitwise AND operator & (b is odd if b and 1 have a common bit set i.e b & 1 == 1).
  • we double a by left shifting by one bit (a << 1) like we did in our earlier examples.
  • we half b by right shifting by one bit (b >> 1) same as in our earlier examples.

How does this compare with the standard multiplication operator?

Instead of telling you how this compares with the standard multiplication operator, we’ll do what scientists do. Let’s run some benchmarks and evaluate from the data of our results. We’ll write benchmark tests in Golang, Rust, and C and run them on an Apple M1 Pro machine.

Go Benchmark

package main

import "testing"

func russianPeasantMultiply(a, b int) int {
	result := 0
	for b > 0 {
		if b&1 == 1 {
			result += a
		}
		a <<= 1
		b >>= 1
	}
	return result
}

func standardMultiply(a, b int) int {
	return a * b
}

func BenchmarkRussianPeasantSmall(b *testing.B) {
	for i := 0; i < b.N; i++ {
		russianPeasantMultiply(12, 15)
	}
}

func BenchmarkStandardSmall(b *testing.B) {
	for i := 0; i < b.N; i++ {
		standardMultiply(12, 15)
	}
}

Running go test -bench=. -benchmem:

BenchmarkInput SizeTime (ns/op)
Russian PeasantSmall (12 × 15)~2.5 ns
StandardSmall (12 × 15)~0.32 ns
Russian PeasantMedium (1234 × 5678)~5.4 ns
StandardMedium (1234 × 5678)~0.32 ns
Russian PeasantLarge (123456 × 789012)~7.7 ns
StandardLarge (123456 × 789012)~0.32 ns

The standard multiplication is 8-24x faster depending on input size.

Rust Benchmark

Using the criterion benchmarking library:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn russian_peasant_multiply(mut a: i64, mut b: i64) -> i64 {
    let mut result: i64 = 0;
    while b > 0 {
        if b & 1 == 1 {
            result += a;
        }
        a <<= 1;
        b >>= 1;
    }
    result
}

fn standard_multiply(a: i64, b: i64) -> i64 {
    a * b
}

fn benchmark(c: &mut Criterion) {
    c.bench_function("russian_peasant", |b| {
        b.iter(|| russian_peasant_multiply(black_box(1234), black_box(5678)))
    });
    c.bench_function("standard", |b| {
        b.iter(|| standard_multiply(black_box(1234), black_box(5678)))
    });
}

criterion_group!(benches, benchmark);
criterion_main!(benches);

Running cargo bench:

BenchmarkInput SizeTime
Russian PeasantSmall (12 × 15)~2.2 ns
StandardSmall (12 × 15)~485 ps
Russian PeasantMedium (1234 × 5678)~5.7 ns
StandardMedium (1234 × 5678)~506 ps
Russian PeasantLarge (123456 × 789012)~9.8 ns
StandardLarge (123456 × 789012)~525 ps

The standard multiplication is 4-19x faster in Rust.

C Benchmark

#include <stdio.h>
#include <stdint.h>
#include <time.h>

__attribute__((noinline))
int64_t russian_peasant_multiply(int64_t a, int64_t b) {
    int64_t result = 0;
    while (b > 0) {
        if (b & 1) {
            result += a;
        }
        a <<= 1;
        b >>= 1;
    }
    return result;
}

__attribute__((noinline))
int64_t standard_multiply(int64_t a, int64_t b) {
    return a * b;
}

Compiling with clang -O0 (no optimization):

BenchmarkInput SizeTime (ns/op)
Russian PeasantSmall (12 × 15)~8.1 ns
StandardSmall (12 × 15)~2.2 ns
Russian PeasantMedium (1234 × 5678)~21.3 ns
StandardMedium (1234 × 5678)~2.1 ns
Russian PeasantLarge (123456 × 789012)~27.4 ns
StandardLarge (123456 × 789012)~2.1 ns

Without optimizations, standard multiplication is 4-13x faster.

Interestingly, with -O3 (maximum optimization), both methods perform nearly identically (~0.3 ns/op). The compiler is smart enough to optimize the Russian Peasant algorithm to match standard multiplication performance.

Why is standard multiplication so much faster?

The answer lies in modern CPU architecture.

Hardware multiplication units

Modern CPUs have dedicated hardware circuits called multiplier units (often implemented as part of the ALU - Arithmetic Logic Unit). These circuits can multiply two integers in a single clock cycle using techniques like:

  • Booth’s algorithm: An efficient hardware multiplication algorithm
  • Wallace trees: Parallel carry-save adder structures
  • Dadda multipliers: Optimized versions of Wallace trees

When you write a * b in your code, it compiles down to a single MUL instruction that executes in constant time (typically 1-3 clock cycles) regardless of the operand values.

The loop overhead

The Russian Peasant algorithm requires:

  • A loop that runs log2(b) times - since we halve b each iteration, the loop runs once for each bit in b’s binary representation. For example, if b = 15 (binary: 1111), we halve through: 15 → 7 → 3 → 1 → 0, which is 4 iterations (the number of bits in 15).
  • Conditional branch checking (if b & 1)
  • Multiple bitwise operations per iteration
  • Memory accesses for the accumulator

Even though individual bitwise operations (<<, >>, &) are fast (single-cycle), the loop overhead, branch prediction misses, and instruction dependencies make it significantly slower than a single MUL instruction.

Compiler optimizations

As we saw in the C benchmark with -O3, modern compilers are incredibly sophisticated. They can:

  • Recognize multiplication patterns and replace them with MUL instructions
  • Unroll loops and eliminate branches
  • Use SIMD (Single Instruction, Multiple Data) instructions
  • Perform constant propagation and dead code elimination

This is why the optimized C code showed nearly identical performance for both methods - the compiler likely recognized what the Russian Peasant algorithm was doing and optimized it accordingly.

Seeing the difference in assembly

To truly appreciate the difference, let’s look at the actual assembly code generated by the compilers.

Go Assembly (x86-64):

; Russian Peasant - 41 bytes, ~15 instructions with loops and branches
main.russianPeasantMultiply:
    XORL    CX, CX              ; result = 0
    JMP     loop_check
loop_body:
    LEAQ    (CX)(AX*1), DX      ; temp = result + a
    SHLQ    $1, AX              ; a <<= 1
    MOVQ    BX, SI
    SARQ    $1, SI              ; b >>= 1
    TESTQ   $1, BX              ; if b & 1
    CMOVQNE DX, CX              ; result = temp (conditional)
    MOVQ    SI, BX
loop_check:
    TESTQ   BX, BX              ; while b > 0
    JGT     loop_body
    MOVQ    CX, AX
    RET

; Standard Multiply - 5 bytes, just 2 instructions!
main.standardMultiply:
    IMULQ   BX, AX              ; result = a * b
    RET

Rust Assembly (ARM64):

; Russian Peasant - multiple instructions with loop
russian_peasant_multiply:
    cmp     x1, #1              ; compare b with 1
    b.lt    return_zero         ; if b < 1, return 0
    mov     x8, #0              ; result = 0
loop:
    sbfx    x9, x1, #0, #1      ; extract bit 0 of b
    and     x9, x9, x0          ; mask with a
    add     x8, x9, x8          ; result += masked value
    lsl     x0, x0, #1          ; a <<= 1
    cmp     x1, #2              ; compare b with 2
    lsr     x1, x1, #1          ; b >>= 1
    b.hs    loop                ; continue if b >= 1
    mov     x0, x8
    ret
return_zero:
    mov     x8, #0
    mov     x0, x8
    ret

; Standard Multiply - just 2 instructions!
standard_multiply:
    mul     x0, x1, x0          ; result = a * b
    ret

The difference is stark: standard multiplication compiles to a single IMULQ (x86) or MUL (ARM) instruction, while the Russian Peasant algorithm requires a loop with conditional branches, shifts, and additions.

You can explore this yourself using Compiler Explorer (Godbolt) - paste the code, select your language and compiler, and see the assembly in real-time. Try it with Go, Rust, or C and compare the output!

When might the Russian Peasant algorithm still be useful?

Despite being slower on modern hardware, there are scenarios where this algorithm (or its principles) remains relevant:

1. Educational purposes

Understanding the Russian Peasant algorithm teaches fundamental concepts:

  • How multiplication can be decomposed into additions and bit shifts
  • The relationship between binary representation and arithmetic
  • Algorithm analysis and optimization thinking

For example, you can apply the same principle to arbitrary-precision integers using Go’s math/big package:

// Russian Peasant multiplication for arbitrarily large integers
func bigIntMultiply(a, b *big.Int) *big.Int {
    result := big.NewInt(0)
    aCopy := new(big.Int).Set(a)
    bCopy := new(big.Int).Set(b)

    zero := big.NewInt(0)
    one := big.NewInt(1)

    for bCopy.Cmp(zero) > 0 {
        if new(big.Int).And(bCopy, one).Cmp(one) == 0 {
            result.Add(result, aCopy)
        }
        aCopy.Lsh(aCopy, 1)  // double
        bCopy.Rsh(bCopy, 1)  // halve
    }
    return result
}

This demonstrates how the algorithm generalizes beyond native integers. In practice, libraries like math/big use far more sophisticated algorithms (Karatsuba, Toom-Cook) that are ~300x faster, but understanding this decomposition helps when studying how those optimized algorithms work.

2. Embedded systems without hardware multipliers

Some microcontrollers (particularly older or ultra-low-power ones) lack dedicated multiplication hardware. In these environments, multiplication is already implemented as a software routine, and understanding bit manipulation can help optimize these implementations.

3. Cryptographic applications

Modular exponentiation (used in RSA and other cryptographic algorithms) uses a similar principle called binary exponentiation or exponentiation by squaring:

// Compute (base^exp) % mod efficiently
func modularExponentiation(base, exp, mod int) int {
    result := 1
    base = base % mod

    for exp > 0 {
        if exp&1 == 1 {
            result = (result * base) % mod
        }
        exp >>= 1
        base = (base * base) % mod
    }
    return result
}

This reduces the time complexity from O(n) to O(log n) for computing large powers.

4. Hardware design

When designing digital circuits (FPGAs, ASICs), understanding how multiplication can be broken down into shifts and additions is essential. The Russian Peasant algorithm is conceptually similar to how hardware multipliers work internally.

Conclusion

The Russian Peasant algorithm is a clever technique that demonstrates how multiplication can be performed using only addition and bit shifting. While it’s significantly slower than hardware multiplication on modern CPUs (as our benchmarks showed), it remains valuable for:

  • Learning about computer arithmetic and binary operations
  • Cryptographic applications like modular exponentiation
  • Scenarios where hardware multiplication isn’t available
  • Understanding the foundations of how computers actually work

The real lesson here isn’t that you should use this algorithm for everyday multiplication - you shouldn’t. Rather, it’s a window into computational thinking and the remarkable optimization that modern CPUs and compilers provide. What once required clever algorithmic tricks is now handled by dedicated silicon running at billions of operations per second.


References

[1] Knuth, D. E. (1997). The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd ed.). Addison-Wesley. Section 4.6.3: Evaluation of Powers. https://www-cs-faculty.stanford.edu/~knuth/taocp.html

[2] Booth, A. D. (1951). “A signed binary multiplication technique.” Quarterly Journal of Mechanics and Applied Mathematics, 4(2), 236–240. https://doi.org/10.1093/qjmam/4.2.236

[3] Wallace, C. S. (1964). “A suggestion for a fast multiplier.” IEEE Transactions on Electronic Computers, EC-13(1), 14–17. https://doi.org/10.1109/PGEC.1964.263830

[4] Go Documentation. “math/big package.” https://pkg.go.dev/math/big

[5] Godbolt, M. “Compiler Explorer.” https://godbolt.org/

[6] Warren, H. S. (2012). Hacker’s Delight (2nd ed.). Addison-Wesley. Chapter 8: Multiplication. https://www.oreilly.com/library/view/hackers-delight-second/9780133084993/

[7] Menezes, A. J., van Oorschot, P. C., & Vanstone, S. A. (1996). Handbook of Applied Cryptography. CRC Press. Chapter 14: Efficient Implementation. http://cacr.uwaterloo.ca/hac/