In maths class you’ve seen how arithmetic operators like addition, subtraction, multiplication, and division are integral to performing computations. They are ever present in the field of computer science (being a subset of maths, I’d argue maths is the representation of the principles of the universe) as well.
Arithmetics in programming
In programming languages, the same operators are used to perform arithmetic operations.
func multiply(a, b int) int {
return a * b
}
func add(a, b int) int {
return a + b
}
func subtract(a, b int) int {
return a - b
}
func divide(a, b int) int {
return a / b
}
func modulo(a, b int) int {
return a % b
}
func main() {
fmt.Println(divide(6,3)) // 2
fmt.Println(subtract(4,1)) // 3
fmt.Println(add(2,3)) // 5
fmt.Println(multiply(2,3)) // 6
fmt.Println(modulo(14,2)) // 7
}
Bitwise for computation
The topic of focus here is multiplication. And how we can multiply two numbers without using the standard multiplication operator (*).
Why would you want to do this? You may ask. Well, in the past when hardware wasn’t as powerful as it is today and resources were limited,
systems programmers had to come up with more efficient ways of performing computations. One such way is utilizing bitwise operators to perform
computations. An example of this was if you wanted to half a number, you would shift the number by one bit to the right (right shift >>).
func half(a int) int {
return a >> 1
}
func main() {
fmt.Println(half(2)) // 1
fmt.Println(half(4)) // 2
fmt.Println(half(6)) // 3
fmt.Println(half(8)) // 4
fmt.Println(half(10)) // 5
}
Or if you wanted to double a number, you would shift the number by one bit to the left (left shift <<).
func double(a int) int {
return a << 1
}
func main() {
fmt.Println(double(2)) // 4
fmt.Println(double(4)) // 8
fmt.Println(double(6)) // 12
fmt.Println(double(8)) // 16
fmt.Println(double(10)) // 20
}
As you can see in the last two examples we have carried out some basic computation using bitwise operators instead of standard arithmetic operations. We won’t be going too much in-depth about bitwise operators here, but you can read more about them here.
Russian Peasant Algorithm
Let’s say we are given two numbers a and b, to find a * b we must:
- let a third variable
cwhich will hold our result be0 - if
bis odd we addatocotherwise we skip - double
a - half
b - repeat the process until
bbecomes 0
To put it in simple terms, the algorithm involves breaking the multiplication into a series of additions to arrive at the result.
func multiply(a, b int) int {
result := 0
for b > 0 {
// check if b is odd
if b&1 == 1 {
result += a
}
a <<= 1 // double a
b >>= 1 // half b
}
return result
}
func main() {
multiply(2,3) // 6
multiply(3,3) // 8
multiply(4,4) // 16
multiply(4,5) // 20
}
Things to note from the implementation
- If you noticed from the implementation above, we check if
bis odd by using the bitwise AND operator&(bis odd ifband1have a common bit set i.eb & 1 == 1). - we double
aby left shifting by one bit (a << 1) like we did in our earlier examples. - we half
bby right shifting by one bit (b >> 1) same as in our earlier examples.
How does this compare with the standard multiplication operator?
Instead of telling you how this compares with the standard multiplication operator, we’ll do what scientists do. Let’s run some benchmarks and evaluate from the data of our results. We’ll write benchmark tests in Golang, Rust, and C and run them on an Apple M1 Pro machine.
Go Benchmark
package main
import "testing"
func russianPeasantMultiply(a, b int) int {
result := 0
for b > 0 {
if b&1 == 1 {
result += a
}
a <<= 1
b >>= 1
}
return result
}
func standardMultiply(a, b int) int {
return a * b
}
func BenchmarkRussianPeasantSmall(b *testing.B) {
for i := 0; i < b.N; i++ {
russianPeasantMultiply(12, 15)
}
}
func BenchmarkStandardSmall(b *testing.B) {
for i := 0; i < b.N; i++ {
standardMultiply(12, 15)
}
}
Running go test -bench=. -benchmem:
| Benchmark | Input Size | Time (ns/op) |
|---|---|---|
| Russian Peasant | Small (12 × 15) | ~2.5 ns |
| Standard | Small (12 × 15) | ~0.32 ns |
| Russian Peasant | Medium (1234 × 5678) | ~5.4 ns |
| Standard | Medium (1234 × 5678) | ~0.32 ns |
| Russian Peasant | Large (123456 × 789012) | ~7.7 ns |
| Standard | Large (123456 × 789012) | ~0.32 ns |
The standard multiplication is 8-24x faster depending on input size.
Rust Benchmark
Using the criterion benchmarking library:
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn russian_peasant_multiply(mut a: i64, mut b: i64) -> i64 {
let mut result: i64 = 0;
while b > 0 {
if b & 1 == 1 {
result += a;
}
a <<= 1;
b >>= 1;
}
result
}
fn standard_multiply(a: i64, b: i64) -> i64 {
a * b
}
fn benchmark(c: &mut Criterion) {
c.bench_function("russian_peasant", |b| {
b.iter(|| russian_peasant_multiply(black_box(1234), black_box(5678)))
});
c.bench_function("standard", |b| {
b.iter(|| standard_multiply(black_box(1234), black_box(5678)))
});
}
criterion_group!(benches, benchmark);
criterion_main!(benches);
Running cargo bench:
| Benchmark | Input Size | Time |
|---|---|---|
| Russian Peasant | Small (12 × 15) | ~2.2 ns |
| Standard | Small (12 × 15) | ~485 ps |
| Russian Peasant | Medium (1234 × 5678) | ~5.7 ns |
| Standard | Medium (1234 × 5678) | ~506 ps |
| Russian Peasant | Large (123456 × 789012) | ~9.8 ns |
| Standard | Large (123456 × 789012) | ~525 ps |
The standard multiplication is 4-19x faster in Rust.
C Benchmark
#include <stdio.h>
#include <stdint.h>
#include <time.h>
__attribute__((noinline))
int64_t russian_peasant_multiply(int64_t a, int64_t b) {
int64_t result = 0;
while (b > 0) {
if (b & 1) {
result += a;
}
a <<= 1;
b >>= 1;
}
return result;
}
__attribute__((noinline))
int64_t standard_multiply(int64_t a, int64_t b) {
return a * b;
}
Compiling with clang -O0 (no optimization):
| Benchmark | Input Size | Time (ns/op) |
|---|---|---|
| Russian Peasant | Small (12 × 15) | ~8.1 ns |
| Standard | Small (12 × 15) | ~2.2 ns |
| Russian Peasant | Medium (1234 × 5678) | ~21.3 ns |
| Standard | Medium (1234 × 5678) | ~2.1 ns |
| Russian Peasant | Large (123456 × 789012) | ~27.4 ns |
| Standard | Large (123456 × 789012) | ~2.1 ns |
Without optimizations, standard multiplication is 4-13x faster.
Interestingly, with -O3 (maximum optimization), both methods perform nearly identically (~0.3 ns/op). The compiler is smart enough to optimize the Russian Peasant algorithm to match standard multiplication performance.
Why is standard multiplication so much faster?
The answer lies in modern CPU architecture.
Hardware multiplication units
Modern CPUs have dedicated hardware circuits called multiplier units (often implemented as part of the ALU - Arithmetic Logic Unit). These circuits can multiply two integers in a single clock cycle using techniques like:
- Booth’s algorithm: An efficient hardware multiplication algorithm
- Wallace trees: Parallel carry-save adder structures
- Dadda multipliers: Optimized versions of Wallace trees
When you write a * b in your code, it compiles down to a single MUL instruction that executes in constant time (typically 1-3 clock cycles) regardless of the operand values.
The loop overhead
The Russian Peasant algorithm requires:
- A loop that runs
log2(b)times - since we halvebeach iteration, the loop runs once for each bit inb’s binary representation. For example, ifb = 15(binary:1111), we halve through: 15 → 7 → 3 → 1 → 0, which is 4 iterations (the number of bits in 15). - Conditional branch checking (
if b & 1) - Multiple bitwise operations per iteration
- Memory accesses for the accumulator
Even though individual bitwise operations (<<, >>, &) are fast (single-cycle), the loop overhead, branch prediction misses, and instruction dependencies make it significantly slower than a single MUL instruction.
Compiler optimizations
As we saw in the C benchmark with -O3, modern compilers are incredibly sophisticated. They can:
- Recognize multiplication patterns and replace them with
MULinstructions - Unroll loops and eliminate branches
- Use SIMD (Single Instruction, Multiple Data) instructions
- Perform constant propagation and dead code elimination
This is why the optimized C code showed nearly identical performance for both methods - the compiler likely recognized what the Russian Peasant algorithm was doing and optimized it accordingly.
Seeing the difference in assembly
To truly appreciate the difference, let’s look at the actual assembly code generated by the compilers.
Go Assembly (x86-64):
; Russian Peasant - 41 bytes, ~15 instructions with loops and branches
main.russianPeasantMultiply:
XORL CX, CX ; result = 0
JMP loop_check
loop_body:
LEAQ (CX)(AX*1), DX ; temp = result + a
SHLQ $1, AX ; a <<= 1
MOVQ BX, SI
SARQ $1, SI ; b >>= 1
TESTQ $1, BX ; if b & 1
CMOVQNE DX, CX ; result = temp (conditional)
MOVQ SI, BX
loop_check:
TESTQ BX, BX ; while b > 0
JGT loop_body
MOVQ CX, AX
RET
; Standard Multiply - 5 bytes, just 2 instructions!
main.standardMultiply:
IMULQ BX, AX ; result = a * b
RET
Rust Assembly (ARM64):
; Russian Peasant - multiple instructions with loop
russian_peasant_multiply:
cmp x1, #1 ; compare b with 1
b.lt return_zero ; if b < 1, return 0
mov x8, #0 ; result = 0
loop:
sbfx x9, x1, #0, #1 ; extract bit 0 of b
and x9, x9, x0 ; mask with a
add x8, x9, x8 ; result += masked value
lsl x0, x0, #1 ; a <<= 1
cmp x1, #2 ; compare b with 2
lsr x1, x1, #1 ; b >>= 1
b.hs loop ; continue if b >= 1
mov x0, x8
ret
return_zero:
mov x8, #0
mov x0, x8
ret
; Standard Multiply - just 2 instructions!
standard_multiply:
mul x0, x1, x0 ; result = a * b
ret
The difference is stark: standard multiplication compiles to a single IMULQ (x86) or MUL (ARM) instruction, while the Russian Peasant algorithm requires a loop with conditional branches, shifts, and additions.
You can explore this yourself using Compiler Explorer (Godbolt) - paste the code, select your language and compiler, and see the assembly in real-time. Try it with Go, Rust, or C and compare the output!
When might the Russian Peasant algorithm still be useful?
Despite being slower on modern hardware, there are scenarios where this algorithm (or its principles) remains relevant:
1. Educational purposes
Understanding the Russian Peasant algorithm teaches fundamental concepts:
- How multiplication can be decomposed into additions and bit shifts
- The relationship between binary representation and arithmetic
- Algorithm analysis and optimization thinking
For example, you can apply the same principle to arbitrary-precision integers using Go’s math/big package:
// Russian Peasant multiplication for arbitrarily large integers
func bigIntMultiply(a, b *big.Int) *big.Int {
result := big.NewInt(0)
aCopy := new(big.Int).Set(a)
bCopy := new(big.Int).Set(b)
zero := big.NewInt(0)
one := big.NewInt(1)
for bCopy.Cmp(zero) > 0 {
if new(big.Int).And(bCopy, one).Cmp(one) == 0 {
result.Add(result, aCopy)
}
aCopy.Lsh(aCopy, 1) // double
bCopy.Rsh(bCopy, 1) // halve
}
return result
}
This demonstrates how the algorithm generalizes beyond native integers. In practice, libraries like math/big use far more sophisticated algorithms (Karatsuba, Toom-Cook) that are ~300x faster, but understanding this decomposition helps when studying how those optimized algorithms work.
2. Embedded systems without hardware multipliers
Some microcontrollers (particularly older or ultra-low-power ones) lack dedicated multiplication hardware. In these environments, multiplication is already implemented as a software routine, and understanding bit manipulation can help optimize these implementations.
3. Cryptographic applications
Modular exponentiation (used in RSA and other cryptographic algorithms) uses a similar principle called binary exponentiation or exponentiation by squaring:
// Compute (base^exp) % mod efficiently
func modularExponentiation(base, exp, mod int) int {
result := 1
base = base % mod
for exp > 0 {
if exp&1 == 1 {
result = (result * base) % mod
}
exp >>= 1
base = (base * base) % mod
}
return result
}
This reduces the time complexity from O(n) to O(log n) for computing large powers.
4. Hardware design
When designing digital circuits (FPGAs, ASICs), understanding how multiplication can be broken down into shifts and additions is essential. The Russian Peasant algorithm is conceptually similar to how hardware multipliers work internally.
Conclusion
The Russian Peasant algorithm is a clever technique that demonstrates how multiplication can be performed using only addition and bit shifting. While it’s significantly slower than hardware multiplication on modern CPUs (as our benchmarks showed), it remains valuable for:
- Learning about computer arithmetic and binary operations
- Cryptographic applications like modular exponentiation
- Scenarios where hardware multiplication isn’t available
- Understanding the foundations of how computers actually work
The real lesson here isn’t that you should use this algorithm for everyday multiplication - you shouldn’t. Rather, it’s a window into computational thinking and the remarkable optimization that modern CPUs and compilers provide. What once required clever algorithmic tricks is now handled by dedicated silicon running at billions of operations per second.
References
[1] Knuth, D. E. (1997). The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd ed.). Addison-Wesley. Section 4.6.3: Evaluation of Powers. https://www-cs-faculty.stanford.edu/~knuth/taocp.html
[2] Booth, A. D. (1951). “A signed binary multiplication technique.” Quarterly Journal of Mechanics and Applied Mathematics, 4(2), 236–240. https://doi.org/10.1093/qjmam/4.2.236
[3] Wallace, C. S. (1964). “A suggestion for a fast multiplier.” IEEE Transactions on Electronic Computers, EC-13(1), 14–17. https://doi.org/10.1109/PGEC.1964.263830
[4] Go Documentation. “math/big package.” https://pkg.go.dev/math/big
[5] Godbolt, M. “Compiler Explorer.” https://godbolt.org/
[6] Warren, H. S. (2012). Hacker’s Delight (2nd ed.). Addison-Wesley. Chapter 8: Multiplication. https://www.oreilly.com/library/view/hackers-delight-second/9780133084993/
[7] Menezes, A. J., van Oorschot, P. C., & Vanstone, S. A. (1996). Handbook of Applied Cryptography. CRC Press. Chapter 14: Efficient Implementation. http://cacr.uwaterloo.ca/hac/