Bzip and my 4 cores

Bzip and my 4 cores

Inter core i5, what a marvelous beast. 4 CPU cores in one tiny laptop. The problem is to use them properly. And when I had to compress a 700MB log file a few days ago, I realized that not all the tools on Linux are multi-core friendly.

Today, a fellow PLUG member pointed me to lbzip2, a multi-threaded implementation of bzip2. I just gave it a quick shot and the results are interesting:

Initial file:

$ ls -s jmeter-server-node1.log --block-size=1
689274880 jmeter-server-node1.log

=== with bzip2 ====

$ time bzip2 -z -9 jmeter-server-node1.log

real	8m33.220s
user	8m31.444s
sys	0m0.880s

$ ls -s jmeter-server-node1.log.bz2 --block-size=1
1589248 jmeter-server-node1.log.bz2

$ time bunzip2 jmeter-server-node1.log.bz2

real	0m35.801s
user	0m33.662s
sys	0m0.964s

=== with lbzip2 ====

$ time lbzip2 -n 4 -z -9 -S jmeter-server-node1.log 

real	5m37.425s
user	20m57.227s
sys	0m5.016s

$ ls -s jmeter-server-node1.log.bz2 --block-size=1
1601536 jmeter-server-node1.log.bz2

$ time lbzip2 -n 4 -d jmeter-server-node1.log.bz2 

real	0m20.370s
user	1m15.697s
sys	0m1.316s

Compression is of the same level, but I'm surprised to see that while lbzip2 is 65% faster, it also uses 250% more user time than bzip2. The efficiency per-core is a lot lower, but I'm happy to be using all my cores.

2 reactions

1 From smaftoul - 11/04/2011, 11:46

How does this compare to pbzip2 ? Have you tried it? (pbzip2 is in debian, though debian version have a bug, it doesn't work with stdin / stdout, it's fixed upstream) .
2 From lacos - 11/04/2011, 23:18

Hi,

lbzip2 author here. I strongly suspect that you see what you see because your Intel core i5 is probably only dual core PLUS hyper-threaded, not real quad-core. Meaning, you have two instances of the L2 per-core cache, not four, and each two hyperthreads share an L2 cache.

Since the bzip2 compression/decompression is very cache sensitive (see "man bzip2"), the scaling factor will be determined mostly by how many OS-threads can dispose over a dedicated cache each. In your case this number is probably 2.

Since you run two threads per core, those contend for the shared L2 cache, basically each messing with the other (flushing / invalidating the shared cache for the other). This contention shows up as double CPU time, because "waiting for cache" (or "waiting for main memory") is accounted for as CPU time.

Hyperthreading is not useful but detrimental for lbzip2; so you should export LBZIP2="-n 2". You should not run more worker threads per core than: core-dedicated-cache-size divided by 8MB.

See a scaling example with "good" caches here (103 worker threads): http://lacos.hu/lbzip2-scaling/scal... . The caching stuff is described in the Observations section at the end.

Quelques digressions sous GPL

2 reactions

Julien Vehent

Security @ Mozilla

about me

My Book

Search

Atom Feed