Inter core i5, what a marvelous beast. 4 CPU cores in one tiny laptop. The problem is to use them properly. And when I had to compress a 700MB log file a few days ago, I realized that not all the tools on Linux are multi-core friendly.
Today, a fellow PLUG member pointed me to lbzip2, a multi-threaded implementation of bzip2. I just gave it a quick shot and the results are interesting:
Initial file:
$ ls -s jmeter-server-node1.log --block-size=1 689274880 jmeter-server-node1.log
=== with bzip2 ====
$ time bzip2 -z -9 jmeter-server-node1.log real 8m33.220s user 8m31.444s sys 0m0.880s $ ls -s jmeter-server-node1.log.bz2 --block-size=1 1589248 jmeter-server-node1.log.bz2 $ time bunzip2 jmeter-server-node1.log.bz2 real 0m35.801s user 0m33.662s sys 0m0.964s
=== with lbzip2 ====
$ time lbzip2 -n 4 -z -9 -S jmeter-server-node1.log real 5m37.425s user 20m57.227s sys 0m5.016s $ ls -s jmeter-server-node1.log.bz2 --block-size=1 1601536 jmeter-server-node1.log.bz2 $ time lbzip2 -n 4 -d jmeter-server-node1.log.bz2 real 0m20.370s user 1m15.697s sys 0m1.316s
Compression is of the same level, but I'm surprised to see that while lbzip2 is 65% faster, it also uses 250% more user time than bzip2. The efficiency per-core is a lot lower, but I'm happy to be using all my cores.
2 reactions
1 From smaftoul - 11/04/2011, 11:46
How does this compare to pbzip2 ? Have you tried it? (pbzip2 is in debian, though debian version have a bug, it doesn't work with stdin / stdout, it's fixed upstream) .
2 From lacos - 11/04/2011, 23:18
Hi,
lbzip2 author here. I strongly suspect that you see what you see because your Intel core i5 is probably only dual core PLUS hyper-threaded, not real quad-core. Meaning, you have two instances of the L2 per-core cache, not four, and each two hyperthreads share an L2 cache.
Since the bzip2 compression/decompression is very cache sensitive (see "man bzip2"), the scaling factor will be determined mostly by how many OS-threads can dispose over a dedicated cache each. In your case this number is probably 2.
Since you run two threads per core, those contend for the shared L2 cache, basically each messing with the other (flushing / invalidating the shared cache for the other). This contention shows up as double CPU time, because "waiting for cache" (or "waiting for main memory") is accounted for as CPU time.
Hyperthreading is not useful but detrimental for lbzip2; so you should export LBZIP2="-n 2". You should not run more worker threads per core than: core-dedicated-cache-size divided by 8MB.
See a scaling example with "good" caches here (103 worker threads): http://lacos.hu/lbzip2-scaling/scal... . The caching stuff is described in the Observations section at the end.