Here is a straightforward port of some easy code.
randtable.c
has a lookup table with seemingly-random
numbers. This table is used by the following macros in
bzlib_private.h:
extern Int32 BZ2_rNums[512];
#define BZ_RAND_DECLS \
Int32 rNToGo; \
Int32 rTPos \
#define BZ_RAND_INIT_MASK \
s->rNToGo = 0; \
s->rTPos = 0 \
#define BZ_RAND_MASK ((s->rNToGo == 1) ? 1 : 0)
#define BZ_RAND_UPD_MASK \
if (s->rNToGo == 0) { \
s->rNToGo = BZ2_rNums[s->rTPos]; \
s->rTPos++; \
if (s->rTPos == 512) s->rTPos = 0; \
} \
s->rNToGo--;
Here, BZ_RAND_DECLS
is used to declare two fields, rNToGo
and
rTPos
, into two structs (1, 2). Both are similar to this:
typedef struct {
...
Bool blockRandomised;
BZ_RAND_DECLS
...
} DState;
Then, the code that needs to initialize those fields calls
BZ_RAND_INIT_MASK
, which expands into code to set the two fields to
zero.
At several points in the code, BZ_RAND_UPD_MASK
gets called, which
expands into code that updates the randomization state, or something
like that, and uses BZ_RAND_MASK
to get a useful value out of the
randomization state.
I have no idea yet what the state is about, but let's port it directly.
Give things a name
It's interesting to see that no code except for those macros uses
the fields rNToGo
and rTPos
, which are declared via
BZ_RAND_DECLS
. So, let's make up a type with a name for that.
Since I have no better name for it, I shall call it just
RandState
. I added that type definition in the C code,
and replaced the macro-which-creates-struct-fields with a
RandState
-typed field:
-#define BZ_RAND_DECLS \
- Int32 rNToGo; \
- Int32 rTPos \
+typedef struct {
+ Int32 rNToGo;
+ Int32 rTPos;
+} RandState;
...
- BZ_RAND_DECLS;
+ RandState rand;
Since the fields now live inside a sub-struct, I changed the other
macros to use s->rand.rNToGo
instead of s->rNToGo
, and similarly
for the other field.
Turn macros into functions
Now, three commits (1, 2, 3) to turn the
macros BZ_RAND_INIT_MASK
, BZ_RAND_MASK
, and BZ_RAND_UPD_MASK
into functions.
And now that the functions live in the same C source file as the
lookup table they reference, the table can be made static const
to
avoid having it as read/write unshared data in the linked binary.
Premature optimization concern: doesn't de-inlining those macros
cause performance problems? At first, we will get the added overhead
from a function call. When the whole code is ported to Rust, the Rust
compiler will probably be able to figure out that those tiny functions
can be inlined (or we can #[inline]
them by hand if we have proof,
or if we have more hubris than faith in LLVM).
Port functions and table to Rust
The functions are so tiny, and the table so cut-and-pasteable, that it's easy to port them to Rust in a single shot:
#[no_mangle]
pub unsafe extern "C" fn BZ2_rand_init() -> RandState {
RandState {
rNToGo: 0,
rTPos: 0,
}
}
#[no_mangle]
pub unsafe extern "C" fn BZ2_rand_mask(r: &RandState) -> i32 {
if r.rNToGo == 1 {
1
} else {
0
}
}
#[no_mangle]
pub unsafe extern "C" fn BZ2_rand_update_mask(r: &mut RandState) {
if r.rNToGo == 0 {
r.rNToGo = RAND_TABLE[r.rTPos as usize];
r.rTPos += 1;
if r.rTPos == 512 {
r.rTPos = 0;
}
}
r.rNToGo -= 1;
}
Also, we define the RandState
type as a Rust struct with a
C-compatible representation, so it will have the same layout in memory
as the C struct. This is what allows us to have a RandState
in
the C struct, while in reality the C code doesn't access it
directly; it is just used as a struct field.
// Keep this in sync with bzlib_private.h:
#[repr(C)]
pub struct RandState {
rNToGo: i32,
rTPos: i32,
}
See the commit for the corresponding extern
declarations in bzlib_private.h
. With those functions and the table
ported to Rust, we can remove randtable.c
. Yay!
A few cleanups
After moving to another house one throws away useless boxes; we have to do some cleanup in the Rust code after the initial port, too.
Rust prefers snake_case fields rather than camelCase ones, and I
agree. I renamed the fields to n_to_go
and table_pos
.
Then, I discovered that the EState
struct doesn't actually use the
fields for the randomization state. I just removed them.
Exegesis
What is that randomization state all about?
And why does DState
(the struct used during decompression) need the
randomization state, but EState
(used during compression) doesn't
need it?
I found this interesting comment:
/*--
Now a single bit indicating (non-)randomisation.
As of version 0.9.5, we use a better sorting algorithm
which makes randomisation unnecessary. So always set
the randomised bit to 'no'. Of course, the decoder
still needs to be able to handle randomised blocks
so as to maintain backwards compatibility with
older versions of bzip2.
--*/
bsW(s,1,0);
Okay! So compression no longer uses randomization, but
decompression has to support files which were compressed with
randomization. Here, bsW(s,1,0)
always writes a 0 bit to the file.
However, the decompression code actually reads the blockRandomised
bit from the file so that it can see whether it is
dealing with an old-format file:
GET_BITS(BZ_X_RANDBIT, s->blockRandomised, 1);
Later in the code, this s->blockRandomised
field gets consulted; if
the bit is on, the code calls BZ2_rand_update_mask()
and friends as
appropriate. If one is using files compressed with Bzip2 0.9.5 or
later, those randomization functions are not even called.
Talk about preserving compatibility with the past.
Explanation, or building my headcanon
Bzip2's compression starts by running a Burrows-Wheeler Transform on a block of data to compress, which is a wonderful algorithm that I'm trying to fully understand. Part of the BWT involves sorting all the string rotations of the block in question.
Per the comment I cited, really old versions of bzip2 used a randomization helper to make sorting perform well in extreme cases, but not-so-old versions fixed this.
This explains why the decompression struct DState
has a
blockRandomised
bit, but the compression struct EState
doesn't
need one. The fields that the original macro was pasting into
EState
were just a vestige from 1999, which is when Bzip2 0.9.5 was
released.