1. Rustifying libipuz: character sets

    Translations: es - Tags: crosswords, gnome, rust

    It has been, what, like four years since librsvg got fully rustified, and now it is time to move another piece of critical infrastructure to a memory-safe language.

    I'm talking about libipuz, the GObject-based C library that GNOME Crosswords uses underneath. This is a library that parses the ipuz file format and is able to represent various kinds of puzzles.

    The words "GNOME CROSSWORDS" set inside a crossword
puzzle

    Libipuz is an interesting beast. The ipuz format is JSON with a lot of hair: it needs to represent the actual grid of characters and their solutions, the grid's cells' numbers, the puzzle's clues, and all the styling information that crossword puzzles can have (it's more than you think!).

    {
        "version": "http://ipuz.org/v2",
        "kind": [ "http://ipuz.org/crossword#1", "https://libipuz.org/barred#1" ],
        "title": "Mephisto No 3228",
        "styles": {
            "L": {"barred": "L" },
            "T": {"barred": "T" },
            "TL": {"barred": "TL" }
        },
        "puzzle":   [ [  1,  2,  0,  3,  4,  {"cell": 5, "style": "L"},  6,  0,  7,  8,  0,  9 ],
                      [  0,  {"cell": 0, "style": "L"}, {"cell": 10, "style": "TL"},  0,  0,  0,  0,  {"cell": 0, "style": "T"},  0,  0,  {"cell": 0, "style": "T"},  0 ]
                    # the rest is omitted
        ],
        "clues": {
            "Across": [ {"number":1, "clue":"Having kittens means losing heart for home day", "enumeration":"5", "cells":[[0,0],[1,0],[2,0],[3,0],[4,0]] },
                        {"number":5, "clue":"Mostly allegorical poet on writing companion poem, say", "enumeration":"7", "cells":[[5,0],[6,0],[7,0],[8,0],[9,0],[10,0],[11,0]] },
                    ]
            # the rest is omitted
        }
    }
    

    Libipuz uses json-glib, which works fine to ingest the JSON into memory, but then it is a complete slog to distill the JSON nodes into C data structures. You need iterate through each node in the JSON tree and try to fit its data into yours.

    Get me the next node. Is the node an array? Yes? How many elements? Allocate my own array. Iterate the node's array. What's in this element? Is it a number? Copy the number to my array. Or is it a string? Do I support that, or do I throw an error? Oh, don't forget the code to meticulously free the partially-constructed thing I was building.

    This is not pleasant code to write and test.

    Ipuz also has a few mini-languages within the format, which live inside string properties. Parsing these in C unpleasant at best.

    Differences from librsvg

    While librsvg has a very small GObject-based API, and a medium-sized library underneath, libipuz has a large API composed of GObjects, boxed types, and opaque and public structures. Using libipuz involves doing a lot of calls to its functions, from loading a crossword to accessing each of its properties via different functions.

    I want to use this rustification as an exercise in porting a moderately large C API to Rust. Fortunately, libipuz does have a good test suite that is useful from the beginning of the port.

    Also, I want to see what sorts of idioms appear when exposing things from Rust that are not GObjects. Mutable, opaque structs can just be passed as a pointer to a heap allocation, i.e. a Box<T>. I want to take the opportunity to make more things in libipuz immutable; currently it has a bunch of reference-counted, mutable objects, which are fine in single-threaded C, but decidedly not what Rust would prefer. For librsvg it was very beneficial to be able to notice parts of objects that remain immutable after construction, and to distinguish those parts from the mutable ones that change when the object goes through its lifetime.

    Let's begin!

    In the ipuz format, crosswords have a character set or charset: it is the set of letters that appear in the puzzle's solution. Internally, GNOME Crosswords uses the charset as a histogram of letter counts for a particular puzzle. This is useful information for crossword authors.

    Crosswords uses the histogram of letter counts in various important algorithms, for example, the one that builds a database of words usable in the crosswords editor. That database has a clever format which allows answering questions like the following quickly: What words in the database match ?OR??WORDS and CORES will match.

    IPuzCharset is one of the first pieces of code I worked on in Crosswords, and it later got moved to libipuz. Originally it didn't even keep a histogram of character counts; it was just an ordered set of characters that could answer the question, "what is the index of the character ch within the ordered set?".

    I implemented that ordered set with a GTree, a balanced binary tree. The keys in the key/value tree were the characters, and the values were just unused.

    Later, the ordered set was turned into an actual histogram with character counts: keys are still characters, but each value is now a count of the coresponding character.

    Over time, Crosswords started using IPuzCharset for different purposes. It is still used while building and accessing the database of words; but now it is also used to present statistics in the crosswords editor, and as part of the engine in an acrostics generator.

    In particular, the acrostics generator has been running into some performance problems with IPuzCharset. I wanted to take the port to Rust as an opportunity to change the algorithm and make it faster.

    Refactoring into mutable/immutable stages

    IPuzCharset started out with these basic operations:

    /* Construction; memory management */
    IPuzCharset          *ipuz_charset_new              (void);
    IPuzCharset          *ipuz_charset_ref              (IPuzCharset       *charet);
    void                  ipuz_charset_unref            (IPuzCharset       *charset);
    
    /* Mutation */
    void                  ipuz_charset_add_text         (IPuzCharset       *charset,
                                                         const char        *text);
    gboolean              ipuz_charset_remove_text      (IPuzCharset       *charset,
                                                         const char        *text);
    
    /* Querying */
    gint                  ipuz_charset_get_char_index   (const IPuzCharset *charset,
                                                         gunichar           c);
    guint                 ipuz_charset_get_char_count   (const IPuzCharset *charset,
                                                         gunichar           c);
    gsize                 ipuz_charset_get_n_chars      (const IPuzCharset *charset);
    gsize                 ipuz_charset_get_size         (const IPuzCharset *charset);
    

    All of those are implemented in terms of the key/value binary tree that stores a character in each node's key, and a count in the node's value.

    I read the code in Crosswords that uses the ipuz_charset_*() functions and noticed that in every case, the code first constructs and populates the charset using ipuz_charset_add_text(), and then doesn't modify it anymore — it only does queries afterwards. The only place that uses ipuz_charset_remove_text() is the acrostics generator, but that one doesn't do any queries later: it uses the remove_text() operation as part of another algorithm, but only that.

    So, I thought of doing this:

    • Split things into a mutable IPuzCharsetBuilder that has the add_text / remove_text operations, and also has a build() operation that consumes the builder and produces an immutable IPuzCharset.

    • IPuzCharset is immutable; it can only be queried.

    • IPuzCharsetBuilder can work with a hash table, which turns the "add a character" operation from O(log n) to O(1) amortized.

    • build() is O(n) on the number of unique characters and is only done once per charset.

    • Make IPuzCharset work with a different hash table that also allows for O(1) operations.

    Basics of IPuzCharsetBuilder

    IPuzCharsetBuilder is mutable, and it can live on the Rust side as a Box<T> so it can present an opaque pointer to C.

    #[derive(Default)]
    pub struct CharsetBuilder {
        histogram: HashMap<char, u32>,
    }
    
    // IPuzCharsetBuilder *ipuz_charset_builder_new (void); */
    #[no_mangle]
    pub unsafe extern "C" fn ipuz_charset_builder_new() -> Box<CharsetBuilder> {
        Box::new(CharsetBuilder::default())
    }
    

    For extern "C", Box<T> marshals as a pointer. It's nominally what one would get from malloc().

    Then, simple functions to create the character counts:

    impl CharsetBuilder {
        /// Adds `text`'s character counts to the histogram.
        fn add_text(&mut self, text: &str) {
            for ch in text.chars() {
                self.add_character(ch);
            }
        }
    
        /// Adds a single character to the histogram.
        fn add_character(&mut self, ch: char) {
            self.histogram
                .entry(ch)
                .and_modify(|e| *e += 1)
                .or_insert(1);
        }
    }
    

    The C API wrappers:

    use std::ffi::CStr;
    
    // void ipuz_charset_builder_add_text (IPuzCharsetBuilder *builder, const char *text);
    #[no_mangle]
    pub unsafe extern "C" fn ipuz_charset_builder_add_text(
        builder: &mut CharsetBuilder,
        text: *const c_char,
    ) {
        let text = CStr::from_ptr(text).to_str().unwrap();
        builder.add_text(text);
    }
    

    CStr is our old friend that takes a char * and can wrap it as a Rust &str after validating it for UTF-8 and finding its length. Here, the unwrap() will panic if the passed string is not UTF-8, but that's what we want; it's the equivalent of an assertion that what was passed in is indeed UTF-8.

    // void ipuz_charset_builder_add_character (IPuzCharsetBuilder *builder, gunichar ch);
    #[no_mangle]
    pub unsafe extern "C" fn ipuz_charset_builder_add_character(builder: &mut CharsetBuilder, ch: u32) {
        let ch = char::from_u32(ch).unwrap();
        builder.add_character(ch);
    }
    

    Somehow, the glib-sys crate doesn't have gunichar, which is just a guint32 for a Unicode code point. So, we take in a u32, and check that it is in the appropriate range for Unicode code points with char::from_u32(). Again, a panic in the unwrap() means that the passed number is out of range; equivalent to an assertion.

    Converting to an immutable IPuzCharset

    pub struct Charset {
        /// Histogram of characters and their counts plus derived values.
        histogram: HashMap<char, CharsetEntry>,
    
        /// All the characters in the histogram, but in order.
        ordered: String,
    
        /// Sum of all the counts of all the characters.
        sum_of_counts: usize,
    }
    
    /// Data about a character in a `Charset`.  The "value" in a key/value pair where the "key" is a character.
    #[derive(PartialEq)]
    struct CharsetEntry {
        /// Index of the character within the `Charset`'s ordered version.
        index: u32,
    
        /// How many of this character in the histogram.
        count: u32,
    }
    
    impl CharsetBuilder {
        fn build(self) -> Charset {
            // omitted for brevity; consumes `self` and produces a `Charset` by adding
            // the counts for the `sum_of_counts` field, and figuring out the sort
            // order into the `ordered` field.
        }
    }
    

    Now, on the C side, IPuzCharset is meant to also be immutable and reference-counted. We'll use Arc<T> for such structures. One cannot return an Arc<T> to C code; it must first be converted to a pointer with Arc::into_raw():

    // IPuzCharset *ipuz_charset_builder_build (IPuzCharsetBuilder *builder);
    #[no_mangle]
    pub unsafe extern "C" fn ipuz_charset_builder_build(
        builder: *mut CharsetBuilder,
    ) -> *const Charset {
        let builder = Box::from_raw(builder); // get back the Box from a pointer
        let charset = builder.build();        // consume the builder and free it
        Arc::into_raw(Arc::new(charset))      // Wrap the charset in Arc and get a pointer
    }
    

    Then, implement ref() and unref():

    // IPuzCharset *ipuz_charset_ref (IPuzCharset *charet);
    #[no_mangle]
    pub unsafe extern "C" fn ipuz_charset_ref(charset: *const Charset) -> *const Charset {
        Arc::increment_strong_count(charset);
        charset
    }
    
    // void ipuz_charset_unref (IPuzCharset *charset);
    #[no_mangle]
    pub unsafe extern "C" fn ipuz_charset_unref(charset: *const Charset) {
        Arc::decrement_strong_count(charset);
    }
    

    The query functions need to take a pointer to what really is the Arc<Charset> on the Rust side. They reconstruct the Arc with Arc::from_raw() and wrap it in ManuallyDrop so that the Arc doesn't lose a reference count when the function exits:

    // gsize ipuz_charset_get_n_chars (const IPuzCharset *charset);
    #[no_mangle]
    pub unsafe extern "C" fn ipuz_charset_get_n_chars(charset: *const Charset) -> usize {
        let charset = ManuallyDrop::new(Arc::from_raw(charset));
        charset.get_n_chars()
    }
    

    Tests

    The C tests remain intact; these let us test all the #[no_mangle] wrappers.

    The Rust tests can just be for the internals, simliar to this:

        #[test]
        fn supports_histogram() {
            let mut builder = CharsetBuilder::default();
    
            let the_string = "ABBCCCDDDDEEEEEFFFFFFGGGGGGG";
            builder.add_text(the_string);
            let charset = builder.build();
    
            assert_eq!(charset.get_size(), the_string.len());
    
            assert_eq!(charset.get_char_count('A').unwrap(), 1);
            assert_eq!(charset.get_char_count('B').unwrap(), 2);
            assert_eq!(charset.get_char_count('C').unwrap(), 3);
            assert_eq!(charset.get_char_count('D').unwrap(), 4);
            assert_eq!(charset.get_char_count('E').unwrap(), 5);
            assert_eq!(charset.get_char_count('F').unwrap(), 6);
            assert_eq!(charset.get_char_count('G').unwrap(), 7);
    
            assert!(charset.get_char_count('H').is_none());
        }
    

    Integration with the build system

    Libipuz uses meson, which is not particularly fond of cargo. Still, cargo can be used from meson with a wrapper script and a bit of easy hacks. See the merge request for details.

    Further work

    I've left the original C header file ipuz-charset.h intact, but ideally I'd like to automatically generate the headers from Rust with cbindgen. Doing it that way lets me check that my assumptions of the extern "C" ABI are correct ("does foo: &mut Foo appear as Foo *foo on the C side?"), and it's one fewer C-ism to write by hand. I need to see what to do about inline documentation; gi-docgen can consume C header files just fine, but I'm not yet sure about how to make it work with generated headers from cbindgen.

    I still need to modify the CI's code coverage scripts to work with the mixed C/Rust codebase. Fortunately I can copy those incantations from librsvg.

    Is it faster?

    Maybe! I haven't benchmarked the acrostics generator yet. Stay tuned!

Page 3 of 105 (previous) (next)