A mini-rant on the lack of string slices in C

- Tags: librsvg, rust

Porting of librsvg to Rust goes on. Yesterday I started porting the C code that implements SVG's <text> family of elements. I have also been replacing the little parsers in librsvg with Rust code.

And these days, the lack of string slices in C is bothering me a lot.

What if...

It feels like it should be easy to just write something like

typedef struct {
    const char *ptr;
    size_t len;
} StringSlice;

And then a whole family of functions. The starting point, where you slice a whole string:

make_slice_from_string (const char *s)
    StringSlice slice;

    assert (s != NULL);

    slice.ptr = s;
    slice.len = strlen (s);
    return slice;

But that wouldn't keep track of the lifetime of the original string. Okay, this is C, so you are used to keeping track of that yourself.

Onwards. Substrings?

make_sub_slice(StringSlice slice, size_t start, size_t len)
    StringSlice sub;

    assert (len <= slice.len);
    assert (start <= slice.len - len);  /* Not "start + len <= slice.len" or it can overflow. */
                                        /* The subtraction can't underflow because of the previous assert */
    sub.ptr = slice.ptr + start;
    sub.len = len;
    return sub;

Then you could write a million wrappers for g_strsplit() and friends, or equivalents to them, to give you slices instead of C strings. But then:

  • You have to keep track of lifetimes yourself.

  • You have to wrap every function that returns a plain "char *"...

  • ... and every function that takes a plain "char *" as an argument, without a length parameter, because...

  • You CANNOT take slice.ptr and pass it to a function that just expects a plain "char *", because your slice does not include a nul terminator (the '\0 byte at the end of a C string). This is what kills the whole plan.

Even if you had a helper library that implements C string slices like that, you would have a mismatch every time you needed to call a C function that expects a conventional C string in the form of a "char *". You need to put a nul terminator somewhere, and if you only have a slice, you need to allocate memory, copy the slice into it, and slap a 0 byte at the end. Then you can pass that to a function that expects a normal C string.

There is hacky C code that needs to pass a substring to another function, so it overwrites the byte after the substring with a 0, passes the substring, and overwrites the byte back. This is horrible, and doesn't work with strings that live in read-only memory. But that's the best that C lets you do.

I'm very happy with string slices in Rust, which work exactly like the StringSlice above, but &str is actually at the language level and everything knows how to handle it.

The glib-rs crate has conversion traits to go from Rust strings or slices into C, and vice-versa. We alredy saw some of those in the blog post about conversions in Glib-rs.

Sizes of things

Rust uses usize to specify the size of things; it's an unsigned integer; 32 bits on 32-bit machines, and 64 bits on 64-bit machines; it's like C's size_t.

In the Glib/C world, we have an assortment of types to represent the sizes of things:

  • gsize, the same as size_t. This is an unsigned integer; it's okay.

  • gssize, a signed integer of the same size as gsize. This is okay if used to represent a negative offset, and really funky in the Glib functions like g_string_new_len (const char *str, gssize len), where len == -1 means "call strlen(str) for me because I'm too lazy to compute the length myself".

  • int - broken, as in libxml2, but we can't change the API. On 64-bit machines, an int to specify a length means you can't pass objects bigger than 2 GB.

  • long - marginally better than int, since it has a better chance of actually being the same size as size_t, but still funky. Probably okay for negative offsets; problematic for sizes which should really be unsigned.

  • etc.

I'm not sure how old size_t is in the C standard library, but it can't have been there since the beginning of time — otherwise people wouldn't have been using int to specify the sizes of things.