Rust tries to follow the "make illegal states unrepresentable" mantra
in several ways. In this post I'll show several things related to the
process of building strings, from bytes in memory, or from a file, or
from char *
things passed from C.
Strings in Rust
The easiest way to build a string is to do it directly at compile time:
let my_string = "Hello, world!";
In Rust, strings are UTF-8. Here, the compiler checks our string literal is valid UTF-8. If we try to be sneaky and insert an invalid character...
let my_string = "Hello \xf0";
We get a compiler error:
error: this form of character escape may only be used with characters in the range [\x00-\x7f]
--> foo.rs:2:30
|
2 | let my_string = "Hello \xf0";
| ^^
Rust strings know their length, unlike C strings. They can contain a nul character in the middle, because they don't need a nul terminator at the end.
let my_string = "Hello \x00 zero";
println!("{}", my_string);
The output is what you expect:
$ ./foo | hexdump -C
00000000 48 65 6c 6c 6f 20 00 20 7a 65 72 6f 0a |Hello . zero.|
0000000d ^ note the nul char here
$
So, to summarize, in Rust:
- Strings are encoded in UTF-8
- Strings know their length
- Strings can have nul chars in the middle
This is a bit different from C:
- Strings don't exist!
Okay, just kidding. In C:
- A lot of software has standardized on UTF-8.
- Strings don't know their length - a
char *
is a raw pointer to the beginning of the string. - Strings conventionally have a nul terminator, that is, a zero byte that marks the end of the string. Therefore, you can't have nul characters in the middle of strings.
Building a string from bytes
Let's say you have an array of bytes and want to make a string from them. Rust won't let you just cast the array, like C would. First you need to do UTF-8 validation. For example:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
In lines 10 and 11, we call convert_and_print()
with different
arrays of bytes; the first one is valid UTF-8, and the second one
isn't.
Line 2 calls String::from_utf8()
, which returns a Result
,
i.e. something with a success value or an error. In lines 3-5 we
unpack this Result
. If it's Ok
, we print the converted string,
which has been validated for UTF-8. Otherwise, we print the debug
representation of the error.
The program prints the following:
$ ~/foo
Hello
FromUtf8Error { bytes: [72, 101, 240, 108, 108, 111], error: Utf8Error { valid_up_to: 2, error_len: Some(1) } }
Here, in the error case, the Utf8Error
tells us that the bytes
are UTF-8 and are valid_up_to
index 2; that is the first problematic
index. We also get some extra information which lets the program know
if the problematic sequence was incomplete and truncated at the end of
the byte array, or if it's complete and in the middle.
And for a "just make this printable, pls" API? We can
use String::from_utf8_lossy()
, which replaces invalid UTF-8
sequences with U+FFFD REPLACEMENT CHARACTER
:
fn convert_and_print(bytes: Vec<u8>) {
let string = String::from_utf8_lossy(&bytes);
println!("{}", string);
}
fn main() {
convert_and_print(vec![0x48, 0x65, 0x6c, 0x6c, 0x6f]);
convert_and_print(vec![0x48, 0x65, 0xf0, 0x6c, 0x6c, 0x6f]);
}
This prints the following:
$ ~/foo
Hello
He�llo
Reading from files into strings
Now, let's assume you want to read chunks of a file and put them into strings. Let's go from the low-level parts up to the high level "read a line" API.
Single bytes and single UTF-8 characters
When you open a File
, you get an object that implements the
Read
trait. In addition to the usual "read me some bytes" method,
it can also give you back an iterator over bytes, or an iterator
over UTF-8 characters.
The Read.bytes()
method gives you back a Bytes
iterator,
whose next()
method returns Result<u8, io::Error>
. When you ask
the iterator for its next item, that Result
means you'll get a byte
out of it successfully, or an I/O error.
In contrast, the Read.chars()
method gives you back
a Chars
iterator, and its next()
method returns
Result<char, CharsError>
, not io::Error
. This
extended CharsError
has a NotUtf8
case, which you get back
when next()
tries to read the next UTF-8 sequence from the file and
the file has invalid data. CharsError
also has a case for normal
I/O errors.
Reading lines
While you could build a UTF-8 string one character at a time, there are more efficient ways to do it.
You can create a BufReader
, a buffered reader, out of anything
that implements the Read
trait. BufReader
has a
convenient read_line()
method, to which you pass a mutable
String and it returns a Result<usize, io::Error>
with either the
number of bytes read, or an error.
That method is declared in the BufRead
trait, which BufReader
implements. Why the separation? Because other concrete structs also
implement BufRead
, such as Cursor
— a nice wrapper that lets
you use a vector of bytes like an I/O Read
or Write
implementation, similar to GMemoryInputStream
.
If you prefer an iterator rather than the read_line()
function,
BufRead
also gives you a lines()
method, which gives you back
a Lines
iterator.
In both cases — the read_line()
method or the Lines
iterator, the
error that you can get back can be of ErrorKind
::InvalidData
,
which indicates that there was an invalid UTF-8 sequence in the line
to be read. It can also be a normal I/O error, of course.
Summary so far
There is no way to build a String
, or a &str
slice, from invalid
UTF-8 data. All the methods that let you turn bytes into string-like
things perform validation, and return a Result
to let you know if
your bytes validated correctly.
The exceptions are in the unsafe
methods,
like String::from_utf8_unchecked()
. You should really only use
them if you are absolutely sure that your bytes were validated as
UTF-8 beforehand.
There is no way to bring in data from a file (or anything file-like,
that implements the Read
trait) and turn it into a String
without going through functions that do UTF-8 validation. There is
not an unsafe "read a line" API without validation — you would have to
build one yourself, but the I/O hit is probably going to be slower than
validating data in memory, anyway, so you may as well validate.
C strings and Rust
For unfortunate historical reasons, C flings around char *
to mean
different things. In the context of Glib, it can mean
- A valid, nul-terminated UTF-8 sequence of bytes (a "normal string")
- A nul-terminated file path, which has no meaningful encoding
- A nul-terminated sequence of bytes, not validated as UTF-8.
What a particular char *
means depends on which API you got it from.
Bringing a string from C to Rust
From Rust's viewpoint, getting a raw char *
from C (a "*const
c_char
" in Rust parlance) means that it gets a pointer to a buffer of
unknown length.
Now, that may not be entirely accurate:
- You may indeed only have a pointer to a buffer of unknown length
- You may have a pointer to a buffer, and also know its length (i.e. the offset at which the nul terminator is)
The Rust standard library provides a CStr
object, which means,
"I have a pointer to an array of bytes, and I know its length, and I
know the last byte is a nul".
CStr
provides an unsafe from_ptr()
constructor which takes a
raw pointer, and walks the memory to which it points until it finds a
nul byte. You must give it a valid pointer, and you had better
guarantee that there is a nul terminator, or CStr
will walk until
the end of your process' address space looking for one.
Alternatively, if you know the length of your byte array, and you know
that it has a nul byte at the end, you can
call CStr::from_bytes_with_nul()
. You pass it a &[u8]
slice;
the function will check that a) the last byte in that slice is indeed
a nul, and b) there are no nul bytes in the middle.
The unsafe version of this last function
is unsafe CStr::from_bytes_with_nul_unchecked()
: it also takes
an &[u8]
slice, but you must guarantee that the last byte is a nul
and that there are no nul bytes in the middle.
I really like that the Rust documentation tells you when functions are not "instantaneous" and must instead walks arrays, like to do validation or to look for the nul terminator above.
Turning a CStr into a string-like
Now, the above indicates that a CStr
is a nul-terminated array of
bytes. We have no idea what the bytes inside look like; we just know
that they don't contain any other nul bytes.
There is a CStr::to_str()
method, which returns a
Result<&str, Utf8Error>
. It performs UTF-8 validation on the array
of bytes. If the array is valid, the function just returns a slice of
the validated bytes minus the nul terminator (i.e. just what you
expect for a Rust string slice). Otherwise, it returns an Utf8Error
with the details like we discussed before.
There is also CStr::to_string_lossy()
which does the
replacement of invalid UTF-8 sequences like we discussed before.
Conclusion
Strings in Rust are UTF-8 encoded, they know their length, and they can have nul bytes in the middle.
To build a string from raw bytes, you must go through functions that do UTF-8 validation and tell you if it failed. There are unsafe functions that let you skip validation, but then of course you are on your own.
The low-level functions which read data from files operate on bytes. On top of those, there are convenience functions to read validated UTF-8 characters, lines, etc. All of these tell you when there was invalid UTF-8 or an I/O error.
Rust lets you wrap a raw char *
that you got from C into something
that can later be validated and turned into a string. Anything that
manipulates a raw pointer is unsafe
; this includes the "wrap me this
pointer into a C string abstraction" API, and the "build me an array
of bytes from this raw pointer" API. Later, you can validate those
as UTF-8 and build real Rust strings — or know if the validation
failed.
Rust builds these little "corridors" through the API so that illegal states are unrepresentable.