Introduction
So, you're coming from C++ and want to write Rust? Great!
You have questions? We have answers.
This book is a collection of frequently asked questions for those arriving from existing C++ codebases. It guides you on how to adapt your C++ thinking to the new facilities available in Rust. It should help you if you're coming from other object-oriented languages such as Java too.
Although it's structured as questions and answers, it can also be read front-to-back, to give you hints about how to adapt your C++/Java thinking to a more idiomatically Rusty approach.
It does not aim to teach you Rust - there are many better resources. It doesn't aim to talk about Rust idioms in general - there are great existing guides for that. This guide is specifically about transitioning from some other traditionally OO language. If you're coming from such a language, you'll have questions about how to achieve the same outcomes in idiomatic Rust. That's what this guide is for.
Structure
The guide starts with idioms at the small scale - answering questions about how you'd write a few lines of code - and moves towards ever larger patterns - answering questions about how you'd structure your whole codebase.
Contributors
The following awesome people helped write the answers here, and they're sometimes quoted using the abbreviations given.
Thanks to Adam Perry(@__anp__) (AP), Alyssa Haroldsen (@kupiakos) (AH), Augie Fackler (@durin42) (AF), David Tolnay (@davidtolnay) (DT), Łukasz Anforowicz (LA), Manish Goregaokar (@ManishEarth) (MG), Mike Forster (MF), Miguel Young de la Sota (@DrawsMiguel) (MY), and Tyler Mandry (@tmandry) (TM).
Their views have been edited and collated by Adrian Taylor (@adehohum), Chris Palmer, danakj@chromium.org and Martin Brænne. Any errors or misrepresentations are ours.
Licensed under either of Apache License, Version 2.0 or MIT license at your option.
Questions about code in function bodies
- How can I avoid the performance penalty of bounds checks?
- Isn't it confusing to use the same variable name twice?
- How can I avoid the performance penalty of
unwrap()
? - How do I access variables from within a spawned thread?
- When should I use runtime checks vs jumping through hoops to do static checks?
How can I avoid the performance penalty of bounds checks?
Rust array and list accesses are all bounds checked. You may be worried about a performance penalty. How can you avoid that?
Contort yourself a little bit to use iterators. - MY
Rust gives you choices around functional versus imperative style, but things often work better in a functional style. Specifically - if you've got something iterable, then there are probably functional methods to do what you want.
For instance, suppose you need to work out what food to get at the petshop. Here's code that does this in an imperative style:
#![allow(unused)] fn main() { // Copyright 2020 Google LLC // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // https://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. use std::collections::HashSet; struct Animal { kind: &'static str, is_hungry: bool, meal_needed: &'static str, } static PETS: [Animal; 4] = [ Animal { kind: "Dog", is_hungry: true, meal_needed: "Kibble", }, Animal { kind: "Python", is_hungry: false, meal_needed: "Cat", }, Animal { kind: "Cat", is_hungry: true, meal_needed: "Kibble", }, Animal { kind: "Lion", is_hungry: false, meal_needed: "Kibble", }, ]; static NEARBY_DUCK: Animal = Animal { kind: "Duck", is_hungry: true, meal_needed: "pondweed", }; fn make_shopping_list_a() -> HashSet<&'static str> { let mut meals_needed = HashSet::new(); for n in 0..PETS.len() { // ugh if PETS[n].is_hungry { meals_needed.insert(PETS[n].meal_needed); } } meals_needed } }
The loop index is verbose and error-prone. Let's get rid of it and loop over an iterator instead:
#![allow(unused)] fn main() { // Copyright 2020 Google LLC // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // https://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. use std::collections::HashSet; struct Animal { kind: &'static str, is_hungry: bool, meal_needed: &'static str, } static PETS: [Animal; 4] = [ Animal { kind: "Dog", is_hungry: true, meal_needed: "Kibble", }, Animal { kind: "Python", is_hungry: false, meal_needed: "Cat", }, Animal { kind: "Cat", is_hungry: true, meal_needed: "Kibble", }, Animal { kind: "Lion", is_hungry: false, meal_needed: "Kibble", }, ]; static NEARBY_DUCK: Animal = Animal { kind: "Duck", is_hungry: true, meal_needed: "pondweed", }; fn make_shopping_list_b() -> HashSet<&'static str> { let mut meals_needed = HashSet::new(); for animal in PETS.iter() { // better... if animal.is_hungry { meals_needed.insert(animal.meal_needed); } } meals_needed } }
We're accessing the loop through an iterator, but we're still processing the elements inside a loop. It's often more idiomatic to replace the loop with a chain of iterators:
#![allow(unused)] fn main() { // Copyright 2020 Google LLC // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // https://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. use std::collections::HashSet; struct Animal { kind: &'static str, is_hungry: bool, meal_needed: &'static str, } static PETS: [Animal; 4] = [ Animal { kind: "Dog", is_hungry: true, meal_needed: "Kibble", }, Animal { kind: "Python", is_hungry: false, meal_needed: "Cat", }, Animal { kind: "Cat", is_hungry: true, meal_needed: "Kibble", }, Animal { kind: "Lion", is_hungry: false, meal_needed: "Kibble", }, ]; static NEARBY_DUCK: Animal = Animal { kind: "Duck", is_hungry: true, meal_needed: "pondweed", }; fn make_shopping_list_c() -> HashSet<&'static str> { PETS.iter() .filter(|animal| animal.is_hungry) .map(|animal| animal.meal_needed) .collect() // best... } }
The obvious advantage of the third approach is that it's more concise, but less obviously:
- The first solution may require Rust to do array bounds checks inside each iteration of the loop, making Rust potentially slower than C++. In this sort of simple example, it likely wouldn't, but functional pipelines simply don't require bounds checks.
- The final container (a
HashSet
in this case) may be able to allocate roughly the right size at the outset, using the size_hint of a Rust iterator. - If you use iterator-style code rather than imperative code, it's more likely the Rust compiler will be able to auto-vectorize using SIMD instructions.
- There is no mutable state within the function. This makes it easier to verify that the code is correct and to avoid introducing bugs when changing it. In this simple example it may be obvious that calling the
HashSet::insert
is the only mutation to the set, but in more complex scenarios it is quite easy to lose the overview. - And as a new arrival from C++, you may find this hard to believe: For an experienced Rustacean it'll be more readable.
Here are some more iterator techniques to help avoid materializing a collection:
-
You can chain two iterators together to make a longer one.
-
If you need to iterate two lists, zip them together to avoid bounds checks on either.
-
If you want to feed all your animals, and also feed a nearby duck, just chain the iterator to
std::iter::once
:#![allow(unused)] fn main() { use std::collections::HashSet; struct Animal { kind: &'static str, is_hungry: bool, meal_needed: &'static str, } static PETS: [Animal; 0] = []; static NEARBY_DUCK: Animal = Animal { kind: "Duck", is_hungry: true, meal_needed: "pondweed", }; fn make_shopping_list_d() -> HashSet<&'static str> { PETS.iter() .chain(std::iter::once(&NEARBY_DUCK)) .filter(|animal| animal.is_hungry) .map(|animal| animal.meal_needed) .collect() } }
(Similarly, if you want to add one more item to the shopping list - maybe you're hungry, as well as your menagerie? - just add it after the
map
). -
Option
is iterable.#![allow(unused)] fn main() { use std::collections::HashSet; struct Animal { kind: &'static str, is_hungry: bool, meal_needed: &'static str, } static PETS: [Animal; 0] = []; struct Pond; static MY_POND: Pond = Pond; fn pond_inhabitant(pond: &Pond) -> Option<&Animal> { // ... None } fn make_shopping_list_e() -> HashSet<&'static str> { PETS.iter() .chain(pond_inhabitant(&MY_POND)) .filter(|animal| animal.is_hungry) .map(|animal| animal.meal_needed) .collect() } }
Here's a diagram showing how data flows in this iterator pipeline:
flowchart LR %%{ init: { 'flowchart': { 'nodeSpacing': 40, 'rankSpacing': 15 } } }%% Pets Filter([filter by hunger]) Map([map to noms]) Meals uniqueify([uniqueify]) shopping[Shopping list] Pets ---> Filter Pond Pond ---> inhabitant inhabitant[Optional pond inhabitant] inhabitant ---> Map Filter ---> Map Map ---> Meals Meals ---> uniqueify uniqueify ---> shopping
C++20 recently introduced ranges, a feature that allows you to pipeline operations on a collection similar to the way Rust iterators do, so this style of programming is likely to become more common in C++ too.
To summarize: While in C++ you tend to operate on collections by performing a series of operations on each individual item, in Rust you'll typically apply a pipeline of operations to the whole collection. Make this mental switch and your code will not just become more idiomatic but more efficient, too.
Isn't it confusing to use the same variable name twice?
In Rust, it's common to reuse the same name for multiple variables in a function. For a C++ programmer, this is weird, but there are two good reasons to do it:
-
You may no longer need to change a mutable variable after a certain point, and if your code is sufficiently complex you might want the compiler to guarantee this for you:
#![allow(unused)] fn main() { fn spot_ate_my_slippers() -> bool { false } fn feed(_: &str) {} let mut good_boy = "Spot"; if spot_ate_my_slippers() { good_boy = "Rover"; } let good_boy = good_boy; // never going to change my dog again, who's a good boy feed(&good_boy); }
-
Another common pattern is to retain the same variable name as you gradually unwrap things to a simpler type:
#![allow(unused)] fn main() { let url = "http://foo.com:1234"; let port_number = url.split(":").skip(2).next().unwrap(); // hmm, maybe somebody else already wrote a better URL parser....? naah, probably not let port_number = port_number.parse::<u16>().unwrap(); }
How can I avoid the performance penalty of unwrap()
?
C++ has no equivalent to Rust's match
, so programmers coming from C++ often underuse it.
A heuristic: if you find yourself unwrap()
ing, especially in an if
/else
statement, you should restructure your code to use a more sophisticated match
.
For example, note the unwrap()
in here (implying some runtime branch):
#![allow(unused)] fn main() { fn test_parse() -> Result<u64,std::num::ParseIntError> { let s = "0x64a"; if s.starts_with("0x") { u64::from_str_radix(s.strip_prefix("0x").unwrap(), 16) } else { s.parse::<u64>() } } }
and no extra unwrap()
here:
#![allow(unused)] fn main() { fn test_parse() -> Result<u64,std::num::ParseIntError> { let s = "0x64a"; match s.strip_prefix("0x") { None => s.parse::<u64>(), Some(remainder) => u64::from_str_radix(remainder, 16), } } }
if let
and matches!
are just as good as match
but sometimes a little more concise. cargo clippy
will usually tell you if you're using a match
which can be simplified to one of those other two constructions.
How do I access variables from within a spawned thread?
Use std::thread::scope
.
When should I use runtime checks vs jumping through hoops to do static checks?
Everyone learns Rust a different way, but it's said that some people reach a point of "trait mania" where they try to encode too much via the type system, and get in a mess. So, in learning Rust, you will want to strike a balance between runtime checks (easy) or static compile-time checks (more efficient but requires deeper understanding.)
It’s very personal - some people learn better if they opt out of language features, others not. - MG
Some heuristics for how to keep things simple during the beginning of your Rust journey:
- It's OK to start with lots of
.unwrap()
, cloning andArc
/Rc
. - Start to use more advanced language features when you feel annoyed with the amount of boilerplate. (As an expert, you'll switch to a different strategy which is to consider the virality of your choices through the codebase.)
- Don't use traits until you have to. You might (for instance) need to use a trait to make some code unit testable, but overoptimizing for that too soon is a mistake. Some say that it's wise initially to avoid defining any new traits at all.
- Try to keep types smaller.
Specifically on reference counting,
If using Rc means you can avoid a lifetime parameter which is in half the APIs in the project, that’s a very reasonable choice. If it avoids a single lifetime somewhere, probably not a good idea. But measure before deciding. - MG
If you want to bail out of the complexity of static checks, which runtime checks are OK?
unwrap()
andOption
is mostly fine.Arc
andRc
is also fine in most cases.- Extensive use of
clone()
is fine but will have a performance impact. Cell
is regarded as a code smell and suggests you don't understand your lifetimes - it should be used sparingly.unsafe
is definitely not OK. It's harder to writeunsafe
Rust than to write C or C++, because Rust has additional aliasing rules. If you're reaching forunsafe
to work around the complexity of Rust's static type system, as a relative Rust beginner, please reconsider and look into the other options listed above.
Doing lifetime magic — where "magic" means annotating a function or complex type with more than 1 lifetime, or other wizardry — is often an optimization that you can defer until later. In the beginning, and when writing small programs that you only intend to use a few times ('scripts'), copying is fine.
Questions about your function signatures
- Should I return an iterator or a collection?
- How flexible should my parameters be?
- How do I overload constructors?
- When must I use
#[must_use]
? - When should I take parameters by value?
- Should I ever take
self
by value? - How do I take a thing, and a reference to something within that thing?
- When should I return
impl Trait
? - I miss function overloading! What do I do?
- I miss operator overloading! What do I do?
- Should I return an error, or panic?
- What should my error type be?
- When should I take or return
dyn Trait
? <'a>
I seem to have lots of named lifetimes. Am<'b>
I doing something wrong?
Should I return an iterator or a collection?
Pretty much always return an iterator. - AH
We suggested you use iterators a lot in your code. Share the love! Give iterators to your callers too.
If you know your caller will store the items you're returning in a concrete collection, such as a Vec
or a HashSet
, you may want to return that. In all other cases, return an iterator.
Your caller might:
- Collect the iterator into a
Vec
- Collect it into a
HashSet
or some other specialized container - Loop over the items
- Filter them or otherwise completely ignore some
Collecting the items into vector will only turn out to be right in one of these cases. In the other cases, you're wasting memory and CPU time by building a concrete collection.
This is weird for C++ programmers because iterators don't usually have robust references into the underlying data. Even Java iterators are scary, throwing ConcurrentModificationExceptions
when you least expect it. Rust prevents that, at compile time. If you can return an iterator, you should.
flowchart LR subgraph Caller it_ref[reference to iterator] end subgraph it_outer[Iterator] it[Iterator] it_ref --reference--> it end subgraph data[Underlying data] dat[Underlying data] it --reference--> dat end
How flexible should my parameters be?
Which of these is best?
#![allow(unused)] fn main() { fn a(params: &[String]) { // ... } fn b(params: &[&str]) { // ... } fn c(params: &[impl AsRef<str>]) { // ... } }
(You'll need to make an equivalent decision in other cases, e.g. Path
versus PathBuf
versus AsRef<Path>
.)
None of the options is clearly superior; for each option, there's a case it can't handle that the others can:
fn a(params: &[String]) { } fn b(params: &[&str]) { } fn c(params: &[impl AsRef<str>]) { } fn main() { a(&[]); // a(&["hi"]); // doesn't work a(&vec![format!("hello")]); b(&[]); b(&["hi"]); // b(&vec![format!("hello")]); // doesn't work // c(&[]); // doesn't work c(&["hi"]); c(&vec![format!("hello")]); }
So you have a variety of interesting ways to slightly annoy your callers under different circumstances. Which is best?
AsRef
has some advantages: if a caller has a Vec<String>
, they can use that directly, which would be impossible with the other options. But if they want to pass an empty list, they'll have to explicitly specify the type (for instance &Vec::<String>::new()
).
Not a huge fan of AsRef everywhere - it's just saving the caller typing. If you have lots of AsRef then nothing is object-safe. - MG
TL;DR: choose the middle option, &[&str]
. If your caller happens to have a vector of String
, it's relatively little work to get a slice of &str
:
fn b(params: &[&str]) { } fn main() { // Instead of b(&vec![format!("hello")]); let hellos = vec![format!("hello")]; b(&hellos.iter().map(String::as_str).collect::<Vec<_>>()); }
How do I overload constructors?
You can't do this:
#![allow(unused)] fn main() { struct BirthdayCard {} impl BirthdayCard { fn new(name: &str) -> Self { Self{} // ... } // Can't add more overloads: // // fn new(name: &str, age: i32) -> BirthdayCard { ... } // // fn new(name: &str, text: &str) -> BirthdayCard { ... } } }
If you have a default constructor, and a few variants for other cases, you can simply write them as different static methods. An idiomatic way to do this is to write a new()
constructor and then with_foo()
constructors that apply the given "foo" when constructing.
#![allow(unused)] fn main() { struct Racoon {} impl Racoon { fn new() -> Self { Self{} // ... } fn with_age(age: usize) -> Self { Self{} // ... } } }
If you have have a bunch of constructors and no default, it may make sense to instead provide a set of new_foo()
constructors.
#![allow(unused)] fn main() { struct Animal {} impl Animal { fn new_squirrel() -> Self { Self{} // ... } fn new_badger() -> Self { Self{} // ... } } }
For a more complex situation, you may use the builder pattern. The builder has a set of methods which take &mut self
and return &mut Self
. Then add a build()
that returns the final constructed object.
#![allow(unused)] fn main() { struct BirthdayCard {} struct BirthdayCardBuilder {} impl BirthdayCardBuilder { fn new(name: &str) -> Self { Self{} // ... } fn age(&mut self, age: i32) -> &mut Self { self // ... } fn text(&mut self, text: &str) -> &mut Self { self // ... } fn build(&mut self) -> BirthdayCard { BirthdayCard { /* ... */ } } } }
You can then chain these into short or long constructions, passing parameters as necessary:
struct BirthdayCard {} struct BirthdayCardBuilder {} impl BirthdayCardBuilder { fn new(name: &str) -> BirthdayCardBuilder { Self{} // ... } fn age(&mut self, age: i32) -> &mut BirthdayCardBuilder { self // ... } fn text(&mut self, text: &str) -> &mut BirthdayCardBuilder { self // ... } fn build(&mut self) -> BirthdayCard { BirthdayCard { /* ... */ } } } fn main() { let card = BirthdayCardBuilder::new("Paul") .age(64) .text("Happy Valentine's Day!") .build(); }
Note another advantage of builders: Overloaded constructors often don't provide all possible combinations of parameters, whereas with the builder pattern, you can combine exactly the parameters you want.
When must I use #[must_use]
?
Use it on Results and mutex locks. - MG
#[must_use]
causes a compile error if the caller ignores the return value.
Rust functions are often single-purpose. They either:
- Return a value without any side effects; or
- Do something (i.e. have side effects) and return nothing.
In neither case do you need to think about #[must_use]
. (In the first case,
nobody would call your function unless they were going to use the result.)
#[must_use]
is useful for those rarer functions which return a result and
have side effects. In most such cases, it's wise to specify #[must_use]
, unless
the return value is truly optional (for example in
HashMap::insert
).
When should I take parameters by value?
Move semantics are more common in Rust than in C++.
In C++ moves tend to be an optimization, whereas in Rust they're a key semantic part of the program. - MY
To a first approximation, you should assume similar performance when passing
things by (moved) value or by reference. It's true that a move may turn out to
be a memcpy
, but it's often optimized away.
Express the ownership relationship in the type system, instead of trying to second-guess the compiler for efficiency. - AF
The moves are, of course, destructive - and unlike in C++, the compiler enforces that you don't reuse a variable that has been moved. Some C++ objects become toxic after they've moved; that's not a risk in Rust.
So here's the heuristic: if a caller shouldn't be able to use an object again, pass it via move semantics in order to consume it.
An extreme example: a UUID is supposed to be globally unique - it might cause a logic error for a caller to retain knowledge of a UUID after passing it to a callee.
More generally, consume data enthusiastically to avoid logical errors during future refactorings. For instance, if some command-line options are overridden by a runtime choice, consume those old options - then any future refactoring which uses them after that point will give you a compile error. This pattern is surprisingly effective at spotting errors in your assumptions.
Should I ever take self
by value?
Sometimes. If you've got a member function which destroys or transforms a thing,
it should take self
by value. Examples:
- Closing a file and returning a result code.
- A builder-pattern object which spits out the thing it was building. (Example).
How do I take a thing, and a reference to something within that thing?
For example, suppose you want to give all of your dogs to your friend, yet also tell your friend which one of the dogs is the Best Boy or Girl.
struct PetInformation {
std::vector<Dog> dogs;
Dog& BestBoy;
Dog& BestGirl;
}
PetInformation GetPetInformation() {
// ...
}
Generally this is an indication that your types or functions are not split down in the correct way:
This is a decomposition problem. Once you’ve found the correct decomposition, everything else just works. The code almost writes itself. - AF
#![allow(unused)] fn main() { struct Dog; struct PetInformation(Vec<Dog>); fn get_pet_information() -> PetInformation { // ... PetInformation(Vec::new()) } fn identify_best_boy(pet_information: &PetInformation) -> &Dog { // ... pet_information.0.get(0).unwrap() } }
One use-case is when you want to act on some data, depending on its contents... but you also wanted to do something with those contents that you previously identified.
#![allow(unused)] fn main() { struct Key; struct Door { locked: bool } struct Car { ignition: Option<Key>, door: Door, } fn steal_car(car: Car) { match car { Car { ignition: Some(ref key), door: Door { locked: false } } => drive_away_normally(car /* , key */), _ => break_in_and_hotwire(car) } } fn drive_away_normally(car: Car /* , key: &Key */) { // Annoying to have to repeat this code... let key = match car { Car { ignition: Some(ref key), .. } => key, _ => unreachable!() }; turn_key(key); // ... } fn turn_key(key: &Key) {} fn break_in_and_hotwire(car: Car) {} }
If this repeated matching gets annoying, it's relatively easy to extract it to a function.
#![allow(unused)] fn main() { fn turn_key(key: &Key) {} fn break_in_and_hotwire(car: Car) {} struct Key; struct Door { locked: bool } struct Car { ignition: Option<Key>, door: Door, } impl Car { fn get_usable_key(&self) -> Option<&Key> { match self { Car { ignition: Some(ref key), door: Door { locked: false } } => Some(key), _ => None, } } } fn steal_car(car: Car) { match car.get_usable_key() { None => break_in_and_hotwire(car), Some(_) => drive_away_normally(car), } } fn drive_away_normally(car: Car) { turn_key(car.get_usable_key().unwrap()); } }
When should I return impl Trait
?
Your main consideration should be API stability. If your caller doesn't need to know the concrete implementation type, then don't tell it. That gives you flexibility to change your implementation in future without breaking compatibility.
Note Hyrum's Law!
Using impl Trait
doesn't solve all possible API stability concerns, because
even impl Trait
leaks auto-traits such as Send
and Sync
.
I miss function overloading! What do I do?
Use a trait to implement the behavior you used to have.
For example, in C++:
class Dog {
public:
void eat(Dogfood);
void eat(DeliveryPerson);
};
In Rust you might express this as:
#![allow(unused)] fn main() { trait Edible { }; struct Dog; impl Dog { fn eat(edible: impl Edible) { // ... } } struct Dogfood; struct DeliveryPerson; impl Edible for Dogfood {} impl Edible for DeliveryPerson {} }
This gives your caller all the convenience they want, though may increase work for you as the implementer.
I miss operator overloading! What do I do?
Implement the standard traits instead (for example PartialEq
, Add
). This
has equivalent effect in that folks will be able to use your type in a standard
Rusty way without knowing too much special about your type.
Should I return an error, or panic?
Panics should be used only for invariants, never for anything that you believe might happen. That's especially true for libraries
- panicking (or asserting) should be reserved for the 'top level' code driving the application.
Libraries which panic are super-rude and I hate them - MY
Even in your own application code, panicking might not be wise:
Panicking in application logic for recoverable errors makes it way harder to librarify some code - AP
If you really must have an API which can panic, add a try_
equivalent too.
What should my error type be?
Rust's Result
type is parameterized
over an error type. What should you use?
For app code, consider anyhow. For library code,
use your own enum
of error conditions - you can use thiserror
to make this more pleasant.
When should I take or return dyn Trait
?
In either C++ or Rust, you can choose between monomorphization (that is, building code multiple times for each permutation of parameter types) or dynamic dispatch (i.e. looking up the correct implementation using vtables).
In C++ the syntax is completely different - templates vs virtual functions.
In Rust the syntax is almost identical - in some cases it's as simple as
exchanging the impl
keyword with the dyn
keyword.
Given this flexibility to switch strategies, which should you start with?
In both languages, monomorphization tends to result in a quicker program (partly
due to better inlining). It's arguably true that inlining is more important in
Rust, due to its functional nature and pervasive use of iterators. Whether or
not that's the reason, experienced Rustaceans usually start with impl
:
It's best practice to start with monomorphization and move to
dyn
... - MG
The main cost of monomorphization is larger binaries. There are cases where large amounts of code can end up being duplicated (the marvellous serde is one).
You can choose to do things the other way round:
... it’s workable practice to start with
dyn
and then move toimpl
when you have problems. - MG
dyn
can be awkward, and potentially expensive in different ways:
One thing to note about pervasive
dyn
is that because it unsizes the types it wraps, you need to box it if you want to store it by value. You end up with a good bit more allocator pressure if you try to havedyn
field types. - AP
<'a>
I seem to have lots of named lifetimes. Am <'b>
I doing something wrong?
Some say that if you have a significant number of named lifetimes, you're overcomplicating things.
There are some scenarios where multiple named lifetimes make perfect sense - for example
if you're dealing with an arena, or major phases of a process (the Rust compiler
has 'gcx
and 'tcx
lifetimes relating to the output of certain compile phases.)
But otherwise, it may be that you've got lifetimes because you're trying too
hard to avoid a copy. You may be better off simply switching to runtime
checking (e.g. Rc
, Arc
) or even cloning.
Are named lifetimes even a "code smell"?
My experience has been that the extent to which they're a smell varies a good bit based on the programmer's experience level, which has led me towards increased skepticism over time. Lots of people learning Rust have experienced the pain of first not wanting to
.clone()
something, immediately putting lifetimes everywhere, and then feeling the pain of lifetime subtyping and variance. I don't think they're nearly as odorous as unsafe, for example, but treating them as a bit of a smell does I think lead to code that's easier to read for a newcomer and to refactor around the stack. - AP
Questions about your types
- My 'class' needs mutable references to other things to do its job. Other classes need mutable references to these things too. What do I do?
- My type needs to store arbitrary user data. What do I do instead of
void *
? - When should I put my data in a
Box
? - Should I have public fields or accessor methods?
- When should I use a newtype wrapper?
- How else can I use Rust's type system to avoid high-level logic bugs?
- What should I do instead of inheritance?
- I need a list of nodes which can refer to one another. How?
- I'm having a miserable time making my data structure. Should I use unsafe?
- I nevertheless have to write my own data structure. Should I use unsafe?
My 'class' needs mutable references to other things to do its job. Other classes need mutable references to these things too. What do I do?
It's common in C++ to have a class that contain mutable references to other objects; the class mutates those objects to do its work. Often, there are several classes that all hold a mutable reference to the same object. Here is a diagram that illustrates this:
flowchart LR subgraph Shared functionality important[Important Shared Object] end subgraph ObjectA methodA[Method] refa[Mutable Reference]-->important methodA-. Acts on shared object.->important end subgraph ObjectB refb[Mutable Reference]-->important methodB[Method] methodB-. Acts on shared object.->important end main --> ObjectA main --> ObjectB main-. Calls .-> methodA main-. Calls .-> methodB
In Rust, you can't have multiple mutable references to a shared object, so what do you do?
First of all, consider moving behavior out of your types. (See the answer about the observer pattern and especially the second option described there.)
Even in Rust, though, it's still often the best choice to make complex behavior
part of the type within impl
blocks. You can still do that - but don't
store references. Instead, pass them into each function call.
flowchart LR subgraph Shared functionality important[Important Shared Object] end subgraph ObjectA methodA[Method] methodA-. Acts on shared object.->important end subgraph ObjectB methodB[Method] methodB-. Acts on shared object.->important end main --> ObjectA main --> ObjectB main --> important main-. Passes reference to shared object.-> methodA main-. Passes reference to shared object.-> methodB
Instead of this:
struct ImportantSharedObject; struct ObjectA<'a> { important_shared_object: &'a mut ImportantSharedObject, } impl<'a> ObjectA<'a> { fn new(important_shared_object: &'a mut ImportantSharedObject) -> Self { Self { important_shared_object } } fn do_something(&mut self) { // act on self.important_shared_object } } fn main() { let mut shared_thingy = ImportantSharedObject; let mut a = ObjectA::new(&mut shared_thingy); a.do_something(); // acts on shared_thingy }
Do this:
struct ImportantSharedObject; struct ObjectA; impl ObjectA { fn new() -> Self { Self } fn do_something(&mut self, important_shared_object: &mut ImportantSharedObject) { // act on important_shared_object } } fn main() { let mut shared_thingy = ImportantSharedObject; let mut a = ObjectA::new(); a.do_something(&mut shared_thingy); // acts on shared_thingy }
(Happily this also gets rid of named lifetime parameters.)
If you have a hundred such shared objects, you probably don't want a hundred function parameters. So it's usual to bundle them up into a context structure which can be passed into each function call:
struct ImportantSharedObject; struct AnotherImportantObject; struct Ctx<'a> { important_shared_object: &'a mut ImportantSharedObject, another_important_object: &'a mut AnotherImportantObject, } struct ObjectA; impl ObjectA { fn new() -> Self { Self } fn do_something(&mut self, ctx: &mut Ctx) { // act on ctx.important_shared_object and ctx.another_important_thing } } fn main() { let mut shared_thingy = ImportantSharedObject; let mut another_thingy = AnotherImportantObject; let mut ctx = Ctx { important_shared_object: &mut shared_thingy, another_important_object: &mut another_thingy, }; let mut a = ObjectA::new(); a.do_something(&mut ctx); // acts on both the shared thingies }
flowchart LR subgraph Shared functionality important[Important Shared Object] end subgraph Context refa[Mutable Reference]-->important end subgraph ObjectA objectA[Object A] methodA[Method] methodA-. Acts on shared object.->important end subgraph ObjectB objectB[Object B] methodB[Method] methodB-. Acts on shared object.->important end main --> objectA main --> objectB main --> Context main-. Passes context.-> methodA main-. Passes context.-> methodB
Even simpler: just put all the data directly into Ctx
. But the key point
is that this context object is passed around into just about all function calls
rather than being stored anywhere, thus negating any borrowing/lifetime concerns.
This pattern can be seen in bindgen, for example.
Split out borrowing concerns from the object concerns. - MG
To generalize this idea, try to avoid storing references to anything that might
need to be changed. Instead take those things as parameters. For instance
petgraph
takes the entire graph as context to a Walker
object,
such that the graph can be changed while you're walking it.
My type needs to store arbitrary user data. What do I do instead of void *
?
Ideally, your type would know all possible types of user data that it could store.
You'd represent this as an enum
with variant data for each possibility. This
would give complete compile-time type safety.
But sometimes code needs to store data for which it can't depend upon
the definition: perhaps it's defined by a totally different area of the
codebase, or belongs to clients. Such possibilities can't be enumerated in
advance. Until recently, the only real option in C++ was to use a void *
and have clients downcast to get their original type back. Modern C++ offers
a much better option, std::any
; if you've come across that, Rust's equivalent
will seem very familiar.
In Rust, the Any
type
allows you to store anything and retrieve it later in a type-safe fashion:
use std::any::Any; struct MyTypeOfUserData(u8); fn main() { let any_user_data: Box<dyn Any> = Box::new(MyTypeOfUserData(42)); let stored_value = any_user_data.downcast_ref::<MyTypeOfUserData>().unwrap().0; println!("{}", stored_value); }
If you want to be more prescriptive about what can be stored, you can define
a trait (let's call it UserData
) and store a Box<dyn UserData>
.
Your trait should have a method fn as_any(&self) -> &dyn std::any::Any;
Each implementation can just return self
.
Your caller can then do this:
trait UserData { fn as_any(&self) -> &dyn std::any::Any; // ...other trait methods which you wish to apply to any UserData... } struct MyTypeOfUserData(u8); impl UserData for MyTypeOfUserData { fn as_any(&self) -> &dyn std::any::Any { self } } fn main() { // Store a generic Box<dyn UserData> let user_data: Box<dyn UserData> = Box::new(MyTypeOfUserData(42)); // Get back to a specific type let stored_value = user_data.as_any().downcast_ref::<MyTypeOfUserData>().unwrap().0; println!("{}", stored_value); }
Of course, enumerating all possible stored variants remains preferable such that the compiler helps you to avoid runtime panics.
When should I put my data in a Box
?
In C++, you often need to box things for ownership reasons, whereas in Rust it's typically just a performance trade-off. It's arguably premature optimization to use boxes unless your profiling shows a lot of memcpy of that particular type (or, perhaps, the relevant clippy lint informs you that you have a problem.)
I never box things unless they're really big. - MG
Another heuristic is if part of your data structure is very rarely filled,
in which case you may wish to Box
it to avoid incurring an overhead for all
other instances of the type.
struct Humility; struct Talent; struct Ego; struct Popstar { ego: Ego, talent: Talent, humility: Option<Box<Humility>>, } fn main() {}
(This is one reason why people like using anyhow
for their errors; it means the failure case in their Result
enum is only
a pointer wide.)
Of course, Rust may require you to use a box:
- if you need to
Pin
some data, typically for async Rust, or - if you otherwise have an infinitely sized data structure
but as usual, the compiler will explain very nicely.
Should I have public fields or accessor methods?
The trade-offs are similar to C++ except that Rust's pattern-matching makes it very convenient to match on fields, so within a realm of code that you own you may bias towards having more public fields than you're used to. As with C++, this can give you a future compatibility burden.
When should I use a newtype wrapper?
The newtype wrapper pattern uses Rust's type systems to enforce extra behavior without necessarily changing the underlying representation.
#![allow(unused)] fn main() { fn get_rocket_length() -> Inches { Inches(7) } struct Inches(u32); struct Centimeters(u32); fn build_mars_orbiter() { let rocket_length: Inches = get_rocket_length(); // mate_to_orbiter(rocket_length); // does not compile because this takes cm } }
Other examples that have been used:
- An IP address which is guaranteed not to be localhost;
- Non-zero numbers;
- IDs which are guaranteed to be unique
Such new types typically need a lot of boilerplate, especially to implement the traits which users of your type would expect to find. On the other hand, they allow you to use Rust's type system to statically prevent logic bugs.
A heuristic: if there are some invariants you'd be checking for at runtime, see if you can use a newtype wrapper to do it statically instead. Although it may be more code to start with, you'll save the effort of finding and fixing logic bugs later.
How else can I use Rust's type system to avoid high-level logic bugs?
Lots of ways:
Zero-sized types.
Also known as "ZSTs". These are types which occupy literally zero bytes, and so (generally) make no difference whatsoever to the code generated. But you can use them in the type system to enforce invariants at compile-time with no runtime check.
For example, they're often used as capability tokens - you can statically prove that code exclusively has the right to do something.
pub trait ValidationStatus {} mod validator { use self::super::{Bytecode, ValidationStatus}; /// ZST marker to show that bytecode has been validated. // Private field ensures this can't be created outside this mod // but PhantomData means this is still zero-sized. pub struct BytecodeValidated(std::marker::PhantomData<u8>); pub fn validate_bytecode<V: ValidationStatus>(code: Bytecode<V>) -> Bytecode<BytecodeValidated> { // Do expensive validation operation here... Bytecode { validated: BytecodeValidated(std::marker::PhantomData), code: code.code } } impl ValidationStatus for BytecodeValidated {} } struct BytecodeNotValidated; impl ValidationStatus for BytecodeNotValidated {} pub struct Bytecode<V: ValidationStatus> { validated: V, code: Vec<u8>, } fn run_bytecode(bytecode: &Bytecode<validator::BytecodeValidated>) { // Compiler PROVES you validated it before you can run it. There are no // runtime branches involved. } fn get_unvalidated_bytecode() -> Bytecode<BytecodeNotValidated> { // ... Bytecode { validated: BytecodeNotValidated, code: Vec::new() } } fn main() { let bytecode = get_unvalidated_bytecode(); // run_bytecode(bytecode); // does not compile let bytecode = validator::validate_bytecode(bytecode); run_bytecode(&bytecode); run_bytecode(&bytecode); }
ZSTs can also be used to demonstrate exclusive access to some resource.
struct RobotArmAccessToken; fn move_arm(token: &mut RobotArmAccessToken, x: u32, y: u32, z: u32) { // ... } fn attach_car_door(token: &mut RobotArmAccessToken) { move_arm(token, 3, 4, 6); move_arm(token, 5, 3, 6); } fn install_windscreen(token: &mut RobotArmAccessToken) { move_arm(token, 7, 8, 2); move_arm(token, 1, 2, 3); } fn main() { let mut token = RobotArmAccessToken; // ensure only one exists attach_car_door(&mut token); install_windscreen(&mut token); }
(The type system would prevent these operations happening in parallel.)
Marker traits
Indicate that a type meets certain invariants, so subsequent users of that type don't need to check at runtime. A common example is to indicate that a type is safe to serialize into some bytestream.
Enums as state machines.
Each enum variant is a state and stores data associated with that state. There simply is no possibility that the data can get out of sync with the state.
#![allow(unused)] fn main() { enum ElectionState { RaisingDonations { amount_raised: u32 }, DoingTVInterviews { interviews_done: u16 }, Voting { votes_for_me: u64, votes_for_opponent: u64 }, Elected, NotElected, }; }
A more heavyweight approach here is to define types for each state, and allow valid state transitions by taking the previous state by-value and returning the next state by-value.
#![allow(unused)] fn main() { struct Seed { water_available: u32 } struct Growing { water_available: u32, sun_available: u32 } struct Flowering; struct Dead; enum PlantState { Seed(Seed), Growing(Growing), Flowering(Flowering), Dead(Dead) } impl Seed { fn advance(self) -> PlantState { if self.water_available > 3 { PlantState::Growing(Growing { water_available: self.water_available, sun_available: 0 }) } else { PlantState::Dead(Dead) } } } impl Growing { fn advance(self) -> PlantState { if self.water_available > 3 && self.sun_available > 3 { PlantState::Flowering(Flowering) } else { PlantState::Dead(Dead) } } } impl Flowering { fn advance(self) -> PlantState { PlantState::Dead(Dead) } } impl Dead { fn advance(self) -> PlantState { PlantState::Dead(Dead) } } impl PlantState { fn advance(self) -> Self { match self { Self::Seed(seed) => seed.advance(), Self::Growing(growing) => growing.advance(), Self::Flowering(flowering) => flowering.advance(), Self::Dead(dead) => dead.advance(), } } } // we should probably find a way to inject some sun and water into this // state machine or things are not looking rosy }
What should I do instead of inheritance?
Use composition. Sometimes this results in more boilerplate, but it avoids a raft of complexity.
Specifically, for example:
- you might include the "superclass" struct as a member of the subclass struct;
- you might use an enum with different variants for the different possible "subclasses".
Usually the answer is obvious: it's unlikely that your Rust code is structured in such a way that inheritance would be a good fit anyway.
I've only missed inheritance when actually implementing languages which themselves have inheritance. - MG
I need a list of nodes which can refer to one another. How?
You can't easily do self-referential data structures in Rust. The usual workaround is to use an arena and replace references from one node to another with node IDs.
An arena is typically a Vec
(or similar), and the node IDs are a newtype
wrapper around a simple integer index.
Obviously, Rust doesn't check that your node IDs are valid. If you don't have proper references, what stops you from having stale IDs?
Arenas are often purely additive, which means that you can add entries but not delete them (example). If you must have an arena which deletes things, then use generational IDs; see the generational-arena crate and this RustConf keynote for more details.
If arenas still sound like a nasty workaround, consider that you might choose an arena anyway for other reasons:
- All of the objects in the arena will be freed at the end of the arena's lifetime, instead of during their manipulation, which can give very low latency for some use-cases. Bumpalo formalizes this.
- The rest of your program might have real Rust references into the arena. You
can give the arena a named lifetime (
'arena
for example), making the provenance of those references very clear.
I'm having a miserable time making my data structure. Should I use unsafe?
Low-level data structures are hard in Rust, especially if they're self- referential. Rust will make visible all sorts of risks of ownership and shared mutable state which may not be visible in other languages, and they're hard to solve in low-level data structure code.
Even something as simple as a doubly-linked list is notoriously hard; so much so that there is a book that teaches Rust based solely on linked lists. As that (wonderful) book makes clear, you are often faced with a choice:
- Use safe Rust, but shift compile-time checks to runtime
- Use
unsafe
and take the same degree of care you'd take in C or C++. And, just like in C or C++, you'll introduce security vulnerabilities in the unsafe code.
If you're facing this decision... perhaps there's a third way.
You should almost always be using somebody else's tried-and-tested data structure.
petgraph and slotmap are great examples. Use someone else's crate by default, and resort to writing your own only if you exhaust that option.
C++ makes it hard to pull in third-party dependencies, so it's culturally normal to write new code. Rust makes it trivial to add dependencies, and so you will need to do that, even if it feels surprising for a C++ programmer.
This ease of adding dependencies co-evolved with the difficulty of making data structures. It's simply a part of programming in Rust. You just can't separate the language and the ecosystem.
You might argue that this dependency on third-party crates is concerning from a supply-chain security point of view. Your author would agree, but it's just the way you do things in Rust. Stop creating your own data structures.
Then again:
it’s equally miserable to implement performant, low-level data structures in C++; you’ll be specializing on lots of things like is_trivially_movable etc. - MY.
I nevertheless have to write my own data structure. Should I use unsafe?
I'm sorry to hear that.
Some suggestions:
- Use
Rc
, weak etc. until you really can't. - Even if you can't use a pre-existing crate for the whole data structure,
perhaps you can use a crate to avoid the
unsafe
bits (for example rental) - Bear in mind that refactoring Rust is generally safer than refactoring C++ (because the compiler will point out a higher proportion of your mistakes) so a wise strategy might be to start with a fully-safe, but slow, version, establish solid tests, and then reach for unsafe.
Questions about designing APIs for others
- When should my type implement
Default
? - When should my type implement
From
,Into
andTryFrom
? - How should I expose constructors?
- When should my type implement
AsRef
? - When should I implement
Copy
? - Should I have
Arc
orRc
in my API? - Should my API be thread-safe? What does that mean?
- What should I
Derive
to make my code optimally usable? - How should I think about API design, differently from C++?
See also the excellent Rust API guidelines. The document you're reading aims to provide extra hints which may be especially useful to folk coming from C++, but that's the canonical reference.
When should my type implement Default
?
Whenever you'd provide a default constructor in C++.
When should my type implement From
, Into
and TryFrom
?
You should think of these as equivalent to implicit conversions in C++. Just as with C++, if there are multiple ways to convert from your thing to another thing, don't implement these, but if there's a single obvious conversion, do.
Usually, don't implement Into
but instead implement From
.
How should I expose constructors?
See the previous two answers: where it's simple and obvious, use the standard traits to make your object behavior predictable.
If you need to go beyond that, remember you've got a couple of extra toys in Rust:
- A "constructor" could return a
Result<Self>
- Your constructors can have names, e.g.
Vec::with_capacity
,Box::pin
When should my type implement AsRef
?
If you have a type which contains another type, provide AsRef
especially
so that people can clone the inner type. It's good practice to provide explicit
versions as well (for example, String
implements AsRef<str>
but also
provides .as_str()
.)
When should I implement Copy
?
Anything that is integer-like or reference-like should be
Copy
; other things shouldn’t. - MY
When it's efficient and when it’s an API contact you're willing to uphold. - AH
Generally speaking, types which are plain-old-data can be Copy
. Anything
more nuanced with any type of state shouldn't be.
Should I have Arc
or Rc
in my API?
It’s a code smell to have reference counts in your API design. You should hide it. - TM.
If you must, you will need to decide between Rc
and Arc
- see the next
answer for some considerations. But, generally, Arc
is better practice because
it imposes fewer restrictions on your callers. Also, consider taking a look at the
Archery
crate.
Should my API be thread-safe? What does that mean?
In C++, a thread-safe API usually means that you can expect your API's consumers to use objects from multiple threads. This is difficult to make safe and therefore substantial extra engineering is required to make an API thread-safe.
In Rust, things differ:
- it's more normal to do things across multiple threads;
- you don't have to worry about your callers making mistakes here because the compiler won't let them;
- you can often rely on
Send
rather thanSync
.
You certainly shouldn't be putting a Mutex
around all your types. If your
caller attempts to use the type from multiple threads, the compiler will
simply stop them. It is the responsibility of the caller to use things
safely.
If the library has
Arc
orRc
in the APIs, it may be making choices about how you should instantiate stuff, and that’s rude. - AF
There's a reasonable chance that your API can be used in parallel threads
by virtue of Send
and Sync
being automatically derived. But - you should
think through the usage model for your API clients and ensure that's true.
use std::cell::RefCell; use std::collections::VecDeque; use std::sync::Mutex; use std::thread; // Imagine this is your library, exposing this interface to library // consumers... mod pizza_api { use std::thread; use std::time::Duration; pub struct Pizza { // automatically 'Send' _anchovies: u32, _pepperoni: u32, } pub fn make_pizza() -> Pizza { println!("cooking..."); thread::sleep(Duration::from_millis(10)); Pizza { _anchovies: 0, // yuck _pepperoni: 32, } } pub fn eat_pizza(_pizza: Pizza) { println!("yum") } } // Absolutely no changes are required to the pizza library to let // it be usable from a multithreaded context fn main() { let pizza_queue = Mutex::new(RefCell::new(VecDeque::new())); thread::scope(|s| { s.spawn(|| { let mut pizzas_eaten = 0; while pizzas_eaten < 100 { if let Some(pizza) = pizza_queue.lock().unwrap().borrow_mut().pop_front() { pizza_api::eat_pizza(pizza); pizzas_eaten += 1; } } }); s.spawn(|| { for _ in 0..100 { let pizza = pizza_api::make_pizza(); pizza_queue.lock().unwrap().borrow_mut().push_back(pizza); } }); }); }
What should I Derive
to make my code optimally usable?
The official guidelines say to be eager.
But don't overpromise:
Equality can suddenly become expensive later - don’t make types comparable unless you intend people to be able to compare instances of the type. Allowing people to pattern match on enums is usually better. - MY
Note that syn
is a rare case in that it
has so many types, and is so extensively depended upon by the rest of the Rust
ecosystem, that it avoids deriving the standard traits unless explicitly
commanded to do so via a cargo feature. This is an unusual pattern and should
not normally be followed.
How should I think about API design, differently from C++?
Make the most of the fact that everything is immutable by default. Things which are mutable should stick out. - AF
Think about things which should take self and return self. - AF
Refactoring is less expensive in Rust than C++ due to compiler safeguards, but rearchitecting is expensive in any language. Think about "one way doors" and "two way doors" in the design space: can you undo a change later?
Questions about your whole codebase
- The C++ observer pattern is hard in Rust. What to do?
- That's all very well, but I have an existing C++ object broadcasting events. How exactly should I observe it?
- Some of my C++ objects have shared mutable state. How can I make them safe in Rust?
- How do I do a singleton?
- What's the best way to retrofit Rust's parallelism benefits to an existing codebase?
- What's the best way to architect a new codebase for parallelism?
- I need a list of nodes which can refer to one another. How?
- Should I have a few big crates or lots of small ones?
- What crates should everyone know about?
- How should I call C++ functions from Rust and vice versa?
- I'm getting a lot of binary bloat.
The C++ observer pattern is hard in Rust. What to do?
The C++ observer pattern usually means that there are broadcasters sending messages to consumers:
flowchart TB broadcaster_a[Broadcaster A] broadcaster_b[Broadcaster B] consumer_a[Consumer A] consumer_b[Consumer B] consumer_c[Consumer C] broadcaster_a --> consumer_a broadcaster_b --> consumer_a broadcaster_a --> consumer_b broadcaster_b --> consumer_b broadcaster_a --> consumer_c broadcaster_b --> consumer_c
The broadcasters maintain lists of consumers, and the consumers act in response to messages (often mutating their own state.)
This doesn't work in Rust, because it requires the broadcasters to hold mutable references to the consumers.
What do you do?
Option 1: make everything runtime-checked
Each of your consumers could become an Rc<RefCell<T>>
or, if you need thread-safety, an Arc<RwLock<T>>
.
The Rc
or Arc
allows broadcasters to share ownership of a consumer. The RefCell
or RwLock
allows each broadcaster to acquire a mutable reference to a consumer when it needs to send a message.
This example shows how, in Rust, you may independently choose reference counting or interior mutability. In this case we need both.
Just like typical reference counting in C++, Rc
and Arc
have the option to provide a weak pointer, so the lifetime of each consumer doesn't need to be extended unnecessarily. As an aside, it would be nice if Rust had an Rc
-like type which enforces exactly one owner, and multiple weak ptrs. Rc
could be wrapped quite easily to do this.
Reference counting is frowned-upon in C++ because it's expensive. But, in Rust, not so much:
- Few objects are reference counted; the majority of objects are owned statically.
- Even when objects are reference counted, those counts are rarely incremented and decremented because you can (and do) pass around
&Rc<RefCell<T>>
most of the time. In C++, the "copy by default" mode means it's much more common to increment and decrement reference counts.
In fact, the compile-time guarantees might cause you to do less reference counting than C++:
In Servo there is a reference count but far fewer objects are reference counted than in the rest of Firefox, because you don’t need to be paranoid - MG
However: Rust does not prevent reference cycles, although they're only possible if you're using both reference counting and interior mutability.
Option 2: drive the objects from the code, not the other way round
In C++, it's common to have all behavior within classes. Those classes are the total behavior of the system, and so they must interact with one another. The observer pattern is common.
flowchart TB broadcaster_a[Broadcaster A] consumer_a[Consumer A] consumer_b[Consumer B] broadcaster_a -- observer --> consumer_a broadcaster_a -- observer --> consumer_b
In Rust, it's more common to have some external function which drives overall behavior.
flowchart TB main(Main) broadcaster_a[Broadcaster A] consumer_a[Consumer A] consumer_b[Consumer B] main --1--> broadcaster_a broadcaster_a --2--> main main --3--> consumer_a main --4--> consumer_b
With this sort of design, it's relatively straightforward to take some output from one object and pass it into another object, with no need for the objects to interact at all.
In the most extreme case, this becomes the Entity-Component-System architecture used in game design.
Game developers seem to have completely solved this problem - we can learn from them. - MY
Option 3: use channels
The observer pattern is a way to decouple large, single-threaded C++ codebases. But if you're trying to decouple a codebase in Rust, perhaps you should assume multi-threading by default? Rust has built-in channels, and the crossbeam crate provides multi-producer, multi-consumer channels.
I'm a Rustacean, we assume massively parallel unless told otherwise :) - MG
That's all very well, but I have an existing C++ object broadcasting events. How exactly should I observe it?
If your Rust object is a consumer of events from some pre-existing C++ producer, all the above options remain possible.
- You can make your object reference counted and have C++ own such a reference (potentially a weak reference)
- C++ can deliver the message into a general message bucket. An external function reads messages from that bucket and invokes the Rust object that should handle it. This means the reference counting doesn't need to extend to the Rust objects outside that boundary layer.
- You can have a shim object which converts the C++ callback into some message and injects it into a channel-based world.
Some of my C++ objects have shared mutable state. How can I make them safe in Rust?
You're going to have to do something with interior mutability: either RefCell<T>
or its multithreaded equivalent, RwLock<T>
.
You have three decisions to make:
- Will only Rust code access this particular instance of this object, or might C++ access it too?
- If both C++ and Rust may access the object, how do you avoid conflicts?
- How should Rust code react if the object is not available, because something else is using it?
If only Rust code can use this particular instance of shared state, then simply wrap it in RefCell<T>
(single-threaded) or RwLock<T>
(multi-threaded). Build a wrapper type such that callers aren't able to access the object directly, but instead only via the lock type.
If C++ also needs to access this particular instance of the shared state, it's more complex. There are presumably some invariants regarding use of this data in C++ - otherwise it would crash all the time. Perhaps the data can be used only from one thread, or perhaps it can only be used with a given mutex held. Your goal is to translate those invariants into an idiomatic Rust API that can be checked (ideally) at compile-time, and (failing that) at runtime.
For example, imagine:
class SharedMutableGoat {
public:
void eat_grass(); // mutates tummy state
};
std::mutex lock;
SharedMutableGoat* billy; // only access when owning lock
Your idiomatic Rust wrapper might be:
#![allow(unused)] fn main() { mod ffi { #[allow(non_camel_case_types)] pub struct lock_guard; pub fn claim_lock() -> lock_guard { lock_guard{} } pub fn eat_grass() {} pub fn release_lock(lock: &mut lock_guard) {} } struct SharedMutableGoatLock { lock: ffi::lock_guard, // owns a std::lock_guard<std::mutex> somehow }; // Claims the lock, returns a new SharedMutableGoatLock fn lock_shared_mutable_goat() -> SharedMutableGoatLock { SharedMutableGoatLock { lock: ffi::claim_lock() } } impl SharedMutableGoatLock { fn eat_grass(&mut self) { ffi::eat_grass(); // Acts on the global goat } } impl Drop for SharedMutableGoatLock { fn drop(&mut self) { ffi::release_lock(&mut self.lock); } } }
Obviously, lots of permutations are possible, but the goal is to ensure that it's simply compile-time impossible to act on the global state unless appropriate preconditions are met.
The final decision is how to react if the object is not available. This decision can apply with C++ mutexes or with Rust locks (for example RwLock<T>
). As in C++, the two major options are:
- Block until the object becomes available.
- Try to lock, and if the object is not available, do something else.
There can be a third option if you're using async Rust. If the data isn't available, you may be able to return to your event loop using an async version of the lock (Tokio example, async_std example).
How do I do a singleton?
Use OnceCell.
What's the best way to retrofit Rust's parallelism benefits to an existing codebase?
When parallelizing an existing codebase, first check that all existing types are correctly Send
and Sync
. Generally, though, you should try to avoid implementing these yourself - instead use pre-existing wrapper types which enforce the correct contract (for example, RwLock
).
After that:
If you can solve your problem by throwing Rayon at it, do. It’s magic - MG
If your task is CPU-bound, Rayon solves this handily. - MY
Rayon offers parallel constructs - for example parallel iterators - which can readily be retrofitted to an existing codebase. It also allows you to create and join tasks. Using Rayon can help simplify your code and eliminate lots of manual scheduling logic.
If your tasks are IO-bound, then you may need to look into async Rust, but that's hard to pull into an existing codebase.
What's the best way to architect a new codebase for parallelism?
In brief, like in other languages, you have a choice of architectures:
- Message-passing, using event loops which listen on a channel, receive
Send
data and pass it on. - More traditional multithreading using
Sync
data structures such as mutexes (and perhaps Rayon).
There's probably a bias towards message-passing, and that's probably well-informed by its extensibility. - MG
I need a list of nodes which can refer to one another. How?
You can't easily do self-referential data structures in Rust. The usual workaround is to use an arena and replace references from one node to another with node IDs.
An arena is typically a Vec
(or similar), and the node IDs are a newtype wrapper around a simple integer index.
Obviously, Rust doesn't check that your node IDs are valid. If you don't have proper references, what stops you from having stale IDs?
Arenas are often purely additive, which means that you can add entries but not delete them (example). If you must have an arena which deletes things, then use generational IDs; see the generational-arena crate and this RustConf keynote for more details.
If arenas still sound like a nasty workaround, consider that you might choose an arena anyway for other reasons:
- All of the objects in the arena will be freed at the end of the arena's lifetime, instead of during their manipulation, which can give very low latency for some use-cases. Bumpalo formalizes this.
- The rest of your program might have real Rust references into the arena. You can give the arena a named lifetime (
'arena
for example), making the provenance of those references very clear.
Should I have a few big crates or lots of small ones?
In the past, it was recommended to have small crates to get optimal build time. Incremental builds generally make this unnecessary now. You should arrange your crates optimally for your semantic needs.
What crates should everyone know about?
Crate | Description |
---|---|
rayon | parallelizing |
serde | serializing and deserializing |
crossbeam | all sorts of parallelism tools |
itertools | makes it slightly more pleasant to work with iterators. (For instance, if you want to join an iterator of strings, you can just go ahead and do that, without needing to collect the strings into a Vec first) |
petgraph | graph data structures |
slotmap | arena-like key-value map |
nom | parsing |
clap | command-line parsing |
regex | err, regular expressions |
ring | the leading crypto library |
nalgebra | linear algebra |
once_cell | complex static data |
How should I call C++ functions from Rust and vice versa?
Use cxx.
Oh, you want a justification? In that case, here's the history which brought us to this point.
From the beginning, Rust supported calling C functions using extern "C"
,
#[repr(C)]
and #[no_mangle]
.
Such callable C functions had to be declared manually in Rust:
sequenceDiagram Rust-->>extern: unsafe Rust function call extern-->>C: call from Rust to C participant extern as Rust unsafe extern "C" fn participant C as Existing C function
bindgen
was invented
to generate these declarations automatically from existing C/C++ header
files. It has grown to understand an astonishingly wide variety of C++
constructs, but its generated bindings are still unsafe
functions
with lots of pointers involved.
sequenceDiagram Rust-->>extern: unsafe Rust function call extern-->>C: call from Rust to C++ participant extern as Bindgen generated bindings participant C as Existing C++ function
Interacting with bindgen
-generated bindings requires unsafe Rust;
you will likely have to manually craft idiomatic safe Rust wrappers.
This is time-consuming and error-prone.
cxx automates a lot of that process. Unlike bindgen
it doesn't learn about functions from existing C++ headers. Instead,
you specify cross-language interfaces in a Rust-like interface definition
language (IDL) within your Rust file. cxx generates both C++ and Rust code
from that IDL, marshaling data behind the scenes on both sides such that
you can use standard language features in your code. For example, you'll
find idiomatic Rust wrappers for std::string
and std::unique_ptr
and idiomatic C++ wrappers for a Rust slice.
sequenceDiagram Rust-->>rsbindings: safe idiomatic Rust function call rsbindings-->>cxxbindings: hidden C ABI call using marshaled data cxxbindings-->>cpp: call to standard idiomatic C++ participant rsbindings as cxx-generated Rust code participant cxxbindings as cxx-generated C++ code participant cpp as C++ function using STL types
In the bindgen case even more work goes into wrapping idiomatic C++ signatures into something bindgen compatible: unique ptrs to raw ptrs, Drop impls on the Rust side, translating string types ... etc. The typical real-world binding we've converted from bindgen to cxx in my codebase has been -500 lines (mostly unsafe code) +300 lines (mostly safe code; IDL included). - DT
The greatest benefit is that cxx sufficiently understands C++ STL object ownership norms that the generated bindings can be used from safe Rust code.
At present, there is no established solution which combines the idiomatic, safe
interoperability offered by cxx
with the automatic generation offered by
bindgen
. It's not clear whether this is even possible but several
projects are aiming in this direction.
I'm getting a lot of binary bloat.
In Rust you have a free choice between impl Trait
and dyn Trait
. See
this answer, too. impl Trait
tends
to be the default, and results in large binaries as much code can be duplicated.
If you have this problem, consider using dyn Trait
. Other options include
the 'thin template pattern' (an example is serde_json
where the code to read
from a string and a slice
would be duplicated entirely, but instead one delegates to the other and
requests slightly different behavior.)
Questions about your development processes
How should I use tools differently from C++?
- Use
rustfmt
automatically everywhere. While in C++ there are many different coding styles, the Rust community is in agreement (at least, they're in agreement that it's a good idea to be in agreement). That is codified inrustfmt
. Use it, automatically, on every submission. - Use
clippy
somewhere. Its lints are useful. - Use IDEs more liberally. Even staunch vim-adherents (your author!) prefer to use an IDE with Rust, because it's simply invaluable to show type annotations. Type information is typically invisible in the language so in Rust you're more reliant on tooling assistance.
- Deny unsafe code by default. (
#![forbid(unsafe_code)]
).