[RFC] A ClangIR based Safe C++

Abstract

The post proposes an idea about ClangIR based Safe C++ as an extension in clang. The proposed Safe C++ should be a pure subset of ISO C++ except few ignorable pragma and attributes. So that other compilers which don’t support the extension can compile the codes accepted by Safe C++. A demo implementation and examples are presented for readers to get a concrete feeling for the proposed language.

Motivation

Let’s skip the part for announcing importance of safety.

The post is inspired by the Safe C++ proposal. But the proposed language in that proposal is not C++. It brings more burdens to the designers, the implementors and the users.

So I am wondering if we can make a pure subset of C++ plus some ignorable pragma and attributes to make a Safe C++. The Safe C++ may reject valid C++ programs. But the valid Safe C++ program except the ignorable pragma and attributes must be a valid C++ programs too. In this way, the burdens on designing, implementation and users may be reduced significantly.

To avoid ambiguity, in the thread, I’ll call the proposal as Safe C++. And call the above proposal as Safe C++2. Beyond the thread, if you like, you can call the proposal as Clang Safe C++.

Quick Example

struct S {
    int *x;

    S();
    S(const S &other);
    S &operator=(const S &other);

    void get(int &x);
    void consume() const;
};

#pragma clang SafeCXX
void invalid(bool cond) {
  S s;
  if (cond) {
    int a = 0;
    s.get(a); // expected-note {{the previous borrow starts from}}
  }
  s.consume(); // expected-warning {{use of s detected beyond lifetime of borrow a}}
}

This is an example to detect the use of dangling pointers. The line #pragma clang SafeCXX is the key component of the proposal. It splits the unsafe part and safe part of the program so that the compiler can decide where to apply the new checks for safe C++.

Then in invalid(bool), in the then part of the if statements, the address of a will be passed to S::get(int&) and we think s borrows a in this point.

Finally in the end of the function, when we want to consume s, we need to check if all the things we borrowed are still alive. Then we found a is not lived here. So we emit a diagnose message here.

Demo Implementation

A demo implementation can be found at: GitHub - ChuanqiXu9/clangir at safe-c++ and the examples can be found at clangir/clang/test/CIR/SafeC++ at safe-c++ · ChuanqiXu9/clangir · GitHub

Since this is based on ClangIR, if you want to try it, remeber to enable mlir and -DCLANG_ENABLE_CIR=ON to build it.

Proposal

The post proposes to add two pragmas:

#pragma clang SafeCXX
#pragma clang UnsafeCXX

The C++ codes which can be reached by (ignoring #include) a #pragma clang SafeCXX without passing #pragma clang UnsafeCXX is called safe C++ codes.

The compiler is allowed to add new checks (to be defined) to safe C++ codes. The compiler is allowed to reject a valid C++ program if any safe C++ codes violate the new checks.

Disclaimer: To make the process clear, I didn’t define any check in the section of proposal. I hope we can make a consensus first about whether we like the idea of #pragma clang SafeCXX and #pragma clang UnsafeCXX. And after that, we can design the checks we like. Otherwise I am afraid the process may be hijacked by voices like “the check is not good”, “I don’t like this check”.

Properties of “#pragma clang SafeCXX”

An important property of #pragma clang SafeCXX and #pragma clang UnsafeCXX is, it won’t be propagated by inclusions. So we can use #pragma clang SafeCXX without worrying the other included “unsafe” files may get broked

//--- unsafe.h
void d0(int *v) {
    delete v;
}

//--- safe.cc
// Test that we won't emit errors in included headers which are not marked as safe.
// expected-no-diagnostics
#pragma clang SafeCXX
#include "unsafe.h"

Checks

Disclaimer again, the checks and examples presented here are majorly for giving readers a concrete feeling for the proposed language. The checks may not be complete. We can add more. Some checks may be too strict, we can redesign it too.

Conventions on examples

Fall through the thread, all the examples are assumed to be in #pragma clang SafeCXX. If there is a #pragma clang SafeCXX in the middle of the example, it assumes the codes before the first #pragma clang SafeCXX is not safe.

Borrow Checks

Borrow check is the most important part of the checks. It makes sure that there is at most one mutable borrow at a time. And also it is helpful for us to make the dependencies between variables clear. So that we can find the possible dangling pointers.

The check is implemented in clangir/clang/lib/CIR/Dialect/Analysis/BorrowChecker.cpp at safe-c++ · ChuanqiXu9/clangir · GitHub. The implementation tries to refer the algorithm in 2094-nll - The Rust RFC Book. The overall idea is mimic Rust’s borrow check mechanism where possible. But we’re in C++ after all, so we’ll try to explain the definitions and checks for borrow in the post.

Producing a borrow

When a variable is referenced, the variable is borrowed. e.g.,

int a;
int &ref = a; // `ref` borrows `a`

And also if a variable is referenced in a call, the variable is borrowed too. e.g.,

void call(int &a);
int a;
call(a); // the call borrows `a`

also if the call has a return value and the return value can potentially refers to the variable (aliasing rule), the variable is borrowed to the return value. e.g.,

struct S {
     int &a;
     // constructors..
};
S getS(int &);
int a;
S s = getS(a); // `s` borrows `a`.

Here S has a member with type int & and the parameter is int & too. So we have to assume the return value of getS may borrow the parameter. However, if the struct S has a different members:

struct S {
     double &a;
     // constructors..
};
S getS(int &);
int a;
S s = getS(a); // `s` doesn't borrow `a`.

Now since S doesn’t have a member can refer a, we can assume the return value of getS() may borrow the parameter with type int &.

And, sometimes, the return values won’t borrow the arguments actually. In this case, we can use [[clang::NoBorrowToRet]] attribute to mark this. e.g.,

struct S {
     int &a;
     // constructors..
};
S getS(int &a [[clang::NoBorrowToRet]]);
int a;
S s = getS(a); // `s` won't borrow `a`.

(or we can use [[clang::NoBorrowToRet]] to mark the function to make sure the return type won’t borrow any parameters.)

For member function calls, the non obj parameters would be borrowed to this by default unless being marked [[clang::NoBorrowToRet]].

When we want to mark this won’t be borrowed to the return value but other parameters may be borrowed, we can use [[clang::ThisNoBorrowToRet]] to the function.

This [[clang::NoBorrowToRet]] attributes corresponds to the explicit lifetime annotations in Rust. But this is not so powerful and flexible. Maybe we should enhance this in the future.

And also if the reference type is const, we call the borrow as const borrow. Otherwise, we call the borrow as mutable borrow. We can’t create a mutable borrow from a const variable:

const int a;
int &ref = a; // invalid!

Lifetime of a borrow

When a variable is borrowed to call returning void or the return type may not able to refer the variable, the lifetime of the borrow ends at the point of the call.

Otherwise, when the variable borrowee is borrowed to another variable borrower, the lifetime of the borrow will be:

all the program points that is reachable from the borrowing point and is reachable to any users of borrower.

For example,

int f() {
    int a;
    const int &ref = a; // point 1
    a = 43;  // point 2
    return a;
}

in point 1, ref borrows a. However, ref doesn’t have any users. So the lifetime of ref will be {1} only. So the write in point 2 may not be problem here.

Similarly,

void consume(const int &);
int f() {
    int a;
    const int &ref = a; // point 1
    consume(ref); // point 2
    a = 43; // point 3
    return a;
}

Here ref has a user at point 2. However, although the point 3 is reachable from the borrowing point 1, the point 3 is not reachable to any user of ref. So point 3 is not part of lifetime of borrow for ref.

However, if we change the order slighlty,

int g() {
    int a;
    const int &ref = a; // expected-note {{the borrow starts from}}
    a = 43; // expected-warning {{a change to a is detected when borrowed by ref}}
    consume(ref);
    return a;
}

here the point of a = 43 is reachable to the user of ref, so the point of a = 43 is considered to be in the lifetime of borrow for ref.

Checks for borrow

No write during const borrow

As showed in the above example, we shouldn’t write to the borrowed variable when it is borrowed.

No multiple mutable borrow

We don’t allow multiple mutable borrow to exist at the same program point.

#pragma clang SafeCXX
void consume(int &);
int f() {
    int a;
    int &ref1 = a; // expected-warning {{non constant borrow a overlapped with other borrow}}
    int &ref2 = a; // expected-warning {{the previous borrow starts from}}
    consume(ref1);
    consume(ref2);
    return a;
}

So that we can have at most a mutable reference at a point.

Another example is

void consume2(int &, int &);
void g() {
    int a;
    consume2(a, a); // expected-warning {{non constant borrow a overlapped with other borrow}}
}

The above code is invalid in safe C++ since the call to create will create two mutable references for a at the same point.

Non borrow overlapped with mutable borrow

We can have multiple const borrow at the same time. But we can’t have any borrow overlapped with a mutable borrow. e.g.,

void consume(const int &);
int h() {
    int a;
    const int &ref1 = a; // expected-note {{the previous borrow starts from}}
    int &ref2 = a; // expected-warning {{non constant borrow a overlapped with other borrow}}
    consume(ref1);
    consume(ref2);
    return a;
}

Examples

Example 1

For the quick start example,

struct S {
    int *x;

    S();
    S(const S &other);
    S &operator=(const S &other);

    void get(int &x);
    void consume() const;
};

#pragma clang SafeCXX
void invalid(bool cond) {
  S s;
  if (cond) {
    int a = 0;
    s.get(a);
  }
  s.consume();
}

To make this example pass, we can either use #pragma clang UnsafeCXX (bad)

void invalid(bool cond) {
  S s;
  if (cond) {
    int a = 0;
    s.get(a); // expected-note {{the previous borrow starts from}}
  }
#pragma clang UnsafeCXX
  s.consume();
}

or we can mark S::get() as [[clang::NoBorrowToRet]].

struct S {
    int *x;

    S();
    S(const S &other);
    S &operator=(const S &other);

    void get(int &x [[clang::NoBorrowToRet]]);
    void consume() const;
};

#pragma clang SafeCXX
void invalid(bool cond) {
  S s;
  if (cond) {
    int a = 0;
    s.get(a);
  }
  s.consume();
}

then it is fine since s won’t borrow a anymore.

And for S::get(int &x [[clang::NoBorrowToRet]]);, if its definition lives in unsafe C++ section, we can only hope the programmers to make the correct decision. But if its definition lives in safe C++ section, we’re able to check if the attributes are marked correctly.

struct S {
    int *x;

    S();
    S(const S &other);
    S &operator=(const S &other);

    void get(int &v [[clang::NoBorrowToRet]]);
    void get2(int &v);
};

#pragma clang SafeCXX
void S::get(int &v) {
    x = &v; // expected-warning {{borrow v has shorter lifetime than this}}
}
void S::get2(int &v) {
    x = &v;
}

here for S::get where we marked [[clang::NoBorrowToRet]], we’re not allowed to borrow it to the borrowers where has longer lifetime. But for S::get2, it is completely fine since the arguments are by default have longer lifetime.

Example 2

This comes from a classic example in Rust’s forum: Calling a `&mut` object's method in a loop - Mutable borrow starts here in previous iteration of loop - help - The Rust Programming Language Forum

fn main() {
    let mut foo = Foo {
        d: HashMap::new() 
    };
    
    let mut saved = Vec::new();
    
    for _i in 1..3 {
        let l = foo.get("a");
        saved.push(l);
    }
}

leads to

error[E0499]: cannot borrow `foo` as mutable more than once at a time
  --> src/main.rs:24:17
   |
24 |         let l = foo.get("a");
   |                 ^^^ mutable borrow starts here in previous iteration of loop

the explanation is, in the first iteration , saved borrows a mutable reference from foo, then in the second iteration, the saved may borrow a mutable reference from foo again while the previous reference is still alive.

I am not sure if this is intended behavior or just a natural result of their borrow checking rules. There are sayings that it is helpful for data races. Then I tried to mimic this in the demo:

namespace std {
class foo {
    int x;
public:
    int &get();
};
template <class T>
class vector {
    T* data;
    unsigned size;
public:
    void push_back(const T & elem [[clang::NoBorrowToRet]]);
};
}
#pragma clang SafeCXX
void consume(const std::vector<int> &);
void func() {
  std::foo f;
  std::vector<int> vec;
  for (int i = 0; i < 10; ++i) {
    auto &l = f.get(); // expected-warning {{non constant borrow l overlapped with other borrow}}
                       // expected-note@-1 {{the previous borrow starts from}}
    vec.push_back(l);
  }
  consume(vec);
}

Example 3

The borrow check is pretty helpful to detect dangling references.

struct S {
    int *x;
    int *y;
};
#pragma clang SafeCXX
S getS(const int &x, const int &y);
S test() {
    int x, y;
    return getS(x, y); // expected-warning {{return during borrowing for x may produce dangling reference}}
                       // expected-warning@-1 {{return during borrowing for y may produce dangling reference}}
                       // expected-note@-2 + {{the previous borrow starts from}}
}

If getS() won’t borrow the parameters for sure, we can use [[clang::NoBorrowToRet]] similarly.

And also, if the attribute is marked incorrectly, we’re able to detect it:

[[clang::NoBorrowToRet]] 
S testInvalid(const int &x, const int &y) {
    return getS(x, y); // expected-warning {{return during borrowing for x may produce dangling reference}}
                       // expected-warning@-1 {{return during borrowing for y may produce dangling reference}}
                       // expected-note@-2 1+{{the previous borrow starts from}}
}

Deprecated Calls

In the proposal, it mentioned:

I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn’t resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.

– Tony Hoare[hoare]

The “billion-dollar mistake” is a type safety problem. Consider std::unique_ptr. It has two states: engaged and disengaged. The class presents member functions like operator* and operator-> that are valid when the object is in the engaged state and undefined when the object is disengaged. → is the most important API for smart pointers. Calling it when the pointer is null? That’s your billion-dollar mistake.

This sounds great. And I think we can achieve similar results by deprecating some calls. e.g., we can deprecate the default constructor and constructor for std::unique_ptr. So that:

int foo2() {
    std::unique_ptr<int> p; // expected-warning {{call to std::unique_ptr<int>::unique_ptr() is deprecated in safe C++}}
    return *p;
}

in the demo I hacked to hard codes these signatures in clangir/clang/lib/CIR/Dialect/Analysis/DeprecatedCallCheck.cpp at 85ecfe7283de9d28ffd75dfe91694af37567c0bb · ChuanqiXu9/clangir · GitHub in the future, we adding these attributes to the standard library. Or maybe we can make this a configuration file and providing a default configuration, so that we don’t have to update every library or force users to update.

For user codes, we can also use [[clang::SafeCXXDeprecated]] to mark the functions we deprecate in safe C++. It should be trivial to implement. It is simply another deprecate that only works in certain locations.

Deprecate use of pointers, taking address and dereferencing (Not Yet Implemented)

But deprecation for calls can’t solve all the problems. e.g., we can p.reset(nullptr); or p = nullptr;. To avoid such uses, I think we should forbid use of pointers (so nullptr) in safe C++. I think we should do this since the pointers are the nightmares for safety. And we can always easily use them in the unsafe mode.

I didn’t implement this since it is not easy to do this in ClangIR but I assumed it will be easy to make it in AST.

References in Async functions (Not Yet Implemented)

It is a known problem for the use of references in async functions. (C++ Core Guidelines) but the lifetime issues exist for non-coroutine async functions too.

Borrow checks algorithm only checks the use of references in sync world. If a variable is borrowed by an async call, the variable should be required to live longer than the current function just like it was returned. But the problem point is, we can’t identify async function in C++.

And in Rust, I heard we need to mark async functions specially async - Rust. So we can know if we’re borrowing to an async function easily. But it is completely another story for C++.

I thought to use yet another attribute to mark async functions. But it looks helpless since we can’t check if the users forgets to mark it.

So I think maybe we can proceed by deciding the following functions are async functions:

  • All coroutines are async functions.
  • Providing a configuration file that users can describe the return types of async functions. And the functions that matches the description are considered to be async.

This is not perfect but I feel this is the best we can do. On the one hand, not all coroutines are async functions. But due to the dynamic allocation problems, the coroutines may only be used in async situations for performance issues. On the other hand, the async function are generally have some common return types, e.g., futures, tasks, or senders. So it might work in practice.

After all, I think we might not have perfect solution for such topics in C++. But it might be able to provide some mechanism that works in practice for different projects.

I understand that non-absolutely safe is unsafe. But it is always helpful to improve the safety.

I don’t have a concrete idea for the configuration file. This is an open topic.

And the algorithm for check the use of referencing for async function would be simple after we can decide which functions are async:

When we pass a reference to an async_function, if the return value of the async_function must be awaited in all paths (in another tongue, for all paths from the returning point to all exit points, the path must pass an co_await.) then we can check it as a common function call. Otherwise, we need to require all the references (borrows) have longer lifetime than the current function.

the algorithm assumes the semantics of co_await where is a user defined operation. If we don’t want to assume the semantics, we have to require all the references (borrows) have longer lifetime than the current function.

But after all, this is an open question.

Constant Global (Not Yet Implemented)

The global variable must be constant. Otherwise, at least, it breaks the assumption for borrowing in calls since it is completely possible to borrow something to a global in a function. This should be trivial to implement.

Standardization and higher level road map

I was wondering to sent this to WG21 or the clang community. I decided to sent it here since I feel this is in an early stage that needs help from the communities for all aspects, including design, implementation and user feed backs.

And most importantly, this proposal in fact doesn’t add new grammar construct to C++. It only adds new checks. That said, the safe C++ codes must be valid C++ codes. Then the safe C++ codes should be able to be compiled by other compilers which supports C++. It is compatible and portable by default!

So we can develop and use this in ahead and sent this to WG21 when we feel it is good. Then every one can have a better understanding of it and this proposal may have a better chance to be part of the standard.

Road Maps

Assuming this was liked by the community. Then the following steps might be:

  1. Add #pragma clang SafeCXX and #pragma clang UnsafeCXX to clang and the mechanism to decide if a location is safe: clangir/clang/include/clang/Basic/SafeCXXState.h at safe-c++ · ChuanqiXu9/clangir · GitHub
  2. Discuss the designing of Safe C++ in the community. Maybe we don’t need to make a complete design at first. I think it is always good to improve the safety.
  3. Implements the designed checks in clangir repo.
  4. After clangir got merged into the main repo, continue the implementation in the main repo.
  5. After CIR is included in the released clang, the SafeC++ are released as experiment automatically.
  6. Release it formally.
  7. Propose to be part of standard.

The 4th step may be confusing. Since CIR depends on MLIR and to my understanding, even if CIR are merged into clang, clang won’t depend on MLIR immediately. Since that may increase the binary size of clang significantly. So that needs yet another discussion to make clang depends on MLIR by default.

Before this happens, the users who want to experience this can only build the compiler by themselves.

Conclusion

The post propose to add two pragmas #pragma clang SafeCXX and #pragma clang UnsafeCXX to specify the safe part for C++. And propose we can add new checks for the safe part to reject valid C++ programs for safety reasons. We can the codes can pass these new checks as safe C++. By ignoring these checks, the program must be a valid C++ program too.

To give users a more concrete feelings for safe C++, an implementation and several examples are presented. Some topics (reference in async call) may not have a good solution right now. The design space is pretty open.

Possible Questions & Answers

The proposal is done here. There are some possible questions & answers that might be helpful for readers to have a better understanding.

Why do we need ClangIR?

Because I want borrow check in safe C++. And borrow check requires path sensitive analysis which is super annoying on AST.

Do we have to depend on ClangIR?

Technically not. Even for borrow check, according to 2094-nll - The Rust RFC Book, it should be possible to perform borrow check in AST. But we can’t perform precise analysis. Then we might have to introduce a lot of {} to tell the compiler about the lifetime as the above nll proposal shows in the beginning.

Does other check require ClangIR?

No. In fact, for path insensitive checks like “deprecated calls”, “no pointers”, “constant globals”, it will be better to check them in AST directly. But if we have other path sensitive checks, it is still better to make it in ClangIR.

What is the status quo of ClangIR for analysis?

In my mind, there are two problems. One is expressiveness. Due to the complexity of C++, ClangIR can’t handle a lot of them today. We might meet a lot crash even for a small C++ program. The other one is the abstract level. I feel the current level of ClangIR is slightly low. e.g., the constness is elided in the CIR’s type system, which is a pretty important for borrow checking. And also in CIR, there is no difference between reference and pointers, (the reason why I didn’t forbid pointers in the demo). And we can’t know if a call is a member call or not…

But maybe we don’t have a better choice. I thought to create another higher level IR for clang by listing the AST nodes. I am just afraid we’re reinventing the wheels. And on the other hand, neither of the above problem is unsolvable. CIR is still relatively in the early stage, we can add information to it relatively easily. And also, from the perspective of the community, given we’ve already decided CIR as the future direction, it looks better for the community to help CIR to represent more programs .

3 Likes

It would be nice if the RFC talked about how it fits into our extension criteria.

I appreciate the RFC and that we’re looking to continue to improve safety and security in Clang! However, given that CIR has barely begun to be upstreamed, it seems a bit premature to run an RFC against it. What do you envision is the timeline for getting started on a PR for this? We could add the pragmas now, but there’s not much point to them until CIR is sufficiently upstreamed so as to be usable, which is a ways away at this point.

As for the design: why pragmas? C++ typically avoids the preprocessor as much as possible, so I don’t see a clear path to standardization through that design. Also, pragmas are pretty hard to reason about it some cases because you run into things like:

void foo(int x) {
  if (x % rand() == rand()) {
    bar(x);
    baz(x + 10);
    #pragma clang SafeCXX // Is it fine for this to be in the middle of the block?
    quux();
  }
  quux(); // Is this also SafeCXX since there's no "unsafe" marking nearby?
}
#pragma clang UnsafeCXX
// Is the entire rest of the TU unsafe or is this an error?

Would attributes make more sense and be more ergonomic because they have clear appertainment rules?

2 Likes

I am glad that you are working on this problem! I did not go into the details, but have some high-level questions/concerns.

First of all, some bikeshedding. There is no agreed upon definition of safe C++ just yet, so I think a spelling like #pragma clang SafeCXX migh be a bit premature. I’d suggest an alternative name instead of SafeCXX just in case something slightly different gets standardized, we do not clash with it.

I wonder if we need [[clang::NoBorrowToRet]] at all. We already have a noescape attribute, and I was wondering if that would have the semantics that you would need. In case we can express it with existing attributes, without introducing new concepts/vocabulary, that could potentially be a great win.

Your approach is very light on annotations. I wonder how well it would cover more complicated scenarios, e.g., code with vector<span<int>> or map<string_view, string_view>. In the latter case, we do not know whether the keys and the values have related lifetime or not if we have no annotations.

And borrow check requires path sensitive analysis which is super annoying on AST.

Small nit, I believe Rust’s borrow checker is not path-sensitive. It is just flow-sensitive. (Pure) path-sensitivity means that the analysis considers “full” execution paths, without merging states after branches.

Overall, I think it is imperative that Clang should have something akin to borrow checking at some point because this is the most popular and well proven method to ensure memory safety that scales to large programs. Given the importance of this topic, I think it might even make sense to potentially run some experiments upstream to help more people joining these efforts even before anything is standarized or there is a general agreement on the direction (but @AaronBallman might have a different opinion :slight_smile: ). In some strategic topics it might make sense to have a central place for people to innovate together.

Adding some folks for awareness.
@bcardosolopes @usx95 @ilya @kinu @gribozavr @ymand @martinboehme

2 Likes

Type safety can’t be addressed this way. You can prevent default construction of unique_ptr, but that doesn’t resolve the type safety problems:

  • How do you move a unique_ptr without leaving the source nulled?
  • unique_ptr arguments can be null.

Code in the safe pragma still has to treat unique_ptr defensively, so in what sense is that code safe?

A memory-safe function has defined behavior for all valid inputs. If a nulled unique_ptr is a valid input, then a safe function cannot expect it to be non-null.

4 Likes

… I think we should forbid use of pointers (so nullptr) in safe C++. I think we should do this since the pointers are the nightmares for safety. And we can always easily use them in the unsafe mode.

We took a significantly less aggressive approach with C++ Safe Buffers and even that is still extremely constraining for developers.

https://clang.llvm.org/docs/SafeBuffers.html#buffer-operations-should-never-be-performed-over-raw-pointers

What options do you imagine the users of this programming model would have other than using the unsafe mode?
If a safe programming model is so strict that real-world projects can adopt it only with frequent use of escape hatches then I don’t see what value it provides.

Maybe I missed something, but how do you plan to treat iterators? It is a very common to have multiple mutable references (“borrows”) to a container in the form of iterators.

1 Like

This seems way too early to RFC… the design is still extremely immature, and the underlying ClangIR bits are still early in development. I don’t think we can properly evaluate this at this point.

That said, I will note that Rust’s “unsafe” is split into two parts: declaring APIs as safe/unsafe, and declaring code inside functions as safe/unsafe. And this is a very important part of how Rust’s safety story works: safe code is guaranteed to be safe because it only calls safe APIs. If you don’t have that, the end result isn’t really “safe” in the same sense.

For unique_ptr, I think you need to explain the interaction with move constructors.

6 Likes

Thanks for replying. I’ll split the discussion into two parts: the higher level part about the interfaces for (#pragma clang safeCXX) so that we can add new checks to reject valid C++ programs and the lower level parts about the concrete designs. And I think these two parts can be separated. The goal of the RFC is to have consensus on the higher level interfaces. And then, we can try to discuss/develop/design/experiment/add the concrete checks.

The lifetime of the proposal

@AaronBallman :

What do you envision is the timeline for getting started on a PR for this? We could add the pragmas now, but there’s not much point to them until CIR is sufficiently upstreamed so as to be usable, which is a ways away at this point.

The lifetime of the RFC/area in my mind is:

  • Get a consensus on the higher level interface
  • Add the higher level interfaces to clang
  • For every one, we can try add a new check to the so-called safe C++ in other threads/PRs. Note that not every check should be implemented in CIR.

I didn’t intended to provide a full and complete safe solution in the single RFC. It is too heavy. The details are the evil.

So I think it is meaningful to add the pragmas than we can add separate checks. Even when CIR is not usable, there are some checks to be done in AST now. For example, for buffer-unsafe-usage mentioned by @jkorous, we can completely enable it for the so called safe C++. These checks are the lower parts things.

@efriedma-quic

This seems way too early to RFC… the design is still extremely immature, and the underlying ClangIR bits are still early in development. I don’t think we can properly evaluate this at this point.

Does this sounds good to you.

Naming and safety

As pointed by @Xazax-hun:

There is no agreed upon definition of C++ just yet, so I think a spelling like #pragma clang SafeCXX migh be a bit premature. I’d suggest an alternative name instead of SafeCXX just in case something slightly different gets standardized

And also according to the above discussion, we may not be able to provide a complete safe solution for a period. (since the checks will be implemented step by step). So the term safe may be too aggressive. Not absolutely safe is unsafe. So let’s get a new name. Then how about strict? We can use

#pragma clang strict
#pragma clang unstrict

Do this sounds better for every one? Following off let’s use the term strict C++.

@efriedma-quic

That said, I will note that Rust’s “unsafe” is split into two parts: declaring APIs as safe/unsafe, and declaring code inside functions as safe/unsafe. And this is a very important part of how Rust’s safety story works: safe code is guaranteed to be safe because it only calls safe APIs. If you don’t have that, the end result isn’t really “safe” in the same sense.

But maybe we have to call unsafe (legacy) codes after all. So if what you say is true, it looks like we can never reach real safe state in C++? But after all, I changed the term safe to strict. Does this sounds good to you?

Why pragmas

@AaronBallman:

As for the design: why pragmas? C++ typically avoids the preprocessor as much as possible, so I don’t see a clear path to standardization through that design.

I use pragma since I hope the strict C++ to be pure subset of C++ except ignorable #pragma and attributes. Then the codes checked by strict C++ will be accepted by other compilers. I think this is super important for compatiblities. We don’t like dialects. But I think it is fine if the dialects is a pure subset of C++.

And for standardization, I think the important parts are checks. For example, if we’ve implemented and shipped borrow checks to strict C++ and it gets used widely. So people can write a paper to WG21 with new interfaces. e.g.,

struct S {
    int *x;

    S();
    S(const S &other);
    S &operator=(const S &other);

    void get(int &x);
    void consume() const;
};

safe { // new interface!
void invalid(bool cond) {
  S s;
  if (cond) {
    int a = 0;
    s.get(a); // expected-note {{the previous borrow starts from}}
  }
  s.consume(); // expected-warning {{use of s detected beyond lifetime of borrow a}}
}
}

then the review process in WG21 can focus on the interfaces since every can be sure the underlying borrow checks are good.

And after all the standardization is not a goal here, the better term may be a hope : ).

pragma in the middle of function

@AaronBallman:

Also, pragmas are pretty hard to reason about it some cases because you run into things like:
void foo(int x) {
if (x % rand() == rand()) {
bar(x);
baz(x + 10);
#pragma clang SafeCXX // Is it fine for this to be in the middle of the block?
quux();
}
quux(); // Is this also SafeCXX since there’s no “unsafe” marking nearby?
}
#pragma clang UnsafeCXX
// Is the entire rest of the TU unsafe or is this an error?

The answer to the first and the third question is:

  • Yes, it is fine for it to be in the middle of the block.
  • Yes, the entire rest of the TU is unsafe. It is not an error.

For the second question, the answer should be: these locations are considered to be SafeC++ (strict C++) but some checks may not be performed. The problem may be stated as:

The flow-insensitive checks are performed. But the flow-sensitive checks may not be performed or giving false negative results.

The flow-sensitive checks are sensible to flows. And we restate the flow-sensitive checks as:

The flow-sensitive check is guaranteed to be accurate only if the whole function are strict.

I think this more or less makes sense. Since if the whole function are not strict, and there some uses in the unstrict modes and the checks is related to the point of that use, then it makes sense if the use are not strict.

Would attributes make more sense and be more ergonomic because they have clear appertainment rules?

I fear it to be less flexible then we can’t add strict/unstrict modes more casually.

NoBorrowToRet

I wonder if we need [[clang::NoBorrowToRet]] at all. We already have a noescape attribute, and I was wondering if that would have the semantics that you would need. In case we can express it with existing attributes, without introducing new concepts/vocabulary, that could potentially be a great win.

Nice suggestion!

Lifetime annotations

@Xazax-hun

Your approach is very light on annotations. I wonder how well it would cover more complicated scenarios, e.g., code with vector<span<int>> or map<string_view, string_view> . In the latter case, we do not know whether the keys and the values have related lifetime or not if we have no annotations.

Yes, I have a similar feeling. I feel maybe we have to introduce similar lifetime annotation system with Rust. Otherwise the result may be very imprecise.

Small nit, I believe Rust’s borrow checker is not path-sensitive. It is just flow-sensitive. (Pure) path-sensitivity means that the analysis considers “full” execution paths, without merging states after branches.

Thanks for correction!

Overall, I think it is imperative that Clang should have something akin to borrow checking at some point because this is the most popular and well proven method to ensure memory safety that scales to large programs. Given the importance of this topic, I think it might even make sense to potentially run some experiments upstream to help more people joining these efforts even before anything is standarized or there is a general agreement on the direction (but @AaronBallman might have a different opinion :slight_smile: ). In some strategic topics it might make sense to have a central place for people to innovate together.

Completely agreed!

For type safety and move

@seanbaxter

Type safety can’t be addressed this way.

Yes, I was trying to mock it.

  • How do you move a unique_ptr without leaving the source nulled?

For move, I think it is not hard to implement a check for move-after-use in IR.

A memory-safe function has defined behavior for all valid inputs. If a nulled unique_ptr is a valid input, then a safe function cannot expect it to be non-null.

Yes, given the strict functions can be called by unstrict functions, we have to check it. So… if we really want to forbid it, maybe we can only deprecated std::unique_ptr in the strict mode? Slightly crazy. Maybe contracts can help here. But it is another more far topic.

Memory buffer

@jkorous

What options do you imagine the users of this programming model would have other than using the unsafe mode?
If a safe programming model is so strict that real-world projects can adopt it only with frequent use of escape hatches then I don’t see what value it provides.

As mentioned above, the concrete strict model is still pretty an open topic. And I do think the safe buffers can be part of the strict mode actually.

iterators

@fmayer

Maybe I missed something, but how do you plan to treat iterators? It is a very common to have multiple mutable references (“borrows”) to a container in the form of iterators.

This was not addressed. For common iterations, we can provide single iterator that calling next until end. (Maybe we need a new iterator class wrapper). There some problems for the auto generated codes for ranged-based for. But I think we can overcome that since the code are generated by compiler.

But if we really need multiple mutable references, I think it might not be allowed by borrow checker.

Hi @ChuanqiXu!

Was there something in particular that you could not model as AST annotations with attributes and/or pragmas?

Hi Vassilev!

I am not sure since I don’t have design the corresponding explicit lifetime annotation in Rust (Explicit annotation - Rust By Example). If there is any, it should be it. Why do you ask this?

I am asking because not long ago there was an RFC: [RFC] Lifetime annotations for C++ which might cover some of your concerns. @mizvekov, worked quite hard to retain the annotation information throughout the frontend, including template instantiations. Maybe worth checking it out, if you haven’t.

2 Likes

Thanks! It looks helpful.

@ChuanqiXu thanks for sharing your take for a safe C++ - I’m also curious about the same attributes mentioned by @vvassilev, we have variations of them in clang and it would be nice if you can reuse those (there are other SG23 proposals than even propose them as part of the C++ language, like P2878, etc).

I feel the current level of ClangIR is slightly low. e.g., the constness is elided in the CIR’s type system, which is a pretty important for borrow checking. And also in CIR, there is no difference between reference and pointers, (the reason why I didn’t forbid pointers in the demo). And we can’t know if a call is a member call or not…

This is expected, it’s when the use case appears that we usually go about improving the representation. You could still propose those dialect specific improvements in the CIR incubator while maintaining your pass / experiment in your downstream repo.

2 Likes