Mixing Error Pointers and NULL

An error pointer is a special Linux kernel thing where we take an error code such as -ENOMEM or -EINVAL, and cast it to a pointer. Error codes are in the (-4095)-(1) range. No valid pointer can be in this range. There is an IS_ERR() function which tests if a pointer is an error pointer.

In the Linux kernel, some functions return NULL on error, some functions return error pointers and some functions return both. It looks something like this:

return ERR_PTR(-ENOMEM);

...

ptr = function_call();
if (IS_ERR(ptr)) {
        ret = PTR_ERR(ptr);

I work in fixing existing kernel code so I never really have to think about designing new APIs or where it’s appropriate to return NULL vs when to return error pointers. It probably depends on the subsystem patterns and if you need to return accurate error codes to the user. Also some errors need to be handled differently like -EPROBE_DEFER. I don’t have any guidance on that. What I want to write about instead are the functions which return a mix of error pointers and NULL.

When a function returns both error pointers and NULL, the NULL is *not* an error, it is a special kind of success which means that the feature has been deliberately disabled by the user. For example, maybe the hardware does not have LEDs so the LED subsystem is disabled in the kernel. The hardware can still run without LEDs. It’s not an error. It’s deliberate.

p = get_optional_feature();
if (IS_ERR(p))
        return PTR_ERR(p);

At this point “p” can be a valid pointer if the feature is enabled or NULL if the feature is disabled. The surrounding code needs to be check for NULL and not crash. But it should not treat NULL returns as an error or print an error message.

If the feature is not optional for the driver, then the Kbuild system should enforce that the driver “depends” or “selects” the feature. Do not let people build non-usable drivers.

Smatch has a number of different checks and warnings related to error pointers. Sometimes IS_ERR() checks should be checks for NULL and vice versa. Sometimes people pass true or non-negative valules to ERR_PTR().

Another thing that people do is pass NULL to PTR_ERR(). NULL gets cast to zero which means success in the kernel. This might be a “forgot to set the error code” bug. Or it might be intentional. Quite often it looks like this:

p = get_feature();
if (IS_ERR_OR_NULL(p))
        return PTR_ERR(p);

This code could be correct. Maybe it is a function called register_feature(). If the feature has been disabled then there is nothing to register and this code can return success. Job’s done!

All the examples in the kernel are correct because we have fixed the buggy code. Unfortunately, most new instances of this warning are bugs.

Probably the most common bug here is that the get_feature() function is not optional and cannot return NULL. This is more of a theoretical bug because if the get_feature() function does not return NULL, then IS_ERR() and IS_ERR_OR_NULL() are functionally the same. Sometimes people think IS_ERR_OR_NULL() is safer than just checking for IS_ERR() because if future developers accidentally return NULL then that’s handled.

Except it’s not handled. Does the surrounding code also handle NULL pointers? If the future programmer intends to return PTR_ERR(-ENOMEM) and instead returns NULL then this code returns success. So it hasn’t fixed the bug but instead changed it into something more difficult to debug.

In a way adding IS_ERR_OR_NULL() checks makes future bugs more likely. Code is supposed to communicate. This code communicates that returning NULL is expected which makes people more likely to do it.

There are already static analysis checks which warns about NULL dereferences from mixed error pointer and NULL functions. Checking for IS_ERR_OR_NULL() makes static analysis more difficult.

If a future programmer adds a return NULL it leads to a NULL dereference, that’s going to be relatively straight forward to debug. The stack trace will show the call tree. But checking for IS_ERR_OR_NULL() means that instead of a NULL dereference the bug is going to exhibit as something different such as using uninitialized data. The stack trace will be more confusing.

In the kernel, we do no like dead code. We do not like stub code to handle future changes. We do not like working around bugs and instead we fix them. Everything about checking IS_ERR_OR_NULL() instead of IS_ERR() as a future proofing strategy is wrong.

Another bad thing code sometimes does is:

ptr = get_optional_feature();
if (IS_ERR(ptr))
        ptr = NULL;

The code here is ignoring errors in optional features.

Maybe that can be the right way to handle errors if the feature is necessary to get the system to boot. But generally errors should be reported back to the user instead of ignored. It’s “optional” for the user not for the kernel. The kernel still has to do what the user says.

The main exception to the NULL for disabled, error pointers for error pattern is debugfs. Debugfs used to work like as described, but debugfs is weird because it’s not *supposed* to be checked for errors. When people wrote buggy error handling code we used to tell them do delete it but they would “fix” it instead. Now it’s harder to write “correct” error checking and that helps people feel better about deleting their error checking code.

One response to “Mixing Error Pointers and NULL”

  1. […] returns error pointers or NULL. Sometimes functions return both and I have written about that before. But debugfs is a special case where the functions are not supposed to be checked for […]

    Like

Leave a comment

Design a site like this with WordPress.com
Get started