Parse, Don’t Validate – Some C Safety Tips

113 points by lelanthran 4 days ago

OptionOfT 41 minutes ago

In general I'm a huge favor of using the type system to its fullest extend to protect you from messing up. If the compiler can protect me, it should.

Examples where I've used it in the past: ValidatedEmail, which is a special form of Email, one that has been validated by the user.

We can have actions that require a `PriviligedUser`, which can be created from a `User`. That creation validates ONCE whether your user is privileged.

This saves you from a whole bunch of .is_priviliged() calls in your admin panel.

The post 'Boolean Blindness' [0] talks about much of the same issues.

[0]: https://existentialtype.wordpress.com/2011/03/15/boolean-bli...

myaccountonhn 5 hours ago

From experience, parsing input into data structures that fit the problem domain once at the "edge" is a good idea. The code becomes a lot more maintainable without a bunch of validation checks scattered all over the place, picking a data structure for the problem at hand usually leads to cleaner solutions, and errors usually show up much earlier and are easier to debug.

From experience though I've found that wrapping all data in newtypes adds too much ceremony and boilerplate. If the data can reasonably be expressed as a primitive type, then you might as well express that way. I can't think of a time where newtype wrapping would have saved me from accidentally not validating or accidentally inputting the wrong data as a parameter. Especially the email example is quite weak, with ~30 lines of code just being ceremony due to wrapping a string, and most likely it's just going to be fed as is to various crud operations that will cast the data to a string immediately.

Interacting with Haskell/elm libraries that have pervasive use of newtypes everywhere can be painful, especially if they don't give you a way to access the internal data. If a use-case comes up that the library developer didn't account for, then you might have no way of modifying the data and you end up needing to patch the library upstream.

movpasd 4 hours ago

I think it can be useful to think of the parsing and logic parts both as modules, with the parsing part interfacing with the outside world via unstructured data, and the parsing and logic parts interfacing via structured data, i.e.: the validated types.
From that perspective, there is a clear trade-off on the size of the parsing–logic interface. Introducing more granular, safer validated types may give you better functionality, but it forces you to expand that interface and create coupling.
I think there is a middle ground, which is that these safe types should be chunked into larger structures that enforce a range of related invariants and hopefully have some kind of domain meaning. That way, you shrink the conceptual surface area of the interface so that working with it is less painful.

hackrmn 7 hours ago

I've been shouting this from "the rooftops" ever since I eventually independently had come to the same conclusion through sufficient volume of experience and empiry. I consider it a small tragedy, however, that my words tend to fall on deaf ears as I try to explain to e.g. co-workers why they should, in fact, parse their input and use the language's type system to their advantage, vs. all the `validateThis` and `validateThat` they seem to be copying from e.g. Stack Overflow. I see them stumbling and breaking their nose time and again passing URIs around as strings -- even in languages other than C (so this isn't obviously a problem C created and needs to solve alone) -- where it's never enforced that the URI is e.g. "absolute" or "relative" (nevermind the fact these people often don't seem to care for the difference). They do the same error with date and time (how many decades of advice do we need to accumulate and impart on our younger peers for them to irrevocably understand why e.g. date and time should never be passed around as numbers or strings, with very clearly documented and justified exceptions?). But the tragedy is never more evident when the person looks at you like you're suddenly talking Klingon to them and how bored they are with the notion and how it cannot be so important as to disturb their very important "coding project" (we don't write Knuth's "literal" programs any more -- we _code_).

rorylaitila an hour ago

While this is the style I do in all of my (web) applications, it took a while to build my framework/tooling that overcomes the boilerplate to parse all incoming data into value objects. In a typical application of mine, there is usually well over dozens of entities and hundreds to thousands of fields. Creating value objects manually for all stringly types and numeric fields is a lot of work. Most value objects are some simple validation like limit to 255 characters or ints of some range.

I solve for this by using reflection and auto generating all value objects (inheriting by default from base types) and auto generating all accessor/controller classes or methods into the domain model. Therefore I model in base types, override the generated value object constructors for validation (if required), and all of the boundaries are using value objects. The internal code generally works with the underlying base types, because boxing/unboxing the value objects can be non-negligible performance impact when serializing a lot of data (which tends to be common in web applications... SQL > JSON > HTML).

I'm a huge fan, but I think ymmv. Web applications tend to have a wide interface (much of the domain model is user accessible). I think it's ideal for this case because of the number of fields a user can ultimately set and reused across many places.

o11c 14 hours ago

`_t` should not be used for custom types since it's reserved for future standard types (and/or types declared in a header you might include someday). This does cause real-world problems (`key_t` anyone?).

Gratuitous allocations are gratuitous.

The whole "prevent double free" claim is completely bogus. Setting a variable to `NULL` only works for cases where there is one, obvious, owner, which is not the circumstance under which double free is prone to happening in the first place. Actually preventing double free requires determining ownership of every object, and C sucks at that.

flohofwoe 10 hours ago

> `_t` should not be used for custom types since it's reserved for future standard types (and/or types declared in a header you might include someday).
That old thing again...
The _t postfix is only reserved in the POSIX standard, but not in the C standard (and C and POSIX are entirely different things - outside the UNIX bubble at least).
It's unlikely that POSIX changes anymore, but if you get a name collision in a new POSIX version it's still just a simple name collision, and it's up to the user code to fix that.
And it's not like symbol collision problems are limited to POSIX, the world won't end because some piece of C code collides with a symbol used in a dependency, there's always ways to isolate its usage.
Also, it's good practice in the C world to use a namespace prefix for libraries, and such a prefix will also make sure that any _t postfix will not collide with POSIX symbols (the question is of course why POSIX couldn't play nice with the rest of the world and use a posix_ prefix - historical reasons I guess, but then just going a ahead and squatting on the _t postfix for all eternity is a bit rich).
- shakna 9 hours ago
  
  The C23 spec also says:
  > A potentially reserved identifier becomes a reserved identifier when an implementation begins using it or a future standard reserves it, but is otherwise available for use by the programmer.
  Which, in practice, does mean using _t is likely to cause you problems, as it may become a reserved identifier, when an implementation like POSIX begins using it.
  - flohofwoe 7 hours ago
    
    Still debatable since the C standard doesn't reserve the _t postfix (it does reserve a single leading underscore followed by a capital letter, e.g. _Bool, and IIRC it also reserves two leading underscores).
    What POSIX reserves or doesn't reserve doesn't affect code that follows only the C standard but doesn't care about POSIX compatibility, and especially _t is so widely used in C libraries that POSIX's opinion obviously doesn't matter all that much in the real world.
    
    shakna 7 hours ago
    
    Whilst you do point to 6.4.3 for where it does reserve "All identifiers that begin with a double underscore (__) or begin with an underscore (_) followed by an uppercase letter"... That section also has the lovely:
    > Other identifiers may be reserved.
    If an implementation of C uses it... Just... Don't. The standard won't save you here, because it's happy for an implementation to do whatever they feel like.
    
    flohofwoe 5 hours ago
    
    Yeah, and that makes any reserved name rules pretty much useless anyway, e.g. anything goes until a collision actually happens, and then it needs to be fixed on the user side anyway.
- froh 6 hours ago
  
  "the UNIX bubble" is an interesting take in the context of C, given the origins of C
  Is your point "why did posix not establish a prefix_ ... _suffix" combo, and maybe even better some reserved "prefix_" namespace?
  which --- I think --- for better or worse leads to the reality that C doesn't have a namespace mechanism, like, say, Java.
  - flohofwoe 5 hours ago
    
    Well, C does have a namespace mechanism, it's called prefixes ;) It's just unfortunate that both POSIX and the C stdlib don't use prefixes (except for the new C23 stdc_ functions which is going into the right direction at least).
    The problem with C++ style namespaces as language feature is that they require name mangling, which opens up a whole new can of worms.
    In the end, the POSIX _t just means "don't blame us when your library names collide with POSIX names", and that's fine. Other platforms have that problem as well, but the sky hasn't fallen because an occasional type or function name collision.
wavemode 3 hours ago

> The whole "prevent double free" claim is completely bogus.
The way I interpreted the author's intent was that, the logic of error handling (something C sucks even more at) can be greatly simplified if your cleanup routine can freely be called multiple times. At the moment an error happens you no longer have to keep track of where you are in the lifecycle of each local variable, you can just call cleanup() on everything. I actually like the idea from that standpoint.
- UncleEntity 2 hours ago
  
  That seemed kind of dubious to me as well but setting the pointer to the freed memory to NULL is good, maybe. Though, with their design, I think it would cause problems with passing the address of a stack allocated wrapper struct to the constructor function if one were into that sort of thing.
  I was reading something a day or two ago where they were talking about using freed memory and their 'solution' to the problem was, basically, if the memory location wasn't reassigned to after it was freed it was 'resurrected' as a valid memory allocation. I'm fairly certain that won't ever lead to any impossible to diagnose bugs...
tialaramex 8 hours ago

For the _t suffix it is deeply unfortunate that on the one hand C's standard library gives people this idea but then the standard says they mustn't use it.
I understand exactly why it was necessary, but to my mind that highlighted an urgent need to provide actual namespacing so that we don't need to rope off whole categories of identifiers for exclusive use by the stdlib, with the implication that every single library will need to do the same. This should have been addressed last century IMO.
- flohofwoe 7 hours ago
  
  > an urgent need to provide actual namespacing
  Some newer parts of the standard library use a stdc_ prefix now (https://en.cppreference.com/w/c/numeric/bit_manip.html).
lelanthran 11 hours ago

> The whole "prevent double free" claim is completely bogus.
"Completely" means "for all". Are you seriously claiming that "for all instances of double-free, setting the pointer to NULL after freeing it would not help"?
- lisper 11 hours ago
  
  > "Completely" means "for all".
  Not in the case of bogosity. Completely bogus things might occasionally work under some very particular circumstances, but unless those particular circumstances just happen to be the circumstances you actually care about, complete bogosity can still obtain.
  > setting the pointer to NULL
  There is no such thing as setting a pointer to null. You can set the value of a variable (whose current value is a pointer) to null, but you cannot guarantee that there isn't a copy of the pointer stored somewhere else except in a few very particular circumstances. This is what the GP meant by "setting a variable to `NULL` only works for cases where there is one, obvious, owner". And, as the GP also pointed out, this "is not the circumstance under which double free is prone to happening in the first place." Hence: complete bogosity.
- taneq 3 hours ago
  
  Eeeeh, I don't think 'completely bogus' means 'exhaustively false for all situations'. It just means 'demonstrably false' (for some relatively sane example, we're talking about C after all which means there will always be bogus examples which break any given assumption). There's plenty of cases where zeroing a pointer immediately after freeing it will prevent any further issues. It's still bogus to claim that it categorically solves the problem of double frees. But it does help.

hansvm 6 hours ago

One flaw I've seen in "Parse, Don't Validate" as it pertains to real codebases is that you end up with a combinatorial prolieration of types.

E.g., requiring that a string be base64, have a certain fixed length, and be provided by the user.

E.g., requiring that a file have the correct MIME type, not be too large, and contain no EXIF metadata.

If you really always need all n of those things then life isn't terrible (you can parse your data into some type representing the composition of all of them), but you often only need 1, 2, or 3 and simultaneously don't want to duplicate too much code or runtime work, leading to a combinatorial explosion of intermediate types and parsing code.

As one possible solution, I put together a POC in Zig [0] with one idea, where you abuse comptime to add arbitrary tagging to types, treating a type as valid if it has the subset of tags you care about. I'm very curious what other people do to appropriately model that sort of thing though.

[0] https://github.com/hmusgrave/pdv

deredede an hour ago

> E.g., requiring that a file have the correct MIME type, not be too large, and contain no EXIF metadata.
"Parse, don't validate" doesn't mean that you must encode everything in the type system -- in fact I'd argue you should usually only create new types for data (or pieces of data) that make sense for your business logic.
Here the type your business logic cares about is maybe "file valid for upload", and it is perfectly fine to have a function that takes a file, perform a bunch of checks on it, and returns a "file valid for upload" new type if it passes the checks.
jeremyscanvic 5 hours ago

You might be interested in Lean's way of doing things. They have normal types (e.g. numeric types) and subtypes (e.g. numbers less than zero). An element of the subtype "numbers less than zero" can be understood as a tuple containing the actual number (which has a normal numeric type) and a proof that this specific number is indeed less than zero.
https://lean-lang.org/doc/reference/latest/Basic-Types/Subty...
tacitusarc 4 hours ago

Structs are representations of combinatorial types! In your file case, you could parse the input into a struct, and then accept or reject further processing based on that contents of that struct.
Of course, it would be reasonable to claim that the accept/reject step is validation, but I believe “Parse, don’t validate” is about handling input, not an admonition to never perform validation.
- mrkeen 3 hours ago
  
  In pure C however, you still get the types-in-source-code explosion, for lack of parametric polymorphism. You need an email_or_error and a name_or_error, etc. The alternative is to fake PP with a void*, but that's so ugly I think I'd scrap the whole effort and just use char*.
  > I believe “Parse, don’t validate” is about handling input, not an admonition to never perform validation.
  It's about validation happening at exactly one place in the code base (during the "parse" - even though it's not limited to string-processing), so that callers can't do the validation themselves - because callers will validate 0 times or n>1 times.
  - deredede an hour ago
    
    > You need an email_or_error and a name_or_error, etc.
    You don't need that. A practical solution is a generic `error` type that you return (with a special value for "no error") and `name` or `email` output arguments that only get set if there's no error.

pianoben 14 hours ago

The trouble I have with this approach (which, conceptually, I agree with) is that it's damned hard to do anything with the parse results. Want to print that email_t? Then you're right back to char*, unless you somehow write your own I/O system that knows about your opaque conventions.

So you say, okay, I'll make an `email_to_string` function. Does it return a copy or a reference? Who frees it? etc, etc, and you're back to square one again. The idea is to keep char* and friends at "the edge", but I've never found a way to really achieve that.

Could just be my limitations as a C programmer, in which case I'd be thrilled to learn better.

8organicbits 9 hours ago

In the past I've taken inspiration from strncpy: the caller needs to allocate the memory. For the email example, you'd probably also want a function to tell you the length of the emailstring, but for other types there are clear size limits. This puts the caller in control of memory allocation, so they may be able to statically allocate, allocate in an arena, or use other methods which promote performance. The static approach is really nice when it works, because there's nothing to free.
lelanthran 11 hours ago

Firstly, `parsing` is just a way to say "serialise from a string". The reverse operation can be done for every type you are creating. If the reverse operation (serialise to a string) does not exist in the interface then adding it gives you a single place to catch all the bugs.
I'm thinking of that recent git bug that occurred because the round-trip of `string -> type -> string` had an error (stripping out the CR character). Using a specific type for a value that is being round-tripped means that a bugfix needs to only be made in the parser function. Storing the value as simple strings would result in needing to put your fix everywhere.
> The trouble I have with this approach (which, conceptually, I agree with) is that it's damned hard to do anything with the parse results.
You're right - it is damn hard, but that is on purpose; if you're doing something with the email that boils down to "treat it like a `char *`" then the potential for error is large.
If you're forced to add in a new use-case to the `email_t` interface then you have reduced the space of potential errors.
For example:
> Want to print that email_t? Then you're right back to char, unless you somehow write your own I/O system that knows about your opaque conventions.
is a bug waiting to surface, because it's an email, not a string, and if you decide to print an email* that was read as a `char *` you might not get what you expect.
It's all a trade-off - if you want more flexibility with the value stored in a variable, then sure, you can have it but it comes at a cost: some code somewhere almost certainly will eventually use that flexibility to mismatch the type!
If you want to prevent type mismatches, then a lot of flexibility goes out the window.
- jagged-chisel 3 hours ago
  
  Linguistic nit: deserialize from a string, serialize to a string
  “Serialization” is the act of taking an internal data structure (of whatever shape and depth) and outputting it for transmission or storage. The opposite is “deserialization,” restoring the original shape and depth.
dwattttt 14 hours ago

email_t doesn't have to be opaque; if it's just a visible wrapper around char* then you can still do everything with it as a char* (that is, everything you do with strings).
The benefit is to avoid treating char*s as email_t, not avoiding treating email_t as char*.
- maxbond 14 hours ago
  
  (Using a thin wrapper like this to add safety is called the newtype pattern, if anyone wants to know.)
  - tetha 8 hours ago
    
    I was curious how this would look in C, and I found this article[1] how this could look in C, apparently with very little overhead.
    And as I just saw, Python 3.10 also introduced a NewType[2] wrapper. I'll have to see how that feels to handle.
    1: https://blog.nelhage.com/2010/10/using-haskells-newtype-in-c...
    2: https://typing.python.org/en/latest/spec/aliases.html#newtyp...
    
    masklinn 7 hours ago
    
    Python’s NewType is, confusingly, a very different thing: it’s a compile-time-only subtype of the original, rather than a Haskell-style newtype (which is an entirely separate type from its source).
  - mrkeen 3 hours ago
    
    I've re-read the article again since getting a bunch of up and down votes across the comment section, and I think you've chosen a better name for this article than PdV. It really is just about using newtype wrappers.
- bcrosby95 14 hours ago
  
  In the example code they explicitly put the struct in the c file so the char* is not available.
  If you're suggesting getting around this by casting an email_t* to char* then I wish you good luck on your adventures. There's some times you gotta do stuff like that but this ain't it.
  - dwattttt 14 hours ago
    
    You could probably get away with the typecast if you satisfy the "common struct prefix" requirement, that's nowhere near necessary.
    While the article does hide the internal char*, that's not strictly necessary to get the benefit of "parse, don't validate". Hide implementation details sure, but not everything is an implementation detail.
restalis 8 hours ago

The main benefit for me with this approach is that the boundries are not transparent anymore. That content printing is such a boundry. Your data is about to exit through there and you're summoned to handle that. The inconvenience that comes with it is as any other when security enters the play. The same with the data management responsibilities - who handles what, for how long, and with whom. Without data type distinctions everything is (more or less) common, with vague or broadly defined ownership.

anon-3988 10 hours ago

> Here, I’ll build on that by showing how this technique can be used outside of niche academic languages by demonstrating it in a language that is as practical as it is dangerous - C.

The "practical" part really bugged me because the entire post is trying to explain exactly why it is not.

The only way to make C reasonably safe is to encode information via newtype pattern. Wrap `char *` inside a struct that have proper names and include the size in there as well.

Basically, there should be ZERO pointers except at creation and consumption by outside libraries (open, write, etc)

pixelpoet 6 hours ago

Such a recognisable name, greetz after many years :D You might remember me from uni as lycium, a few lifetimes ago. Going to check the rest of your blog now...

charcircuit 2 hours ago

I don't trust C developers to parse strings.

belter 4 hours ago

This pattern seems to be shown almost as a comprehensive security solution when it's really just one layer of defense. This parse dont validate has to be combined with resource limits, and other protective measures during the parsing phase itself.

jkuli 10 hours ago

validating twice is safe. it will not error the second time.

it is against the rules to call someone dumb on this server.

jagged-chisel 3 hours ago

Comments, however …
;-)

mrkeen 14 hours ago

This stuck out:

  email_t theEmail = parseEmail(untrustedInput);
  if (theEmail == PARSE_ERROR) {
    return error;
  }

An email_t is not a parse error, and a parse error is not one of the emails, so this shouldn't compile (and I don't take 'pseudocode' as an excuse).

parkcedar 13 hours ago

> and I don't take 'pseudocode' as an excuse
Weird hill to die on, since neither email_t nor PARSE_ERROR were defined in the sample snippets. How do you know PARSE_ERROR is not email_t?
- mrkeen 10 hours ago
  It's the parse-versus-validate hill in this case.
  This pseudocode is "Validate" for at least 3 reasons:
  Forgetting to check:
  this check is fragile: it’s extremely easy to forget. Because its return value is unused, it can always be omitted, and the code that needs it would still typecheck.
  Repeatable/redundant checks:
  First, it’s just annoying. We already checked that the list is non-empty, why do we have to clutter our code with another redundant check? Second, it has a potential performance cost. Although the cost of the redundant check is trivial in this particular example, one could imagine a more complex scenario where the redundant checks could add up, such as if they were happening in a tight loop.
  Not using the type system:
  Use a data structure that makes illegal states unrepresentable. Model your data using the most precise data structure you reasonably can. If ruling out a particular possibility is too hard using the encoding you are currently using, consider alternate encodings that can express the property you care about more easily. Don’t be afraid to refactor.
  > How do you know PARSE_ERROR is not email_t
  It has to be for it to compile, right? Which means that email_t is the type which represents both valid and invalid emails. How do you know if it's valid? You remember to write a check for it. Why not just save yourself some keystrokes and use char* instead. This is validate, not parse.
  - lmz 8 hours ago
    
    > It has to be for it to compile, right? Which means that email_t is the type which represents both valid and invalid emails. How do you know if it's valid? You remember to write a check for it. Why not just save yourself some keystrokes and use char* instead. This is validate, not parse.
    I feel this kind of fundamentalism is letting the perfect be the enemy of the good.
    
    mrkeen 7 hours ago
    
    Every C programmer is already doing it the 'good' way (validation), so this article doesn't really add anything.
    The only fundamentalism involved in PdV is: if you have an email, it's actually an email. It's not arbitrary data that may or may not an email.
    Maybe you want your emailing methods to accept both emails and not-emails in your code base. Then it's up to each method to validate it before working on it. That is precisely what PdV warns against.
    
    lmz 7 hours ago
    
    You don't think there's a degree of difference between (valid email_t or null) and (valid char pointer or invalid char pointer)?
    
    mrkeen 4 hours ago
    
    There's a huge difference. One is an email_t to validate and one is a char* to validate.
    As established, head is partial because there is no element to return if the list is empty: we’ve made a promise we cannot possibly fulfill. Fortunately, there’s an easy solution to that dilemma: we can weaken our promise. Since we cannot guarantee the caller an element of the list, we’ll have to practice a little expectation management: we’ll do our best return an element if we can, but we reserve the right to return nothing at all. In Haskell, we express this possibility using the Maybe type
    ^ Weaken the post-condition. In some contexts null might be close enough for Maybe. But is Maybe itself even good enough?
    Returning Maybe is undoubtably convenient when we’re implementing head. However, it becomes significantly less convenient when we want to actually use it! Since head always has the potential to return Nothing, the burden falls upon its callers to handle that possibility, and sometimes that passing of the buck can be incredibly frustrating.
    This is where the article falls short. It might be good (the enemy of perfect), but it ain't PdV.
- VMG 13 hours ago
  
  Because an error is not an email?
  - exe34 11 hours ago
    
    By that logic, a float couldn't store NaN.
    
    mrkeen 10 hours ago
    
    Correct. You'll never see a raw float on the LHS of a PDV expression.
    
    exe34 10 hours ago
    
    Who puts their personal data vault on the left hand side of a raw float?
  - howaboutno2312 11 hours ago
    
    [dead]
bmandale 14 hours ago

> and I don't take 'pseudocode' as an excuse
They write the non-pseudo variant later. There, the return value is a pointer and the check is against NULL. Which is fairly standard for C code, albeit not always desirable.
- mrkeen 10 hours ago
  
  Correct, it is fairly standard C code. It is not Parse, Don't Validate.
Gibbon1 11 hours ago
I'm with you, don't do crap like that. Always return a valid object.
```
  email_t theEmail = parseEmail(untrustedInput);
  if (theEmail.error != PARSE_OK) {
    return error;
  }
```
- mrkeen 10 hours ago
  
  This is validate.
  You made an email-or-error type and named it email_t and then manually checked it.
  PDV returns an non-error-email type from the check method.
  - ykonstant 4 hours ago
    
    I don't understand; what is your suggested solution?
    
    mrkeen 3 hours ago
    
    I'm not smart enough to suggest a fix here, I'm just pointing out that this article is not the PdV from the well known article.
    But I can spot when code is doing exactly what the cited article says not to do,
    This line is the "validate" in the expression "parse, don't validate":
    if (theEmail.error != PARSE_OK)
    You might like it, but that's not my business. Maybe this C article should have been "parse, then validate".
    You'd be better off reading the original: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...
    
    teo_zero 3 hours ago
    
    parseEmail() should either return a valid email, or not return at all; whether that means panic, exit, or jump to an error handler... is left to the implementer