LLMs are mortally terrified of exceptions

110 points by nought 7 hours ago

https://x.com/karpathy/status/1976082963382272334

karpathy 3 hours ago

Sorry I thought it would be clear and could have clarified that the code itself is just a joke illustrating the point, as an exaggeration. This was the thread if anyone is interested

https://chatgpt.com/share/68e82db9-7a28-8007-9a99-bc6f0010d1...

why_at 2 hours ago

This part from the first try made me laugh:

      if random.random() < 0.01:

          logging.warning("This feels wrong. Aborting just in case.")

          return None

chis 3 hours ago

I think there’s always a danger of these foundational model companies doing RLHF on non-expert users, and this feels like a case of that.
The AIs in general feel really focused on making the user happy - your example, and another one is how they love adding emojis to the stout and over-commenting simple code.
- miki123211 17 minutes ago
  
  This feels like RLVR, not RLHF.
  With RLVR, the LLM is trained to pursue "verified rewards." On coding tasks, the reward is usually something like the percentage of passing tests.
  Let's say you have some code that iterates over a set of files and does processing on them. The way a normal dev would write it, an exception in that code would crash the entire program. If you swallow and log the exception, however, you can continue processing the remaining files. This is an easy way to get "number of files successfully processed" up, without actually making your code any better.
- cma 2 hours ago
  
  And more advanced users are more likely to opt out of training on their data, Google gets around it with a free api period where you can't opt out and I think from did some of that too, through partnerships with tool companies, but not sure if you can ever opt out there.
bjourne 2 hours ago

This is stunning English: "Perfect setup for satire. Here’s a Python function that fully commits to the bit — a traumatically over-trained LLM trying to divide numbers while avoiding any conceivable danger:" "Traumatically over-trained", while scoring zero google hits, is an amazingly good description. How can it intuitively know what "traumatic over-training" should mean for LLMs without ever having been taught the concept?
- gnulinux an hour ago
  
  Hard to know but if you could express "traumatically" as a number, and "over-trained" as a number, it seems like we'd expect "traumatically" + "over-trained" to be close to "traumatically over-trained" as a number. LLMs work in mysterious ways.
- drekipus 2 hours ago
  
  The same way that you and I think up a word and what it might mean without being taught the concept.
  Adverb + verb
  - the_gipsy an hour ago
    
    But the machines cannot possibly have the magic brain-juice!

comex 3 hours ago

This is a parody but the phenomenon is real.

My uninformed suspicion is that this kind of defensive programming somehow improves performance during RLVR. Perhaps the model sometimes comes up with programs that are buggy enough to emit exceptions, but close enough to correct that they produce the right answer after swallowing the exceptions. So the model learns that swallowing exceptions sometimes improves its reward. It also learns that swallowing exceptions rarely reduces its reward, because if the model does come up with fully correct code, that code usually won’t raise exceptions in the first place (at least not in the test cases it’s being judged on), so adding exception swallowing won’t fail the tests even if it’s theoretically incorrect.

Again, this is pure speculation. Even if I’m right, I’m sure another part of the reason is just that the training set contains a lot of code written by human beginners, who also like to ignore errors.

metalcrow 5 hours ago

Given that the output describes the function as being done "with extraordinary caution, because you never know what can go wrong", i would guess that the undisclosed prompt was something similar to "generate a division function in python that handles all possible edges cases. be extremely careful". Which seems to say less about LLM training and more about them doing exactly what they are told.

freehorse 4 hours ago

Aside from the absurdity and obvious satirical intention,
1. the code is actually wrong (and is wrong regardless of the absurd exception handling situation)
2. some of the exception handling makes no sense regardless, or is incoherent
3. a less absurd version of this actually happens (edit: commonly in actual irl scenarios) if you put emphasis on exception handling in the prompt
angry_albatross 3 hours ago

I interpreted the function code as being a deliberately exaggerated satirical example that was illustrative of the experience he was having. So yes, in that example it was probably told to be overly cautious, but I agree with him that the default of LLMs seems to be a bit more cautious than I would like.

CGamesPlay 34 minutes ago

I dealt with this in my AGENTS.md by including a recap of the text of "Vexing Exceptions" [0], rephrased as a set of guidelines for when to write a throw or catch. I feel like it helped; and when it still emits error handling I disagree with and I ask about it, it will categorize it into one of the four categories, and typically rewrite it in an appropriate way.

I think the Vexing Exceptions post is on the same tier as other seminal works in computer science; definitely worth a quick read or re-read once in a while.

[0] https://ericlippert.com/2008/09/10/vexing-exceptions/

criemen 5 hours ago

It's also logically incoherent - division by zero can't occur, because if b=0 then abs(b) < sys.float_info.epsilon.

Furthermore, the code is happy to return NaN from the pre-checks, but replaces a NaN result from the division by None. That doesn't make any sense from an API design standpoint.

fkyoureadthedoc 4 hours ago

Not sure why but it made me think of FizzBuzzEnterpriseEdition https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...

ineedasername 3 hours ago

Woah, were they using junit 4.8.3 in that project? Someone was flying by the seat of their pants, I hope they got sign-off on that by legal & the CTO, that’s the kind of cowboy coding choice that can hurt a career.
lawlessone an hour ago

...so many folder and files , i feel damaged after seeing that.
Great satire.

falcor84 4 hours ago

That code has many issues, but the one that bothers me the most in practice is this tendency of adding imports inside functions. I can only assume that it's an artifact of them optimizing for a minimal number of edits somewhere in the process, but I expect better.

natnat 28 minutes ago

I think this has a lot to do with the mechanism of RoPE attention, where physical closeness in the code is a signal of relevance.
andrewmcwatters 41 minutes ago

In very, very large projects, you end up finding that you want lazy initialization as much as possible, because it greatly affects startup times.
kccqzy 4 hours ago

It's to make imports lazy, to solve the issue of slow import at startup.
- falcor84 3 hours ago
  
  While there are some cases where lazy imports are appropriate, this function, and the vast majority of such lazy imports that I get from Claude are not.
  In particular, I can't think of any non-pathological situation where a python developer should import logging and update logging.basicConfig within an inner function.

jampekka 4 hours ago

I've noted that LLMs tend to produce defensive code to a fault. Lots of unnecessary checks, e.g. check for null/None/undefined multiple times for same valie. This can lead to really hard to read code, even for the LLM itself.

The RL objectives probably heavily penalize exceptions, but don't reward much for code readability or simplicity.

ziml77 an hour ago

That's funny but definitely not far off from reality. I have instructions from my agent to use exceptions but they only help so much.

I really dislike their underuse of exceptions. I'm working on ETL/ELT scripts. Just let stuff blow up on me if something is wrong. Like, that config entry "foo" is required. There's no point in using config.get("foo") with a None check which then prints a message and returns False or whatever. Just use config["foo"] and I'll know what's wrong from the stack trace and exception text.

sarchertech 5 hours ago

Expert beginners program like this. I call it what it driven development. Turns out a lot of code was written by expert beginners because by many metrics they are prolifically productive.

In go all SOTA agents are obsessed with being ludicrously defensive against concurrency bugs. Probably because in addition to what if driven development, there are a lot of blog posts warning about concurrency bugs.

glitchc 4 hours ago

But what's the prompt that led to this output? Is it just a simple "Write code to divide a by b?" or are there instructions added for code safety or specific behaviours?

I know it's Karpathy, which is why the entire prompt is all the more important to see.

johnisgood 4 hours ago

"Write me a code that divides a by b and make sure it is safe and handles all edge cases"[1] or something and some languages have more than others.
[1] Probably with some "make you sure handle ALL cases in existence", or emphasis, along those lines.

wffurr 5 hours ago

Turns out computer math is actually super hard. Basic operations entail all kinds of undefined behavior and such. This code is a bit verbose but otherwise familiar.

falcor84 4 hours ago

    # Step 3: Preemptively check for catastrophic magnitude differences
    if abs(a) > sys.float_info.max / 2:
        logging.warning("Value of a might cause overflow. Returning infinity just to be sure")
        return math.copysign (float('inf'), a)
    if abs(b) < sys.float_info.epsilon:
        logging.warning("Value of b dangerously close to zero. Returning NaN defensively.")
        return math.nan

Does the above code make any sense? I've not worked with this sort of stuff before, but it seems entirely unreasonable to me to check them individually. E.g. if 1 < b < a, then it seems insane to me to return float('inf') for a large but finite a.

vessenes 22 minutes ago

It’s parody
im3w1l 4 hours ago

Ignoring the sign of b for big a can't be right.

Den_VR 5 hours ago

If we wanted defined behavior we’d build systems with Karnaugh maps all the way down.

bobogei81123 5 hours ago

This is just AI trying to tell us how bad we designed our programming languages to be when exceptions can be thrown pretty much anywhere

recursive 5 hours ago

So you think java's checked exceptions are a better model? No opinion myself, but that way seems widely considered bad too.
- Terr_ 2 hours ago
  
  > So you think java's checked exceptions are a better model?
  Checked Exceptions are a good concept which just needed more syntactic-sugar. (Like easily specifying that one kind of exception should be wrapped into another.) The badness is not in the logic but in the ecology, the ways that junior/lazy developers are incentivized to take horrible shortcuts.
  Checked exceptions are fundamentally the same as managing the types of return-values... except the language doesn't permit the same horrible-shortcuts for people to abuse.
  Meme reaction: http://imgur.com/iYE5nLA
  _____
  Prior discussion: https://news.ycombinator.com/item?id=42946597
- nivertech 4 hours ago
  
  Why do you need exceptions at all? They’re just a different return types in disguise…
  Also, division by zero should return Inf
  - dragonwriter 2 hours ago
    
    > Why do you need exceptions at all? They’re just a different return types in disguise…
    You don’t need exceptions, and they can be replaced by more intricate return types.
    OTOH, for the intended use case for signalling conditions that most code directly calling a function does not expect and cannot do anything about, unchecked exceptions reduce code clutter (checked exceptions are isomorphic to "more intricate return types"), at the expense of making the potential error cases less visible.
    Whether this tradeoff is a net benefit is somewhat subjective and, IMO, highly situational. but if (unchecked) exceptions are available, you can always convert any encountered in your code into return values by way of handlers (and conversely you can also do the opposite), whereas if they aren’t available, you have no choice.
  - dmoy 3 hours ago
    
    > division by zero should return Inf
    Sometimes yes, sometimes no?
    It's a domain specific answer, even ignoring the 0/0 case.
    And also even ignoring the "which side of the limit are you coming from?" where "a" and/or "b" might be negative. (Is it positive infinity or negative infinity? The sign of "a" alone doesn't tell you the answer)
    Because sometimes the question is like "how many things per box if there's N boxes"? Your answer isn't infinity, it's an invalid answer altogether.
    The limit of 1/x or -1/x might be infinity (or negative infinity), and in some cases that might be what you want. But sometimes it's not.
  - threeducks 4 hours ago
    
    Or -Inf, depending on the sign of the zero, which might catch some programmers by surprise, but is of course the correct thing to do.
    
    nivertech 4 hours ago
    
    a/0 = Inf when a>0 a/0 = -Inf when a<0 a/0 = NaN when a=0
    
    dmoy 3 hours ago
    
    No this doesn't work either
    In the context of say a/-0.001, a/-0.00000001, a/-0.0000000001, a/<negative minimum epsilon for denormalized floating point>, a/0
    Then a/0 is negative when a>0, and positive when a<0
    
    nivertech 3 hours ago
    
    Why not just to use IEEE 754?
    > According to the IEEE 754 standard, floating-point division by zero is not an error but results in special values: positive infinity, negative infinity, or Not a Number (NaN). The specific result depends on the numerator
    
    dmoy 3 hours ago
    
    Because sometimes it's very wrong
    Way back when during my EE course days, we had like a whole semester devoted to weird edge cases like this, and spent month on ieee754 (precision loss, Nan, divide by zero, etc)
    When you took an ieee754 divide by zero value as gospel and put it in the context of a voltage divisor that is always negative or zero, getting a positive infinity value out of divide by zero was very wrong, in the sense of "flip the switch and oh shit there's the magic smoke". The solution was a custom divide function that would know the context, and yield negative infinity (or some placeholder value). It was a contrived example for EE lab, but the lesson was - sometimes the standard is wrong and you will cause problems if it's blindly followed.
    Sometimes it's fine, but it depends on the domain
    
    nivertech 3 hours ago
    
    With IEEE 754 you can always explicitly check for edge cases.
    But with exceptions you can’t use SIMD / vectorization.
    
    dmoy 2 hours ago
    
    Yea that's totally fair, you'd need to build it in as a first class behavior of your code, doesn't necessarily mean that exceptions is the right way to do it.
  - johnyzee 4 hours ago
    
    Unchecked exceptions are more like a shutdown event, which can be intercepted at any point along the call stack, which is useful and not like a return type.
    
    nivertech 4 hours ago
    
    Why do you need the call stack at all?
  - layer8 3 hours ago
    
    What about division of zero by zero?

mwkaufma 5 hours ago

Even when they're not AI slop, these kinds of "paranoid sanity checks" are the software equivalent of security-theater.

bwfan123 5 hours ago

Form over function is what they are trained for. So, verbose commentary, needless readmes, and emojis all serve that purpose.
- mwkaufma 5 hours ago
  
  Coding for the reviewer, not the user.
simonw 5 hours ago

Yeah, I really hate code like this because it generally ends up full of codepaths that have never been exercised, so there's all sorts of potential for weird behavior and unexpected edge cases. Plus it's harder to review.

jpcompartir 4 hours ago

Most comments seem to be taking the code seriously, when it's clearly satirical?

shiandow 5 hours ago

Is there a way to read the rest?

hugo1789 5 hours ago

https://xcancel.com/karpathy/status/1976082963382272334

constantcrying 5 hours ago

If you are dividing two numbers with no prior knowledge of these numbers or any reasonable assumptions you can make and this code is used where you can not rely on the caller to catch an exception and the code is critical for the product, then this is necessary.

If you are actually doing safety critical software, e.g. aerospace, medicine or automotive, then this is a good precaution, although you will not be writing in Python.

mewpmewp2 5 hours ago

I might agree with that, and maybe the example posted by Karpathy is not the greatest, but what I'm constantly being faced with is try catches where it will fail silently or return a fallback/mock response, which essentially means that system will behave unexpectedly in a more subtle way down the line while leaving you clueless to as what the issue was.
I have to constantly remind Claude that we want to fail fast.
- isoprophlex 5 hours ago
  
  A good 10% of my Claude.md is yelling at it that no i don't want you to silently handle exceptions six calls deep into the stack and no please don't wrap my return values in weird classes full of dumb status enums "for safety"
  Just raise god damn it
hyperpape 5 hours ago

I'm not sure returning None is any safer than an Exception, because the caller still has to check.

dijksterhuis 5 hours ago

I mean, the first three cases are just attempting to turn dynamic into static typed... right? maybe just don't aim for uber-safety in a dynamically typed language? :shrugs:

(I used to look out for kaparthy's papers ten years ago... i tend to let out an audible sigh when i see his name today)

falcor84 4 hours ago

You shouldn't have the same expectations from a person's tweet as you would from a paper. I don't see any issue with high profile people who are careful in their professional work, putting less thought-through output on social media. At least as long as they don't intentionally/negligently spreading misinformation, which I've never seen Karpathy do.
I for one really enjoy both his longer form work and his shorter takes.

stargrazer 4 hours ago

but then, why code with exceptions, why not perform pre-flight/pre-validation checks and minimize exceptions to the truly unknown?

OutOfHere 5 hours ago

Is this Claude? GPT is not like this. To me it looks like Anthropic is just maximizing billable token use as usual, and it has nothing really to do with exceptions per se.

TuxSH 3 hours ago

From the UI it indeed seems to be Claude