Evaluating GPT5's reasoning ability using the Only Connect game show

ingram.tech

40 points by scrollaway 2 days ago

We evaluated OpenAI GPT5 lateral reasoning abilities against other models using an approach based on the notoriously difficult and highly-challenging british game show Only Connect, which challenges contestants' pattern-matching and trivia skills.

Insights: - GPT-5 does extremely well, but only marginally better than o3. - Model verbosity has little impact on accuracy and cleverness, except, interestingly, for the sequences round - "minimal" verbosity however causes accuracy to drop sharply.

We'll be publishing additional results in the coming days from our extended tests. We're looking at different types of evals (how do the models fare with a single item in a sequence vs. 2, 3, 4). We would also like to look at how the models behave in a team of 3, replicating the format of the game show.

We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now). Finally, we are looking at replicating the results of the connecting wall with the New York Times' Connections, however we suspect those to be in the training materials which would skew the results.

OtherShrezzing a day ago

>We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now).

I just don't think this is a credible assumption. The BBC is one of the highest-trusted sources of millions of hours online audio/visual content, all of which is accompanied by human curated & edited closed captions. All of which is trivially easy to download. The base assumptions should be that the entire BBC iPlayer corpus is inside of all the frontier model training datasets.

The communities on Reddit (known to be included in all models) extensively discuss each show & question - usually creating google docs tracking the questions asked and answers given.

Finally, there's the OCDB[0], which lists every question and answer on the show.

While using real questions from the show, this benchmark should be assumed to be testing the model fact-recall ability, rather than its reasoning capabilities.

[1]https://ocdb.cc/

  • scrollaway a day ago

    To clarify what I meant by this: Despite looking, we haven't seen any evidence of any of the models consistently responding based on pre-trained knowledge (outside of easier-to-guess trivia-type questions). It's likely the questions are in some form in the training data but it doesn't necessarily mean the results will be significantly influenced.

    • empath75 a day ago

      Models will engage in post-hoc rationalizations, so you can't trust their purported reasoning -- in particular, if you sneak an answer into the context (even an incorrect answer), it will provide reasoning for giving you that as the final answer, even if the answer is wrong. So, it could be arguing backwards from an answer that is in it's training data, you can't possibly tell that it isn't from its reasoning.

      On the other hand we do know the training cut off of these models, so you could easily create a corpus of post-cut off Connections with confidence that it doesn't have access to them.

      • scrollaway a day ago

        We didn't test using post-hoc reasoning. Instead, we focused on checking whether specific, obscure questions could be recognized or identified in any way, using various ad-hoc methods to see if the answers could be surfaced without relying on reasoning.

        It's very difficult to prove either way (and basically impossible without the model weights), but we're reasonably confident that there's no significant prior knowledge of the questions that would affect the score.

        • mr_wiglaf 21 hours ago

          I'm new to this sort of inquiry. What do you do to see if questions can be recognized? Do you just ask/prompt "do you recognize this puzzle?"

          What does it mean for it to "be surfaced without relying on reasoning"?

          • scrollaway 21 hours ago

            > Do you just ask/prompt "do you recognize this puzzle?"

            In essence, yes, but with a bit more methodology (though as I mentioned it was all ad-hoc).

            We've tried to extract pre-existing questions as well through a variety of "You are a contestant on the british TV show Only Connect" and see if it can recognize questions - couldn't find anything that reliably reproduced preexisting knowledge. It's absolutely possible we missed something.

    • andrepd a day ago

      There's a database of every question and answer, and almost every episode is also on youtube, complete with transcripts. I really don't see how you can assume that the fact that questions+answers are in the training data (which they are) doesn't affect the results of your "benchmark"...

      It also doesn't pass the smell test. These models routinely make basic mistakes, yet can answer these devilish lateral thinking questions more than 9 times out of 10? Seems very strange.

      • scrollaway 21 hours ago

        > These models routinely make basic mistakes, yet can answer these devilish lateral thinking questions more than 9 times out of 10?

        You could also say "These models routinely make basic mistakes, yet they're able to one-shot write entire webpages and computer programs that compile with no errors".

        There are classes of mistakes the models make, this is what we're digging into.

    • bgwalter 21 hours ago

      How can the results not be influenced if Grok for example lists all questions and answers of a particular episode if asked?

      It is as easy as the lion/goat/cabbage riddle in canonical form.

Amorymeltzer a day ago

I cannot, cannot, cannot recommend Only Connect enough. Readers here should enjoy it. As an American, a lot of UK geography and history questions are beyond my ken, but it's well-worth your time. Victoria Coren Mitchell is a national treasure.

It's sort of exactly what I'd think LLMs would be good at. Maybe not so much the lateral thinking—the occasional "math(s) question but actually it's a word one"—but rapidly assessing "what connects these" or "what would continue this sequence" is right up their alley. The joy, of course, is trying it yourself and seeing others do the same.

  • stoneman24 21 hours ago

    I will echo the recommendation. Though some of the puzzles do require local knowledge, there are quite a few based on the US.

    Though not really applicable to LLM, the last rapid fire round (missing vowels, work out what the words/phrase without any vowels, on the buzzer) is a favourite.

    Currently Monday evening on BBC 2 is Mastermind, Only Connect and then University Challenge. This excellent sequence will stretch pretty much anyone's knowledge.

  • dpoloncsak a day ago

    James Acaster introduced me to the wonderful world of British Game-Shows. Victoria pops up on a lot of them. 8 Out of Ten Cats. Countdown. Would I Lie to You? All great viewing

    • xnorswap a day ago

      Do you mean Countdown or "8 out of 10 cats does countdown"? Because they're somewhat different shows.

      • sunrunner 21 hours ago

        Perhaps parent poster actually did mean Countdown or 8 Out of 10 Cats, and is about to discover the wonder that is 8 Out of 10 Cats Does Countdown.

    • scrollaway a day ago

      You're giving me an idea for a spinoff show "Would I Hallucinate To You?"...

  • sunrunner 21 hours ago

    > The joy, of course, is trying it yourself

    "Trying" being the operative word here (for myself at least)

  • waisbrot 21 hours ago

    > As an American, a lot of UK geography and history questions are beyond my ken

    But note that the UK-based contestants have no problem with sequences like "US vice presidents ordered by number of terms served" or "US capitals ordered alphabetically by the state they're in".

  • empath75 21 hours ago

    I actually find it very stressful and frustrating to watch because they answer faster than I can reason through stuff.

whimsicalism a day ago

> We were unable to find evidence that the Only Connect games are in the training materials (which of course is likely to change now).

Respectfully, I do not think this is a good assumption for any TV show broadcast prior to 2025.

e: tonality wise, llm threads seem to bring out the worst

  • andrepd a day ago

    Googling "only connect questions answers" yields a literal database of all questions and answers in the show's history https://ocdb.cc/. Most episodes are also in youtube complete with transcripts.

    Par on course for AI """benchmarks""" if you ask me... x)

    • scrollaway a day ago

      Looking at the other comments, you'll see this is in fact the database of questions & answers we used as our source material for the benchmarks. You'll also find the explanation of what I meant by this particular sentence and a preview of how we tested for it.

      • andrepd 21 hours ago

        Your statement was

        > We were unable to find evidence that the Only Connect games are *in the training materials*.

        which is obviously completely false. You acknowledge as much in another comment when you say

        > To clarify what I meant by this: Despite looking, we haven't seen any evidence of any of the models consistently responding based on pre-trained knowledge (outside of easier-to-guess trivia-type questions).

        which has nothing to do with what you said x)) Basically: "to clarify, when I said X, I actually meant something else entirely".

        But fine, at least now it's not bullshit, it's just vague enough that it wouldn't pass in a 9th grade science project where I went to school.

        Just my 2 cents.

        -----

        If you'd like to explain more how you supposedly concluded that it wasn't returning data in its training set, I'm all ears.

        • scrollaway 21 hours ago

          Sorry; I dropped out of school, so I wouldn't know about 9th grade science projects. Would you like to phrase your constructive feedback as an attack instead? (/shrug)

          Edit after your update: As mentioned in the other comment, the tests were mostly ad-hoc. It's nearly impossible to prove whether something is absent from the training data, but it's possible to put the LLM in a bunch of situations which would be conducive to completing with pre-existing knowledge.

orwin a day ago

It's extremely interesting. I do have a suggestion to make sure the question are not in training data:

If this game show is like the ones in my country, people do come together in 'clubs' to train for the event, sometimes organising internal tournaments. Some people in those clubs are question-writers, and write the internal tournaments questions.

Maybe try to contact those clubs, and find those writers, you'll be sure the LLM won't be trained on that specific set.

  • scrollaway a day ago

    Very good point and a great idea. I think there would be value in such an archive that is not reshared for AI training. Very time consuming to build up though.

amoe_ a day ago

> Wall: Players group 16 elements into four categories (similar to the NYT Connections game)

I have to be the designated pedant here and point out that Only Connect was first.

  • xnorswap a day ago

    And unless they've improved it, NY Connections doesn't have nearly enough red-herrings to be interesting.

    The difficulty with the Only Connect wall is that you can have 5 sets of 5 or more in each, and sometimes can have loads that could fit a category, e.g. you might have 7 or 8 Pixar Movies listed.

    You know that's going to be a category, but you also know it's a waste of time to try them before finding other categories.

    There's also linguistic and add/drop letter ones, so if you see "Soul", that might actually be "Capitols missing a letter", "Stars with an extra letter", or "Homophones of Fish".

    But it might be straight, so could fit "Davids", "Pixar Movies", etc.

    There is a meta-pattern, you typically only have one such missing/add letter group on a wall, and you typically have one "Names" (often common firstname) set. Places in general also feature very regularly, especially with missing parts.

    Fully solving most walls in the time limit is extremely challenging. It's slightly easier for the home viewer on the second wall, because there are often common themes across the two walls, but of course the competitors don't get that help.

    • scrollaway a day ago

      NYT Connections definitely has less red herrings than Only Connect's wall (and doesn't have the pressure of the timer, either...), but the quality has gone up significantly in the past year and half.

      If anyone wants to try, I actually built a command line player of the NYT Connections game here: https://github.com/jleclanche/connections-tui (for some definition of the word "I") -- you can jump by date and easily compare.

  • beepbooptheory 21 hours ago

    To be possibly even more pendantic: this quote does not imply otherwise.

jackbrookes a day ago

I tried for a while to get ChatGPT to generate connections style puzzles with some suggested topics, including red herrings to create some answers that seemingly fit in multiple categories. Then it would post them to https://connections.swellgarfo.com/. Overall they were really bad but that was using GPT4

  • shubham13596 20 hours ago

    Yes I think having an LLM generate such OnlyConnect style questions (with the right prompting) should solve for the problem this benchmark seems to have of LLMs, most likely, being trained on past years OnlyConnect questions.

IanCal a day ago

As someone who loves the show this is very fun to see. I'm impressed at the number of correct answers most get.

Helpfully as well there is a current season on, and while the total questions wouldn't be enough to fully validate results while ensuring the data isn't in the training it's certainly good enough as a sanity check that people could do.

Mizza a day ago

That's great, I thought about building a similar thing data from PuzzGrid, an onlyconnect fan site, but some of the questions there are a bit iffy compared to the ones on the show. How did you build the dataset - just binge watching with a notepad?

  • scrollaway a day ago

    The source data is from the unofficial OC fansite ocdb.cc - I'd like to make it available publicly but we haven't yet received a response from the website's author allowing us to do so.

VegaKH 21 hours ago

The top reasoning competitor to GPT-5 would probably be Gemini 2.5 Pro. That is the first model I would like to see it compared with.

energy123 a day ago

Good job being explicit about the reasoning effort and verbosity settings. And interesting but not surprising that verbosity helped performance.

catigula a day ago

Anecdotally it doesn't feel like GPT-5 is a meaningful improvement over o3 to me.

  • scrollaway a day ago

    Over o3 it's only incremental (which backs up the community's general feeling of gpt5 being an incremental improvement over o3), but it's very consistently better. Also worth mentioning that the score of 77% vs. 90% on the sequences round was shockingly good and shows an improvement over the LLM's ability to not just "classify things" (little to no improvement) but really understand the underlying pattern to get the next one right.

    • catigula 21 hours ago

      How are you determining that it's better?

      Care to make a case for it that isn't benchmark (gameable) based?

      • scrollaway 21 hours ago

        By that metric, everything is gameable. Any case we'd make for it would be purely based on vibes (and our take on that would not be any more useful than the general community opinion there).

        • yunwal 17 hours ago

          > By that metric, everything is gameable

          Usually in cases like this you would use a testing set created after the model was trained.

        • catigula 21 hours ago

          So the answer would be no.

          • scrollaway 21 hours ago

            A benchmark is exactly how you measure things reliably instead of "based on vibes". I really don't understand what you're asking or expecting.

swee69 a day ago

So basically - LLM usefulness has plateau’d hard - Sam A knows this and will pivot to capturing revenue now that the high growth phase is over - more restrictive rate limits, higher bills

They U-turned on the recent rate limit changes after the release backlash but more is coming

arnaudsm a day ago

Did you use o3 pro high, o3 high or o3 medium?

(OpenAI's naming is so confusing)

  • scrollaway a day ago

    Default parameters for o3 (o3-2025-04-16).

AIPedant 21 hours ago

I am less interested in questioning training data corruption than I am in questioning claims like this:

  test reasoning abilities such as pattern recognition, lateral thinking, abstraction, contextual reasoning (accounting for British cultural references), and multi-step inference.... its emphasis on clever reasoning rather than knowledge recall, Only Connect provides an ideal challenge for benchmarking LLMs' reasoning capabilities.
It seems to me that the null hypothesis should be "LLMs are probabilistic next-word generators and might be able to solve a lot of this stuff with shallow surface statistics built from inhumanly large datasets, without ever properly using abstraction, contextual reasoning, etc." This is particularly true for NYT Connections, but in general evaluations like this seem to be at least partially testing how amenable certain word/trivia games are to naive statistical algorithms. (Many NYT Connections "purple" categories seem like they would be quite obvious to a next n-gram calculator, but not for people who actually use words conversationally!) Humans don't use these statistical algorithms for reasoning except in particular circumstances (many use "folk n-gram statistics" when playing Wordle; poker; serious word game players often learn more detailed tables of info; you could see competitive NYT Connections players learning a giant bag of statistical heuristics to help them speedrun things). We just can't accumulate the data ourselves without making a concerted computer-aided effort.

In general a lot of LLM benchmarks don't adequately consider that LLMs can solve certain things better than humans without using reasoning or knowledge. The most stupid example is how common multiple choice benchmarks are, despite us all learning as children that multiple-choice questions can be partially gamed with shallow statistical-linguistic tricks even if you have no clue how to answer the question honestly[1]; it stands to reason that a superhuman statistical-linguistic computer could accumulate superhuman statistical-linguistic tricks without ever properly learning the subject matter. AI folks have always been quick to say "if it quacks like a duck it reasons like a duck" but these days computers are quite good at playing duck recordings.

[1] "When in doubt, C your way out," sniffing out suspicious answers, shallow pattern-matching to answer reading comprehension, etc etc. One thing humans and LLMs actually do have in common is that multiple-choice tests are terrible ways to assess their knowledge or intelligence.