François Chollet: The Arc Prize and How We Get to AGI [video]

200 points by sandslash 5 days ago

qoez a day ago

I feel like I'm the only one who isn't convinced getting a high score on the ARC eval test means we have AGI. It's mostly about pattern matching (and some of it ambiguous even for humans what the actual true response aught to be). It's like how in humans there's lots of different 'types' of intelligence, and just overfitting on IQ tests doesn't in my mind convince me a person is actually that smart.

TheAceOfHearts a day ago

Getting a high score on ARC doesn't mean we have AGI and Chollet has always said as much AFAIK, it's meant to push the AI research space in a positive direction. Being able to solve ARC problems is probably a pre-requisite to AGI. It's a directional push into the fog of war, with the claim being that we should explore that area because we expect it's relevant to building AGI.
- lostphilosopher 20 hours ago
  
  We don't really have a true test that means "if we pass this test we have AGI" but we have a variety of tests (like ARC) that we believe any true AGI would be able to pass. It's a "necessary but not sufficient" situation. Also ties directly to the challenge in defining what AGI really means. You see a lot of discussions of "moving the goal posts" around AGI, but as I see it we've never had goal posts, we've just got a bunch of lines we'd expect to cross before reaching them.
  - MPSimmons 16 hours ago
    
    I don't think we actually even have a good definition of "This is what AGI is, and here are the stationary goal posts that, when these thresholds are met, then we will have AGI".
    If you judged human intelligence by our AI standards, then would humans even pass as Natural General Intelligence? Human intelligence tests are constantly changing, being invalidated, and rerolled as well.
    I maintain that today's modern LLMs would pass sufficiently for AGI and is also very close to passing a Turing Test, if measured in 1950 when the test was proposed.
    
    fasterik 13 hours ago
    
    >I don't think we actually even have a good definition of "This is what AGI is, and here are the stationary goal posts that, when these thresholds are met, then we will have AGI".
    Not only do we not have that, I don't think it's possible to have it.
    Philosophers have known about this problem for centuries. Wittgenstein recognized that most concepts don't have precise definitions but instead behave more like family resemblances. When we look at a family we recognize that they share physical characteristics, even if there's no single characteristic shared by all of them. They don't need to unanimously share hair color, skin complexion, mannerisms, etc. in order to have a family resemblance.
    Outside of a few well-defined things in logic and mathematics, concepts operate in the same way. Intelligence isn't a well-defined concept, but that doesn't mean we can't talk about different types of human intelligence, non-human animal intelligence, or machine intelligence in terms of family resemblances.
    Benchmarks are useful tools for assessing relative progress on well-defined tasks. But the decision of what counts as AGI will always come down to fuzzy comparisons and qualitative judgments.
    
    QuadmasterXLII 15 hours ago
    
    The current definition and goal of AGI is “Artificial intelligence good enough to replace every employee for cheaper” and much of the difficulty people have in defining it is cognitive dissonance about the goal.
    
    jenadine 7 hours ago
    
    Or this definition of AGI from OpenAI and Microsoft:
    > [AGI is achieved when] AI systems that can generate at least $100 billion in profits.
    https://techcrunch.com/2024/12/26/microsoft-and-openai-have-...
    
    drdeca 7 hours ago
    
    I’d remove the “for cheaper” part? (And also, only necessary for the employees whose jobs are “cognitive tasks”, not ones that are based on their bodies. So like, doesn’t need to be able to lift boxes or have a nice smile.)
    If something would be better at every cognitive task than every human, if it ran a trillion times faster, I would consider that to be AGI even if it isn’t that useful at its actual speed.
    
    fvdessen 15 hours ago
    
    Turing test is not really that meaningful anymore because you can always detect the AI by text and timing patterns rather than actual intelligence. In fact the most reliable way to test for AI is probably to ask trivia questions on various niche topics, I don't think any human has as much breath of general knowledge as current AIs.
    
    staunton 2 hours ago
    
    > you can always detect the AI by text and timing patterns
    I see no reason why an AI couldn't be trained on human data to fake all of that.
    If noone has bothered so far, that's because pretty much all commercial applications of this would be illegal or at least leading to major reputational damage when exposed.
    
    goatlover 11 hours ago
    
    Because an important part of being a Natural general Intelligence is having a body and interacting with the world. Data from Star Trek is a good example of an AGI.
    
    fragmede 5 hours ago
    
    Given the actions of Data's brother, I think Data qualifies as a benevolent ASI.
  - batmansmk 3 hours ago
    
    One of the very first slides of François’ presentation is about defining AGI. Do you have anything that opposes his synthesis of the two (50 years old) takes on this definition?
  - tedy1996 16 hours ago
    
    I have graduated with a degree in Software engineering and i am bilingual (Bulgarian and English). Currently AI is better than me in everything except adding big numbers or writing code in really niche topics - for example code golfing a Brainfuck interpreter or writing a Rubiks cube solver. I believe AGI has been here for at least a year now.
    
    fvdessen 15 hours ago
    
    I suggest you to try to let the AI think through race conditions scenarios in asynchronous programs; it is not that good at these abstract reasoning tasks.
    
    jcelerier 11 hours ago
    
    the huge majority of humans aren't either though
    
    goatlover 11 hours ago
    
    Can the AI wash your dishes, fold your laundry, take out your trash, meet a friend for dinner or the other thousand things you might do in an average day when you're not interacting with text on a screen?
    You know stuff that humans have done way before there were computers and screens.
    
    umeshunni 10 hours ago
    
    Yeah, I'm convinced that the biggest difference between the current generation of AIs we have and humans is that AIs don't have the range of tool use and interaction with the physical environment that humans do. And that's what's actually holding AGI back not access to more data.
- kordlessagain 18 hours ago
  
  ARC is definitely about achieving AGI and it doesn't matter whether we "have" it or not right now. That is the goal:
  > where he introduced the "Abstract and Reasoning Corpus for Artificial General Intelligence" (ARC-AGI) benchmark to measure intelligence
  So, a high enough score is a threshold to claim AGI. And, if you use an LLM to work these types of problems, it becomes pretty clear that passing more tests indicates a level of "awareness" that goes beyond rational algorithms.
  I thought I had seen everything until I started working on some of the problems with agents. I'm still sorta in awe about how the reasoning manifests. (And don't get me wrong, LLMs like Claude still go completely off the rails where even a less intelligent human would know better.)
  - MPSimmons 16 hours ago
    
    >a high enough score is a threshold to claim AGI
    I'm pretty sure he said that AGI would achieve a high score, not that a high score was indicative of AGI
  - smohare 18 hours ago
    
    [dead]
- ummonk 19 hours ago
  
  "Being able to solve ARC problems is probably a pre-requisite to AGI." - is it? Humans have general intelligence and most can't solve the harder ARC problems.
  - singron 15 hours ago
    
    https://arcprize.org/leaderboard
    "Avg. Mturker" has 77% on ARC1 and costs $3/task. "Stem Grad" has 98% on ARC1 and costs $10/task. I would love a segment like "typical US office employee" or something else in between since I don't think you need a stem degree to do better than 77%.
    It's also worth noting the "Human Panel" gets 100% on ARC2 at $17/task. All the "Human" models are on the score/cost frontier and exceptional in their score range although too expensive to win the prize obviously.
    I think the real argument is that the ARC problems are too abstract and obscure to be relevant to useful AGI, but I think we need a little flexibility in that area so we can have tests that can be objectively and mechanically graded. E.g. "write a NYT bestseller" is an impractical test in many ways even if it's closer to what AGI should be.
    
    tbrownaw 8 hours ago
    
    > I think the real argument is that the ARC problems are too abstract and obscure to be relevant to useful AGI
    I think it's meant to work like how getting things off the top shelf at the supermarket isn't relevant to playing basketball.
  - adastra22 19 hours ago
    
    They, and the other posters posting similar things, don't mean human-like intelligence, or even the rigorously defined solving of unconstrained problem spaces that originally defined Artificial General Intelligence (in contrast to "narrow" intelligence").
    They mean an artificial god, and it has become a god of the gaps: we have made artificial general intelligence, and it is more human-like than god-like, and so to make a god we must have it do XYZ precisely because that is something which people can't do.
    
    ummonk 18 hours ago
    
    Right, but there is a very clear term for that which they should be using: ASI
    
    adastra22 13 hours ago
    
    Agree, and you shouldn't be downvoted. It is a pet peeve of me as well.
  - satellite2 16 hours ago
    
    Didn't he say that 70% in a random sample of the population should get it right?
- cubefox 15 hours ago
  
  > Getting a high score on ARC doesn't mean we have AGI and Chollet has always said as much AFAIK
  He only seems to say this recently, since OpenAI cracked the ARC-AGI benchmark. But in the original 2019 abstract he said this:
  > We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
  https://arxiv.org/abs/1911.01547
  Now he seems to backtrack, with the release of harder ARC-like benchmarks, implying that the first one didn't actually test for really general human-like intelligence.
  This sounds a bit like saying that a machine beating chess would require general intelligence -- but then adding, after Deep Blue beats chess, that chess doesn't actually count as a test for AGI, and that Go is the real AGI benchmark. And after a narrow system beats Go, moving the goalpost to beating Atari, and then to beating StarCraft II, then to MineCraft, etc.
  At some point, intuitively real "AGI" will be necessary to beat one of these increasingly difficult benchmarks, but only because otherwise yet another benchmark would have been invented. Which makes these benchmarks mostly post hoc rationalizations.
  A better approach would be to question what went wrong with coming up with the very first benchmark, and why a similar thing wouldn't occur with the second.
  - whattheheckheck 9 hours ago
    
    Okay, true AGI would solve the coordination problem for all humans everywhere and usher in a post scarcity utopia. https://slatestarcodex.com/2014/07/30/meditations-on-moloch/
    We can simply check the news every day until it's built...
    
    fragmede 6 hours ago
    
    That's ASI, artificial super intelligence, not AGI.
    
    cubefox 2 hours ago
    
    More precisely, it's value-aligned ASI.
- echelon 18 hours ago
  
  My problem with AGI is the lack of a simple, concrete definition.
  Can we formalize it as giving out a task expressible in, say, n^m bytes of information that encodes a task of n^(m+q) real algorithmic and verification complexity -- then solving that task within a certain time, compute, and attempt bounds?
  Something that captures "the AI was able to unwind the underlying unspoken complexity of the novel problem".
  I feel like one could map a variety of easy human "brain teaser" type tasks to heuristics that fit within some mathematical framework and then grow the formalism from there.
  - glenstein 18 hours ago
    
    >My problem with AGI is the lack of a simple, concrete definition.
    You can't always start from definitions. There are many research areas where the object of research is to know something well enough that you could converge on such a thing as a definition, e.g. dark matter, consciousness, intelligence, colony collapse syndrome, SIDS. We nevertheless can progress in our understanding of them in a whole motley of strategic ways, by case studies that best exhibit salient properties, trace the outer boundaries of the problem space, track the central cluster of "family resemblances" that seem to characterize the problem, entertain candidate explanations that are closer or further away, etc. Essentially a practical attitude.
    I don't doubt in principle that we could arrive at such a thing as a definition that satisfies most people, but I suspect you're more likely to have that at the end than the beginning.
  - kordlessagain 18 hours ago
    
    After researching this a fair amount, my opinion is that consciousness/intelligence (can you have one without the other?) emerges from some sort of weird entropy exchange in domains in the brain. The theory goes that we aren't conscious, but we DO consciousness, sometimes. Maybe entropy, or the inverse of it, gives way to intelligence, somehow.
    This entropy angle has real theoretical backing. Some researchers propose consciousness emerges from the brain's ability to integrate information across different scales and timeframes. This would essentially create temporary "islands of low entropy" in neural networks. Giulio Tononi's Integrated Information Theory suggests consciousness corresponds to a system's ability to generate integrated information, which relates to how it reduces uncertainty (entropy) about its internal states. Then there is Hammeroff and Penrose, which I commented about on here years ago and got blasted for it. Meh. I'm a learner, and I learn by entertaining truths. But I always remain critical of theories until I'm sold.
    I'm not selling any of this as a truth, because the fact remains we have no idea what "consciousness" is. We have a better handle on "intelligence", but as others point out, most humans aren't that intelligent. They still manage to drive to the store and feed their dogs, however.
    A lot of the current leading ARC solutions use random sampling, which sorta makes sense once you start thinking about having to handle all the different types of problems. At least it seems to be helping out in paring down the decision tree.
  - apwell23 17 hours ago
    
    one of those cases where defining it and solving it is the same. If you know how to define it then you've solved it.
- autobodie a day ago
  
  [flagged]
davidclark a day ago

In the video, François Chollet, creator of the ARC benchmarks, says that beating ARC does not equate to AGI. He specifically says they will be able to be beaten without AGI.
- cubefox 15 hours ago
  
  He only says this because otherwise he would have to say that
  - OpenAI's o3 counts as "AGI" when it did unexpectedly beat the ARC-AGI benchmark or
  - Explicitly admit that he was wrong when assuming that ARC-AGI would test for AGI
  - sweezyjeezy 13 hours ago
    
    FWIW the original ARC was published in 2019, just after GPT-2 but a while before GPT-3. I work in the field, I think that discussing AGI seriously is actually kind of a recent thing (I'm not sure I ever heard the term 'AGI' until a few years ago). I'm not saying I know he didn't feel that, but he doesn't talk in such terms in the original paper.
    
    cubefox 12 hours ago
    
    > We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
    https://arxiv.org/abs/1911.01547
yorwba a day ago

I think the people behind the ARC Prize agree that getting a high score doesn't mean we have AGI. (They already updated the benchmark once to make it harder.) But an AGI should get a similarly high score as humans do. So current models that get very low scores are definitely not AGI, and likely quite far away from it.
- cubefox 15 hours ago
  
  > I think the people behind the ARC Prize agree that getting a high score doesn't mean we have AGI
  The benchmark was literally called ARC-AGI. Only after OpenAI cracked it, they started backtracking and saying that it doesn't test for true AGI. Which undermines the whole premise of a benchmark.
  - yorwba 7 hours ago
    
    It does test for AGI. For its absence.
    
    cubefox 2 hours ago
    
    I would call something as intelligent as an animal an AGI. They are very general (as opposed to narrow) intelligences, just not necessarily very smart ones. ARC-AGI arguably conflates these two dimensions.
yunwal 25 minutes ago

The author of the test agrees with you.
energy123 a day ago

https://en.m.wikipedia.org/wiki/AI_effect
But on a serious note, I don't think Chollet would disagree. ARC is a necessary but not sufficient condition, and he says that, despite the unfortunate attention-grabbing name choice of the benchmark. I like Chollet's view that we will know that AGI is here when we can't come up with new benchmarks that separate humans from AI.
gonzobonzo 21 hours ago

I agree with you but I'll go a step further - these benchmarks are a good example of how far we are from AGI.
A good base test would be to give a manager a mixed team of remote workers, half being human and half being AI, and seeing if the manager or any of the coworkers would be able to tell the difference. We wouldn't be able to say that AI that passed that test would necessarily be AGI, since we would have to test it in other situations. But we could say that AI that couldn't pass that test wouldn't qualify, since it wouldn't be able to successfully accomplish some tasks that humans are able to.
But of course, current AI is nowhere near that level yet. We're left with benchmarks, because we all know how far away we are from actual AGI.
- criddell 21 hours ago
  
  The AGI test I think makes sense is to put it in a robot body and let it navigate the world. Can I take the robot to my back yard and have it weed my vegetable garden? Can I show it how to fold my laundry? Can I take it to the grocery store and tell it "go pick up 4 yellow bananas and two avocados that will be ready to eat in the next day or two, and then meet me in dairy"? Can I ask it to dice an onion for me during meal prep?
  These are all things my kids would do when they were pretty young.
  - gonzobonzo 21 hours ago
    
    I agree, I think of that as the next level beyond the digital assistant test - a physical assistant test. Once there are sufficiently capable robots, hook one up to the AI. Tell it to mow your lawn, drive your car to the mechanic and have the mechanic to get checked, box up an item, take it to the post office, and have it shiped, pick up your dry cleaning, buy ingredients from a grocery store, cook dinner, etc. Basic tasks an low-skilled worker would do as someone's assistant.
  - bumby 17 hours ago
    
    I think the next harder level in AGI testing would be “convince my kids to weed the garden and fold the laundry” :-)
    
    fragmede 5 hours ago
    
    Convincing kids is AGI, convincing adults is ASI.
- godshatter 21 hours ago
  
  The problem with "spot the difference" tests, imho, is that I would expect an AGI to be easily spotted. There's going to be a speed of calculation difference, at the very least. If nothing else, typing speed would be completely different unless the AGI is supposed to be deceptive. Who knows what it's personality would be like. I'd say it's a simple enough test just to see if an AGI could be hired as, for example, an entry level software developer and keep it's job based on the same criteria base-level humans have to meet.
  I agree that current AI is nowhere near that level yet. If AI isn't even trying to extract meaning from the words it smiths or the pictures it diffuses then it's nothing more than a cute (albeit useful) parlor trick.
  - gonzobonzo 11 hours ago
    
    Those could probably be mitigated pretty easily in testing situations. For example, making sure all participants had a delay in chat conversations, or running correspondence through an LLM to equalize the personality.
    However, I'm not sure an AGI test should be mitigating them. If an AI isn't able to communicate at human speeds, or isn't able to achieve the social understandings that a human does, it would probably be wrong to say that it has the same intelligence capabilities as a human (how AGI has traditionally been defined). It wouldn't be able to provide human level performance in many jobs.
crazylogger 20 hours ago

I think next year's AI benchmarks are going to be like this project: https://www.anthropic.com/research/project-vend-1
Give the AI tools and let it do real stuff in the world:
"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.
Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.
- Miraltar 4 hours ago
  
  This sounds like a terrible idea to me, you're training intelligent computer to aim for power. It's fine as long as they're bad but if they get good then we have a problem
cttet a day ago

The point is not that having a high score -> AGI, their ideas are more of having a low score -> we don't have AGI yet.
ben_w a day ago

You're not alone in this; I expect us to have not yet enumerated all the things that we ourselves mean by "intelligence".
But conversely, not passing this test is a proof of not being as general as a human's intelligence.
- kypro a day ago
  
  I find the "what is intelligence?" discussion a little pointless if I'm honest. It's similar to asking a question like does it mean to be a "good person" and would we know whether an AI or person is really "good"?
  While understanding why a person or AI is doing what it's doing can be important (perhaps specifically in safety contexts) at the end of the day all that's really going to matter to most people is the outcomes.
  So if an AI can use what appears to be intelligence to solve general problems and can act in ways that are broadly good for society, whether or not it meets some philosophical definition of "intelligent" or "good" doesn't matter much – at least in most contexts.
  That said, my own opinion on this is that the truth is likely in between. LLMs today seem extremely good at being glorified auto-completes, and I suspect most (95%+) of what they do is just recalling patterns in their weights. But unlike traditional auto-completes they do seem to have some ability to reason and solve truly novel problems. As it stands I'd argue that ability is fairly poor, but this might only represent 1-2% of what we use intelligence for.
  If I were to guess why this is I suspect it's not that LLM architecture today is completely wrong, but that the way LLMs are trained means that in general knowledge recall is rewarded more than reasoning. This is similar to the trade-off we humans have with education – do you prioritise the acquisition of knowledge or critical thinking? Maybe believe critical thinking is more important and should be prioritised more, but I suspect for the vast majority of tasks we're interested in solving knowledge storage and recall is actually more important.
  - ben_w 19 hours ago
    
    That's certainly a valid way of looking at their abilities at any given task — "The question of whether a computer can think is no more interesting than the question of whether a submarine can swim".
    But when the question is "are they going to more important to the economy than humans?", then they have to be good at basically everything a human can do, otherwise we just see a variant of Amdahl's law in action and the AI perform an arbitrary speed-up of n % of the economy while humans are needed for the remaining 100-n %.
    I may be wrong, but it seems to me that the ARC prize is more about the latter.
    
    IanCal 19 hours ago
    
    > are they going to more important to the economy than humans?", then they have to be good at basically everything a human can do,
    I really don’t think that’s the case. A robot that can stack shelves faster than a human is more valuable at that job than someone who can move items and also appreciate comedy. One that can write software more reliably than person X is more valuable than them at that job even if X is well rounded and can do cryptic crosswords and play the guitar.
    Also many tasks they can be worse but cheaper.
    I do wonder how many tasks something like o3 or o3 pro can’t do as well as a median employee.
    
    ben_w 16 hours ago
    
    > I really don’t think that’s the case. A robot that can stack shelves faster than a human is more valuable at that job than someone who can move items and also appreciate comedy.
    Yes, until all the shelves are stacked and that is no longer your limiting factor.
    > One that can write software more reliably than person X is more valuable than them at that job even if X is well rounded and can do cryptic crosswords and play the guitar.
    Cryptic crosswords and guitar playing are already something computers can do, so they're not great examples.
    Consider a different example: "computer" used to be a job title of a person who computes. A single Raspberry Pi model zero, given away for free on a magazine cover at launch, can do this faster than the entire human population combined even if we all worked at the speed of the world record holder 24/7. But that wasn't enough to replace all human labour.
    
    tedy1996 16 hours ago
    
    AFAIK 55% of PRs written by latest GPT model get approved.
- NetRunnerSu a day ago
  
  Unfortunately, we did it. All that is left is to assemble the parts.
  https://news.ycombinator.com/item?id=44488126
kubb a day ago

AGI isn't defined anywhere, so it can be anything you want.
- mindcrime 19 hours ago
  
  Oh, it's defined in lots of places. The problem is.. it's defined in lots of places!
- FrustratedMonky a day ago
  
  Yes. And a lot of humans also don't pass for having AGI.
avmich a day ago

Roughly speaking, the job of a medical doctor is to diagnose the patient - and then, after the diagnosis is made, to apply the healing from the book, corresponding to the diagnosis.
The diagnosis is pattern matching (again, roughly). It kinda suggests that a lot of "intelligent" problems are focused on pattern matching, and (relatively straightforward) application of "previous experience". So, pattern matching can bring us a great deal towards AGI.
- AnimalMuppet a day ago
  
  Pattern matching is instinct. (Or at least, instinct is a kind of pattern matching. And once you learn the patterns, pattern matching can become almost instinctual). And that's fine, for things that fit the pattern. But a human-level intelligence can also deal with problems for which there is no pattern. (I mean, not always successfully - finding a correct solution to a novel problem is difficult. But it is within the capability of at least some humans.)
whiplash451 a day ago

You're not the only one. ARC-AGI is a laudable effort, but its fundamental premise is indeed debatable:
"We argue that human cognition follows strictly the same pattern as human physical capabilities: both emerged as evolutionary solutions to specific problems in specific evironments" (from page 22 of On the Measure of Intelligence)
https://arxiv.org/pdf/1911.01547
- Davidzheng a day ago
  
  But I believe that because of this "even edge" thing which people call of AI weakenesses being not necessarily same of humans, once we run out of these tests which AI is worse than humans it will actually in effect be very much superhuman. My main evidence for this is leela-zero the Go AI who struggled with ladders and some other aspects of Go play well into the superhuman regime (in go it's easier to see when it's superhuman bc you can have elos and play win-rates etc and there's less room for debates)
SubiculumCode 21 hours ago

[1]https://app.rescript.info/public/share/W_T7E1OC2Wj49ccqlIOOz...
Perhaps it's because the representations are fractured. The link above is to the transcript of an episode of Machine Learning Street Talk with Kenneth O. Stanleyabout The Fractured Entangled Representation Hypothesis[1]
cainxinth a day ago

> It's mostly about pattern matching...
For all we know, human intelligence is just an emergent property of really good pattern matching.
andoando 19 hours ago

Who says intelligence is anything more than "pattern matching"? Everything is patterns
tippytippytango 18 hours ago

He’s playing the game. You have to say AGI is your goal to get attention. It’s just like the YouTube thumbnail game. You can hate it, but you still have to play if you want people to pay attention.
NetRunnerSu a day ago

To pass Arc, you need a living model with sentient abilities, not the dead frog now.
https://news.ycombinator.com/item?id=44488126
nxobject a day ago

I understand Chollet is transparent that the "branding" of the ARC-AGI-n suites is meant to be suggestive of its purpose, than substantial.
However, it does rub me the wrong way - as someone who's cynical of how branding can enable breathless AI hype by bad journalism. A hypothetical comparison would be labelling SHRDLU's (1968) performance on Block World planning tasks as "ARC-AGI-(-1)".[0]
A less loaded name like (bad strawman option) "ARC-VeryToughSymbolicReasoning" should capture how the ARC-AGI-n suite is genuinely and intrinsically very hard for current AIs, and what progress satisfactory performance on the benchmark suite would represent. Which Chollet has done, and has grounded him throughout! [1]
[0] https://en.wikipedia.org/wiki/SHRDLU [1] https://arxiv.org/abs/1911.01547
- heymijo a day ago
  
  I get what you're saying about perception being reality and that ARC-AGI suggests beating it means AGI has been achieved.
  In practice when I have seen ARC brought up, it has more nuance than any of the other benchmarks.
  Unlike, Humanity's Last Exam, which is the most egregious example I have seen in naming and when it is referenced in terms of an LLMs capability.
cess11 4 hours ago

Much like other forms of psychometry, especially related to so called intelligence, it's mainly about stratification and discrimination for ideological purposes.
maaaaattttt a day ago

I've said this somewhere else, but we have the perfect test for AGI in the form of any open world game. Give the instructions to the AGI that it should finish the game and how to control it. Give the frames as input and wait. When I think of the latest Zelda games and especially how the Shrine chanllenges are desgined they especially feel like the perfect environement for an AGI test.
- Lerc a day ago
  
  And if someone makes a machine that does all that and another person says
  "That's not really AGI because xyz"
  What then? The difficulty in coming up with a test for AGI is coming up with something that people will accept a passing grade as AGI.
  In many respects I feel like all of the claims that models don't really understand or have internal representation or whatever tend to lean on nebulous or circular definitions of the properties in question. Trying to pin the arguments down usually end up with dualism and/or religion.
  Doing what Chollet has done is infinitely better, if a person can easily do something and a model cannot then there is clearly something significant missing
  It doesn't matter what the property is or what it is called. Such tests might even help us see what those properties are.
  Anyone who wants to claim the fundamental inability of these models should be able to provide a task that it is clearly possible to tell when it has been succeeded, and to show that humans can do it (if that's the bar we are claiming can't be met). If they are right, then no future model should be able to solve that class of problems.
  - maaaaattttt a day ago
    
    Given your premise (which I agree with) I think the issue in general comes from the lack of a good, broadly accepted definition of what AGI is. My initial comment originates from the fact that in my internal definition, an AGI would have a de facto understanding of the physics of "our world". Or better, could infer them by trial and error. But, indeed, it doesn't have to be the case. (The other advantage of the Zelda games is that they introduce new abilities that don't exist in our world, and for which most children -I've seen- understand the mechanisms and how they could be applied to solve a problem quite naturaly even they've never had that ability before).
    
    wat10000 21 hours ago
    
    I'd say the issue is the lack of a good, broadly accepted definition of what I is. We all know "smart" when we see it, but actually defining it in a rigorous way is tough.
    
    ta8645 15 hours ago
    
    This difficulty is interesting in and of itself.
    When people catalogue the deficiencies in AI systems, they often (at least implicitly) forgive all of our own such limitations. When someone points to something that an AI system clearly doesn't understand, they say that proves it isn't AGI. But if you point at any random human, who fails at the very same task, you wouldn't say they lack "HGI", even if they're too personally limited to ever be taught the skill.
    All of which, is to say, I don't think pointing at a limitation of an AI system, really proves it lacks AGI. It's a more slippery definition, than that.
  - bonoboTP 21 hours ago
    
    > It doesn't matter what the property is or what it is called. Such tests might even help us see what those properties are.
    This is a very good point and somewhat novel to me in its explicitness.
    There's no reason to think that we already have the concepts and terminology to point out the gaps between the current state and human-level intelligence and beyond. It's incredibly naive to think we have armchair-generated already those concepts by pure self-reflection and philosophizing. This is obvious in fields like physics. Experiments were necessary to even come up with the basic concepts of electromagnetism or relativity or quantum mechanics.
    I think the reason is that pure philosophizing is still more prestigious than getting down in the weeds and dirt and doing limited-scope well-defined experiments on concrete things. So people feel smart by wielding poorly defined concepts like "understanding" or "reasoning" or "thinking", contrasting it with "mere pattern matching", a bit like the stalemate that philosophy as a field often hits, as opposed to the more pragmatic approach in the sciences, where empirical contact with reality allows more consensus and clarity without getting caught up in mere semantics.
  - jcranmer 21 hours ago
    
    > The difficulty in coming up with a test for AGI is coming up with something that people will accept a passing grade as AGI.
    The difficulty with intelligence is we don't even know what it is in the first place (in a psychology sense, we don't even have a reliable model of anything that corresponds to what humans point at and call intelligence; IQ and g are really poor substitutes).
    Add into that Goodhart's Law (essentially, propose a test as a metric for something, and people will optimize for the test rather than what the test is trying to measure), and it's really no surprise that there's no test for AGI.
sva_ 19 hours ago

It is a necessary condition, but not a sufficient one.
mindcrime 19 hours ago

> I feel like I'm the only one who isn't convinced getting a high score on the ARC eval test means we have AGI.
Wait, what? Approximately nobody is claiming that "getting a high score on the ARC eval test means we have AGI". It's a useful eval for measuring progress along the way, but I don't think anybody considers it the final word.
loki_ikol a day ago

Well for most, the next steps are probably towards removing the highly deterministic and discrete characteristics of current approaches (we certainly don't think in lock steps). Those have no measures. Even the creative aspect is undermined by those characteristics.
CamperBob2 a day ago

If you can write code to solve ARC by "overfitting," then give it a shot! There's prize money to be won, as long as your model does a good job on the hidden test set. Zuckerberg is said to be throwing around 8-figure signing bonuses for talent like that.
But then, I guess it wouldn't be "overfitting" after all, would it?
OtomotO a day ago

You're not alone in this, no.
My definition of AGI is the one I was brought up with, not an ever moving goal post (to the "easier" side).
And no, I also don't buy that we are just stochastic parrots.
But whatever. I've seen many hypes and if I don't die and the world doesn't go to shit, I'll see a few more in the next couple of decades
oldge a day ago

Today’s llms are fancy autocomplete but lack test time self learning or persistent drive. By contrast, an AGI would require: – A goal-generation mechanism (G) that can propose objectives without external prompts – A utility function (U) and policy π(a│s) enabling action selection and hierarchy formation over extended horizons – Stateful memory (M) + feedback integration to evaluate outcomes, revise plans, and execute real-world interventions autonomously Without G, U, π, and M operating llms remain reactive statistical predictors, not human level intelligence.
- KoolKat23 a day ago
  
  I'd say we're not far off.
  Looking at the human side, it takes a while to actually learn something. If you've recently read something it remains in your "context window". You need to dream about it, to think about, to revisit and repeat until you actually learn it and "update your internal model". We need a mechanism for continuous weight updating.
  Goal-generation is pretty much covered by your body constantly drip-feeding your brain various hormones "ongoing input prompts".
  - NetRunnerSu a day ago
    
    Yes, you're right, that's what we're doing.
    https://github.com/dmf-archive/PILF
    
    KoolKat23 16 hours ago
    
    Very interesting, thanks for the link.
  - onemoresoop a day ago
    
    > I'd say we're not far off.
    How are we not far off? How can LLMs generate goals and based on what?
    
    FeepingCreature 21 hours ago
    
    You just train it on the goal. Then it has that goal.
    Alternately, you can train it on following a goal and then you have a system where you can specify a goal.
    At sufficient scale, a model will already contain goal-following algorithms because those help predict the next token when the model is basetrained on goal-following entities, ie. humans. Goal-driven RL then brings those algorithms to prominence.
    
    kordlessagain 18 hours ago
    
    Random goal use is showing to be more important than training. Although, last year someone trained on the fly during the competition, which is pretty awesome when you think about it.
    
    kelseyfrog 19 hours ago
    
    How do you figure goal generation and supervised goal training are interchangeable?
    
    FeepingCreature 6 hours ago
    
    Layman warning! But "at sufficient scale", like with learning-to-learn, I'd expect it to pick up largely meta-patterns along with (if not rather than) behavioral habits, especially if the goal is left open, because strategies generalize across goals and thus get reinforcement from every instance of goal pursuit during base training.
    But also my intuition is that humans are "trained on goals" and then reverse-engineer an explicit goal structure using self-observation and prosaic reasoning. If it works for us, why not the LLMs?
    edit: Example: https://arxiv.org/abs/2501.11120 "Tell me about yourself: LLMs are aware of their learned behaviors". When you train a LLM on an exclusively implicit goal, the LLM explicitly realizes that it has been trained on this goal, indicating (IMO) that the implicit training hit explicit strategies.
    
    NetRunnerSu a day ago
    
    Minimize prediction errors.
    
    tsurba a day ago
    
    But are we close to doing that in real-time on any reasonably large model? I don’t think so.
    
    NetRunnerSu 12 hours ago
    
    This is not about reasoning , this is about continuous learning and perpetual learning .
    https://github.com/dmf-archive/PILF
    https://dmf-archive.github.io/docs/posts/beyond-snn-plausibl...
- asah 12 hours ago
  
  we're closer than you think...
- NetRunnerSu a day ago
  
  In fact, there is no technical threshold anymore. As long as the theory is in place, you can see such AGI at most half a year. It will even be more energy efficient than the current dense models.
  https://dmf-archive.github.io/docs/posts/beyond-snn-plausibl...
aaron695 a day ago

[dead]

noiv 12 minutes ago

Did anybody already let o3 crack Bongard Problems? o1 did not convince: https://arxiv.org/abs/2410.19546

stared 4 hours ago

I dislike the term AGI, as intelligence (of any type) always involves tradeoffs. Being exceptional at solving 2D grid-based pattern tasks is just one skill. Humans have a strong visual bias, while some hypothetical superintelligent slime molds might value entirely different problems. I know smart people (PhDs in STEM fields at major universities) who struggle with geometric puzzles, yet excel at linguistic or algebraic ones.

Getting a perfect ARC-AGI-n score isn't a smoking gun indicator of general intelligence. Rather, it simply means we're now able to solve a class of problems previously beyond AI capabilities (which is exciting in itself!).

I view ARC-AGI primarily as a benchmark (similar in spirit to Raven's matrices) that makes memorization substantially harder. Compare this with vocabulary-focused IQ tests, where cognitive skills certainly matter, but results depend heavily on exposure to a particular language.

spectre9 3 hours ago

Call me crazy but we should be optimizing for human visual intelligence rather than slime mold symbolic space
- ACCount36 3 hours ago
  
  It's obvious why having human visual intelligence in a machine is desirable.
  But if slime mold symbolic space is better suited for something like understanding of biology or abstract math, that's a good damn reason to go for the slime mold route too.

TheAceOfHearts a day ago

The first highlight from this video is getting to see a preview of the next ARC dataset. Otherwise it feels like most of what Chollet says here has already been repeated in his other podcast appearances and videos. It's a good video if you're not familiarized with his work, but if you've seen some of his recent interviews then you can probably skip the first 20 minutes.

The second highlight from this video is the section from 29 minutes onward, where he talks about designing systems that can build up rich libraries of abstractions which can be applied to new problems. I wish he had lingered more on exploring and explaining this approach, but maybe they're trying to keep a bit of secret sauce because it's what his company is actively working on.

One of the major points which seems to be emerging from recent AI discourse is that the ability to integrate continuous learning seems like it'll be a key element in building AGI. Context is fine for short tasks, but if lessons are never preserved you're severely capped with how far the system can go.

modeless 21 hours ago

ARC-AGI 3 remindes me of PuzzleScript games: https://www.puzzlescript.net/Gallery/index.html

There are dozens of ready-made, well-designed, and very creative games there. All are tile-based and solved with only arrow keys and a single action button. Maybe someone should make a PuzzleScript AGI benchmark?

mNovak 19 hours ago

This game is great!
https://nebu-soku.itch.io/golfshall-we-golf
Maybe someone can make an MCP connection for the AIs to practice. But I think the idea of the benchmark is to reserve some puzzles for private evaluation, so that they're not in the training data.

EigenLord 6 hours ago

I've been thinking lately about how AGI runs up against the No Free Lunch Theorem. This is what irritates me: science is not determining the narrative. Money is. I highly recommend mathematician David Wolpert's work on the topic. I think he inadvertently proved that ASI is physically impossible. Certainly he proved that AOI (artificial omniscient intelligence) is impossible.

One thing he showed is that you can't have a universe with two omniscient intelligences (as it would be intractable for them to predict the other's behavior.)

It's also very questionable whether "humanlike" intelligence is truly general in the first place. I think cognitive neurobiologists would agree that we have a specific "cognitive niche", and while this symbolic niche seems sufficiently general for a lot of problems, there are animals that make us look stupid in other respects. This whole idea that there is some secret sauce special algorithm for universal intelligence is extremely suspect. We flatter ourselves and have committed to a fundamental anthropomorphic fallacy that seems almost cartoonishly elementary for all the money behind it.

FilosofumRex 5 hours ago

AGI can't be defined, because it's the means by which definitions are created. You can only measure it contemporaneously by some consensus method such as ARC.
You can't define AGI, any more than you can define ASA (artificial sports ability). Intelligence, like athleticism changes both quantitively and qualitatively. The Greek Olympic champions of 2K yrs ago wouldn't qualify for high school championships today, however, they were once regarded as great athletes.
fragmede 6 hours ago

When hasn't money determined the narrative?

visarga 20 hours ago

I think intelligence is search. Search is exploration + learning. So intelligence is not in the model or in the environment, but in their mutual dance. A river is not the banks, nor the water, but their relation. ARC is just a frozen snapshot of the banks, not the dynamic environment we have.

ipunchghosts 15 hours ago

I agree strongly with this take but find it hard to convince others of it. Instead, people keep thinking there is a magic bullet to discover resulting in a lot of wasted resources and money.

bogtog 20 hours ago

I wonder how much slow progress on ARC can be explained by their visual properties making them easy for humans but hard for LLMs.

My impression is that models are pretty bad at interpreting grids of characters. Yesterday, I was trying to get Claude to convert a message into a cipher where it converted a 98-character string into 7x14 grid where the sequential letters moved 2-right and 1-down (i.e., like a knight it chess). Claude seriously struggled.

Yet, Francois always pumps up the "fluid intelligence" component of this test and emphasizes how easy these are for humans. Yet, humans would presumably be terrible at the tasks if they looked at it character-by-character

This feels like a somewhat similar (intuition-lie?) case as the Apple paper showing how reasoning model's can't do tower of hanoi past 10+ disks. Readers will intuitively think about how they themselves could tediously do an infinitely long tower of hanoi, which is what the paper is trying to allude to. However, the more appropriate analogy would be writing out all >1000 moves on a piece of paper at once and being 100% correct, which is obviously much harder

ACCount36 3 hours ago

There are some major hints that this is indeed the case.
I've seen a simple ARC-AGI test that took the open set, and doubled every image in it. Every pixel became a 2x2 block of pixels.
If LLMs were bottlenecked solely by reasoning or logic capabilities, this wouldn't change their performance all that much, because the solution doesn't change all that much.
Instead, the performance dropped sharply - which hints that perception is the bottleneck.
krackers 13 hours ago

I thought so too back when the test was first released, but now that we have multimodal models which can take images directly as input, shouldn't this point be moot?
- ACCount36 3 hours ago
  
  Even the very best multimodal LLMs still suffer from a harsh perception bottleneck. They're impressive, but nowhere near as good as human visual cortex.
- bogtog 9 hours ago
  
  I think the top performer afaik (ChatGPT o3) is still treating ARC as a series of characters. I imagine complex reasoning in multimodal processing wouldn't be nearly as advanced so treating it as characters is still better
  - krackers 9 hours ago
    
    interesting, I thought one of the whole points of o3 was mixed multimodal reasoning (e.g. everyone doing those geoguesser challenges). But maybe that's just a parlor trick and it's not actually implemented that way. I wonder when they're going to extend chain-of-thought to work with image tokens, seems like that'd help for solving spatial challenges like this.

visarga 20 hours ago

I think intelligence is search. Search is exploration and learning. So intelligence is not in the model, or in the environment, but in their mutual dance. A river is not the banks, nor the water, but their relation.

gtech1 17 hours ago

This may be a silly question, I'm no expert. But why not simply define as AGI any system that can answer a question that no human can. So for example, ask AGI to find out, from current knowledge, how to reconcile gravity and qed.

soVeryTired 16 hours ago

Computers can already do a lot of things that no human can though. They can reliably find the best chess or go move better than a human.
It's conceivable (though not likely) that given training enough training in symbolic mathematics and some experimental data, an LLM-style AI could figure out a neat reconciliation of the two theories. I wouldn't say that makes it AGI though. You could achieve that unification with an AI that was limted to mathematics rather than being something that can function in many domains like a human can.
- goatlover 11 hours ago
  
  Wouldn't this unification need to be backed by empirical data? Let's say the AI discovers the two theories can be unified using let's say some configuration 8 spatial dimensions and 2 time dimensions. Neat trick, but how do we know the world actually has those dimensions?
  - gtech1 9 hours ago
    
    Do we even have any other theory that does that already ? It seems that even finding one would be a great achievement
m11a 16 hours ago

That would be ASI I think.
But consider: technically AlphaTensor found new algorithms no human did before (https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...). So isn't it AGI by your definition of answering a question no human could before: how to do 4x4 matrix multiplication in 47 steps?
layer8 15 hours ago

Aside from other objections already mentioned, your example would require feasible experiments for verification, and likely the process of finding a successful theory of quantum gravity requires a back and forth between experimenters and theorists.
imiric 16 hours ago

"What is the meaning of life, the universe, and everything?"
- ta8645 15 hours ago
  
  42

lawlessone 18 hours ago

How do we define AGI?

I would have thought/considered AGI to be something that is constantly aware, a biological brain is always on. An LLM is on briefly while it's inferring.

A biological brain constantly updates itself adds memories of things. Those memories generally stick around.

khalic 18 hours ago

This quest for an ill defined AGI is going to create a million of Cpt Ahab

chromaton a day ago

Current AI systems don't have a great ability to take instructions or information about the state of the world and produce new output based upon that. Benchmarks that emphasize this ability help greatly in progress toward AGI.

vixen99 a day ago

Is the text available for those who don't hear so well?

jasonlotito a day ago

At the very least, YouTube provides a transcript and a "Show Transcript" button in the video description, which you can click on to follow along.
- heymijo a day ago
  
  When I watched the video I had the subtitles on. The automatic transcript is pretty good. "Test-time" which is used frequently gets translated as "Tesla" so watch out for that.

hackinthebochs a day ago

Has Chollet ever talked about his change of heart regarding AGI? It wasn't that long ago when he was one of the loudest voices decrying even the concept of AGI, let alone us being on the path to creating it. Now he's an advocate and has his own prize dataset? Seems rather convenient to change your tune once hundreds of billions are being thrown at AGI (not that I would blame him).

zamderax a day ago

People are allowed to evolve opinions. It seems to me he believes that a combination of transformer and program synthesis are key. The big unknown at the moment is how to do program search.
- hackinthebochs a day ago
  
  Absolutely. Presumably there is some specific considerations or evidence that helped him evolve his opinion. I would be interested in seeing a writeup about it. With him having been a very public advocate against AGI, a writeup of his evolution seems appropriate and would be very edifying for a lot of people.
  - Bjorkbat 17 hours ago
    
    I recall it as less an evolution and more a complete tonal shift the moment o3 was evaluated on ARC-AGI. I remember on Twitter Sam made some dumb post suggesting they had beaten the benchmark internally and Francois calling him out on his vagueposting. Soon as they publicly released the scores, it was like he was all-in on reasoning.
    Which I have to admit I was kind of disappointed by.
    
    brcmthrowaway 11 hours ago
    
    What exactly is "reasoning"?
  - blibble a day ago
    
    > Presumably there is some specific considerations or evidence that helped him evolve his opinion.
    suitcases full of money?
cubefox a day ago

ARC-AGI was introduced in 2019:
https://arxiv.org/abs/1911.01547
GPT-3 didn't come out until 2020.
- hackinthebochs a day ago
  
  In my view that just makes his evolution more interesting as it wasn't just a matter of being wow'ed by what ChatGPT could do.
0xCE0 a day ago

He has recently co-founded NDEA company, so he has to align himself for that. Same kind of vibe change feels for Joscha Bach after having some position in Liquid AI company. Communication is not so relaxed anymore.
That said, I'd still listen these two guys (+ Schmidhuber) more than any other AI-guy.

timonofathens 5 hours ago

[dead]

ltbarcly3 19 hours ago

[flagged]

_giorgio_ a day ago

[flagged]

acegod 19 hours ago

[dead]

abeppu 18 hours ago

I think you're basically saying that ARC-AGI doesn't achieve a goal that _it didn't set_. The point of ARC-AGI is not to benchmark LLMs specifically. The point is to measure fluid intelligence in a way which supports comparisons between models and between models and humans. It's not the obligation of the test to be tailored to the form of model that's most popular now.
- acegod 17 hours ago
  
  Right, that's exactly what I'm saying.
  >The point is to measure fluid intelligence in a way which supports comparisons between models and between models and humans. It's not the obligation of the test to be tailored to the form of model that's most popular now.
  The problem is that the test may not be giving an accurate comparison because the test is problematic when used to assess LLMs, which are the kind of model that people are most interested in assessing for general capabilities.

jacquesm 21 hours ago

Let's not. Seriously. I absolutely love François and have used his work extensively. But looking around me at the social impact of AI I am really not convinced that this is what the world needs right now and that if we can stave off the turning point for another decade or two that humanity will likely benefit from that. The last thing we need is to inject yet another instability into a planet that is already fighting existential crisis on a number of fronts.

thatguy0900 21 hours ago

It doesn't matter what should or should not happen. Technology will continue to race forward at breakneck speed while everyone involved pats each other on the back for making a bunch of money before the consequences hit
- bgwalter 11 hours ago
  
  The Iranian nuclear program disagrees. There could be treaties and the analogue of the IAEA for "AI".
- nessbot 21 hours ago
  
  technology doesn't just advance itself
  - bnchrch 21 hours ago
    
    No, but one thing is certain, in large human systems you can only redirect greed, you can't stop it.
  - alex_duf 21 hours ago
    
    If the incentive is there, the technology will advance. I hear "we need to slow down the progress of technology", but that's misunderstanding _why_ it progresses. I'm assuming the slow down camp really need to look into what's the incentive to slow down.
    Personally I don't think it's possible at this stage. The cat's out of the bag (this new class of tools are working) the economic incentive is way too strong.
  - lo_zamoyski 21 hours ago
    
    This is true. We have a choice...in principle.
    But in practice, it's like stopping an arms race.
    
    hollerith 11 hours ago
    
    Which arms race? The nuclear arms race of the Cold War? That one got stopped.

ltbarcly3 19 hours ago

There is some kind of massive brigading happening on this thread. Lots of thoughtful comments are downmodded or flagged (including mine, which I thought was pretty thoughtful. I even said poop instead of shit.).

https://news.ycombinator.com/item?id=44492241

My comment was basically instantly flagged. I see at least 3 other flagged comments that I can't imagine deserve to be flagged.

layer8 17 hours ago

You didn’t address anything from the actual talk.
- ltbarcly3 14 hours ago
  
  I addressed the entire concept of the talk, and made other relevant points. The correct response to "let me tell you something I can't possibly know" isn't to argue the points within that frame.
  If you see a talk like: "How we will develop diplomacy with the rat-people of TRAPPIST-5." you don't have to make some argument about super-earths and gravity and the rocket equation. You can just point out it's absurd to pretend to know something like whether there are rat-people there.
  Either way, it isn't flag-able!
  - layer8 14 hours ago
    
    Did you actually watch the talk?
    The flagging is probably due to your aggressively indignant style.
starchild3001 8 hours ago

I've noticed that Hacker News has become increasingly strict with tone lately; almost to the point of aggressive "tone policing." It feels like you’re expected to maintain a very diplomatic tone: praise the author, acknowledge the strengths of the article or video, and only then, very carefully, offer any criticism. You can't just bash it, and not get downvoted :)

saberience a day ago

The Arc prize/benchmark is a terrible judge of whether we got to AGI.

If we assume that humans have "general intelligence", we would assume all humans could ace Arc... but they can't. Try asking your average person, i.e. supermarket workers, gas station attendants etc to do the Arc puzzles, they will do poorly, especially on the newer ones, but AI has to do perfectly to prove they have general intelligence? (not trying to throw shade here but the reality is this test is more like an IQ test than an AGI test).

Arc is a great example of AI researchers moving the goal posts for what we consider intelligent.

Let's get real, Claude Opus is smarter than 99% of people right now, and I would trust its decision making over 99% of people I know in most situations, except perhaps emotion driven ones.

Arc agi benchmark is just a gimmick. Also, since it's a visual test and the current models are text based it's actually a rigged (against the AI models) test anyway, since their datasets were completely text based.

Basically, it's a test of some kind, but it doesn't mean quite as much as Chollet thinks it means.

leumon a day ago

He said in the video that they tested regular people (uber driver, etc.) on arc-agi2 and at least 2 people were able to solve each task (an average of 9-10 people saw each task). Also this quote from the paper: None of the self-reported demographic factors recorded for all participants—including occupation, industry, technical experience, programming proficiency, mathematical background, puzzle-solving aptitude, and var- ious other measured attributes—demonstrated clear, statistically significant relationships with performance outcomes. This finding suggests that ARC-AGI-2 tasks assess general problem-solving capabilities rather than domain-specific knowledge or specialized skills acquired through particular professional or educational experiences.
Workaccount2 21 hours ago

This is what is called "spikey" intelligence, where a model might be able to crack phd physics problems and solve byzantine pattern matching games at the 90th percentile, but also can't figure out how to look up a company and copy their address on the "customer" line of an invoice.
cttet a day ago

Maybe it is a cultural difference aspect, but I feel that "supermarket workers, gas station attendants" (in an Asian country) that I know of should be quite capable of most ARC tasks.
profchemai a day ago

Out of 100 of evals, ARC is a very distinct and unique eval, most frontier models are also visual now, don't see the harm in having this instead of another text eval.
daveguy a day ago

It is not a judge of whether we got to AGI. And literally no one except straw-manning critics are trying to claim it is. The point is, an AGI should easily be able to pass it. But it can obviously be passed without getting to AGI (as . It's a necessary but not sufficient criteria. If something can't pass a test as simple as AGI (which no AI currently can) then it's definitely not AGI. Anyone claiming AGI should be able to point their AI at the problem and have an 80+% solution rate. Current attempts on the second ARC are less than 10% with zero shot attempts even worse. Even the better performing LLMs on the first ARC couldn't do well without significant pre-training. In short, the G in AGI stands for general.
- saberience 21 hours ago
  
  So do you agree that a human that CANNOT solve ARC doesn't have general intelligence?
  If we think humans have "GI" then I think we have AIs right now with "GI" too. Just like humans do, AIs spike in various directions. They are amazing at some things and weak at visual/IQ test type problems like ARC.
  - adamgordonbell 20 hours ago
    
    It's a good question. But only complicated answers are possible. A puppy and crow and a raccoon all have intelligence but certainly can't all pass the ARC challenge.
    I think the charitable interpretation is that, if intelligence is made up of many skills, and AIs are super human at some, like image recognition.
    And that therefore, future efforts need to be on the areas where AIs are significantly less skilled. And also, since they are good at memorizing things, knowledge questions are the wrong direction and anything most humans could solve but that AIs can not, especially if as generic as pattern matching, should be an important target.

roenxi a day ago

By both definitions of intelligence in the presentation we should be saying "how we got to AGI" in the past tense. We're already there. AI's can deal with situations they weren't prepared for in any sense that a human can. They might not do well, but they'll have a crack at it. We can trivially build systems that collect data and do a bit more offline training if that is what someone wants to see, but there doesn't really seem to be a commercial need for that right now. Similarly, AIs can whip most humans at most domains that require intelligence.

I think the debate hqas been flat-footed by the speed all this happened. We're not talking AGI any more, we're talking about how to build superintelligences hitherto unseen in nature.

tmvphil a day ago

According to this presentation at least, ARC-AGI-2 shows that there is a big meaningful gap in fluid intelligence between normal non-genius humans and the best models currently, which seems to indicate we are not "already there".
- saberience a day ago
  
  There's already a big meaningful gap between the things AIs can do which humans can't, so why do you only count as "meaningful" the things humans can do which AIs can't?
  I enjoy seeing people repeatedly move the goalposts for "intelligence" as AIs simply get smarter and smarter every week. Soon AI will have to beat Einstein in Physics, Usain Bolt in running, and Steve Jobs in marketing to be considered AGI...
  - tmvphil 20 hours ago
    
    > There's already a big meaningful gap between the things AIs can do which humans can't, so why do you only count as "meaningful" the things humans can do which AIs can't?
    Where did I say there was nothing meaningful about current capabilities? I'm saying that's what is novel about a claim of "AGI" (as opposed to a claim of "computer does something better than humans", which has been an obviously true statement since the ENIAC) is the ability to do at some level everything a normal human intelligence can do.
cubefox a day ago

Well, there is also robotics, active inference, online learning, etc. Things animals can do well.
- AIPedant a day ago
  
  Current robots perform very badly on my patented and highly scientific ROACH-AGI benchmark - "is this thing smarter at navigating unfamiliar 3D spaces than a cockroach?"