The new version of the artificial intelligence model has passed a test designed to measure human-level “general intelligence.” The ARC-AGI benchmark given to OpenAI’s o3 system on December 20 gave a score of 85%, higher than the previous AI’s best score of 55%, and matching the average human score.

It also responded well to rigorous mathematical evaluation. OpenAI appears to have made a significant effort toward developing artificial general intelligence, also known as artificial intelligence.

Despite their initial doubts, many AI researchers and developers claim that a new reality has emerged regarding the possibility of artificial global intelligence (AGI). Is this true, or have they considered it a far-fetched dream? Intelligence and generalization are two important components of intelligence. The ARC-AGI test is essential to understanding the o3 result.

The test is designed to determine the “sample efficiency” of the AI ​​system, including the number of test cases needed to identify its operation. ChatGPT (4 and AI respectively) is not very efficient in generating samples. It relies on training data from millions of human text samples to establish probabilistic word-based rules.

The result is satisfactory in both regular and unusual tasks, but not in exceptional tasks due to the lack of data and samples related to unusual tasks.

AI systems are only suitable for repetitive and failure-prone tasks until they can adapt to more efficient examples and smaller numbers to reduce the sporadically required tasks.

The ability to generalize is a fundamental aspect of intelligence that involves the ability to solve problems in previously unknown or new ways using relatively small amounts of data.

The use of grids and patterns is widespread.

To ensure efficient optimization of samples, the ARC-AGI benchmark examines simple grid square problems such as the one depicted below. The AI ​​must find a way to convert a left-to-right grid to a right-to-right grid.

The AI ​​system must learn three examples from each question and then apply them to the enum Ly/ML and its components.

These tests resemble cognitive tests often taken in school.

The low standards and grading of OpenAI’s approach are not entirely clear, but the results indicate that the o3 model is highly versatile, with generalizable rules being discovered from a few examples.

Theoretically, the absence of superstition or precision requirements in finding patterns helps maximize adaptability.

What are the weakest rules? The specific definition is not very clear, but weak rules are usually described as ‘weak rules’.

The example above shows how the rule can be expressed in plain English, with each shape covered by another shape.

Checking thought cycles

Although OpenAI has not yet determined how this result was achieved, it seems unlikely that they have deliberately optimized the o3 system to identify weak rules. Nevertheless, the achievement of ARC-AGI tasks demands the discovery of such rules.

OpenAI started with a general-purpose version of the O3 model, which stands out from other models because of its ability to think on complex topics, and then trained it for the ARC-AGI test.

The benchmark, designed by French AI researcher Francois Chollet, assumes that when searching for a solution, it will use a series of thoughts to describe the steps and then choose the best option based on a loosely defined set of rules known as a heuristic.

The methods used by Google’s AlphaGo system to defeat the world Go champion were different and similar.

It is possible to conceptualize these series of thoughts as programs that correspond to examples. However, if it is similar to a Go-playing AI, a loose guide is needed to determine the optimal program.

It is possible to create many programs that appear to be equally valid. The choice may be to prioritize either the weakest or the simplest.

Before AlphaGo there was an AI that created a heuristic, which Google used to train a model to rate different sequences of moves based on their characteristics.

What is still undiscovered is a crucial aspect.

Is it more comparable to AGI? If so, the functionality of o3 may not match previous versions of the underlying model.

The model has learned concepts from language, which cannot be more generalized through specific heuristic training for testing.

With OpenAI restricting its public disclosure to a limited number of media presentations and initial testing for a limited number of researchers, laboratories, and AI security institutes, 90% of the details about o3 are still unverified.

Fully understanding the potential of o3 requires in-depth exploration, evaluation, and understanding of its capability distribution, failure, and success rates.

The modularity of O3 will be evaluated in the coming years, allowing us to make progress comparable to the average human.

Leave a Reply

Your email address will not be published. Required fields are marked *