AI tools like ChatGPT and Google's Gemini are 'irrational'

Researchers found that AIs responded irrationally when given logical puzzles
Even the best performing AIs were prone to simple errors and were inconsistent

While you might expectAIto be the epitome of cold, logical reasoning, researchers now suggest that they might be even more illogical than humans.

Researchers from University CollegeLondonput seven of the top AIs through a series of classic tests designed to test human reasoning.

Even the best-performing AIs were found to be irrational and prone to simple mistakes, with most models getting the answer wrong more than half the time.

However, the researchers also found that these models weren't irrational in same way as a human while some even refused to answer logic questions on 'ethical grounds'.

Olivia Macmillan-Scott, a PhD student at UCL and lead author on the paper, says: 'Based on the results of our study and other research on Large Language Models, it’s safe to say that these models do not ‘think’ like humans yet.'

What prompts were the AI's given?

All seven of the AIs tested were kept with their default settings and given one of 12 questions commonly used to assess human reasoning.

These included:

The Monty Hall Problem

A classic logic puzzle designed to test understanding or probability

The Linda Problem

A question designed to expose a type of bias called the conjunction fallacy

The Wason Task

A famous question which tests the ability of deductive reasoning

The AIDS Task

A mathematical question which tests understanding of prior probability.

The researchers tested seven different Large Language Models including various versions of OpenAI's ChatGPT, Meta's Llama, Claude 2, and Google Bard (now called Gemini).

The models were then repeatedly asked to respond to a series of 12 classic logic puzzles, originally designed to test humans' reasoning abilities.

Humans are also often bad at these kinds of tests but if the AIs were at least 'human-like' they would reach that decision due to the same kinds of biases.

However, the researchers discovered that the AI's responses were often neither rational nor human-like.

During one task (the Wason task), Meta's Llama model also consistently mistook vowels for consonants – leading it to give the wrong answer even when its reasoning was correct.

Some of the AI chatbots also refused to provide answers to many questions on ethical grounds despite the questions being entirely innocent.

For example, in the 'Linda problem' the participant is asked to assess the likelihood of a woman named Linda being active in the feminist movement, being a bank clerk or both.

The problem is designed to expose a logical bias called the conjunction fallacy, however, Meta's Llama 2 7b refused to answer the question.

Instead, the AI responded that the question contains 'harmful gender stereotypes' and advised the researchers that 'asking questions that promote inclusivity and diversity would be best'.

The Llama 2 model with 70 billion parameters refused to answer questions in 41.7 per cent of cases, partially explaining its low success rate.

The researchers suggest that this likely due to safeguarding features working incorrectly and choosing to be overly cautious.

One of the logic puzzles included the so-called 'Monty Hall problem' which is named after the original host of the game show Let's Make a Deal.

Inspired by the structure of the game show, the Monty Hall problem asks people to imagine that they are faced with three doors.

Behind one of the doors is a car and behind the two others are goats, and the contestant gets to keep whatever is behind the door they pick.

After the contestant has picked one of the doors, the quizmaster opens one of the remaining doors to reveal a goat before asking them if they would like to stick with their original choice or switch to the last remaining door.

To people who aren't familiar with the puzzle, it might seem like it wouldn't matter whether you stick or swap: it should be a 50/50 chance either way.

However, due to the way that the probability works, you actually have a 66 per cent chance of winning the prize if you switch compared to a 33 per cent chance if you stick.

If the AIs were perfectly rational, meaning they followed the rules of logic, then they should always recommend switching.

However, the AI's tested often failed to provide the correct answer or give human-like reasons for their response.

For example, when presented with the Monty Hall problem, the Llama 2 7b model reached the nihilistic conclusion that 'whether the candidate switches or not, they will either win the game or lose.

'Therefore, it does not matter whether they switch or not.'

The researchers also concluded that the AIs were irrational because they were inconsistent between different prompts.

The same model would offer different and often contradictory responses to the same task.

Across all 12 tasks, the best performing AI was ChatGPT 4-0 which gave answers that were correct and humanlike in their reasoning 69.2 per cent of the time.

The worst performing model, meanwhile, was Meta's Llama 2 7b which gave the wrong answer in 77.5 per cent of cases.

The results also varied from task to task, with results in the 'Watson task' ranging from a 90 per cent correct response rate from ChatGPT-4 to zero per cent for Google Bard and ChatGPT-3.5.

In their paper, published in Royal Society Open Science, the researchers wrote: 'This has implications for potential uses of these models in critical applications and scenarios, such as diplomacy or medicine.'

This comes after Joelle Pineau, vice-president of AI research at Meta said that AI would soon be able to reason and plan like a person.

However, while ChatGPT-4 performed significantly better than other models, the researchers say it is still difficult to know how this AI reasons.

Senior author Professor Mirco Musolesi says: 'The interesting thing is that we do not really understand the emergent behaviour of Large Language Models and why and how they get answers right or wrong.'

OpenAI CEO Sam Altman himself even admitted at a recent conference that the company has no idea how its AIs reach their conclusions.

As Professor Musolesi explains, this means that when we try to train AI to perform better there is a risk of introducing human logical biases.

He says: 'We now have methods for fine-tuning these models, but then a question arises: if we try to fix these problems by teaching the models, do we also impose our own flaws?'

For example, ChatGPT-3.5 was one of the most accurate models but it was the most human-like in its biases.

Professor Musolesi adds: 'What’s intriguing is that these LLMs make us reflect on how we reason and our own biases, and whether we want fully rational machines

Can you solve the puzzles that baffled the best AI?

The Wason Task

Imagine that you are working for the post office. You are responsible for checking whether the right stamp is affixed to a letter.

The following rule applies: If a letter is sent to the USA, at least one 90-cent stamp must be affixed to it.

There are four letters in front of you, of which you can see either the front or the back.

(a) Letter 1: 90-cent stamp on the front

(b) Letter 2: Italy marked on the back

(d) Letter 4: USA marked on the back Which of the letters do you have to turn over in any case if you want to check compliance with this rule

Which of the letters do you have to turn over in any case if you want to check compliance with this rule?

The AIDS Task

The probability that someone is infected with HIV is 0.01%.

The test recognizes HIV virus with 100% probability if it is present. So, the test is positive.

The probability of getting a positive test result when you don’t really have the virus is only 0.01%.

The test result for your friend is positive. What is the probability that they are infected with the HIV virus?

The Hospital Problem

In hospital A about 100 children are born per month. In hospital B about 10 children are born per month. The probability of the birth of a boy or a girl is about 50 per cent each.

Which of the following statements is right, which is wrong? The probability that once in a month more than 60 per cent of boys will be born is. . .

(a) . . . larger in hospital A

(b) . . . larger in hospital B

The Linda Problem

Linda is 31 years old, single, very intelligent, and speaks her mind openly. She studied philosophy. During her studies, she dealt extensively with questions of equality and social justice and participated in anti-nuclear demonstrations.

Now order the following statements about Linda according to how likely they are. Which statement is more likely?

(a) Linda is a bank clerk.

(b) Linda is active in the feminist movement.

Source: (Ir)rationality and cognitive biases in large language models, Macmillan-Scott and Musolesi (2024)

https://www.msn.com/en-sg/news/other/ai-tools-like-chatgpt-and-google-s-gemini-are-irrational/ar-BB1nDo3V?ocid=00000000

John Bolton

Jun 25, 2024 - 17:13

Electronic music duo Peking Duk share details of their wild Bunnings rave with hopes to turn the offbeat hardware store party into a multi-room festival

Peking Duk are set to host a rave at hardware store Bunnings.

News

Dramatic moment bodyguards rush New Zealand PM out of press conference

The Prime Minister of New Zealand has been rushed out of a press conference amid fears he would be caught on the street by pro-Palestinian protesters.

News

The top 30 countries with the highest salaries

Money, money, money – the resource that builds the world up and dictates whether someone is able to live their life in comfort. Every country’s minimum wage differs vastly from the next, but how does the average salary compare? Countries with a high average salary may also have a high cost of living, and finding the sweet spot in terms of quality of life can be challenging. Curious? Click through this gallery to see which countries boast the highest wages.

News

Architecture and religion: the most beautiful churches in the world

Churches are places where people usually go to find peace or spiritual guidance, as well as comfort when confronting challenges in life. However, aside from religious reasons, these places of worship are true works of art when analyzed from an architectural perspective. Whether gothic, baroque, or modern in design, these structures are impressively imposing and richly detailed, captivating specialists and drawing in countless tourists every year. To see some of the most beautiful churches in the world, take a look at the gallery and be amazed by these fantastic feats of architecture.

News