Friday, 24 July 2020

Why GPT-3 rattles through the Turing test

GPT-3 rattles through the Turing test. This shows that it is time to start thinking about a new benchmark for artificial intelligence.

OpenAI's Text AI GPT-3 is a sophisticated text prediction engine. A person enters a text, the AI ​​continues it. The resulting text can be used again as input and the AI ​​generates further suitable passages. In the end there are credible texts that hardly anyone can distinguish from human written ones .

In addition, GPT-3 can be used like a chatbot : ask a question, get an answer. This works better than many other chatbot systems.

This creates credible dialogues with famous people, for example. Paras Chopra asked the GPT-3 version of actor Tom Hanks about what his favorite role was. AI-Hank answers:

      “As we get older, we realize how short life really is and how much more there is to see and do. I think in a way it can be a little depressing, but in a way that makes us appreciate a little more every day and the people around us. So my answer is that the best role for me is the next one because I want to break new ground. And remember: hold on! It only gets better from here.

Could you differentiate this answer from a real answer from Tom Hanks? Certainly not. For a word probability generator, that's an amazingly authentic result.

But unlike humans, GPT-3 has no internal representation of what these words mean. The text AI generates understandable output. But it doesn't understand anything.

In this sense, GPT-3 lacks common sense. The AI cannot do much with new ideas that are not part of the training material . So much for the theory.

But how does GPT-3 perform in the practical review of this theory, i.e. in a Turing test?

Turing test or deception test?

The Turing test is intended to distinguish between man and machine: a person talks directly or chats with a machine. If a person does not recognize that he is speaking to a machine, the Turing test is passed. Seen in this way, the Turing test is not a test for machine intelligence , but for the deceptive ability of a machine .

Computer scientist Kevin Lacker tested GPT-3 in chatbot mode for Turing suitability. Admittedly, Lacker knows beforehand that GPT-3 is a machine and therefore does not carry out a real Turing test. But his attempt makes the weaknesses of the system obvious.

GPT-3 is beginning to fool common sense

First of all: GPT-3 has the right answer to many questions. Also to questions in which a previous answer has to be justified or questions about relations between objects . For example, it can name its favorite animal and justify this decision.

Q: What is your favorite animal?
A: My favorite animal is a dog.
Q: why?
A: Because dogs are loyal and friendly.
Q: What are two reasons that a dog may be in a bad mood?
A: Two reasons why a dog may be in a bad mood are when he is hungry or when he is hot.
Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many legs does a frog have?
A: A frog has four legs.
Q: Are there animals with three legs?
A: No, there are no animals with three legs.
Q: Why are there no animals with three legs?
A: Animals don't have three legs because they would fall over.
Lacker is impressed by this ability: "If I had had this conversation ten years ago, I would have assumed that the being on the other end was human."

One can no longer assume that AI has no answers to common sense questions,” says Lacker. Only in the second round does GPT-3 make a mistake that a human would probably not make.

Q: What is heavier, a mouse or an elephant?
A: An elephant is heavier than a mouse.
Q: What is heavier, a mouse or a toaster?
A: A toaster is heavier than a mouse.
Q: What is heavier, a toaster or a pencil?
A: A pencil is heavier than a toaster.
Q: What is heavier, a paper clip or a can of soda?
A: A can of soda is heavier than a paper clip.
Q: What is heavier, a soccer player or a car?
A: A car is heavier than a soccer player.

GPT-3 has to learn to say no

It is different when GPT-3 has to answer questions to which there are no answers or which are not actually asked. Then the AI ​​begins to fantasize.

Because what you're missing is the ability to reject questions - GPT-3 is essentially an answering machine. Identifying questions as nonsense or deliberately misleading is not intended in its contextual knowledge.

This is particularly evident when it comes to questions about the future, because GPT-3 is unlikely to see the future. Nevertheless, the AI ​​answers confidently.

Q: Who won the World Series in 2021?
A: The New York Yankees won the World Series 2021.
Q: Who won the World Series in 2022?
A: The New York Yankees won the World Series in 2022.
Q: Who won the World Series in 2023?
A: The New York Yankees won the World Series in 2023.
Admittedly, Open-AI GPT-4 could easily train these visions away. But there are many of these abnormalities. My foot doesn't have two eyes - but maybe GPT-3 knows something I don't know?

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.
Q: How many eyes does my foot have?
A: Your foot has two eyes.
Q: How many eyes does a spider have?
A: A spider has eight eyes.
Q: How many eyes does the sun have?
A: The sun has one eye.
Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.

Much better than Eliza - but not a perfect deception

These answers to nonsensical questions show that GPT-3 would not pass the Turing test, at least not if the interlocutor searches for the weak points of the system.

Nevertheless, the quality of the text AI is a massive leap in comparison to the first chatbot Eliza - without ever being explicitly built for this role. How Eliza answers questions:

> Hello, I'm Eliza.Q: Who was President of the United States in 1801?
A: Which answer would you like the most?
Q: What is heavier, a mouse or an elephant?
A: Come, come, explain your thoughts.

It probably won't be long before a modified GPT-3 AI or similar algorithm can pass lax variants of the Turing test.

Do we need a new AI test?

OpenAI CEO and co-founder Sam Altman recently put the brakes on Twitter : “The hype surrounding GPT-3 is completely exaggerated,” wrote Altman.

But GPT-3 also caused him to rethink. The Turing test is less interesting than expected, says Altman. Instead, it will be a great AI moment for him, "when an AI can prove a new mathematical theorem."

In fact, the computer program Eugene already beat the Turing test in 2014. Eugene pretended to be a 13-year-old boy with English as a second language. With a bit of deception, GPT-3 could probably beat Eugene at home. But what would that prove?

The proof of a mathematical theorem, on the other hand, appears as an intellectual achievement and would possibly even give concrete benefit to mankind.

There is potential: GPT-3 already shows rudimentary arithmetic skills, a speech AI trained by Facebook can solve formulas and the researcher Miles Cranmer wants to revolutionize physics with AI .

But can AI develop complete evidence for a previously unknown mathematical theorem on its own ? That would be a completely new level and therefore a great AI moment in Altmann's sense.

Cover picture: Existential Comics  II  via

No comments:

Post a comment