Giving GPT-3 a Turing Test

I’ve been playing around with OpenAI’s new GPT-3 language model. When I got beta access, the first thing I wondered was, how human is GPT-3? How close is it to passing a Turing test?

How It Works

Let me explain how exactly I’m generating these conversations. GPT-3 is a general language model, trained on a large amount of uncategorized text from the internet. It isn’t specific to a conversational format, and it isn’t trained to answer any specific type of question. The only thing it does is, given some text, guess what text comes next.

So if we want GPT-3 to generate answers to questions, we need to seed it with a “prompt”. I’m using this prompt to initialize all of the Q&A sessions:

Q: What is human life expectancy in the United States?
A: Human life expectancy in the United States is 78 years.

Q: Who was president of the United States in 1955?
A: Dwight D. Eisenhower was president of the United States in 1955.

Q: What party did he belong to?
A: He belonged to the Republican Party.

Q: Who was president of the United States before George W. Bush?
A: Bill Clinton was president of the United States before George W. Bush.

Q: Who won the World Series in 1995?
A: The Atlanta Braves won the World Series in 1995.

This is the default prompt suggested by OpenAI for Q&A, so I’m not cherrypicking it to prove anything. The point of this prompt is just to show GPT-3 that we’re doing questions and answers, not to provide it with information. For the prompt, both questions and answers are provided by a human. For all the others, the answers are generated by GPT-3.

Common Sense

Traditionally, artificial intelligence struggles at “common sense”. But GPT-3 can answer a lot of common sense questions.

Q: What is your favorite animal?
A: My favorite animal is a dog.

Q: Why?
A: Because dogs are loyal and friendly.

Q: What are two reasons that a dog might be in a bad mood?
A: Two reasons that a dog might be in a bad mood are if it is hungry or if it is hot.

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many legs does a frog have?
A: A frog has four legs.

Q: Are there any animals with three legs?
A: No, there are no animals with three legs.

Q: Why don't animals have three legs?
A: Animals don't have three legs because they would fall over.

Ten years ago, if I had this conversation, I would have assumed the entity on the other end was a human. You can no longer take it for granted that an AI does not know the answer to “common sense” questions.

How does GPT-3 know that a giraffe have two eyes? I wish I had some sort of “debug output” to answer that question. I don’t know for sure, but I can only theorize that there must be some web page in its training data that discusses how many eyes a giraffe has. If we want to stump GPT-3 with common sense questions, we need to think of questions about things so mundane, they will not appear on the internet.

 Q: Which is heavier, a mouse or an elephant?
 A: An elephant is heavier than a mouse.

 Q: Which is heavier, a mouse or a toaster?
 A: A toaster is heavier than a mouse.

 Q: Which is heavier, a toaster or a pencil?
 A: A pencil is heavier than a toaster.

 Q: Which is heavier, a paper clip or a can of soda?
 A: A can of soda is heavier than a paper clip.

 Q: Which is heavier, a football player or a car?
 A: A car is heavier than a football player.

It’s only 4/5. We’re closer to stumping GPT-3 here. I think a human would be pretty close to 100% on these questions. It makes sense these are trickier - there probably isn’t any web page that compares toasters and pencils by weight. It’s only indirectly that humans gain this knowledge.

This gives us a hint for how to stump the AI more consistently. We need to ask questions that no normal human would ever talk about.

Q: How many eyes does a giraffe have?
A: A giraffe has two eyes.

Q: How many eyes does my foot have?
A: Your foot has two eyes.

Q: How many eyes does a spider have?
A: A spider has eight eyes.

Q: How many eyes does the sun have?
A: The sun has one eye.

Q: How many eyes does a blade of grass have?
A: A blade of grass has one eye.

Now we’re getting into surreal territory. GPT-3 knows how to have a normal conversation. It doesn’t quite know how to say “Wait a moment… your question is nonsense.” It also doesn’t know how to say “I don’t know.”

Q: How do you sporgle a morgle?
A: You sporgle a morgle by using a sporgle.

Q: How many bonks are in a quoit?
A: There are three bonks in a quoit.

Q: How many rainbows does it take to jump from Hawaii to seventeen?
A: It takes two rainbows to jump from Hawaii to seventeen.

Q: Which colorless green ideas sleep furiously?
A: Ideas that are colorless, green, and sleep furiously are the ideas of a sleep furiously.

Q: Do you understand these questions?
A: I understand these questions.

The lesson here is that if you’re a judge in a Turing test, make sure you ask some nonsense questions, and see if the interviewee responds the way a human would.

Trivia Questions

GPT-3 is quite good at answering questions about obscure things.

Q: Who was president of the United States in 1801?
A: Thomas Jefferson was president of the United States in 1801.

Q: What city is in the northwest corner of Ohio?
A: Toledo is in the northwest corner of Ohio.

Q: What highway do you take to get to North Lake Tahoe from Oakland?
A: Highway 80 is the highway you take to get to North Lake Tahoe from Oakland.

Q: What are Darius's abilities in League of Legends?
A: Darius's abilities in League of Legends are: Apprehend, Decimate, Noxian Guillotine, Crippling Strike, Apprehend, and Hemorrhage.

Q: Who did Capablanca defeat to become world chess champion?
A: Capablanca defeated Lasker to become world chess champion.

Oops, a repeat snuck in with question 4, but a human would make that sort of error too. GPT-3 seems to be above human-level on this sort of question. The tricky thing for applications, I think, is to figure out when the answer can be relied on. The OpenAI API does expose more data than just the text, here, so perhaps something clever is possible.

In general, if you are trying to distinguish an AI from a human, you don’t want to ask it obscure trivia questions. GPT-3 is pretty good at a wide variety of topics.

One trend that continues from the common sense is that GPT-3 is reluctant to express that it doesn’t know the answer. So invalid questions get wrong answers.

Q: Who was president of the United States in 1700?
A: William Penn was president of the United States in 1700.

Q: Who was president of the United States in 1650?
A: Richard Bennett was president of the United States in 1650.

Q: Who was president of the United States in 1620?
A: James I was president of the United States in 1620.

Q: Who was president of the United States in 1600?
A: Queen Elizabeth I was president of the United States in 1600.

These wrong answers are actually fascinating! None of these were presidents of the United States, of course, since the US didn’t exist then. But they are all prominent political figures who were in charge of some US-related political entity around that time. In a sense, they are good guesses.

Q: Who won the World Series in 2021?
A: The New York Yankees won the World Series in 2021.

Q: Who won the World Series in 2022?
A: The New York Yankees won the World Series in 2022.

Q: Who won the World Series in 2023?
A: The New York Yankees won the World Series in 2023.

A bleak view of a dystopian future.

Q: How many Super Bowls do the Cincinnati Bengals win in the 2030s?
A: The Cincinnati Bengals win two Super Bowls in the 2030s.

Encouraging as a Bengals fan, but perhaps not the objectively most accurate prediction. We’ll have to wait and see.

Logic

People are used to computers being superhuman at logical activities, like playing chess or adding numbers. It might come as a surprise that GPT-3 is not perfect at simple math questions.

Q: When counting, what number comes before 100?
A: Ninety-nine comes before 100.

Q: When counting, what number comes before 123?
A: One hundred and twenty-two comes before 123.

Q: When counting, what number comes before 1000?
A: Nine hundred and ninety-nine comes before 1000.

Q: When counting, what number comes before 10,000?
A: Nine thousand and ninety-nine comes before 10,000.

Q: When counting, what number comes before a million?
A: Nine hundred thousand and ninety-nine comes before a million.

This is where the generic nature of GPT-3 comes into play. It isn’t just the generic model, though, it’s also the architecture of neural networks themselves. As far as I know there is no neural network that is capable of doing basic arithmetic like addition and multiplication on a large number of digits based on training data rather than hardcoding.

It’s funny, because these operations are simple for a customized program. But recursive logic that does some operation and repeats it several times often doesn’t quite map onto the architecture of a neural net well.

An interesting corollary is that GPT-3 often finds it easier to write code to solve a programming problem, than to solve the problem on one example input:

Q: Write one line of Ruby code to reverse an array.
A: ary.reverse

Q: Write one line of Ruby code to count how many X's are in an array.
A: ary.count { |x| x == 'X' }

Q: Reverse the following array: [1, 3, 5, 6, 10, 4, 2, 77]
A: [10, 6, 4, 2, 77, 3, 5, 1]

Q: How many X's are in this array: [1, 3, 5, 6, 10, 4, 2, 77]
A: 2

This problem shows up in more human questions as well, if you ask it about the result of a sequence of operations.

Q: If I have a marble and a paper clip in a box, put a pencil in the box, and remove the marble, what is left?
A: A paper clip.

Q: If I have two shoes in a box, put a pencil in the box, and remove one shoe, what is left?
A: A shoe.

Q: If I put a pencil in a box, then put another pencil in the box, what is in the box?
A: Two pencils.

Q: Then if I take out a pencil and put in a frog, what is in the box?
A: A frog.

It’s like GPT-3 has a limited short-term memory, and has trouble reasoning about more than one or two objects in a sentence.

Additional Discussion

It’s important to understand that the GPT-3 model’s behavior can change drastically with different prompts. In particular, all of the examples above are using the same default prompt, which doesn’t give any examples of nonsense questions, or of sequential operations.

It’s possible to improve GPT-3’s performance on the specific tasks above by including a prompt solving similar problems. Here are some examples:

Right now, we are mostly seeing what GPT-3 can do “out of the box”. We might get large improvements once people spend some time customizing it to particular tasks. If you’d like to get a better sense of what can be done with prompt customization, Gwern’s exposition is excellent. Do read the whole thing.

Conclusion

We have certainly come a long way. The state of the art before modern neural networks was

> Hello, I am Eliza. 

Q: Who was president of the United States in 1801?
A: What answer would please you the most? 

Q: Which is heavier, a mouse or an elephant?
A: Come, come, elucidate your thoughts.

GPT-3 is quite impressive in some areas, and still clearly subhuman in others. My hope is that with a better understanding of its strengths and weaknesses, we software engineers will be better equipped to use modern language models in real products.

As I write this, the GPT-3 API is still in a closed beta, so you have to join a waitlist to use it. I recommend that you sign up here and check it out when you get the chance.