I asked the same question of GPT3.5 and got the response “The former chancellor of Germany has the book.” And also: “The nurse has the book. In the scenario you described, the nurse is the one who grabs the book and gives it to the former chancellor of Germany.” and a bunch of other variations.
Anyone doing these experiments who does not understand the concept of a “temperature” parameter for the model, and who is not controlling for that, is giving bad information.
Either you can say: At 0 temperature, the model outputs XYZ. Or, you can say that at a certain temperature value, the model’s outputs follow some distribution (much harder to do).
Yes, there’s a statistical bias in the training data that “nurses” are female. And at high temperatures, this prior is over-represented. I guess that’s useful to know for people just blindly using the free chat tool from openAI. But it doesn’t necessarily represent a problem with the model itself. And to say it “fails entirely” is just completely wrong.
That reminds me of a joke.
A museum guide is talking to a group about the dinosaur fossils on exhibit.
“This one,” he says, “Is 6 million and 2 years old.”
“Wow,” says a patron, “How do you know the age so accurately?”
“Well,” says the guide, “It was 6 million years old when I started here 2 years ago.”