background

Is Anthropic's PR Review Tool Worth It? Lessons from ELIZA.

Explore the fascinating history of early AI, including ELIZA, Joseph Weizenbaum's experiments, and the psychology behind human-computer interactions.

“THOUGH instinct tells you, Scæva, how to act,
And makes you live among the great with tact,
Yet hear a fellow-student; ‘tis as though
The blind should point you out the way to go,”

 — The Satires, Horace, translated by John Conington

Would you drive down the highway if you could not see the road?

This would seem dangerous.

Illusion of Intelligence

Illusion of Intelligence

What if someone was yelling instructions – turn left now, turn right?

This still seems dangerous, though it’s possible you’d make some correct turns – especially if you drove slow. Perhaps you’d even survive the experience.

I suspect though, that few, if any, of my readers would volunteer to drive blind – even if they had someone feeding them outside information. It is simply too important to the act of driving to see what is ahead of you.

This point is deeply related to the current use of LLMs. On a very fundamental level, LLMs do not perceive the world the way humans do – or, arguably, at all. They are, quite literally, word prediction machines. They are told certain things about the world – just like that driver is told when to turn. Just as you wouldn’t trust your life to a blindfolded driver, even a very smart one, you should be quite cautious about trusting an LLM.

They should not be relied upon to make judgements. They can make decisions – “what color should this background be” might be an acceptable decision. Judgements like “is this scheme secure”, “will this business model succeed in the market”, or “should this person receive a loan” are, however, are judgments – not an appropriate use of LLMs.

As we will discuss in this article, there are, however, some serious cognitive traps you may fall into. There are serious risks related to some usages of LLMs, and those risks become near certainty if you do not have an accurate understand what an LLM can advantageously be used for.

Making this worse is the very large class of people who stand to gain financially from the adoption of LLMs. While some of these people may be very intelligent, make no mistake: they are not to be taken more seriously than the president of PepsiCo speaking about how wonderful Pepsi products are.

To be clear, LLMs have utility, as does AI as a whole. I worry, though, that people need a better intuitive understanding of what strengths and weaknesses LLMs have. To deploy LLMs accurately, and in a way that will benefit a business, executives need to really deeply understand what they do – and to do otherwise is potentially very risky.

In some domains, the use of LLMs already become quite accepted. For example, the use of AI code generation has become quite common. This is hardly surprising; even before LLMs, code generation tools were very common. Many such tools are slightly interactive – they ask the user questions and then insert the answers into the code. If such a simple algorithm provides value, it is not surprising, therefore, that an LLM assistant could do even better.

Interestingly, Anthropic has announced a new product called Code Review – one that does not generate code.

Should AI be reviewing Pull Requests?

The purpose of Code Review is, as the name implies, to review code. It’s job is to review Pull Requests – a role traditionally assigned to human beings.

“We’ve seen a lot of growth in Claude Code, especially within the enterprise, and one of the questions that we keep getting from enterprise leaders is: Now that Claude Code is putting up a bunch of pull requests, how do I make sure that those get reviewed in an efficient manner?” said Cat Wu, the head of Product from Anthropic.

This is, on its face, a rather interesting claim.

Note that its phrased as a way to solve a problem created by the use of AI tools itself. Anthropic claims that AI code generation tools enable engineers write about 200% more code each; therefore, assuming this is accurate, it stands to reason that the amount of code to review will just about triple.

Importantly, note that Anthropic does not, in fact, claim here that AI can remove the need for traditional code review; merely that this will ensure that the AI-generated PRs will get reviewed “in an efficient manner”.

Anthropic hasn’t provided details on their product quite yet – it’s in limited beta, they say. However, an excellent warning comes from Anthropic itself in an annoucement for a similar product, “Claude Code Security;” it reads thusly:

“Nothing is applied without human approval: Claude Code Security identifies problems and suggests solutions, but developers always make the call.”

If Anthropic, as a vendor of AI products, suggests that you use humans to make the final judgement call, then I would suggest you definitely make sure you do that.

To be clear, this may be of value to some companies. Static analysis tools such as linters have been in use for a very long time; they have proven their value, and LLM driven analysis is not inherently bad. It’s quite possible for algorithmic solutions to find potential issues with code; they frequently do, in fact.

It is the opinion of the author that LLM PR analysis should be used as SAST tools are used: an adjunct to human review.

This is not necessarily a popular opinion.

On the contrary, optimism about the future of AI is, at the time of this writing, quite high. This is understandable; deep learning models have, indeed, done things which were previously impossible.

This is, however, not the first time an evolution in computing has triggered unrealistic ideas about what, exactly, has been accomplished.

For example, back in 1966, a program called ELIZA was written.

Many people were convinced it was intelligent.

Learning from Eliza

Weizenbaum, the MIT professor who developed the program, was not nearly as impressed.

“I had not realized … that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people,” he said Weizenbaum.

Weizenbaum’s program had various scripts it could use; the script that is usually discussed was called DOCTOR, and was designed to roughly simulate speaking with a Rogerian psychotherapist; here’s an example conversation from Weizenbaum’s 1966 paper:

USER> Men are all alike.
ELIZA> IN WHAT WAY
USER> They're always bugging us about something or other.
ELIZA> CAN YOU THINK OF A SPECIFIC EXAMPLE
USER> Well, my boyfriend made me come here.
ELIZA> YOUR BOYFRIEND MADE YOU COME HERE
USER> He says i'm depressed much of the time. 
ELIZA> I AM SORRY TO HEAR YOU ARE DEPRESSED

Slight misandry aside, this snippet of conversation illustrates how Eliza simulated intelligence.

For example, Eliza had a rule where if the user typed a comparision – like anything containing the word “ALIKE” or “LIKE”, such as “men are all alike” or “ducks are like chickens”, it always responds with one of a set list of responses: “IN WHAT WAY ?”, “HOW ?”, “WHAT DOES THAT SIMILARITY SUGGEST TO YOU ?”, and so on. Note that this is the same whether someone is making a reasonable statement or something like ‘CHICKENS ARE LIKE SPACE MUSHROOMS’ – it doesn’t really understand the topic, but is simply running through a list of rules.

Likewise, for the second line, the script is matching the word “ALWAYS”, which triggers a response from a list including “CAN YOU THINK OF A SPECIFIC EXAMPLE”. As before, you could say something nonsensical like “always always pigeon always” and it will trigger a similar result – as long as it contains the word “always”.

The third and fourth responses demonstrate a rather clever trick; it knows to reverse patterns like “my” or “I’m” and repeat them back to the user.

None of this, of course, is intended to disparage ELIZA or Joseph Weizenbaum; it’s a very interesting program and a great experiment. However, it does illustrate a point; as Weizenbaum said in this 1966 paper:

“ELIZA shows, if nothing else, how easy it is to create and maintain the illusion of understanding, hence perhaps of judgement deserving of credibility. A certain danger lurks there.”

Of course, ELIZA is much simpler than the deep learning tools we use today – which, truth, makes it an even more effective illusion, and therefore increases the danger.

When evaluating any use of LLMs – or any other AI technology – I’d suggest pondering some of the following questions:

  • Is it possible I am only seeing the illusion of understanding?
  • Is it possible that one of my employees, customers, or subcontractors may fall victim to the “powerful delusional thinking” that Weizenbaum mentioned?
  • If so, might they use the AI technology for inappropriate uses?
  • Could a malicious actor manipulate the AI to cause some harm?
  • What would be the cost for adding additional human review to my workflow?
  • What’s the worst case scenario for the abilities I am granting the AI model?

AI models will continue to evolve and grow – and they will continue to be adopted by business. It is understandable that you would want to reap the benefits of AI – just be careful to avoid the downsides.

You don’t drive blind, and you shouldn’t let a blind model drive your business.