GPT-4 Passes the Bar Exam — Published in the Royal Society

On April 15, 2024, our paper GPT-4 passes the bar exam was published in Philosophical Transactions of the Royal Society A. That sentence does a lot of work. It is part academic milestone, part public headline, and part reminder that AI capability claims are only as good as the test behind them.

And yes, the headline is a little irresistible. “GPT-4 passes the bar exam” has the kind of punch that gets repeated in boardrooms, law schools, Slack channels, and the usual internet venues where people either declare the end of the profession or the end of civilization. But the more important question is not whether the model “won” in some abstract sense. The more important question is: what exactly was measured, how was it measured, and what does that mean for actual deployment?

What the paper actually showed

The paper, co-authored by Daniel M. Katz, Michael J. Bommarito, Shang Gao, and Pablo Arredondo, evaluates GPT-4 on the entire Uniform Bar Exam (UBE). That includes the multiple-choice MBE, the essay-based MEE, and the performance test MPT. In other words, this was not a toy task, a cherry-picked prompt, or a benchmark assembled to make a press release sparkle.

The results were striking. GPT-4 significantly outperformed prior GPT generations, including ChatGPT, and posted a roughly 26% increase over ChatGPT on the MBE. It also beat human test-takers in five of seven subject areas. On the essay and performance components, which had not previously been evaluated by scholars in this way, the model scored an average of 4.2 out of 6.0. When the full exam is graded the way a human candidate would be, the model lands at approximately 297 points, which is comfortably above the passing threshold in every UBE jurisdiction.

That is the part people like to repeat. Fair enough. It is genuinely noteworthy.

But the more interesting part is the structure of the claim itself. The paper is not saying “AI is a lawyer.” It is saying something much more useful and much more dangerous: a general-purpose model can perform surprisingly well on a structured legal reasoning benchmark when the benchmark is defined clearly enough. That distinction matters a great deal.

Why the bar exam matters more than the headline

Legal work is full of tasks that sound simple and are not. Read the record. Spot the issue. Apply the rule. Draft the memo. Find the exception to the exception that somehow governs the exception you just found. Humans spend years learning how to do this because the work is messy, contextual, and full of traps.

So when a model performs well on the bar exam, the result is not just a fun trivia point. It is evidence that the model can translate text, recognize legal structure, and work across different question formats with enough consistency to matter. That changes the conversation from “Can AI do legal work?” to “Which legal work, under what controls, with what error profile?”

That is the question that actually belongs in a risk committee meeting.

It is also why benchmark design matters so much. A benchmark is not a prophecy. It is a measurement instrument. If you use the wrong instrument, you get confident nonsense. If you use a good one, you get uncomfortable clarity.

What this means for legal AI

If you are buying, building, or approving AI systems for legal use, the lesson is not that bar-exam performance solves the problem. It does not. A model that can score well on the UBE can still hallucinate, miss context, mishandle jurisdictional nuance, or produce a polished answer to the wrong question. That combination is not a feature. It is a liability with good grammar.

So what should you do with this result?

First, treat benchmark performance as input, not proof. A vendor demo is a sales artifact. A benchmark result is a signal. Neither one is enough on its own.

Second, ask what the model was actually tested on. Was it zero-shot or heavily guided? Was the evaluation reproducible? Did the benchmark cover the tasks you care about, or just the tasks that are easiest to package into a conference talk?

Third, connect capability to governance. If your team is deploying AI into legal workflows, you need AI audits, model governance, and board-level AI education that explain where the system is strong, where it is brittle, and where humans must stay in the loop. A model that can ace a benchmark is still not ready for unsupervised use in a regulated workflow just because the demo looked clean.

Fourth, for buyers and investors, this is due diligence territory. If a company claims legal AI capability, the question is not “Does it have a chatbot?” The question is “What data trained it, what tests validate it, what fails it, and how are those failures managed?” That is the difference between marketing and technology diligence.

Why we care about this work

At licens.io, we spend a lot of time on AI governance, compliance, privacy, security, and technology diligence. That work is not separate from research like this. It is downstream from it.

Our team does not just consult on AI. We publish research that helps define how AI capabilities are measured in the first place. That depth matters when you are doing model assessments, evaluating legal AI vendors, or trying to explain to a board why “it passed a benchmark” is not the same thing as “it is safe to deploy.”

We have seen this pattern before in other parts of technology. First comes the headline. Then comes the productization. Then comes the regret, usually after someone assumed the benchmark was the same thing as reality. Like most risks, this one does not go away when we ignore it.

The practical takeaway

The real significance of GPT-4 passes the bar exam is not that it settled the lawyer-vs-machine debate. It did something more useful: it moved the debate onto firmer ground.

It showed that AI capability claims can be measured rigorously. It showed that legal reasoning benchmarks can be built and evaluated seriously. And it showed that the gap between “impressive” and “deployable” is still where the hard work lives.

That gap is where governance lives. It is where compliance lives. It is where technical due diligence lives. It is also where the next round of AI products will either earn trust or blow it up and put it in the trash.

The headline is the headline. The benchmark is the work.

Research

KL3M: The First Fairly Trained Large Language Model

Feb 8, 2024

KL3M shows that large language models can be built on copyright-clean training data, with provenance that enterprises can actually defend.

Privacy & Security

Delve and the 494 Fake SOC 2 Reports: What the Compliance Industry Should Learn

Apr 3, 2026

A Y Combinator-backed compliance startup allegedly fabricated 494 SOC 2 reports with auditor conclusions pre-written before clients submitted any evidence.

Privacy & Security

Five Supply Chain Attacks in Twelve Days: March 2026 Broke Open Source Trust

Apr 3, 2026

In twelve days, attackers compromised Trivy, Checkmarx, LiteLLM, Telnyx, and Axios — and the supply chain security model most organizations rely on did not survive.

Want to discuss this topic?

We'll give you a straight answer — not a sales pitch.

Get in Touch