- October 01, 2025
- By Zsana Hoskins M.Jour. ’25
Since 2012, self-described trivia nerd Jordan Boyd-Graber has pitted an artificial intelligence (AI) system he developed against humans in a series of quizbowl-style competitions. In more than a dozen matches, his AI technology has held the edge, beating the likes of “Jeopardy!” uber-champion (and now host) Ken Jennings as well as a group of national quizbowl winners.
These victories by the system called QANTA weren’t about bragging rights, said Boyd-Graber, a professor of computer science working in the University of Maryland’s Computational Linguistics and Information Processing (CLIP) Laboratory. Instead, the goal was to improve the evaluation of AI-infused systems by comparing how language models—which understand everyday language and respond with unscripted answers—and human trivia experts each answer a range of tough questions.
Now Boyd-Graber and several students in the CLIP Lab have shifted the focus of these competitions so that humans and computers are collaborating instead of facing off. The endgame? To test cooperation, trust and confidence between humans and machines by determining when the former can rely on the latter to come up with the right answer—and when humans should take the reins.
“One of the big things that we want to do in our research is not just replace humans with computers, but instead try to build systems where humans and computers can work together effectively,” said Boyd-Graber, who is active in the Artificial Intelligence Interdisciplinary Institute at Maryland (AIM). “The big challenge here is being able for the computer to provide the correct information to the human that is going to be most valuable to them.”

Despite the game format, the research is relevant to the real world, he said; once-theoretical discussions about AI replacing human workers no longer have the ring of sci-fi, particularly in the manufacturing, data entry and customer service sectors.
In June, the UMD researchers ran a QANTA 2025 competition—which they dubbed a “human-computer cooperative AI tournament”—using the new format. They’ve scheduled a similar event for early 2026 featuring AI systems designed by Boyd-Graber’s students and quiz questions written by them—not to mention cash prizes up to $200.
The competition uses the standard quizbowl format, where questions are asked aloud and contestants from two teams can “buzz in” at any time with a response. The humans on each side choose an AI teammate that can buzz in too. If it gets too aggressive with the buzzer on toss-up questions and flubs the answers, however, the humans can put it on “mute” for the rest of the game.
Next, the team that correctly answers a toss-up gets a bonus question; here, the human team members come up with an answer and then ask the AI for advice. The study measures the correctness of human and AI answers separately, as well as the correctness of the whole team—providing evidence about how much and when people should trust AI answers over their own, or vice versa.
While the CLIP team is still analyzing the data it collected from this summer’s competition, some patterns seemed to emerge. For instance, human-only teams were much more likely to buzz in before the full question was read if they felt they knew the answer, said Yoo Yeon Sung, a fifth-year doctoral student in the College of Information.
Humans know when they have the answer and can quickly decide when to buzz in, but AI models often lack this complex ability, Sung explained—computers not only don’t know when they have the answer wrong, but also can’t confidently say when they have the answer right.
“If a question has a lot of names, entities, countries, locations, AI is good at recalling information,” she said. “But when it comes to conceptualizing in a very indirect way, that becomes hard for AI-infused models.”
Another key point the CLIP team is considering is what “well-calibrated” confidence looks like in AI-generated answers; that is, if a weather model states it is 70% confident it will rain, that over many such predictions, it actually rains 70% of the time.
This type of calibration could be relevant in more high-stakes environments like medicine, Boyd-Graber said, where uncertainty can have serious consequences if not properly communicated. AI systems that can accurately judge how likely they are to be wrong would be more useful—and safer—than ones that can’t, Boyd-Graber said.
“One challenge going forward is to figure out the cases where an AI can lead you astray, because that’s going to be important—not just for these silly trivia games—but for the real world,” he emphasized.
Although the CLIP researchers said they expected the competition’s human participants to over-rely on AI—particularly given its higher overall accuracy than human—that wasn’t always the case. They were surprised that skilled players often trusted their own knowledge over conflicting AI answers and were frequently correct.
The CLIP team’s competitions to improve AI question-and-answering systems addresses a growing concern with AI-infused models—that systems are typically optimized to flatter and entertain users rather than truly help them.
“You can't always get what you want from AI,” Boyd-Graber said. “But this research might find ways to make it so that you get what you need.”
Boyd-Graber—a self-described “trivia nerd”—is shown here with the late Alex Trebek, host of “Jeopardy!” Boyd-Graber competed on the show in 2018. (Photo courtesy of Sony Pictures Entertainment)
AI at Maryland
The University of Maryland is shaping the future of artificial intelligence by forging solutions to the world’s most pressing issues through collaborative research, training the leaders of an AI-infused workforce and applying AI to strengthen our economy and communities.
Read more about how UMD embraces AI’s potential for the public good—without losing sight of the human values that power it.