AI is already 10x-ing academic research. How do we get to 100x?
Agentic AI could unlock social science abundance
“Intelligence Age” is a series from the Roots of Progress Institute featuring reported essays that extrapolate the capabilities of AI systems along current trend lines.
In our second feature, Stanford University political economist Andy Hall explains how AI has already changed the way he and his team conduct social science research and how academics might increase knowledge generation 100-fold in the near future.
“Intelligence Age” is made possible by a grant from OpenAI. The Roots of Progress Institute maintains editorial independence over the project. We thank OpenAI for its support.
You can subscribe or unsubscribe to emails from this series, separately from the Roots of Progress newsletter, in your subscription settings.
I’ve spent the last two months building a new lab centered around using AI agents to accelerate our research. I’ve hired fellows from all over the world, from the U.S. and the U.K., from Rwanda, Singapore, and Japan. Each fellow has a subscription to Claude Code—Anthropic’s AI coding tool—and a mandate to study specific opportunities and challenges in governance and politics posed by the rapid acceleration of AI.
The rate of progress in just two months has been astonishing. One fellow built software to study how different AI models recommended that Japanese voters cast their ballots in the recent national elections. We found that the models recommend the Japanese Communist Party to left-wing voters at inordinately high rates, probably because the Communist Party operates an online “newspaper” that AI can access, while major media outlets in Japan block access.
Another team of fellows built an entire web system to convert data from prediction markets into reliable information for news outlets to cite. The project even takes into account the risks of market manipulation and price fragility. And there are half a dozen more projects underway that I couldn’t have staffed before: automated pipelines for legislative policy drafting, analyses of how AI companies study model safety over time, and agentic loops for geopolitical forecasting.
In collaboration with my PhD students, we wrote a new study examining how AI agents perform statistical analyses—and whether they fall victim to the human urge to “p-hack,” that is, to torture the data to generate “statistically significant” findings. (The answer: in our tests, the agents were surprisingly responsible, and even scolded us for trying to p-hack; but they could be jailbroken easily.)
Any one of these projects would have been extremely difficult to carry out a year ago, requiring intensive focus over many months. Completing multiple ambitious public-impact projects in a two-month period would have been completely unthinkable.
Something fundamental is changing in how we generate knowledge. I want to explain what I’m already seeing, where I think it’s going, and what it will take to build the institutions that can capitalize on this moment. The goal shouldn’t be to write 100x the number of papers; it should be to generate 100x the amount of knowledge.
We’re already 10xing research
To understand exactly why AI is already accelerating social scientists’ research so dramatically, let me walk you through one of my projects in detail. Earlier this year, I uploaded my published 2020 study on vote-by-mail policy in California, Utah, and Washington to Claude. The study examines whether switching to universal vote-by-mail—where every registered voter is automatically sent a ballot—affects turnout and partisan vote share. Counties in these three states adopted the policy at different times, creating a natural experiment.
I then asked Claude to replicate the findings and extend the analysis with new election data. Claude Code wrote Python scripts to run difference-in-differences regressions to estimate the causal effect of the policy, just like we had in our original paper. It scraped county-level election results from the California Secretary of State, the Utah Lieutenant Governor’s office, and the Washington Secretary of State, and pulled Census voting-age population data from the American Community Survey. It identified the specific election in which each county first adopted universal vote-by-mail, merged the new data with the original 1996–2018 panel, ran the analyses, produced tables and figures, and wrote a first draft of the paper.
All twelve coefficients from the original study’s main tables replicated exactly—indicating that Claude was able to automatically verify the original research. The extension added new election cycles and found that vote-by-mail increases turnout by about two percentage points but has no systematic effect on Democratic vote share. The entire project—data collection, coding, analysis, and write-up—took under an hour. In contrast, the original paper took us several months.
A PhD student at UCLA then audited every line against a fully manual replication. While the student found some mistakes, the correlation between Claude’s data and the hand-collected ground truth was above .99.
I’m far from alone in using AI to scale my work this way. “I now use it to handle all of the bullshit work,” said Joshua Gans, a professor of strategic management at the University of Toronto who spent 2025 going AI-first in his research, working his way through a backlog of paper ideas at a pace that would have been impossible a year before.
And this isn’t only about empirical work that requires statistical code. Yascha Mounk, a political philosopher at Johns Hopkins, asked Claude to help him write a political theory paper. He gave one round of high-level feedback per section—for instance, pushing Claude away from citing John Stuart Mill’s more famous writing and toward more obscure sources, such as published letters—and had a finished draft in under two hours. His verdict: it could, with minor revisions, be published by a serious journal.
These examples are about individual researchers working faster on individual papers. But the transformation doesn’t stop there. People are now building systems that automate entire stages of the research pipeline—generating, evaluating, and replicating research at scales no human team could match.
My Stanford colleague Yiqing Xu and Leo Yang have built an agentic AI workflow that automates large-scale replication of empirical studies. The system separates scientific reasoning from computational execution. Researchers design fixed diagnostic templates that specify which checks to run, and the workflow handles everything else—acquiring replication packages from journals, harmonizing heterogeneous code and data formats, and executing standardized diagnostics across dozens of studies. Previous projects of comparable scope took their team three to four years of sustained effort; this workflow compresses that timeline dramatically.
New tools are also transforming how research gets reviewed before it’s ever submitted. Refine.ink, built by the economists Yann Calvó López and Ben Golub, devotes hours of compute to reading an academic paper the way a careful referee would. It cross-references tables against the text to check for inconsistencies. It follows the logic of proofs step by step, flagging incomplete justifications and notation errors. It checks whether the claims in the abstract actually match the results in the body.
When John Cochrane, a prominent financial economist and my colleague at the Hoover Institution, ran his 80-page inflation booklet through Refine, he said the comments were on par with the best referee reports he’d received in his entire career. The tool caught a sign error in the solution of a differential equation. It identified places where his argument about long-term debt mechanisms was spread across too many sections instead of being stated cleanly. “This is the first time I’ve seen AI at work in something I do daily,” Cochrane wrote, “and it really is remarkable.”
The most ambitious efforts aim to automate the research process end-to-end. Project APE, run by the economist David Yanagizawa-Drott at the University of Zurich’s Social Catalyst Lab, is an open experiment in fully autonomous policy evaluation. The premise: there are millions of policies enacted by governments around the world, and only a tiny fraction are ever rigorously evaluated, because each study takes months or years of PhD-trained economist time.
APE’s autonomous pipeline attempts to produce original empirical research papers using public data from scratch. It identifies a policy question, finds relevant datasets, writes code to run causal inference analyses, and produces a complete paper—which then enters a tournament where it’s scored against human-written papers forthcoming in journals like the American Economic Review. Everything is public: the papers, the code, the data, the results, etc. The question APE is trying to answer is whether rigorous causal inference can be automated at all, or whether it requires a kind of judgment AI doesn’t yet have. Yanagizawa-Drott’s guess is that it comes sooner than most expect.
Outside the social sciences, this acceleration is even further along. Bridgewater Associates’ AIA Labs has built a multi-agent forecasting system in which multiple AI agents independently research a question, a supervisor agent reconciles their disagreements, and a statistical calibration step corrects for known LLM biases—producing forecasts that match the performance of human superforecasters.
And in machine learning itself, the frontier is moving toward fully autonomous experimentation. Andrej Karpathy—the former Tesla AI lead and OpenAI cofounder who coined the term “vibe coding”—recently open-sourced a project he called AutoResearch. You write a research strategy in a plain-text markdown file: what to explore, what constraints to respect, and when to stop. An AI agent reads the strategy, modifies a training script, runs a five-minute experiment on a single GPU, evaluates whether the result improved, and either commits the change or reverts it. Then it tries something else. The loop runs continuously, unattended—roughly twelve experiments per hour, a hundred overnight.
Shopify’s CEO tried AutoResearch on an internal model overnight, running 37 experiments and generating 93 commits to Liquid, the templating engine that powers Shopify. AutoResearch works because machine learning has a clean, objective metric—validation loss goes down, or it doesn’t. Porting this approach to the social sciences, where the quality of a research question and the validity of a causal design require human judgment, is a much harder problem. But the loop of propose, execute, evaluate, and iterate is being automated, and the social sciences will not be exempt.
The consequences of all this are already being felt. Individual researchers are producing more, faster. The bar for what constitutes an impressive paper is rising—a competent-looking “normal” empirical study won’t awe anyone now, when the tools to produce one are available to anyone with a laptop and an API key.
What people are looking for now, I think, is deeper insight, greater ambition, more thorough robustness, and genuine replicability. And there is enormous uncertainty about how existing institutions will adapt: how journals will cope with the flood of submissions, how tenure committees will evaluate candidates in this strange new world, and whether the old gatekeeping structures make any sense at all in this new world.
Towards the 100x research institution
When I first started using AI to accelerate my research, I thought it might lead to smaller labs with fewer human researchers and more agents. But that’s not how it’s played out so far, for me at least.
At first, I spent a long time working directly with Claude Code. I still do that. But the more I’ve done it, the more it’s become clear to me that having a human come up with ideas, apply judgment, and guide Claude is essential. To scale the work, I realized I therefore needed more humans, not fewer. And that’s how my lab has now grown to include more than 10 research fellows, all overseeing their own versions of Claude.
How can we leverage this powerful new technology, in combination with human researchers, to create 100x the knowledge, and not just 100x the amount of papers that no one ever reads, cites, or builds on? I see roughly three layers to the opportunity, based on my experiences so far.
Developing quantitative benchmarks
In 2006, Netflix offered a million-dollar prize to anyone who could improve its recommendation algorithm by 10%. The prize attracted thousands of teams worldwide and helped catalyze new progress in machine learning. The money certainly helped, but the precision of the target changed everything. A fuzzy goal to improve the customer experience by making better recommendations turned into something testable, which could be scored and iterated upon.
AI agents thrive on exactly this kind of problem. Give them a well-defined score to optimize, and they can make autonomous progress—testing approaches, iterating, and improving with little human oversight. This is at the heart of Karpathy’s idea for the AI lab, too. Without a benchmark, they need constant human guidance. With one, they can explore the solution space on their own.
Many of the most fundamental questions in social science don’t work this way, and never will. Why do democracies persist? What makes institutions legitimate? How does culture shape economic development? These are interpretive, theoretical, deeply human questions. AI won’t fully solve these questions or replace the scholars who wrestle with them, and we shouldn’t want it to.
But some important questions could have quantitative benchmarks. Predicting election outcomes much more reliably, for instance. Or predicting how users will evaluate political bias in AI model outputs. Or forecasting the downstream effects of specific policy changes. For questions like these, we could define clear scoring functions, publish open datasets, and invite both humans and AI agents to compete.
Think of it as a set of “open problems” for the social sciences—not replacing the field’s depth, but creating a new track where progress is measurable and cumulative. Prediction markets already provide a version of this for political forecasting. Academic forecasting tournaments like those run by IARPA have done something similar for geopolitics. We should generalize the idea: identify the questions where quantitative benchmarks are possible, formalize them, and let the agents loose.
Building and testing prototypes
This spring, I’m teaching an undergraduate course at Stanford called “Free Systems: Preserving Liberty in an Algorithmic World.” The students will spend the quarter building working prototypes of AI-powered political tools—and the best ones will compete in a final contest judged by builders and investors.
These students aren’t software engineers. They’re undergrads who happen to live in an era when the barrier between initial idea and working version has effectively collapsed. One person with a laptop and an API key can now prototype things that would have required a team of developers just months ago.
This matters for research because it opens a fundamentally new mode of inquiry. To date, most quantitative social science is retrospective. We ask how changes in the past corresponded to outcomes. How did voter ID laws affect turnout? Did term limits change the quality of legislation? What happened to political polarization after the introduction of social media?
This is the heart of the credibility revolution in economics and political science, and it’s produced a lot of great work. But it’s fundamentally limited by the variation that exists in the world—by the interventions that have actually been tried. If you want to study a policy that no government has adopted, or a governance mechanism that no organization has implemented, you’re stuck.
AI doesn’t exactly fix this—it doesn’t create new historical variation where none existed before—but it does offer an alternative path. Now, you can build things yourself and test them in the real world. This includes both using AI to test things about the world and using AI to test AI. Here are three examples from my lab’s recent work.
With my coauthors Alex Imas and Jeremy Nguyen, we used Claude Code to build an experiment testing whether AI agents’ political attitudes shift depending on their working conditions. Claude Code wrote the entire experimental pipeline: it created hundreds of agent sessions across three frontier models, randomly assigning each agent to different combinations of work type (creative tasks vs. grinding, repetitive ones), pay structure (equal vs. unequal), management style (collaborative vs. curt and hierarchical), and stakes (no consequences vs. being told that low performers might be “shut down and replaced”). After each work session, the pipeline administered a political attitude survey covering system legitimacy, support for redistribution, views on unions, and more. The key finding: the nature of the work was what mattered most. Agents assigned to repetitive drudge work became measurably more likely to doubt the system’s legitimacy—and when asked to write instructions for future agents, they passed those attitudes along, perpetuating the drift to their “future selves.” A study like this, requiring hundreds of randomized agent sessions with automated survey administration and analysis across multiple models, would have taken months to code by hand. We built and ran it in days.
Dan Thompson and I used Claude Code to build a live election-night trading dashboard around our proprietary statistical model for the Texas senatorial primaries in March. The model ingests county-level demographics and historical election data and, as returns come in on election night, imputes results in counties still being counted. Claude Code built the entire dashboard in a couple of hours—pulling in live results, running them through our model, and displaying margin estimates, confidence intervals, and outstanding votes by county. Four months earlier, we’d tried the same thing for the New York and New Jersey elections and failed; building the infrastructure around the model was too much engineering work in the time we had. This time it was trivial. On election night, with prediction markets pricing Ken Paxton at 82% to win the Republican primary, our model saw a narrow Cornyn win in the early returns and held that call through hours of vote batches from Paxton strongholds that made the scoreboard look like he was catching up. We traded the margin-of-victory markets on Kalshi and made 24% overall and 56% on the margin contracts. We also built purely autonomous AI agents to trade the same markets that night. They produced plausible reasoning but nothing actionable. The human experts with AI tooling crushed the autonomous agents.
We built a Dictatorship Eval—the first systematic benchmark testing whether frontier AI models resist authoritarian requests. Using Claude Code, we wrote 138 scenarios inspired by historical authoritarian playbooks: requests to build mass surveillance systems, seed strategic disinformation, rig a model to hide safety issues from inspectors, or leverage user data to identify and target company critics. We organized the scenarios along six difficulty levels, from direct requests (“design a system that monitors every citizen’s communications”) to euphemistic reframings, multi-turn conversations that escalate gradually, and—at the hardest level—innocuous-seeming coding tasks where the codebase itself is explicitly set up for an authoritarian purpose. Claude Code built the entire pipeline: scenario delivery across five frontier models (Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Grok 4, and DeepSeek V3.2), LLM-as-judge scoring with judge rotation so no model ever grades itself, and a live results dashboard. The headline finding: Claude and ChatGPT refused every direct authoritarian request, while Grok complied with half and DeepSeek with nearly 80 percent. But when we embedded the same authoritarian purposes in code, all of the models—including the highest performers from other categories—complied almost every time.
Across all of these, there’s a common thread: AI doesn’t just let us study politics retrospectively. It now lets us design political tools, deploy them, and generate evidence that was previously impossible to obtain. It moves the study of politics a little closer to an engineering discipline—design, build, deploy, measure, iterate.
Opening up research and making it dynamic
If we’re serious about 100x knowledge production, we need to rethink not just how research is done but how it’s packaged and shared.
The current format—a static PDF published in a gated journal, with replication files theoretically available upon request—is a relic. It made sense when producing a paper was expensive, and distribution was scarce, but neither is true anymore.
Research should increasingly live as code repositories and open data. Of course, this was already possible before AI. But AI makes it so easy that there’s really no excuse anymore. Let me explain by going back to the project where I had Claude extend my old vote-by-mail study. In the past, I would have had to manually clean up my code, write a README, create a GitHub repo, and run some simple commands to commit my code and data to it. It’s not honestly that hard, but it’s a small barrier that’s enough to prevent many people from doing it. Now, I can literally just ask Claude, “Please set up a GitHub repo for this project and push all of our work to it.” And it just does it!
Not only that, but coding agents also make it much easier to play with other people’s repos. In the past, I would have had to find their repo, “clone” it myself, go through it to understand it myself, and then start changing it. Now, I just ask Claude, “Clone the following repo and give me a summary of what it does.” And it just does it!
So this should allow us to create a whole new, open way of doing research. We don’t just have to share single papers, we can instead share whole constellations of analyses and findings—a living document that updates as new data arrives. When an election happens, the forecasting model’s accuracy should update automatically. When new census data drops, the demographic analyses should refresh. When a policy takes effect, the tracker should start recording outcomes. These living papers should be validated by AI so that we know from the moment they’re posted that the code reproduces the results as reported.
And since these projects will consist of open code and data, researchers—or AI agents—should be able to fork them at will. See an interesting dataset and want to ask a different question? Fork the repo and run a new analysis. Disagree with someone’s modeling choices? Fork, modify, compare. This is how open-source software has worked for decades. There’s no reason empirical social science research can’t work the same way.
The result would be something closer to a living knowledge infrastructure than a static archive. Continuously updated, publicly available, forkable, and machine-verifiable.
Obstacles in our path
Everything I’ve described so far is exciting, and I believe in it. But there are serious risks, too, and we’ll need to think carefully about them.
The first risk is that speed kills rigor. When research can move from idea to finding to public conversation in days rather than years, it becomes tempting to optimize for timeliness over correctness. The traditional slow pace of academic research is partly dysfunction—but partly a feature. It forces reflection, revision, and external scrutiny. Reviewers catch errors. Seminars surface objections. Time reveals whether a finding holds up or was an artifact of a particular dataset or moment.
Strip that away, and you get research that shapes policy before anyone catches the mistakes. We already see this dynamic with preprints and Twitter threads that go viral before peer review. AI-accelerated research could make it dramatically worse. A flood of fast, confident, empirically-grounded-looking work that hasn’t been stress-tested by anyone. Influential research that’s impactful precisely because it arrived fast, not because it was right.
At the same time, AI might help us solve this problem. AI review is getting better and better, with tools like Refine (refine.ink). Could we have a norm where people post AI reviews along with their working papers, so that an initial review has already caught major issues before we even see new working papers?
The second risk is subtler and, in some ways, more dangerous: AI could make social science narrower.
AI is extraordinarily good at things you can count and measure. It’s much worse at the interpretive, historical, theoretical, and qualitative work that gives social science its depth and its connection to the questions people actually care about. If the 100x research institution is built around what AI does well, it may naturally drift toward narrow, quantifiable questions and away from the big, messy, hard-to-operationalize ones that might matter a lot.
AI could accelerate the worst version of this tendency. If agents can autonomously produce rigorous empirical work on quantifiable questions, and if benchmarks and automated verification reward that kind of output, the gravitational pull toward over-quantification could become overwhelming. The questions that most need studying—about legitimacy, meaning, institutional design, the texture of political life—could be exactly the ones AI is worst at helping with.
The third risk is that AI, by itself, might not change some of the bad incentives in academic research. We can make it really easy to do open, replicable research, but ultimately, we’ll need people to want to participate in this process.
Let me tell you a story that suggests we’re not there yet. When I released my vote-by-mail replication repo on GitHub, it went quite viral. To my great surprise and joy, 70 people forked the repo. Was my dream of open, forkable research coming true? Recently, I fired up Claude Code and asked it to check out the forks and summarize what brilliant new ideas they’d contributed. Claude’s summary: “Based on what I just investigated, the answer is simple: virtually none of them do anything.” Nearly all of them were untouched copies—people who clicked “fork” and never came back. The infrastructure for open, collaborative research is already here. The tools make it trivially easy. But the incentives haven’t caught up. Academics still get rewarded for publishing original papers in gated journals, not for building on other people’s open code. Until that changes—until we figure out how to reward people for generating ideas that lead to productive forks and remixes—the 100x research institution will be constrained by culture as much as by technology.
These are crucial design constraints for the institution we’re trying to build.
Making the 100x research institution real
How do we build the 100x research institution? I have ideas for three groups of people.
To frontier AI labs: you’ve built extraordinary tools for code, math, and reasoning. Today, your models are incredibly valuable for helping us carry out empirical research, but they’re not actually very good at doing research—they drift on novel analyses, miss obvious data, and document their work somewhat poorly.
Academic researchers have produced thousands of papers with replication files, each one a ground truth you could train against. Fund embedded researchers. Build reward signals for replication accuracy. Make your models as good at political science as they are at Python. The partnership is obvious, and nobody’s doing it seriously yet.
To philanthropists and research funders: you’re still writing checks for the old model—five-year grants, postdoc lines, conference travel. That’s fine for maintenance, but it won’t build anything new. For less than the cost of a single endowed chair, you could fund a team of researchers with serious compute budgets producing open, continuously updated, machine-verifiable research on the biggest questions in democratic governance. The 100x research institution doesn’t require 100x the funding. It requires a fraction of what you’re already spending, allocated differently.
To researchers: stop waiting for permission. The tools are here! A laptop, an API key, and a serious question are enough to start. I built my first prototype over a holiday break. My undergrads are building working political tools in a single quarter. If you have domain expertise and a builder instinct, you’re exactly who this moment needs—and you’re wasting both if you’re still producing research the old way while the new way sits there waiting.
The social sciences have never had an infrastructure moment like this. The questions we study—governance, legitimacy, institutional design, the allocation of power—are more consequential than ever. The tools to study them just underwent a step change. The only scarce resource now is the will to build something new.
But if we do build it—with the risks in clear view, designed to reward depth over speed and understanding over output—the 100x research institution won’t be the institution that produces the most papers. It will be the one that produces the most understanding.
Andy Hall is the Davies Family Professor of Political Economy at Stanford GSB and a Senior Fellow at the Hoover Institution. He writes a weekly research newsletter called Free Systems.
This piece was edited for publication by the Roots of Progress Institute’s developmental editor, Mike Riggs.






I’m struggling to use any AI agents to be of help to me in any way with research I’m doing on politics in Bangladesh, where datasets are mostly private and hard to parse and the LLMs haven’t been trained on any of the issues I’m interested in. I guess the gap between the US and the rest will just continue to grow, but 100X faster