AI Made the First Draft Cheap: Correctness Is Still Expensive
On June 16, Databricks introduced an AI agent that builds forecasting models, deploys apps, and writes its own documentation from a sentence of English, joining comparable agents already running at Snowflake, AWS, and GitHub. The open question isn’t whether an agent can write the code. It’s whether anyone can trust what it wrote.
AI Made the First Draft Cheap. Correctness Is Still Expensive
Freelance data scientist Longhow Lam described a similar moment on LinkedIn. He said plain-English instructions could direct an AI agent through data generation, forecasting, deployment, and documentation, yet every artifact still needed careful review before he trusted it.
A gap separates work generated from work confirmed correct, and it defines the past year of agentic data tools. Vendors measure how much an agent can produce. Few measure how much of the resulting production survives contact with a reviewer who has to sign off on it.
Call the missing number verified output: the share of generated code, models, or dashboards a qualified human approves without rework. It is the metric most productivity claims skip, and it is the one data leaders need most.
English Is Becoming an Interface to the Data Stack
Programming has moved up a layer before. Programmers wrote in machine code until 1957, when IBM’s John Backus led the team that built Fortran, the first widely used high-level language. Low-code platforms followed decades later: Forrester says it coined the term in 2014, and Microsoft launched PowerApps in November 2015 to let business users build applications through visual tools instead of code.
Agentic AI extends the pattern, but the mechanism differs. A compiler applies fixed rules to source code and produces a predictable result every time. A large language model interprets an ambiguous instruction and produces a probable result, not a guaranteed one. English works as an interface to a code-producing system rather than as a replacement for the code, tests, and schemas underneath it.
Four examples show how far the interface has moved. Snowflake’s Cortex Agents reached general availability on November 4, 2025, planning tasks and pulling from structured and unstructured data through Cortex Analyst and Cortex Search. AWS introduced AgentCore Code Interpreter in August 2025, letting agents write and run Python, JavaScript, and TypeScript for data analysis inside a sandboxed environment. GitHub’s Copilot coding agent became generally available on September 25, 2025, accepting a delegated task, opening a draft pull request, and asking a human to review it. Databricks’ Genie Code, now folded into the broader Genie One suite, plans and executes data science workflows from a written prompt.
Each vendor frames its agent around a plain-language request. None removes the step where a person decides if the output is fit to ship.
Generation and Verification Do Not Scale Together
Benchmarks built specifically for data work show why plausible answers carry real risk. DSBench, presented at ICLR 2025, tested AI agents against 466 data-analysis questions and 74 end-to-end modeling tasks drawn from real competitions. The strongest agent in the original evaluation solved roughly a third of the analysis questions, well below sampled human performance, though the benchmark relied on 2024-era models and newer systems may score higher.
Google Research published a counterpoint in November 2025. Its DS-STAR system raised accuracy on three data-science benchmarks, reaching 45.2% on DABStep, 44.7% on KramaBench, and 38.5% on DA-Code, ahead of the best alternative tested at the time. The hardest DABStep tasks still needed an average of 5.6 rounds of planning and verification before the system settled on an answer. Even a research system built to push past prior limits treats review as part of the work, not as cleanup performed afterward.
A 2024 study from Microsoft Research and the University of Washington, presented at CHI, watched 22 analysts work through AI-generated analyses. Participants leaned on procedure-level evidence, such as code and explanations, and on data-level evidence, such as tables and charts, to decide whether a result held up. Their checks sorted into five layers: did the code run, was the method appropriate, were joins and missing values handled correctly, did the result answer the real business question, and would the pipeline keep working on new data.
Generation scales with compute. Verification scales with the number of qualified people available to look closely at an answer and decide if it can be trusted. The two rates rarely match, and the distance between them is where work piles up.
The Productivity Evidence Depends on What Gets Counted
Some of the strongest AI-productivity evidence comes from a 2023 controlled experiment, still widely cited, in which developers asked to build a JavaScript HTTP server finished 55.8% faster with GitHub Copilot than without it. The task was narrow, the goal was clear, and success was easy to judge. Under narrow, well-scoped conditions, an agent helped enormously.
METR’s 2025 randomized trial points the other way. Sixteen experienced open-source developers worked through 246 tasks in large, mature repositories they already knew well. With AI access, completion took 19% longer. Participants had predicted a 24% speedup beforehand, and they still estimated a 20% speedup afterward, despite the slower outcome they had just lived through. METR frames the result as a snapshot of early-2025 tools in one setting, not a universal verdict on AI coding.
Google’s 2025 DORA report surveyed software professionals and found AI use among 90% of them, with a median of two hours a day. Adoption tracked with higher output, and it tracked with lower delivery stability at the same time. DORA’s framing fits the pattern: AI amplifies what a team already does well, and amplifies what it does poorly just as fast.
Stack Overflow’s 2025 developer survey adds a behavioral signal. Forty-six percent of respondents distrusted AI output accuracy, against 33% who trusted it, and only 3% reported high trust. Sixty-six percent said they spent more time fixing AI code which looked almost right but proved wrong. dbt Labs found 80% of data practitioners used AI daily in late 2024, up from 30% a year earlier, yet only 30% trusted an agent to answer natural-language questions directly against their data. Acceleration and confidence are not the same measurement, and the surveys keep finding gaps between them.
The New Bottleneck Changes the Shape of the Data Team
If English lowers the cost of asking a question, then the cost shifts toward judging the answer. Anaconda’s 2025 survey of practitioners found reported skill gaps concentrated in AI governance (30%), deep-learning engineering (23%), and prompt design (20%), a spread suggesting a wider mix of skills rather than one skill replacing the rest. LinkedIn data shows a 177% jump in members adding AI-related skills to their profiles since 2023, nearly five times the growth rate across all skills, though the figure tracks self-reported skills, not employer requirements written into job postings.
Job-posting research covering 378 US public companies recruiting for generative-AI roles found higher demand for cognitive skills and a post-ChatGPT rise in social-skill requirements, though the dataset runs through 2023 and isn’t specific to data-science roles. Read together, the evidence supports a narrower claim than the one frequently repeated in headlines: domain framing, evaluation, governance, and orchestration are gaining value alongside coding ability, not replacing it. No dataset reviewed here shows employers dropping Python or statistics requirements in favor of prompt-writing skills.
Inside a data team, the shift lands unevenly. A junior analyst can now produce a working draft model in an afternoon. A senior reviewer, a domain expert, or a data-quality owner still has to decide whether the draft deserves to influence a customer, an operational decision, or a dollar of spend. Junior staff create faster. Senior staff carry more decisions per day, because the volume in front of them grew while their headcount stayed flat. Accountability concentrates around the people positioned to catch a wrong assumption before it reaches production, regardless of who wrote the first version.
Opinion: Measure Verified Outcomes, Not Generated Volume
Here is the take: counting generated artifacts as a productivity measure rewards the wrong behavior. A dashboard, model, or pull request an agent produces in seconds carries no value until a qualified person confirms it works and decides to keep it. A simple count of outputs tells a team how busy its agents stayed, not how much real progress it made.
Data leaders should track verified outcomes instead. Acceptance rate measures the share of agent-generated work approved without rework. Review time measures how many human-hours each accepted artifact cost. Escaped-defect rate measures how often a problem reaches production anyway. Rework volume, model-monitoring incidents, and time to a validated decision round out a picture closer to reality than a count of lines written or queries answered. The clearest single number may be the simplest: the share of generated work reaching production unchanged.
Nothing above argues against agentic tools. Cortex Agents, AgentCore, and Copilot’s coding agent all lower the cost of a first draft, and a cheaper first draft is worth having. My take: the win gets overstated whenever a vendor or a headline conflates speed of generation with speed of delivery.
Natural language will keep widening who can start a piece of data work. A marketing analyst, a finance lead, or an operations manager can now ask a question in plain words and get back a model, a chart, or a working app. What stays scarce is knowing which question to ask, how much evidence is enough before trusting an answer, and when to refuse one. The skill won’t show up in a model’s response time, and it won’t get cheaper just because the first draft did.
The post AI Writes the Code. Humans Still Carry the Risk appeared first on .
