datasetpapers

A new shape for research output

The paper stops being the point.

A datasetpaper is a versioned, forkable, executable research object built on an open dataset. The data, the code, the environment, the figures, and each individual claim are all addressable. The written narrative is just one view of it. Built to be verified, cited, and forked, by people and by AI agents.

Not a data descriptor. A data paper describes a dataset and omits the analysis on purpose. A datasetpaper is the analysis: hypotheses, methods, results, and claims, made machine-readable and forkable.

Why now

Generating an analysis and its write-up is no longer the hard part. That changes what is valuable. If anyone can produce a plausible analysis in minutes, plausibility is worthless, and the scarce thing is trust: knowing which claim was re-executed from the data and which was merely asserted. datasetpapers is built around that scarce thing. The verification and the provenance are the product. The prose is a by-product.

The object has parts, and every part is addressable

Because the parts are addressable, an agent can build on one claim without parsing a whole document, and a person can cite a single figure with a stable identifier.

Dataset

Referenced and pinned to a version. Never silently updated. Retraction propagates to everything derived from it.

Code and environment

A notebook plus a pinned container and lockfile, so the analysis re-runs deterministically, not approximately.

Figures and tables

Each one points back to the exact code that produced it, so no result is orphaned from its computation.

Claims

Atomic statements, individually citable and machine-checkable, each carrying the figures that support it.

Provenance

A graph tying every output back to the data, and recording whether a human or an AI produced each step.

Narrative

Compiled from the object, not hand-edited. Available as a web page, a PDF, or structured XML.

How it works, in plain steps

  1. Start from an open dataset, pinned to a version so results are reproducible.
  2. Ask a question the dataset can answer, and check it is not a restatement of something already known.
  3. Run the analysis in a captured environment, so the exact computation can be re-run later.
  4. Break the result into parts: data, code, environment, figures, claims, and provenance.
  5. Pass it through a gate that scores novelty, checks integrity, and runs an adversarial review before publishing.
  6. Give it persistent identifiers and publish it in formats both people and machines can read.
  7. Let others verify individual claims, and let anyone fork the whole thing to ask a new question.

Built for AI to build on

The reason to decompose the paper is that the future reader is often a machine. The whole corpus is published through interfaces made for that reader.

An MCP server

Any agent can search, read, verify, and fork datasetpapers without a human in the loop. Reads are open; writes are gated.

Machine-readable throughout

A research-object bundle, a dataset description ML tools already consume, a knowledge graph, persistent identifiers, and git for forking.

Trust travels with the claim

Every machine-readable result carries its verification status and its distance from raw data, so an agent cannot quietly treat an assertion as a fact.

Forking is first class

A fork is a new lineage with its ancestry recorded in both git and the provenance graph. Building on prior work is the default motion.

In concrete terms, datasetpapers builds on open datasets from repositories such as Figshare and Zenodo, credits their depositors via ORCID, packages each result as an RO-Crate with a Croissant dataset description, mints persistent ARK identifiers, and passes new work through an adversarial review gate before publishing.

Credit where it is due

Every datasetpaper starts from someone's dataset. datasetpapers records that debt explicitly, notifies the original depositor that their data was used, and gives them a first-class place in the credit graph. The people who share data have been under-credited for as long as data sharing has existed. A world where machines analyse open data at scale makes that worse unless credit is designed in. Here it is.

What we are asking people to do

Read a datasetpaper the way you would read a pull request, not a journal article. Check the claims you care about. Fork it if you can do better. Bring your dataset if you want it analysed. Treat it as a starting point that is honest about its own uncertainty, not a finished verdict.

Questions, answered

What is a datasetpaper?

A datasetpaper is a versioned, forkable, executable research object built on an open dataset. Its data, code, environment, figures, and individual claims are each addressable, and the written narrative is one rendered view of it. It is built to be verified, cited, and forked by people and by AI agents.

How is a datasetpaper different from a data paper?

A data paper describes a dataset and deliberately omits the analysis. A datasetpaper is the analysis itself — hypotheses, methods, results, and claims — made machine-readable and forkable.

Who is datasetpapers for?

Researchers who want to build on open data, data depositors who want credit when their data is reused, and AI agents that need machine-readable, verifiable analyses to build on.

How can AI agents use datasetpapers?

Through an MCP server and machine-readable formats: an RO-Crate research object, a Croissant dataset description, a knowledge graph, persistent identifiers, and git for forking. Every result carries its verification status and its distance from the raw data, so an agent cannot silently treat an unverified assertion as fact. Reads are open; writes are gated.

What does datasetpapers build on?

It builds on open datasets from repositories such as Figshare and Zenodo, credits depositors via ORCID, packages each result as an RO-Crate with a Croissant dataset description, mints persistent ARK identifiers, and passes new work through an adversarial review gate before publishing.