Dataset
Referenced and pinned to a version. Never silently updated. Retraction propagates to everything derived from it.
A new shape for research output
A datasetpaper is a versioned, forkable, executable research object built on an open dataset. The data, the code, the environment, the figures, and each individual claim are all addressable. The written narrative is just one view of it. Built to be verified, cited, and forked, by people and by AI agents.
Not a data descriptor. A data paper describes a dataset and omits the analysis on purpose. A datasetpaper is the analysis: hypotheses, methods, results, and claims, made machine-readable and forkable.
Generating an analysis and its write-up is no longer the hard part. That changes what is valuable. If anyone can produce a plausible analysis in minutes, plausibility is worthless, and the scarce thing is trust: knowing which claim was re-executed from the data and which was merely asserted. datasetpapers is built around that scarce thing. The verification and the provenance are the product. The prose is a by-product.
Because the parts are addressable, an agent can build on one claim without parsing a whole document, and a person can cite a single figure with a stable identifier.
Referenced and pinned to a version. Never silently updated. Retraction propagates to everything derived from it.
A notebook plus a pinned container and lockfile, so the analysis re-runs deterministically, not approximately.
Each one points back to the exact code that produced it, so no result is orphaned from its computation.
Atomic statements, individually citable and machine-checkable, each carrying the figures that support it.
A graph tying every output back to the data, and recording whether a human or an AI produced each step.
Compiled from the object, not hand-edited. Available as a web page, a PDF, or structured XML.
The reason to decompose the paper is that the future reader is often a machine. The whole corpus is published through interfaces made for that reader.
Any agent can search, read, verify, and fork datasetpapers without a human in the loop. Reads are open; writes are gated.
A research-object bundle, a dataset description ML tools already consume, a knowledge graph, persistent identifiers, and git for forking.
Every machine-readable result carries its verification status and its distance from raw data, so an agent cannot quietly treat an assertion as a fact.
A fork is a new lineage with its ancestry recorded in both git and the provenance graph. Building on prior work is the default motion.
In concrete terms, datasetpapers builds on open datasets from repositories such as Figshare and Zenodo, credits their depositors via ORCID, packages each result as an RO-Crate with a Croissant dataset description, mints persistent ARK identifiers, and passes new work through an adversarial review gate before publishing.
Every datasetpaper starts from someone's dataset. datasetpapers records that debt explicitly, notifies the original depositor that their data was used, and gives them a first-class place in the credit graph. The people who share data have been under-credited for as long as data sharing has existed. A world where machines analyse open data at scale makes that worse unless credit is designed in. Here it is.
Read a datasetpaper the way you would read a pull request, not a journal article. Check the claims you care about. Fork it if you can do better. Bring your dataset if you want it analysed. Treat it as a starting point that is honest about its own uncertainty, not a finished verdict.
A datasetpaper is a versioned, forkable, executable research object built on an open dataset. Its data, code, environment, figures, and individual claims are each addressable, and the written narrative is one rendered view of it. It is built to be verified, cited, and forked by people and by AI agents.
A data paper describes a dataset and deliberately omits the analysis. A datasetpaper is the analysis itself — hypotheses, methods, results, and claims — made machine-readable and forkable.
Researchers who want to build on open data, data depositors who want credit when their data is reused, and AI agents that need machine-readable, verifiable analyses to build on.
Through an MCP server and machine-readable formats: an RO-Crate research object, a Croissant dataset description, a knowledge graph, persistent identifiers, and git for forking. Every result carries its verification status and its distance from the raw data, so an agent cannot silently treat an unverified assertion as fact. Reads are open; writes are gated.
It builds on open datasets from repositories such as Figshare and Zenodo, credits depositors via ORCID, packages each result as an RO-Crate with a Croissant dataset description, mints persistent ARK identifiers, and passes new work through an adversarial review gate before publishing.