DataWiseLabs — Institutional Research Data Corpus

Why this corpus exists

Biomedical material is increasingly judged by its provenance, not its volume.

The properties that distinguish biomedical research — named authorship, institutional review, embargo timing, and editorial discipline — are precisely the properties that web-derived collections tend to lose along the way. Once those signals are stripped, they are difficult to reconstruct.

A corpus cannot convey embargo, retraction, or peer-review status if those signals were not preserved when the material was first collected.

This corpus is organised around the opposite premise: a documented source of record — the named institution, the named investigator, the embargo date — preserved at the level of the individual record, throughout a continuous thirty-year archive.

What changes when provenance is preserved

Properties of the underlying material

→Attribution is intact at the record level, not inferred from later mentions.
→Editorial status — preprint, peer-reviewed, retracted — is captured rather than guessed.
→Embargo timing distinguishes first publication from later reporting.
→The chain of custody from institution to platform is documentable end to end.

Properties of the material

Five characteristic gaps in web-derived biomedical collections — and what this corpus retains.

Each row describes a property that is commonly absent when biomedical material is collected from the open web, the reason it tends to be lost, and how this corpus preserves it from the point of original release.

Unverifiable attribution

References that cannot be traced back to a named investigator or institution of record.

Why it tends to be lostWeb collections mix verified, preprint, and aggregator material without a source-of-record discipline.

What this corpus preservesRecords originate from a named institution with named researchers, retained at the record level for verifiable attribution.

Editorial status not distinguished

Preprints, peer-reviewed findings, and retracted material carried alongside one another without separation.

Why it tends to be lostDownstream copies of biomedical claims rarely carry reliable timing or editorial-status signal.

What this corpus preservesEmbargo-disciplined release timing and institutional editorial review, captured prior to publication.

Truncated clinical context

Abstracts and headlines that omit the methodology, framing, or investigator commentary around a finding.

Why it tends to be lostAbstract-only sources strip the explanatory and clinical context that accompanies an institutional release.

What this corpus preservesFull institutional-release prose: methodology summary, clinical framing, named-investigator commentary.

Unclear rights basis

Material whose contractual basis for downstream commercial use is undocumented or unavailable.

Why it tends to be lostLarge biomedical collections are often aggregated, or licensed only for non-commercial research use.

What this corpus preservesA thirty-year continuous distribution chain from institution to platform, with rights documentation reviewed for diligence.

Loose image–text pairing

Clinical and laboratory imagery that has drifted from its original caption or context.

Why it tends to be lostImage–text pairs in medicine are often noisy, mis-captioned, and not researcher-approved.

What this corpus preservesResearcher- and PIO-approved images and video, paired to release text at the record level.

What distinguishes the corpus

Four properties that depend on how the material was first collected.

These are not features of scale. Each reflects a structural property of how the material was produced — at the source, under embargo, with attribution intact. A deduplicated attribution graph of 48,051 unique experts and 51,737 expert–institution relationships (across 397 institutions in the medical expert-graph) connects records to named authorities.

01 / Provenance

Source of record

Records originate from the named institution of publication, with researcher names, titles, and affiliations preserved at the record level for verifiable attribution.

02 / Multimodal

Aligned at origin

Researcher- and PIO-approved images and video paired to release text at the record level — clean image–text and video–text signal with no post-hoc captioning.

03 / Temporal

Continuity over time

Thirty years of continuous, embargo-dated provenance enables time-aware training that separates established consensus from emerging findings without conflation.

04 / Temporal-lead

Upstream of downstream reporting

Institution-authored releases precede downstream coverage, giving the corpus a measurable lead over the aggregated medical news that later follows it.

Build vs license

Measured against the cost to originate the corpus.

A license is most fairly measured against the alternative: originating thirty years of institutional relationships, editorial discipline, and rights documentation from a standing start. The relevant comparison is time and feasibility.

Originate from scratch

Build

Estimated 5–10+ years · uncertain rights · closed historical windows

✕Direct agreements with 3,000+ archive-wide contributing institutions, one relationship at a time.
✕Embargo and attribution discipline, which only exists if captured at the moment of release.
✕The 2020–2025 window — including the COVID communication peak — which has already closed.
✕Clearing commercial-use rights institution by institution, without guarantee of a clean chain.
✕Pairing and approving 126K+ multimedia assets at source, before public distribution.

License the assembled asset

License

Weeks to evaluate · documented rights · delivery-ready

→An assembled, structured corpus available for evaluation in weeks rather than years.
→Embargo timing, attribution, and editorial status already preserved at the record level.
→The closed historical window already captured — those years already in hand.
→Rights documentation prepared for diligence, with a chain of custody traceable end to end.
→Source-approved multimodal pairings delivered as a clean, structured distribution.

The current context

Why this layer is being considered carefully in 2026.

Four characteristics of the current environment make a rights-reviewed, institution-authored biomedical layer a substantive consideration for organisations building serious medical systems.

A different category of material

Provenance, attribution, and temporal signal are increasingly recognised as distinct from raw volume. A system grounded in this material can cite, date, and attribute its claims in a way that web-derived corpora do not readily support.

Regulatory direction of travel

As training-data provenance moves toward becoming an audited surface under emerging AI governance, a documented institution-to-platform chain is increasingly relevant to compliance and review.

A closed historical window

The 2020–2025 cohort — a particularly active period in biomedical communication — is finite and already captured. The window itself is no longer open to be re-originated.

Scope is configurable

The medical cohort can be licensed in defined slices, and the structure supports category-, industry-, or temporal-exclusive arrangements where that serves both sides of an engagement.

The medical cohort

The 2020–2025 medical cohort.

Medicine and the health sciences are the strongest, most actively maintained area of the archive — produced under embargo, written for accuracy, tied to named researchers at named institutions across oncology, cardiovascular disease, neuroscience, infectious disease, and public health.

59,074

Medical research stories · 2020–2025

9,960

COVID-19 & public-health stories, within the cohort

48,051

Unique experts · across 397 medical institutions

Layer 01

Provenance

Named institution and investigator preserved per record, for attribution-grounded generation.

Layer 02

Multimodal alignment

78,520 cohort multimedia assets — including 5,223 videos — approved at source and paired to release text. (126,522 archive-wide.)

Layer 03

Temporal continuity

Embargo-dated release timing allowing time-aware reasoning without conflation of recent and historical findings.

Layer 04

Temporal-lead

First-public-mention windows recording the corpus's lead over downstream reporting.

The licensing surface

Not a single product. A set of dimensions defined per engagement.

A license is shaped to the requirements of the system being built — rights, scope, attestation, and pathway, specified in dialogue with technical and legal counsel rather than presented as a fixed package.

Rights

The envelope under which the corpus is licensed for commercial AI development.

·Commercial training against the full corpus or any defined subset
·RAG and grounding rights for production deployment
·Fine-tuning and continued pre-training on the record set
·Citation and source-of-claim attribution at inference time
·Forward-feed access as the archive continues to grow

Scope

The dimensions along which a license can be defined against the asset.

·The full thirty-year institutional archive
·The 2020–2025 medical cohort as a standalone resource
·Subject-area channels (oncology, cardiovascular, public health)
·The aligned multimedia layer paired to text at source
·The named-expert and institution attribution graph

Attestation

The buyer-protection and provenance documentation accompanying the license.

·Documented chain of custody, institution to platform
·Per-record provenance tied to source and embargo date
·C2PA-compatible provenance roadmap across the intelligence layers
·Rights review materials available under NDA
·Indemnity structured to the scope and term of engagement

Pathways

How a buyer moves from inspection to production.

·Technical evaluation against a curated sample first
·Channel-level engagement before full-cohort access
·Conversion to production with evaluation credited forward
·Multi-year arrangements across model generations
·Joint working sessions during onboarding

The licensing architecture

Seven independent dimensions, combined per engagement.

A license is defined as a combination across seven dimensions. The corpus is structured to support different specifications depending on the system being built. A medical-cohort license is one combination among many — the first commercial expression of the architecture.

Content

Medical · Oncology · Cardiovascular · Neuroscience · COVID · Public Health · Science

Time

2020–2025 · Pre-2020 · Full thirty-year archive

Media

Text · Images · Video · Aligned multimedia

Intelligence layers

Provenance · Expert attribution · Institution intelligence · Temporal intelligence

Rights

Training · Fine-tuning · RAG · Production · Citation

Updates

Static snapshot · Quarterly refresh · Forward feed

Exclusivity

Category · Industry · Geography · Temporal

Each dimension is licensed independently. Specific structures — and any exclusivity across content domains, intelligence layers, or time windows — are defined in dialogue with technical and legal counsel, in light of the system the licensee is building.

Diligence access

What's available under NDA.

A diligence package prepared for technical and legal counsel to examine the corpus at the record level. Five components, delivered together, each addressing a question a reviewing team would reasonably want to answer.

01 · Rights & provenance

Rights & provenance memo

The contractual basis under which institutional content reaches the corpus, and the structure of the chain of custody from institution to platform.

02 · Institutions

Institution validation pack

A representative roster of contributing institutions — named, verifiable, and substantiated against the volume figures presented.

03 · Schema

Schema documentation

Field-level schema for records, expert graph, institution graph, and multimedia pairings — what arrives, how it is shaped, and how it is keyed.

04 · Records

Representative records sample

A curated sample drawn from the medical cohort — real records with real attribution, embargo dates, and paired multimedia, for direct technical inspection.

05 · Evaluation kit

Evaluation kit for technical teams

A working subset structured for the licensee's own benchmarks — provenance-aware fine-tuning, citation grounding, and multimodal alignment — under terms that credit the evaluation forward into production.

To request the NDA and review the diligence package, please be in touch.

hello@datawiselabs.ai