DW

Datawiselabs

Turning unstructured research into structured intelligence

AI · Research · Data

Unlock research at scale — structured, query-ready datasets from unstructured science

Datawiselabs transforms scientific, medical, and technical content into high-quality structured data using domain-adapted models, knowledge graphs, and proprietary metadata pipelines. License, query, or use these datasets to accelerate discovery and train downstream AI.

Domains
Biomedical, Climate
Target users
AI devs & Research Orgs
Phase I KPI
~90% extraction F1 (goal)
Dataset scale
Millions of docs (prototype)

Prototype: federated query & metadata API

Example query: find studies on drug X with randomized control, funded by NIH, 2015-2024

Query timeline Embeddings + Metadata Filters

Search embeddings + metadata filters. Structured output includes entities, experimental outcomes, funding, methods, and linkable provenance.

Latency target 250ms Extraction accuracy ~90% F1 (goal)

Automated Metadata Extraction

Extract funding sources, methods, experimental outcomes, materials, and more across PDF, HTML and multimedia.

Domain-Adapted Models

Models fine-tuned on biomedical and climate corpora for higher precision, with entity linking to controlled vocabularies.

Proprietary Metadata Pipelines

Hybrid pipelines combine model inference, rule-based validation, and knowledge-graph linking for clean, auditable outputs.

Phase I Objectives

Model Validation

Train and validate extraction models for biomedical & energy/climate domains.

Benchmarking

Compare against industry tagging systems using precision, recall, and F1 metrics.

Prototype API

Deploy federated search and analytics across partner datasets with governed access.

Technology stack & IP highlights

Our platform pairs domain-adapted models with a metadata orchestration layer that applies rule-based validators and knowledge-graph linking. Key technical differentiators:

  • Multimodal ingestion (PDF, HTML, figures, tables, audio transcripts).
  • Entity normalization and semantic linking to controlled vocabularies (MeSH, UMLS, climate ontologies).
  • Provenance tracking for each extracted field to support licensing and reproducibility.
Performance targets & benchmarks
  1. Extraction F1: target >= 0.88 across key fields.
  2. Ingestion throughput: prototype 10k docs/day; Phase II 100k+/day.
  3. API latency: under 250ms for typical queries (index-backed).

Legal & licensing: content ingestion only from partners and licensed archives; metadata outputs cleared for redistribution and dataset licensing.

Docs — Quickstart & API concept

This quickstart shows the conceptual API for prototyping queries and dataset access.

Query API (concept)

POST /api/v1/query
{
  "query": "randomized control trial drug X",
  "filters": {
    "funding": ["NIH"],
    "year": {"from":2015,"to":2024},
    "domain": "biomedical"
  },
  "fields":["title","authors","funding","outcome","provenance"]
}

The API returns structured JSON with normalized entities, supporting provenance links back to the original source.

Dataset licensing

Datasets are packaged with schema documentation and provenance. Licensing terms depend on source access agreements; we provide cleared metadata outputs for redistribution.

Privacy & Data Usage

We ingest content only under license or explicit partner agreements. Metadata outputs are reviewed and cleared for redistribution where allowed. We do not distribute full-text content unless licensed. For any sensitive or personal data discovered during extraction, we apply redaction and follow applicable laws and institutional policies.

Get early access / partner with us

We're onboarding research partners and pilot customers for Phase I — especially in biomedical and energy/climate domains.

  • Dataset licensing
  • Pilot API access
  • Research partnerships & grants