Your data has structure your schema doesn't capture.

We find it.

Database Whisper discovers the structural meaning hidden in your data — in structured tables and in unstructured text. No training data. No domain configuration. One pip install. It maps what your current systems miss.

96.5%
Precision on clinical text
3
Domains validated
40,291
Instances mapped
20
Lines of algorithm

The Problem: Keywords Don't Map Meaning

A doctor writes "positive" in a clinical note. Your search returns 657 results. But positive what?

Address 1: Consult • history • none • early

"Family history: Positive for thyroid disease"

Meaning: a relative has a condition — hereditary risk flag

Address 2: Consult • blood • being • early

"Blood cultures were positive ESBL Klebsiella"

Meaning: pathogen detected — lab result

Address 3: Consult • normal • none • early

"Positive Rinne test"

Meaning: clinical test produced a finding — exam result

Same word. Three meanings. Three structural addresses.
No medical ontology. No training data. Discovered from the text.

Head-to-Head: Structural Addresses vs. Embeddings

Task: find "positive" instances that are lab results. 657 instances, 287 actual lab results.

Method Retrieved Precision Recall F1 Interpretable?
Keyword ("positive") 657 43.7% 100% No
Embedding (best threshold) 575 46.8% 93.7% 62.4% No
DW Structural Address 255 96.5% 85.7% 90.8% Yes

The embedding can't tell you why two uses differ.
The structural address can: specialty + co-occurring term + verb frame + clause position.

Meaning-Addressed Retrieval for RAG

Embedding-based RAG retrieval is sense-blind. It can't tell "positive lab result" from "positive family history." DW addresses can.

13-query cross-domain comparison: clinical, biblical, and legal text.

Method Precision F1
Keyword 72.5% 82.7%
Embedding (best k) 73.5% 82.9%
DW Address 92.5% 91.6%

+8.8 F1 over embedding retrieval. Zero training. Zero tuning.

MeaningIndex API

Build a sense-aware retrieval index in four lines:

import database_whisper as dw

index = dw.MeaningIndex(records, text_field="text",
                        concepts=["positive", "discharge", "failure"])

# Retrieve by structural sense
results = index.query("positive", sense_hint={"paired_concept": "blood"})

Compare Datasets: Structural Quality Index

Does your synthetic data preserve meaning? SQI tells you in one number.

Detect structural collapse

Synthetic data generators can reproduce surface statistics while destroying the structural relationships that give words their meaning. SQI measures whether those relationships survive.

result = dw.compare(real_data, synthetic_data, text_field="text",
                    concepts=["positive", "discharge", "failure"])
print(result)  # SQI = 0.47 — structural collapse detected

Same Tool. Three Domains. Three Different Maps.

The ladder changes shape because the text changes genre. The tool discovers structure — it doesn't impose it.

Clinical Notes

40,291 instances • 40 specialties • 17,844 addresses

1
specialty
90%
2
paired concept
80%
3
verb class
52%
4
clause position
39%

KJV Bible

10,192 instances • 66 books • 5,584 addresses

1
category
82%
2
verb class
75%
3
paired concept
59%
4
clause position
55%

US Constitution

216 instances • 62 sections • section alone resolves 97%

1
section
97%

The Constitution was designed to be unambiguous. DW confirms it.

The Core Insight

Language is less about words than about the mental vocabularies people carry. When a microbiologist writes "positive," the word activates a vocabulary where it means "pathogen detected." When a family physician writes "positive" in a review of systems, the same word activates a different vocabulary where it means "symptom endorsed."

The word is identical. The vocabularies are not.

A database stores the word without storing which vocabulary was active. Database Whisper recovers it — from the structural features already present in the text.

How It Works

1
Point it at your data
CSV, JSON, SQLite, Excel, Parquet, or raw text. DW auto-detects identity fields, provenance, and structure.
2
It discovers the map
Greedy pair-reduction finds which fields best distinguish records that share an identity. For text, it extracts 10 linguistic features automatically.
3
You see what you're missing
The discriminator ladder shows which features matter, in what order, and where the structural resolution limit is — the boundary where structure ends and interpretation begins.

Get Started

Structured data

Profile any dataset in two lines:

pip install database-whisper

import database_whisper as dw
report = dw.profile("your_data.csv")
print(report)

Unstructured text

Discover meaning-addresses in any text field:

import database_whisper as dw

concepts = dw.auto_detect_concepts(records, text_field="notes")
instances = dw.extract_concept_instances(records, "notes", concepts)
report = dw.profile_records(instances, identity_fields=["concept"])
print(report)

GitHub  |  PyPI

Also: Gap Finder for Scientific Literature

DW applied to 250M+ papers via OpenAlex. Find where nobody has published yet.

The same algorithm that maps meaning in clinical notes can find structural gaps across scientific literature — method-topic-subfield combinations that exist in neighboring fields but not in yours. Each gap is a potential paper with zero competition.

Free during beta. No spam.

The Paper

Finding the Limits of Meaning Within Textual Structures

Where Text Ends and Interpretation Begins

Data is exploding. Funded annotation efforts are not. We present a method for computing structural resolution limits from text — automatically, without training data, without domain expertise. Validated on clinical notes, the KJV Bible, and the US Constitution.

Coming soon on arXiv.