Your data has structure your schema doesn't capture.

We find it.

Database Whisper discovers the structural meaning hidden in your data — in structured tables and in unstructured text. No training data. No domain configuration. One pip install. It maps what your current systems miss.

The Problem: Keywords Don't Map Meaning

A doctor writes "positive" in a clinical note. Your search returns 657 results. But positive what?

Address 1: Consult • history • none • early

"Family history: Positive for thyroid disease"

Meaning: a relative has a condition — hereditary risk flag

Address 2: Consult • blood • being • early

"Blood cultures were positive ESBL Klebsiella"

Meaning: pathogen detected — lab result

Address 3: Consult • normal • none • early

"Positive Rinne test"

Meaning: clinical test produced a finding — exam result

Same word. Three meanings. Three structural addresses.
No medical ontology. No training data. Discovered from the text.

Head-to-Head: Structural Addresses vs. Embeddings

Task: find "positive" instances that are lab results. 657 instances, 287 actual lab results.

Method	Retrieved	Precision	Recall	F1	Interpretable?
Keyword ("positive")	657	43.7%	100%	—	No
Embedding (best threshold)	575	46.8%	93.7%	62.4%	No
DW Structural Address	255	96.5%	85.7%	90.8%	Yes

The embedding can't tell you why two uses differ.
The structural address can: specialty + co-occurring term + verb frame + clause position.

Meaning-Addressed Retrieval for RAG

Embedding-based RAG retrieval is sense-blind. It can't tell "positive lab result" from "positive family history." DW addresses can.

13-query cross-domain comparison: clinical, biblical, and legal text.

Method	Precision	F1
Keyword	72.5%	82.7%
Embedding (best k)	73.5%	82.9%
DW Address	92.5%	91.6%

+8.8 F1 over embedding retrieval. Zero training. Zero tuning.

MeaningIndex API

Build a sense-aware retrieval index in four lines:

import database_whisper as dw

index = dw.MeaningIndex(records, text_field="text",
                        concepts=["positive", "discharge", "failure"])

# Retrieve by structural sense
results = index.query("positive", sense_hint={"paired_concept": "blood"})

Compare Datasets: Structural Quality Index

Does your synthetic data preserve meaning? SQI tells you in one number.

Detect structural collapse

Synthetic data generators can reproduce surface statistics while destroying the structural relationships that give words their meaning. SQI measures whether those relationships survive.

result = dw.compare(real_data, synthetic_data, text_field="text",
                    concepts=["positive", "discharge", "failure"])
print(result)  # SQI = 0.47 — structural collapse detected

Same Tool. Three Domains. Three Different Maps.

The ladder changes shape because the text changes genre. The tool discovers structure — it doesn't impose it.

Clinical Notes

40,291 instances • 40 specialties • 17,844 addresses

specialty

90%

paired concept

80%

verb class

52%

clause position

39%

KJV Bible

10,192 instances • 66 books • 5,584 addresses

US Constitution

216 instances • 62 sections • section alone resolves 97%

section

97%

The Constitution was designed to be unambiguous. DW confirms it.

The Core Insight

Language is less about words than about the mental vocabularies people carry. When a microbiologist writes "positive," the word activates a vocabulary where it means "pathogen detected." When a family physician writes "positive" in a review of systems, the same word activates a different vocabulary where it means "symptom endorsed."

The word is identical. The vocabularies are not.

A database stores the word without storing which vocabulary was active. Database Whisper recovers it — from the structural features already present in the text.

How It Works

Point it at your data

CSV, JSON, SQLite, Excel, Parquet, or raw text. DW auto-detects identity fields, provenance, and structure.

It discovers the map

Greedy pair-reduction finds which fields best distinguish records that share an identity. For text, it extracts 10 linguistic features automatically.

You see what you're missing

The discriminator ladder shows which features matter, in what order, and where the structural resolution limit is — the boundary where structure ends and interpretation begins.

Get Started

Structured data

Profile any dataset in two lines:

pip install database-whisper

import database_whisper as dw
report = dw.profile("your_data.csv")
print(report)

Unstructured text

Discover meaning-addresses in any text field:

import database_whisper as dw

concepts = dw.auto_detect_concepts(records, text_field="notes")
instances = dw.extract_concept_instances(records, "notes", concepts)
report = dw.profile_records(instances, identity_fields=["concept"])
print(report)

GitHub | PyPI

Also: Gap Finder for Scientific Literature

DW applied to 250M+ papers via OpenAlex. Find where nobody has published yet.

The same algorithm that maps meaning in clinical notes can find structural gaps across scientific literature — method-topic-subfield combinations that exist in neighboring fields but not in yours. Each gap is a potential paper with zero competition.

Free during beta. No spam.

The Paper

Finding the Limits of Meaning Within Textual Structures

Where Text Ends and Interpretation Begins

Data is exploding. Funded annotation efforts are not. We present a method for computing structural resolution limits from text — automatically, without training data, without domain expertise. Validated on clinical notes, the KJV Bible, and the US Constitution.

Coming soon on arXiv.