We find it.
Database Whisper discovers the structural meaning hidden in your data —
in structured tables and in unstructured text. No training data. No domain configuration.
One pip install. It maps what your current systems miss.
A doctor writes "positive" in a clinical note. Your search returns 657 results. But positive what?
"Family history: Positive for thyroid disease"
Meaning: a relative has a condition — hereditary risk flag
"Blood cultures were positive ESBL Klebsiella"
Meaning: pathogen detected — lab result
"Positive Rinne test"
Meaning: clinical test produced a finding — exam result
Same word. Three meanings. Three structural addresses.
No medical ontology. No training data. Discovered from the text.
Task: find "positive" instances that are lab results. 657 instances, 287 actual lab results.
| Method | Retrieved | Precision | Recall | F1 | Interpretable? |
|---|---|---|---|---|---|
| Keyword ("positive") | 657 | 43.7% | 100% | — | No |
| Embedding (best threshold) | 575 | 46.8% | 93.7% | 62.4% | No |
| DW Structural Address | 255 | 96.5% | 85.7% | 90.8% | Yes |
The embedding can't tell you why two uses differ.
The structural address can: specialty + co-occurring term + verb frame + clause position.
Embedding-based RAG retrieval is sense-blind. It can't tell "positive lab result" from "positive family history." DW addresses can.
13-query cross-domain comparison: clinical, biblical, and legal text.
| Method | Precision | F1 |
|---|---|---|
| Keyword | 72.5% | 82.7% |
| Embedding (best k) | 73.5% | 82.9% |
| DW Address | 92.5% | 91.6% |
+8.8 F1 over embedding retrieval. Zero training. Zero tuning.
Build a sense-aware retrieval index in four lines:
import database_whisper as dw
index = dw.MeaningIndex(records, text_field="text",
concepts=["positive", "discharge", "failure"])
# Retrieve by structural sense
results = index.query("positive", sense_hint={"paired_concept": "blood"})
Does your synthetic data preserve meaning? SQI tells you in one number.
Synthetic data generators can reproduce surface statistics while destroying the structural relationships that give words their meaning. SQI measures whether those relationships survive.
result = dw.compare(real_data, synthetic_data, text_field="text",
concepts=["positive", "discharge", "failure"])
print(result) # SQI = 0.47 — structural collapse detected
The ladder changes shape because the text changes genre. The tool discovers structure — it doesn't impose it.
40,291 instances • 40 specialties • 17,844 addresses
10,192 instances • 66 books • 5,584 addresses
216 instances • 62 sections • section alone resolves 97%
The Constitution was designed to be unambiguous. DW confirms it.
Language is less about words than about the mental vocabularies people carry. When a microbiologist writes "positive," the word activates a vocabulary where it means "pathogen detected." When a family physician writes "positive" in a review of systems, the same word activates a different vocabulary where it means "symptom endorsed."
The word is identical. The vocabularies are not.
A database stores the word without storing which vocabulary was active. Database Whisper recovers it — from the structural features already present in the text.
Profile any dataset in two lines:
pip install database-whisper
import database_whisper as dw
report = dw.profile("your_data.csv")
print(report)
Discover meaning-addresses in any text field:
import database_whisper as dw concepts = dw.auto_detect_concepts(records, text_field="notes") instances = dw.extract_concept_instances(records, "notes", concepts) report = dw.profile_records(instances, identity_fields=["concept"]) print(report)
DW applied to 250M+ papers via OpenAlex. Find where nobody has published yet.
The same algorithm that maps meaning in clinical notes can find structural gaps across scientific literature — method-topic-subfield combinations that exist in neighboring fields but not in yours. Each gap is a potential paper with zero competition.
Free during beta. No spam.
Where Text Ends and Interpretation Begins
Data is exploding. Funded annotation efforts are not. We present a method for computing structural resolution limits from text — automatically, without training data, without domain expertise. Validated on clinical notes, the KJV Bible, and the US Constitution.
Coming soon on arXiv.