Auditable AI Research: a family tech purchase, methodically resolved

The Problem

Research produced with the help of an AI language model carries a specific, well-understood hazard, and naming it plainly is the right place to begin. A language model is very good at producing text that reads as authoritative. It writes in fluent, confident prose. It formats citations correctly. It names sources, attaches numbers, and supplies dates. The result looks like the output of careful research. The difficulty is that all of those surface features can attach to a claim that isn't grounded in anything — a source that doesn't exist, a quotation that was never written, a fact that's quietly wrong.

This is the part of the problem a reader can't solve by reading carefully. On the page, a grounded claim and an ungrounded one look identical. The reader has no way, from the text alone, to tell which is which.

That's the trust problem that has to be solved before AI-assisted research is useful to anyone making a real decision — whether the decision is a regulatory analysis, a grant evaluation, or, here, a household calendar purchase. The methodology's job is not to ask the reader to trust that the research avoided the hazard. The methodology's job is to make grounding something the reader can independently check.

The Approach

The research moved through a fixed sequence of stages rather than open-ended searching. Each stage produces an artifact the next stage depends on, and the final deliverable carries the full chain back to its sources.

Frame the question. The household had a real decision to make: which home-display calendar to buy. The criteria were named explicitly — custom-content capability, real-purchaser experience, price, privacy posture — before any source was opened.
Survey the source landscape. Twenty-five candidate sources were identified across five products, spanning manufacturer documentation, independent long-form reviews, community discussion, and developer reference material. Each candidate URL was recorded with its expected role before retrieval. Of the twenty-five sources verified, nineteen became the numbered load-bearing citations; the remaining six were corroborating sources (cross-references behind those entries) that were verified but not load-bearing on their own.
Verify and archive every source. Each candidate was retrieved, parsed, and stored as verified content. A source can't be cited unless its actual content was captured and is checkable. Three sources were initially bot-blocked and were re-fetched through a browser-emulation tool; one search-result candidate turned out to point at an unrelated company sharing the brand name — rather than being quietly dropped, it was flagged and retained in the record as a finding (citation [19]) so the collision is visible to any reader.
Build claims with citation chains. Every load-bearing claim in the comparison was bound to a specific passage of verified source content. Where evidence was thin or absent, the comparison says so rather than reaching for the nearest plausible filler.
Audit the deliverable. Before publication, the writeup was scanned mechanically for internal terminology leakage, audited for stance (no advocacy beyond evidence; no sales posture), and reviewed for honest disclosure of what the work does not guarantee.

The Build

Sources Verified

Citation Chain Entries

Products Compared

Criteria Evaluated

The research draws on a deliberately diverse source set: official product documentation, third-party long-form reviews, community discussion threads, developer reference material, and cross-product comparison pieces. The source distribution itself is part of the rigor — relying only on manufacturers would tilt the writeup toward marketing claims; relying only on reviewers would miss the technical capability questions that anchored the decision.

Source category	Examples	What it provides
Manufacturer primary	Official product pages; developer documentation	Baseline specs; declared capability; pricing
Independent long-form review	2026 hands-on reviews on dedicated review sites	Real-world usage; pros and cons under household conditions
Community / vendor forums	Vendor community threads; multi-year comment trails	Friction patterns; longitudinal evidence of vendor behavior
Comparison aggregator	Multi-product 2026 buying guides	Cross-product positioning; market context
Own synthesis	The comparison, the recommendation, the caveats	The research's own work, marked as such

The finding that demonstrates the method. One of the five products initially appeared, from a fast search, to have an official developer API — the documentation looked legitimate and matched what one would expect. Reading the actual page content revealed it was the documentation of a different company that happens to share the product's brand name. A faster process would have confidently asserted an API that doesn't exist. Source verification caught it. This is what the methodology is for.

The Outcome

The research produced a clean recommendation, structured as two paths depending on which household need is load-bearing. Five products were evaluated; three were ruled out on specific evidence and one of the two finalists wins each path. The comparison itself is published as a sibling page; every assertion in it carries a citation link to a verifiable source.

Live Citation Links

100%

Load-bearing Claims Sourced

<1 hr

Time to Audit the Work

The auditability claim is the substantive output, not just a feature of the page. Each citation entry shows the path from the claim in the body to the citation marker to the entry on the citations page to the live source URL. A reader doesn't have to accept the path exists — they can walk it for any single claim. The same methodology applied to harder problems (regulatory analysis, legal-support research) is where this discipline was built; bringing it to a household decision is a deliberate demonstration that the verification structure isn't reserved for high-stakes work.

View the research →

Technical details: how the verification structure works

For readers who want the methodology's working parts rather than just the surface, this appendix describes the structure at category level. The same structure is applied to harder projects in our portfolio.

The three-tier evidence standard

Every source the research draws on sits in one of three tiers, based on how its content was obtained and how independently verifiable it is:

Verified source material — the source's actual substantive content was retrieved and recorded. Only this tier is allowed to carry a load-bearing claim.
Orientation material — live web search results used for finding candidates and corroborating. Useful for orientation; never load-bearing on its own.
Unverifiable recall — a language model's unaided memory of what a source says. Never acceptable as a citation, in any project.

The two verification gates

Two checks are applied before any claim or citation is allowed into the deliverable:

Ground-or-flag. Every factual assertion is either grounded against a verified source or explicitly marked as inference. The reader is never left to guess which is which.
Pre-citation gate. Three checks before a citation commits: that the source actually exists, that its content is substantive rather than thin, and (the one that matters most) that the cited passage topically matches the specific claim it's attached to. The third check is what caught the brand-name-collision finding described above.

The chain of custody

For any cited claim, the chain is a single visible path: the claim in the body, the citation marker attached to it, the entry on the citations page, the live source URL the entry links to. A reader does not have to accept that this path exists — they can walk it.

What this methodology does not guarantee

No methodology removes every source of error, and presenting one as though it did would contradict the standard the rest of this appendix describes. What this methodology guarantees is bounded:

It establishes that load-bearing claims rest on verified source content.
It establishes that inferences are labelled as inferences.
It establishes that every citation can be audited by the reader.
It does not establish that the underlying sources are themselves correct on the merits.
It does not establish that fast-moving facts haven't changed since their source was captured.
It does not establish that coverage is complete.

Disclosing these limits is part of the methodology, not a departure from it.

Why this applies to a household purchase

The methodology was built on harder problems — regulatory analysis, legal-support research, AI safety landscape work — where verification matters because outside observers can't easily check the work themselves. Bringing the same structure to a consumer-tech decision is a demonstration: if the methodology works, it should produce a writeup that a partner, collaborator, or curious reader can verify in under an hour, without needing to be a domain expert. That property is itself the differentiator.

Governed AI craftsmanship, applied where the stakes are low and the rigor is high

We build research deliverables with verification structure as a first-class feature — not because every reader will audit every claim, but because the option to do so is what makes AI-assisted work trustworthy in the first place. If you have a research question, a comparison, or a domain audit that needs to stand up to inspection, let's talk.

Get in Touch View More Projects