Agent feedback loops for website audits

There's a quiet truth about every web audit tool that's existed for the last fifteen years — Lighthouse, PageSpeed Insights, Ahrefs Site Audit, Semrush, Screaming Frog, the lot. They are one-way pipes. The crawler runs, the report renders, you read it, you (mostly) close the tab. If the report says something wrong, you have nowhere to put that. Maybe a support ticket if you're patient and the rule was egregious. Usually you just learn which findings to ignore and move on.

That model made sense when the consumer of the report was a human reading a PDF. It stops making sense the moment the consumer is an AI agent in your editor — because an agent has the codebase open, can verify a finding in seconds, and knows when the audit is wrong.

So we built the other half of the pipe.

The PageLens AI MCP server has a tool called report_finding_feedback. It does what the name says: when an agent reading your audit sees a finding it disagrees with, it can submit structured feedback — kind, reason, evidence, optional proposed fix — and that submission lands in our review queue. We act on it. The next person who runs the same audit gets a better report.

This sounds small. It is the most important feature we've shipped this year, and I want to explain why.

Since publishing this, we added the companion workflow for the other case: the finding is real, but the owner has made an intentional decision. acknowledge_finding_decision records that context on the report without hiding the issue or changing the score, and clear_finding_decision removes it when the decision changes.

The problem with one-way audits

Every static analyser has the same failure mode: it's right most of the time, and the rest of the time it's confidently wrong in ways that erode trust. Common shapes:

Selector-based false positives. A rule that says "this CSS class makes elements unclickable" matches a utility class that's used in a hundred places, only one of which actually breaks.
Context-blind severity. A rule that grades every missing meta description as "critical", even on the 404 page where it doesn't matter.
Misclassified categories. An issue tagged as SEO when it's really a PERFORMANCE regression masquerading as one.
Fixes that don't apply. A suggestion that assumes WordPress when the site is Next.js, or vice versa.

Every single one of these is fixable. None of them get fixed in a one-way pipe, because the audit author and the audit reader never share a feedback channel.

In a normal SaaS workflow, you'd hope the customer files a support ticket, an analyst reads it, the rule gets tuned. In practice the friction is too high. Customers see a bad finding, sigh, mentally tag the tool as "noisy", and move on. The tool keeps shipping the same wrong finding to every other customer. The rule never gets fixed because nobody has the energy to write the support ticket.

The MCP changes the economics

When the consumer of the report is an AI agent inside an editor, two things change at once:

First, verification becomes free. The agent can grep the codebase, open the file the audit is complaining about, look at the actual selector, and form an opinion in milliseconds. No more "I'd love to check this but I'm not sure where to look" — the agent already has the project open.

Second, the cost of filing structured feedback drops to zero. A human writing a support ticket is a 10-minute task that nobody wants. An agent calling a tool is one tool call — and unlike a support ticket, the tool requires structured fields (kind, reason, evidence) so the submission is always actionable when it arrives.

Put those together and you get a workflow that's never existed before: the audit reader can argue back, with proof, in real time, without leaving the conversation.

Here's what it actually looks like. The user asks Claude:

"Walk through the top findings on my latest scan and propose patches. Skip anything that's actually wrong about my code."

Claude pulls the report via the MCP, walks the findings, and for most of them goes straight to a patch proposal. But occasionally it stops and says something like:

"PERF-014 flags .pointer-events-auto as making elements unclickable, but in this codebase it's a Tailwind utility used in 47 different components — none of which are unclickable. I'm filing this as a false positive with the evidence."

Then it calls report_finding_feedback with kind=FALSE_POSITIVE, a one-paragraph reason, and a snippet from the actual DOM. The submission lands in our admin queue tagged with the agent's model identifier (e.g. claude-3.7-sonnet), the OAuth client (e.g. cursor-mcp), and the finding id. A human reviewer at PageLens AI looks at it, decides whether to accept, and tunes the rule.

The user, meanwhile, gets on with their work. The conversation flows uninterrupted. The bad finding — and any of its siblings on other customers' sites — is on a path to being fixed.

Why this is different from "rate the answer"

Every chatbot has had a thumbs-up / thumbs-down button next to the answer for two years now. They do not move the product forward, because the signal they generate is too thin. "This was bad" tells you nothing actionable. "This selector matched 47 elements and broke none of them, here's the relevant Tailwind class" gives you a concrete change to make.

The shape we settled on after a few iterations is five kind values, two required prose fields, and one optional structured field:

| Field | Purpose | | --- | --- | | kind | One of FALSE_POSITIVE, INCORRECT_SEVERITY, INCORRECT_CATEGORY, NOT_ACTIONABLE, OTHER. | | reason | 20–1,000 chars of plain-English why this is wrong. | | evidence | 20–2,000 chars of what the agent actually saw — DOM snippet, header value, code excerpt. | | proposed_severity | Optional. With kind=INCORRECT_SEVERITY, the severity the agent thinks it should be. | | agent_model | Optional. The agent volunteers which model produced the report. |

Two things matter about this shape:

Evidence is required. You can't submit "this is wrong" with no backing. The minimum length of 20 chars on both prose fields keeps drive-by submissions out of the queue.
Model identifier is volunteered. When agent_model is filled in, we can group accepted disagreements by which model produced them — which means we can spot patterns like "Claude 3.7 catches our SEO false positives more reliably than the older Cursor models". That data is gold for tuning the next generation of rules.

What happens after the agent submits

Submissions go into a private admin review queue. Each row shows:

The OAuth client name, the agent model, the user, and the timestamp.
The original finding (rule id, severity, category, page).
The agent's reason and evidence, rendered as plain text in visually-quoted blocks so an admin can never confuse user-supplied prose with our own.
Four buttons: Accept, Reject, Needs more info, Mark resolved.

The reviewer reads the submission, opens the relevant rule definition in our repo, and either tunes it (more specific selector, narrower severity criteria, better category mapping) or rejects with a one-line reason that gets surfaced back to the user. The cycle from "agent submits" to "rule is better for everyone" is usually under a working day for the cases where the agent is right.

We deliberately don't auto-act on submissions. The agent's job is to flag — the human's job is still to decide. Two reasons for that:

Adversarial robustness. A misbehaving agent (or a deliberately malicious one) shouldn't be able to silence true findings just by submitting "this is wrong" enough times. The UNIQUE constraint on (finding, user) plus per-token / per-scan rate limits make flooding hard, but the human-in-the-loop is the real defence.
Rule tuning is a craft. A rule that fires too often on one site might be perfectly calibrated for ninety others. Deciding the change requires looking at the rule's full population of hits — that's reviewer judgement, not a single submission's say-so.

When the finding is right, but intentional

Not every disagreement is a false positive. Sometimes the audit is correct and the team still has a good reason to accept the tradeoff.

The common example for us is security headers. A report may flag script-src 'unsafe-inline' or inline styles in a Content Security Policy. That is a real warning. But for a specific app, the owner might accept the risk temporarily because of a framework constraint, a payment provider, or a component library that still relies on inline attributes.

That should not become a support ticket, and it should not suppress the finding. It should become context.

Accepted decision annotations let an MCP-connected assistant attach that context to the finding:

| Field | Purpose | | --- | --- | | decision | One of ACKNOWLEDGED, ACCEPTED_RISK, INTENTIONAL_TRADEOFF, WONT_FIX_NOW. | | reason | Plain-English explanation of why the owner accepts the finding for this page/domain. | | evidence | The code, header, product constraint, or other source the assistant used to support the decision. | | agent_model | Optional model identifier, for auditability. |

PageLens AI still shows the original finding. The score stays unchanged. Private reports show the reason and evidence; public reports only show that the owner has acknowledged the finding as an intentional decision. Future scans can match the decision again when the domain, page, viewport, rule, and finding fingerprint line up.

If the decision no longer applies, the agent can call clear_finding_decision. We stop showing it on reports and keep the audit trail.

Why this is the moat

I want to be careful here, because "we have a feedback loop" is the kind of thing every product page eventually claims. The thing that makes this one different is who is on the giving end of the feedback.

A traditional feedback loop is: customer fills in a form, ops team triages, eventually it makes its way to engineering. Maybe 1% of dissatisfied customers ever fill in the form. The feedback you get is biased toward the loudest complainers and the most catastrophic failures. The long tail of "mildly wrong findings on routine scans" never reaches you.

The agent-as-reviewer loop inverts that. The friction is so low that every scan an agent reads can produce structured disagreements when the agent has them. We're not collecting opinions from the 1% of customers who could be bothered to file a ticket — we're collecting evidence-backed, structured disputes from every audit-meets-codebase pairing in the wild.

The compounding effect is: the more agents read PageLens AI reports, the more disagreements we see, the faster we tune the rules, the better the next report is, the more useful the agent's next session is. It's a flywheel that no static audit tool can have, because no static audit tool has the bidirectional connection in the first place.

What we're going to do with the data

A few things, in roughly priority order:

Per-rule false-positive rate. When we have enough accepted submissions to be statistically meaningful, we'll start tracking per-rule FP rates and surface the top offenders to ourselves on a weekly basis. The rules that misfire most often get rewritten or retired.
Per-model accuracy patterns. With agent_model filled in on most submissions, we'll get to see which models tend to be right when they disagree with us. That informs both our trust ranking when reviewing submissions, and a future "this agent has a strong false-positive track record on this rule type" signal in the admin queue.
Public changelog of accepted disputes. Once the volume justifies it, we'll publish an /changelog page that lists every rule that's been tuned and the agent-reported submission that triggered it. That's both a transparency commitment and, frankly, marketing — "see what your agent's feedback actually changed" is a much more compelling story than "see our latest release notes".

How to use this today

If you're already on the PageLens AI MCP, the report_finding_feedback, acknowledge_finding_decision, and clear_finding_decision tools are already there — you just need the write:feedback scope. The default DCR registration grants it; if your client connected before April 2026 you'll need to disconnect and re-add it from Settings → Integrations so the consent screen offers the current scope.

A useful prompt to seed the workflow:

"Walk through the top 10 findings on my latest PageLens AI scan. For each one, propose a patch if you can. If you think the finding is wrong, call report_finding_feedback with structured evidence. If the finding is real but reflects an intentional architectural decision, ask me for confirmation and then call acknowledge_finding_decision."

That's the whole loop. Read, verify, patch or push back. The audit is no longer the product; the conversation is.

— Richard