№ 02 · AI quality monitoring

Watching AI Answers actually answer

A daily check on whether Help Scout’s AI Answers actually answered. Reads every thread, tags the customer’s real signal, sends samples to Claude for scoring. A 1,100-conversation analysis split “Contact helped” into three honest buckets and fed the team’s improvement queue.

Year: 2026
Role: Built and operate it; findings feed the Customer team’s weekly AI quality review

Help Scout’s AI Answers reports a “Contact helped” rate: the percentage of conversations it marks resolved. The problem: “resolved” can mean the customer said the answer worked, or it can mean they went silent after reading and the system assumed success. In the dashboard, those look identical.

The toolkit separates them. Each morning, a Ruby script reads every conversation thread (including what the customer said after the AI responded) and tags it with the real signal. Sampled conversations go to Claude for quality scoring on a five-point scale and topic classification, so failures can be sorted by product area instead of just counted. The pipeline runs on a launchd schedule; review files land in a folder, segmented by outcome.

An analysis of 1,100 conversations split “Contact helped” into three buckets. 7.1% had specific, addressable gaps: customers still stuck but who hadn’t written back. Those went into the improvement queue. 7.3% were genuine resolutions with explicit customer confirmation. The other 85.6% were ambiguous: answer received, no complaint, no confirmation. That bucket is now tracked separately instead of being counted as a success.

Stack

Ruby · Bash · Help Scout Conversations API · Claude API · launchd

Next Help Scout, in plain English