AI that listens, scores and fills the CRM

I build and run two production systems that live inside sales teams' CRMs. The first is my own product: an AI quality-control layer that listens to every sales call and chat, transcribes it, scores it against configurable checklists with an LLM, writes the results back into the CRM and flags risks. The second is a request-distribution engine that routes each inbound lead to the right available person in seconds. Both run for real businesses in production, isolated per client, and both are things I designed, shipped and operate end to end.

Part A — AI quality control that listens and scores

Sales teams generate huge volumes of calls and chats, and almost none of it gets reviewed. Managers spot-check a handful of conversations a week, so most coaching, risk and compliance signal is simply lost. My product closes that gap: it ingests conversations from the telephony and chat layer, transcribes audio, scores each conversation against a client-configurable checklist using an LLM with structured outputs, auto-fills CRM fields and raises alerts on risky calls. It replaces manual spot-checking with continuous, auditable review of 100% of conversations.

What it did in the first 2.5 months

70,000+

calls analyzed in production

1,200+

hours of audio transcribed & scored

31,000+

automated quality checks run

All of that in the first 2.5 months in production, with a separate, isolated setup per client so no customer's data ever mixes with another's.

Architecture

Ingest (pull-API)

A worker pulls call recordings and chat transcripts from the telephony/CRM side over their APIs, rather than waiting on inbound webhooks. Where the source already exposes a transcript, I reuse it and skip speech-to-text entirely.

Speech-to-text

Audio that has no existing transcript is run through speech-to-text to produce a clean, timestamped conversation before any scoring happens.

LLM scoring pipeline

Each transcript is scored against the client's checklist by an LLM (Anthropic / Gemini) using structured outputs, so every criterion comes back as a typed, machine-readable field with a rationale — not free text.

Guardrails & evals

Outputs pass through schema validation and guardrails, and I run evals against reviewed examples so scoring stays stable and auditable as prompts and checklists evolve.

CRM autofill

Validated scores, extracted fields and risk flags are written straight back into the CRM (Kommo / amoCRM), so managers see results where they already work — no separate tool to check.

Dashboards & alerts

Aggregate quality trends feed dashboards, and risky conversations trigger alerts so a human can review the exact call the model flagged.

Per-client isolation

Every client gets an isolated deployment and data store. Configuration, checklists and conversation data are separated by design, which keeps data handling clean and lets each client's scoring rules evolve independently.

Key decisions & trade-offs — AI QA

Decision — Pull-API ingestion over webhooks, and no redundant speech-to-text

I ingest by polling the telephony/CRM APIs instead of relying on inbound webhooks. Webhooks are simpler to wire up but they fail silently: a missed delivery, a downtime window or a retry gap means a call is just never scored, and you don't find out until someone asks where the data went. Pulling makes the system the source of truth for "what have I processed" — I track a cursor and can always backfill. The same principle drove a second choice: where the source already has a transcript, I reuse it rather than re-running speech-to-text. STT is the most expensive and error-prone step, so not doing it twice saves cost and avoids introducing a worse transcript than the one the platform already produced. The trade-off is more state to manage on my side (offsets, dedup, backfill), which I accepted because reliability of "every call gets scored" is the whole value proposition.

Decision — Structured LLM outputs with guardrails and evals, not free-text scoring

The model returns a typed object per checklist criterion (pass/fail/score plus a short rationale), validated against a schema, rather than a paragraph of prose. Free text would have been faster to prompt, but it can't be auto-filled into CRM fields, can't be aggregated into dashboards, and can't be regression-tested. Structured outputs make every score machine-usable and auditable, and let me run evals against human-reviewed examples so I catch drift when I change a prompt or a client changes their checklist. The cost is stricter prompt engineering and handling the occasional schema-invalid response with validation and retries — a cost I take gladly because it's what turns "the AI said something" into "the CRM has a defensible, reviewable score."

Decision — LLM assists, humans audit

Scores are designed to be human-reviewable, not treated as final verdicts. Because each score carries its rationale and links back to the exact call, a manager can open a flagged conversation and check the model's reasoning in seconds. I chose this over full automation deliberately: sales QA drives coaching and sometimes compensation decisions, so a black-box score that no one can challenge would erode trust in the whole product. Keeping a human in the loop costs some of the "fully hands-off" pitch, but it's what makes the output credible and safe to act on.

Part B — Request-distribution engine

The second system solves a different problem for a long-standing client (in the same operations/sales-CRM space): when a lead comes in, who gets it, and how fast? Sending every request to a shared queue means the fastest or the loudest person grabs the easy ones and hard leads go cold. My engine routes each inbound request to the right available person within seconds, based on configurable rules — current load, skills, schedule and priority — and integrates directly with the client's CRM and telephony.

43,000+

requests routed in production since 2022

2018

the year this client relationship began

Key decisions & trade-offs — routing engine

Decision — A configurable rule set, not hard-coded routing

Routing logic lives in a rule set the client can configure (load, skills, schedule, priority), rather than being baked into the code. Hard-coding would have shipped faster and been simpler to reason about, but sales teams reorganize constantly — new hires, shift changes, product lines, temporary priorities — and every one of those would have meant a code change and a deploy. Externalizing the rules means the business adapts routing without me in the loop, which is exactly what let this run for years across changing teams. The trade-off is a more general engine to build and test, which paid for itself the first time the team restructured and nothing needed a redeploy.

Decision — Fair load-balancing plus fallback chains

The engine actively balances load so no one person is buried while others sit idle, and every routing target has a fallback chain: if the first-choice assignee is unavailable, it walks to the next qualified, available person instead of dropping the request into a void. The naive alternative — round-robin or "assign and hope" — is trivial to build but produces the exact failure mode this system exists to prevent: a lead assigned to someone who's offline, sitting unanswered while it goes cold. Fair balancing and explicit fallbacks cost more logic and more edge cases (what counts as available, how to break ties), but they're what makes "routed in seconds" actually mean "answered."

Stack

LanguagePython

AI QA — transcriptionSpeech-to-text, with transcript reuse when the source already has one

AI QA — scoringLLM (Anthropic / Gemini) with structured outputs, guardrails and evals

AI QA — ingestPull-API from the telephony/CRM layer

Routing engineConfigurable rules engine, fair load-balancing, fallback chains

IntegrationsCRM (Kommo / amoCRM) autofill, telephony, webhooks

IsolationSeparate, isolated deployment and data store per client

Outcome

Both systems run in production inside clients' CRMs today. The AI QA product processed 70,000+ calls, 1,200+ hours of audio and 31,000+ automated checks in its first 2.5 months, turning a process that used to be manual spot-checking into continuous, auditable coverage of every conversation. The routing engine has distributed 43,000+ requests since 2022 for a client I've worked with since 2018 — a relationship that has lasted precisely because the system was built to adapt to their business rather than force the business to adapt to it. Across both, the through-line is the same: reliable ingestion, machine-usable structured output, and a human left in a position to audit and trust the result.