Luka Mrkić
Head of BD
Insights, strategies, and real-world playbooks on AI-powered marketing.
MAY 22, 2026

An AI-powered ICP scoring model in HubSpot reads each company record, scores fit on a 0 to 100 scale, writes a tier and a short rationale back to the record, and routes the company through a HubSpot workflow. The model is an LLM (GPT-4o or Claude 3.5 Sonnet) wrapped in a stable prompt with a clear rubric and a JSON output schema. The data layer is firmographic, technographic, and signal data already in HubSpot or enriched by a tool like Clay. The workflow layer is a standard HubSpot workflow with a webhook step or a HubSpot operations-hub custom code action.
Build time is a few hours for a working prototype, a week for something production grade with overrides, audits, and back-testing against closed-won data.
If you are evaluating who should build this for your team, this guide gives you both the technical blueprint and the standards to evaluate the work.
An ICP scoring model assigns every account in your CRM a fit score against your ideal customer profile. Traditional ICP scoring is rules-based: add 10 points for industry, 15 points for size, 5 for region. It works until the rules sprawl, fight each other, and stop reflecting how deals actually close.
An AI ICP scoring model replaces the rule tree with an LLM call. The model reads the company’s fields, compares them to your written ICP definition, and returns a score, a tier, and a one-line reason a rep can read in five seconds. The output writes back into HubSpot as custom properties, so the score is queryable and the rest of the CRM behaves normally around it.
Three things move when this is wired up correctly: routing speed, sales time spent on tier-1 accounts, and the feedback loop between marketing and sales on what fit actually looks like.

The model can only score what it can read. The data layer is the first place ICP models break, and the fix is unglamorous: write down the fields, confirm they are populated, and only then move to the prompt.

Practical rules:
tech_stack, funding_stage, last_funding_date, target_persona_count, primary_signal.HubSpot’s own properties API and the company object schema are documented in the HubSpot developer docs and are the right reference when you start adding custom fields. Treat the company object as the source of truth and write the AI score back to it, not to a side table.
The prompt is the model. A clean prompt with a clear rubric and a strict output schema produces stable scores across thousands of runs. A loose prompt produces a number that changes every time you ask.

A working prompt has six parts:
score, tier, reason, top_signal, and disqualifiers fields. Tell the model to return only this JSON.Keep the reason field under 25 words. A rep needs to read it during a call. Anything longer gets ignored and the score loses its trust.
If you want this set up cleanly inside your stack with logging, retries, and a feedback loop into a CRM, that is the kind of work we ship at Espressio.
HubSpot workflows on the company object are the right home for this. The trigger fires when a company is created or when a property changes, the workflow calls the model, and the response writes back to custom properties.
Two common implementations:
A minimal Python version of the custom code action looks like this:
import os, json, requests
def main(event):
company = event["inputFields"]
prompt = build_prompt(company) # your ICP system prompt + company fields
resp = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
json={
"model": "gpt-4o",
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": prompt},
],
"temperature": 0.2,
"response_format": {"type": "json_object"},
},
timeout=20,
)
data = json.loads(resp.json()["choices"][0]["message"]["content"])
return {
"outputFields": {
"icp_score": data["score"],
"icp_tier": data["tier"],
"icp_reason": data["reason"],
}
}
Temperature stays low (0.1 to 0.3) for scoring. You want deterministic numbers, not creative ones.
A score nobody acts on is decoration. The same HubSpot workflow that calls the model should branch on the tier and do something useful:
Rescore on a schedule. A nightly or weekly workflow that re-runs the scoring on companies whose properties have changed in the last seven days catches the cases where a tier-3 becomes a tier-1 after a funding round or a hiring spike.
Before this model touches live routing, run it against history. Take your closed-won deals from the last 12 months, your closed-lost deals, and a sample of disqualified leads. Score all of them with the new model and look at the distribution.
What you want to see:
If the distribution is wrong, the prompt and the rubric are the first place to look. The model is rarely the problem; the definition usually is.
Whether you build this yourself or hire someone, these are the standards. A model that fails any of them is not production-ready.

Two more checks people forget:
Track these in HubSpot reports:
Set a 30-day and a 90-day review on the calendar. The 30-day review catches obvious bugs in the prompt and the routing. The 90-day review compares win rates by tier and tells you whether the model is actually moving revenue.
HubSpot’s native predictive scoring is statistical: it learns from your historical contact data and weighs properties to predict conversion. It is good for engagement scoring on contacts. An AI ICP scoring model on the company object is rubric-based, transparent, and editable: you write the ICP and the rubric, the model applies them, and the reason is human-readable. The two coexist well. Use predictive scoring for contact engagement, AI ICP scoring for account fit.
Both work. GPT-4o is slightly cheaper per call and has native JSON mode, which makes the output schema enforcement cleaner. Claude 3.5 Sonnet tends to write tighter rationale lines and is stronger at following long ICP definitions verbatim. Pick one, stay on it for at least a quarter, and only switch if you have a concrete reason.
Only if your HubSpot data is sparse on the fields your ICP cares about. If industry, size, and region are filled in on 90% of records, you can start without enrichment and add it later. If those fields are mostly empty, fix enrichment before building the model. The model cannot score what it cannot see.
A standard pattern: score on company creation, then rescore on a weekly schedule for any company whose properties changed in the last seven days. Funding events, hiring spikes, and stack changes are the signals that move a score, so re-running weekly catches them without burning budget.
At GPT-4o or Claude 3.5 Sonnet prices, a single scoring call on a typical company record costs well under one cent. 10,000 company records scored once a week sits around a few dollars of model spend per month. The cost driver is enrichment, not the model itself.
Yes, that is the whole point of returning a short reason field. The rep opens the company record, sees a tier-1 score of 87 with a reason that says “fintech, series-b funded last quarter, Snowflake and Stripe in stack, hiring senior data roles”, and acts on it. If the reason is empty or vague, the prompt needs more constraint.
Three controls. Keep temperature low. Pin the model version (do not let it silently upgrade). Re-run the back-test once a quarter against the latest closed-won deals and tweak the rubric if the distribution drifts. Treat the prompt like code: version it, review changes, and write down why each change was made.
icp_score, icp_tier, icp_reason, plus any input fields you need.If you want this set up for your team end to end, with a clean ICP definition, HubSpot workflow, model integration, back-test, and override queue wired in, let’s talk.