Lesson 10.5: Secrets, logging, monitoring, cost guards | GeekHub Learn

Production AI without observability is gambling. This lesson is the safety net.

A pilot who never reads the instruments will eventually crash. Logs and dashboards are the instruments.

Five non-negotiable production habits:

Secrets in env vars or secret manager. Never in code.
Structured logs for every LLM call: timestamp, user_id, prompt_hash, model, tokens, latency, status.
Cost dashboard: provider dashboard + your own.
Hard spending caps in provider dashboards.
Rate limit per user/IP to prevent abuse.

For Python, log with logging and ship to Logtail, Datadog, or Grafana Cloud (free tiers exist).

Wrap LLM calls:

import logging, time, hashlib, json
log = logging.getLogger("ai")

def call_llm(messages, user_id):
    started = time.time()
    h = hashlib.sha256(json.dumps(messages, sort_keys=True).encode()).hexdigest()[:10]
    try:
        r = client.chat.completions.create(model="gpt-4o-mini", messages=messages)
        log.info({
            "user_id": user_id, "prompt_hash": h, "model": "gpt-4o-mini",
            "tokens": r.usage.total_tokens, "latency_ms": int((time.time()-started)*1000),
            "status": "ok",
        })
        return r
    except Exception as e:
        log.error({"user_id": user_id, "prompt_hash": h, "status": "error", "err": str(e)})
        raise

Cost guard via simple per-user counter in Redis, capping daily token spend.

Visualize it

A "production layer cake" diagram: app -> logger -> dashboard -> alerting. Each layer labeled.

Try it now

Set a $5 monthly cap on OpenAI right now. Take a screenshot for accountability.

Hands-on lab

Add structured logging to your PDF chatbot. Print logs locally as JSON.

Try it now

Why hash the prompt instead of logging the raw prompt?

Common mistakes

Logging full prompts (privacy and storage cost)
No spend caps (free trial gets drained in hours)
No per-user rate limits (one user can rack up your bill)

Debugging tip

If a feature seems "stuck", check logs for repeated errors. Add a retry around the offending line.

Challenge

Build a tiny "/health" endpoint that reports last hour's: requests, error count, average latency, total tokens.

Where this shows up

Production AI apps
SaaS demos at scale
Internal AI tools

From the field

The single biggest "junior to mid-level" promotion driver in 2026 AI roles is "you set up our cost dashboard and prevented the next outage". That is observability.

Recap

Secrets, logs, dashboards, caps, rate limits. Five habits. Always.

Quick recall

3 prompts · think before you flip

Prompt 1 of 3

Why hash prompts rather than store them?

Quiz time

1 question · tap an answer to check it

1. The most overlooked production safety control is