Prompt Engineering for Production
Scale prompt engineering from notebooks to production systems
Published July 1, 2026
•
11 min read
Challenge: Prompts that work in ChatGPT fail in production. This guide covers versioning, testing, monitoring, and scaling prompt strategies.
The Problem: From Ad-Hoc to Production
Typical notebook development:
# ❌ Bad: hardcoded, no versioning
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Write a poem about AI"}]
)
Issues:
- No version history
- Can't A/B test prompts
- No monitoring of quality
- Impossible to debug failures
- Can't reuse across teams
Production Prompt Structure
# prompts/poetry.yaml
name: "generate_poem"
version: "1.0"
model: "gpt-4-turbo"
system: |
You are a creative poet specializing in technology poetry.
Your poems are witty, thought-provoking, and accessible.
Guidelines:
- Use vivid metaphors
- Keep rhythm consistent
- Target audience: software engineers
user_template: |
Write a poem about {topic} in the style of {style}.
Poem should be {lines} lines long.
parameters:
temperature: 0.7
max_tokens: 500
top_p: 0.9
Then load and use:
import yaml
def load_prompt(name: str):
with open(f"prompts/{name}.yaml") as f:
return yaml.safe_load(f)
def generate_poem(topic: str, style: str, lines: int = 8):
config = load_prompt("poetry")
user_message = config["user_template"].format(
topic=topic,
style=style,
lines=lines
)
response = client.chat.completions.create(
model=config["model"],
messages=[
{"role": "system", "content": config["system"]},
{"role": "user", "content": user_message}
],
temperature=config["parameters"]["temperature"],
max_tokens=config["parameters"]["max_tokens"]
)
return response.choices[0].message.content
Testing Prompts
Manual Test Cases
# tests/test_poetry.py
import pytest
@pytest.mark.parametrize("topic,style", [
("AI", "cyberpunk"),
("Quantum computing", "haiku"),
("Machine learning", "Shakespearean sonnet"),
])
def test_poem_generation(topic, style):
poem = generate_poem(topic, style)
# Check length
assert len(poem) > 100
# Check it mentions the topic
assert topic.lower() in poem.lower()
# Check it's not a hallucination/error
assert "error" not in poem.lower()
assert "unable" not in poem.lower()
Automated Quality Checks
def evaluate_response_quality(response: str) -> dict:
"""Score LLM output quality"""
metrics = {
"length": len(response.split()),
"has_errors": "error" in response.lower(),
"coherence": evaluate_coherence(response),
"toxicity": evaluate_toxicity(response),
"relevance": evaluate_relevance(response)
}
# Fail if quality is too low
if metrics["coherence"] < 0.6:
raise ValueError("Output quality too low")
return metrics
Versioning and A/B Testing
# Version prompts like code
# prompts/poetry/v1.yaml - Original version
# prompts/poetry/v2.yaml - Added creativity guidelines
# prompts/poetry/v3.yaml - Simplified instructions
def run_ab_test(topic: str, num_samples: int = 100):
"""Compare two prompt versions"""
results = {
"v2": [],
"v3": []
}
for _ in range(num_samples):
# Get response from v2
prompt_v2 = load_prompt_version("poetry", "v2")
response_v2 = generate_from_prompt(prompt_v2, topic)
score_v2 = score_response(response_v2)
results["v2"].append(score_v2)
# Get response from v3
prompt_v3 = load_prompt_version("poetry", "v3")
response_v3 = generate_from_prompt(prompt_v3, topic)
score_v3 = score_response(response_v3)
results["v3"].append(score_v3)
# Statistical comparison
v2_mean = sum(results["v2"]) / len(results["v2"])
v3_mean = sum(results["v3"]) / len(results["v3"])
print(f"v2 avg score: {v2_mean:.3f}")
print(f"v3 avg score: {v3_mean:.3f}")
if v3_mean > v2_mean:
print("✅ v3 is better. Deploy to production.")
else:
print("❌ v2 is still better. Keep v2.")
Monitoring in Production
# Collect metrics on every call
class PromptLogger:
def __init__(self):
self.metrics = []
def log_call(self, prompt_name: str, response: str, latency: float):
self.metrics.append({
"timestamp": datetime.now(),
"prompt": prompt_name,
"response_length": len(response),
"latency_ms": latency,
"quality_score": score_response(response)
})
def check_drift(self):
"""Alert if quality degrades"""
recent = self.metrics[-100:]
recent_quality = sum(m["quality_score"] for m in recent) / len(recent)
baseline = 0.85
if recent_quality < baseline * 0.9: # 10% drop
send_alert(f"Quality degraded: {recent_quality:.2f}")
logger = PromptLogger()
# In your API
@app.post("/generate-poem")
def api_generate_poem(topic: str):
start = time.time()
result = generate_poem(topic)
latency = (time.time() - start) * 1000
logger.log_call("poetry", result, latency)
logger.check_drift()
return {"poem": result}
Common Patterns
1. Few-Shot Prompting
System: You are a sentiment classifier.
Examples:
Text: "I love this product!" → Sentiment: Positive
Text: "Worst experience ever" → Sentiment: Negative
Now classify:
Text: {user_input} → Sentiment:
2. Chain-of-Thought
Think step by step:
1. What is the question asking?
2. What information do I need?
3. What is the answer?
Question: {user_question}
3. Role-Based Prompting
You are an expert {role} with {years} years of experience.
Your task: {task}
Constraints: {constraints}
Cost Optimization
Tip: Longer prompts = more tokens = higher cost
- Use concise system prompts
- Cache repeated context with prompt caching APIs
- Use cheaper models (gpt-3.5) for simple tasks
- Batch requests when possible
Scaling Across Teams
# Central prompt repository
# prompts/
# ├── customer-service/
# │ ├── v1.yaml
# │ ├── v2.yaml
# │ └── tests.py
# ├── content-generation/
# │ ├── v1.yaml
# │ └── tests.py
# └── translation/
# └── v1.yaml
# Anyone can load any prompt
from prompts import load_prompt
prompt = load_prompt("customer-service", version="v2")
Learn Prompt Engineering at Scale
Master production prompt engineering with real projects:
- Systematic prompting techniques
- Testing and evaluation frameworks
- Cost optimization strategies
- Monitoring and alerting
- Scaling to production systems
Master Production Prompt Engineering
Build reliable LLM applications with versioning, testing, and monitoring.
Start Prompt Engineering Course →