# Overview
A system which uses [[GenAI]] to produce new content like text, images, code, etc.
# Key Considerations
## Evaluation
**Offline:**
Create a stratified test set of prompts covering intents, languages, edge-cases, and policy red-lines. For text, collect multiple reference answers or use pairwise ranking. Measure quality metrics, run toxicity detectors, and slice by domain.
**Online:**
Ship in (shadow) mode first: generate suggestions but hide them from users, logging quality signals. Graduate to A/B where the new model handles a % of traffic; monitor business KPIs + safety dashboards. Always keep a "golden canary" set of prompts served by a legacy model for drift detection.
### Product Metrics
- Task success rate (e.g., ticket fully resolved)
- Average handle time (when humans intervene)
- User satisfaction / Net Promoter Score (NPS, "how likely are you to recommend this product to a friend?")
- Brand-safety incident rate & review cost
- Prompt–response latency
- Down-stream engagement (clicks, watch-time, etc.)
### ML Metrics
- Automated Overlap scores:
- [[BLEU]]
- [[ROUGE]]
- [[METEOR]]
- Semantic Similarity:
- [[BERTScore]]
- [[BLEURT]]
- Task-specific Fact Checkers
- Likert scale
# Pros
# Cons
# Use Cases
- [[Chatbots]]
- [[Image Synthesis]]
- [[Code Generation]]
# Related Topics