# Overview A system which uses [[GenAI]] to produce new content like text, images, code, etc. # Key Considerations ## Evaluation **Offline:** Create a stratified test set of prompts covering intents, languages, edge-cases, and policy red-lines. For text, collect multiple reference answers or use pairwise ranking. Measure quality metrics, run toxicity detectors, and slice by domain. **Online:** Ship in (shadow) mode first: generate suggestions but hide them from users, logging quality signals. Graduate to A/B where the new model handles a % of traffic; monitor business KPIs + safety dashboards. Always keep a "golden canary" set of prompts served by a legacy model for drift detection. ### Product Metrics - Task success rate (e.g., ticket fully resolved) - Average handle time (when humans intervene) - User satisfaction / Net Promoter Score (NPS, "how likely are you to recommend this product to a friend?") - Brand-safety incident rate & review cost - Prompt–response latency - Down-stream engagement (clicks, watch-time, etc.) ### ML Metrics - Automated Overlap scores: - [[BLEU]] - [[ROUGE]] - [[METEOR]] - Semantic Similarity: - [[BERTScore]] - [[BLEURT]] - Task-specific Fact Checkers - Likert scale # Pros # Cons # Use Cases - [[Chatbots]] - [[Image Synthesis]] - [[Code Generation]] # Related Topics