> ## Documentation Index > Fetch the complete documentation index at: https://docs.galileo.ai/llms.txt > Use this file to discover all available pages before exploring further. # BLEU and ROUGE > Evaluate sequence-to-sequence model performance using BLEU and ROUGE metrics to measure n-gram overlap between generated and target outputs export const DefinitionCard = ({children}) => { return

{children}

; }; export const Scale = ({low, mid, high, lowLabel = "Low", midLabel = "Mid", highLabel = "High", lowDescription, midDescription, highDescription, midColor = "yellow", inverted = false}) => { const lowColor = inverted ? "green" : "red"; const highColor = inverted ? "red" : "green"; const gradientId = inverted ? "greenToRed" : "redToGreen"; return

{low}

{mid &&

{mid}

}

{high}

{lowLabel}

{lowDescription &&

{lowDescription}

}

{mid &&

{midLabel}

{midDescription &&

{midDescription}

}

{highLabel}

{highDescription &&

{highDescription}

}

; }; BLEU and ROUGE are metrics used heavily in sequence-to-sequence tasks measuring n-gram overlap between a generated response and a target output. BLEU and ROUGE are only supported in experiments, and require a Ground Truth to be set in the `output` column of your experiment's [dataset](/sdk-api/experiments/datasets). ## BLEU

Why BLEU Score?

BLEU (Bilingual Evaluation Understudy) addresses a fundamental challenge in natural language processing: how do we evaluate generated text when multiple correct outputs are possible? Unlike classification tasks where outputs can be compared directly, language generation tasks often have many valid ways to express the same idea. For example, these sentences could both be valid translations: * "The ball is blue" * "The ball has a blue color" BLEU provides a quantitative way to evaluate such outputs by measuring how closely they match one or more reference texts. ### BLEU score components

Key Elements

Sets of consecutive words in a sentence. For example, in "The ball is blue": * 1-gram: "The", "ball", "is", "blue" * 2-gram: "The ball", "ball is", "is blue" Measures word overlap while preventing inflation through repetition. Limited by maximum word occurrences in reference text. Penalizes outputs that are too short compared to the reference, preventing gaming the metric with minimal outputs. Scores range from 0 to 1, with 0.6-0.7 considered excellent. Scores near 1 may indicate overfitting. ### BLEU calculation method

Computing BLEU Score

1. Count matching n-grams between generated and reference text 2. Apply clipping to prevent inflation from repeated words 3. Divide by total number of n-grams in generated text 1. Apply weights to each n-gram level (typically uniform weights) 2. Calculate weighted geometric mean of precision scores 3. Result ranges from 0 to 1 1. Calculate ratio of generated length to reference length 2. If shorter than reference, apply exponential penalty 3. BP = 1 if output is longer than reference BLEU = BP × exp(Σ wₙ × log pₙ) where BP is brevity penalty, wₙ are weights, and pₙ are precision scores ### BLEU score variants

Types of BLEU Scores

Different BLEU variants capture different aspects of text similarity. Higher-order n-grams help ensure grammatical correctness and phrase structure:

BLEU-1: Uses only unigram precision, good for capturing basic content overlap

BLEU-2: Geometric average of unigram and bigram precision, begins to capture local word order

BLEU-3: Includes up to trigram precision, better at capturing phrase structures

BLEU-4: Most common variant, uses up to 4-gram precision, best at ensuring fluent and grammatical outputs

### Strengths and limitations

Understanding Trade-offs

* Quick to calculate * Language-independent * Correlates with human judgment * Supports multiple references * Doesn't consider meaning * Misses word variations * Treats all words equally * Limited by reference quality ## ROUGE

What is ROUGE?

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics designed to evaluate AI-generated texts, particularly summaries and translations. It bridges the gap between machine learning outputs and human expectations by measuring how well AI captures and conveys information from source content. ### ROUGE variants

Types of ROUGE Metrics

Evaluates n-gram overlap: * ROUGE-1: Single words * ROUGE-2: Two-word phrases * ROUGE-3: Three-word phrases Measures longest common subsequences, allowing for flexible word ordering while maintaining meaning. Weighted version that prioritizes longer matching sequences, promoting natural flow and coherence. Examines skip-bigrams, allowing gaps between matched words to capture rephrased content. ### ROUGE calculation method

Computing ROUGE Scores

1. Process generated text and reference text(s) 2. Extract relevant units (n-grams, sequences, or skip-grams) 3. Handle multiple references if available 1. Count matching units in generated text 2. Divide by total units in generated text 3. Precision = matches / total\_generated 1. Count matching units in reference text 2. Divide by total units in reference text 3. Recall = matches / total\_reference F1 = 2 × (precision × recall) / (precision + recall)

Variant-Specific Calculations:

ROUGE-N: Apply above steps using n-gram matches (unigrams, bigrams, etc.)

ROUGE-L: Use longest common subsequence instead of n-grams

ROUGE-W: Apply weights based on consecutive matches in ROUGE-L

ROUGE-S: Consider skip-bigram matches with flexible word gaps

### ROUGE components

Key Metrics

Precision: Measures how much of the AI-generated text is relevant to the reference

Recall: Evaluates how much of the reference text is captured in the AI output

F1-Score: Balanced measure combining precision and recall

## Optimizing your AI system

Using BLEU and ROUGE Effectively

To effectively use these metrics in your system:

Set Ground Truth: Ensure your dataset includes reference outputs for comparison.

Monitor Performance: Use scores to identify areas where model outputs deviate from expected results.

Consider Limitations: Remember BLEU doesn't account for meaning, word variants, or word importance.

Iterate and Improve: Focus optimization efforts on areas with lower overlap scores.