CAPE-V and GRBAS for the Acoustic Era: Reconciling Perceptual and Acoustic Voice Assessment

May 8, 202616 min readJorge C. Lucero

🎯 Key Takeaways

CAPE-V and GRBAS measure perception, not acoustics—no acoustic parameter "is" a CAPE-V dimension; parameters predict ratings imperfectly
Different dimensions need different parameters—shimmer and noise measures dominate severity; CPPS and GNE dominate breathiness; shimmer and H1-H2 dominate roughness; HNR and F0 dominate strain
One number cannot replace four ratings—a global index like AVQI captures severity well but loses dimension-specific information
Strain is the hardest to predict—accuracy drops because strain reflects laryngeal effort that acoustic measures only partially capture
Acoustic measures supplement, not replace, perceptual judgment—use them to quantify change over time, document outcomes, and triangulate clinical impressions

The CAPE-V and GRBAS scales are the most widely used perceptual voice quality assessments in clinical practice. Both ask the clinician to rate dimensions of voice quality—Severity (or Grade), Roughness, Breathiness, Strain—on a structured scale. Both are recommended in international consensus statements. Both have decades of clinical validation behind them.

And both produce ratings that are, fundamentally, subjective. Inter-rater reliability for CAPE-V dimensions is typically reported in the 0.70–0.85 range for severity and breathiness, but drops for roughness and strain. Ratings drift across sessions. Two qualified clinicians can rate the same recording and disagree by clinically meaningful margins.

This is where acoustic analysis enters the conversation. If we can measure something objectively in the voice signal that predicts what trained ears hear, we get reproducibility, documentation, and a way to track change without depending on memory of "how the patient sounded six weeks ago." But the relationship between perceptual ratings and acoustic parameters is more nuanced than most clinical training conveys—and the recent acoustic literature has shifted the picture significantly.

This guide answers a practical question: for each CAPE-V and GRBAS dimension, which acoustic parameters actually predict the perceptual rating—and how well?

A Translation Problem, Not a Replacement

Before mapping perceptual dimensions to acoustic parameters, it is worth being precise about what the mapping is and is not.

"Perceptual ratings and acoustic measurements answer different questions about the same voice."

A CAPE-V Breathiness rating answers: How breathy does this voice sound to a trained listener? An acoustic measure of breathiness (e.g., CPPS, GNE) answers: What spectral or noise property of the signal correlates with that perception? The two are related but not identical. A voice can produce signal patterns that statistically associate with breathiness without sounding breathy in a particular sample, and a clinician can perceive breathiness from cues that do not map cleanly to any single parameter.

This distinction matters because it sets the right expectation. Acoustic analysis cannot replace the perceptual rating; it can quantify a related signal property that, in aggregate, tracks the perception well. The clinical use is reproducibility and documentation, not substitution.

A Common Misconception

Reading "CPPS correlates with breathiness" sometimes leads to the inference that low CPPS means the voice is breathy. The correct inference is statistical: across a population, voices with lower CPPS tend to be rated as breathier. For an individual patient, low CPPS is one piece of evidence that needs to be reconciled with the perceptual rating, the laryngoscopic finding, and the case history.

What "Predicting a Rating" Actually Means

When the literature reports that an acoustic parameter "predicts" CAPE-V Breathiness with R² = 0.6, this means that variation in the parameter explains roughly 60% of the variation in breathiness ratings across speakers. The remaining 40% reflects genuine perceptual information not captured by the parameter, plus rater noise.

Two implications follow. First, no single parameter—and no combination of parameters from sustained vowels—captures everything trained listeners hear. Second, the ceiling on prediction accuracy is set partly by inter-rater reliability itself: if two raters agree on Roughness only 70% of the time, no acoustic model can predict their averaged ratings perfectly.

With those caveats in place, here is what the evidence shows for each dimension.

Overall Severity (CAPE-V) and Grade (GRBAS)

Severity and Grade ask essentially the same question: how deviant is this voice overall? They are the most reliable perceptual dimensions and the most studied acoustically.

Top acoustic correlates (in approximate order of incremental contribution):

Shimmer (in dB)—cycle-to-cycle amplitude irregularity
GNE (Glottal-to-Noise Excitation Ratio)—periodic versus aperiodic glottal source energy
HNo-6000—relative high-frequency noise energy
CPPS—overall harmonic strength
F0—fundamental frequency, contributing modestly

Multiparametric indices that combine several of these parameters (AVQI, CSID) achieve good agreement with overall severity ratings, with reported AUC values around 0.80–0.87 for distinguishing dysphonic from non-dysphonic voices in research databases.

Why Shimmer (Not Jitter) Leads

Earlier research often emphasized jitter as a primary correlate of voice quality. More recent work using larger samples and modern feature-selection methods consistently ranks shimmer (in dB) above jitter for predicting overall severity. One likely reason: shimmer captures both amplitude irregularity from disordered vocal fold vibration and amplitude variation from incomplete glottal closure, making it sensitive to multiple production mechanisms.

Breathiness

Breathiness arises when the vocal folds fail to close completely during phonation, allowing turbulent airflow to escape through the glottal gap. The acoustic consequence is added aperiodic energy—noise—mixed with the periodic harmonic structure.

Top acoustic correlates:

CPPS—decreases as harmonic structure is degraded by aspiration noise
GNE—directly quantifies the periodic-to-aperiodic ratio at the glottal source
HNo-6000—captures high-frequency noise energy where aspiration is most prominent

Breathiness is the dimension where acoustic prediction works best. Three carefully chosen parameters can predict breathiness ratings nearly as well as nine-parameter indices, suggesting that the acoustic signature of incomplete glottal closure is relatively concentrated in a few measures.

Older clinical training sometimes emphasized HNR as the breathiness measure. HNR does correlate with breathiness, but it requires reliable F0 tracking and degrades when the signal becomes severely aperiodic. CPPS does not require period-by-period tracking and is more robust in moderate-to-severe cases. For most clinical purposes, CPPS and GNE outperform HNR for breathiness.

Roughness

Roughness is associated with irregular vocal fold vibration: asymmetric oscillation, aperiodic cycle-to-cycle variation, subharmonic patterns, or diplophonia. It is perceptually distinct from breathiness—a rough voice is not necessarily breathy, and vice versa—but the two often co-occur in clinical samples.

Top acoustic correlates:

Shimmer (in dB)—directly reflects the cycle-to-cycle amplitude instability of irregular vibration
H1-H2 (first vs. second harmonic amplitude)—reflects the spectral shape of the glottal source; flatter spectra (smaller H1-H2) accompany irregular vibration
GNE—aperiodic excitation rises with vibratory irregularity
Jitter (in log form)—captures cycle-to-cycle frequency irregularity

Prediction accuracy for roughness is meaningfully lower than for breathiness. This appears to be substantive, not a measurement artifact: roughness encompasses a heterogeneous set of vibratory abnormalities, from mild irregularity to subharmonic breaks and diplophonia, that linear acoustic measures from sustained vowels capture only partially.

Clinical Implication

For roughness, acoustic measures are best used to support the perceptual rating and to track change over time, rather than to assign a category on a single measurement. A spectrogram inspection often adds information that no scalar measure provides—visible subharmonic banding, irregular pulse spacing, and mode breaks are diagnostic features that aggregate parameters miss.

Strain

Strain reflects effortful, hyperfunctional phonation: increased laryngeal muscle tension, firm glottal closure, and elevated subglottal pressure. It is the most clinically nuanced of the four dimensions and the hardest to predict acoustically.

Top acoustic correlates:

HNR—often elevated in strained voice (clean periodicity can coexist with hyperfunction)
F0—tends to rise with increased longitudinal tension of the vocal folds
Spectral tilt—less negative tilt (more high-frequency energy) accompanies pressed phonation
H1-H2—decreases with firmer glottal adduction

Strain prediction works less well than the other dimensions for two reasons. First, strain is perceptually complex—clinicians integrate cues from voice onset, register transitions, and effort that go beyond steady-state vowel acoustics. Second, hyperfunction can produce cleaner signals on some measures (higher HNR, lower jitter), which is the opposite of what most acoustic indices were designed to detect. A perfectly periodic but pressed voice will not look "abnormal" on traditional perturbation measures.

For strain, acoustic measures are most useful as part of a profile that includes maximum phonation time, S/Z ratio, and connected-speech observations. A single number is unlikely to do justice to the dimension.

Why a Single Index Cannot Replace Four Ratings

Multiparametric indices like AVQI, ABI, and CSID combine several acoustic parameters into a single score that correlates well with overall dysphonia severity. They are valuable: a global severity number is faster to interpret, easier to track over time, and more robust than any single parameter.

But severity is not the same as roughness, and roughness is not the same as breathiness or strain. The parameter ranking changes by dimension in ways that matter clinically:

Dimension	Top Predictor	Production Mechanism
Severity	Shimmer (dB), GNE	Overall vibratory and source disruption
Breathiness	CPPS, GNE	Incomplete glottal closure, aspiration noise
Roughness	Shimmer (dB), H1-H2	Irregular vocal fold vibration
Strain	HNR, F0, tilt	Hyperfunction, increased medial compression

A patient with mild breathiness from incomplete closure and a patient with mild strain from hyperfunction can produce the same AVQI score for very different reasons. The dimensions provide clinically actionable information that the global score collapses. This is not a criticism of multiparametric indices—it is an argument for using them alongside dimension-specific measures, not in place of them.

A Practical Workflow

For a clinician who already uses CAPE-V or GRBAS perceptually, acoustic analysis adds value in three concrete ways:

1. Document the baseline objectively

Pair the CAPE-V form with a small set of acoustic measures: an overall severity index (AVQI or CSID), CPPS for breathiness, shimmer for roughness severity, and F0 statistics for strain context. Store both. The perceptual rating remains primary; the acoustic measures support and date-stamp it.

2. Track change with measures matched to the target

If therapy targets glottal closure (e.g., for unilateral vocal fold paresis), CPPS and GNE are the measures to track—not shimmer or jitter. If therapy targets hyperfunction, HNR and F0 trends matter more than overall severity indices. Choose the measure that aligns with the treatment mechanism.

3. Triangulate when perceptual ratings are uncertain

When two clinicians disagree, when self-rating differs from clinician rating, or when a patient's voice fluctuates session to session, an acoustic profile gives an external reference point. It does not adjudicate the perceptual disagreement, but it constrains the range of plausible interpretations.

What Acoustic Analysis Will Not Tell You

A complete picture of voice quality requires more than acoustic measurement. The CAPE-V protocol exists because voice quality involves perception that integrates context, connected speech, phonatory onset, and registration in ways that sustained vowel acoustics do not capture. Several clinically important phenomena are systematically under-detected by acoustic measures:

Onset and offset patterns—glottal attack, breathy attack, and hard offsets
Connected-speech instability—phonation breaks, register transitions, vocal fry segments
Pitch breaks and diplophonia—often present in connected speech but not in sustained vowels
Vocal effort that does not produce measurable signal disturbance—pressed voice with clean periodicity
Resonance abnormalities—hypernasality, cul-de-sac resonance, that affect perceived voice quality without primarily originating at the larynx

Perceptual judgment by a trained clinician integrates these in a way that no current acoustic system does. The role of acoustic analysis is to make the perceptual judgment more reproducible and easier to document, not to replace it.

Summary

1CAPE-V and GRBAS measure perception. Acoustic parameters predict ratings statistically; they are not the same construct.
2Different dimensions have different acoustic signatures. Severity ≠ Breathiness ≠ Roughness ≠ Strain in terms of which parameters matter most.
3Breathiness is the easiest to predict acoustically; strain is the hardest. This reflects production mechanisms, not measurement quality.
4Multiparametric indices excel at severity but lose dimension-specific information. Use them with dimension-targeted measures, not instead of them.
5Acoustic analysis supports perceptual judgment. Its clinical role is documentation, change tracking, and triangulation—not replacement.

📊 Pair Your Perceptual Ratings with Acoustic Measures

PhonaLab computes the parameters discussed here—CPPS, HNR, shimmer, jitter, F0, AVQI, ABI, and CSID—directly in the browser, validated against Praat. No installation, and audio is processed in memory and never stored.

Open Voice Analyzer →

⚠️ Educational Information

This article presents acoustic and perceptual voice assessment concepts and summarizes published research findings for educational purposes. It does not constitute clinical advice, diagnostic guidance, or treatment recommendations. Clinical decisions regarding voice assessment and intervention should be made by qualified, licensed healthcare professionals based on comprehensive evaluation of individual patients. PhonaLab provides acoustic measurement tools; it does not provide clinical interpretations or medical diagnoses.

References & Further Reading

Kempster GB, Gerratt BR, Verdolini Abbott K, Barkmeier-Kraemer J, Hillman RE. (2009). Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V): Development of a standardized clinical protocol. American Journal of Speech-Language Pathology, 18(2), 124-132.
Hirano M. (1981). Clinical Examination of Voice. Springer-Verlag.
Maryn Y, Corthals P, Van Cauwenberge P, Roy N, De Bodt M. (2010). Toward improved ecological validity in the acoustic measurement of overall voice quality: Combining continuous speech and sustained vowels. Journal of Voice, 24(5), 540-555.
Barsties v. Latoszek B, Maryn Y, Gerrits E, De Bodt M. (2017). The Acoustic Breathiness Index (ABI): A multivariate acoustic model for breathiness. Journal of Voice, 31(4), 511.e11-511.e27.
Awan SN, Roy N, Jetté ME, Meltzner GS, Hillman RE. (2010). Quantifying dysphonia severity using a spectral/cepstral-based acoustic index: Comparisons with auditory-perceptual judgements from the CAPE-V. Clinical Linguistics & Phonetics, 24(9), 742-758.
Walden PR. Perceptual voice qualities database (PVQD): database characteristics. Journal of Voice. 2022 Nov 1;36(6):875-e15.
Patel RR, Awan SN, Barkmeier-Kraemer J, Courey M, Deliyski D, Eadie T, et al. (2018). Recommended protocols for instrumental assessment of voice: American Speech-Language-Hearing Association expert panel to develop a protocol for instrumental assessment of vocal function. American Journal of Speech-Language Pathology, 27(3), 887-905.
Lucero JC. (2026). Algorithm verification and concurrent validity of a web-based platform for multiparametric acoustic voice quality indices. Journal of Voice. Open access. doi:10.1016/j.jvoice.2026.04.009
Michaelis D, Gramss T, Strube HW. (1997). Glottal-to-noise excitation ratio—a new measure for describing pathological voices. Acta Acustica united with Acustica, 83(4), 700-706.
Hillenbrand J, Cleveland RA, Erickson RL. (1994). Acoustic correlates of breathy vocal quality. Journal of Speech and Hearing Research, 37(4), 769-778.

Back to Guides