Reliability is the consistency of scores obtained from an instrument. Consistency may be assessed across items on the same test, across raters of a performance or product, and across scores from different occasions or test forms. Inconsistencies in scores, also known as measurement error, may be due to these multiple sources of variation. Simply put, higher reliability relates to smaller inconsistencies and less measurement error in scores.

In classical test theory (CTT), the reliability coefficient is the ratio of true score variance to observed score variance. To ease interpretations, this coefficient is often converted to the standard error of measurement (SEM). The SEM is the amount of variation in observed scores due to measurement error. The SEM is simple to interpret relative to the score scale: larger values indicate more measurement error and smaller values indicate less measurement error. Multiple reliability estimates have been formulated to quantify the proportion of variance in observed scores due to unique sources of measurement error, each requiring its own data collection design. These reliability estimates include measures of stability; equivalence; internal consistency; inter-rater consistency; and inter-rater consensus. The most common reliability estimate is coefficient alpha, an estimate of the internal consistency of the scores.

Generalisability (G) theory extends CTT to allow for calculating reliability estimates due to multiple sources of error simultaneously. Using analysis of variance, G theory partitions measurement error into different facets (sources of variation). These may include raters, occasions, and forms. Both overall and facet level measurement error estimates are available.

Reliability estimate calculations differ depending on whether absolute (compared to a criterion) or relative (compared to a group) score interpretations are made. Like CTT, G theory provides a single reliability estimate for all scores in the distribution.