What is a good Cohen's kappa value?

By the widely used Landis & Koch (1977) benchmarks: 0.41–0.60 is moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement. Many fields treat 0.60 or 0.70 as a minimum for usable coding reliability, but the right threshold depends on the stakes — medical diagnoses demand more than exploratory content coding.

Why is my kappa low even though percent agreement is high?

This is the kappa paradox: when one category dominates, chance agreement pₑ is high, leaving little room above it. With 90% raw agreement but pₑ = 0.85, κ is only 0.33. The kappa is honestly telling you that most of the agreement could come from base rates alone — report both pₒ and κ.

Cohen's kappa vs Fleiss' kappa vs weighted kappa?

Cohen's kappa handles exactly two raters and nominal categories. Fleiss' kappa generalizes to three or more raters. Weighted kappa is for ordered categories, giving partial credit when raters pick adjacent categories rather than completely different ones.

Can kappa be negative?

Yes. Negative kappa means the raters agree less often than chance would predict — systematic disagreement. It is rare in practice and usually signals that the raters interpreted the category definitions differently.

Agreement measure

Cohen's Kappa Calculator

Cohen's kappa (κ)measures how well two raters agree when classifying the same items into categories, after subtracting the agreement expected by chance. κ = 1 is perfect agreement, κ = 0 is no better than chance, and values from 0.61 to 0.80 are conventionally called “substantial” agreement.

Reviewed by the crosstabs.com methods team · Last updated June 11, 2026

Run this on your own data — free, no signup

Upload a CSV or XLSX. Everything runs in your browser; your file never leaves your device.

Open the workspace →

Calculate online

Enter the agreement matrix: rows are rater A's classifications, columns are rater B's, and each cell counts the items the pair placed in that combination. The highlighted diagonal is where the raters agree. Category labels are editable, and the matrix supports 2–6 categories.

Agreement matrixCategoriesRows = rater A, columns = rater B. Agreement sits on the diagonal.

Rater A ↓ / Rater B →			Total
Yes			25
No			25
Total	30	20	50

Cohen's kappa (κ)

0.400

fair agreement (Landis & Koch) · 95% CI 0.146 to 0.654 · p = .004 vs κ = 0

κ = 0.40, 95% CI [0.15, 0.65], N = 50

Observed agreement (pₒ)	0.700
Chance agreement (pₑ)	0.500
Standard error of κ	0.1296

Why not just use percent agreement?

Raw percent agreement is inflated by chance. If two raters each say “yes” 90% of the time at random, they will agree about 82% of the time without any real shared judgment. Kappa removes that baseline: it rescales agreement so 0 means “exactly what chance would produce from these raters' marginal rates” and 1 means perfect agreement.

Kappa is standard for coding reliability in content analysis, diagnostic agreement between clinicians, and label quality checks for machine-learning training data.

Formula

Definition

κ = (pₒ − pₑ) / (1 − pₑ)

pₒ: = observed agreement — the proportion of items on the diagonal
pₑ: = chance agreement — Σᵢ (row totalᵢ × column totalᵢ) / n², from the raters' marginal rates
SE(κ): = √( pₒ(1 − pₒ) / (n(1 − pₑ)²) ), the large-sample standard error (Fleiss, Cohen & Everitt, 1969); the 95% CI is κ ± 1.96·SE

Worked example

Two reviewers screen 50 abstracts for inclusion in a systematic review. Both say “include” for 20, both say “exclude” for 15, and they disagree on the other 15 (matrix [[20, 5], [10, 15]]).

Observed agreement pₒ = (20 + 15) / 50 = 0.70. Chance agreement from the marginals is pₑ = (25×30 + 25×20) / 50² = 0.50.

κ = (0.70 − 0.50) / (1 − 0.50) = 0.40— only “fair” agreement on the Landis–Koch scale, despite 70% raw agreement. The 95% CI is roughly 0.15 to 0.65.

When to use it

Use it when

Two raters classify the same items into nominal (unordered) categories.
You need a chance-corrected reliability figure for coding, diagnosis, or labeling.
The categories are mutually exclusive and every item is rated by both raters.

Not the right tool when

More than two raters — use Fleiss' kappa or Krippendorff's alpha.
Ordered categories where near-misses should count partially — use weighted kappa.
You want to test whether paired classifications changed rather than agree — that is McNemar's test.

How to interpret it

Rule of thumb

By the Landis & Koch benchmarks: below 0 poor, 0–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect. These are conventions, not laws — judge κ against the stakes of your application and report the confidence interval.

Landis–Koch interpretation table

Kappa	Strength of agreement
< 0	Poor
0.00 – 0.20	Slight
0.21 – 0.40	Fair
0.41 – 0.60	Moderate
0.61 – 0.80	Substantial
0.81 – 1.00	Almost perfect

Need weighted kappa?

For ordered categories, weighted kappa gives partial credit to near-misses (linear or quadratic weights). The free crosstabs MCP server exposes weighted kappa alongside the unweighted version, so Claude and other AI assistants can compute both on your agreement data exactly.

Frequently asked questions

What is a good Cohen's kappa value?: By the widely used Landis & Koch (1977) benchmarks: 0.41–0.60 is moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement. Many fields treat 0.60 or 0.70 as a minimum for usable coding reliability, but the right threshold depends on the stakes — medical diagnoses demand more than exploratory content coding.
Why is my kappa low even though percent agreement is high?: This is the kappa paradox: when one category dominates, chance agreement pₑ is high, leaving little room above it. With 90% raw agreement but pₑ = 0.85, κ is only 0.33. The kappa is honestly telling you that most of the agreement could come from base rates alone — report both pₒ and κ.
Cohen's kappa vs Fleiss' kappa vs weighted kappa?: Cohen's kappa handles exactly two raters and nominal categories. Fleiss' kappa generalizes to three or more raters. Weighted kappa is for ordered categories, giving partial credit when raters pick adjacent categories rather than completely different ones.
Can kappa be negative?: Yes. Negative kappa means the raters agree less often than chance would predict — systematic disagreement. It is rare in practice and usually signals that the raters interpreted the category definitions differently.

References & further reading

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Landis, J. R. & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.
Fleiss, J. L., Cohen, J. & Everitt, B. S. (1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72(5), 323–327.
Cohen's kappa — Wikipedia

Try it on your own data — free, no signup

Upload a CSV or XLSX. Everything runs in your browser; your file never leaves your device.

Open the workspace →

Related calculators

McNemar's Test (paired change)Chi-Square Test of Independence Cramér's V (association)Goodness of Fit Test Which test should I use?

← All calculators & guides