Inter- and Intra-observer Reliability of Posterior Malleolus Classification Systems

Barry Mullins, Alisdair Felstead, John McFall, Harry Akehurst, Andrew Jowett, Togay Koc

Queen Alexandra Hospital

Posterior malleolar fracture morphology is increasingly being recognised as an important variable in the management of ankle fractures. In this study we compare the interobserver and intraobserver reliability of three different posterior malleolar classification systems.

Forty computed tomography scans demonstrating ankle fractures with posterior malleolar components were reviewed by four reviewers on two separate occasions. The Mason & Molloy, Haraguchi and Bartonicek classification systems were used by the reviewers. The reviewer group included two consultant foot & ankle surgeons, one foot & ankle fellow and one specialist registrar, all familiar with the three classification systems. An interobserver and intraobserver reliability study were completed using Fleiss kappa (k) and Mean Cohen’s kappa (k) coefficient respectively. This was completed using R software.

The Fleiss kappa statistic for interobserver reliability was 0.43 (95% CI 0.35 – 0.50) for Bartoníček, 0.65 (0.56 – 0.75) for Haraguchi and 0.63 (0.55 – 0.72) for Mason & Malloy classifications. Mean Cohen’s kappa for intraobserver reliability was 0.66 (range 0.58 – 0.78) for Bartoníček, 0.73 (range 0.63-0.84) for Haraguchi and 0.65 (range 0.61 – 0.70) for Mason & Malloy classifications.

The Haraguchi classification had the highest interobserver and intraobserver reliability. The interobserver reliability agreement was ‘substantial’ (0.61 – 0.80) for all classifications except Bartoníček. While the Haraguchi classification is descriptive and has been utilised widely in previous research, the Mason & Malloy classification has prognostic value, which can aid in decision making whilst retaining substantial interobserver reliability. The Bartoníček classification emphasises the importance of syndesmotic incisural involvement and its role in decision making but demonstrated the lowest interobserver reliability.