Percentage Agreement Inter Rater Reliability

Without scoring guidelines, ratings are increasingly influenced through the experimenter, i.e. a tendency of ratings to drift towards the evaluator`s expectations. For processes that involve repeated measurements, correction of evaluator drift can be done through regular training to ensure that evaluators understand policies and measurement objectives. The area in which you work determines the acceptable level of agreement. If it is a sports competition, you can accept a 60% scoring agreement to determine a winner. However, if you look at the data of cancer specialists who decide on treatment, you want a much higher match – more than 90%. In general, more than 75% are considered acceptable for most regions. SPSS and the R irr package require users to specify a unidirectional or bidirectional model, an absolute or consistency match type, and units of individual or average dimensions. The design of the hypothetical study provides information on the correct selection of ICC variants. Note that although SPSS, but not the R irr package, allows a user to specify random or mixed effects, the calculation and results for random and mixed effects are identical. For this hypothetical study, all subjects were evaluated by all programmers, meaning that the researcher should probably use a two-way ICC model, since the design is completely cross-referenced, and an average CCI unit of measurement, since the researcher is probably interested in the reliability of the average ratings provided by all encoders. The researcher is interested in assessing the extent to which the encoder`s ratings were consistent with each other, so that the higher ratings of one encoder match the higher ratings of another encoder, but not to the extent that programmers match the absolute values of their ratings, justifying some type of ICC consistency.

The programmers were not randomly selected, and therefore the researcher is interested in knowing to what extent the programmers agreed on their evaluations in this study, but not in generalizing these evaluations to a larger population of programmers, which justifies a mixed model. The data presented in Table 5 have their final form and are not further transformed, so these are the variables for which an IRR analysis should be performed. The agreement between the evaluators was substantial, κ = 0.75, and higher than what could be expected by chance, Z = 3.54, p < 0.05. The resulting CCI is high, ICC = 0.96, which indicates an excellent IRR for empathy scores. From an occasional observation of the data in Table 5, this high CPI is not surprising, as the differences of opinion between the coders appear to be small relative to the range of values observed in the study and there does not appear to be any significant restriction of scope or gross violations of normality. Reports on these findings should describe in detail the specifics of the chosen ICC variant and provide a qualitative interpretation of the impact of the ICC estimate on agreement and power. The results of this analysis can be reported as follows: Both percentage agreement and kappa have strengths and limitations. Percentage match statistics are easy to calculate and can be interpreted directly.

Its main limitation is that it does not take into account the possibility that the evaluators have guessed the scores. It may therefore overestimate the true agreement between the evaluators. Kappa is designed to take into account the possibility of guessing, but the assumptions it makes about the independence of the evaluator and other factors are not well supported, and therefore it can unduly lower the match estimate. In addition, it cannot be interpreted directly, and so it has become common for researchers to accept low levels of kappa in their evaluator reliability studies. A low level of reliability of evaluators is unacceptable in healthcare or clinical research, especially when study results may alter clinical practice in a way that leads to poorer patient outcomes. Perhaps the best advice for researchers is to calculate both the percentage of agreement and the kappa. While there are probably a lot of assumptions among evaluators, it may make sense to use kappa statistics, but if evaluators are well trained and there are probably few rates, the researcher can certainly rely on percentage matching to determine the evaluator`s reliability. Many research designs require an Inter-Evaluator Reliability Assessment (IRR) to demonstrate consistency between observational assessments provided by multiple coders.

However, many studies use incorrect statistical methods, do not fully report the information needed to interpret their results, or do not address how IRR affects the validity of their subsequent analyses for hypothesis testing. This document provides an overview of the methodological issues related to the evaluation of the IRR with a focus on study design, the selection of appropriate statistics, and the calculation, interpretation and reporting of some commonly used IRR statistics. Examples of calculations include SPSS and R syntax for calculating Cohen`s kappa and internal class correlations for evaluating IRR. This is a simple procedure when the values are only zero and one and the number of data collectors is two. If there are more data collectors, the procedure is a little more complex (Table 2). However, as long as the scores are limited to only two values, the calculation remains easy. The researcher simply calculates the percentage match for each row and averages the rows. Another advantage of the matrix is that it allows the researcher to know whether the errors are random and therefore fairly evenly distributed among all evaluators and variables, or whether a particular data collector frequently records values that are different from other data collectors. Table 2, which shows an overall reliability of the evaluators of 90%, shows that no data collector had an excessive number of outliers (values that were not consistent with the majority of the evaluators` evaluations). Another advantage of this technique is that it allows the researcher to identify variables that may be problematic. It should be noted that Table 2 shows that the evaluators obtained only a 60% match for variable 10. This variable may warrant verification to identify the cause of such a weak match in its assessment.

An example of the calculated kappa statistic can be found in Figure 3. Note that the percentage match is 0.94, while the kappa is 0.85 – a considerable reduction in congruence. The greater the expected random agreement, the lower the resulting value of the kappa. .