Aims: (i) Determine whether between-observer reproducibility for Ki67 when assessed on whole sections according to a standardized scoring protocol is adequate for clinical application. (ii) Compare between-observer reproducibility of Ki67 scores assessed on hot-spots to scores using a global method that averages across a tissue section.
Background: The nuclear proliferation biomarker Ki67 has multiple potential roles in breast cancer, including aiding decisions based on prognosis, but unacceptable levels of between-laboratory variability have been observed. The International Ki67 in Breast Cancer Working Group has undertaken a systematic program to determine whether Ki67 measurement can be analytically validated and standardized across labs. In phase 1, variability in visual interpretation was identified as an important source of variability. Phases 2 and 3a showed that adherence to defined scoring methods substantially improved reproducibility in scoring tissue microarrays and core-cut biopsies. We now assess whether acceptable reproducibility can be achieved on whole sections.
Methods: Adjacent sections from 30 primary ER+ breast cancers were centrally stained for Ki67 to assemble 4 sets of 30 stained tumor sections, circulated around 23 labs in 12 countries. Ki67 was scored by 2 methods by all labs: (a) global: 4 fields of 100 tumor cells each were selected to reflect observed heterogeneity in nuclear staining (b) hot-spot: the field with highest Ki67 percentage of tumor cells with nuclear staining was selected and up to 500 cells scored. Ki67 scores were log2-transformed for statistical analyses and back-transformed for presentation. The primary objective was to assess whether either method could achieve an intraclass correlation coefficient (ICC) significantly greater than 0.8, considered substantial to almost-perfect reproducibility. Secondary objectives were to assess which method had highest observed ICC and to assess whether observers identified the same “hot-spots”.
Results: ICC for the global method was 0.87 (95%CI: 0.799-0.93), marginally meeting the prespecified success criterion. The ICC for the hot-spot method was 0.83 (95%CI: 0.74-0.90) and had a CI extending below the success criterion. Across the 23 labs, geometric mean value of the 30 scores ranged from 8.5 to 19.6 for the global method and from 12.8 to 30.3 for the hot-spot method. The overall mean (95% CI) of these values was 12.9 (11.9-14.0) and 20.9 (19.1-22.8), respectively. Visually, between-laboratory agreement in location of selected hot-spot varies between cases. The median times for scoring were 9 and 6 minutes for global and hot-spot methods respectively.
Conclusions: The global method marginally met the prespecified criterion of success; it should now be evaluated for clinical validity in appropriate cohorts of cases. The hot-spot method was observed to have slightly less reproducibility between labs. The time taken for scoring by either method is practical using counting software we are making publicly available. Establishment of external quality assessment schemes is likely to improve the reproducibility between labs further
- Medicin och hälsovetenskap
- Cancer och onkologi