REDEFINING TSR SCORING WITH AI: INSIGHTS FROM THE DATA SCIENTISTS

REDEFINING TSR SCORING WITH AI: INSIGHTS FROM THE DATA SCIENTISTS

Discover how AI is revolutionizing TSR scoring, enhancing clinical decisions, and shaping the future of healthcare.

The ARABESC project is the result of a collaborative partnership between WSK Medical and IMP Diagnostics. Since 2023, the two companies have been working together to develop an AI-driven algorithm for tumor-stroma ratio (TSR) scoring.

In the first edition of our interview series, we spoke with two pathologists about the transformative potential of AI in TSR scoring, its impact on clinical decision-making, and the challenges of AI adoption in healthcare. In this second edition, we sat down with Felix Dikland and Cyrine Fekih — two data scientists with extensive experience applying machine learning to healthcare tools and applications.

In this interview, you can expect a candid discussion on:

•             Variability and bias control

•             Validation and the evolving scientific paradigm

•             Bridging the gap between technology and medicine

•             Key considerations for implementing AI in clinical practice

Variability and Bias control

Interviewer (I) - How would you design an AI pipeline to handle colour variations in H&E-stained slides across different laboratories?

Felix Dikland (FD) - Traditionally colour variations are overcome by preprocessing the image patches, such as colour normalization or stain deconvolution. These are ways to standardise the input of the image, making the model more reliable, but not more robust. To achieve this, input augmentations should be introduced during training. 

Cyrine Fekih (CF) – In fact, to handle colour variations in H&E-stained slides across different laboratories, it is important that AI pipeline should begin with a colour normalisation step to standardize staining variations.

I - How would you address inter-slide variability in tissue preparation that could impact TSR quantification accuracy?

FD - From a traditional semantic segmentation model training point of view, appropriate pre-processing and data augmentation in combination with a balanced, large dataset is the foundation of a good and robust model. Having a “balanced” dataset in this case also entails having a proper distribution of institutions, scanners, tumour subtypes, and acquisition methods, such as surgical specimen, pretreatment biopsies and polypectomies. 

I - What quality control measures would you implement for stain normalization across different scanner types?

CF - For quality control of stain normalisation across different scanner types, it is possible use statistical metrics such as stain vector similarity or colour histogram comparisons to evaluate consistency before and after normalisation. Another method would incorporate reference slides or colour calibration targets scanned on each device to standardise outputs. Visual inspection by pathology experts can also be used on a sample of slides to validate the perceived consistency.

I - How would you address potential biases in the algorithm’s performance across diverse patient demographics or tumour subtypes?

CF - In a perfect scenario, the training dataset would cover a wide range of patient demographics - like age, ethnicity, tumour types, and molecular subtypes. But since that's often hard to get in practice, I’d take a few steps to reduce the risk of bias. I’d validate the model on data from different institutions or patient groups, even if the datasets are small, to see how well it generalises. I’d also use model uncertainty to flag predictions it’s less confident about, which could highlight underrepresented cases. Lastly, I’d make sure to clearly report any limitations in the training data and known biases so users are aware of them.

I - What strategies would you use to enable the tool’s application to other epithelial cancers (e.g., breast, pancreatic)?

FD - The majority of development work lays in creating a solid standard for data annotation, data extraction, data augmentation, a training pipeline and a validation pipeline. The basis of these standards can be copied to create new standards for these steps in creating a tool for other epithelial cancers. Each clinical site however will present itself with unique tissues and thus unique issues. 

Validation and the current dogma change

I - How would you ensure the algorithm’s TSR cutoff values align with established prognostic thresholds (e.g., 50% stroma)?

FD - From literature we know that the TSR is often underestimated by human observers. Visually necrosis, mucin and the area within the lumen should be excluded from evaluation. These tissues however can mistakenly raise the total estimated tumour area, because of their darker appearance. This causes manual TSR scores to be systematically undervalued. This is crucial knowledge when evaluating the automated score, as even if the automated score is a more accurate representation of the TSR this might not correspond with the clinically validated manual TSR score.

I - What steps would you take to validate the tool’s performance against manual pathologist assessments (in multicentre studies)?

FD - The greatest hurdle in the validation of an automated TSR score is comparing it to the current gold standard; visual eyeballing. The automated method is fully deterministic producing identical TSR scores independent of time and user. The manual method is semi-quantitative. It follows a standardized protocol with quantifiable steps, that leaves plenty of room for subjectivity. When comparing these scores, it is crucial to create custom setups that test every step in the TSR scoring process, to find in what capacity deviations arise from pathologists’ subjectivity and in which cases they arise from AI model error. 

I - How would you handle discordance between AI-derived TSR and pathologist assessments in borderline cases?

CF - In cases of discordance between the AI-derived TSR and pathologist assessments, a direct comparison should be made between the tool’s output and the expert evaluation—both in terms of tissue identification and TSR calculation. Analysing these differences can help identify the source of disagreement, whether in segmentation accuracy or threshold interpretation. It is also essential to gather feedback from pathologists on such cases, as this input can be used to refine and retrain the model, improving its performance and reliability in handling complex or borderline scenarios over time.

I - How would you standardize the analysed tissue area size (e.g., 1.0 mm vs. 2.0 mm) to ensure consistent prognostic performance?

CF - To ensure consistent of TSR quantification, we would first implement a quality control algorithm to verify that each slide meets pixel-to-micron calibration standards, ensuring accurate and standardised spatial measurements. For the automated pipeline, we would use a fixed circular region of interest with a diameter selected based on established clinical guidelines and supporting literature. In the manual mode, the tool would allow users to select a circular ROI with a diameter between 1.8 mm and 2.2 mm, providing flexibility while maintaining consistency within a clinically validated range.

I - What user interface features would clinicians/pathologists need to trust and adopt automated TSR scoring?

FD - In essence, the TSR scoring tool is a tissue identifier and segmenter. This means that the underlying mechanism is a pixel-wise classification of the tumour tissue, stroma tissue, and all other tissues identifiable in colorectal carcinoma. Besides supplying the user with a percentage score, it also provides a detailed colourmap as an overlay of the analysed region. The pathologist should depend the reliability of the percentage score on the accuracy of this coloured segmentation map.

I - What safeguards would you implement to prevent over-reliance on automated TSR scores in clinical decision-making?

FD - It is important to realize that even if the automated score might objectively be a more accurate quantification of the TSR, the semi-quantitative method estimation used by pathologists is the only method clinically tested for prognostic value. Until the automated score is verified as an independent prognostic indicator the pathologist should agree with the score produced by the algorithm. Besides that, even if the automated score is clinically validated, the pathologist should be well instructed to verify the accuracy of the segmentation map, before using the produced score as a prognostic indicator.

I - How would you quantify the tool’s impact on reducing interobserver variability in stroma-rich vs. stroma-poor classification?

FD - Researchers have invented a measure that quantifies the correlation of the tool with observers normalized by the variability of observers to each other. This score is called the discrepancy ratio. In short, it relies on the fact that individual observers are closer to the ground truth that to each other, given that the error of individual observers are random and independent. So, if the mean interobserver variability of each observer to the tool is lower than the mean of each observer to each other, this means that the discrepancy ratio is larger than 1 and the tool reduces the variability in stroma-rich and stroma-poor classification.

Robust validation ends where clinical integration begins. In Part 2 we turn the microscope toward workflow design, multidisciplinary collaboration, and the regulatory runway that ultimately delivers ARABESC to patients’ bedsides.

Will we find you at European Society of Digital and Integrative Pathology - ESDIP this year? Contact us to meet the team and look out for our poster. If not, stay tuned for the part 2.

To view or add a comment, sign in

More articles by WSK Medical

Others also viewed

Explore topics