Monday, September 26, 2022
HomePediatrics DentistryDifferential cell counts using center-point networks achieves human-level accuracy and efficiency over...

Differential cell counts using center-point networks achieves human-level accuracy and efficiency over segmentation

Cytospin preparation and imaging

Cytospins were from archives of prior approved studies into respiratory disease. Briefly, bronchoalveolar lavage fluid was collected from children with cystic fibrosis (CF) as part of the the Australian Respiratory Early Surveillance Team for Cystic Fibrosis (AREST CF) program36, a study approved by the Child and Adolescent Health Service Ethics Committee. Induced sputum from adults with asthma was collected as part of clinical trials approved by the Hunter New England Human Research Ethics Committee. In both studies, cytopsins were prepared as part of their research protocol by centrifugating the cellular fraction of the airway specimen (approximately $$4 \times 10^{4}$$ cells) onto glass pathology slides. Cells were stained with Kwik-Diff™ (ThermoFisher Scientific, Australia) or May-Grumwald Giemsa stain according to standard practices. To generate maximal resolution images for the purposes of this study, cytospins were digitised at $$\times$$ 100 oil magnification using a ScanScope OS (Leica Biosystems, Australia) through the Centre for Microscopy, Characterisation & Analysis, The University of Western Australia. Images were saved in the standard SVS container file format. Each sample contained approximately ten thousand $$1024 \times 1024$$ pixel tiles and each file was up to 2.5 GB in size. We utilised 40 tiles ($$1024 \times 1024$$ pixel size and without overlapping boundaries) from each of the 19 digitised cytospins for use in the annotation assessment and training of the detection networks.

Ground truth annotation

All assessors annotated the centre points of cells for the full 760 image dataset, using the LabelBox37 platform. To minimise assessor fatigue and assist with interperson variability testing (IVT), labelling was conducted in four rounds of annotation, each round containing at least 160 images that were sourced from four randomly selected cytopsins. In LabelBox, the $$1024 \times 1024$$ image tiles were presented in random order to assessors, who labelled any identifiable cellular objects as belonging to one of four immune cell classes; macrophage lineage, neutrophil, eosinophil and lymphocyte. We grouped macrophages and monocytes as a single macrophage lineage class, as this is often done in cytopathology practice due to their overlap in morphological features. Since digital image boundaries are fixed in comparison to microscope visualisation for manual counts, assessors were asked not to label cells if the nuclei was not sufficiently visible on the tile. Once completed, annotations from all four assessors were collected, overlaid and an automated query was performed to consolidate duplicated class annotations within a 10 pixel radius. For duplicated annotations in disagreement of classification, a majority wins approach was taken. Finally, our “ground truth” dataset was finalised by a single assessor (“Assessor 1”), who reviewed all 760 annotated images in the dataset to correct any further duplicate or conflicting annotations that were not addressed by the automated cleanup. This assessor also ensured all point annotations were located in the centre of the complete cell object, not the centre of the nuclei (Fig. 2A). This “expert-in-the-loop” annotation process avoided missing labels and offered consensus annotation when disagreement occurred. To understand the baseline human performance level that DCNet must achieve, agreement between the classifications by our individual assessors at the cellular object level was compared through ICC analysis.

DCNet

The DCNet approach utilises a fully convolutional encoder–decoder network to predict a centre point heatmap for each type of immune cell. Our model is end-to-end differentiable and trained using a weighted pixel-wise logistic regression with focal loss. At inference time, we extract the centre clusters from each corresponding class mask and calculate average precision (AP) based on euclidean distance between predicted and ground truth centre points.

To overcome the lack of a defined object size in the annotation, which is a limitation of the centre point annotation method versus traditional bounding box or segmentation approaches, an approximation of the relative size differences between the classes was established. To generate our target heatmaps in Fig. 5, we create a set of masks $$Y \in \ (0, 1) \, W\times H \times C$$, where C corresponds to the number of cell types and W x H to the input resolution. A selection of 24 image tiles from across the 19 cytospin samples, based upon their lymphocyte or eosinophil counts, was analysed and the cell diameters measured in pixel units, as presented in Fig. 2C,D. Having determined the mean radius of each cell class, we plotted each ground truth cell point annotation using a 2D Gaussian kernel (1) onto the mask, placed at the cell centre and the variance $$\sigma _C$$ determined by the average radius for each cell class, C.

\begin{aligned} K_{\sigma _C}(x,y) = e^{-\frac{z^{2}+y^{2}}{2 \sigma _C^{2}}} \end{aligned}.

(1)

Architecture

We used variations of the U-Net architecture to predict center point heatmaps. U-net, perhaps the most widely adopted by the bio-medical field since it was introduced in 20154, consists of a fully-convolutional encoder-decoder network with skip connections. Our base model, which we refer to as ‘DCNet’, uses ResNet-34 as the encoder backbone (Fig. 6). The decoder consists of 5 upsampling blocks with a pixel shuffle upsampling layer and two $$3\times 3$$ convolutional layers, each followed by batch normalization. Specific model details are outlined in the fast.ai library38 which is developed at the Data Institute, University of San Francisco. Two variants of DCnet were also tested. CenterNet-HG uses an Hourglass-104 network most commonly used in keypoint detection networks17,18. This network consists of 2 downsampling layers, followed by 2 stacked hourglass networks. In addition, we adapted the base DCNet model to use a semantic segmentation approach, which we refer to as DCNet-CE. Instead of heatmaps, DCNet-CE uses a weighted categorical-cross-entropy loss to predict segmentation masks with object-centers. This approach utilised the same image preparation and loss function described in the publications by Falk et al.6 and Sirinukunwattana et al.9.

Loss function

We implemented a penalty-reduced logistic regression with focal loss as described by Zhou et al.18. A target $${Y}_{x y c} = 1$$ corresponds to the heatmap centre point (x, y), while $${Y}_{x y c} = 0$$ corresponds to the background, with a weighted negative loss based on the Gaussian distance (1) to each centre point. We set $$\alpha$$ with reference to the number of objects and $$\beta$$ to 2 in all of our experiments.

$$L = \frac{{ – 1}}{N}\sum\limits_{{xyc}} {\left\{ {\begin{array}{*{20}l} {\left( {1 – \hat{Y}_{{xyc}} } \right)^{\alpha } \log \left( {\hat{Y}_{{xyc}} } \right)} & {{\text{if }}Y_{{xyc}} = 1} \\ {\left( {1 – Y_{{xyc}} } \right)^{\beta } \left( {\hat{Y}_{{xyc}} } \right)^{\alpha } \log \left( {1 – \hat{Y}_{{xyc}} } \right)} & {{\text{otherwise}}} \\ \end{array} } \right.} ,$$

(2)

Training

Prior to training DCNet, two cytospins (10.5% of annotated images) were first set aside as the validation dataset. With the remaining images, we trained on a reduced input resolution of $$256 \times 256$$ to predict C heatmaps. Data augmentation includes random rotation, warp, color jittering, and horizontal/vertical flips. We used a fixed weight decay of $$10^{-5}$$ and a context specific optimal learning rate based on the use of learning rate finder39. Unless specified otherwise, hyper-parameters are consistent across all experiments and models trained for 60 epochs on an NVIDIA V100 GPU with 8G RAM.

Annotated image data and DCNet code is available at the following Github repository https://github.com/slee5777/DCNet.

Metrics

Conventionally, average precision (AP) is used to evaluate the performance of segmentation problem by calculating the intersection over union (“IoU”) across several thresholds to find true positives. The Dice coefficient is very similar to IoU and they are positively correlated40. Since the output predictions are (x,y) coordinates, we used Percentage of Correct Parts (PCP) to measure accuracy by using a modified method first introduced in pose estimation networks41. At inference time, we extracted the peaks for each predicted class heatmap and selected any clusters with a maximum cluster area of 16 and a prediction score greater than 0.5. For each cluster, we chose the point with the highest prediction as the object centre point.

Given a list of predicted object centres $${\hat{Y}}_{{x y}}^{(C)}$$, for each class, we calculated true positives by comparing each point to the ground truth points $${Y}_{{x y}}^{(C)}$$. A predicted point is counted as a true positive if it lies within a certain distance threshold $$\delta$$ of an unmatched ground truth centre point of the same class. Next, for each set of true positive predictions, we reported the AP for each class and compared them against our assessors.

The distance threshold was calculated by multiplying average size $$\sigma$$ of the cell class over a range of percentage threshold values $$\delta \in [0.1, 0.25, 0.50]$$. We evaluated our model across these thresholds $$\delta$$ and reported the AP. A $$\delta = 1$$ means the predicted centre point lies within one cell $$\sigma$$ diameter away from the ground truth. The AP was also calculated for the individual assessors by comparing their specific annotations to the final curated ground truth.

Statistical analysis

Analyses were performed in R42 or GraphPad Prism version 9.0.1 for Windows, GraphPad Software, San Diego, California USA, www.graphpad.com. Data were subjected to normality testing by histogram plotting and D’Agostino-Pearson omnibus K2 test. Untransformed data were analysed using parametric or non-parametric analyses as indicated appropriately through the text. $$\textit{p} < 0.05$$ was considered statistically significant. Absolute agreement ICC in the presence of bias (named ICC(A,1)43) were performed with the ’irr’ package in R using the function $${<}{<}$$icc(Ma, model=“twoway”, type=“agreement”)$${>}{>}$$.

RELATED ARTICLES