This work proposes a new end-to-end AI architecture for bone marrow aspirate NDC based on machine learning and deep learning algorithms (Fig. 1).

### Dataset

This study was approved by the Hamilton Integrated Research Ethics Board (HiREB), study protocol 7766-C. As this study protocol was retrospective, it was approved with waiver of patient consent. Digital whole slide images (WSI) were acquired retrospectively and de-identified and annotated with only a diagnosis, spanning a period of 1-year and 1247 patients. This starting dataset represented the complete breadth of diagnoses over this period in a major hematology reference center. WSI were then sampled from this dataset for model development and validation as described in Table 1 and Supplementary Table S3. These images were scanned with either an Aperio Scanscope AT Turbo or a Huron TissueScope at 40X and acquired as SVS and tif file format.

### Data annotation and augmentation strategy

ROI tiles and individual bone marrow cell types included were annotated by expert hematopathologists as the ground truth or reference standard in WSI images used for model training and test-validation as described below. This follows ICSH guidelines, where expert pathologists are considered the reference standard for bone marrow aspirate cytology in clinical diagnosis^{2}. Data augmentation was applied to increase the diversity of the input image types. Generally, there are two categories for pixel-wise adjustments augmentation, photometric distortion, which includes hue, contrast, brightness, saturation adjustment, and adding noise; and geometric distortion, which includes flipping, rotating, cropping, and scaling. As we had an imbalanced class distribution within our dataset for the ROI detection model (70,250 inappropriate and 4,750 appropriate tiles), it was necessary to apply one of the over-sampling or under-sampling methods to prevent misclassification. To address this, a number of the above augmentation techniques were applied to the training data during the learning process to over-sample the appropriate ROI tiles and train the model correctly. Subsequently, after applying augmentation, the dataset in this phase contained 98,750 annotated images for training, including 70,250 inappropriate ROI tiles and 28,500 appropriate ROI tiles (Supplementary Table S1). For the cell detection and classification model, after annotating the objects inside ROI tiles by using LabelImg tool (Supplementary Fig. S1)^{38}, in addition to the above augmentation categories, other techniques were also applied, like cutmix^{39} which mixes 2 input images, and mosaic, which mixes 4 different training images. Accordingly, after applying augmentation, the dataset in this phase contained 1,178,408 annotated cells for training, including 119,416 neutrophils, 44,748 metamyelocytes, 52,756 myelocytes, 17,996 promyelocytes, 173,800 blasts, 117,392 erythroblasts, 1,012 megakaryocyte nuclei, 57,420 lymphocytes, 25,036 monocytes, 7,744 plasma cells, 10,956 eosinophils, 308 basophils, 4,664 megakaryocytes, 246,532 debris, 8,404 histiocytes, 1,452 mast cells, 174,724 platelets, 25,740 platelet clumps, and 88,308 Other cell types (Supplementary Table S2). To enhance generalization, the augmentation was only applied on the training set in each fold of the cross-validation.

### Region of interest (ROI) detection method

The first phase in the proposed architecture is ROI detection. The ROI detection was applied to extract tiles from a WSI and examine if that tile was suitable for diagnostic cytology. To accomplish this, a deep neural network was built, fine-tuned and evaluated on aspirate digital WSI tiles. In the ROI detection method, initially 98,750 tiles (including augmented data) in 512 × 512-pixel size in high resolution are extracted and acquired from 250 WSI. To choose the tiles, a grid of 15 rows and 20 columns was created on each digital WSI and tiles were selected from the center of each grid cell, ensuring all tiles have been sampled from the WSI evenly. Appropriate and inappropriate ROI tiles were annotated by an expert hematopathologist; appropriate ROI tiles needed to be well spread, thin and free of red cell agglutination, overstaining and debris; and contain at least one segmentable cell or non-cellular object as outlined above. Then, a deep neural network based on DenseNet121 architecture^{39} was fine-tuned to extract features from each tile. A binary classifier was added in the last layer of the model to classify appropriate and inappropriate tiles. This network was trained using a cross entropy loss function and AdamW optimizer with learning rate 1e-4 and weight decay 5.0e-4. Also, a pretrained DenseNet121 was applied to initialize all weights in the network prior to fine-tuning. The entire network was fine-tuned for 20 epochs with 32 batch size.

We applied patient-level 5-folds cross-validation to train and test the model. Hence, the dataset (98,750 tiles) was split into two main partitions in each fold, training and test-validation, 80% (204 WSIs including 80,250 tiles) and 20% (46 WSIs including 18,500 tiles), respectively. The test-validation was also been split into two main partitions, 70% validation and 30% test. To ensure that enough data for each class was chosen in our dataset, the above split ratios were enforced on appropriate and inappropriate tiles separately. The dataset was split into training, validation and test sets at patient level, such that each set has a patient WSI that does not come in the other sets to prevent data leakage. In each fold, the best model was picked by running on the validation partition after the training and then evaluated on unseen patients in the test dataset. Extracting ROI tiles for further processing for the cell detection and classification model was the primary aim of using the ROI detection model. To this end, the ROI detection model should be able to minimize false positives in the result. Therefore, the precision has been considered as a key performance metric to select the best model.

### Cell detection and classification

The next phase was cell detection and classification applied on ROI tiles of 512 × 512 pixels in high resolution. To accomplish this, the YOLOv4 model was customized, trained and evaluated to predict bounding boxes of bone marrow cellular objects (white blood cells) and non-cellular objects inside the input ROI tile and classify them into 19 different classes. In this architecture, CSPDarknet53^{40} was used as the backbone of the network to extract features, SPP^{41} and PAN^{42} were used as the neck of the network to enhance feature expressiveness and robustness, and YOLOv3^{43} as the head. As bag of specials (BOS) for the backbone, Mish activation function^{44}, cross-stage partial connection (CSP) and multi input weighted residual connection (MiWRC) were used. For the detector, Mish activation function, SPP-block, SAM-block, PAN path-aggregation block, and DIoU-NMS^{45} were used. As bag of freebies (BoF) for the backbone, CutMix and Mosaic data augmentations, DropBlock regularization^{46}, and class label smoothing were used. For the detector, complete IoU loss (CIoU-loss)^{45}, cross mini-Batch Normalization (CmBN), DropBlock regularization, Mosaic data augmentation, self-adversarial training, eliminate grid sensitivity, using multiple anchors for single ground truth, Cosine annealing scheduler^{47}, optimal hyperparameters and random training shapes were used. In addition, the hyperparameters for bone marrow cell detection and classification were used as follows: max-batches is 130,000; the training steps are 104,000 and 117,000; batch size 64 with subdivision 16; the polynomial decay learning rate scheduling strategy is applied with an initial learning rate of 0.001; the momentum and weight decay are set as 0.949 and 0.0005 respectively; warmup step is 1,000; YOLO network size set to 512 in both height and width; anchor size set to 13, 14, 19, 18, 29, 30, 19, 64, 62, 20, 41, 39, 35, 59, 50, 49, 74, 35, 56, 62, 68, 53, 46, 87, 70, 70, 95, 65, 79, 85, 101, 95, 87,129, 139,121, 216, 223.

Similar to the ROI detection method above, patient-level 5-folds cross-validation was applied to train the model here. Therefore, each fold is divided into training and test-validation partitions, 80% and 20% respectively. The test-validation data portion was split into two main partitions (70% validation and 30% test). Additionally, to ensure that enough data for each class was chosen in our dataset, the mentioned portions were enforced on each object class type individually. In each fold, the best model was picked by running it on the validation partition and then evaluation on the test (unseen) dataset was performed using the mean average precision (mAP).

After training and applying the cell detection and classification model on each tile, the Chi-square distance (Eq. (1)) was applied to determine when the IHCT converges.

$${tilde{chi }}^{2}=frac{1}{2}mathop{sum }limits_{i=1}^{n}frac{{({x}_{i}-{y}_{i})}^{2}}{({x}_{i}+{y}_{i})}$$

(1)

If IHCT converged, the bone marrow NDC is completed and represented by the IHCT, otherwise, another tile is extracted, and the previous process applied again iteratively until it converges.

To calculate Chi-square distance, the number of following cellular objects, as well as BM_{ME} ratio (Eq. (2)), were utilized: “neutrophil”, “metamyelocyte”, “myelocyte”, “promyelocyte”, “blast”, “erythroblast”, “lymphocyte”, “monocyte”, “plasma cell”, “eosinophil”, “basophil”, “megakaryocyte”.

$$ B{M}_{{{{{{{{rm{ME}}}}}}}}},ratio\ =frac{{{{{{{{rm{Blast}}}}}}}}+{{{{{{{rm{Promyelocyte}}}}}}}}+{{{{{{{rm{Myelocyte}}}}}}}}+{{{{{{{rm{Metamyelocyte}}}}}}}}+{{{{{{{rm{Neutrophil}}}}}}}}+{{{{{{{rm{Eosinophil}}}}}}}}}{{{{{{{{rm{Erythroblast}}}}}}}}}$$

(2)

Cell types were chosen to include all bone marrow cell types traditionally included in the NDC, as well as several additional cell or object types that have diagnostic relevance in hematology (“megakaryocytes”, “megakaryocyte nuclei”, “platelets”, “platelet clumps” and “histiocytes”).

### Evaluation

To evaluate the ROI detection model in predicting appropriate and inappropriate tiles, we calculated common performance measures such as accuracy, precision (PPV-positive predictive value), recall (sensitivity), specificity, and NPV (negative predictive value), as shown by the following equations:

$${{{{{{{rm{Accuracy}}}}}}}}=frac{{T}_{p}+{T}_{n}}{{T}_{p}+{T}_{n}+{F}_{p}+{F}_{n}}$$

(3)

$${{{{{{{rm{Precision}}}}}}}}=frac{{T}_{p}}{{T}_{p}+{F}_{p}}$$

(4)

$${{{{{{{rm{Recall}}}}}}}}=frac{{T}_{p}}{{T}_{p}+{F}_{n}}$$

(5)

$${{{{{{{rm{Specificity}}}}}}}}=frac{{T}_{n}}{{T}_{n}+{F}_{p}}$$

(6)

$${{{{{{{rm{NPV}}}}}}}}=frac{{T}_{n}}{{T}_{n}+{F}_{n}}$$

(7)

For the ROI detection model, T_{p}, T_{n}, F_{p} and F_{n} in the above equations are defined as:

T_{p} (True Positive): The number of appropriate tiles which predicted correctly

T_{n} (True Negative): The number of inappropriate tiles which predicted correctly

F_{p} (False Positive): The number of inappropriate tiles which predicted as appropriate tiles

F_{n} (False Negative): The number of appropriate tiles which predicted as inappropriate tiles

To assess the performance of the proposed cell detection and classification method, Average Precision (AP) was used with 11-point interpolation (Eq. (8)). Also at the end, the mean Average Precision (mAP) ^{48} was calculated for all the AP values (Eq. (10)). The value of recall was divided from 0 to 1.0 points and the average of maximum precision value was calculated for these 11 values. It is worth mentioning that the value of 0.5 was considered for Intersection over Union (IoU) in AP for each object detection and >0.75 has been used for class probability. In addition, Precision, Recall, F1-score (Eq. (12)), average IoU (Eq. (11)) and log-average miss rate (Eq. (13)) have been calculated here for each object type.

$$AP=frac{1}{11}mathop{sum}limits_{rin {0.0,ldots ,1.0}}A{P}_{r}=frac{1}{11}mathop{sum}limits_{rin {0.0,ldots ,1.0}}{P}_{{{{{{{{rm{interp}}}}}}}}}(r)$$

(8)

where

$${P}_{{{{{{{{rm{interp}}}}}}}}}(r)=mathop{max }limits_{tilde{r}ge r} p(tilde{r})$$

(9)

$$mAP=frac{1}{N}mathop{sum }limits_{i = 1}^{N}A{P}_{i}$$

(10)

$$IoU=frac{GTBoxcap PredBox}{GTBoxcup PredBox}$$

(11)

$$F1-Score=2times frac{{{{{{{{rm{Precision}}}}}}}}times {{{{{{{rm{Recall}}}}}}}}}{{{{{{{{rm{Precision}}}}}}}}+{{{{{{{rm{Recall}}}}}}}}}$$

(12)

For the cell detection and classification model, T_{p}, F_{p} and F_{n} in Eq. (4) and Eq. (5) are defined as:

T_{p} (True Positive): The number of all cellular and non-cellular objects which predicted correctly.

F_{p} (False Positive) and F_{n} (False Negative): The number of all cellular and non-cellular objects which not predicted correctly

The Log-average miss rate^{49} is calculated by averaging miss rates at 9 evenly spaced FPPI points between 10^{−2} and 10^{0} in log-space.

$$Lo{g}_{{rm average};{rm miss};{rm rate}}={left(mathop{prod }limits_{i = 1}^{n}{a}_{i}right)}^{frac{1}{n}}=exp left(frac{1}{n}mathop{sum }limits_{i=1}^{n}ln {a}_{i}right)$$

(13)

Where a_{1}, a_{2},({ldots}!!,)a_{9} are positive values corresponding the miss rates at 9 evenly spaced FPPI points in log-space, between 10^{−2} and 10^{0}.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.