Thursday, September 29, 2022
HomePediatrics DentistryElectronic case report forms generation from pathology reports by ARGO, automatic record...

Electronic case report forms generation from pathology reports by ARGO, automatic record generator for onco-hematology

Data collection

Overall, 332 histopathology paper-based reports were collected between 2014 and 2020 at the Pathology Unit of the IRCCS Istituto Tumori ‘Giovanni Paolo II’ in Bari, Italy (239) and from six different Italian centers (93) from Unit of Hematology, Azienda Ospedaliero-Universitaria Policlinico Umberto I in Rome, Italy, Hematology, AUSL/IRCCS of Reggio Emilia in Reggio Emilia, Italy, Division of Hematology 1, AOU “Città della Salute e della Scienza di Torino” in Turin, Italy, Division of Hematology, Azienda Ospedaliero-Universitaria Maggiore della Carità di Novara in Novara, Italy, Department of Medicine, Section of Hematology, University of Verona in Verona, Italy, and Division of Diagnostic Haematopathology, IRCCS European Institute of Oncology in Milan, Italy. The internal series included 106 DLBCL, 79 FL, and 54 MCL, while the external one comprised 49 DLBCL, 24 FL, and 20 MCL.

A unique ID code was assigned to each report. According to the diagnostic criteria for each lymphoma subtype, reports included IHC results obtained from LN, EN, BM or PB specimens. Qualitative and quantitative information for IHC markers including MYC, BCL2, BCL6, CD10, CD20, Cyclin-D1 were reported. Some reports also included molecular data from FISH analysis, while some reports included either FISH results or the level tumor cell infiltration as addendum. For DLBCL, molecular classification according to the COO estimated by the Hans algorithm was also included24. Ki-67 proliferation index was also reported as quantitative value ranging from 5 to 100%.

The work was approved by the Institutional Review Board of the IRCCS Istituto Tumori “Giovanni Paolo II” hospital in Bari, Italy. All methods were carried out in accordance with relevant local regulations and after obtainment of dedicated informed consent.

Automated detection of relevant terms in paper-based reports

We aimed this step of the workflow at automating the detection of relevant terms to be extracted from the text fields of paper-based reports. ARGO exploits OCR25 and NLP26 techniques to convert images of reports into text and detect relevant words in the text based on an “ad-hoc” thesaurus.

The conversion from image to text has been implemented in Tesseract OCR© (version 4.1.1-rc2-20-g01fb). To improve conversion performance, each pathology report was firstly converted from pdf to image through Poppler library (version 0.26.5). Then, the image was translated in a grey scale of 8 bits (from 0 to 255 levels of grey).

Image transformation was developed in Python by OpenCV© software (version 4.2.0).

In ARGO, NLP techniques were adopted to automatically extract relevant terms for the disease diagnosis, to be transferred into the digitalized eCRFs. Thus, a set of NLP regular expressions were applied to extract information concerning the diagnosis, date of the report, report ID, type of the specimen, execution of BM biopsy, IHC, and FISH analyses, as well as quantitative and qualitative data of selected IHC markers (MYC, BCL2, BCL6, CD10, CD20, Cyclin-D1), COO subtypes and Ki-67 proliferation index (paragraph “ARGO function and NLP rules”).

The disease nomenclature was assigned based on the highest match between the pattern of detected biomarkers in each report and a reference pattern, as reported in the “Hematopoietic and Lymphoid Neoplasm Coding Manual guidelines” from the “Surveillance, Epidemiology and End Results (SEER) program” of the National Institute of Health27. The final diagnosis nomenclature was referred to the ICD10 classification23. Communication between ARGO and SEER official servers was flexibly dealt via API.

ARGO was developed in Flask©, version 1.1.2, the webserver was an Oracle© Linux Server 7.8 with kernel 4.14.35–1902.303.5.3.el7uek.x86_64. We used MariaDB© 5.5.68 as database. NLP algorithms were developed in Python 3.6.8. Translation from English to Italian language was dealt via API tool MyMemory© (version 3.5.0). To increase the detectability of biomarkers in the reports we also built three thesauri in Phyton with NLP regular expressions (Supplementary Appendix Source code S1 and Table S2). Despite the domain specificity of such thesauri, the technique of knowledge extraction by flexibly introducing a new thesaurus is a general feature of ARGO.

ARGO functions and NLP rules

ARGO was developed according to three functions: function_read.py, header_info.py, and params.py. Function_read.py was the main function and incorporated (1) the call to the header_info.py function to recognize the report template as input, (2) the set of NLP expressions to identify both biomarker and diagnosis description, and (3) the call to the params.py function which included two API tokens, the first to take data on biomarkers and diagnosis from the SEER database and the second provided from the REDCap project ID to allow automatic data entry. Supplementary Fig. S2A details the pseudocode to process a pathology report. ARGO embedded two main activities, namely i) the recognition of the template from the header section including the fields “BIOPSY DATE” and “ID NUMBER”, the demographical patient information (“NAME”, “SURNAME”, “DATE OF BIRTH”, “PLACE OF BIRTH”, “SEX”, and “SSN” [Social Security Number]), and the “SPECIMEN TYPE” (via header_info.py), and ii) the recognition of the “IHC MARKERS” (“POSITIVITY/NEGATIVITY” or “QUANTITY”) from the biological samples, the fields “FISH”, “DIAGNOSIS”, and “CELL OF ORIGIN” from the disease section (via function_read.py). Supplementary Fig. S2B shows an example of NLP input from the internal series. The regular expressions used to automatically recognize the header section for internal reports are reported in Table 4. Those for the external reports are detailed in Supplementary Table S3.

Table 4 Set of NLP regular expressions embedded into the header_function.py for the internal reports.

Concerning function_read.py, we identified the set of pathological description patterns according to the following four scenarios:

  1. 1.

    description of qualitative markers by symbolic qualifiers in a free text form (e.g. “ + ” for positivity and “-” for negativity);

  2. 2.

    description of qualitative markers by textual qualifiers in a free text form (e.g. “positive”, “reactive” or “immunoreactive” for positivity and “negative” or “immunonegative” for negativity);

  3. 3.

    description of both qualitative and quantitative markers by symbolic or textual qualifiers in a bullet form;

  4. 4.

    description of pure quantitative markers (as Ki-67).

Table 5 shows three representative patterns of description with their relative NLP pseudocodes and expected results. The whole set of patterns is detailed in Supplementary Table S4.

Table 5 Representative sets of NLP rules embedded into the function_read.py for patterns 1.1, 3.2, and 4.1.

Data-mapping and automatic population of eCRFs

For a systematic collection of the diagnostic variables in this study, we designed dedicated eCRFs on REDcap17,18. eCRFs were suited to the synoptic templates provided and approved by the CAP. We referred to DLBCL, FL, and MCL templates28,29. The data-mapping between ARGO and the eCRFs was performed by providing the relevant data fields from the REDCap dictionary as a flexible input to the application (Supplementary Table S5). Finally, we used API technology for the automatic data entry and final upload of the information of interest into the eCRFs.

Validation metrics

ARGO performance, regarded as the level of consistency between data included in the original pathology reports and those automatically transferred into eCRFs, was assessed in terms of accuracy, precision, recall and F1 score30. To calculate each measure, we defined the cases in the following (1) true-positive: cases in which ARGO detected correctly the expected variables; (2) false-positive: cases in which ARGO detected variables even if not present in the original report; (3) true-negative: cases in which ARGO did not detect a variable not present in the original report; and (4) false-negative: cases in which ARGO failed in detecting a variable present in the original report.

Results for each data-field of internal and external series were statistically compared by a chi-square test.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments