Sickle cell disease is the most common inherited blood disorder, estimated to affect over 100,000 individuals living in the United States. There are different subtypes of sickle cell disease, determined by an individual’s specific genetic combination. These subtypes can factor into the severity of an individual’s sickle cell disease, and the treatment regimen that they receive.
The researchers developed two automated approaches to determine and classify individuals according to their sickle cell disease subtype, a linear mixed-effects model and a random forest model. The linear mixed-effects model had an accuracy of 0.859, sensitivity of 0.59, and specificity of 0.92. The random forest model performed similarly. Model accuracy was higher for individuals less than 18 years of age compared to those 18 to 39.9 and 40 years of age or more, and also varied according to site.
Overall, the researchers were able to subclassify similar genotypes of sickle cell disease (HbSS/HbSβ0 vs. HbSC/HbSβ+) using electronic health record data with moderate to high accuracy. Such a model could be extremely useful for advancing research on sickle cell disease, but there remains a need for improving data completeness and standardization to improve classification.

