Results 3 . 1 Correlation Analysis Reveals Expected Cluster of Major Conditions in the Patient Population

Although co/multimorbidities are associated with a significant increase in mortality, lack of quantitative exploratory techniques often impedes an in-depth analysis of their association. In the current study, we explore the clustering of co/multimorbid patients in the Texas patient population. We employ unsupervised agglomerative hierarchical clustering to find clusters of co/multimorbid patients within this population. Our analysis revealed the presence of nine distinct, clinically relevant clusters of co/multimorbidities within the study population of interest. This technique provides a quantitative exploratory analysis of the co/multimorbidities present in a specific population.


Introduction
One in four Americans have two (comorbid) or more (multimorbid) chronic conditions (Hoffman, Rice, & Sung, 1996).Projections estimate that more than 81 million Americans will suffer from these co/multimorbidities by 2020 (Anderson, 2003).Further, new data suggests that when routine clinical procedures are applied to patients with a co/multimorbidity, it can lead to unintended adverse events if healthcare professionals are unaware of the patient"s history (Fried et al., 2014).Thus, there is a critical need to identify these patients with complex co/multimorbidities to enable proper support and intervention.Previous research to classify subgroups of composite patients with co/multimorbidities have conventionally depended on complex multivariate regression techniques (Cheng, Dy, Fang, Chen, & Chiu, 2013;Ilesanmi & Fatiregun, 2014).Although, newer supervised and unsupervised machine learning algorithms have been successfully adopted in many other spheres of biomedical data analysis, they have not been applied for the characterization of co/multimorbidities in patient data (Clifton, Niehaus, Charlton, & Colopy, 2015;Deo, 2015).
Clustering is an unsupervised machine learning technique that aims to group analogous entities into one cluster and partitions dissimilar objects into another cluster (Becker, 2005;Hofstetter, Dusseldorp, van Empelen, & Paulussen, 2014;Whiteman & Whiteman, 1949).A cluster is defined as a subset of similar objects, defined by certain parameters, within a larger set.The threshold definition of similarity cutoff is often subjective and is usually determined by the study design.With respect to medical conditions, a co/multimorbid clustering can be defined as an unsupervised technique to find patients with similar medical conditions.In the current study, we use correlational clustering analysis to find the key groups of diseases present in the population.We further employ hierarchical clustering analysis on patients from the Texas health care patient data to describe inherent patterns in clusters of multimorbid patients.The clustering approach identifies cohorts of co/multimorbidities and presents opportunities for better management of these patients.

Study Population
We used open access, de-identified aggregate data provided by the Texas Department of State Health Services (http://healthdata.dshs.texas.gov/Home) to conduct this analysis.Inpatient and Outpatient datasets were combined to generate a composite dataset consisting of more than 15,000 data points and the inpatient procedure code was used to identify different clinical conditions.The training cohort consisted of patients who were 21 years or older as of January 1, 2015, with two (comorbidities) or more (multimorbidities) identified by inpatient procedure code on first examination.Members with admits to hospice, a long-term care facility, or with a pregnancy reported in the last 3 months were excluded from the study.After exclusions, our final study population was 13,920 patients.We isolated the list of 75 most common conditions reported by the Center for Disease Control, USA and used it to further filter out input dataset ("CDC -NCHS -National Center for Health Statistics," 2018).A literature search was also performed to further identify conditions which could be included in the study based on the general Texas population, our specific study cohort and disease with relatively high prevalence.Identification of conditions within cohort members were based on an outpatient data cross-referenced to the International Classification of Diseases, Tenth Revision (ICD-10) diagnosis and procedure codes in 2015 ("WHO | International Classification of Diseases, 11th Revision (ICD-11))

Exploratory Data Analysis and Feature Generation
Microsoft SQL Server (version 2012) was used to extract, transform, load and query the dataset.Binary outcome variables were created for the selected conditions.Age, gender, income and other demographic variables were also included in the input dataset.Input variables were scanned for outliers.Imputations to median/mode were performed for the non-binary continuous/categorical variables.We observed less than 2% imputations overall in the dataset.A non-zero variance analysis was performed on the binary disease variables and variables with less than 2% variance were further excluded from the analysis.

Correlation Algorithm
Correlation clustering involves the creation of a weighted matrix X= (P, E), such that the edge weight specifies the similarity (+ ve edge weight) or dissimilarity (-ve edge weight).The goal is to find an optimal cluster which maximizes similarity or minimizes dissimilarity (Becker, 2005).The method of minimizing disagreement was chosen for the current study based on the characteristics of the input data.A spearman"s correlation matrix was generated for binary variables created from different input conditions.A k-means clustering algorithm based on Jaccard"s distance was created and analyzed.The optimal cluster number was chosen to minimize the goodness of fit criterion.The cluster of conditions was analyzed for similarity of condition based on origin, organ system and patient demographic.

Clustering Algorithm
An agglomerative hierarchical clustering (AHC) algorithm with a bottom up approach was used to separate clinically appropriate clusters within the study population.The bottom up reproach to AHC initiates with each member starting at an isolated cluster, followed by serial merging of similar members to form similarity clusters until only once cluster remains.After the clustering procedure terminates, subject matter expertise, clinical relevance and study design criterion are used to select a cutoff/threshold which produces the final clusters.The process can be visualized using dendrogram.We used Ward"s method along with Gower"s distance matrix for similarity calculations as it has shown to be more reliable for mixed data with a preponderance of weighted binary data (like condition related binary variables) (Gower, 1971).

Figure Preparation
The results from R-software were exported into csv files, which were imported into Tableau (version 8.0) or Microsoft Excel (version 2013), which were then used to create graphs and visualizations.Tables were created in Microsoft Word (version 2013).

Correlation Analysis Reveals Expected Cluster of Major Conditions in the Patient Population
A spearman"s correlation analysis was performed on the dataset of more than 70 different conditions to check for the correlation between different conditions in the population.The resulting correlation matrix was further clustered to produce grouping of similar conditions with a correlation coefficient cutoff greater than 0.60 (Figure 1). Figure 1.A matrix showing the results of spearman"s correlation analysis followed by k-means clustering on the results.
Conditions are represented on the top and left panel.The size of the circles depict the strength of correlation between diseases.An additional color coding of the spearman"s correlation coefficient (red to blue signifies correlation from -1 to +1) was added to increase interpretability.Clustering analysis using the wards method reveals five clusters of clinically related conditions shown in the dotted red box.
We identified 5 broad cluster of multi-morbid conditions: (1) diseases overrepresented in the female population including menopause, Chronic Thyroid disorder and Osteoporosis; (2) Neurological and Psychiatric conditions including substance related psychiatric conditions, epilepsy, dementia and depression; (3) Disease related metabolic syndrome including hypertension, lower back pain, hyperlipidemia, obesity and diabetes mellitus; (4) conditions related to the cardiovascular system including heart failure, ischemic heart disease, peripheral artery disease and cerebrovascular disease; (5) conditions of the eye including cataract and glaucoma.The clusters identified were homogenous and overall had low demographic variance.A radial dendrogram was created to further visualize the similarity of conditions within the population (Figure 2).

Clustering Analysis Reveals 9 Broad Cluster of Multi-morbidity Patients in the Population
Agglomerative Hierarchical Clustering revealed 9 broad clusters in the population data of 13,920 patients (Figure 3).
Figure 3. Results from hierarchical agglomerative clustering reveal nine distinct clusters.Clustering was performed using Ward"s method with Gower"s distance and a threshold (h=27; shown as a dotted line) was used to isolate 9 clusters.
The average age of patients was 54.9 years and contained 50.1 % males and 49.9 % females.Descriptive statistics revealed clinical homogeneity within the clusters (Table 1).The clusters (numbered randomly) were divergent based on the mix of co/multimorbidities observed and age/gender demographics.Clinically relevant summarization showed the presence of distinct clusters with a high proportion of patients with (Table 2) : cancer (cluster 1), musculoskeletal diseases (cluster 2), substance abuse (cluster 3), female population with arthritis and post-menopausal conditions (cluster 4), metabolic syndrome related conditions (cluster 5), thyroid related conditions (cluster 6), females with migraines and depression (cluster 7), elderly population with multiple conditions (cluster 8) and a diabetes cohort (cluster 9).Income, medical utilization, inpatient visits were also calculated but are not shown in the current analysis.

Discussion
In the current paper, we use a clustering approach to identify groups of patients with similar co/multimorbid conditions in the Texas population.The clusters identified were homogenous and clinically relevant, and the real-world applications of our findings provide actionable insights for the fields of public health & healthcare provision.We present a fast and easy approach to explore patient data to better understand co/multimorbidities.
Co/multimorbidities are illnesses that coexist with a condition of interest and often lead to delayed treatment or misdiagnosis and have been shown to increase mortality in multiple populations (Song et al., 2018).They are a major source of economic burden on healthcare systems, with co/multimorbid patients experiencing worse health, economic and social outcomes compared to patients with singular health issues.Indeed, it has been shown that having multiple health conditions significantly increases the probability of reporting a diminished quality of life (Pisinger, Toft, Aadahl, Glümer, & Jørgensen, 2009;Song et al., 2018Song et al., , 2018)).Co/multimorbidity indices are frequently used to summarize the overall health of a population but often suffer from errors of manual data curation ("Comorbidity indices," n.d.; Sharabiani, Aylin, & Bottle, 2012).Our analysis provides a quantitative, data driven approach to exploring multimorbid patient data with the possibility of real-time analysis.
Clustering on the Texas patient health data to isolate multimorbidity patients revealed nine well-defined clusters.An analysis of the most prevalent conditions in every cluster revealed broad groupings within each cluster (Table 2).Our first cluster was also the largest with 4,532 patients.It contained middle aged patients (median age 53.5 years) and had a slightly higher ratio of males (56%) compared to females (44%).The cluster was characterized by the highest incidence of cancers of different organ systems including colorectal cancer (2.8%), prostate cancer (3.9%), ovarian cancer (0.6%), Multiple myeloma (1.5%), malignant melanoma (1.8%), pancreatic cancer (0.5%), esophageal cancer (0.4%), stomach cancer (0.4%), skin cancer (7.2%), oral cancer (0.8%) and other cancers (8.9%).Interestingly, this cluster also had the highest incidences of kidney stones (7.7%), inflammatory bowel disease (8.6%) and sickle cell anemia (0.2%) indicating a potential causal role of these conditions in certain cancers.Indeed, a 2015 meta-analysis revealed an increased risk for kidney stone formation and renal cell carcinoma (Cheungpasitporn et al., 2015).Further corroborating our cluster-derived corollary relationship, inflammatory bowel disease patients are known to be at an increased risk of colorectal cancer and it has also been recently identified as a risk factor for oral cancer (Katsanos, Roda, Brygo, Delaporte, & Colombel, 2015;Kim & Chang, 2014).
Our second cluster had a higher ratio of older females (53% females with median age of 59.6 years) and had the highest incidences of conditions like osteoarthritis (69%) and lower back pain (71%).Interestingly, this cluster in general tends to be more expensive compared to other clusters (results not shown).Cluster three was our youngest cluster (median age 42.9 years), with more males compared to females (64% males).This cluster contained the highest percentage of members with substance abuse (80%) and related disorders including hepatitis (15%), pancreatitis (16%), neurosis (1.1%), and psychoses (9.3%).Remarkably, we also found the highest rate of post-partum neurosis disorders in this cohort, which raised the possibility of post-partum substance abuse or addiction.Indeed it has been suggested that pharmacological agents used to treat post-partum depression often lead to long term addiction and need more federal & clinical regulation (Chapman & Wu, 2013;Ross & Dennis, 2009).
Cluster 4 contained a high proportion of middle aged females (median age 56.1 years; 62.5 % females) with a relatively high proportion reporting menopause (17.2%).Further, this cluster also reported the highest incidences for conditions like rheumatoid arthritis (9.4%) and fibromyalgia (8.8%).This correlative evidence further backs the previous similar observations of links between fibromyalgia, rheumatoid arthritis and menopause (Martí nez-Jauand et al., 2013;Pines, 2014).Martinez-Jauand and colleagues, have previously shown that an early menopause can reduce estrogen exposure and this causes an increased sensitivity to pain which magnifies the fibromyalgia symptoms (Martí nez-Jauand et al., 2013).Cluster 5, although relatively smaller in size (895 patients), contained a very high proportion of patients with metabolic syndrome (95%) patients.As expected this cluster had the highest rates of hypertension (98%) and obesity (61%).This cluster also contained a high proportion of cervical cancer patients (0.4%) and this link has previously been demonstrated in other populations (Bussiè re, Sicsic, & Pelletier-Fleury, 2014; Lee, So, Piyathilake, & Kim, 2013).Cluster 6, was our smallest cluster cohort (765 patients) with the highest proportion of females (77.3% females) had the highest incidences of osteoporosis (22.4%), chronic thyroid disorders (86.4%), chronic fatigue syndrome (1.2%).A common theme related to these conditions is the interleukin-6 pathway, dysregulations of which are known to play a central role in osteoporosis, thyroid disorders and neck cancer (Guerrera et al., 2014;Lumachi, Basso, & Orlando, 2010;Papanicolaou, Wilder, Manolagas, & Chrousos, 1998;Roy, Curtis, Fears, Nahashon, & Fentress, 2016).Our results suggest that this cytokine molecular pathway may be responsible for more disorders than previously identified.
Cluster 7 also had a high proportion of young females (59% females; median age 49.1 years) who reported a high proportion of neurological and psychological disorders, including psychosis (2.8%), depression (48%), migraines/headaches (35.4%), epilepsy (12.1%), and eating disorders (0.6%).This group also had the highest proportion of fertility issues (0.7%) and gave birth to babies with low birth weight (0.1%).We also observed this group reporting an increased incidence of having a disrupted childhood (0.6%), posing a possible origin of these psychological issues.Similar co/multimorbidity associations have extensively been studied in childhood post-traumatic stress disorders (Gekker et al., 2018;Lecei et al., 2018;Nordin, Olsson, & Tomson, 2018).Cluster 8, our oldest cluster (median age 60.5 years) with the highest proportion of males (66.8% males) suffered from a combination of cardiovascular and respiratory diseases commonly seen in the elderly population.This cluster had a high incidence rates for heart failure (49%), cerebrovascular disease (80.9%),COPD (16.7%), congenital heart disease (1.7%) and ventricular arrhythmia (14.2%).Further, consistent with our expectations, we also found the highest incidences of Parkinson"s (0.7%) and dementia (2.7%) in this cluster.This cohort of patients were seen to have the highest frequency of in-patient visits and the highest total cost associated with them (data not shown).Our final cluster, number 9, was dominated by middle aged males (61.8% males; median age 59.2 years) with the highest incidence of diabetes mellitus (68.8%).We also saw highest incidences of diabetes related chronic disorders like chronic renal failure (29.5%), cataract (20.8%) and glaucoma (20.6%).Relationships between these conditions have been extensively reported (Harding, Egerton, van Heyningen, & Harding, 1993;Lipton & Decker, 2015;Yoshimoto & Kato, 2016).
Overall, our clustering approach has identified cohorts of patients with similar multimorbid diseases with actionable insights that can be used to reduce disease incidence, treatment & management costs as well as the overall burden on today"s healthcare system.

Figure 2 .
Figure 2. A closed radial dendrogram shows the structure of different conditions as inferred from the data.Conditions which frequently co-occur in patients share a common node (branch) in the dendrogram.

Table 1 .
Summaries of disease distributions in different clusters.This table shows cluster summaries for age (median; years), male and female composition (%), and the proportion of people identified with different medical conditions (%; number of members in cluster with the disease/ total number of members with the disease), for the nine clusters.