Parametric Survival Modeling of Tuberculosis Data -A Case Study of Federal Medical Centre, Bida, Nigeria

The research performed parametric survival analysis of Tuberculosis (TB) data (covering 2010 to 2016) collected from the Federal Medical Centre, Bida, Niger State, Nigeria. Three parametric survival models (Exponential, Weibull and Log-logistic) were fitted. The outcome variable was time to recovery from TB infection and four covariates being age, gender, TB type and occupation were involved. Models were estimated by maximum likelihood method and model selection criterion used was the Akaike Information Criterion (AIC). The exponential and log-logistic models found all covariates statistically insignificant while Weibull found all covariates but TB type significant at 5% level. Based on AIC, Weibull model with AIC of 163.5731 performed best, followed by log-logistic model with AIC of 191.419 and exponential model performed worst, with AIC of 517.9652. The best of fitted models being Weibull suggested that older patients had higher hazards than younger ones, older patients hence, had lower survival times, holding other covariates constant. That is, the older the TB patient, the lower was the time to recovery from TB. Males had higher hazards and hence, lower survival times compared to females. That is, male TB patients recovered faster than the females. Pulmonary TB patients had lower (insignificant) hazards and hence, higher survival times than Respiratory TB patients. TB patients on technical occupation had lower hazards than others and hence, had higher survival times than those whose occupations were considered not technical. The research concluded that age, gender and occupation were the major determinants of recovery period of TB patients. It was recommended that the Management of Federal Medical Centre, Bida, and other organizations involved in TB management could make use of the Weibull model to fit and predict both the survival and hazard rates of TB patients.


Introduction
Survival analysis is the analysis of time-to-event data. Such data describe the length of time from a time origin to an endpoint of interest. Survival analysis methods are usually used to analyze data when major interest is time to occurrence of an event. Such event may be death, recovery, employment, marriage and so on. However, not all individuals could have experienced the event of interest before end of study, implying that real time of event is not available for some individuals, this is called censoring.
Patel, Konstantinos and Derhy (2003) investigated TB prognostic factors in Australia. Ponnuraja and Venkatesan (2010) investigated TB cases in India; Fikadu, Yohannes and Fikre (2015) modeled risk factors of TB cases in Ethiopia. Xia, Ning and Huang (2018) performed the empirical comparison of the Breslow estimator and the KP estimator based on simulated data. Recent works include Yoosefi et al. (2018) who applied exponentiated Weibull model to cancer data, Odeleye, Bolarinwa and Bolarinwa (2019), an application of Kaplan Meier model on unemployment duration data, and Bolarinwa and Michael (2020) that fitted Cox proportional hazards model to Tuberculosis data. Khanum et al. (2019) applied logistic regression on TB study in Pakistan. Aung et al. (2019) applied Cox model to TB-HIV co-infected patients in Myanmar. Ogbo et al. (2018) observed a decreasing trend in TB disease burden among HIV-negative people in Nigeria.
TB is a widespread disease which can be fatal. It is infectious and caused by various strains of mycobacteria. It is spread through the air when people who have an active TB infection, cough, sneeze, or otherwise transmit respiratory fluids through the air. Most infections do not have symptoms, known as latent tuberculosis (Mason, Roy, Spillane & Singh, 2016).
Although a cure for TB was developed more than 50 years ago, TB remains one of the world's deadliest infectious diseases and is among the top ten causes of global morbidity and mortality (USAID, 2009). The increase in TB incidence is highest in Africa and Asia (World Health organization (WHO), 2009a;2009b). This may be due to high poverty levels that both continents share in common. TB still constitutes a major health problem in Nigeria. This was recently brought to the fore by the WHO, which ranked Nigeria (fourth behind India, Indonesia and China) as one of the six countries which accounted for 60 per cent of the world total TB burden. Some 1.8 million people died from TB in 2015, 0.4 million of whom were co-infected with HIV (WHO, 2016).
The WHO 10-year strategy (2006WHO 10-year strategy ( -2015 to cut down the burden of TB in the world worked elsewhere as it reportedly saved some 37 million lives while some countries halved the prevalence of the disease. But in Nigeria, the reverse is the case. According to the National TB and Leprosy Control Programme (NTBLCP), over 600,000 new cases of TB occurred in Nigeria from a global report conducted in 2014. The fact that substantial numbers of the people infected in the country are unreported or undiagnosed, calls for concern as such aids spread of the disease. Most survival studies on TB have not incorporated occupation as a covariate, the present article does. Gender, TB type and age are other covariates in the model.
The main objective of this research is to model TB data using three parametric survival models: Exponential, Weibull and Log-logistic, with a view to selecting the best based on an information criterion.

Theoretical Framework and Model Specifications
Survival models are analyzed within the context of nonparametric, semi-parametric and fully parametric methods. The nonparametric approaches (Kaplan-Meier, Life-table and Cumulative hazard) only handle univariate cases, they cannot handle multivariate cases. Once the problem is multivariate in nature, there must be recourse to semi-parametric or fully parametric models which handle both univariate and multivariate survival cases.
In describing survival data, two essential functions to be considered are survival function (survivor or survivorship function) and hazard function.
Let T denote a continuous non-negative random variable representing survival time, with probability density function f(t) and cumulative distribution function F(t).
The survival function, S(t) of an individual is the probability that the individual survives until at least time t.
where t is a time of interest and T is the time variable.
The hazard function ) (t h is a related measure of the probability that the event T occurs in the next given that the individual has reached time step t. It is defined Vol. 14, No. 7; The hazard function is the rate of death or failure at an instant t, given that the individual survives up to time t. Hazard function is also known as instantaneous failure rate or the force of mortality (Angela, 2008;Singh & Mukhopadhyay, 2011).
When a model has covariates, survival modeling can be approached from four perspectives: Parametric families, accelerated failure time, proportional hazards and proportional odds. It is not all models that can be formulated using all approaches. However, some can be formulated using more than one approach. For example, exponential and Weibull models can be formulated as both accelerated failure time and proportional hazards models while the log-logistic model can be formulated as accelerated failure time model but not as proportional hazards model. It can however, be formulated as proportional odds model.

Parametric Survival Modeling Framework
Parametric modeling hinges on the principle that the outcome variable T, time to event, follows a specified probability distribution, whose parameters, are characteristically unknown and have to be estimated from available data. That is, they are fully parametric. Examples of these models include exponential model, Weibull model, and lognormal model (Richards, 2012). With a specified probability distribution of survival times, associated survival and hazard functions can be obtained.
For a specified survival times distribution f(t), the survivor function S(t) is defined The most commonly used parametric time-to-event models are the Weibull, log-logistic and log-normal distributions (Kahn & Khosa, 2016).

Proportional Hazards (PH) Modeling Framework
A very popular approach to modeling survival data is to relate survival times to explanatory variables (covariates) so that the effect of the covariates increases or decreases the hazard by a proportionate amount at all durations. That is, the effect of covariates is multiplicative with respect to the hazard h(t). If we denote the average or hazard baseline function by h o (t), the hazard for a particular individual, h(t) is where S(t) is usually written as This effectively means that the hazard function h(t) of a certain individual is a multiple, S(t), of the baseline hazard function, h o (t).
Taking log of (5), we have where X is the set of covariates. greater than 1 indicates that increase in the value of the variable is accompanied by increased event hazard and hence, decreased length of survival (Bradburn, Clark, Love & Altman, 2003).
Under proportionality assumption, the log of the hazard ratio for the i-th subject to the reference group is linear in the covariates. That is, The Cox PH model is based on this framework and is the most commonly used for analyzing time to event data (Kleinbaum & Klein, 2012).

Accelerated Failure Rate (AFT) Models
This framework utilizes the relative length of time to the event rather than hazard or hazard ratio. The defining feature of AFT model is is the baseline function and ϕ is an accelerator factor (hence, the model name) that depends on the covariates according to the scheme The scheme allows for evaluation of effect of covariates on survival time just as hazard ratio enables evaluation of covariates on the hazard. The idea is that each covariate either stretches or shrinks the survival curve along the time axis by a constant relative amountϕ . So, the effect of the covariates is multiplicative with respect to survival time S(t). Ifϕ < 1, there is an acceleration (stretching) of the endpoint and if ϕ >1, there is a delay (shrink). Acceleration factor ϕ is a ratio of survival times corresponding to any fixed value of S(t) ( (Kleinbaum & Klein, 2012). The median survival time for a group of individuals is ϕ times what they would have experienced in the reference group. Under the AFT framework, the survival time variable T is assumed to follow a specific distributional form.
The general form of AFT model is where log (T i ) is the log of survival time; β AFT is the vector of AFT model parameters associated with covariate vector X i Ɛ is a random error term; σ is a scale factor.
Parametric distributional assumption is usually imposed on the error term Ɛ. Different probability distributions of Ɛ and assumption on σ, generate different survival models. Commonly used distributions are exponential, gamma, extreme value, normal and logistic. The AFT formulation, (11) is a model of log-survival times as a linear function of the covariates, X i .
Parameters under AFT and parameters under PH are related as follows:

The Exponential Model
This model is the simplest among the parametric survival models and it assumes constant, hazard over time. The distribution has only one parameter λ which is the mean and the variance. If the distribution times follow the exponential distribution, Substituting (13) into (3), the survivorship function of the model is and substituting (13) into (4), the result is hazard function λ = ) (t h Constant hazard means that the probability of experiencing the event remains same from one time interval to another-unrealistic for most modeling situations. The exponential model is clearly inadequate to model situations with varying hazard as its hazard is independent of time. A way out is to assume a density which results in variable hazard. For such densities, resulting hazards depend on time. The Weibull, log-normal and log-logistic readily come to mind.
Even when many possible options exist for relating the parameter λ to the vector of covariates X, a natural choice is the log-linear model The exponential model can be parameterized as both PH and AFT models.
To motivate exponential model as AFT model, we shall lean on (11). If σ =1 and Ɛ is extreme value (min) distributed, the survival times T i are exponentially distributed. As observed by Machin, Cheung and Parma (2005), the coefficients arising from PH and AFT formulations of the exponential model only differ in signs, implying that i iAFT β β − = This is evident from (12) when σ=1, being a requirement of the exponential model. It is worthy of note that the two categories of coefficients have contrasting interpretations. That is, whereas a positive i β implies that associated covariate x i has a higher hazard and a shorter log survival time, same is communicated by a negative iAFT β .

The Weibull Model
This is a modification of the exponential model. The modification is such that the survivorship and the hazard depend on time, T. The model is the most used of parametric survival models (Kleinbaum & Klein, 2012). If the distribution times follow the Weibull distribution, (3) Parameter α is a shape parameter, usually held fixed and it defines the shape of the hazard function. The presence of this shape parameter in the model offers a great flexibility.

If it is
• less than 1, hazard decreases over time; • equal to 1, a constant hazard results and • greater than 1, the hazard increases with time.
This range of shapes of possible hazard function makes the Weibull distribution attractive for modeling survival times (Machin, Cheung & Parmar, 2005).Parameter λ is the scale parameter and can be reparameterized in terms of the predictor variables (covariates). While holding α fixed, λ can be formulated as done for exponential X h / ) log( β = It is worth mentioning that Weibull reduces to exponential model if α=1. Leaning on the AFT model (11) once again, Weibull model results if the form of the error Ɛ is extreme value distribution with α=1/σ. It is obvious that the model can be formulated as both PH and AFT models. Just like the exponential, the Weibull model is hence, a parametric PH model. It is the most used parametric model in survival analysis (kleinbaum & Klein, 2012) and is the only distribution that is closed under PH and AFT frameworks (Kalbfleisch &Prentice, 2002). It is in the literature (See Machin, Cheung & Parmar, 2005) that PH and AFT parameters for the Weibull are related as follows This can also be deduced from (12) as follows:

The Log-logistic Model
The log-logistic model possesses survivorship function that varies with survival time. Unlike the exponential and Weibull models, log-logistic model can only be formulated as AFT model. It cannot be formulated as PH model but can be formulated as proportional odds model. Just like Weibull, this model has two parameters, λ and γ being scale and shape parameters respectively. To motivate the model as AFT, we assume that the error term Ɛ in (11) Parameter γ is the shape parameter; if it is less than or equal to 1, the hazard decreases continuously with time; and if it is greater than 1, the hazard first increases to a maximum point, then decreases with time. The hazard function is hence, unimodal.

Data
The data was collected from Federal Medical Centre, Bida, Niger State, North Central Nigeria. The natives of Bida are by tribe Nupes. The data used for this study was secondary data collected from 259 TB patients' records covering 2010 to 2016. The covariates of interest in this study were age, gender, type of TB, and occupation. They were coded as follows: All occupations that typically expose the individual to chemicals or any other toxic substance were categorized as Technical.

Model Estimation
All models were estimated by the maximum likelihood method. To motivate discussion of maximum likelihood for survival models, recognition must be given to the fact that censoring is a feature that distinguishes survival data from other types of data. This special nature of survival data imparts on the derivation of the likelihood. mas.ccsenet.org Modern Applied Science Vol. 14, No. 7; For survival data, a subject either experiences the event (fails) or is censored. The likelihood function is a combination of the likelihood for subjects that fail and those censored. If we assume right censoring, we have: Effectively, every subject contributes S(t i ) but only those who fail (i.e. experience the event) contribute h(t i ).
Hence, the likelihood [L(β)] for all subjects is the product over all subjects.
where h i (t i ) is hazard for subject i who fails at time t i and ) ( i i t Λ [which is -log S(t)] is the cumulative hazard for subject i at his failure or censoring time.
Using the exponential model as an illustration, the procedure for estimation follows: Hence, for this model, (26) becomes On differentiating (28) with respect to β 0 , we have the score equation

Models Comparison
Competing models can be assessed in a number of ways. The likelihood ratio (LR) test can be utilized for assessing models with a view to selecting the best. It does this by assessing whether additional parameter improves fit or not. It is however, applicable only when competing models are nested. Exponential model is nested within Weibull as the former is a particular case of the latter (when α=1). However, the log-logistic is not nested within either of the two models. This makes the comparison of the three models using the LR test inappropriate. Whenever such situation arises, the ideal is to resort to the use of an information criterion. An information criterion is useful for comparing both nested and non-nested models. It is a relative measure only as it does not inform about the quality of models, it only ranks the models being compared in order of goodness of fit. A number of such criteria exist. But the model selection criterion adopted in this article was Akaike information criterion (AIC).
The AIC proposed by Akaike (1974) is a measure of the goodness of fit of an estimated statistical model. AIC is given by where k is the number of model parameters. Lower AIC indicates better likelihood. That is, lower AIC values indicate a better model. AIC is probably the most used of information criteria.

Results and Discussion
The data consisted of 259 registered TB patients treated between January 2010 to December 2016 at the Federal Medical Centre, Bida. Out of these patients, 148 representing 57.1% were males while 111 representing 42.9% were females. Sixty-six percent of the patients had Pulmonary TB while 34% had Respiratory TB. Patients aged between 0 -20 years constituted 34.7% , those aged between 21 -40 years were 30.9% while those aged between 41 -60 years constituted 24.7% of the total. Patients aged 61 years and above constituted 9.7%. This suggests that the most youthful of the age groups (0-20 years) recorded highest incidence. Those on jobs regarded as technical constituted 59.1%. Tables 1 to 3. Table 1 presents the parameter estimates, the corresponding hazard ratios and associated standard errors for each of the three models involved.  Table 1) suggest that for each additional year to age of TB patient, the hazard rate (HR) increased by 0.4% (H.R = 1.004) holding the effects of other covariates constant, suggesting lower survival times for older patients. That is older patients recover faster than younger ones. With hazard rate of 0.927 for gender, male patients had 7.3% hazard rate lower compared to females, implying that males had higher survival times than females, other covariates being constant, implying lower recovery period for females. Hazard rate of 1.051 for TB type implies that PTB patients recorded 0.51% hazard rate higher than patients in RTB category, holding other covariates constant. Hence, PTB patients had reduced survival times compared to RTB patients. Occupation with hazard rate of 1.050 implies that patients on technical jobs had 5% hazard rate higher than others, holding the effects of other covariates constant. This simply means that patients on technical occupation had reduced survival times when compared to patients on other occupations. That is, those on technical occupation recover faster than those on other occupations. It is worth noting that all hazard ratios of this model are not statistically significant at 5% level (See Table 2). For the Weibull model (See Table 1) for a year increment in age of patients, the hazard rate increased by 0.3%, holding the effects of other covariates constant, thereby suggesting lower survival times for older patients. That is, lower times to recovery from TB for older patients compared to younger patients. This is a deviation from the expected as previous studies suggested otherwise. This may not be unconnected with the youthful nature of the subjects. Whereas those aged 0-40 years constituted 65.6% of the subjects involved in the study, only 9.7% were aged 61 years and above. This domination by the youths could be responsible for this inconsistent result. Age was recognized as one of major risk factors with survival of TB patients in several results in the past. Borgdorff, Buu, Houben, Quy, Lan and Cobelens (2007) identified age greater than 65 years as one of risk factors in survival of TB patients in Netherlands. Gopi., Vasantha and Subramani (2008) identified age as a risk factor among TB patients in South India. Pardeshi (2009) identified TB patients above 40 years as recording higher death rates than those younger. Balaky, Mawlood and Shabila (2019) observed lower survival rates among TB patients aged 65 years and above. They related this to possible presence of age-related diseases like cardiovascular diseases, diabetes mellitus and malignancy. Tolosie and Sharma (2014) observed that TB patients had statistically significant differences in survival experience due to age among other factors. But Ajagbe, Kabair and O'connor (2014) concluded no significant difference in survival experience of male and female and among age groups. In these studies, event was death.

Results of analysis are presented in
HR of 1.092 for gender suggests 9.2% higher HR for male than for female TB patients. This implies lower survival times for males compared to female TB patients. That is, it took male TB patients shorter periods to recover than females holding other covariates constant. Balaky, Mawlood and Shabia (2019) observed higher, though not significant survival rates among male than female TB patients. Impact of gender on survival has presented a number of inconsistent results. Feng et al. (2012) in a study in Taiwan observed that more than three-fourths of TB patients were males and generally older than the females. According to Feng et al., (2012) higher mortality has been reported for males in Mexico, India and Italy. Limenih and Workie (2019) identified gender as one of the risk factors in their study of survival times to cure for TB patients in Ethiopia. The study concluded male multidrug-resistant TB patients experiencing longer recovery times than females. A similar study by Brust, Carrara, Osbum and Padayatchi (2010) in South Africa corroborated this. Sacks et al. (1988), Borgdorf (2007 and Zawdie, Enqueselasie and Tesfay (2015) confirmed higher hazard rates for female TB patients in South Africa, Netherlands and Ethiopia respectively. It is worthy of note that event is these cases was death and not recovery. A Cox PH analysis by Bolarinwa and Michael (2020) observed lower HR for male TB patients than for females in a recent study in which event of interest was recovery. So result of Cox analysis of same data is consistent with that of Weibull for the covariate, gender.
PTB patients had lower HR compared to RTB patients (H.R = 0.963). Hence, PTB patients had higher survival times than RTB patients. That is, it took PTB patients longer periods to recover from the disease than RTB patients. Tolosie and Sharma (2014) did not identify TB type among others as a survival risk factor in a study in Ethiopia. However, Domingos, Caiaffa, and Colosimo (2008) in a similar study in Brazil, Munoz-Sellart, Hargreaves, Gausi, Kwanjana and Salaniponi (2001) in Malawi, and Borgdorff, Veen, Kalisvaart, and Nagelkerke (1998) in Netherlands observed TB type as a significant factor for mortality of TB patients. Omari-Sasu, Owusu, Boateng and Sabogu (2016) found TB type among factors responsible for varying survival experience among TB patients in Ghana. Compared to the exponential model, Weibull model has made contrary suggestions as regards hazards for gender, type of TB and occupation.
TB patients on technical jobs had lower hazards compared to others. Occupation is a category that has not been captured as frequently as age and gender in survival studies in medical research. However, Akinyemi, Sholanke and Odimegwu (2018) in a study identified type of occupation of parents as a risk factor for both infant and child mortality in Nigeria. Saroj, Murthy, Kumar, Singh and Kumar (2019) found husband's/partner's occupation significant in a study on under-five child mortality in India. Bolarinwa and Michael (2020) identified occupation as a component of the best of fifteen fitted Cox PH models of Tuberculosis data.
According to the log-logistic model, HR was constant for all ages. This contrasts with the position of the other two models. HR of 0.999 for gender implies a lower hazard (by 0.01%) for male than for female patients. The reduced hazard for males is in harmony with results obtained for the exponential model. HR of 1.003 for Type of TB implies 0.3% increase in the hazard for PTB compared for RTB patients. This is in agreement with the position of the exponential model. Type of occupation has a unit HR, suggesting constant risk for technical and other occupation. This contradicts the other two models which also maintain contrasting positions. It should be noted that parameters and hence, hazard ratios of this model are not statistically significant at 5% level (See Table 2), just like those of exponential model.  Table 2 presents the significance values (p-values) of parameters of each of the three models under consideration. The exponential model found all covariates insignificant at 5% level. This should be suggestive of a poor fit. High reported p-values suggest high level insignificance of model parameters. This means that going by this model, none of the covariates had a statistically significant influence on survival (time to recovery) of TB patients. The Weibull model found all covariates but type of TB significant at 5% level. Having three of four parameters significant may be indicative of a good fit. As far as this model is concerned, age, gender and occupation of TB patient exerted statistically significant influence on recovery from TB disease. The performance of the log-logistic model was similar to that of the exponential model as all covariates were found insignificant at 5% level. The Weibull model has performed better than the other two on the basis of proportion of significant parameters and hence, deserves better attention than its competitors.  Table 3. The Weibull model has the least AIC and has hence, performed best, followed by log-logistic and the exponential model performed worst. The performance of the Weibull above others based on AIC could not be attributable to chance but rather a corroboration of results of significance testing of models parameters presented in Table 2. This is consistent with results for Chere and Tesfay (2015) and Saroj, Murthy, Kumar, Singh and Kumar (2019) in which Weibull outperformed exponential, lognormal and log-logistic models based on AIC for a study on TB data in Ethiopia and India respectively. Little wonder that this model is most used of all parametric models in analyzing survival data.
This study is not without its limitations. Identified limitations follow: The data used is secondary and bias cannot be ruled out; among TB patients, time to recovery may be associated with presence of underlying illnesses which are not incorporated into the study; and age distribution of the subjects involved in the study is not necessarily representative of typical situation in the society.

Conclusion
This study has analyzed times to recovery of TB patients using the exponential, Weibull and log-logistic models. The Weibull model which found age, gender and occupation significant, outperformed exponential and log-logistic models on the basis of AIC. Weibull model hence, suggests that age, gender and occupation were the major determinants of recovery period of TB patients. The Management of Federal Medical Centre, Bida, Nigeria and other organizations involved in TB management could make use of Weibull model to fit and predict both the survival and hazard rates of TB patients.