A Novel Method for Computationally Efficacious Linear and Polynomial Regression Analytics of Big Data in Medicine

Background: Machine learning relies on a hybrid of analytics, including regression analyses. There have been no attempts to deploy a scale-down transformation of data to enhance linear regression models.

Most importantly, these optimized models will not only be more accurate in terms of statistical inference but will also be economical in terms of the computational processing demands for the analyses of large-scale arrays of data (Schilling et al., 2005). Scientists use several data transformation techniques to boost a spectrum of statistical tests, including the Fourier transformation, the Log Base-10 (Log10) transformation, the natural logarithm (Ln) transformation, and inverse transformation, as well as the square root and cubic root transformers (O'Hara and Kotze, 2010;Takeda et al., 1982). The scale-down optimization of data can capitalize on powerful and economic computational processing for real-time analyses and predictive models (Schilling et al., 2005). In 1965, the British statistician, Austin Bradford Hill, proposed the nine-element criteria to provide evidence for causality between a presumed effect and an observed outcome (Phillips and Goodman, 2004). Hill proposed to analyze the effect size, the strength of association, the replicability of the results, the specificity of association, the temporality of causation, the biological gradient effect, as well as plausibility, coherence, experimentation, and analogy (Fedak et al., 2015;Phillips and Goodman, 2004). If researchers and data analysts integrate optimized linear or polynomial models, in combination with Hill's criteria, they can infer robust data that possess the least prediction error and the highest statistical power, while keeping the human resources and the requisite computational infrastructure to a minimum.
Our primary objective is to optimize linear and polynomial models, principally for analytics that are dependent on correlation and regression statistics, by implementing a scale-down transform function that significantly reduces the error of residuals by minimizing the sum of squared errors (SSE). Thereby achieving more powerful and externally valid models that apply to real-time analytics, as well as predictive models that are necessary for high-impact research, based on big data (Al-Imam, 2017; Al-Imam, 2019; Al-Imam et al., 2019).

Mathematical Simulations
We made multiple simulations based on a random number generator that follows a normal distribution [mean=0, standard deviation=1]. We created 40 trials (i.e., simulation models) for linear regression calculations [k=40], each test has a sample size of one thousand observations [n=1,000] for two variables as a predictor and an outcome (X and Y), thereby, summing to a grand sample size of 40,000 [n total =40,000]. We transformed the two variables, by dividing, each observation to the maximum observation within the same variable, by using the "max" function in Excel 2016, thereby scaling them down. Within each linear model, we calculated correlation and regression statistics, including the sum of squares (SS), mean of squares (MS), F statistic [ANOVA], and p-value [regression]. We calculated the sum of squared errors (SSE) using the formula SSE=∑ (y−ŷ) 2 to fulfill the regression equation ŷ=b0+b1X. Calculations were conducted twice, before [pre-optimization] and after deploying the scale-down transformation [post-optimization]. We statistically tested the performance of the scale-down optimization model using the Wilcoxon signed-rank test for non-parametric within-subjects statistical inference by comparing the pre-optimization versus post-optimization statistics. Ultimately, we further examined the optimization efficacy of our model by implementing Cronbach's alpha as a measure of the internal consistency of the summative optimized model.

Statistical Analysis, Ethics, and Level of Evidence
We implemented the Statistical Package for the Social Sciences [IBM-SPSS version 24] and Excel [Microsoft Office 2016] with integrated Data Analysis ToolPak. We made descriptive statistics using Excel and GNU-Octave version 5.1.0 [GNU's Not UNIX Project]. We implemented MatLab high-level programming language (HLL) version R2019a [MathWorks] for two-dimensional array transposition before exporting the data to SPSS for Cronbach's alpha calculations. We conducted an elaborate set of parametric and non-parametric models of non-Bayesian statistics, including linear and polynomial regression, Fisher's ANOVA, Wilcoxon signed-rank test for within-subjects study design, and Cronbach's alpha analytics for assessing the reliability and internal consistency of our proposed statistical model based on the scaling down of the data.
The authors conducted the work described in this article following the Code of Ethics of the World Medical Association (Declaration of Helsinki) on medical research involving human subjects, EU Directive (210/63/EU) on the protection of animals used for scientific purposes, uniform requirements for manuscripts submitted to biomedical journals, and the ethical principles defined in the Farmington Consensus of 1997. According to the Oxford Centre for Evidence-based Medicine (OCEBM), our research represents "Absolute Better-Value or Worse-Value Analyses" under the category "Economic and Decision Analyses" (Greenhalgh, Howick, & Maskrey, 2014; OCEBM Levels of Evidence, 2016). Accordingly, our study is of level-1c, which belongs to the top tier [level-1, Grade-A] of the categorization scheme rectified by the OCEBM (OCEBM Levels of Evidence, 2016).

Systematic Review of the Literature
During September 2019, we conducted a pragmatic review of the databases of peer-reviewed literature, including the Cochrane Library [the Cochrane Database of Systematic Reviews | the Cochrane Collaboration], PubMed [the United States National Library of Medicine], and Embase [Elsevier]. We implemented an exhaustive set of keywords based on medical subject headings (MeSH), in addition to generic terms, while using Boolean expression operators and truncations. We deputized keywords of five main themes, including 1) machine learning and artificial intelligence, 2) real-time and predictive analytics, 3) real-time analytics and epidemiology, 4) data transform functions, and 5) an amalgamation of the previous four themes. The aim is to explore the existing literature for prior attempts of using scale-down data transformation for enhancing and optimizing linear models.

Results
For the optimization model, we applied the scale-down transform for 40 trials of linear regression analyses (Table  1). The model was triumphant in attaining a significant reduction of the sum of squared errors (SSE) for each trial following the application of the scale-down transform [absolute Z-score = 5.511, effect size = 0.779 (i.e., strong effect), p-value < 0.001 for the Wilcoxon signed-rank test] (Table 2). We utilized a non-parametric alternative of the dependent Student's t-test due to the violation of t-test assumptions, including the absence of statistical outliers, homoscedasticity, and the normality of distribution [Shapiro-Wilk test] (Table 2). On the other hand, there was no significant change in the coefficient of determination (R 2 score) and the F-score for the pre-optimized versus post-optimized trials, as we created each with a random number generator function using the Data Analysis ToolPak plugin in Excel. A randomly selected linear model, the 34 th trial, manifested with a sum of squared errors of 9.96E+08 [pre-optimization] and 76.484 [post-optimization], confirming a significant SSE reduction and a better predictive model fitting. The scale-down transformation neither had a distortion nor an artefactual effect on the scattered correlates of the tested variables (Figure 1 Lastly, Cronbach's alpha analysis yielded collateral evidence and verified the internal consistency of the optimization model [Cronbach's alpha=0.993]. Deleting any trial from the optimization model had no effect on the inter-item reliability with an exception for five simulations [1 st , 4 th , 5 th , 6 th , and 10 th ], the deletion of which increases the internal consistency to 0.998, an almost perfect consistent model.

Discussion
Our optimization model applies to anticipated high-impact research that requires linear or polynomial model analyses (Figure 1), including anatomical sciences, dermatology, and medical research and practice.
Boosted regression models are of utmost importance in the exponentially growing field of machine learning and artificial intelligence. The applications are not limited to psychoactive and novel psychoactive substances research, an emerging subdiscipline of addiction neuroscience and behavioral psychiatry. Optimized regression analytics are priceless when it comes to applications with extensive data analytics and bioinformatics, comprehensive genomic analyses, and analytics based on extracting information from open-source deposits of big data, for instance, Google Trends and Google Analytics databases. Optimum linear and polynomial models not only will reinforce the hypothesis testing for more powerful inferences but also will lessen the computational processing power and the human resources allocated for demanding real-time and predictive analyses. If our optimization model integrates with the anticipated quantum computing, the benefits will be monumental concerning the precision of analytics and the efficacy of the computational processing.
Machine learning relies upon the analyses of big data using a plethora of well-established techniques of mathematical and data science models, including artificial neural networks, regression analysis, and decision trees (Jordan and Mitchell, 2015). Artificial intelligence techniques attempt to reach the lowest achievable error rates of mathematically interpreted predictions for causality associations (Everitt, Goertzel, & Potapov, 2017). Machine learning is mandatory for unwitnessed benefits when it comes to applications related to spatio-temporal description and prediction of phenomena of interest, including epidemiological and digital epidemiological investigations (Everitt, Goertzel, & Potapov, 2017;Jordan and Mitchell, 2015). The infrastructure of big data upon which machine learning algorithms operate is the same as those designated for classical epidemiology and digital epidemiological research (Rothman, Greenland, & Lash, 2008). Researchers can retrieve data from the databases using survey tools, internet snapshots, longitudinal studies, cross-sectional studies, analyses of web-based social networks, and electronic commerce website analytics of the surface web as well as the deep web, including the infamous Darknet hypermarket (Al-Imam A and Al-Shalchi, 2019; Al-Imam, 2017; Motyka and Al-Imam, 2019; Rothman, Greenland, & Lash, 2008).
We reviewed the literature using a combination of thematic keywords search. There were 55,288 publications indexed in the Cochrane Library (117, 0.21%), PubMed (40, 0.07%), and Embase (55,131, 99.71%) (Figure 2).  Following a full-text retrieval of papers of interest, only fifteen publications (0.03%) indexed in the national library of medicine were found relevant to the primary objective. However, none of these studies implemented our data transform method to boost linear or polynomial regression models. Since the last decade, there have been several attempts in the existing peer-reviewed literature to implement linear models as well as other machine-learning methods in combination with the data transform function, including logistic regression, regression trees and Fourier transform, logistic regression with Log10 transformation, logistic regression with Ln transformation, multiple linear regression with log10 transformation, cycling regression model with Fourier transform, proportional hazards Cox regression model, time-series analytics regression with Fourier transform, logistic regression with square root and log10 transformation, and proportional hazards model in combination with logistic regression (Lorenz et al., 2017;Menotti, Puddu, & Lanti, 2002;Shaban-Nejad, Michalowski, & Buckeridge, 2018).

Conclusion
Our novel transform and optimization method serves three primary purposes: 1) Reducing the sum of squared errors (SSE), which will provide a better line of best fit. 2) The scale-down transformation will significantly reduce the computational processing demands for mathematical calculations for big data with an extensive list of variables, as well as an extended number of observations for each variable that is tangible in multiple polynomial regression analyses. 3) Real-time processing of correlations and regression among exhaustive multidimensional arrays of data will even be more consuming in terms of the requirement of computational processing power that can burden supercomputers existing today and the near future. The optimization will transform all variables into a narrower range with limited decimal places and without deforming the original correlation of variables, which can be economical for subsequent mathematical and computational processing.

Availability of Data
Our data are available upon request from the corresponding author.