An Improved Comparison of Chemometric Analyses for the Identification of Acids and Bases With Colorimetric Sensor Arrays

Colorimetric sensor arrays incorporating red, green, and blue (RGB) image analysis use value changes from multiple sensors for the identification and quantification of various analytes. RGB data can be easily obtained using image analysis software such as ImageJ. Subsequent chemometric analysis is becoming a key component of colorimetric array RGB data analysis, though literature contains mainly principal component analysis (PCA) and hierarchical cluster analysis (HCA). Seeking to expand the chemometric methods toolkit for array analysis, we explored the performance of nine chemometric methods were compared for the task of classifying 631 solutions (0.1 to 3 M) of acetic acid, malonic acid, lysine, and ammonia using an eight sensor colorimetric array. PCA and LDA (linear discriminant analysis) were effective for visualizing the dataset. For classification, linear discriminant analysis (LDA), (k nearest neighbors) KNN, (soft independent modelling by class analogy) SIMCA, recursive partitioning and regression trees (RPART), and hit quality index (HQI) were very effective with each method classifying compounds with over 90% correct assignments. Support vector machines (SVM) and partial least squares – discriminant analysis (PLS-DA) struggled with ~85 and 39% correct assignments, respectively. Additional mathematical treatments of the data set, such as incrementally increasing the exponents, did not improve the performance of LDA and KNN. The literature precedence indicates that the most common methods for analyzing colorimetric arrays are PCA, LDA, HCA, and KNN. To our knowledge, this is the first report of comparing and contrasting several more diverse chemometric methods to analyze the same colorimetric array data.


Introduction
The examination of digital images in analytical chemistry has increased by more than 87% from 2005 to 2015, tracking with the increased availability of imaging devices (Capitán-Vallvey, López-Ruiz, Martínez-Olmos, Erena, & Palma, 2015).In particular, colorimetric tests and arrays have greatly benefited from the enhanced qualitative and quantitative analysis provided by that color space techniques (Askim, Mahmoudi, & Suslick, 2013).Colorimetric arrays are typically composed of 3-40 sensors that can interact with analytes and change color upon molecular interactions (Burks et al., 2010;Li, Jang, Askim, & Suslick, 2015;Salles, Meloni, de Aaujo, & Paixão, 2014).Various types of color changing sensors have been utilized in sensor arrays including pH indicators, metalloporphyrins, solvatochromic dyes, redox indicators, metal salts, ionic liquids, and nanoparticles (Askim et al., 2013;Galpothdeniya et al., 2015).Potential analyte -sensor interactions leading to colorimetric changes include Lewis acid/base interactions, hydrogen bonding, π-π interactions, and dipole-dipole interactions.Array sensor selection typically depends on an analyte's chief mode of interaction.For example, an acidic or basic analyte would warrant pH indicators as sensors, while the detection of a metal ion would point to complexometric sensors (Ariza-Avidad et al., 2014).The previously mentioned analytesensor interactions allow for a dynamic versatility and high applicability of colorimetric sensor arrays (Suslick, 2004).Effective arrays typically have the following criteria: high selectivity, high sensitivity, the ability to detect many analytes with the fewest numbers of sensors, and yield RGB data that can be analyzed via statistical analysis methods were prepared by dissolving each into aliquots of a solvent mixture consisting of acetate buffer (0.1 M, pH 5), ethylene glycol, triethylene glycol monobutyl ether, and glycerol in a ratio of 14:1.6:1:3.2.The sensor solutions sonicated for 1 hour in a bath sonicator (30 ºC), followed by 5 minutes mixing with a probe sonicator, and then vacuum filtered twice through Whatman #1 filter paper.Acetic acid (0.1 -3 M) and ammonia (0.1 -3 M) solutions were prepared by diluting concentrated reagent solutions with milli-Q water (18 MΩ-cm).Solutions of malonic acid (0.1 -2 M) and lysine (HCl salt, 0.1-2 M) were prepared by dissolving appropriate amount of the analytes in milli-Q water.The sensor array was laid out in a 96-well plate as shown in Figure 2 by dispensing 100 µL of each sensor in designated rows.The same volume of an analyte or control were added to the 12 columns of the well plate.To explore reproducibility, each plate contained 4 replicates of a water control and 8 replicates of each analyte.
Figure 2. Sensor array housed in a 96-well plate, with each sensor was placed in a designated row.For each array, the first four columns are controls (water), and the final eight columns are analyte (shown above: 0.5 M lysine).The black boxes highlight color differences between the control (water) and the analyte (0.5 M lysine).
All array images, including Figure 2, were collected as 24-bit color images using an Epson Perfection V700 desktop scanner in transparency mode.To eliminate interferences from stray light, the scanner was draped in black cloth.The images were analyzed with ImageJ (Schneider, Rasband, & Eliceiri, 2012), and the extraction of mean RGB values for each well was automated with a macro (Lyon et al., 2012;Soldat, Barak, & Lepore, 2009).No attempts were made to correct for image-to-image variation by subtracting a control row, as our previous work showed such a correction to be unnecessary (Kangas 2018).The RGB dataset is provided in the supplemental information as SI.4 to facilitate further chemometric studies.
All statistical analysis was performed using the statistical programming language R. PCA was performed using the function prcomp.The data was mean-centered, but was not scaled to unit variance because all of the data was on a consistent scale of 0 to 255 RGB units.Loading plots are included in the supplemental information as SI.2.Score plots of the resulting data were constructed using the function pca2d from the library pca3d (Weiner, 2017).Hierarchical clustering was conducted in agglomerative mode using Ward's method based on Euclidean distances using the hclust function from the stats library (R Core Team, 2017).LDA was performed using the MASS library (Venables & Ripley, 2002) and the classification ability of the LDA model was tested by using all but one cross validation.The concentrations of each analyte were treated as a separate class, and for classification the prior probabilities were equal for all classes.KNN based on Euclidean distances was performed for k=1, 3, 5, 7, and 9 using the class package (Venables & Ripley, 2002).HQI values were calculated using a custom R program given in the supplemental information (Supplementary Figure, SI.1.Classification was performed using all but one cross-validation, and classifications were assigned based on the library sample with the highest HQI value.PLS-DA was performed using the plsDA function from the DiscriMiner library with leave one out cross validation (Sanchez, 2013).For RPART, SVM, and SIMCA, the data was randomly split into a training set containing 75% of the data and a test set comprised of the remaining 25%, and the analysis was repeated in triplicate with different train/test sets.RPART was conducted using the rpart package (Therneau, Atkinson, & Ripley, 2017).SVM was performed using the rrcovHD package (Todorov, 2016).SIMCA was performed using the CSIMCA function from the rrcovHD package.

Results and Discussion
In the present study, the selected sensors were chosen for many of the reasons listed in the Introduction, plus our previous work (Kangas, 2018) showed these sensors to be well-suited for the qualitative and quantitative classification of NaOH and HCl solutions.As shown in Figure 2, some color differences between the control and analyte test wells are easily visible to the naked eye.However, some sensor changes more subtle and are not immediately visible to a user.Thereby, with chemometric analysis the investigator can more easily detect and use subtle color changes identified by image analysis for the identification and quantification of analytes.

Principal Component Analysis
PCA is one of the more commonly used chemometric analysis methods for large data sets collected from colorimetric tests or sensor arrays (Capitán-Vallvey et al., 2015).PCA is an algorithm that uses an orthogonal transformation a set of observable -and possibly related -variables to change them into a set of linearly uncorrelated variables known as principal components (Graham, 1993).Usually, for use with colorimetric sensors, this means that the principal components are statistically weighted combinations of the R,G and B values from all the sensors.Although calculated in multiple components, only a few components are typically needed to visualize and analyze trends in the data set.In this study, for each array, 24 variables were available in the original data set (i.e. 8 sensors x 3 channels = 24 variables).However, only 4 principal components were required to assess 95% of the variance in the data set.The number of components needed to describe the variance in the data set was consistent with observations from other colorimetric sensor array studies (Li et al., 2015;Salinas et al., 2014).
When analyzing with PCA, variables that are strongly correlated typically remain closely related when converted to the new components.Similarly, data points that are clustered in the original data set can usually be found together in the principal component space, thereby allowing the visualization of similar data when multiple component are plotted together (Graham, 1993).In our work here, the biplots of PC1 and PC2 (Figure 3) show that water, each acid, and each base create distinct clusters in the plot.These grouping indicate the ease with which each analyte of study can be identified, whereas the biplot of all analytes (Figure 3) struggles to distinguish malonic from acetic acid in the same space.However, in the biplot of PC2 and PC3 (Supplementary Figure SI.2), the distinction of acetic and malonic acid is much more apparent while the distinction of water, ammonia and lysine are less so.The individual concentrations of the analytes were also observed via plotting in the PC1 vs PC2 biplot.Again, the acids did not perform well with very little selective grouping to mark the concentration changes while ammonia showed very distinct clusters of grouping by concentration below 2 M. It was also observed that the low concentrations of lysine, 0.1-1 M, were very difficult to distinguish from one another while there was a distinct separation between those and the 2 M samples.However, some distinction arises in all groups by concentration when viewed in the PC2 vs PC3 biplot suggesting that this space is a better space to observe the clustering and concentration differences found by the PCA analysis (Supplementary Figure SI.2).
Figure 3. Biplots of PCA results of acetic acid, ammonia, lysine, malonic acid, and water by concentration and all analytes together showing analyte and concentration grouping as viewed by the first and second components.PC1 and PC2 are the first two components from PCA, respectively By observing the principle component values for each channel, PCA analysis can be used to also determine the sensors with the most influence within each component.For instance, as observed in Supplemental figure SI.3, PC1 appears to be dominated by the red channels of CR, EB, AY, and UV and the green channels of AY and PH.In PC2, the red channels of EB and AY are strongly influential while the red channels of CR, UV and BB and green channels of AY and PH are only mildly so.Finally, PC3 is strongly dominated by the red channel of UV and BB and the blue channel of AY while only mildly influenced by the green channel of AY.This may explain why the acid samples are not well distinguished in the PC1 and PC2 biplot as both channels are strongly dominated by similar red channels, especially EB and AY, while the PC3 is strongly influenced by other dyes in different channels.In addition to visualizing the data with the scores plot, PCA can also be used to determine which sensors are responsible for the analyte discrimination by analyzing the loading plots (Supplementary Figure , SI.3).This information could be used for sensor selection and array optimization as the loading plots reveal which sensors make the biggest contributions toward analyte detection as evidenced by the highest loadings on the y-axis.
Overall, the plots show that PC1 and PC2 space provide excellent separation of ammonia, lysine, and water, but not acetic and malonic acid.This is likely because PC1 and PC2 are strongly dominated by similar principle components.
To achieve separation of the acids and their concentrations, one must look at PC2 and PC3 space which provides better separation but loses the separation of the bases, which like the acids, seems to coincide with sensor pKa values and the complementary color change.This indicates that simply relying on two dimensions of PCA is not best for identifying the acids and bases tested in this study, and at least three components should be used.Furthermore, by viewing the single component variables (Supplementary Figure, SI.2), the array of dyes can be improved by noting those sensors which strongly influence each components (i.e.CR, EB, AY, UV, BB, PH) and those that have very little influence in the first three components, (i.e.CV and ER).It may be advisable to use other components to see if those sensors have any influence elsewhere in the space or exchange non-influential sensors for others which may provide more information.

Hierarchical Cluster Analysis
Similar to PCA, HCA is an unsupervised, no bias multivariate clustering analysis that is commonly used to analyze colorimetric arrays (Bueno, Meloni, Reddy, & Paixão, 2015;Capitán-Vallvey et al., 2015).When performing HCA, the analysis receives no information on the classes of the samples except for the 24 variable values per sample.Therefore, the grouping of samples is determined by closeness, typically a Euclidean distance, in that 24 dimension space.This clustering is an iterative process resulting in a tree of relational closeness where well-related samples are near to one another on the tree while samples with less relation are further away.Compared to other clustering algorithms, HCA has two main advantages: (1) it provides a quantitative metric for the similarity of groups and (2) it defines clusters in all size scales, ranging from individual samples up to a single group that contains all samples (Graham, 1993).Figure 4 shows the clustering results of the analyte group means, with all variables averaged within a particular sample label, from the colorimetric data to provide a more visually appealing example of the clustering capabilities.When viewing the HCA plot, each of the T junctions are flexible when interpreting similarity within the plot.What this means is that groups within the same branch, but not necessarily samples within the same region of the graph, are considered similar.
For example, all of the ammonia samples fall within the same branch noting their similarity.However, 2 M ammonia is equally similar to 0.5 M ammonia as it is to 1 M ammonia due to the flexibility of the branch junctions.Furthermore, in the branch containing lysine and water, 1 M lysine and 2 M lysine are classified equally similar to water although 0.5 M lysine is less similar to water than 1 M lysine.Ammonia and all the rest of the data form the largest groupings showing that ammonia is the least similar to all of the rest of the samples.Within the next tier, lysine-water and the acids form the next families demonstrating how quickly HCA is able to discern the selected acids from the bases.While lysine and ammonia are easily distinguished, HCA struggles with the individual acids especially in the low concentrations, such as in the case of 0.1 M malonic and 1 M acetic acid.
Overall, HCA provided useful analysis for understanding similarities and differences in the data sets and was able to distinguish ammonia and lysine as having distinct signal from the acids.However, given that the branches of the dendromer are freely rotating, HCA was unable to distinguish the variables of lysine as being uniquely basic or distinct from water.Furthermore, HCA does not indicate why the groups were clustered as they were or how each individual variable contributes to the classification, which would -as in the case of PCA -be helpful sensor array optimization data.Although HCA is a powerful classification tool that provides sufficient analysis of the data set, we judge it inferior to PCA in the ability to draw conclusions on the sufficiency of the sensor array as HCA provides little insight into how the classification may be improved.

Linear Discriminant Analysis
Like PCA, LDA is also a method that generates new variables called discriminants which consist of linear combinations of the original variables and has been applied in colorimetric sensor array analysis (Minami et al., 2013;Zhang et al., 2014).Unlike PCA, LDA considers and exploits the differences in the group means and often outperforms PCA in the separation of groups (Askim et al., 2013).The main disadvantage of LDA is that it requires a data set larger than that required for PCA (Wold, Johansson, Jellum, Bjørnson, & Nesbakken, 1981).In addition, the size and composition of the classes within a dataset will affect the discriminant and result of LDA, thus influencing LDA's ability to correctly identify analytes and their concentrations.Figure 5 shows a panel of plots from the first and second discriminants of LDA analysis which was input as groups by concentration.In the plot of analytes it is observed that LDA appears to cluster acetic acid, malonic acid and ammonia as distinct groups for analysis and identification of analyte while lysine is hardly distinguishable from water.Malonic acid was easily separated into of concentration groups, while acetic acid and ammonia solutions were well-resolved at lower concentrations but distinguishing between higher concentrationsespecially 2 and 3 M acetic acid and 1, 2, and 3 M ammonia -was a challenge.For this particular data set it appears that LDA is superior to PCA at a two dimensional separation of acetic and malonic acid with regard to class and concentration, but is more comparable to HCA in the ability to distinguish lysine from water.Table 1 shows the classification of the LDA analysis in which groups were input by concentration.LDA was able to correctly classify samples by analyte identity and concentration in 626 out of 631 samples (99.2%).When misclassifications occurred, the analysis was often still able to classify an analyte as an acid or a base.For example, one 3 M ammonia sample was misclassified as 1 M ammonia and one 0.1 malonic acid sample was misclassified as 0.5 M acetic acid.As mentioned previously, LDA analysis struggled most with lysine samples, misclassifying one sample of 0.1 M acetic acid as 0.1 M lysine and one sample of water as 2 M lysine.One sample of 0.5 M ammonia could not be classified, as replicate trials resulted in different classifications.In addition, subsequent analysis of the posterior probabilities indicated a modeling error in the analysis.This result was likely the result of unusually low RGB values for phenolphthalein other possibilities include a shadow or air bubble in the image of the well.LDA was also used to qualitatively classify analytes correctly 622 out of 631 (98.6%).The previously observed trends were also true when considering only analyte classification with one sample of acetic acid misclassified as lysine (struggles with lysine) and seven samples of malonic acid classified as acetic acid (still classifies acids as acids).Overall, LDA was observed to classify the data 99.2% correctly when considering analyte identify and concentration, while data classification was 98.6% correct when only analyte identity was considered.Although PCA has an advantage of better grouping lysine samples, LDA clearly has the advantage of quantitative classification results which may be more useful in certain reporting schemes.

K Nearest Neighbor
KNN is another chemometric method which classifies unknown samples by comparing them to a library of known samples.Previous applications of KNN include pH determination using a sensor array (Capel-Cuevas, Cuéllar, Orbe-Payá, Pegalajar, & Capitán-Vallvey, 2010) and melting point estimation of organic compounds (Nigsch et al., 2006).With KNN, the classification of the unknown is determined by measuring the distance to the most similar known samples.The unknown is then identified by association with the predetermined "K" number of nearest neighbors, with the nearest neighbor defined as the known samples with the shortest distance to the unknown (Ma, Yang, & Cheng, 2014).For example, if K = 1 the unknown is identified as the classification of the one closest known sample whereas if K = 4 then the unknown is classified with the identity of the four closest neighbors.KNN classification treats the data sets as points in dimensional space equal to the number of variables (Balabin, Safieva, & Lomakina, 2010).In this study with 24 values per array, there are 24 dimensions, this makes the calculation of the distance between points and the implementation of KNN relatively straight-forward.Furthermore, KNN often performs as well as or better than more complicated classifiers, such as SVM (Ma et al., 2014).
For our work, K was varied and the Euclidean distance was calculated for KNN analysis.Increasing the k value has previously been shown to influence method performance and accuracy, but the relationship between k and performance can vary (Ma et al., 2014).In our study, increasing K = 1 to K = 9 resulted in lower classification accuracy (Table 3).
Overall, KNN was used to correctly identify 98.1% of samples for K = 1, decreasing to 93.0% for K=9.This indicates that K = 1 was the optimal parameter.Table 4 contains K = 1 classification data.Our results indicate that KNN is a very robust method for sample classification for this sample size since the least accurate run with the largest number of neighbors (K = 9) is able to classify samples correctly 93.0% of the time.Some of the mislabeled samples involved the correct analyte, but the wrong concentration, such as 0.1 M lysine was identified as 0.5 M lysine or 0.5 M acetic acid was identified as 1 M. Six misclassifications occurred in acetic acid.
Comparing the two classification methods, KNN (K = 1) is comparable to LDA with 98.1% accuracy and 98.6% accuracy, respectively.While LDA utilizes input variables with optimal weights to separate the group means and KNN gives equal significance to all of the variables, both methods still result in similar results.Alternative KNN algorithms do apply various transformations to the dataset to optimize the accuracy of KNN.However, these methods were not pursued in this study (Nigsch et al., 2006).

Hit Quality Index
HQI is commonly used as a spectral comparison method when working with an unknown FTIR or Raman spectra and a database of known spectra (Gryniewicz-Ruzicka, Rodriguez, Arzhantsev, Buhse, & Kauffman, 2012;Lee, Lee, & Chung, 2013, p. 201).HQI treats the unknown spectra as vectors in 24 dimensional space by calculating the dot product according to the following equation: The terms x and y are the unknown and one of the many known spectra in the database, respectively.Classification is then assigned by assessing the closeness of fit which is determined by the result of the dot product being close to 1.The use of HQI in comparative colorimetric studies have shown similar accuracies to other chemometric methods we compare (Gryniewicz-Ruzicka et al., 2012;Lee et al., 2013, p. 201).
In the present study, HQI showed an overall 98% accuracy for analyte identity and concentration when classifying acetic acid, malonic acid, lysine and ammonia (see Table 5 for results).Common misclassifications include concentrations within the same analyte (6 samples), such as 1 M acetic acid misclassified as 0.5 M acetic acid and 0.5 M lysine misclassified as 0.1 M lysine.Two samples were classified outside their analyte; 0.5 M lysine misclassified as 0.1 M acetic acid and 0.1 M malonic acid misclassified as 1 M acetic acid.Similar to KNN, HQI underperformed both LDA and PCA, especially with respect to classifying by concentration within an analyte.These results are similar to our previous work with NaOH and HCl which also showed a high level of accuracy for analyte classification efficacy while suffering in the concentration identification within analytes, especially high concentration HCl (Kangas, 2018).HQI is a straightforward, mathematical analysis method which could be useful as an alternative or additional chemometric analysis of colorimetric arrays.However, it suffers from inaccuracies which make it inferior to LDA and PCAespecially in the acetic acid concentrations tested in the present study (Table 5).

Partial Least Squares Discriminant Analysis
PLS-DA is an extension of the PLS methodology used for classifying samples.Briefly, PLS or PLS-DA are performed by generating components similar to those in PCA, but the components are selected to correlate with y-values or classes, respectively (Brereton & Lloyd, 2014).Our PLS-DA results for classifying samples into groups based on the analyte and concentration are given in Table 6.Unlike the other classification methods used in this study, PLS-DA was only able to correctly classify water and the most concentrated ammonia samples (3 M), resulting in an overall accuracy of 39%.The poor performance may be ascribed to PLS-DA using a one-versus-all approach for multiple class problems (Brereton & Lloyd, 2014).With the present data set, the samples in target class may be very similar to those in the rest of the dataset.Since water was the majority of our observations, the class weights may have been affected.For example, when classifying 0.5 M acetic acid, both 0.1 and 1 M acetic acid would be in the other class.PLS-DA also can have problems setting the boundaries when there are groups of unequal sizes (Brereton & Lloyd, 2014), which would also be present with the one-versus-all groupings.PLS-DA analysis with groups based on the analytes rather than both the analytes and concentration resulted in a much higher accuracy with 610 of 631 correct, data not shown.In this case, the one-versus-all groupings should be more distinct and the classes should be closer in size, and there are fewer classes to assign observations to.

Recursive Partitioning
RPART is fast and simple to implement (Miller, 2001) classification method that generates a decision tree to classify samples.At each branch, the value of a single variable is tested with a rule.For example, a classifier for patients may use rules like is the patient age >18, while a classifier for colorimetric data may test the intensity of a specific sensor.
Examples of the usage of RPART in the chemical literature include calculating phase diagrams for surfactants (Bell, 2016) and identifying new pharmaceutical and antibiotic compounds (Rusinko, Farmen, Lambert, Brown, & Young, 1999;Wang et al., 2014).An advantage of RPART is that only the most important variables from the dataset are used in the rules, and variables that have a low impact on classification are ignored.
Figure 6 shows the decision tree used for classifying the samples.As shown in Figure 6, the first rule is based on the blue intensity of AY.The numbers below the groups indicate the confidence in that classification with the training set.The 0.1 M lysine sample was the only one with a confidence less than 1, and is consistent with the classification results for the test set, where there were acetic acid samples classified as 0.1 M lysine and lysine samples classified as acetic acid.In addition, the decision tree shows that phenolphthalein and eriochrome black T were not used in any rules, while bromophenol blue and alizarin yellow were the most utilized sensors.
For RPART, the data was randomly split with R into a training set which acts as the database with 473 samples and a testing set which acts as the unknowns with 158 samples.To test the reproducibility, the analysis was repeated in triplicate with a new training and testing set of data randomly chosen for each trial.The classification accuracies for the three trials were 95.6%, 97.4%, and 96.2% for an average accuracy of 96.4%, and the classification results for trial 1 are summarized in Table 7.The decision tree for trial 1 is shown in Figure 6.
As shown in Table 7, there were 7 incorrect classifications including three acetic acid samples that were classified as the correct analyte but the wrong concentration.The remaining four samples were classified as the wrong analyte.These include two lysine samples (0.1 and 1 M) that were classified as 0.1 M acetic acid and two acetic acid samples (0.1 M) that were classified as lysine (0.1 M).These results are consistent with the results for KNN and LDA (98% and 98.6%, respectively), which also showed some confusion between dilute acetic acid and lysine.This is likely due to the similarity in pH between the two compounds.

Support Vector Machines
Support vector machines were utilized to investigate whether more novel and less reported learning schemes could improve the classification results of our data set.SVMs establish a space complete with hyperplanes that are based on a training set of data with known classifications.This gives maximum distance between groups clearly dividing the possible combinations of data.When an unknown sample is added to the space, the sample is subsequently classified based on its closeness to a particular hyperplane receiving the identity of the training samples which made up that hyperplane.For colorimetric sensor arrays, SVMs has been applied for the detection, prediction, and classification of various explosives (Askim, Li, LaGasse, Rankin, & Suslick, 2016).
Table 8 demonstrates one trial of SVM identification and classification with the presented dataset.Out of 158 samples, 140 were identified correctly in analyte identity and concentration.Most misclassifications occurred within the acetic acid samples.However, it was only the concentrations of the acetic acid samples there were misidentified but not the analyte itself -such as 1 M of acetic acid was misclassified as 0.5 and 2 M acetic acid.Only two concentrations were misclassified in the ammonia dataset: 1 M of ammonia was predicted to be 0.5 M of ammonia.In the end, the accuracy of SVMs was 89%, which makes the performance of SVMs comparable to other learning schemes such KNN (98%), LDA (98.6%), etc.

Soft Independent Modelling by Class Analogy
Soft and hard chemometric methods have been developed to analyze data obtained from chemical systems (Kakhki & Abedi, 2012).SIMCA is usually defined as soft method, meaning that samples can be classified in one group, multiple groups, or no groups.SIMCA is also an independent modelling method, which means that a sample can be categorized into to more than one group.Unlike many other classification algorithms, in SIMCA, a sample could be assigned to multiple groups in the case of orthogonal or hierarchical groups.For each class in the training set, PCA performed and a model describing the group is generated.Afterwards each unknown sample is projected into all of the models for the groups, and the unknown can be assigned to a group based on the similarity with the group (Gemperline, 2006).In addition, outliers can be rejected from all classes.Other advantages of SIMCA include the ability to work well with data sets with small numbers of variables and large numbers of variables (Esbensen, Guyot, Westad, & Houmøller, 2010).
Similar to RPART, SVM, and PLS-DA the data set was randomly split into a training set and a test set using R. To check the reproducibility, the analysis was performed in triplicate, with new training and test sets each time, providing accuracies of 97%, 89%, and 92% with an average of 92%.These results are comparable to the results from KNN (98%), RPART (96.4%),HQI (98%), and LDA (98.6%).A summary of the classification results for SIMCA trial 1 are given below in Table 9.Of the five misclassifications, four were the correct analyte but the wrong concentration.The final misclassification was a sample of 0.5 M lysine which was classified as 0.1 M acetic acid.Misclassifications between lysine and dilute acetic acid were also observed with RPART.

Conclusions and Future Outlook
Colorimetric sensor arrays are rapidly becoming a common tool for the identification and quantification of analytes.
The multidimensional nature of colorimetric data is well-served by the use of chemometric methods.While HCA and PCA are popular chemometric methods, we sought to explore the use of other algorithms to compare and contrast their usefulness in qualitative and quantitative analysis.In this work, an eight sensor colorimetric array was used to compare the performance of PCA, HCA, LDA, KNN, HQI, PLS-DA, RPART, SVM, and SIMCA for efficacy in identification and quantification of acetic acid, malonic acid, ammonia, and lysine.PCA, HCA, and LDA were used to qualitatively visualize the data and relationships between the analytes.In PCA, PC1 and PC2 provide excellent separation of ammonia, lysine, and water -but not acetic and malonic acid.These analytes were separated much better with PC2 and PC3, indicating that greater than bidimensional PCA components should be evaluated to obtain optimal clustering of analytes.HCA was unable to distinguish the variables of lysine as being uniquely basic or distinct from water, making this method not as effective for classification as PCA for our selected analytes.The two dimensional separation of acetic and malonic acid with regard to class and concentration was achieved with LDA, making this method for our data set superior to PCA.However, the lysine separation from water was similar in performance to HCA.Therefore, for the present data set and presented methods, the effectiveness in regards to visualization and classification can be arranged as LDA > PCA (if only PC 1 and 2 are used) > HCA.
LDA, KNN, HQI, PLS-DA, RPART, SVM, and SIMCA were used to quantitatively classify the samples.LDA is unique in that it can achieve visualization of the data as well as report quantitative means of classifying unknown compounds with high accuracy (>99% in this data set).KNN is advantageous because it is relatively simple to execute, performing similar to HQI (98% accuracy when k = 1) and better than PLS-DA, RPART, SVM, and SIMCA.PLS-DA was the lease discriminating chemometric method for this data set as it was only able to correctly classify water and the most concentrated ammonia samples (3 M), resulting in an overall accuracy of 39%.RPART results were consistent with KNN and LDA showing misclassifications between dilute acetic acid and lysine.In comparison to all methods except for PLS-DA (39%), SVM under performed (85% correct classification).Therefore, the effectiveness of the quantitative methods for this dataset for an analyte concentration range from 0.1M to 3.0M can be ranked as LDA > HQI > KNN > SIMCA > RPART > SVM >> PLS-DA.This coincides with our previously published ranking of LDA> HQI > KNN for classifications and quantification of HCl and NaOH.(Kangas paper) Therefore, it appears that these classification methods follow a general trend for inorganic and organic acids and bases.If other analytes were to be analyzed, it is recommended that all these chemometric methods are examined for effectiveness as analytes and sensors can have completely different mechanistic interactions that lead to different types of color changes, different RGB values, and data sets.However, based on the fact that PLS-DA was much more inferior to the other methods with only 39% accuracy, it may also not perform well for other datasets with a high number of analytes.Also, depending on the number of samples, LDA may not work because it requires a large data set, while KNN, HQI and SIMCA can accommodate smaller data sets.Finally, KNN can be employed if an easy algorithm and quick results are desired if a slightly lower accuracy is acceptable.Herein nine chemometric methods were applied to the data set.The data set is provided in the supplemental information (SI.4) to the readers for analysis with the many other methods available for further processing and comparison.The methods that were reported here offer a suitable balance that was reached between data set requirements, analysis time, and robustness of response for our chemical classification application.

Figure 5 .
Figure 5. LDA for 0.1 -3 M acetic acid, and 0.1-3 M ammonia, 0.1-2 M lysine, 01-2 M malonic acid and water.LD1 and LD2 are the first and second discriminants from LDA, respectively An advantage of LDA over PCA is LDA's ability to report quantitative means of classifying unknown compounds.Table1shows the classification of the LDA analysis in which groups were input by concentration.LDA was able to correctly classify samples by analyte identity and concentration in 626 out of 631 samples (99.2%).When misclassifications occurred, the analysis was often still able to classify an analyte as an acid or a base.For example, one 3 M ammonia sample was misclassified as 1 M ammonia and one 0.1 malonic acid sample was misclassified as 0.5 M acetic acid.As mentioned previously, LDA analysis struggled most with lysine samples, misclassifying one sample of 0.1 M acetic acid as 0.1 M lysine and one sample of water as 2 M lysine.One sample of 0.5 M ammonia could not be classified, as replicate trials resulted in different classifications.In addition, subsequent analysis of the posterior probabilities indicated a modeling error in the analysis.This result was likely the result of unusually low RGB values for phenolphthalein other possibilities include a shadow or air bubble in the image of the well.LDA was also used to qualitatively classify analytes correctly 622 out of 631 (98.6%).The previously observed trends were also true when considering only analyte classification with one sample of acetic acid misclassified as lysine (struggles with lysine) and seven samples of malonic acid classified as acetic acid (still classifies acids as acids).Overall, LDA was observed to

Figure 6 .
Figure 6.Decision tree generated with trial one of RPART Each branch provides a freely rotating, more selective classification of the samples.The blue boxes at the bottom show the classes and the confidence in that assignment.Variables used in the rules are indicated as R, G, or B for the color channel and an abbreviation for the sensor.Sensors are defined as Congo red (CR), erythrosin B (EB), alizarin yellow R (AY), crystal violet (CV), eriochrome black T (ER), phenolphthalein (PH), universal indicator (UV), and bromophenol blue (BB).

Table 2 .
Summary of LDA sample classification (grouped by analyte)

Table 3 .
Effect of K on K nearest neighbor (KNN) accuracy

Table 6 .
Summary of partial least squares discriminant analysis (PLS-DA) sample classifications

Table 7 .
Summary of RPART sample classifications

Table 8 .
Summary of Support Vector Machines (SVM) sample classifications

Table 9 .
Summary of SIMCA sample classifications

Table 10 .
Summary of quantitative chemometric analysis of data by method.LOO = Leave one out, all but one.T/T = Train and Test