Multi-Channel Similarity Based Compression

Many situations arise where data is collected continuously across multiple channels or over multiple similar subjects. In many cases, transmission of the data across all channels is necessary, but the process can be made more efficient by making use of present similarity between data across different channels. We present here a combined compression approach which exploits approximate linear dependence and high correlation coefficient values between pairs of transformed and sorted channel data vectors. By exploiting this similarity, substantial compression gains can be achieved compared to compression of data per each individual channel.


Introduction
Data compression is of ever growing importance, with increasing data sizes driven by the use of high sensitivity sensors recording in high definition formats. There are many examples of multi-channel data which needs to be stored and/or transmitted for analysis such as recordings with acoustic arrays or medical sensors. Two particular examples we will analyze are acoustic recordings and electrocardiogram (ECG) data. In both cases, there are many sets of long data vectors of floating point numbers, which exhibit some similarity when plotted. Storing these vectors individually for each channel and even compressing them with e.g. deflate or Burrows-Wheeler type lossless methods, would yield large output sizes. However, there are two similarity aspects which can be exploited. One is approximate linear dependence. When these vectors across multiple channels are put together in a matrix, the singular value decay would typically be nonlinear. Specialized low rank factorizations such as the Interpolative Decomposition can be adapted to this purpose (Voronin & Martinsson , 2017). In case the data is lagged from channel to channel, such as in case of acoustic recordings made with sensors a significant distance apart, lag adjustments can be performed prior to matrix formation and small portions of data per channel can be retained for separate processing. The other aspect which can be exploited is high correlation of transformed and sorted data (Gutton et al. , 2007). In many cases, for every channel, the absolute values of the sorted transform coefficients would display an exponentially decaying shape. Upon application of the log transformation, the decay becomes approximately linear and a high correlation between nearby pairs of coefficients can be exploited via a low order polynomial model for the data of one of the channels in terms of the other. To improve performance, the coefficient curve can be subdivided in several parts, and the log transformed portions can be separately fitted with respect to the corresponding part of one of the retained transformed reference signals.

Acoustic and ECG Examples
Some examples to which our approach is applicable come from generated acoustic and ECG data sets. In particular, the multi-channel acoustic recording system is simulated by means of the k-wave (Treeby & Benjamin , 2010) tools (http: //www.k-wave.org) and the ECG data was obtained from the PhysioNet (Moody et al. , 2001) (https://physionet. org/content/ptbdb/1.0.0/) database. Examples are shown in Figure 1. There, a sample acoustic geometry is shown with a main emitter source, nearby interference, and receiver positioning (arranged as a vertical array of black dots on the right). Some of the measured channels are then plotted. Also shown are the processed ECG signals from the above database coming from a multi-patient data set. Throughout this document we refer to channel as an individual entity, which can be a physical object (such as a microphone recorder) or patient for which data is obtained. The goal at hand is to store floating point data efficiently across all channels.

Singular Value Decay and Transformed Signal Correlation
For both applications, a combined strategy taking advantage of high degree of linear dependence and correlation under transformation can be employed to improve on the simple compression ratio, obtained by lossless compressing the floating point data for each channel. Figure 2 plots the associated singular value decay and pairwise correlation values under transformation of the first 350 ECG signals, of a set of 550. That is, we have taken a matrix consisting of 550 rows, with one ECG signal per row, and computed its singular values. We observe that the decay is nonlinear, suggesting that a suitable low rank factorization can be utilized to store the data more efficiently. Next, we compute a four level CDF 9/7 Wavelet transform of each data row. We then sort the absolute values of the resulting coefficients for each row, take the log transformation, and compute their pairwise correlation coefficient values. We can see that many pairs of transformed and sorted coefficient vectors have high values (> 0.95), which suggests that the coefficients vectors for a few channels can be stored and the rest represented by a linear (ax + b) or low order polynomial model with respect to one of the stored vectors x. With the additional storage of the permutation set from the sort and the vector of signs, the original transformed coefficients for the remaining channels can be recovered.

Method
We present here a combined approach based on the observations noted in section 1.2. We discuss the relevant low rank factorization and the approaches for high correlation based compression via low order polynomial modeling against retained pillar data.

Interpolative Decomposition
There are several available matrix decompositions which can exploit numerical rank deficiency in a matrix of similar signals (Voronin & Martinsson , 2017). We review here the low rank singular value decomposition (SVD) and the interpolative decomposition (ID), illustrated in Figure 2. The rank-k SVD (A k = U k Σ k V T k ) of a general m × n matrix A yields an optimal approximation of rank k to A, in the sense that ∥A − A k ∥ ≤ ∥A − M k ∥ for any rank k matrix M k , both in the operator (spectral) and Frobenius norms. However, with the use of the SVD, the eigenvectors may often be difficult to interpret and do not contain direct entries of the signals in the matrix. The ID is a different factorization based on the QR decomposition, which returns a factor containing a selection of columns (or rows) of the original matrix. We can start with the pivoted QR decomposition, AP = QS , which can be written as A(:, J c ) = QS for a column index J c . From this, the rank-k ID can be efficiently constructed, with details given in (Voronin & Martinsson , 2017). Below we outline the main steps from expanding the Q and S factors into portions: Inserting the above expressions for S 1 , S 2 into A(:, J c ), we get: ] .
Next, we set C := A(:, J c (1 : k)) = Q 1 S 11 , so that: where T l is the solution to the matrix equation S 11 T l = S 12 , a set of matrix-vector systems. This results in the approximate factorization: with C a subset of the columns of A based on the pivoting strategy in the QR factorization and the remaining V factor a well-conditioned matrix, a portion of which is a diagonal identity sub-matrix. The permutation matrix P does not need be formed explicitly and is represented by vector J c . With access to A, this vector and matrix T l give complete information for the ID. The V factor can be constructed as below: The runtime can be accelerated by employing randomization at the expense of accuracy, which essentially projects the matrix A from the left (RA) and uses the smaller matrix for the ID. The alternative is to build a QB factorization (Martinsson & Voronin , 2016) and obtain an ID from that which allows rank to be chosen based on specified tolerance. For m × n input A, C is m × k and V is n × k. Notice that when we apply the ID to the matrix transpose, we get A T ≈ A(J r (1 : k), :)Ṽ T , where A(J r (1 : k), :) represents a subset of k rows of the matrix, corresponding to the employed pivoting strategy. The use of the ID allows us to retain only a subset of all available channels via direct application to the transpose of the matrix containing the channel data, one per row. Of course, the V factor (or T l and J c or J r ) also needs to be stored. It should be noted that the data may need some pre-processing prior to being inserted into the matrix, such as in the case of substantial lag between channels. In that case, data vectors can be shifted and adjusted with respect to one another so that a portion of each channel vector would separately remain along with the ID factors.

High Correlation Modeling
Once the ID has been used to potentially reduce the number of retained channels, high correlation modeling can be employed to further decrease floating point storage requirements. The approach described here is loosely based on the method with the Fourier transform described in (Gutton et al. , 2007). In (Goffman-Vinopal & Moshe , 2002), the authors describe correlation compression for color image data. Here, we describe the use of the sorted and scaled Wavelet and log transforms, portion splitting, and low order polynomial modeling for a set of floating point signals. First, a matrix M = [w 1 ; . . . ; w m ] is formed with w i = T hr(Wr i ). W can represent a CDF 9/7 or other Wavelet transform (Rao , 2002), with the optimal choice depending on the smoothness properties of the data. Thr() is an optional hard or soft thresholding operation (Kowalski , 2014), which retains only a portion of the largest by absolute value coefficients, resulting from the transformation W applied to a row vector (floating point channel data) r i , such that r i ≈ W −1 T hr(Wr i ) to a specified tolerance, corresponding to classic lossy signal compression in the signal processing sense. Clearly, if we are able to approximately reconstuct M from some compressed set, then we can obtain the original channel data by applying the inverse Wavelet transform to each row. Once M is constructed, three matrices are formed:M, M num , M sgn . To do so, a descending order sort is applied to the absolute values of the transformed coefficients: to get the corresponding sorted coefficient set to be stored in row i ofM and the integer permutation index information I i (for where the values occurred in the original unsorted Wavelet transformed vector) to be stored in M num . The signs of w i (I i ) (the permuted coefficients) are stored in M sgn as a bit array of 0s and 1s for each channel, corresponding to negative or positive transform coefficients in the permuted order. If we plot the entries of v i , then for every channel i, they would have a similar, exponentially decaying shape as in Figure 2, and hence roughly linear, under a log transform. We can scale each vector v i by the max entry, and save in addition, one scaling factor per row (or modeled portion), so that the log values of the transformed and sorted coefficient vectors for all channels start at 0. The basic idea to high correlation compression is to store v i only for some i (forming the so called reference pillar set), and for the other i, use a low order model to model log(v i ) with respect to the log of one of the retained pillar vectors (of sorted absolute values of the Wavelet coefficients). Then, for the non-pillar channels, only the fitting coefficients are stored in place of the floating point vectors v i . To reconstruct the original signal r i for each channel, we employ the linear model reconstruction and the integer index and sign sets to form the approximate Wavelet coefficients of correct sign, followed by the application of the inverse transform. The pillar vectors can either be chosen as a small subset of all available data or as one or more members of a cluster, when waveform clustering is applied to cluster the data in groups. In that case, high correlation modeling can be performed separately over each cluster.

Compression and Decompression Algorithms
The general compression and decompression approaches are summarized in Algorithms 1 and 2, using a mix of lossy compression (as achieved via the ID, thresholding, and high correlation modeling) and later, optional lossless compression of the remaining parts. First, the ID is applied on the transpose of channel data M = [r 1 , . . . , r n ] T , resulting in a subset of channels C (rows of M) and an extra factor V (typically small compared to C). The subsequent steps aim to compress the data in C via high correlation based modeling. For this, the three M matrices are formed from the data in C. Notice that M num and M sgn are integer and bitwise matrices with lower storage cost, whileM contains the floating point sorted absolute values of the transform coefficients, only a portion of which will be retained. Some of the rows ofM are retained in floating point format in E as pillar data. The rest ofM is not stored. Choosing the pillar data can be accomplished by keeping a set portion ofM as reference. Alternatively, rows can be added in blocks until a specified error tolerance is met or clustering can be used to make a set of pillars (e.g. taking the first member of each cluster). The resulting matrix E contains the pillar signal transformed and sorted absolute valued Wavelet coefficients against which other channels coefficients will be fitted. For the rest of rows j ofM, stored in matrix F, are the coefficients for a fitting model with respect to log(M(i, :)) in E (the optimal reference model to fit against log(M( j, :))), the scaling factors, and the index i to the reference in E. Once done, matrix F contains a small amount of floating point data per non-pillar channel and the combined approach can represent substantial savings over storing the floating point data for each channel. As a final step, lossless compression is employed over all retained data, including the integer and bitwise sets M num and M sgn and the V factor from the ID. Burrows-Wheeler based bzip2 can be used for any of the remaining data, and special methods e.g. (Ratanaworabhan et al. , 2006), can be employed for integer permutation sets. To improve performance, the data inM can be subdivided into several portions for each row and fitted separately with respect to the reference data in E (such that different portions of a row can be fitted against portions of possibly different pillar vectors). A simplified version of the transform correlation based encoding code is shown below, where a set of data is retained as reference pillars: For decompression, assuming the use of a second order model, given model terms a j and b j and scaling factor s j in F and the index i to the reference data in matrix E (corresponding to the reference channels for which the transformed and sorted coefficients are stored), we can compute v j = s j exp ( a j x 2 i + b j x i + c j ) with x i = E(i, :) to model the sorted absolute values of the coefficients for channel j. Then, to construct approximate coefficients of the Wavelet transformed result for each channel (unsorted and with proper sign), we must make use of the permutation index and sign information stored in M num and M sgn . To do so, we initialize an empty vector for each channel, and insert v j in entry specified by index M num (:, j) with sign specified by M sgn (:, j). Following the construction of the transformed vector with the correct sign entries, the inverse transform can be applied to yield the approximate signal for each channel. The sequence of steps is summarized by the pseudocode below, where the log model is fitted with respect to the best reference, then the coefficients are inserted in correct order with relevant sign, and subsequently inverse transformed. If the rows ofM were split in several portions, then F would be composed of several sets of records for the different models of the respective row portions and the reconstructions for each portion would be performed separately. A simplified version of the code appears below, where we read the pillar and model data, evaluate the second order polynomial model, and use the permutation and sign information to construct the approximate transform coefficients for the channel.  Notice that the transformed and sorted absolute values of the Wavelet transform of channel data when plotted (see e.g. the plot in Figure 2) generally exhibit the greatest mean curvature (change) in the first half or quarter of the interval and it is the largest coefficients in that region that are most critical to model accurately for good reconstructions. For this reason, it may be beneficial to split the data non-uniformly, utilizing more portions for the first half of the interval. The advantage of this splitting is that it computes different model coefficients per portion and allows different portions to be fitted with respect to different pillar data vectors. Another method which we discuss in the examples is to precluster the original channel data into several parts (clusters) and take one or more members of each cluster to be the pillar data. A sample clustering sequence in R is shown below using a dissimilarity matrix construction between floating point channel data pairs based on dynamic time warping metrics (Oates et al. , 1999;Mori et al. , 2016). High correlation modeling compression can then be performed over each cluster. Utilize the resulting matrices to reconstruct C. To do so, the sorted magnitude coefficients are reconstructed by means of the evaluated model with x from the specified (best fit row) in E. The vector is then scaled and exponentiated, and its entries are inserted into appropriate locations using M num index set, with appropriate sign assigned from the bitwise M sgn information. The inverse transform is then applied over both the reconstructed transform data and pillar data in E to yield the approximate select channel data C. Multiply C by V to obtain approximation to A T . Extract floating point channel information from the columns for A T .

Numerical Results
We present results from two applications: medical ECG signals and synthetic acoustic data receiver. As discussed in the introduction, both data sets contain data for multiple channels. For the ECG data set, data from 25 patients was used, for a total of 1020 records, each of length 115200 samples. In double precision format, the storage size for the 1020 × 115200 matrix is 940 MB. With a rank 400 ID (resulting in a relative error of 3.7%), the retained channels take up 368 MB, along with a small dense matrix V of size 1020 × 400. Following the ID computation, correlation based compression on transformed and sorted data is performed. That is, the data is first transformed and sorted. The indices and signs are also recorded. Lossy compression (thresholding) is performed to reduce the size of coefficients retained. This is done, by looping over every channel and measuring the error that results from retaining only a portion of the largest coefficients. A suitable portion to retain is then determined over the maximal error over all channels. In our case, 3/4 of the largest magnitude coefficients were retained in every channel. Next, we save 40 of 400 transformed and thresholded channels as reference pillars and encode the remaining values in terms of the optimal paired saved channel. In order to improve performance, we have broken up signals in subsets of length 7000. The full matrix for one portion is then 57 MB and the ID factors (C and V) are 22.4 and 3.2 MB, with 1.2% error in the approximation to A T . We then divided the sorted absolute values of the Wavelet coefficients (the decaying exponential curves) into four parts indexed by the sets (1 → round(N/4)), . . . , (round(3N/4) + 1 → N), which are separately fitted via a third degree polynomial model based on the optimal pairing with one of the transformed and sorted reference signals.
Following this, we have the following saved data: • Saved reference channels 40 × 7000 in double, 2.24 MB.
• Small model coefficient and reference signal index array for the 4 transformed signal parts, each of size 400 × 4.
The total size is thus around 16.2 MB, down from 25.6 MB for the ID factors and 57 MB for the original data. Combining these pieces to cover the entire 115200 sample length requires around 267 MB total storage. Lossless compression can be used to further compress retained data. Error results for the transformed coefficient set reconstructions per channel are shown in Figure 3 along with a sample fit of a waveform based on the model derived from one of the 40 retained pillars. Notice that we can choose to retain less pillars (floating point data) via clustering, as we show in the next example.
In the following test, we show results for a synthetic acoustic application, whereby an array of 29 microphones is recording the signal from a particular emitter and interference sources. The main setup is shown in Figure 1 (left). Figure 4 shows a typical signal obtained by one of the microphones, along with its subdivision into 6 parts. Since the measured signal at a given microphone potentially consists of many samples (as in the example shown), it is reasonable to subdivide the signal into several portions before employing compression. If the signal has both active (high magnitude) and relatively flat (low magnitude) portions, depending on the emitter and interference source profiles, then a gradient map of the indices corresponding to large magnitude entries of the floating point vector (e.g. above the third quartile after filtering the small entries away), can be used to estimate cut locations along the flat areas of the signal, as shown. Once subdivision is performed, the compression sequence can be performed in parallel over different portions on multi-processor or suitable multi-core systems, if necessary. In Figure 5, we show one particular portion (#3), as recorded at 4 different sensors. The pillar selection approach then proceeds by clustering, using all microphone signals from this portion, subdivided in 2000 point portions. In order to accomplish clustering, a dissimilarity matrix is first constructed, which shows the relative distance between pairs of recorded microphone signals. The distance approach is based on dynamic time warping (DTW) (Oates et al. , 1999) and auto regressive integrated moving average (ARIMA) process derived metrics (the average of ACF and ARLPCCeps distances from the TSdist package in R was used (Mori et al. , 2016)). The number of clusters was set to 6 and the resulting dendrogram shows the resulting microphone distribution with respect to different clusters. For each cluster, only the transformed coefficients of the first member (the pillar) were saved and the rest of the signals were encoded with respect to the pillar using the described technique. The resulting final error distribution is shown in the same figure along with per cluster statistics consisting of the original floating point size, the similarity compressed size, and the final size following Burrows-Wheeler based and entropy lossless compression. Notice that the second problem shown here is more challenging for two reasons: the more oscillatory (higher frequency) nature of the signals and the fact that encoding is done with respect to only one pillar, unlike the ECG example, where 40 pillar signals were saved and the optimal reference was used in each case. We can instead choose to save more than one pillar data per cluster. Figure 6 shows two examples with both the reconstructed signed and properly ordered Wavelet coefficients and the reconstructed signals (obtained from the inverse transform of the reconstructed Wavelet coefficients), for a higher and lower error case. As expected, when the Wavelet transform output is more oscillatory, the reconstruction mechanism gives greater error. Still the final result captures the main features of the signals, with some deviations in the magnitudes. For the signal on the right of the figure, both the Wavelet coefficient sequence and subsequent signal reconstruction are close to the originals. In this example, we have also not applied the ID since the full rank of the data matrix is relatively small, but additional gains are possible with respect to the cluster sizes reported in Figure 5.

Discussion
In conclusion, we have presented an approach useful for the compression of a set of similar time domain signals, such as those obtained from multiple test subjects (e.g. medical patients) or recorded by multi-channel processing systems. The basic approach is to exploit both the linear dependence and correlation of data after transformation and sorting. The approximate linear dependence between some of the channel pairs can be exploited by means of the interpolative decomposition, which returns a subset of the data matrix rows when applied to its transpose. The rank can be modified adaptively based on the desired error tolerance level. Next, the remaining data can be broken up into groups via clustering and a set of pillar signals can be retained. The absolute values of the transformed and sorted coefficients of the different signals are then compared to those of the retained pillar data and a low order polynomial model is saved in place of floating point data, along with sign and index permutation sets in more efficient data types. The original data channels can then be approximately recovered by means of the model information, the retained sign and integer index sets, and the inverse transform. Lossless compression can then be applied to the resulting data sets (the ID factor, transformed pillar coefficients, polynomial model information, and the bitwise sign and integer index sets). The approach is simple and efficient to employ, adaptive with respect to specified tolerances, and can yield substantial compression gains.