Secure Multi-User k-Means Clustering Based on Encrypted IoT Data

IoT technology collects information from a lot of clients, which may relate to personal privacy. To protect the privacy, the clients would like to encrypt the raw data with their own keys before uploading. However, to make use of the information, the data mining technology with cloud computing is used for the knowledge discovery. Hence, it is an emergent issue of how to effectively performing data mining algorithm on the encrypted data. In this paper, we present a k-means clustering scheme with multi-user based on the IoT data. Although, there are many privacy-preserving k-means clustering protocols, they rarely focus on the situation of encrypting with different public keys. Besides, the existing works are inefficient and impractical. The scheme we propose in this paper not only solves the problem of evaluation on the encrypted data under different public keys but also improves the efficiency of the algorithm. It is semantic security under the semi-honest model according to our theoretical analysis. At last, we evaluate the experiment based on a real dataset, and comparing with previous works, the result shows that our scheme is more efficient and practical.


Introduction
With the growing up of Internet of Things technology, the application of Internet of Things will spread to all walks of life. IoT technology collects information through a variety of smart devices or sensors, according to the agreed protocol, transfers the information to the application platform for processing and achieves the goal of intelligent control. For example, hospital wristbands can identify patients undergoing medical care, and sport trackers can log physical activities. All of those smart devices and sensors will produce large amounts of data in its running, and to reduce cost, users would like to benefit from the outsourced services, which is one of the fundamental advantages of cloud computing.
Cloud computing is a new business computing model. As an emerging computing model, cloud computing attracts a large number of users for its characteristics such as good scalability, low-cost and pay-on-demand. More and more enterprises and users store their services and data into the cloud. That is, the users with resource-constraint devices can delegate the heavy workloads into untrusted cloud servers and enjoy the unlimited computing resources. However, such a large quantity of data always involves sensitive information, such as medical records or locations information, and it is high risk to store such information directly in the cloud servers. The security challenge is an emergent problem in the cloud computing development.
Due to the security threats to the cloud, users have to take action to protect their sensitive information. The common method is to encrypt raw data before uploading to the cloud server. Generally speaking, users do not trust others easily, and they encrypt their own data with their own keys. In other words, except securely storing data in the cloud server, it allows users to retrieve cloud data without revealing confidential data to other users or the service providers. Meanwhile, it brings new challenges to evaluate over ciphertexts under multiple keys. of operating on inputs encrypted under multiple keys. Unfortunately, it requires heavy communication cost during the decryption of the final results, and its efficiency is far from practice. Then, Peter, Tews and Katzenbeisser (2013) represented a scheme that allows evaluating any dynamically chosen function on inputs encrypted under different independent public keys utilizing Bresson-Catano-Pointcheval (BCP) encryption (Bresson, Catalano, & Pointcheval, 2003), which is an additive homomorphic encryption with a double trapdoor decryption mechanism. The disadvantage of this solution is that it requires the complex interactions between two non-colluding cloud servers during the ciphertexts transformation phase. Therefore, it is not suitable for the system in the real word.
In this paper, to avoid all these drawbacks and achieve a better balance between efficiency and security, we construct a more efficient secure k-means clustering scheme under multiple keys based on two non-colluding cloud servers by utilizing the ElGamal-based Proxy Re-encryption (PRE) (Ateniese, Fu, Green, & Hohenberger, 2006). Generally speaking, it is reasonable to assume the existence of two non-colluding servers to perform the secure computation. According to Van and Juels (2010), there is a clear indication that it is impossible to realize a completely non-interactive solution in a single server setting. Hence, if we aim to implement non-interaction between data owners, we need at least two servers, just like Peter et al. (2013). Moreover, as a way to enhance efficiency, we make extensive use of ElGamal-based PRE to transform ciphertexts encrypted under multiple keys into ciphertexts under the same key before computation. Compared to Peter et al. (2013), we reduce the interaction between two servers, and increase the computing efficiency.
In brief, we summarize our main contributions as follows: 1) In order to protect privacy information, we construct a new efficiency privacy-preserving k-means clustering scheme. In our setting, the scheme can both preserve the privacy of the data owners' sensitive information and the calculation results.
2) Existing works require the inputs to be encrypted under the same public key, which is very limited in practice. To avoid these problems, we take advantage of proxy re-encryption to construct a scheme that is based on distributed data encrypted under multiple keys.
3) We utilize the two non-colluding servers model to complete large calculation in the learning phase. Except computing proxy keys, the data owners should do nothing during the learning process. In the end, the data owners only need to do decryption to obtain the result.
4) Finally, we evaluate our scheme based on a real dataset. Besides, we make comparison with other works, and the results show that our scheme is more efficient and more practical.
Organization. The rest paper is organized as follows. In Section 2, we introduce the related work about our work. Next, we represent the setting of our scheme and analyze the threat model in Section 3. Section 4 describes the preliminary knowledge and privacy-preserving building blocks. We introduce our scheme in detail in Section 5, while analyzing security in Section 6. We summarize our experimental results in Section 7. Finally, we conclude in Section 8.

Related Work
Previous works have focused on the issues of privacy-preserving clustering algorithm. In the early years, researchers mainly focused on the security k-means clustering based on a single databased, and made some achievements. Recently, the focus has shifted to the multiple data sources setting to obtain more precise clustering result. In the scheme (Bunn & Ostrovsky, 2007), Bunn and Ostrovsky (2007) proposed a secure two-party k-means clustering protocol that guaranteed privacy of each database without revealing the intermediate values, which was based on secure two-party computation. This scheme extends the clustering algorithm to an algorithm that works in the two-database setting. However, as we all know, secure multiparty computation will increase communication cost among participating parties. Besides, it is a heavy work for users to perform the data mining algorithm, because of its limited computation resources.
To address this issue, researchers have started to focus on the data mining task in an outsourced environment (Rao, Samanthula, Bertino, Yi, & Liu, 2015;Jiang et al., 2018;Rong, Wang, Liu, Hao, & Xian, 2017;Samanthula, Rao, Bertino, Yi, & Liu, 2014;Xing, Hu, Yu, Cheng, & Zhang, 2017). The works of Rao et al. (2015) and Samanthula et al. (2014) outsourced all computation to two non-colluding cloud servers. In their works, users encrypt their own raw data under a cloud server's public key and upload them to the other cloud server. The two cloud servers collaboratively perform the clustering task on the combined data in a privacy-preserving manner. Since all data are encrypted under a unified key, and only the cloud server who holds the secret key can decrypt the ciphertext, it is hard for data owners to retrieve data from the cloud servers.
under multiple keys. Jiang et al. (2018) proposed a secure k-means clustering protocol with the benefit of two noncolluding cloud servers to support storage and computation outsourcing. The raw data are encrypted under the data owner's public key, and it is convenient for them to retrieve its data and decrypt with its secret key. While, during the clustering procedures, the data owners should online all the time, and help work out the result. Similarly, data owners need participate in the whole clustering process in scheme Xing et al. (2017). They allow data users to compute the nearest cluster locally and update the cluster centers with the help of cloud server. Obviously, it will increase the communication cost between each entity and there exists a large number of calculations for data owners. Beyond that, there are some potential problems about data security in the update cluster centers phase. Rong et al. (2017) presented a privacy-preserving k-means clustering over a joint database encrypted under multiple keys in distributed cloud environments. They transformed ciphertexts under different keys into ones under a common key through a double decryption cryptosystem (Youn, Park, Kim, & Lim, 2005), which allows an authority to decrypt any ciphertext by using the master key without consent of corresponding owner. The problem with this method is that it will occur without the data owners' prior consent, which may against data owners' wishes.
In addition, at the stage of ciphertext transformation, the ciphertexts are converted by two non-colluding servers, and it may significantly decrease the efficient computation. Obviously, it is still a problem to be solved imperatively that achieves efficient privacy-preserving k-means clustering algorithm under multiple keys.

Architecture and Entities
In order to solve the existing problems, we present a secure and efficient privacy-preserving k-means clustering protocol. In our setting, as shown in Figure 1, we consider data owners, and each of them with andimensional object (1 ≤ i ≤ ). Due to the security concerns, data owners carefully store data in an encrypted form. Once they want to get some information from the data, they will send request to the cloud server. In other words, data mining will be executed based on the cloud model in a secure manner. We only discuss k-means clustering as the data mining method in this paper, and it obviously can be extended to other data mining algorithm.
There are two type of entities in our system model: data owners and cloud servers.

1) Data Owners (DO):
Data owner encrypts data under his own public key, and he is the only person who can decrypt the ciphertext by the secret key. In general, the ciphertexts would be centralized into storage service provider. Data owners have the right to decide whether their data will be involved in the k-means clustering algorithm.

2) Cloud Computing and Storage Server (S):
The cloud computing and storage server provides storage service to all data owners, and it will perform computation on the data when receiving request from the clients.

3) Cloud Computing Server (C):
Cloud computing server is a temporary server with only computing service. Its main work is to assist the cloud server S to perform data mining algorithm in a privacy-preserving manner.

Threat Model
In this paper, we prefer outsourcing the data and computation to a server provider, such as the cloud server. While, due to many reasons, the clouds are unreliable, and they may try to collude with others to obtain uncorrupted parties' private information as many as possible. During the evaluation process, the corrupted parties may also deviate from the protocol specification according to the adversary instruction. Under these circumstances, the adversaries are called malicious adversaries. In our study, we mainly focus on the semi-honest model. In other words, the entities in our setting are all semi-honest adversaries. That is to say, they will execute the protocol correctly, but they also attempt to obtain some information about the users' private information. In addition, we assume that the entities do not collude with each other.
The design goal of our scheme is to ensure the data owners obtain the results of clustering, while the clouds do not learn any information from the clustering algorithm, even the intermediate value. It is worth noting that the ciphertexts stored in the cloud server are all encrypted under different public keys. Hence, we mainly aim to effectively solve the privacy-preserving k-means clustering under multi-keys. We make benefits from the proxy re-encryption to convert the ciphertexts into the ones under a unified key. During the transformation process, the entities learn nothing about the data owners' information. The clustering process is based on a two-cloud model. They execute the protocol correctly and do not learn any extra information. Thus, we say that our system is private. Informally, we also say that our protocol is correct. It is easy to verify.
In addition, during the clustering process, we make sure that the communication cost and computation is minimized as possible. In the study, we consider that the data owners are not willing to help to run collaborative analysis and to spend too much resources to execute it. Consequently, we outsource the computation to computation service providers, and ensure that the communication cost is minimized between the two cloud servers.

k-Means Clustering Algorithm
k-means clustering algorithm as one of the main data mining methods is widely used in practice. It can be used to partition a set of data objects into clusters. We assume that there are participants and each participant holds an -dimensional object (1 ≤ ≤ ). k-means clustering algorithm aims to divide the objects into clusters (1 ≤ ≤ ), and to ensure a large degree of similarity within the same class, but little similarity between different classes. The clustering process is comprised of two steps. The first step is to assign objects into different clusters. The criterion of classification is to measure the distance between a sample and the related cluster center . There are many methods for this criterion, but in this paper, we adopt the Euclidean distance. At each iteration of the first step, k-means clustering algorithm assigns the object to the nearest cluster, which is labeled by , and it follows = argmin || − || 2 where 1 ≤ ≤ , 1 ≤ ≤ . All objects will be divided into k clusters during the first step, and the second step is to update the cluster centers. The new cluster center is defined as the center of each cluster, and we assume that there exists objects in the -th cluster. The updating algorithm is given by where 1 ≤ i ≤ , 1 ≤ j ≤ . The clustering process is terminated if the cluster centers are sufficiently close to the previous ones or the iterations reach a certain number of times.

Additively Homomorphic Proxy Re-Encryption
Proxy Re-Encryption (PRE) allows an honest-but-curious proxy to transform a ciphertext computed under Alice's public key into one that can be opened by Bob's secret key, without disclosing the plaintext. In 2006, Ateniese et al. (2006) proposed a unidirectional PRE, which is an improvement over Blaze, Bleumer and Strauss (1998) where the keys are bidirectional relied on pairing-based cryptography. In this paper, to enable the additive homomorphic property, we use the algebraic structure of elliptic curves over finite fields, similar to (Wang, M Li, Chow, & H Li, 2014;Shafagh, Hithnawi, Burkhalter, Fischli, & Duquennoy, 2017). The PRE scheme is also based on the bilinear map (Boneh & Franklin, 2001), which, given a cyclic group of prime order , has the following property for , ∈ ℤ and , ℎ ∈ : e( , ℎ ) = ( , ℎ) . The additively homomorphic proxy re-encryption (AHPRE) scheme can be described as follows:  Setup(1 ) → , , Z, , : Input 1 , where κ is a security parameter. Choose a random generator ∈ .
 KeyGen → (pk, sk): Choose a random number ∈ ℤ , and set the public key as pk = with secret key sk = .
 ReKeyGen(sk , pk ) → rk → : A user A delegates to B with public key pk = , the re-encryption key is computed as rk → = pk / = / ∈ .  Enc (pk, ) → : Present the message as = ∈ . To encrypt under pk in such a way that it can only decrypted by the holder of sk , output the first-level ciphertext = ( , ).
 Enc (pk, ) → : To encrypt = ∈ under pk in such a way that it can only decrypted by A and her delegates, output the second-level ciphertext c = ( , ).
 ReEnc ( Note that in final decryption, we need to map back to , where the message is a finite and relatively small number, which can be obtained by solving a discrete log problem.

Basic Cryptographic Primitives
In this section, we mainly introduce a group of cryptographic primitives that will be used in our privacy preserving scheme.

 Secure Multiplication Protocol (SMP):
This protocol aims to compute the multiplication of two ciphertexts. Assume that S has two ciphertexts Enc( ) and Enc( ), it will obtain the encrypted multiplication Enc( ) with the help of C who has the corresponding secret key sk. The details of SMP is described in Protocol 1.

 Secure Minimum out of 2 Numbers Protocol (SMINP2):
This protocol considers the S with inputs Enc( ) and Enc( ) and the server C with the secret key sk . The SMINP 2 can be used to determine the relationship between two encrypted data. The Protocol 3 shows the detailed process.

The Procedures of Our Scheme
In this section, we describe our privacy-preserving scheme for the k-means clustering algorithm in detail. We consider that there are data owners, and each of them holds an -dimensional object (1 ≤ i ≤ ). They encrypt raw data with their own public key pk (1 ≤ ≤ ) to the second-level ciphertexts and upload to the cloud server S. Under this circumstance, data owners are still able to retrieve their data and decrypt them under their own secret keys sk without leaking any information to other participants. Once receiving the request, the cloud server S will perform the clustering algorithm with the cloud server C in a privacy preserving manner.
We divide our scheme into three stages: (1) Ciphertexts transformation; (2) Assigning records to the nearest cluster; (3) Computing the new clustering centers.

Ciphertexts Transformation
Due to the ciphertexts are encrypted under different public keys, it is hard to perform evaluation on these data. Although, López-Alt et al. (2012) have presented a multi-key fully homomorphic encryption cryptosystem, it is really inefficient in the practical applications. To realize data encryption and communication protection and improve the calculation efficiency, we design a ciphertexts transformation method based on the proxy reencryption (Ateniese et al., 2006). This method aims to convert the ciphertexts into the ones under a unified key.
Noting that the ciphertexts stored in the cloud server are all encrypted in the form of the second-level ciphertext. The data owners have the right to decide whether they will participant into the data mining. If they would like to participant the k-means clustering to comparison with others, which may help them to reacquainted themselves, they will send the request to the cloud server S with a proxy re-encryption key rk → , which is computed based on the data owners' secret key and the cloud server C's public key.
where the sk is the -th participant's secret key and pk is the C's public key that is broadcast to all entities. Once receiving clustering request and the proxy re-encryption key rk → from data owners, the cloud server S begins to execute the re-encryption function ReEnc(Enc (pk , ), rk → ) to convert the ciphertext computed under DO 's public key pk into the encryption under the cloud server C's public key pk .
Next work will be executed on the ciphertexts under the cloud server C's public key. To simply, we denote Enc( ) as the first-level ciphertexts under pk .

Assigning Records to the Nearest Cluster
The second step is to assign records to their nearest clusters by computing the minimum squared Euclidean distance. The cloud server S's first task is to initialize cluster centers ( , … , ). There are many initialization methods. The general method is to initialize centers with randomly generated values. Alternatively, to reduce the number of iterations required in the process of clustering, we can adopt an optimized manner proposed in (Ostrovsky, Rabani, Schulman, & Swamy, 2012). In this paper, we randomly choose data records as the cluster centers.
Let denotes the squared Euclidean distance between the record and the cluster center . It is easy for S to compute (1 ≤ ≤ , 1 ≤ ≤ ) by performing the protocol SSEDP defined in Section 3. To assign a record to the nearest cluster, it needs to compare the squared Euclidean distance between the record and the cluster centers (1 ≤ ≤ ). We have present a protocol SMINP 2 for securely get the minimum, and we assign the record to the cluster based on the minimum squared Euclidean distance and update the cluster label corresponding to the record to = at the same time. It is worth mentioning that the cluster label is encrypted with the cloud C's public key under the second-level encryption in our scheme. Its good point is the convenience for returning the final results to the data owners. Just like before, it only needs performing the proxy re-encryption to convert the ciphertexts to ones that encrypted under data owners' public keys.
This process repeats until the cluster center is figured out. The data set will be divided into subsets. Note that when this procedure terminates, the cloud servers will learn nothing about the raw data and which cluster belongs to.

Computing the New Clustering Centers
After assigning the records to the nearest cluster, the cloud server S needs to reevaluate the cluster center for each cluster. We assume that there are records in the -th cluster. The new cluster center is simply defined as the center of each cluster, and the updating process as follows: The stage-2 and stage-3 is an iterative process. They will repeat until the termination condition holds. In this paper, we set two termination conditions: 1) We set a reasonable number , and execute the algorithm verifies whether the sum of the squared Euclidean distances between the current and new clusters is upper-bounded by .
2) If the number of iterations reaches a set value, the iteration will be terminated too.
If the termination condition holds, the clustering will halt and return the final result. Otherwise, the algorithm continues to the next iteration with the new clusters as input.

Security Analysis
In this section, we mainly analyze the security about our privacy-preserving k-means clustering scheme. Obviously, the data confidentiality of our scheme is achieved by the additively homomorphic proxy re-encryption.
Beyond that, we assume that there is no collusion between the cloud server S and the cloud server C. And our scheme aims to achieve the privacy of the raw data, the intermediate values and the final results under the semihonest model.
Proof. While, Dodis and Yampolskiy (2005) have proved that the above assumption is hard in the generic group model. In (Dodis & Yampolskiy 2005), Dodis and Yampolskiy (2005) address a stronger version called q-Decisional Bilinear Diffie-Hellman Inversion (q-DBDHI). The q-DBDHI problem asks: given the tuple ( , , … , ( ) ) as input, distinguish e( , ) / from random. The q-DBDHI assumption holds if no probabilistic polynomial time algorithm has advantage at least in solving the q-DBDHI problem in .

Definition 1. (q-DBDHI Assumption). We say that the ( , , ) -DBDHI assumption holds in if no -time algorithm
has advantage at least in solving the -DBDHI problem in .

Privacy of Data.
To protect the data privacy of the data owners, we allow the data owners encrypt the sensitive information with its own pair of public and private keys under the additively homomorphic proxy re-encryption (AHPRE) cryptosystem. According to the Theorem 1 and the assumption of non-colluding between two cloud servers, we can easily achieve the protection of individual privacy.
Lemma 1. Without any collusion, our scheme is privacy-preserving for the training model under the additively homomorphic proxy re-encryption.
Proof. Recall that in our basic scheme, the cloud server C does not communicate with the cloud server S until the training phase. It's worth mentioning that data owners do not need to communicate with the cloud servers. During the training phase, the cloud server S always perform the evaluation on the ciphertexts, and it can't obtain any information about the intermediate values, because the additively homomorphic proxy re-encryption scheme is semantically secure. Although the cloud server C keeps the secret key, it receives the blinded ciphertexts and can only be able to get the blinded messages. Hence, the cloud server C also cannot obtain the learning result. Therefore, the privacy of the training model is confidential.

Scheme Evaluation
In this subsection, we will analyze the communication and computation overhead incurred in each stage of the proposed scheme. The results regarding computational costs are given in Table 1. Here denotes the number of attributes, denotes the sum numbers of data records of participants, denotes the number of clusters. Besides, Map represents the bilinear map, Mul represents the multiplication, Exp represents the exponentiation. It is important to note that Stage 1 is run only once whereas Stage 2 and Stage 3 are run in an iterative manner until the termination condition holds. In addition, it clearly shows that the computational and communication costs of Stage 2 are significantly higher than the costs incurred in Stage 3 in each iteration.

Implementation and Dataset Description
We implemented the proposed scheme in Python using the GNU Multiple Precision Arithmetic (GMP) library  (Naeem & Asghar, 1999) from the UCI KDD archive. The dataset consists of 65554 data records and 29 attributes. As part of the pre-processing, we normalized the attribute values and scaled them into the integer domain. Stage 2 (per iteration) n * (kd(d + 4) + 25(k − 1)) Mul n * (8kd + 40(k − 1)) Exp n * 6kd + 28(k − 1) |q| Stage 3 (per iteration) n + 25k Mul 40k Exp 28k |q| In this paper, we focus on performing evaluation of arbitrary functions on inputs that are encrypted under different independent public keys. We compare the performance of the ciphertexts transformation with two previous works.
The result is showed in Figure 1. It clearly shows that the scheme we proposed in this paper is more efficient than the other two schemes. In addition, the communication costs in Table 1 is also less than the scheme of Rong et al. (2017) and Peter et al. (2013). To apply the privacy-preserving scheme to complex practical system, it is obvious that our scheme is the best choice to convert the ciphertexts into ones under a unified key.
Furthermore, we implemented the secure k-means clustering algorithm based on the KEGG Metabolic Reaction Network dataset. The performance of the scheme is shown in Figure 2. We tested our scheme with different dimensional data (i.e d = 29, d = 20, d = 10). As we can see, the computational time depends on the size of the dataset. The most of the costs are resulted by the partial homomorphic property, since it needs much time to perform the SMP which is involved many decryption and encryption algorithm.
Generally speaking, the proposed scheme in this paper is more efficient and more practical according to the result. Furthermore, we note that the data owners need not participate in the learning phase, and all the hard computations are outsourced to cloud servers, which is lightweight for them.

Conclusion and Future Work
In this paper, we aim to solve the problems of k-means clustering algorithm on inputs that are encrypted under different independent public keys. Our scheme is based on two non-colluding cloud servers, and during the process, there is no interaction between the cloud servers and the data owners. We have proved that our scheme is semantically secure in the semi-honest model. Moreover, we highlight its efficiency by giving the experimental results and comparing with previous works.
To meet the needs of practical application, we will continue improve the efficiency of the learning algorithm. Furthermore, we are planning to experiment with other machine learning algorithms used in other application scenario.