A New Model for Rating Users’ Profiles in Online Social Networks

,


Introduction
Online Social Networks (OSNs) have gained a large popularity and became an integral part of our daily activity.Profiling OSN users has been frequently practiced for different purposes (e.g., price discrimination, targeted servicing, fraud detection, and extensive social sorting) in various fields (e.g., marketing, financial, sociology, and forensic science) despite the numerous concerns that have been raised (e.g., security, privacy, ethics, and liability).
In this work, we define a new model for rating user's profile in a community (e.g., political, religion, lifestyle) using a computerized algorithm in order to deal with huge and complex amount of profile's data.Then, the estimated rates are used for positioning the profiles in the clusters of each community, which are classified as low, medium, high, and advanced.To the best of our knowledge, there is no current solution for rating the profiles of OSN users in each community in such a way.
In order to test the accuracy of the proposed model, we experimented the clustering of 3000 profiles in religion, political and lifestyle communities (Note 1) of three social networks (i.e., Facebook, Twitter and Instagram).Each case study consists in embedding the profiles of an OSN community in an independent Cartesian space to observe their distribution in the clusters.Our results show that we are able to estimate accurately the profile rates by reducing the vector of metrics to a low-dimensional space whittle down to 3-D Cartesian space.
The rest of the paper is organized as follows.Section 2 discusses the related works.The clustering method is presented in Section 3.Then, Section 4 details three case studies conducted to validate our model.Finally, Section 5 concludes the paper and presents the future works.

Related Works
At the first stages, embedding and networking concepts within technology and society domains present inter-relationships that researchers depended on for identifying or categorizing profiles that belong to individuals or groups.A group profile refers to a category of people (e.g., radical, moderate) that does not necessarily form a community (e.g., religion, politic), but are found to share previously unknown patterns of behavior or other characteristics.
Recent works in this domain have focused on the presentation of user's information and on the analysis of this information to discover a correlation or pattern.Such proposals have presented the users' profiles in graphs and classified them into communities by relying on their attributes.These attributes are mainly extracted using data mining tool from the users' profiles and sometimes natural language processing techniques have been used to interpret data.
In this context, community detection algorithms depending on well-defined metrics are being developed such as the study that differentiates between intra-centrality and inter-centrality metrics, to characterize nodes in communities.
Network vertices are often divided into groups or communities with dense connections within communities and sparse connections between communities.
On the other hand, researchers propose some possible models to visualize networks with the objective of exposing their community structures based on modularity maximization, which can be used as a useful tool for community detection algorithms and for graph layout methods (Li, 2015).Another approach states a node-similarity based mechanism with the intention of exploring the formation of modular networks by applying the concept of hidden metric spaces of complex networks.Likewise, others initialize community attractiveness matrix to initiate a graph clustering algorithm based on the concept of density and attractiveness for weighted networks, considering node and edge weights.
Previous research on OSN communities has also defined factors that are responsible for the member's activity; for in-stance identity-based attachment showed a high impact on information sharing.Studies on the first popular OSNs illustrate a measurement framework to observe user activity in order to address two key issues of online social networks, which are characterization of user activities and usage patterns in these OSNs.Consequently, new algorithms are used for the purpose of distinguishing community formation by detecting a number of communities.
Studies on OSN's communities demonstrated that community structures across different online social networks are similar and are related to users' locations (Fan & Yeung, 2015).Relationships are measured according to mutual interest through two predefined properties which are the reachability that measures the ability of any node to reach out members of community, and the "isolability" that measures the ability of any community to isolate itself from the rest of the network.
Users can form or join existing groups on the basis of shared interests or because dense social connections exist among group members (Meo, Messina, Rosaci, & Sarné, 2014).On the other hand, community-based supervised learning process to detect the set of attributes in a user profile for which it is expected to see a correlation among their attributed values (e.g., job and salary) (Bahri, Carminati, & Ferrari, 2014).Besides, new socially and economically platforms that are drawn recently and may be a useful platform package for managing text, image, audio and video-based analysis modules to detect inappropriate content or high risk behavior.
Then, researchers made a comparison of communities' detection algorithms for multiplex that is considered as a set of graphs on the same vertex set because the wrong choice of algorithm can deviate from its intended purpose (Loe & Jensen, 2015).Some studies discuss the topological characteristics of legitimate users, including the formation of tightly knit communities because they consider that it is a promising approach, but they need to devise efficient techniques for identifying hackers, personal attack (FIDIS, 2016) and spammers along with attackers.Other works discussed the problem of community detection in complex networks and defined a vulnerability set and value for each of the communities within these networks.
Other studies designed a search algorithm to find users with the specified keywords in their profile attributes.Notably, it is based on a linear combination of topological distance and trust metrics.It is also dynamic in nature such that it adapts itself for each individual node during the search process.Additional research tracks the evolution of community structure and revise the effect of community-based immunization strategy on epidemic spreading.
Cross-System User Data Discovery has proven to be a successful algorithm in retrieving profiles that may belong to the searched user, correlate them, aggregate the discovered data and return them to the searcher (Carmagnola, Osborne, & Torre, 2014).Other researchers depended on local community neighborhood ratio function as a useful community detection algorithm.

Clustering Method
We propose a new model for profiling OSN users in a computerized way in order to deal with the huge and complex amount of user profile's data.The model consists in calculating a rate for the user's profile in an OSN community.
The proposed clustering method consists in inferring a Cartesian space for each community where each cluster is located at predefined radius range.Then, the user's profile is represented by a point in the space in order to characterize its corresponding cluster.
Basically, the profile of an OSN user in each community is expressed as a vector of metrics that are set of attributes of interest (i.e., qualitative and quantitative) extracted from the user's profile using data mining tools to characterize his/her level of participation and behavior in a community.Then, the vector of metrics is transformed into a vector of uncorrelated and normalized coordinates using Principal Component Analysis (PCA).The transformed vector of independent coordinates (Note 2) becomes the new representation of the user's profile.In this way, the user's profile can be represented as a point in a Cartesian space where the coordinate of each axis can be obtained from the transformed vector.The rate of the user's profile (i.e., low, medium, high, and advanced) is estimated according to its position in the corresponding cluster of the Cartesian space.
The presentation of users' profiles in low-dimensional Cartesian space allows to easily infer the rate of the user's profile (i.e., low, medium, high, and advanced) through the length of its vector that fits within one of the predefined clusters in the Cartesian space.Besides, representing the user's profile in few metrics instead of having very large amount of complex data, leads to discover the most basic dimensionality of data since the "original" profile' metrics are correlated among each other.Furthermore, such approach provides better visualization the clustered users' profiles in low-dimensional Cartesian space given that it is hard to illustrate them in high-dimensional non-orthogonal space.
The problem can be formulated as follows: Let (  ���⃗ ∈ ℜ  ) be the vector of metrics of user i where the j th component of (  ���⃗) represents a qualitative or quantitative metric derived from the user's profile as implicit or explicit data, for j = 1...m.
Then, we need to map (  ���⃗ ∈ ℜ  )to a reduced vector(y  ���⃗ ∈ ℜ r ) that contains independent coordinates since there are some correlations between the metrics derived from the user's profile.
To this end, we used the Principal Component Analysis (PCA) to determine the Eigen vectors of each community as the orthogonal axes of the Euclidean space modeling this community.This is achieved through the following steps: a. Calculate the mean  ��⃗ and Covariance matrix C of the dataset containing N profiles: (1) ( with r < m in the orthogonal space of the selected eigenvectors by projecting the vector to the matrix E r through the following equation: (3) where(  ���⃗) and (  ��⃗) are treated as column vectors of dimensions m and r respectively.
The obtained coordinates of node i (the components of   ��⃗) can be considered as the independent and most basic metrics of the user's profile that are used to position it in the Cartesian space and infer its rate.This means that our model infers easily the profile's rate through a simple process obtained by matrix-vector multiplication.

Overview
We recall that our model consists in profiling OSN users in the space of each community.For testing the accuracy of our model, we have tested the clustering of 3000 profiles in three communities of interest (Religion, Political and Lifestyle) of three social networks (Facebook, Twitter and Instagram).Each case study consists in embedding the profiles of an OSN community in an independent Cartesian space to observe their distribution in the clusters.
We started by defining the vector of metrics representing the users profile in each community as a set of attributes of interest (i.e., qualitative and quantitative) to characterize his /her level of participation and behavior in this community.Then, we applied our clustering model on three case studies: • Community of Religion in Facebook.
• Community of Politics in Twitter.
• Community of Lifestyle in Instagram.

Vector of metrics
A large vector of qualitative and quantitative metrics can be defined and extracted from the user's profile using data mining tools.Then, Principle Component Analysis can be applied for dimensionality reduction to find the uncorrelated and normalized coordinates.In our experiments, we define the following vector of nineteen metrics for the profiles of Facebook, Twitter and Instagram as illustrated in Table 1

Case Studies -Datasets
In order to test the accuracy of our model, we have generated 1000 fake profiles for each of the three case studies by assigning random values of the metrics for each profile given that it is of difficulty to collect large number of real OSN accounts.It is worth to mention that the fact that the profiles used for testing are fake and not real does not have any negative impact on the validity of our testing.A computer program generates random values after restricting the range of values of each metric according to Table 2

Clusters Boundaries
We have generated 3000 OSN profiles vectors to be embedded in three Cartesian spaces representing the three selected communities which are religion, politics and lifestyle (one space per community).Once the dimensionality reduction is applied using PCA on the vector of metrics of all the OSN profiles of our dataset, we obtain the cartesian coordinate system ℝr of each community.Afterward, we are able to derive the orthogonal coordinates of each profile and its position in the corresponding space of its community.Once the points representing the profiles are distributed in the Cartesian space, a clustering method can be applied that optimizes an objective function such as: • Hierarchical clustering algorithm: Hierarchical clustering algorithms (Jeon & Yoon, 2015) (Shepitsen, Gemmell, Mobasher, & Burke, 2008) can be applied based on Agglomerative Hierarchical clustering algorithm or Divisive Hierarchical clustering algorithm which are reverse to each other.Divisive Hierarchical clustering is a top-down approach starting by grouping all the points in one cluster.Then, it consists of splitting the cluster into two sub-clusters which are in turn divided into sub-clusters iteratively for being presented in a dendrogram graph.Hierarchical clustering is not required in our case since each point can be leveled through the magnitude of its vector in the Cartesian space where the distance between points can be easily calculated as well.
• K-means clustering algorithm: K-means is a popular algorithm used for clustering by grouping points into K clusters (Kanungo, et al., 2002).This can be applied on our dataset for determining the K clusters by selecting K random points as centers of the clusters.Then, the rest of points can be assigned to the cluster of the closest center.Afterward, the K centers should be re-determined by identifying in each cluster the point that minimizes the summed delay with the other points in the same cluster.This is repeated until reaching an unchangeable set of centers.Although the K-means is an interesting method for minimizing an objective function known as squared-error function, it is not useful in our case where our aim is to define clusters that represent the profiles rates and not only the proximity between the profiles point • EMST based clustering algorithm: A Euclidean minimum spanning tree (EMST) can be used to detect clusters in the Cartesian space with irregular boundaries without assuming a spherical shaped clustering structure (Zahn, 1971).Clusters are detected to achieve some measure of optimality, such as minimum intra-cluster distance or maximum inter-cluster distance which is not our scope for profiling OSN users.
• Density-based clustering algorithm: Density-based clustering algorithm groups nearby points together that form a dense region (Ester, Kriegel, Sander, & Xu, 1996).A cluster is constructed through the selection of a new random point and a neighbor point is retrieved to be grouped in the cluster if it has sufficiently many neighbor points in turn.Despite the importance of this algorithm for many applications, we cannot rely on it for rating users' profiles in each community.
While these clustering methods are very useful for grouping points into clusters, another method is needed for defining clusters that represent predefined rates so that the profiles embedded in each cluster can be easily rated.
In this case, we propose that clusters are defined as specific range of radius in an (r-1)-sphere.This can be done by defining the boundary between each two adjacent clusters through subjective model by consulting experts in the domain of social media.In this way, they have defined subjective values of the vectors modeling the profiles of ten OSN users that could be rated on each boundary.Then, the transformation of these vectors to the Cartesian space provides ten points on each boundary where their average magnitude defines the radius of the boundary.Take the case where there are four clusters rated as low, medium, high, and advanced.In this case, the radius of the boundaries separating these clusters can be estimated as follows: • For the boundary between the low and medium clusters, we have started by finding the profiles that could be rated at this location of the space.Therefore, let be the set of vectors that are modeling profiles of ten OSN • Users having subjective values of metrics that are rated in-between low and medium.The transformation of to the Cartesian space provides respectively the set of vectors having an average length .Then, we consider Bm as the radius of the (r-1)-sphere S 1 that is separating the low and medium clusters in the space.
• As for the second boundary separating the medium and high clusters, we have defined as the set of vectors that are modeling profiles of ten OSN users having subjective values of metrics that are rated in-between medium and high.The transformation of to the Cartesian space provides respectively the set of vectors having an average length .Then, we consider as the radius of the (r-1)-sphere S 2 separating the medium and high clusters in the space.
• Finally, the boundary between the high and advanced clusters is inferred by defining the set of vectors that are modeling profiles of ten OSN users having subjective values of metrics that are rated in-between high and advanced.
• The transformation of to the Cartesian space provides respectively the set of vectors having an average length .Then, we consider F m as the radius of the (r-1)-sphere S 3 separating the high and advanced clusters in the space.
Tables 3 and 4 summarize the defined original vectors for clusters boundaries and the i mapping to the orthogonal space.This means that the four clusters are defined as the following: • Cluster 1: inside the (r-1)-sphere S1 • Cluster 2: between the (r-1)-sphere S1and S2.
• Cluster 4: outside S3 The position of an OSN user in a cluster in the Cartesian space should reflect the level of participation and behavior in the corresponding community.These positions can classify the users' profiles in the community according to the following mapping: • Cluster 1: Low profile.

Results
In order to validate our model, we rely on the four clusters characterized in the (r-1)-spheres of the Cartesian Space R r using the subjective vector of metrics.Then, we evaluate the total error and matrix reconstruction error for the different values of r to evaluate the maximum reduction that can be tolerated for our dataset.First, the original vector (x ı �� �⃗) is reconstructed from (y ı ���⃗) through equation 4: (4) Then, the total error over all the profile vectors is estimated as follows (Equation 5): (5) where ( ) are the eigenvalues discarded in the dimensionality reduction.
Figure 1 presents the relative total error ( ) for the different dimensionality reduction (r=1...19).The figure shows that the number of orthogonal dimensions can be reduced down to 3 dimensions with a relative error less than 0.06.When the number of dimensions is reduced down to 2 and 1, the relative total error increases considerably to reach 0.22 and 0.39 respectively.This means that the greatest proportion of eigenvalues hold for the first three principle components and the rest of them is relatively negligible.

Figure 1. Relative Total Error
In addition, the reconstruction error of the reconstructed vector with respect to the original vector is calculated as follows (Equation 6):  The figure shows that the number of dimensions can be reduced down to 4 with a relative errorless than 0.1 while this relative error increases slightly (equal to 0.15) when the number of dimensions r is equal to 3 and considerably when r is equal 2, to reach 0.55.
To test the impact of this dimensionality reduction on the accuracy of the clustering, we have conducted the following experiment.Basically, our clustering method is considered accurate if any two profiles that are "close" to each other in the original metric space have typically similar differences among each other in the Cartesian space.Thus, two nearby points in the metric space must have very similar coordinate vectors, and so must be mapped to nearby points under this embedding system to be located in the same cluster of the Cartesian space.To simplify the problem, this is validated as the following:

•
Low Profiles: For all profiles i (Note 3) satisfying Let l i = number of match cases satisfying The fraction of match cases for low profile i is: • Medium Profiles: For all profiles i satisfying Let m i = number of match cases satisfying The fraction of match cases for Medium profile i is: • High Profiles: For all profiles i satisfying Let h i = number of match cases satisfying The fraction of match cases for High profile i is: • Advanced Profiles: For all profiles i satisfying let d i = number of match cases satisfying Vol. 10, No. 2; 2017 Then, we have checked the impact of this low percentage of variation on clustering and we have found that none of the points modeling the users' profiles change its cluster when reducing the number of dimensions to 3. One may think that it is still probable in another data set that a point positioned very close to the boundary of an adjacent cluster may be erroneously clustered due to the error (even if small) resulted from the reduction of the number of principle components to three.Thus, one can reduce the number of dimensions to three to simplify the presentation of profiles in 3-D space but this tolerates that some points positioned very close to the boundaries of clusters may be not precisely clustered and in this case further analysis is required to be applied on such points if necessary.
Figure 7. CDF of length variation

Conclusions and Perspectives
Our model for clustering OSN users has a number of interesting properties as compared to the traditional mining methods of user's profile.The major advantage is that it allows rating the users' profile (i.e., low, medium, high, and advanced) in each community after presenting it in a low-dimensional Cartesian space embedding the community.The network coordinates of the profile are inferred using PCA after mining the attributes as a set of metrics reflecting its activities and behavior within the community.
The model has been validated by profiling 3000 OSN users in the communities of politics, religion, and lifestyle of three popular social networks.This is achieved by inferring a profile rate in each community that well reflects to the level of participation and tendency of the user.The presented results show valid clustering of users profiles after dimensionality reduction of the vector of metrics whittle down to three dimensions.
Further studies need to continue conducting empirical research to ascertain more factors that contribute to refine and enhance the profiling of users.Particularly, identifying users that are most likely to impress others and direct their attitudes may help for characterizing, preventing and controlling advanced malpractice profiles.On the other hand, future studies must also detect the fabricated profiles that work on manipulating users in a specific community.
2) b.Derive the eigenvectors of the Covariance matrix c.Form the orthogonal matrix by selecting the r eigenvectors having the largest eigenvalues.d.Map each profile vector (  ���⃗) of dimension m to a lower dimensional representation (  ��⃗) of dimension r MetricDescription M1 Number of friends (Facebook) or Followers (Twitter or Instagram) M2 Degree of relevance of profile name to the community: not relevant (0), low relevance (1), medium relevance (2), high relevance (3) M3 Degree of relevance of profile photo to the community: not relevant (0), low relevance (1), medium relevance (2), high relevance (3) M4 Degree of relevance of profile status to the community : not relevant (0), low relevance (1), medium relevance (2), high relevance (3) M5 Number of posted relevant4 Books M6 Number of relevant user's Page M7 Number of relevant activities/events (to which user participated) M8 Number of relevant groups (joined by user) M9 Number of relevant surveys (filled by user) M10 Number of relevant keywords (used by user) in previous search through Search Engine Optimizer (SEO).M11 Number of received likes on relevant content(text, image, and video) M12 Number of sent likes on relevant content.M13 Number of received comments (Facebook/Instagram) or tweets (Tweeter) on relevant content M14 Number of sent comments (Facebook/Instagram) or tweets (Tweeter) on relevant content M15 Number of received posts containing a relevant content.M16 Number of sent posts containing a relevant content.M17 Number of received emails containing a relevant content M18 Number of sent emails containing a relevant content M19 Number of relevant content shared