Complex Network Analysis of the Contiguous United States Graph

We model the contiguous states (48 states and the District of Columbia) of the United States (US) as an undirected network graph with each state represented as a node and there is an edge between two nodes if the corresponding two states share a common border. We determine a ranking of the states in the US with respect to a suite of node-level metrics: the centrality metrics (degree, eigenvector, betweenness and closeness), eccentricity, maximal clique size, and local clustering coefficient. We propose a normalization-based approach to obtain a comprehensive centrality ranking of the vertices (that is most likely to be tie-free) encompassing the normalized values of the four centrality metrics. We have applied the proposed normalization-based approach on the US States graph to obtain a tie-free ranking of the vertices based on a comprehensive centrality score. We observe the state of Missouri to be the most central state with respect to all the four centrality metrics. We have also analyzed the US States graph with respect to a suite of network-level metrics: bipartivity index, assortativity index, modularity, size of the minimum connected dominating set, algebraic connectivity and degree metrics. The approach taken in this paper could be useful for several application domains: transportation networks (to identify central hubs), politics (to identify campaign venues with larger geographic coverage), cultural and electoral studies (to identify communities of states that are relatively proximal to each other) and etc.


Introduction
Network Science is one of the emerging fields of Data Science to analyze real-world networks from a graph theory point of view.Several real-world networks have been successfully modeled as undirected and directed graphs to study the intrinsic structural properties of the networks as well as the topological importance of nodes in these networks.The real-world networks that have been subjected to complex network analysis typically fall under one of these categories: social networks (Ghali et al., 2012), transportation networks (Cheung & Gunes, 2012), biological networks (Ma & Gao, 2012), citation networks (Zhao & Strotmann, 2015), co-authorship networks (Ding, 2011) and etc.One category of real-world networks for which sufficient attention has not yet been given are the regional networks featuring the states within a country.
In this paper, we present a comprehensive analysis of a network graph of the states within a country with respect to various node-level and network-level metrics typically considered in the field of Network Science and demonstrate the utility of information that can be obtained from the analysis.We also propose a normalization-based approach to obtain comprehensive centrality scores for the vertices encompassing the normalized individual centrality scores and illustrate the use of these comprehensive scores to obtain a ranking of the vertices (that is most likely to be tie-free).We also illustrate the procedure to identify the centrality metric whose scores and ranking are relatively the closest to the normalized comprehensive centrality scores and ranking.
We opine the paper to serve as a model for anyone interested in analyzing a connected graph of the states within a country from a Network Science perspective.The approaches presented in this paper could be useful to determine the states (and their cities) that are the most central and/or influential within a country.For example, the ranking of the vertices based on the shortest path centrality metrics (closeness and betweenness) could be useful to choose the states (and their cities) that could serve as hubs for transportation networks (like road and airline networks).We could identify the states that are most the central states as well as identify the states that could form a connected backbone and geographically well-connected to the rest of the states within a country and use this information to design the road/rail transportation networks.The degree centrality and eigenvector centrality metrics as well as the network-level metrics like minimum connected dominating set and maximal clique size could be useful to identify fewer number of venues (with several adjacent states to draw people) for political campaigns/meetings that would cover the entire country.Node-level metrics like local clustering coefficient could be useful to identify the states that are critical to facilitate communication between the neighbor states.One could develop an optimal regional classification of states for cultural studies (language accent, eating habits, etc) and electoral studies (like scheduling of elections) by identifying communities of states (that are relatively more proximal with each other) with high modularity scores.We choose the United States (US) as the country for analysis and build a connected network graph of the contiguous states (48 states and the District of Columbia, DC) of the US: each state and DC is a node (vertex) and there exists a link (edge) between two vertices if the two corresponding states/DC share a common border.Though some prior studies have been conducted on transportation networks (Cheung & Gunes, 2012) and food flow networks (Lin et al., 2014) in the United States, to the best of our knowledge, there has been no prior study of network analysis on the graph of the contiguous US states solely based on their geographical locations.In this paper, we have implemented the algorithms to compute several node-level metrics (such as the degree centrality, eigenvector centrality (Newman, 2010), betweenness centrality (Brandes, 2001), closeness centrality (Newman, 2010), maximal clique size (Meghanathan, 2015b), eccentricity (Cormen et al., 2009) and local clustering coefficient) as well as several network-level metrics (such as bipartivity index (Estrada & Rodriguez-Velazquez, 2005), modularity (Newman, 2006), minimum connected dominating set (Meghanathan, 2014b), algebraic connectivity (Fiedler, 1973), average path length (Cormen et al., 2009), diameter (Cormen et al., 2009), assortativity index (Newman, 2010) and spectral radius (Meghanathan, 2014a)) and analyze the US States network graph with respect to these metrics.We also analyze random network instances (generated with the same degree sequence using the Configuration model (Meghanathan, 2016c)) of the US States graph to study the correlation of the node-level metrics and proximity of values for the network-level metrics.Finally, we illustrate cis.ccsenet.
the applica on the US graph.We ranking of scores and

Local Clustering Coefficient
The local clustering coefficient (LCC) of a vertex is a measure of the probability that any two neighbors of the vertex are connected.For a vertex vi with ki neighbors, the maximum number of links between any two neighbors of the vertex is ki(ki-1)/2.The LCC of a vertex is the ratio of the actual number of links connecting the neighbors of the vertex to that of the maximum possible number of links between the neighbors of the vertex.The smaller the LCC of a vertex, the more important is the vertex for facilitating shortest path communication among its neighbors (as there is a good chance that the neighbors of a vertex that are connected to each other go through the vertex for shortest path communication).Hence, we give a higher rank to vertices having a lower LCC.7-a captures the cumulative probability distribution of the LCC metric and we observe that only about 15% of the vertices have a LCC of 0.3 or lower, and more than half of these vertices have the largest values for the BWC (as observed in Figure 7-b).We observe the Spearman's Rank-based correlation coefficient between LCC and BWC (computed based on the rankings in Tables 4 and 8) to be 0.82. Figure 7-c very well captures the inverse relationship between degree and LCC.Vertices having a larger degree are more likely to have a lower LCC as it would be difficult to expect any two neighbors of a high-degree node to be directly connected to each other and are more likely to go through the vertex for shortest-path communication.On the other hand, vertices having a lower degree are more likely to have a larger LCC as it is highly possible for any two neighbors of a low-degree vertex to be directly connected to each other and need not go through the vertex for shortest path communication.Thus, vertices with higher degree and lower LCC are more likely to have a larger BWC, and vertices with a lower degree and higher LCC are more likely to have a smaller BWC.A plot of Closeness Centrality (ClC) vs. LCC reveals that the two metrics are almost independent of each other (as vertices covering the entire range of values observed for the ClC have almost the same LCC), leading to a Spearman's rank-based correlation coefficient of 0.52.The distribution of the eccentricity of the vertices shows that the minimum value (also called radius): 5 is half of the maximum value (also called diameter): 10.Nevertheless, we observe that more than 65% of the vertices have an eccentricity of 8 or above (i.e., more than 65% of the vertices have a maximum path length of 8-10 to one or more vertices) and only 4% of the 49 vertices (i.e., just 2 vertices) incur eccentricity values corresponding to the radius of the graph.The two states of West Virginia and Ohio (with an eccentricity corresponding to the radius) are said to form the "center" of the graph (Newman, 2010); each of these two vertices are within a maximum hop count of 5 on a shortest path to any other vertex in the graph.Note that neither of these two vertices are among the vertices that are ranked in the top 3 with respect to any of the centrality metrics and local clustering coefficient.There are five states (Arizona, California, Maine, Montana and North Dakota) that have an eccentricity corresponding to the diameter of the graph.Table 9 illustrates a ranking of the vertices based on eccentricity (the state with the smallest eccentricity is ranked first).

Network-Level Metrics
In this section, we evaluate the following network-level metrics for the US States graph: Bipartivity Index; Degree Metrics -Average, Standard Deviation, Kurtosis and Spectral Radius Ratio; Algebraic Connectivity; Assortativity Index and Modularity.We also determine the size of the Minimum Connected Dominating Set of vertices based on the four centrality metrics (DegC, BWC, EVC and ClC).

Bipartivity Index
A graph is bipartite (a.k.a.2-colorable) if the vertices of the graph can be partitioned to two disjoint sets such that all the edges in the graph are those that connect a vertex from one partition to the other partition, and there are no edges between vertices within a partition (Cormen et al., 2009).The two partitions are determined using the sign of the entries in the eigenvector corresponding to the smallest eigenvalue of the binary adjacency matrix of the graph (Estrada & Rodriguez-Velazquez, 2005); the positive entries are grouped into one partition and the negative entries are grouped into another partition.Figure 9 displays the US States graph with the states colored in yellow or green to represent the two partitions.
A measure called bipartivity index (Estrada & Rodriguez-Velazquez, 2005) has been proposed in the literature to determine the extent of bipartivity for complex network graphs.The bipartivity index of a graph is computed using the eigenvalues of the binary adjacency matrix of the graph.The bipartivity index values could range from 0 to 1; if a graph has bipartivity index of 1, it implies all the edges in the graph are only those that connect the vertices across the two partitions.However, there exist several real-world network graphs for which there are few edges (called frustrated edges) that connect the vertices within each partition (though a majority of the edges connect the vertices across two partitions; Estrada & Rodriguez-Velazquez, 2005).Graphs with one or more frustrated edges have bipartivity index less than 1 and graphs with no frustrated edges have bipartivity index equal to 1 (Estrada & Rodriguez-Velazquez, 2005).While graphs with no frustrated edges have been referred to as truly bipartite, graphs with frustrated edges have been referred to as close-to-bipartite (Estrada & Rodriguez-Velazquez, 2005).The bipartivity index of the US States graph has been observed to be 0.66 and the fraction of frustrated edges in the network is 0.32.Though the bipartivity index value is not that close to 1, it is still larger than the values observed for several of the real-world networks in the literature (Estrada & Rodriguez-Velazquez, 2005). cis.ccsenet.

Degree
From Figu corroborat of the two (all indica Nevzorov, 1998).For or below 3 spectral ra the graph a node degr metric is 1 node degre not as clos network w variation d

Algebr
The algebr value of th network.T matrix of low value entries in t

Assorta
The assort graph is a (Newman, product-m the edges i (Newman, graph hav assortative

Normal
As there a

Configuration Model-Based Analysis
Given the degree sequence of a real-world network, the Configuration model could be used to generate a random network whose degree sequence is also the same as that of the real-world network (i.e., the random network could even have a non-Poisson degree distribution if the corresponding real-world network has one; Meghanathan, 2016c).In this paper, we use the Configuration model to study whether the degree sequence of the US States network graph (a real-world network) would be sufficient to generate a random network whose node-level metrics and network-level metrics exhibit strong correlation or proximity with the values incurred for these metrics in the corresponding real-world network.
Let N and L be respectively the number of nodes and edges in the chosen real-world network of study (like the US States network graph).Given the degree sequence (D) for the chosen real-world network, we simulate the generation of a random network per the configuration model as follows: We create a list LD (of length corresponding to the sum of the node degrees): the list is initialized with node IDs and the number of instances a node ID appears in the list corresponds to the degree of the node in D. The list LD is shuffled.We then proceed in iterations (to generate the random network), traversing the list LD in the reverse direction (i.e., with index j from |LD| to 2).In each iteration: we generate an edge (for the random network) involving the vertex at index j in the list LD to a vertex at a randomly chosen index i (i < j) when the following conditions are met: (i) the two entries are not -1, (ii) the two vertices are not the same (to avoid self-loop) and (iii) there does not exist already an edge involving the two vertices in the random network.The entries at both the indexes i and j are then set to -1.  (Triola, 2012) between the values incurred for the nodes in each of the 100 instances of the random networks and the actual real-world network and averaged the correlation coefficient values (shown in Table 13 in the decreasing order of the correlation coefficient values).
We adapt the range of correlation coefficient values (rounded to two decimals) proposed in the literature (Evans, 1995) to decide on the level of correlation.We observe a very strong positive correlation (range: 0.80...1.00)in the case of the degree centrality (as expected) and closeness centrality metrics, and a strongly positive correlation (range: 0.60...0.79) in the case of the eigenvector centrality and betweenness centrality metrics.On the other hand, we observe a moderately positive correlation (range: 0.40...0.59) in the case of eccentricity, and a weakly positive correlation (range: 0.20...0.39) in the case of maximal clique size and local clustering coefficient.
For each network-level metric, we averaged the results obtained with the 100 instances of the random networks and compared this average value with the value incurred for the actual US States network graph (shown in Table 14).For none of the network-level metrics (other than degree-based edge assortativity and spectral radius ratio for node degree), we observe the average values obtained for the random networks generated using the configuration model to be closer to the values obtained for the actual US States network graph.We observe the random network instances to be relatively more bipartite, more robust to disconnection and more modular.We also observe the random network instances to have a relatively smaller diameter and a smaller average path length between any two nodes.As expected of a random network, we also observe the edges to be very weakly assortative with respect to all the four centrality metrics for the random networks generated using the configuration model; on the other hand, we observe the edges to be strongly assortative with respect to the eigenvector and closeness centrality metrics for the actual US States network graph.
Thus, based on the results obtained for the node-level metrics, we could conclude that the degree sequence of the US States network graph would be sufficient to generate random network instances that exhibit strong-very strong positive levels of correlation with respect to all the four centrality metrics.On the other hand, with respect to the other node-level metrics (like Eccentricity, Maximal Clique Size and Local Clustering Coefficient) as well as for all the network-level metrics (other than Degree centrality and Spectral radius ratio for node degree), we could conclude that the degree sequence of the US States network graph would alone not be sufficient to generate random network instances that exhibit comparable values for these metrics.

Related Work
Very few works have been conducted on network graphs related to the US.We review these works below: Fogarty et al. (2008) conducted a network analysis-based study on the hurricanes that made landfalls in the US from 1851 to 2008.A set of 23 non-overlapping regions (nodes) of the US that were affected with at least one hurricane were identified; two nodes were linked with an edge if at least one hurricane impacted the regions corresponding to both of them.One of the interesting conclusions from this study was that regions (like Louisiana) with a high occurrence rate of hurricanes had a low connectivity with the rest of the regions; on the other hand, regions with high connectivity (like Virginia) had a low occurrence rate.Several similarities have been observed between the hurricane landfall network by Fogarty et al (2008) and the US states network graph studied in this paper.For both the networks, the betweenness centrality metric exhibited a power-law distribution and the closeness centrality metric exhibited a uniform distribution with narrow range of values.While the average local clustering coefficient of the nodes in the landfall network was 0.46, the average local clustering coefficient of the nodes in the US states network graph is slightly larger (0.52).The diameter values for the network graphs are proportional: we observe a diameter of 10 for the US states network graph of 49 nodes and a diameter of 5 for the landfall network of 23 nodes.However, the two networks differ with respect to the degree centrality metric: we observe a clear bi-modal degree distribution for the US states network graph and no such distinct distribution could be attributed for the degree centrality metric in the landfall network.Though the hurricane landfall network and the US States network shared several similarities (as mentioned above), it must be remembered that the hurricane landfall network was constructed by cumulatively considering the landfall of hurricanes over a longer period of time (for about 150 years).We anticipate the results for the node-level and network-level metrics to appreciably differ for the two networks if the landfall network is constructed for a particular year or over a shorter time period.Lin et al. (2014) conducted a network analysis of food flows within the US and had the following results: The distributions for the degree centrality and betweenness centrality were observed to be normal and Weibull (Balakrishnan, & Nevzorov, 2003) in nature.A power-law relationship (Balakrishnan, & Nevzorov, 2003) existed between the degree centrality and betweenness centrality metrics, indicating a vulnerability to the disturbance of key nodes.On the other hand, we did not observe a power-law relationship between degree and betweenness centrality for the US States network graph; even vertices with moderate-high degree had a low betweenness centrality.Lyte et al. (2015) conducted a citation network-based analysis of the different sections that fall under the 52 titles of United States Code; each section is a node and there exists a directed edge from one section to another section if the former cites the latter.The betweenness and eigenvector centrality metrics were used in this study to identify major pathways of references from one section to another.The modularity-based Louvain community detection algorithm (Blondel et al., 2008) was used to identify communities of sections that had similarities with respect to concepts and codes.It was observed that though sections under two or more related titles formed a single community, most of the communities detected were a collection of sections under a particular title.For the US States network graph, the communities detected using the Louvain algorithm were similar to the regional divisions used by the United States Census Bureau.
Cheung and Gunes (2012) conducted a complex network analysis study of the US air transportation network as of 2011 and compared it with the networks that existed in 1991 and 2001.Their study revealed no major changes in the features (like centrality and connectivity of the airports) of the air transportation networks that evolved with time (with increase in the number of airports and flight connections).A critical finding from the study was that the US air transportation network of 2011 has been identified to be more vulnerable to airport closures than it was in the past.The degree distribution of the 2011 US air transportation network only follows a partial Power-law (i.e., the distribution exhibited Power-law only after a degree value > 1), unlike the world-wide air transportation network that follows Power-law starting from degree value of 1 (Guimera, 2005).Random network instances (generated using the configuration model) of the US States network graph exhibited strong positive correlation with respect to the centrality metrics, but were observed to be relatively more bipartite, modular and robust to disconnection.

Summary and Conclusions
Our high-level contribution in this paper is to illustrate complex network analysis of a connected graph of the states within a country at node-level and network-level as well as propose a normalization-based approach to comprehensively rank the vertices (more likely to be tie-free) in a network graph based on the centrality metrics.We implemented the algorithms to compute a suite of node-level and network-level metrics and ran them on the US States network graph.We summarize the results and key observations as follows: (i) The state of Missouri is the top-ranked node with respect to all the commonly studied centrality metrics such as degree, betweeenness, closeness and eigenvector centralities.This is vindicated with several airlines (like American Airlines, Southwest Airlines, etc) choosing the city of Missouri as one of their primary hubs over the past two decades.(ii) The degree distribution appears to mimic a bi-modal Poisson distribution, while the betweenness centrality (BWC) exhibits a Power-law style distribution.(iii) There exists a maximum clique of size 4 involving the states of Arizona, Colorado, New Mexico and Utah; the rest of the states (except Maine) are part of maximal cliques of size 3. (iv) The state of Idaho has the lowest non-zero local clustering coefficient, indicating that the state is the most critical state with respect to facilitating communication between its neighboring states.(v) The radius, diameter and average path length are 5, 10 and 3.94 respectively.The states of Ohio and West Virginia form the "center" of the graph with an eccentricity corresponding to the radius of the graph (these states are at most 5 hops away from any other state in the graph).The states of Arizona, California, Maine, Montana and North Dakota have an eccentricity corresponding to the diameter of the graph (these states could be as large as 10 hops away to one or more states in the graph).More than 65% of the vertices have an eccentricity of 8 or above.(vi) The bipartivity index of the graph is 0.66 with 32% frustrated edges.(vii) The algebraic connectivity of the network graph is 0.0973 (indicating low robustness) and the spectral radius ratio for node degree is 1.24 (moderately high for a Poisson network, vindicating the bi-modal degree distribution of the vertices).(viii) The modularity score of the graph is 0.58 with a total of six non-overlapping communities of states, closely resembling the regional classification of the states.(ix) The network has been observed to be relatively more assortative with respect to eigenvector and closeness centralities; whereas the degree-based and BWC-based approximations to the minimum connected dominating sets are of the smallest size.(x) The Configuration model-based study of the US States network graph indicated that the degree sequence alone was sufficient to generate random network instances that exhibited strong-very strong levels of positive correlation for the centrality metrics, but the degree sequence was not sufficient to observe such a strong correlation for the other node-level metrics and comparable values for the network-level metrics.The random network instances of the US States network graph were observed to be relatively more robust to network disconnection, more bipartite and more modular.Thus, even though it might look like some states may have a common border by chance (especially, if the common border is over a smaller area), the above results (especially those from assortativity analysis and the configuration model-based study) indicate that the network of US states is very much different from a random network.
We have also proposed a normalization-based approach to arrive at a (possibly tie-free) ranking of the vertices based on their comprehensive centrality scores determined as a weighted average of the normalized scores of the individual centrality metrics.We also show how to identify the centrality metric whose normalized , it appears that the Eigenvector Centrality metric (that consistently incurs the second smallest RMSD values with respect to both the normalized centrality scores and the numerical ranking of the vertices) could be relatively the best metric that could be used to obtain a comprehensive centrality-based ranking of the vertices in the US States network graph.A similar approach could be used to identify a centrality metric that could be considered the candidate metric to claim a comprehensive centrality-based ranking of the vertices in other real-world network graphs and synthetic graphs generated from theoretical models.
To the best of our knowledge, we have not come across a paper that comprehensively analyzes a suite of node-level and network-level metrics for any real-world network and one especially based on the states within a country.The approach taken and the metrics evaluated in this paper could have several applications: For example, we could identify the states that are most the central states as well as identify the states that could form a connected backbone and geographically well-connected to the rest of the states within a country and use this information to design the road/rail transportation networks; we could identify the states that could be clustered to a particular geographical region within a country and use this information for region-based analysis and etc.For countries with a reasonably larger area and an appreciable number of states, each state (except those in the corners of the country) typically shares border with a similar number of states.Hence, we anticipate the distribution of values for the node-level metrics to be about the same for several other countries too.We thus opine the paper to serve as a model for anyone interested in analyzing a connected graph of the states within a country from a Network Science perspective.

Figure 2
Figure Figure 9. On t lues (see Figu ality and the e Figure 12-b).
Figure 13.and the individualized scores and ranking of the vertices is relatively the closest to the normalized comprehensive centrality (NCC) scores and the ranking of the vertices based on the NCC scores.Considering the results plotted in Figures 12-(a) through 12-(d) and Figures 13-(a) through 13-(d)

Table 1 .
List of Contiguous States (including DC) of the US in Alphabetical Order

Table 8 .
Ranking of the Vertices in the US States Network Graph based on Local Clustering Coefficient (LCC)

Table 8
ranks the vertices in the US States graph in the increasing order of the values of the LCC.As the LCC values get larger, we observe a significant number of ties among the vertices.The state of Idaho (with a degree of 6) has the lowest LCC and hence is the top ranked with respect to the LCC metric.The state of Missouri (that was ranked first with respect to all the four centrality metrics) is ranked second with respect to LCC.There are only nine unique values for the LCC metric.Figure

Table 9 .
Ranking of the Vertices in the US States Network Graph based on Eccentricity (Ecc)

Table 12 .
Root Mean Square Difference (RMSD) Values obtained for the Node-Level Distribution of the Normalized Centrality Scores and the Ranking of the Vertices based on the Normalized Scores vis-a-vis the Normalized Comprehensive Centrality (NCC) Scores

Table 13 .
Correlation of the Node-Level Metrics for the US States Network Graph and its 100 Instances of Random Networks (with the same Degree Sequence) Generated using the Configuration Model

Table 14 .
Correlation of the Node-Level Metrics for the US States Network Graph and its 100 Instances of Random Networks (with the same Degree Sequence) Generated using the Configuration Model We generate 100 instances of random networks for the US States network graph according to the Configuration model and measure the following node-level metrics: (i) Degree Centrality, (ii) Eigenvector Centrality, (iii) Betweenness Centrality, (iv) Closeness Centrality, (v) Maximal Clique Size, (vi) Local Clustering Coefficient and (vii) Eccentricity; and network-level metrics: (i) Assortativity Index of the edges based on each of the four centrality metrics, (ii) Spectral Radius Ratio for Node Degree, (iii) Average Path Length, (iv) Diameter, (v) Bipartivity Index, (vi) Algebraic Connectivity and (vii) Modularity score determined using the Louvain algorithm.In the case of the node-level metrics, we measured the Pearson's product-moment correlation coefficient