Statistical Analysis to Bitcoin Transactions Network

There is abundantly documented scientific evidence that the financial transactions that have grown rapidly recently, in conjuction with the interest of the public, were due to the sharp rise in the price of Bitcoin in December 2017. As a consequence, a freshly emerging dataset in the research community has emerged. Therefore, the aim of the present investigation was to examine the analyses of data in this newly emerging dataset in the research community. In order to achieve the extraction of data, their conversion to network and finally their fragmentation, the studied variables were analyzed by using two parts of analysis, namely, statistical network analyses and economic activity analyses. Network statistical analyses was employed aiming to analyze, in a holistic approach, the complex systems of modern times which are represented as networks, as it is impossible to analyze them partially, in order to avoid incorrect conclusions. Additionally, the analyses of economic activity, which is related to indicators from the stock market and the economics of science, was used, after it had been transferred and matched with the economic model represented by Bitcoin. The results distinguished the extent of the data generated by the statistical analyses of the networks and the analyses of economic activity. With respect to data presented, we established that the daily transaction networks were scale free networks which were not evolving like ER random networks and they were not defined as the small world. Also, it was demonstrated that daily transaction networks cannot be reproduced in a random way like ER random networks. Furthermore, the opportunities and problems encountered in conducting the present research were briefly presented.

It was chosen to do the analysis with the programming language R (ver 3.5.2). Due to the large number of links it was decided not to analyze and represent the transactions network as a single one, it was doubtful whether conclusions could be drawn due to the heterogeneity in the evolution of the network, as well as too many computing requirements will limit the capabilities of the iGraph library to be used for statistical network analysis (Androulaki et al.,2013;Kondor et al., 2014b). It was finally decided to split the transactions network into daily transactions networks, where statistical analysis for each daily transactions network would be performed. The results of the statistical analysis for each daily transactions network will be combined with the corresponding calendar order and a time series of results will be created (Khatri, Y., 2019;Redman, J., 2019;Aki, J., 2019;Madore, P., 2019).
The analysis yielded 182 time series of 2763 observations from daily transactions networks, gigantic components and ER random networks. In the fourth stage of the analysis, an attempt will be made to predict the price of Bitcoin using the 182 time series in various combinations of artificial neural networks.

Analysis of the Txedges Data Set
The team that implemented the bitcoind-dump-tsv data mining tool has also implemented the txedges tool, which creates a directed weighted network, in the form of a list of directed weighted connections (Kalodner et al., 2017;Day et al., 2018;Day et al., 2019;Kondor et al., 2015). The extracted txedges.dat file of size 115 GigaByte contains:  The transaction identifier txID  The ingoing address identifier in_addrID ID  The outgoing address identifier out_addrID  The weight field which is the total satoshis of the transaction  Coinbase transactions are not included (Kondor, D. 2018).
It also had to find the daily price of Bitcoin in dollars, which is provided by almost all exchanges. However, because there are price differences per exchange, it was decided to use the list from the coindesk website, which can be considered reliable as it does not manage cryptocurrencies in any way.
The analysis was done with the R language, which, supports a representation of integers up to 32 bits. The txedges.dat file has 2536261805 connections, this number is much larger than 2 ^ 31, so the txedges.dat file could not be read in R. Due to this limitation the txedges.dat file was split into 2 smaller files, so it was ensured that each file could be read in R.
Also due to the very large number of connections it was decided not to analyze the transaction network as a single one. It was decided to cut the transaction network into daily transaction networks, where statistical analysis will be done for each daily transaction network. The results of the statistical analysis for each daily transaction network will be combined with the corresponding chronological order and a time series will be created for each result.

Data Completion and Segmentation of the Transaction Network
Some fields from the bitcoind-dump-tsv data set had to be added to the list of directed weighted connections to make the analysis more complete. Added to the connections of txedges.dat file the blockID field from the tx.dat file and the block_timestamp field from the bh.dat file, whose values were identified with the txID and blockID fields respectively. The connections now have the date and time of the block to which they belong with the block_timestamp field, but the time has been deducted to match the connections at the same time for each day.
When adding the usd field from the BTC_USD-CoinDesk.csv file, it was observed that there are connections before the first day that a price was recorded for Bitcoin, which was rejected by the analysis. The remaining connections were identified by the block_timestamp field in the connections list with the timestamp field of the BTC_USD-CoinDesk.csv file.
The weight field containing all the satoshis of each transaction was converted to bitcoin, then multiplied by the price of Bitcoin and the weight2USD field created that has the total of the transaction in dollars.
As the field completion is completed, the list of directed weighted connections now has the fields block_timestamp, blockID, txID, in_addrID, out_addrID, weight, usd and weight2USD. Closing the first stage of the analysis, the connection list was cut into 2763 daily connection lists..

Application of Descriptive Analysis in Daily Transaction Networks
The second stage of the analysis will be applied sequentially to all daily connection lists. Initially, those connections that have in_addrID or out_addrID of value equals to -1 are removed, as this address is not real and may affect the results. Then the number of unique incoming and outgoing addresses is counted.
Then the daily transaction network is created and its giant component with the iGraph library, which has the in_addrID field values for start nodes and the out_addrID field values for end nodes, the connections only have the value of the weight field. A random network of Erdos-Renyi with multiple connections and loops was also built, with the same number of nodes and connections to the daily transaction network.
The number of nodes and connections, multiple and loop connections are counted in the networks. Then we calculate the density, the variability, the coefficient of complexity, the coefficient of similarity, the number of components, the connecting and linear number, the hub index with and without weights, the index of authorities with and without weights, the number of triangles and the number of cuts. In addition, the maximum number of Bitcoin, the maximum price in dollars, the total number of Bitcoin and the total price in dollars are counted. Like the number of nodes without outgoing connections along with the total number of Bitcoin and the total price in dollars of these nodes.
From the vectors of components, degrees, weights, power, rank with and without weights of the networks are kept the minimum value, the value at the 1 st quartile of the vector, the intermediate value, the mean value, the value at the 3 rd quartile of the vector, the second largest value, the maximum value, the standard deviation, the variation, the second mode value, the mode value and the coefficient of variation.
Some important indicators could not be calculated due to the large number of connections, firstly because the algorithms that calculate the specific indicators need to go through all the network connections several times and secondly because the specific algorithms are not optimized in the irgaph library to run at the same time

Analysis of Economic Activity in Daily Transaction Networks
In the third stage of the analysis, the results from the application of descriptive statistics will be used. The total amount of Bitcoin and the total value in dollars of the transaction network for each day is initially calculated.
Then the NVT index is calculated which when high indicates that the total value of the network exceeds the value transmitted to the network, this can happen when the network is in high growth and investors value it as a high-yield investment or alternatively when the price is in an unsustainable bubble (Woo, W. 2017).

=
(1) Where is the total sum of bitcoins in USD and the daily sum of bitcoins in USD (2) Where nMA is Moving Avarage with previous n days window.
The NVTS index, which is a derivative of NVT, is then calculated, with a greater emphasis on signal prediction than price peaks .
In addition, the PMR index based on the Metcalfe and Zipf laws, which calculate the effect of unique daily addresses on communication networks, is calculated (Metcalfe's law & Zipf's law, Wikipedia) .
Where is the distinct number of daily vertices.
ln(DUA) = ln(UniqueDailyAddresses) Also calculated is the NVM index that describes the value of the network in relation to the maximum and minimum limits it presents and thus quantifies any overestimation or devaluation, the normalized NVM sets the limits from -1 to 1 .
The NVTG index and two variants based on the Metcalfe and Zipf law are then calculated. The NVTG index evaluates the cryptocurrency transaction network, measuring the ratio of the value of transactions to its growth (Arun, V. 2018a; Arun, V. 2018b).
Some different choices have been made in how to calculate economic activity indicators for reasons of uniformity, as the ways of calculation are not fully clarified in the references presented. Moving averages are applied with half the horizon than in the references and only in previous observations, this choice was made as it was not mentioned in the references whether moving averages control only previous observations or even later. Also, only the simple (unweighted mean of the previous n data) moving average was used for easier comparison of the results in cases where some indicators used an exponential moving average (Farmakis and Makris, 2012).
Finally, time series was created which indicates the trend of the price of Bitcoin compared to the previous day, the number 0 means that the price of Bitcoin remained the same or decreased and the number 1 means that the price of Bitcoin increased. The case that the price of Bitcoin itself remained the same as the previous day is almost zero as all the decimal places of its price are compared.

Presentation of Results
Initially, the most important results of the analyzes of the descriptive statistics and the economic activity that took place in the daily transaction networks, in the ER random networks and in the giant components, which produced the initial time series, will be presented. There will be an empirical test of adjustment to power law distribution, comparison of the evolution of daily transaction networks with ER random networks, investigation of the small world phenomenon, the similarity of vertices degrees and the correlation of time series with the price of Bitcoin. Then we will present the results of the statistical analysis made in the initial time series, which produced the high correlation time series, the time series with zero or almost zero variation and the primary component analysis of the time series.  (Mitzenmacher, 2003). Figure 2 shows that the the vertices of the 3 rd quartile components of the daily transaction networks are around 5, while in the maximum component of the daily transaction networks are over 150,000. Such a degree of inhomogeneity is characteristic of the power law distribution (Mitzenmacher, 2003). Figure 3 shows that the 3 rd quartile degrees of the number of neighbors of ER random networks are around 10, while in the maximum number of neighbors of ER random networks the degrees are close to 25. Such a degree of inhomogeneity is not characteristic of the power law distribution (Mitzenmacher, 2003). Figure 4 shows that the vertices of 3 rd quartile components of ER random networks are around 200,000, while in the maximum component of ER random networks are close to 500,000. Such a degree of inhomogeneity is not characteristic of the power law distribution (Mitzenmacher, 2003).    (Mitzenmacher, 2003). Figure 8 shows that the 3 rd quartile of the vertices rank of giant components is around 0.005, while the maximum vertices rank of giant components is close to 0.1. Such a degree of inhomogeneity is rather characteristic of the power law distribution (Mitzenmacher, 2003). Exactly the same applies to the weighted rank of the top components of the giant components.

Comparison of the Evolution of Daily Transaction Networks With Erdos-Renyi Random Networks of Similar Size
It is observed different way of daily transaction network development, from the comparison of the number of components of the daily transaction networks with the Erdos-Renyi random networks of similar size. The correlation of the two time series is 0.47396 (Figure 9). A completely different way of evolving is observed from the comparison of the most common degrees of daily transaction networks with Erdos-Renyi random networks of similar size. The correlation of the two time series is only -0.03144 ( Figure 10).  It is observed in the daily transaction networks that the time series of the coefficient of complexity and transitivity are quite different from each other but also from the time series of density ( Figure 13).
In the ER random networks created for comparison, it is observed that the values of the transitivity and the coefficient of complexity are quite close to the value of the density, which is something to be expected. The correlation between density and transitivity is 0.93886. The correlation between density and complexity is 0.95536. The correlation between the transitivity and the coefficient of complexity is 0.96574. These three correlations show that the values of the transitivity and the coefficient of complexity are close to each other and to the density values ( Figure 12).
It is observed in ER random networks that the time series of the complexity and transitivity coefficient are quite similar to each other but also to the time series of density. The coefficient of similarity of the degree of daily transaction networks varies from state of dissimilarity to state of similarity ( Figure 13). The coefficient of similarity of the degree of the giant components change from the state of dissimilarity to the state of similarity in a more clearly way (Figure 14). The coefficient of similarity of the degree of ER of random networks tends to zero and absolute inhomogeneity (Figure 15).   Figure 23 shows low level correlation of time series of descriptive statistics. Figure 24 shows low level correlation of time series of descriptive statistics together with economic analysis.

Time Series With Zero or Almost Zero Variation
The number of time series of economic analysis with zero or almost zero variation is 1. This time series is er.vertices.no.outgoing.number. The number of time series for descriptive statistics with zero or almost zero variation is 11. The number of time series of descriptive statistics together with economic analysis with zero or almost zero variation is 12.

Results -Discussion
It is estimated that the characterization of daily transaction networks as scale free networks is an important result, that daily transaction networks do not evolve like ER random networks and that they do not have the status of a small world. It would be possible to produce more results if various iGraph algorithms in R did not show problems.
Also important is the holistic approach taken in the field of data analysis, to create as much data as possible, as the majority of available tools and algorithms were used

Time Series of Descriptive Statistical and Economic Activity
The empirical test of the adjustment of the degrees of the daily transaction networks and the giant component in power law distribution is positive. The conclusion is that the form of daily transaction networks and giant component is similar, either by focusing on specific parts or by observing them on a larger scale. The opposite is true of empirical test over the adaptation of ER networks to random distribution networks, which is negative and confirms their difference. Daily transaction networks and their giant component appear to be scale free networks, meaning that most transactions are done from a few addresses.
In addition, the empirical tests of the components of the daily transaction networks, the strength of the vertices and the weights of the edges of the giant components, in power law distribution are positive. While empirical tests of rank of the vertices and the weighted rank of the vertices of the giant components, in power law distribution tend to be positive (Mitzenmacher, 2003).
meaning that daily transaction networks cannot be reproduced in a random way like ER random networks. It also seems that daily transaction networks do not have the status of a small world as the correlation between the transitivity and the clustering coefficient of daily transaction networks is very small.
As for the vertices similarity coefficient, there is an evolution over time. High degree vertices are associated during the first few days of Bitcoin operation, with smaller degree vertices as the values of the peer-to-peer similarity coefficient are negative. This is justified because in the early days of Bitcoin, mining rewarded large amounts of bitcoins. With the evolution of Bitcoin and the reduction of the mining reward, the coefficient of similarity of degree is reversed and becomes positive, now high degree vertices are associated with similar vertices. The vertices similarity coefficient can be interpreted as the similarity of the trading habits that rich addresses have with each other, the same is true for the poor address side.
Examining the results of the time series correlations that do not include the Bitcoin price with the time series of the Bitcoin price, it is observed that there are no correlations less than -0.7. From the seven time series with the highest correlation, two are time series of the giant component and five are time series of the daily transaction networks. In addition, of the seven time series, one is for components, two are for edges and four are for vertices.

Statistical Analysis of Time Series
It is worth noting that the problems that arose with the various algorithms of iGraph, such as finding the diameter, the average distance, the ANND factor, the communities and many more. Were created due to the very large number of edges of the networks under examination but also the inability of these algorithms to work with parallel processing.
The results of which would provide a wide range of quality data and in turn reduce the high correlation that occurred with statistical analysis. It would also be what would contribute the most, as the information they provide is qualitatively similar to the information in the financial analysis.