Full Range Testing of the Small Size Effect Bias for Benford Screening: A Note

  •  Yan Bao    
  •  Frank Heilig    
  •  Chuo-Hsuan Lee    
  •  Edward Lusk    


Bao, Lee, Heilig, and Lusk (2018) have documented and illustrated the Small Sample Size bias in Benford Screening of datasets for Non-Conformity. However, their sampling plan tested only a few random sample-bundles from a core set of data that were clearly Conforming to the Benford first digit profile. We extended their study using the same core datasets and DSS, called the Newcomb Benford Decision Support Systems Profiler [NBDSSP], to create an expanded set of random samples from their core sample. Specifically, we took repeated random samples in blocks of 10 down to 5% from their core-set of data in increments of 5% and finished with a random sample of 1%, 0.5% & 20 thus creating 221 sample-bundles. This arm focuses on the False Positive Signaling Error [FPSE]—i.e., believing that the sampled dataset is Non-Conforming when it, in fact, comes from a Conforming set of data. The second arm used the Hill Lottery dataset, argued and tested as Non-Conforming; we will use the same iteration model noted above to create a test of the False Negative Signaling Error [FNSE]—i.e., if for the sampled datasets the NBDSSP fails to detect Non-Conformity—to wit believing incorrectly that the dataset is Conforming. We find that there is a dramatic point in the sliding sampling scale at about 120 sampled points where the FPSE first appears—i.e., where the state of nature: Conforming incorrectly is flagged as Non-Conforming. Further, we find it is very unlikely that the FNSE manifests itself for the Hill dataset. This demonstrated clearly that small datasets are indeed likely to create the FPSE, and there should be little concern that Hill-type of datasets will not be indicated as Non-Conforming. We offer a discussion of these results with implications for audits in the Big-Data context where the audit In-charge may find it necessary to partition the datasets of the client.

This work is licensed under a Creative Commons Attribution 4.0 License.