This is a project to investigate the accuracy cost of small sample sizes when sampling from a categorical distribution.

Currently, the project only implements one very simple case: sampling from a distribution of evenly weighted categories, using the Jaccard index to evaluate the similarity of the sample distribution to the known population distribution.

To run the project, just run R/sample_size_cost.R, either from within an R REPL/IDE or from the command line.

There are two variables of interest that the user might want to set. These currently need to be set within the code. They are:

bucket.counts: the set of different distributions that will be sampled from
sample.sizes: the set of different sample sizes to use when sampling from each distribution.

The script will create a folder for each distribution within the plots/ directory. There, for each sample size, it will store a histogram of similarity scores generated by sampling from the distribution 1000 times and comparing the sample to the distribution. Also within the same directory is a plot called 'errorbars.jpg'; it shows the mean and standard deviation of similarity scores at each sample size for that distribution.

Also in the plots/ directory, the script will create a "cross-section" plot. This plot shows the mean and standard deviation of similarity scores for each distribution size, at a fixed sample size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls