Skip to content
mikerabat edited this page May 13, 2019 · 2 revisions

Statistic functions

Beside the base statistic function of mean and variance that are highly optimized - the classes stated in corr.pas and statistics.pas try to extend the statisitcs capabilites of this library. Many of the functions are originaly published in nr and converted to Pascal.

Statistic Tests

Student T Test

The Student T test tests if two sample datasets are significantly different from each other. (see https://en.wikipedia.org/wiki/Student%27s_t-test for further information) There are 3 versions of the Studen T test implemented in this library:

  • Unequal sample size but assumed samed variance
  • Unequal sample size and assumed different variances
  • Equal/"paired" sample size.

The involved functions are:

   // Student t test: Returns the probobability in prob and t = (x_A_mean - x_B_mean)/(sD) , sD the "pooled variance".
   procedure StudentTTest( data1 : PDouble; n1 : integer; data2 : PDouble; n2 : integer; var t, prob : double ); overload;
   procedure StudentTTest( meanData1, varData1 : double; n1 : integer; data2 : PDouble; n2 : integer; var t, prob : double ); overload;

   // student t test with significantly different variances
   procedure StudentTUTest( data1 : PDouble; n1 : integer; data2 : PDouble; n2 : integer; var t, prob : double ); overload;
   procedure StudentTUTest( meanData1, varData1 : double; n1 : integer; data2 : PDouble; n2 : integer; var t, prob : double ); overload;

   // student t test for the case of paired samples
   procedure StudentTPTest( data1 : PDouble; data2 : PDouble; n : integer; var t, prob : double ); overload;
   procedure StudentTPTest( data1 : PDouble; meanData1, varData1 : double; data2 : PDouble; n : integer; var t, prob : double ); overload;

The resulting data is:

  • t: The "t" statistics value
  • prob: The p value (probability) that two sample sets have the same distribution

F Test

Testing for distributions that follow a F distribution. The F Test can be used to check if two distributions have 2 different variances. Check out the wiki page https://en.wikipedia.org/wiki/F-test for more information. The implementation here follows the implementation in NR:

   // f-Test for signifacantly different variances
   procedure FTest( data1 : PDouble; n1 : integer; data2 : PDouble; n2 : integer; var f, prob : double );

Returned is the F value max(var(data1), var(data2)/min(var(data1), var(data2)) and the probability.

Chi Square Test

Testing if two distributions are the same input is a binned probability density distribution. As usual Wiki gives a thorough explanation: https://en.wikipedia.org/wiki/Chi-squared_test

Given observed events (bins) and expected distribution (expectedBins) the routine geturns the number of degrees of freedom (df) and the chi-square chsq as well as the significance prob. A small value of prob indicates a signifacant difference between the distrubition bins and expectBins.

3 different routines are implemented:

  • ChiSquareOne: Expects as input two binned probability density histogram where the second is a reference (e.g. stems from a calculated normal distribution...)
  • ChiSquareTwo: Tests if two binned probability density histograms are the same both are coming from real observations.
  • ChiSquareTwoUnequal: Same as above but the input may be unequally sampled.
   procedure ChiSquareOne(bins, expectBins : PConstDoubleArr; nBins : integer; numConstraints : integer; var df, chsq, prob : double );

   // case of two binned data sets (no expected data known)
   procedure ChiSquareTwo(bins1, bins2 : PConstDoubleArr; nBins : integer; numConstraints : integer; var df, chsq, prob : double);

   // case of unequal number of elements in both dataset
   procedure ChiSquareTwoUnequal(bins1, bins2 : PConstDoubleArr; nBins : integer; numConstraints : integer; var df, chsq, prob : double);

Returns the probability (p-Value) if two distributions are the same as well as the distance (chsq) and the number of degrees of freedom df.

Histograms

These functions create a pdf (histogram) from the measurements from the input data.

The functiont try to put all values >= maxVal in the last bin and put all others into the bins below. Indexing is done by calculating a step: step := (maxVal - minVal)/(nBins - 1).

   procedure Hist( data : PConstDoubleArr; n : integer; histogram : PConstDoubleArr; nBins : integer); overload;
   procedure Hist( data : PConstDoubleArr; n : integer; histogram : PConstDoubleArr; nBins : integer; minVal, maxVal : double); overload;

Kolmogorov-Smirnov Test

This test is based on the maximum distance of PDF of two unbinned distributions that are functions of a single independent variable. The maximum distance and the probability that both distributions are drawn from the same statistics is returned. 2 different implementations are provided here:

  • kolmogorovSmirnovOne: Given a dataset of observations and a function to test against (the function returns a probability based on the input value x) it returns the maximum distance and p value. Can be e.g. used to to compare against a multivariate distribution (e.g. the overall distribution is drawn from 3 normal distribuions).
  • kolmogorovSmirnovTwo: This function has as input two datasets and calculates the maximum distance of the pdf's as well as the p value that these are drawn from the same distribution.

Note that the order of the input data is destroyed since the above functions sort the data before further computations.

Statistics Class

The library also contains a class that captures the handling of matrix objects and the statistics functions. Many functions are class functions but also many implementations allow to precalculate for one dataset and then compare this dataset against many others. It also allows to calculate different pdf functions (normal, weibull, exponention and many more...)

   // ######################################################
   // #### Class implementation of various statistic functions
   type
     TStatTest = class(TBaseMathPersistence)
     private
       fN : integer;
       fMeanVar : TMeanVarRec;
       fRefData : IMatrix;
       fMinVal, fMaxVal : Double;
       fNumBins : integer;
     protected
       class function ClassIdentifier : String; override;
       procedure DefineProps; override;
       procedure OnLoadDoubleProperty(const Name : String; const Value : double); override;
       procedure OnLoadIntProperty(const Name : String; Value : integer); override;
       function OnLoadObject(const Name : String; Obj : TBaseMathPersistence) : boolean; override;
     public
       // initialize the mean var record with one db so we can compare against others more than once
       procedure InitStudentTTest( data1 : TDoubleMatrix );
       procedure InitStudentTPTest( data1 : TDoubleMatrix);
       procedure InitStudentFTest( data1 : TDoubleMatrix );
       procedure InitChisquareTest( data1 : TDoubleMatrix; nBins : integer );
       procedure InitKolmogorv2Test( data1 : TDoubleMatrix );

       // #########################################
       // #### Statistical tests available after the "init" functions (for multiple tests
       // against the same data1 source)
       procedure StudentTTest( data2 : IMatrix; var t, prob : double ); overload;
       procedure StudentTTest( data2 : TDoubleMatrix; var t, prob : double ); overload;

       // student t test with significantly different variances
       procedure StudentTUTest( data2 : IMatrix; var t, prob : double ); overload;
       procedure StudentTUTest( data2 : TDoubleMatrix; var t, prob : double ); overload;

       // student t test for the case of paired samples
       procedure StudentTPTest( data2 : IMatrix; var t, prob : double ); overload;
       procedure StudentTPTest( data2 : TDoubleMatrix; var t, prob : double ); overload;

       // f-Test for signifacntly different variances
       procedure FTest( data2 : IMatrix;  var f, prob : double ); overload;
       procedure FTest( data2 : TDoubleMatrix;  var f, prob : double ); overload;

       // chi square test
       procedure ChiSquare1( data2 : IMatrix; numConstraints : integer; var df, chsq, prob : double ); overload;
       procedure ChiSquare1( data2 : TDoubleMatrix; numConstraints : integer; var df, chsq, prob : double ); overload;

       // chi square2 test
       procedure ChiSquare2( data2 : IMatrix; numConstraints : integer; var df, chsq, prob : double ); overload;
       procedure ChiSquare2( data2 : TDoubleMatrix; numConstraints : integer; var df, chsq, prob : double ); overload;

       // kolmogroovSmirnov2 test
       procedure KolmogorovSmirnov2( data2 : IMatrix; var d, prob : double ); overload;
       procedure KolmogorovSmirnov2( data2 : TDoubleMatrix; var d, prob : double ); overload;
     public
       // note that all data is treated as all one big vector!

       // student t test with same variances
       class procedure StudentTTest( data1, data2 : IMatrix; var t, prob : double ); overload;
       class procedure StudentTTest( data1, data2 : TDoubleMatrix; var t, prob : double ); overload;

       // student t test with significantly different variances
       class procedure StudentTUTest( data1, data2 : IMatrix; var t, prob : double ); overload;
       class procedure StudentTUTest( data1, data2 : TDoubleMatrix; var t, prob : double ); overload;

       // student t test for the case of paired samples
       class procedure StudentTPTest( data1, data2 : IMatrix; var t, prob : double ); overload;
       class procedure StudentTPTest( data1, data2 : TDoubleMatrix; var t, prob : double ); overload;

       // f-Test for signifacntly different variances
       class procedure FTest( data1, data2 : IMatrix;  var f, prob : double ); overload;
       class procedure FTest( data1, data2 : TDoubleMatrix;  var f, prob : double ); overload;

       // chi square test on one dimensional datapoints,
       class procedure ChiSquare1( data1, data2 : TDoubleMatrix; nBins : integer; numConstraints : integer; var df, chsq, prob : double ); overload;
       class procedure ChiSquare1( data1, data2 : IMatrix; nBins : integer; numConstraints : integer; var df, chsq, prob : double ); overload;

       // chi square test on one dimensional datapoints -> no assumption on the expected histogram (data1 in chisquare1) is raised
       class procedure ChiSquare2( data1, data2 : TDoubleMatrix; nBins : integer; numConstraints : integer; var df, chsq, prob : double ); overload;
       class procedure ChiSquare2( data1, data2 : IMatrix; nBins : integer; numConstraints : integer; var df, chsq, prob : double ); overload;

       class procedure KolmogorovSmirnov1( data : TDoubleMatrix; func : TKSTestFunc; var d, prob : double ); overload;
       class procedure KolmogorovSmirnov1( data : IMatrix; func : TKSTestFunc; var d, prob : double ); overload;

       // returns the K-S statisticas d and the signifacance level prob for the null hypothesis that the data sets are drawn from the same
       // distribution. Small values of prob show that the cumulative distribution function of data1 is significantly different fro mthat of data2
       // the arrays data1 and data2 are modified by being sorted into ascending order
       class procedure KolmogorovSmirnov2( data1, data2 : TDoubleMatrix; var d, prob : double ); overload;
       class procedure KolmogorovSmirnov2( data1, data2 : IMatrix; var d, prob : double ); overload;

       // histogram from the input data
       // Creates a histogram from the input data between the
       // given borders. Elements >= the max value are put into the last bin
       // elements < minValue + (maxVal - minVal)/(nbins - 1) into the first bin
       // the histogram operates along the given dimension (rowWise is default)
       // returns a nbins x height (RowWise True) matrix
       class function HistogramIntf( data : IMatrix; nBins : integer; RowWise : Boolean = True ) : IMatrix; overload;
       class function HistogramIntf( data : IMatrix; nBins : integer; minVal, maxVal : double; RowWise : Boolean = True )  : IMatrix; overload;
       class function Histogram( data : TDoubleMatrix; nBins : integer; RowWise : Boolean = True ) : TDoubleMatrix; overload;
       class function Histogram( data : TDoubleMatrix; nBins : integer; minVal, maxVal : double; RowWise : Boolean = True )  : TDoubleMatrix; overload;

       // some probability density functions
       class function UniformPDF( x : TDoubleMatrix; a, b : double ) : TDoubleMatrix;
       class function NormalPDF( x : TDoubleMatrix; mu, sigma : double ) : TDoubleMatrix;
       class function LogNormalPDF( X : TDoubleMatrix; mu, sigma : double ) : TDoubleMatrix;
       class function PoissonPDF( x : TDoubleMatrix; lambda : double ) : TDoubleMatrix;
       class function RayleighPDF( x : TDoubleMatrix; sigma : double ) : TDoubleMatrix;
       class function WeibullPDF( x : TDoubleMatrix; lambda, k : double ) : TDoubleMatrix;
       class function ExpPDF( x : TDoubleMatrix; lambda : double ) : TDoubleMatrix;
       class function BinomPDF( x : TDoubleMatrix; n, p : integer ) : TDoubleMatrix;
     end;

Examples

Student T Test

Create a few distributions and calculate the StudentTValue:

   procedure TTestStatFunc.TestStudentT1;
   var rnd : TRandomGenerator;
       dat1 : Array[0..5] of TDoubleDynArray;
       y : Integer;
       x: Integer;
       t, prob : double;
   const cMean : Array[0..5] of double = (0.5, -0.1, 2, 0.4, 0.5, -0.101);
         cNum : Array[0..5] of integer = (13000, 14400, 5000, 1000, 8000, 10000);
   begin
        InitMathFunctions(itFPU, False);
        // test with same variance -> test just checks if the routine does not crash
        rnd := TRandomGenerator.Create(raMersenneTwister);
        rnd.Init( 455 );

        try
           for y := 0 to High(dat1) do
           begin
                SetLength( dat1[y], cNum[y] );
                for x := 0 to Length(dat1[y]) - 1 do
                    dat1[y][x] := sqr(0.7)*rnd.RandGauss + cMean[y];
           end;
        finally
               rnd.Free;
        end;

        for y := 0 to High(dat1) - 1 do
        begin
             for x := y + 1 to High(dat1) do
             begin
                  StudentTTest(@dat1[y][0], cNum[y], @dat1[x][0], cNum[x], t, prob);

                  Status( Format('Student T %.2f, %d, %.2f, %d, %.3f, %.3f', [cMean[y], cNum[y], cMean[x], cNum[x], t, prob] ) );

                  if abs(cMean[y] - cMean[x]) < 0.02  then
                  begin
                       StudentTTest(@dat1[y][0], cNum[y], @dat1[x][0], cNum[x], t, prob);
                       Check( prob > 0.1, 'Student t Probability failed');
                  end;
             end;
        end;
   end;

Kolmogorov-Smirnov Test with a function:

   function UniformPDF(x : double) : double;
   begin
        Result := x;  // just a linear relationship between 0 and 1
   end;

   procedure TTestStatFunc.TestKomogorov1;
   var rnd : TRandomGenerator;
       data : TDoubleDynArray;
       i : Integer;
       d, prob : double;
   const cPVal = 0.85746;      // from octave... maybe off by a bit
   begin
        rnd := TRandomGenerator.Create(raMersenneTwister);
        try
           rnd.Init(763);

           // uniformly distributed -> linear increasing when sorted
           SetLength(data, 10000);
           for i := 0 to Length(data) - 1 do
               data[i] := rnd.Random; // i/1000 + 0.00001*(rnd.Random - 0.5);
        finally
               rnd.Free;
        end;

        //WriteMatlabData('D:\kolm.txt', data, Length(data));

        kolmogorovSmirnovOne(@data[0], Length(data), {$IFDEF FPC}@{$ENDIF}UniformPDF, d, prob);
        Status(Format('Kolmogoriv result on uniform distributions: %.4f   ,    %.3f', [d, prob]));
        Check(SameValue( prob, cPVal, 1e-3 ), 'Error uniform distribution p value wrong' );
   end;
Clone this wiki locally