Calculate the percentile of a group for a range (i.e. 0% - 20%)


(Matt Tatro) #1

Hi Everyone,
just started using this great library. What I am trying to do is find the data that sits in a percentile range. What I would like to do is determine which of the elements are in the percentile range of 0% - 20% for example. The only way I can currently see doing this is looping through all possible values and from 1-100 for each. There must be a better way…

Any help appreciated.

Matt


(Peter Vanderwaart) #2

In a sense, you are right: the program has to inspect each element of the array. OTOH, you don’t have to program that loop. You can use a pre-defined function such as a sort. Here is a little program that extracts the elements for a particular percentage range. It’s all C# code, not Numerics.

        // create example array
        Random RanGen = new Random();
        double[] x = new double[250];
        for (int i = 0; i < x.Length; i++) x[i] = RanGen.NextDouble();
        
        // sort the array 
        Array.Sort(x);
        
        // set indexes for desired percentiles of 20% and 30%
        int iMin = (20 * x.Length)/100;
        int iMax = (30 * x.Length)/100;

        // create output array
        double[] y = new double[(iMax - iMin) + 1];

        // Copy result to array y
        Array.Copy(x, iMin, y, 0, y.Length);
        
        Console.WriteLine(y);

I’m looking forward to someone showing a better way.


(Peter Vanderwaart) #3

I had another thought: if you have a really big data set, you don’t have to use the whole thing. You can just do a random sample. That’s what statistics is all about. It depends on how precise you need to be.

Also, while I was thinking about this I wrote a couple of sample programs for the specific problem you described (as I understand it).

        // x is array of random numbers
        // find percentiles of x=0.45 and x=0.55
        double xMin = 0.45;
        double xMax = 0.55;

        // solution by loop 
        int c1 = 0;
        int c2 = 0;
        foreach (double z in x)
        {
            if (z <= xMin) c1++;
            if (z <= xMax) c2++;
        }
        double pcMin = 100.0 * c1 / x.Length; // percentile of 0.45
        double pcMax = 100.0 * c2 / x.Length; // percentile 0f 0.55
        int nPoints = c2 - c1; // number of datapoints between min and max

        // solution using array.Where
        int r1 = x.Where((v) => v <= xMin).Count();
        int r2 = x.Where((v) => v <= xMax).Count();
        double pctMin = 100.0 * r1 / x.Length; // percentile of 0.45
        double pctMax = 100.0 * r2 / x.Length; // percentile of 0.55
        int nPts = r2 - r1; // number of datapoints between min and max

The first solution is undoubtedly faster. It’s a single pass through the data. The second solution takes two passes though the data, and each creates an array of the observations that pass the filter just to give us the Count. It’s a bit neater stylistically, though. Answers are the same.