Home > APIs > YSQL > Functions and operators > Aggregate functions >

Case study—using aggregate functions to demonstrate the "68–95–99.7" rule

This case study shows the use of aggregate functions to demonstrate the so-called "68–95–99.7 rule"—described in this Wikipedia article. This case-study focuses on just one part of the rule:

68.27% of the values in a normal distribution lie within one standard deviation each side of the mean.

Populate the test table

The demonstration uses the function normal_rand(), brought by the tablefunc extension, to populate the test table:

drop table if exists t cascade;
create table t(v double precision primary key);

do $body$
declare
  no_of_rows constant int              := 1000000;
  mean       constant double precision := 0.0;
  stddev     constant double precision := 50.0;
begin
  insert into t(v)
  select normal_rand(no_of_rows, mean, stddev);
end;
$body$;

If you choose a value of one million for "no_of_rows", then the two different estimates for the one sigma boundaries reliably produce results that typically differ from each other by less than about 0.1%. If you choose fewer rows, then the variability, and the typical results difference, will be bigger.

Because the demonstration (for convenience) uses a table with a single double precision column, "v", this must be the primary key. It's just possible that normal_rand() will create some duplicate values. However, this is so very rare that it was never seen while the script was repeated, many times, during the development of this code example. If insert into t(v) does fail because of this, just repeat the script by hand.

Create, and execute, the table function "the_6827_rule()"

The table function "the_6827_rule()", below, uses the following approach:

Firstly, it determines the one sigma boundaries ordinarily by using the avg() and stddev_pop() aggregate functions.
Secondly, it determines these boundaries by invoking the percentile_cont() aggregate function with putative percent_rank() input values corresponding, respectively, to the fraction of the table's values that lie below "mean - one standard deviation" and below "mean + one standard deviation".

drop function if exists the_6827_rule() cascade;
create function the_6827_rule()
  returns table (t text)
  language plpgsql
as $body$
declare
  -- FIRST, determine the one sigma boundaries by using avg() and stddev_pop().
  measured_avg
    constant double precision := (
      select avg(v) from t);

  measured_sigma
    constant double precision := (
      select stddev_pop(v) from t);

  one_sigma_boundaries_from_avg_and_stddev
    constant double precision[] := array[
      measured_avg - measured_sigma,
      measured_avg + measured_sigma];

  -- SECOND, determine the one sigma boundaries by using percentile_cont().
  fraction_within_sigma_from_avg
    constant double precision :=
      0.6827;

  expected_fraction_below_avg_minus_sigma
    constant double precision :=
      (1.0 - fraction_within_sigma_from_avg)/2.0;

  expected_fraction_above_avg_plus_sigma
    constant double precision :=
      expected_fraction_below_avg_minus_sigma +
      fraction_within_sigma_from_avg;

  fractions constant double precision[] := array[
    expected_fraction_below_avg_minus_sigma,
    expected_fraction_above_avg_plus_sigma];

  one_sigma_boundaries_from_percentile_cont
    constant double precision[] := (
      select
        percentile_cont(fractions)
        within group (order by v)
    from t);

begin
  t := rpad(' ',                            13)||
        lpad('from avg() and stddev_pop()', 29, ' ')||
        lpad('from percentile_cont()',      24, ' ')||
        lpad('ratio',                       10, ' ');

  return next;
  t := lpad(' ', 13)||rpad('  ', 29, '-')||rpad('  ', 24, '-')||rpad('  ', 10, '-');
  return next;

  for j in 1..2 loop
    declare
      caption constant text not null :=
        case j
          when 1 then 'mean - sigma:'
          when 2 then 'mean + sigma:'
        end;

      num constant double precision not null :=
        one_sigma_boundaries_from_percentile_cont[j] -
        one_sigma_boundaries_from_avg_and_stddev[j];

      denom constant double precision not null :=
        (one_sigma_boundaries_from_percentile_cont[j] +
         one_sigma_boundaries_from_avg_and_stddev[j])/2.0;

      ratio constant double precision not null :=
        abs((100.0::double precision*num)/denom);
    begin
      t := rpad(caption, 13)||
           lpad(to_char(one_sigma_boundaries_from_avg_and_stddev[j],  '990.99999'), 29)||
           lpad(to_char(one_sigma_boundaries_from_percentile_cont[j], '990.99999'), 24)||
           lpad(to_char(ratio, '990.999'), 9)||'%';
      return next;
    end;
  end loop;
end;
$body$;

A table function is used because if raise info is used in a procedure or anonymous block, then the output cannot be spooled to a file. Invoke it, for example, like this:

\t on
\o report.txt
select t from the_6827_rule();
\o
\t off

Typical result

Here is a typical result (for one million rows):

                from avg() and stddev_pop()  from percentile_cont()     ratio
                ---------------------------  ----------------------  --------
 mean - sigma:                    -50.00899               -49.99646    0.025%
 mean + sigma:                     49.97396                50.00483    0.062%