gquantiles

Efficiently compute percentiles, quantiles, categories, and frequency counts.
gquantiles is a by-able replacement for xtile, pctile, and _pctile that offers several additional features, like computing arbitrary quantiles (and an arbitrary number), frequency counts, and more (see the examples below).
gquantiles is also faster than the user-written fastxtile, so an alias, fasterxtile, is also provided.
Important
Run gtools, upgrade to update gtools to the latest stable version.
Syntax

The full syntax is
gquantiles [newvar =] exp [if] [in] [weight], {pctile | xtile | _ pctile} quantiles_method [ gquantiles_options ]
This function accepts by() with xtile and pctile . However, you can simply use it as a replacement for native Stata commands.
Equivaent to pctile (store the quantiles of exp in newvar ):
gquantiles newvar = exp [if] [in] [weight], pctile [nquantiles(#) genp(newvarname) altdef]
Equivaent to xtile (store the categories of exp in newvar):
gquantiles newvar = exp [if] [in] [weight], xtile [nquantiles(#) cutpoints(numlist) altdef]
fasterxtile newvar = exp [if] [in] [weight], [nquantiles(#) cutpoints(numlist) altdef]
The options and behavior of the above largely mimic that of the Stata native
commands. You only need to read the rest of the documentation if you wish to
use some of the additional features that gquantiles provides.
Weights
aweight, fweight, and pweight are allowed and mimic the weights in
pctile, xtile, or _pctile (see help weight and the weights
section in help pctile). Weights are not allowed with altdef.
Options
Quantiles method
gquantiles offers 4 ways of specifying quantiles and 3 ways of specifying
cutoffs. The behavior of each differs slightly when specifying pctile,
xtile, and _pctile (see stored results for details).
nquantiles(#) Number of quantiles to copmpute. gquantiles computes
  percentiles (quantiles) corresponding to 100 * k / m for
  k = 1, 2, ..., m - 1, where m = #. For example, nquantiles(10)
  requests that the 10th, 20th, ..., 90th percentiles be computed.
  The default is nquantiles(2), the median.
cutpoints(varname) (xtile or pctile only).
  Requests that the values of varname be used instead of the quantiles
  of the source variable. Like the native equivalent, all values of varname
  are used, regardless of any if or in restriction (the user can pass
  the option cutifin to restrict cutpoints to the if in range or option
  cutby to use different cutpoints in each group).
  Note that without any additional options, in the case of pctile all this
  does is sort varname and store it in the target variable.  Further, unlike
  the native equivalent, this variable does not need to be sorted or unique.
  Missings are excluded by default and the variable is sorted internally.  The
  user can request duplicates be excluded via the option dedup.
quantiles(numlist) or percentiles(numlist). Requests percentiles (quantiles)
  corresponding to those specified by numlist. For example,
  percentiles(10(20)90) requests that the 10th, 30th, 50th, 70th, and 90th
  percentiles be computed.


  Unlike the native equivalent, this can be specified with pctile and xtile
  as well as _pctile.  With pctile the results are stored in the target
  variable. With xtile the rank is computed using numlist.  With _pctile
  the return values are stored in r(r1), r(r2), ..., etc. Note that the
  number of return values that can be set is artificially capped at 1001
  because they take a really long time to set. For more than 1001 return
  values consider using cutquantiles() with pctile or see the
  computing many quantiles section in the examples
  below.
quantmatrix(matrix). Requests percentiles (quantiles) corresponding to the
  entries of the matrix. This must be a column vector or a row vector. The
  behavior of gquantiles using this option is otherwise equivalent to its
  behavior when passing quantiles().
cutquantiles(varname) (xtile or pctile only).
  Requests that the values of varname be used as the percentiles (quantiles)
  to compute. varname must have all its values between 0 and 100. By default
  all values are read, regardless of if in restrictions (use option
  cutifin to restrict the range or cutby to use different cut quantiles
  per group). Missing values are dropped and duplicates are allowed
  (use option dedup to drop duplicate quantiles).


  This option is included to allow the user to compute an arbitrary number
  of quantiles in a reasonable amount of time. While the user can force more
  than 1001 return values with the _pctile option, this is ill-advised
  since performance will suffer greatly.
cutoffs(numlist) (Only with xtile, pctile, or _pctile with bincount).
  This option is similar to cutpoints, except that instead of using the values
  of a variable the function uses the values of numlist as the cutoffs.


  With option _pctile this option is only allowed with bincount, which
  requests a count of the number of instances of the source variable within
  the bins defined by the cutoffs.
cutmatrix(matrix). Requests cutoffs corresponding to the entries of the matrix.
  This must be a column vector or a row vector. The behavior of gquantiles using
  this option is otherwise equivalent to its behavior when passing cutoffs().
gquantiles options
Standard Options
altdef. Use an alternative definition for quantiles. When you have a finite
  number of values, as is the case with all data with N observations, there are
  at most N - 1 exact quantiles (assuming no duplicates). To compute arbitrary
  quantiles we need to make additional assumptions.


  Let x(i) denote i-th smallest value of the relevant variable such that
  i > qN / 100.  If i - 1 < qN / 100 then the quantile is x(i). However,
  if i - 1 = qN / 100 the convention is to use the average of x(i) and x(i - 1).
  Option altdef defines i <= q(N + 1) / 100 and h = q(N + 1) / 100 - i.
  Then the quantile is (1 - h) * x(i) + h * x(i + 1) instead.
genp(newvar) specifies a new variable to be generated containing the percentages
  corresponding to the percentiles (quantiles) be created.
Extras
by(varlist) Compute quantiles by group (pctile or xtile only). pctile[()] requires option
  strict, which has the effect of ignoring groups where the number of quantiles requested is
  larger than the number of non-missing observations within the group. by() is most useful
  with option groupid(varname).
groupid(varname) Store group ID in varname.
_pctile (Not with by.) does the computation in the style of the native command _pctile. It stores
  return values in r(1), r(2), and so on, as wll as a matrix called r(quantiles_used)
  or r(cutoffs_used) in case quantiles or cutoffs are requested. See stored results
  for details.
pctile or pctile(newvar) Computes the quantiles of the source variable (i.e. calls
  gquantiles in the style of the native command pctile). The pctile(newvar) syntax
  is provided to allow combining pctile with other options. See examples.
xtile or xtile(newvar) Computes the category of the source variable using the categories
  defined by the cutoffs or implied by the quantiles (i.e. calls gquantiles in the style of
  the native command xtile). The xtile(newvar) syntax is provided to allow combining xtile
  with other options. See examples.
binfreq (Not with by.) or binfreq(newvar) Stores the frequency counts of the source variable
  within the bins defined by the quantiles or the cuoffs. When
  weights are specified, this stores the sum of the weights within
  that category. If newvar is passed then the frequency counts
  are stored in newvar; otherwise they are stored in the matrix
  r(quantiles_bincount) or r(cutoffs_bincount).
Switches
method(#) (Not with by.) If you have many duplicates or are computing many quantiles,
  you should specify method(1). If you have few duplicates or are computing
  few quantiles you should specify method(2). By default, gquantiles tries
  to guess which method will run faster. See computation methods
  in the examples section below.
dedup By default all quantiles and cutoffs are used in computations, regardless
  of duplicate values. For instance, if the user asks for quantiles 1, 90, 10,
  10, and 1, then quantiles 1, 1, 10, 10, and 90 are used. With this option
  only 1, 10, and 90 would be used.
cutifin By default all values of the variable requested via cutpoints
  or cutquantiles are used. The reason Stata behaves in this way is that
  cutpoints was written to take in the output of pctile (this is how
  xtile works internally). Here, both cutpoints and cutquantiles are
  generic options, so the user can read all values or just values if in.
cutby By default all values of the variable requested via cutpoints
  or cutquantiles are used. With this option, each group uses a different
  set of quantiles or cutoffs (note this automatically sets option cutifin).
returnlimit(#) Default is 1001. When the user runs gquantiles in the style of _pctile
  return values r(1), r(2), and so on are set. The limit of quantiles that can be computed
  is the number of observations in the data, which can be millions or billions. Setting
  that many return values is computationally infeasible! The user can turn
  off this check via returnlimit(.). This is ill-advised. To compute
  that many quantiles it is better to use the pctile option.
strict Without by(), exit with error if the number of quantiles requested
  is above the number of non-missing values. In the case of xtile, this also
  restricts the number of quantiles to the number of observations in the data.
  With by(), skip groups where this happens.  Note that Stata limits
  nquantiles() with xtile because it creates a temporary variable
  with the quantiles. Hence from the point of view of gquantiles this is
  an artificial limitation.
minmax (Not with by.) Additionally store r(min) and r(max). Percentiles
  (quantiles) are required to be strictly between 0 and 100. However,
  the min and max are relatively trivial to compute given the internals
  of the command, so the user can request them here.
replace Replace targets, should they exist.
Gtools options
(Note: These are common to every gtools command.)
compress Try to compress strL to str#. The Stata Plugin Interface has
            only limited support for strL variables. In Stata 13 and
            earlier (version 2.0) there is no support, and in Stata 14
            and later (version 3.0) there is read-only support. The user
            can try to compress strL variables using this option.
forcestrl Skip binary variable check and force gtools to read strL variables
            (14 and above only). Gtools gives incorrect results when there is
            binary data in strL variables. This option was included because on
            some windows systems Stata detects binary data even when there is none.
            Only use this option if you are sure you do not have binary data in your
            strL variables.
verbose prints some useful debugging info to the console.
benchmark or bench(level) prints how long in seconds various parts of the
            program take to execute. Level 1 is the same as benchmark. Levels
            2 and 3 additionally prints benchmarks for internal plugin steps.
hashmethod(str) Hash method to use. default automagically chooses the
            algorithm. biject tries to biject the inputs into the
            natural numbers. spooky hashes the data and then uses the
            hash.
oncollision(str) How to handle collisions. A collision should never happen
            but just in case it does gtools will try to use native
            commands. The user can specify it throw an error instead by
            passing oncollision(error).
Stored results
All calls store the following results
Scalars
    r(N)                  Number of observations
    r(min)                Min (only if minmax was requested)
    r(max)                Max (only if minmax was requested)
    r(nqused)             Number of quantiles/cutoffs
    r(method_ratio)       Rule used to decide between methods 1 and 2
    r(nquantiles)         Number of quantiles (only w nquantiles())
    r(ncutpoints)         Number of cutpoints (only w cutpoints())
    r(nquantiles_used)    Number of quantiles (only w quantiles())
    r(nquantpoints)       Number of quantiles (only w cutquantiles())
    r(ncutoffs_used)      Number of cutoffs (only w cutoffs())
    r(r#)                 The #th quantile requested (only w _pctile)
Macros
    r(quantiles)          Quantiles used (only w percentiles() or quantiles())
    r(cutoffs)            Cutoffs used (only w option cutoffs())
Matrices
    r(quantiles_used)     With _pctile or with quantiles()
    r(quantiles_binfreq)  With option binfreq and any quantiles requested
    r(cutoffs_used)       With _pctile or with cutoffs()
    r(cutoffs_binfreq)    With option binfreq and any cutoffs requested
Methods and Formulas
The behavior of gquantiles follows the behavior specified in
Stata's documentation
(in particular see the "Methods and formulas" section). We do not
repeat those formulas here.
One interesting thing to note is that Stata artificially limits the way in
which xtile can be categorized. That is, the program defines categories
using the quantiles or the cutoffs as the right-inclusive endpoints. For
instance, nquantiles(NQ) gives q(1), q(2), ..., q(NQ - 1) quantiles. (For
cutpoints(varname), q(i) is the i-th smallest value of varname, and so
on for each way to specify quantiles and cutoffs). Now we have
| Category | Range                  |
| -------- | ---------------------- |
|        1 | (-Inf, q(1)]           |
|        2 | (q(1), q(2)]           |
|      ... | ...                    |
|   NQ - 1 | (q(NQ - 2), q(NQ - 1)] |
|       NQ | (q(NQ - 1), Inf)       |
In theory there is no reason to limit NQ. For example, the question
"Where do these 4 values fit in these 1000 categories?" is a
well-defined question. Even if there will be at least 996 categories
contain no values, there is no reason to limit the number of categories
to 4 (of course, since the 1000 categories are created from those 4
values, in practice this might not be adviseable).
So why does the limit exist in xtile? It is actually a limit in pctile,
which is used internally. Since pctile stores the percentiles in a variable
in the data, this limit is actually very reasonable in this case (the same
limit is in gquantiles, pctile). However, the user can request an arbitrary
number of quantiles via gquantiles, xtile nq(#). If the user does not like
this behavior, they can pass the option strict.
Examples
You can download the raw code for the examples below
Using by
One major feature that gquantile adds is by(). It should be
internally consistent if the user specifies the option strict.
For example,
clear
set obs 1000000
gen group = int(runiform() * 100)
gen x = runiform()
local popts pctile strict by(group) cutby
gquantiles pctile = x, `popts' nq(10) genp(perc) groupid(id)
local xopts xtile strict by(group) cutby
gquantiles xtile  = x, `xopts' nq(10)
gquantiles xtile2 = x, `xopts' cutpoints(pctile)
gquantiles xtile3 = x, `xopts' cutquantiles(perc)
assert xtile == xtile2
assert xtile == xtile3
However, there is no requirement for the user to do so:
sysuse auto, clear
gquantiles xtile = price, xtile by(foreign) nq(50)
Computing many quantiles
Stata's _pctile caps the number of quantiles to 1001. pctile uses _pctile
internally, so to compute more than 1001 percentiles it needs to loop over
various runs of _pctile in a very inefficient way.  This inefficiency carries
over to xtile because that command uses pctile internally. (Presumably this is
the reason for the limit in the user-written fastxtile).
The following executes with no errors in a reasonable amount of time
clear
set obs 1000000
gen x = runiform()
_pctile x, nq(1001)
pctile p1 = x, nq(1001)
gquantiles p2 = x, pctile nq(1001)
However, if you increase nq the runtimes become excessive:
drop p*
timer clear
timer on 90
pctile p1 = x, nq(5001)
timer off 90
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
assert p1 == p2
timer list
  10:      0.42 /        1 =       0.4200
  90:     61.14 /        1 =      61.1440
61 seconds for only 1,00,000 observations! This is in
Stata/MP with 8 cores. gquantiles scales nicely, by contrast:
drop p*
timer clear
timer on 10
gquantiles p2 = x, pctile nq(`=_N + 1')
timer off 10
clear
set obs 100000000
gen x = runiform() * 100
timer on 20
gquantiles p2 = x, pctile nq(`=_N + 1')
timer off 20
timer list
  10:      0.44 /        1 =       0.4430
  20:     36.53 /        1 =      36.5270
  90:     61.14 /        1 =      61.1440
That's right, gquantiles computed 100M quantiles for 100M observations in 36
seconds, faster than pctile could compute 5,000 quantiles for 1M observations.
As a side-note, using mata can afford a massive speedup, obviating the need
to call C in case gquantiles does not do something you want. Consider:
clear all
timer clear
mata:
void function mata_pctile (string scalar newvar,
                           string scalar sourcevar,
                           real scalar nq)
    real scalar N
    real colvector X, quantiles, qpositions, qties, qtiesix, Q, Qties
    X = st_data(., sourcevar)
    N = rows(X)
    _sort(X, 1)
    quantiles  = ((1::(nq - 1)) * N / nq)
    qpositions = ceil(quantiles)
    qties      = (qpositions :== quantiles)
    Q          = X[qpositions]
    if ( any(qties) ) {
        qtiesix = selectindex(qties)
        Qties = X[qpositions[qtiesix] :+ 1]
        Q[qtiesix] = (Q[qtiesix] + Qties) / 2
    st_addvar("`:set type'", newvar)
    st_store((1::(nq - 1)), newvar, Q)
set obs 1000000
gen x = runiform()
timer on 80
mata: mata_pctile("p0", "x", 5001)
timer off 80
timer on 90
pctile p1 = x, nq(5001)
timer off 90
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
assert p0 == p1
assert p1 == p2
timer list
  10:      0.48 /        1 =       0.4770
  80:      1.23 /        1 =       1.2300
  90:     56.97 /        1 =      56.9720
Just by using mata we speeded up Stata 50 times! The mata solution
does not scale as well as gquantiles, however:
clear
timer clear
set obs 10000000
gen x = runiform()
timer on 80
mata: mata_pctile("p0", "x", 5001)
timer off 80
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
timer list
  10:      2.97 /        1 =       2.9720
  80:     14.40 /        1 =      14.4030
With just 10M observations, gquantiles is still a 5x improvement over mata
when computing many quantiles.
Computation methods
Computing quantiles involves selecting elements from an unordered array
in one of two ways: Using a selection algorithm on the unsorted variable
or sorting and then selecting elements of the sorted varaible.
The internal selection algorithm of gquantiles is very fast and on average will
run in linear O(N) time (see quickselect).
The sorting algorithm runs in O(N log(N)) time (see quicksort).
Clearly, with few quantiles we can see the selection algorithm will be faster.
However, with a large number of quantiles running multiple iterations of the
selection algorithm is clearly slower than doing a single sort.
clear
timer clear
set obs 10000000
gen x = rnormal() * 100
timer on 10
gquantiles p1 = x, pctile nq(2) method(1)
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(2) method(2)
timer off 20
assert p1 == p2
timer list
  10:      3.39 /        1 =       3.3870
  20:      1.10 /        1 =       1.1020
We can see that method 2 was more than 3 times faster for a single quantile.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(10) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(10) method(2) replace
timer off 20
timer list
  10:      3.43 /        1 =       3.4290
  20:      2.28 /        1 =       2.2840
While method 2 was still faster, computing 10 quantiles took twice the time it
took to compute 1. By contrast, method 1 took essentially the same time. This
is because after sorting the data, selecting elements is nearly instantaneous.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(100) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(100) method(2) replace
timer off 20
timer list
  10:      3.22 /        1 =       3.2210
  20:      7.04 /        1 =       7.0370
With 100 quantiles we can see that the performance of method 2 is now much
worse than method 1. Internally, gquantiles will try to switch between the two
methods based on the number of observations and the number of quantiles.
You might be tempted to always specify method 2 for few quantiles, but
there is a second way in which it is slower than sorting:
timer clear
replace x = int(x)
timer on 10
gquantiles p1 = x, pctile nq(10) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(10) method(2) replace
timer off 20
timer list
  10:      1.29 /        1 =       1.2940
  20:      1.48 /        1 =       1.4760
What happened? While both commands are faster, now method 1 is faster than
method 2, whereas before it was 50% slower. This is because the specific
sorting algorithm I use handles duplicates better than the selection
algorithm.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(100) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(100) method(2) replace
timer off 20
timer list
  10:      1.54 /        1 =       1.5390
  20:      4.31 /        1 =       4.3080
Again, both are faster with duplicates, but method 1 is much faster.
Multiple subcommands
gquantiles allows the user to compute several things at once:
sysuse auto, clear
gquantiles price, _pctile xtile(x1) pctile(p1) binfreq nq(10)
matrix list r(quantiles_binfreq)
l price x1 p1 in 1/10
r(quantiles_binfreq)[9,1]
r1   8
r2   7
r3   8
r4   7
r5   7
r6   8
r7   7
r8   8
r9   7
     +----------------------+
     |  price   x1       p1 |
     |----------------------|
  1. |  4,099    2     3895 |
  2. |  4,749    5     4099 |
  3. |  3,799    1     4425 |
  4. |  4,816    5     4647 |
  5. |  7,827    8   5006.5 |
     |----------------------|
  6. |  5,788    7     5705 |
  7. |  4,453    4     6165 |
  8. |  5,189    6     7827 |
  9. | 10,372    9    11385 |
 10. |  4,082    2        . |
     +----------------------+
Specifying quantiles and cutoffs
gquantiles allows for several ways to specify cutoffs
sysuse auto, clear
gquantiles price, _pctile p(10(10)99)
matrix p0 = r(quantiles_used)
gquantiles p1 = price, pctile nq(10) genp(g1) xtile(x1)
gquantiles x2 = price, xtile cutpoints(p1)
gquantiles x3 = price, xtile cutquantiles(g1)
qui glevelsof p1
gquantiles x4 = price, xtile cutoffs(`r(levels)')
qui glevelsof g1
gquantiles x5 = price, xtile quantiles(`r(levels)')
matrix list p0
l p1 g1 x? in 1/10
p0[9,1]
r1    3895
r2    4099
r3    4425
r4    4647
r5  5006.5
r6    5705
r7    6165
r8    7827
r9   11385
. l p1 g1 x? in 1/10
     +--------------------------------------+
     |     p1   g1   x1   x2   x3   x4   x5 |
     |--------------------------------------|
  1. |   3895   10    2    2    2    2    2 |
  2. |   4099   20    5    5    5    5    5 |
  3. |   4425   30    1    1    1    1    1 |
  4. |   4647   40    5    5    5    5    5 |
  5. | 5006.5   50    8    8    8    8    8 |
     |--------------------------------------|
  6. |   5705   60    7    7    7    7    7 |
  7. |   6165   70    4    4    4    4    4 |
  8. |   7827   80    6    6    6    6    6 |
  9. |  11385   90    9    9    9    9    9 |
 10. |      .    .    2    2    2    2    2 |
     +--------------------------------------+