gquantiles newvar = exp [if] [in] [weight], xtile [nquantiles(#) cutpoints(numlist) altdef]
fasterxtile newvar = exp [if] [in] [weight], [nquantiles(#) cutpoints(numlist) altdef]
The options and behavior of the above largely mimic that of the Stata native
commands. You only need to read the rest of the documentation if you wish to
use some of the additional features that gquantiles provides.
Weights
aweight, fweight, and pweight are allowed and mimic the weights in
pctile
, xtile
, or _pctile
(see help weight
and the weights
section in help pctile
). Weights are not allowed with altdef
.
Options
Quantiles method
gquantiles offers 4 ways of specifying quantiles and 3 ways of specifying
cutoffs. The behavior of each differs slightly when specifying pctile
,
xtile
, and _pctile
(see stored results for details).
nquantiles(#)
Number of quantiles to copmpute. gquantiles computes
percentiles (quantiles) corresponding to 100 * k / m for
k = 1, 2, ..., m - 1, where m = #. For example, nquantiles(10)
requests that the 10th, 20th, ..., 90th percentiles be computed.
The default is nquantiles(2)
, the median.
cutpoints(varname)
(xtile
or pctile
only).
Requests that the values of varname
be used instead of the quantiles
of the source variable. Like the native equivalent, all values of varname
are used, regardless of any if
or in
restriction (the user can pass
the option cutifin
to restrict cutpoints
to the if in
range or option
cutby
to use different cutpoints in each group).
Note that without any additional options, in the case of pctile
all this
does is sort varname
and store it in the target variable. Further, unlike
the native equivalent, this variable does not need to be sorted or unique.
Missings are excluded by default and the variable is sorted internally. The
user can request duplicates be excluded via the option dedup
.
quantiles(numlist)
or percentiles(numlist)
. Requests percentiles (quantiles)
corresponding to those specified by numlist
. For example,
percentiles(10(20)90)
requests that the 10th, 30th, 50th, 70th, and 90th
percentiles be computed.
Unlike the native equivalent, this can be specified with pctile
and xtile
as well as _pctile
. With pctile
the results are stored in the target
variable. With xtile
the rank is computed using numlist
. With _pctile
the return values are stored in r(r1), r(r2), ..., etc. Note that the
number of return values that can be set is artificially capped at 1001
because they take a really long time to set. For more than 1001 return
values consider using cutquantiles()
with pctile
or see the
computing many quantiles section in the examples
below.
quantmatrix(matrix)
. Requests percentiles (quantiles) corresponding to the
entries of the matrix. This must be a column vector or a row vector. The
behavior of gquantiles using this option is otherwise equivalent to its
behavior when passing quantiles()
.
cutquantiles(varname)
(xtile
or pctile
only).
Requests that the values of varname
be used as the percentiles (quantiles)
to compute. varname
must have all its values between 0 and 100. By default
all values are read, regardless of if in
restrictions (use option
cutifin
to restrict the range or cutby
to use different cut quantiles
per group). Missing values are dropped and duplicates are allowed
(use option dedup
to drop duplicate quantiles).
This option is included to allow the user to compute an arbitrary number
of quantiles in a reasonable amount of time. While the user can force more
than 1001 return values with the _pctile
option, this is ill-advised
since performance will suffer greatly.
cutoffs(numlist)
(Only with xtile
, pctile
, or _pctile
with bincount
).
This option is similar to cutpoints
, except that instead of using the values
of a variable the function uses the values of numlist
as the cutoffs.
With option _pctile
this option is only allowed with bincount
, which
requests a count of the number of instances of the source variable within
the bins defined by the cutoffs.
cutmatrix(matrix)
. Requests cutoffs corresponding to the entries of the matrix.
This must be a column vector or a row vector. The behavior of gquantiles using
this option is otherwise equivalent to its behavior when passing cutoffs()
.
gquantiles options
Standard Options
altdef
. Use an alternative definition for quantiles. When you have a finite
number of values, as is the case with all data with N observations, there are
at most N - 1 exact quantiles (assuming no duplicates). To compute arbitrary
quantiles we need to make additional assumptions.
Let x(i) denote i-th smallest value of the relevant variable such that
i > qN / 100. If i - 1 < qN / 100 then the quantile is x(i). However,
if i - 1 = qN / 100 the convention is to use the average of x(i) and x(i - 1).
Option altdef
defines i <= q(N + 1) / 100 and h = q(N + 1) / 100 - i.
Then the quantile is (1 - h) * x(i) + h * x(i + 1) instead.
genp(newvar)
specifies a new variable to be generated containing the percentages
corresponding to the percentiles (quantiles) be created.
Extras
by(varlist)
Compute quantiles by group (pctile or xtile only). pctile[()]
requires option
strict
, which has the effect of ignoring groups where the number of quantiles requested is
larger than the number of non-missing observations within the group. by()
is most useful
with option groupid(varname)
.
groupid(varname)
Store group ID in varname
.
_pctile
(Not with by.) does the computation in the style of the native command _pctile
. It stores
return values in r(1), r(2), and so on, as wll as a matrix called r(quantiles_used)
or r(cutoffs_used)
in case quantiles or cutoffs are requested. See stored results
for details.
pctile
or pctile(newvar)
Computes the quantiles of the source variable (i.e. calls
gquantiles
in the style of the native command pctile
). The pctile(newvar)
syntax
is provided to allow combining pctile
with other options. See examples.
xtile
or xtile(newvar)
Computes the category of the source variable using the categories
defined by the cutoffs or implied by the quantiles (i.e. calls gquantiles
in the style of
the native command xtile
). The xtile(newvar)
syntax is provided to allow combining xtile
with other options. See examples.
binfreq
(Not with by.) or binfreq(newvar)
Stores the frequency counts of the source variable
within the bins defined by the quantiles or the cuoffs. When
weights are specified, this stores the sum of the weights within
that category. If newvar
is passed then the frequency counts
are stored in newvar
; otherwise they are stored in the matrix
r(quantiles_bincount)
or r(cutoffs_bincount)
.
Switches
method(#)
(Not with by.) If you have many duplicates or are computing many quantiles,
you should specify method(1)
. If you have few duplicates or are computing
few quantiles you should specify method(2)
. By default, gquantiles
tries
to guess which method will run faster. See computation methods
in the examples section below.
dedup
By default all quantiles and cutoffs are used in computations, regardless
of duplicate values. For instance, if the user asks for quantiles 1, 90, 10,
10, and 1, then quantiles 1, 1, 10, 10, and 90 are used. With this option
only 1, 10, and 90 would be used.
cutifin
By default all values of the variable requested via cutpoints
or cutquantiles
are used. The reason Stata behaves in this way is that
cutpoints
was written to take in the output of pctile
(this is how
xtile
works internally). Here, both cutpoints
and cutquantiles
are
generic options, so the user can read all values or just values if in
.
cutby
By default all values of the variable requested via cutpoints
or cutquantiles
are used. With this option, each group uses a different
set of quantiles or cutoffs (note this automatically sets option cutifin
).
returnlimit(#)
Default is 1001. When the user runs gquantiles in the style of _pctile
return values r(1), r(2), and so on are set. The limit of quantiles that can be computed
is the number of observations in the data, which can be millions or billions. Setting
that many return values is computationally infeasible! The user can turn
off this check via returnlimit(.)
. This is ill-advised. To compute
that many quantiles it is better to use the pctile
option.
strict
Without by()
, exit with error if the number of quantiles requested
is above the number of non-missing values. In the case of xtile, this also
restricts the number of quantiles to the number of observations in the data.
With by()
, skip groups where this happens. Note that Stata limits
nquantiles()
with xtile
because it creates a temporary variable
with the quantiles. Hence from the point of view of gquantiles this is
an artificial limitation.
minmax
(Not with by.) Additionally store r(min)
and r(max)
. Percentiles
(quantiles) are required to be strictly between 0 and 100. However,
the min and max are relatively trivial to compute given the internals
of the command, so the user can request them here.
replace
Replace targets, should they exist.
(Note: These are common to every gtools command.)
compress
Try to compress strL to str#. The Stata Plugin Interface has
only limited support for strL variables. In Stata 13 and
earlier (version 2.0) there is no support, and in Stata 14
and later (version 3.0) there is read-only support. The user
can try to compress strL variables using this option.
forcestrl
Skip binary variable check and force gtools to read strL variables
(14 and above only). Gtools gives incorrect results when there is
binary data in strL variables. This option was included because on
some windows systems Stata detects binary data even when there is none.
Only use this option if you are sure you do not have binary data in your
strL variables.
verbose
prints some useful debugging info to the console.
benchmark
or bench(level)
prints how long in seconds various parts of the
program take to execute. Level 1 is the same as benchmark
. Levels
2 and 3 additionally prints benchmarks for internal plugin steps.
hashmethod(str)
Hash method to use. default
automagically chooses the
algorithm. biject
tries to biject the inputs into the
natural numbers. spooky
hashes the data and then uses the
hash.
oncollision(str)
How to handle collisions. A collision should never happen
but just in case it does gtools
will try to use native
commands. The user can specify it throw an error instead by
passing oncollision(error)
.
Stored results
All calls store the following results
Scalars
r(N) Number of observations
r(min) Min (only if minmax was requested)
r(max) Max (only if minmax was requested)
r(nqused) Number of quantiles/cutoffs
r(method_ratio) Rule used to decide between methods 1 and 2
r(nquantiles) Number of quantiles (only w nquantiles())
r(ncutpoints) Number of cutpoints (only w cutpoints())
r(nquantiles_used) Number of quantiles (only w quantiles())
r(nquantpoints) Number of quantiles (only w cutquantiles())
r(ncutoffs_used) Number of cutoffs (only w cutoffs())
r(r#) The #th quantile requested (only w _pctile)
Macros
r(quantiles) Quantiles used (only w percentiles() or quantiles())
r(cutoffs) Cutoffs used (only w option cutoffs())
Matrices
r(quantiles_used) With _pctile or with quantiles()
r(quantiles_binfreq) With option binfreq and any quantiles requested
r(cutoffs_used) With _pctile or with cutoffs()
r(cutoffs_binfreq) With option binfreq and any cutoffs requested
The behavior of gquantiles
follows the behavior specified in
Stata's documentation
(in particular see the "Methods and formulas" section). We do not
repeat those formulas here.
One interesting thing to note is that Stata artificially limits the way in
which xtile
can be categorized. That is, the program defines categories
using the quantiles or the cutoffs as the right-inclusive endpoints. For
instance, nquantiles(NQ)
gives q(1), q(2), ..., q(NQ - 1) quantiles. (For
cutpoints(varname)
, q(i) is the i-th smallest value of varname
, and so
on for each way to specify quantiles and cutoffs). Now we have
| Category | Range |
| -------- | ---------------------- |
| 1 | (-Inf, q(1)] |
| 2 | (q(1), q(2)] |
| ... | ... |
| NQ - 1 | (q(NQ - 2), q(NQ - 1)] |
| NQ | (q(NQ - 1), Inf) |
In theory there is no reason to limit NQ. For example, the question
"Where do these 4 values fit in these 1000 categories?" is a
well-defined question. Even if there will be at least 996 categories
contain no values, there is no reason to limit the number of categories
to 4 (of course, since the 1000 categories are created from those 4
values, in practice this might not be adviseable).
So why does the limit exist in xtile
? It is actually a limit in pctile
,
which is used internally. Since pctile
stores the percentiles in a variable
in the data, this limit is actually very reasonable in this case (the same
limit is in gquantiles, pctile
). However, the user can request an arbitrary
number of quantiles via gquantiles, xtile nq(#)
. If the user does not like
this behavior, they can pass the option strict
.
Examples
You can download the raw code for the examples below
Using by
One major feature that gquantile adds is by()
. It should be
internally consistent if the user specifies the option strict
.
For example,
clear
set obs 1000000
gen group = int(runiform() * 100)
gen x = runiform()
local popts pctile strict by(group) cutby
gquantiles pctile = x, `popts' nq(10) genp(perc) groupid(id)
local xopts xtile strict by(group) cutby
gquantiles xtile = x, `xopts' nq(10)
gquantiles xtile2 = x, `xopts' cutpoints(pctile)
gquantiles xtile3 = x, `xopts' cutquantiles(perc)
assert xtile == xtile2
assert xtile == xtile3
However, there is no requirement for the user to do so:
sysuse auto, clear
gquantiles xtile = price, xtile by(foreign) nq(50)
Computing many quantiles
Stata's _pctile caps the number of quantiles to 1001. pctile uses _pctile
internally, so to compute more than 1001 percentiles it needs to loop over
various runs of _pctile in a very inefficient way. This inefficiency carries
over to xtile because that command uses pctile internally. (Presumably this is
the reason for the limit in the user-written fastxtile).
The following executes with no errors in a reasonable amount of time
clear
set obs 1000000
gen x = runiform()
_pctile x, nq(1001)
pctile p1 = x, nq(1001)
gquantiles p2 = x, pctile nq(1001)
However, if you increase nq
the runtimes become excessive:
drop p*
timer clear
timer on 90
pctile p1 = x, nq(5001)
timer off 90
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
assert p1 == p2
timer list
10: 0.42 / 1 = 0.4200
90: 61.14 / 1 = 61.1440
61 seconds for only 1,00,000 observations! This is in
Stata/MP with 8 cores. gquantiles scales nicely, by contrast:
drop p*
timer clear
timer on 10
gquantiles p2 = x, pctile nq(`=_N + 1')
timer off 10
clear
set obs 100000000
gen x = runiform() * 100
timer on 20
gquantiles p2 = x, pctile nq(`=_N + 1')
timer off 20
timer list
10: 0.44 / 1 = 0.4430
20: 36.53 / 1 = 36.5270
90: 61.14 / 1 = 61.1440
That's right, gquantiles computed 100M quantiles for 100M observations in 36
seconds, faster than pctile could compute 5,000 quantiles for 1M observations.
As a side-note, using mata can afford a massive speedup, obviating the need
to call C in case gquantiles does not do something you want. Consider:
clear all
timer clear
mata:
void function mata_pctile (string scalar newvar,
string scalar sourcevar,
real scalar nq)
real scalar N
real colvector X, quantiles, qpositions, qties, qtiesix, Q, Qties
X = st_data(., sourcevar)
N = rows(X)
_sort(X, 1)
quantiles = ((1::(nq - 1)) * N / nq)
qpositions = ceil(quantiles)
qties = (qpositions :== quantiles)
Q = X[qpositions]
if ( any(qties) ) {
qtiesix = selectindex(qties)
Qties = X[qpositions[qtiesix] :+ 1]
Q[qtiesix] = (Q[qtiesix] + Qties) / 2
st_addvar("`:set type'", newvar)
st_store((1::(nq - 1)), newvar, Q)
set obs 1000000
gen x = runiform()
timer on 80
mata: mata_pctile("p0", "x", 5001)
timer off 80
timer on 90
pctile p1 = x, nq(5001)
timer off 90
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
assert p0 == p1
assert p1 == p2
timer list
10: 0.48 / 1 = 0.4770
80: 1.23 / 1 = 1.2300
90: 56.97 / 1 = 56.9720
Just by using mata we speeded up Stata 50 times! The mata solution
does not scale as well as gquantiles, however:
clear
timer clear
set obs 10000000
gen x = runiform()
timer on 80
mata: mata_pctile("p0", "x", 5001)
timer off 80
timer on 10
gquantiles p2 = x, pctile nq(5001)
timer off 10
timer list
10: 2.97 / 1 = 2.9720
80: 14.40 / 1 = 14.4030
With just 10M observations, gquantiles is still a 5x improvement over mata
when computing many quantiles.
Computation methods
Computing quantiles involves selecting elements from an unordered array
in one of two ways: Using a selection algorithm on the unsorted variable
or sorting and then selecting elements of the sorted varaible.
The internal selection algorithm of gquantiles is very fast and on average will
run in linear O(N) time (see quickselect).
The sorting algorithm runs in O(N log(N)) time (see quicksort).
Clearly, with few quantiles we can see the selection algorithm will be faster.
However, with a large number of quantiles running multiple iterations of the
selection algorithm is clearly slower than doing a single sort.
clear
timer clear
set obs 10000000
gen x = rnormal() * 100
timer on 10
gquantiles p1 = x, pctile nq(2) method(1)
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(2) method(2)
timer off 20
assert p1 == p2
timer list
10: 3.39 / 1 = 3.3870
20: 1.10 / 1 = 1.1020
We can see that method 2 was more than 3 times faster for a single quantile.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(10) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(10) method(2) replace
timer off 20
timer list
10: 3.43 / 1 = 3.4290
20: 2.28 / 1 = 2.2840
While method 2 was still faster, computing 10 quantiles took twice the time it
took to compute 1. By contrast, method 1 took essentially the same time. This
is because after sorting the data, selecting elements is nearly instantaneous.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(100) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(100) method(2) replace
timer off 20
timer list
10: 3.22 / 1 = 3.2210
20: 7.04 / 1 = 7.0370
With 100 quantiles we can see that the performance of method 2 is now much
worse than method 1. Internally, gquantiles will try to switch between the two
methods based on the number of observations and the number of quantiles.
You might be tempted to always specify method 2 for few quantiles, but
there is a second way in which it is slower than sorting:
timer clear
replace x = int(x)
timer on 10
gquantiles p1 = x, pctile nq(10) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(10) method(2) replace
timer off 20
timer list
10: 1.29 / 1 = 1.2940
20: 1.48 / 1 = 1.4760
What happened? While both commands are faster, now method 1 is faster than
method 2, whereas before it was 50% slower. This is because the specific
sorting algorithm I use handles duplicates better than the selection
algorithm.
timer clear
timer on 10
gquantiles p1 = x, pctile nq(100) method(1) replace
timer off 10
timer on 20
gquantiles p2 = x, pctile nq(100) method(2) replace
timer off 20
timer list
10: 1.54 / 1 = 1.5390
20: 4.31 / 1 = 4.3080
Again, both are faster with duplicates, but method 1 is much faster.
Multiple subcommands
gquantiles allows the user to compute several things at once:
sysuse auto, clear
gquantiles price, _pctile xtile(x1) pctile(p1) binfreq nq(10)
matrix list r(quantiles_binfreq)
l price x1 p1 in 1/10
r(quantiles_binfreq)[9,1]
r1 8
r2 7
r3 8
r4 7
r5 7
r6 8
r7 7
r8 8
r9 7
+----------------------+
| price x1 p1 |
|----------------------|
1. | 4,099 2 3895 |
2. | 4,749 5 4099 |
3. | 3,799 1 4425 |
4. | 4,816 5 4647 |
5. | 7,827 8 5006.5 |
|----------------------|
6. | 5,788 7 5705 |
7. | 4,453 4 6165 |
8. | 5,189 6 7827 |
9. | 10,372 9 11385 |
10. | 4,082 2 . |
+----------------------+
Specifying quantiles and cutoffs
gquantiles allows for several ways to specify cutoffs
sysuse auto, clear
gquantiles price, _pctile p(10(10)99)
matrix p0 = r(quantiles_used)
gquantiles p1 = price, pctile nq(10) genp(g1) xtile(x1)
gquantiles x2 = price, xtile cutpoints(p1)
gquantiles x3 = price, xtile cutquantiles(g1)
qui glevelsof p1
gquantiles x4 = price, xtile cutoffs(`r(levels)')
qui glevelsof g1
gquantiles x5 = price, xtile quantiles(`r(levels)')
matrix list p0
l p1 g1 x? in 1/10
p0[9,1]
r1 3895
r2 4099
r3 4425
r4 4647
r5 5006.5
r6 5705
r7 6165
r8 7827
r9 11385
. l p1 g1 x? in 1/10
+--------------------------------------+
| p1 g1 x1 x2 x3 x4 x5 |
|--------------------------------------|
1. | 3895 10 2 2 2 2 2 |
2. | 4099 20 5 5 5 5 5 |
3. | 4425 30 1 1 1 1 1 |
4. | 4647 40 5 5 5 5 5 |
5. | 5006.5 50 8 8 8 8 8 |
|--------------------------------------|
6. | 5705 60 7 7 7 7 7 |
7. | 6165 70 4 4 4 4 4 |
8. | 7827 80 6 6 6 6 6 |
9. | 11385 90 9 9 9 9 9 |
10. | . . 2 2 2 2 2 |
+--------------------------------------+