The slow speed of
xtile has long
irritated me . As an attempt to find a speedy alternative, I posted
astile
program on this forum (
http://www.statalist.org/forums/foru...tile-vs-astile
). However, that version had issues and was abandoned. Even so, my motivation for a speedy alternative never died. Thanks to Kit Baum, new version of
astile
is now available on SSC. Here is the description of the program and some speed tests. To install the package, type
Code:
ssc install astile
help astile
Title
astile - Creates variable containing quantile categories
Syntax
astile newvar = exp [if] [in] [, nquantiles(#) by(varlist)]
Description
astile
creates a new variable that categorizes exp by its quantiles. For example, we might be interested in making 10 size-based portfolios. This will involve placing the smallest 10% firms in portfolio 1, next 10% in portfolio 2, and so on. astile creates a new variable as specified in the newvar option from the existing variable which is specified in the = exp. Values of the newvar ranges from 1, 2, 3, ... up to n, where n is the maximum number of quantile groups specified in the nq option. For example, if we want to make 10 portfolios, values of the newvar will range from 1 to 10.
astile
is faster than Stata official
xtile
. It's speed efficiency matters more in larger data sets or when the quantile categories are created multiple times, e.g, we might want to create portfolios in each year or each month. Unlike Stata's official
xtile, astile
is
byable
.
Options
astile
has the following two optional options.
1.
nquantiles
The
nq(#)
option specifies the number of quantiles. The default value of nq is 2, that is the median.
2.
by
astile
is
byable
. Hence, it can be run on groups as specified by option
by(varlist).
Example 1: Create 10 groups of firms based on thier market value
Code:
webuse grunfeld
. astile size10=mvalue, nq(10)
Example 2: Create 5 groups of firms based on thier market value in each year
Code:
. webuse grunfeld
. astile size5=mvalue, nq(5) by(year)
Limitatons
This version of astile does not support weights, altdef and cutpoint options that are available in the official xtile function. In the next version, I plan to include some of these options.
SPEED COMPARISON
The following tests are performed using Stata 14.2. The test results might vary from computer to computer based on CPU speed.
Without by Option
* To generate an example data set of one million observations
Code:
clear
set obs 1000
gen year=_n+1000
expand 1000
bys year: gen id=_n
gen size=uniform()*100
timer clear
timer on 1
egen xt10=xtile(size), nq(10) // from egenmore package
timer off 1
timer on 2
astile as10=size, nq(10)
timer off 2
timer on 3
xtile of10=size, nq(10) // Stata official
timer off 3
assert as10==of10
timer list
1: 4.71 / 1 = 4.7130
2: 3.61 / 1 = 3.6140
3: 9.01 / 1 = 9.0050
* Both
xtile
(from egenmore package and
astile
perform three time faster than the official xtile, with marginal speed efficiency for astile over xtile from egenmore.
With By Option
Since the official
xtile
does not have a by option, I would compare
astile
with
xtile
from
egenmore
package.
Code:
. timer clear
. timer on 1
bys year: egen yxt10=xtile(size), nq(10)
timer off 1
timer on 2
bys year: astile yas10=size, nq(10)
timer off 2
timer list
1: 1037.37 / 1 = 1037.3700
2: 198.31 / 1 = 198.3080
assert yxt10==yas10
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my
asdoc
program, which sends outputs to MS Word.
For more flexibility, consider using
asdocx
which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Fast always beats slow with nothing else said, so speeding up slow commands is clearly good.
But as with Attaullah's
asrol
, discussed at
http://www.statalist.org/forums/foru...peed-advantage
there is a claim here of speed without full access to the code: the Mata code at the heart of this is presented as compiled code and programmers can't inspect it.
I feel more or less the same way about this practice as I do about magic, such as a card trick or someone sawn in half as far as you see, except that they aren't. With magic, evidently someone has a smart trick but we don't know what it is. Similarly, I can't learn from code I can't see and I can't fully discuss it either (e.g. what are its hidden assumptions, if any).
I don't know why some programmers hide Mata code in programs made public.
That said, there is plenty of scope for various comments on the details.
1. The
egen
function
xtile()
in
egenmore
(SSC) was written for Stata 8.2 by
Ulrich Kohler
. Uli can speak for himself, but it's offered as a convenience wrapper. In Attaullah's second example, there are 1000 distinct groups to loop over, so yes, it's slow. I've not seen anyone before wanting to bin using quantiles, but separately within 1000 distinct groups, but someone will tell me they do often want this. There is no reason to use this function if you only want a single group, as
xtile
does that directly.
2. The comparison has missed a serious alternative candidate,
Michael Stepner
wrote
fastxtile
and it's been on SSC since 2014. That program has several useful bells and whistles more than
astile
, just not (so far as I can see) support for
by:
.
3. There is another benchmark with bins ideally of equal size when the quantiles are regularly spaced in probability terms, as they are in the only version of quantile binning tacked by
astile
. You just round the ranks suitably. That method is exact if (and only if) ties do not bite at quantile boundaries. With a million random uniforms generated as a
float
that's possible but very unlikely. But it's a reference level.
I tweaked Attaullah's script by leaving out the generation of identifiers (irrelevant) and adding an explicit seed, important for reproducibility, and adding calls to Michael's code and what I am calling the naive approach of rounding ranks.
Here are my results:
Code:
. clear
. set obs 1000
number of observations (_N) was 0, now 1,000
. set seed 2803
. gen year=_n+1000
. expand 1000
(999,000 observations created)
. gen size=uniform()*100
. timer clear
. timer on 1
. egen xt10=xtile(size), nq(10) // from egenmore package
. timer off 1
. timer on 2
. astile as10=size, nq(10)
. timer off 2
. timer on 3
. xtile of10=size, nq(10) // Stata official
. timer off 3
. timer on 4
. fastxtile ms10=size, nq(10)
. timer off 4
. timer on 5
. sort size
. gen naive10 = ceil(10 * _n/_N)
. timer off 5
. assert as10==of10
. assert ms10==of10
. assert naive10==of10
. timer list
1: 1.70 / 1 = 1.7000
2: 1.33 / 1 = 1.3260
3: 2.50 / 1 = 2.4960
4: 1.20 / 1 = 1.2010
5: 0.86 / 1 = 0.8580
. timer clear
. timer on 1
. bys year: egen yxt10=xtile(size), nq(10)
. timer off 1
. timer on 2
. bys year: astile yas10=size, nq(10)
. timer off 2
. timer on 3
. bysort year (size): gen ynaive10 = ceil(10 * _n/_N)
. timer off 3
. timer list
1: 772.65 / 1 = 772.6540
2: 111.14 / 1 = 111.1440
3: 1.05 / 1 = 1.0540
. assert yxt10==yas10
. assert yxt10==ynaive10
The new details are that
1.
fastxtile
is faster yet than
astile
for the simpler problem. Personally I don't mind waiting 2 seconds for 1 million values to be binned. I can check Statalist while I am waiting. (If I had 1 billion values, I would care.) But if speed is claimed,
astile
is not, on this limited evidence, the fastest beast in the zoo.
2. While Attaullah's program is as before much faster than the
egen
function for the groupwise problem, naive code is 100 times faster and (in this case; I was lucky) gives the same result. That doesn't mean it's better code, as it won't handle ties properly. The question is what is
astile
doing to be so slow when the only difference is fixing what happens with ties?
Good news then: scope for many further improvements is implied.
There is in fact another creature in the zoo! Mathieu Gomez
created an egen wrapper for fastxtile
, which is
available in SSC
under
egenmisc
. I have not used this code or inspected it, but I believe it is based on my
fastxtile
code.
I can't blame Atullah for being unaware of these alternatives: it's hard to keep up with the growing catalog of options. During the time Stata 14 was in development, I advocated for replacing
xtile
with
fastxtile
. My own tests show that they produce identical output, but Stata has their own certification scripts which they could subject
fastxtile
to. I wouldn't be offended if they removed the additional functionality before placing it in mainline Stata, if they'd prefer not to add any new bells and whistles to
xtile
. And I've released the code to the public domain under CC0, so there are no licensing issues to speak of.
If anyone at StataCorp would like to replace
xtile
with
fastxtile
for Stata 15, they'd have my full support in the endeavor.
Michael: Glad you spotted this thread.
StataCorp can speak for themselves, but by and large they don't adopt user code any more. I guess the reason is three-fold:
1. Really good user code such as your own
fastxtile
is already accessible and easy to download and use, so the margin in StataCorp adopting it is smaller than you think. Otherwise put, the main StataCorp solution is providing
net install
so that it's trivial for most people to pull stuff across the net. (But I don't forget that a large class of users are behind firewalls that make that difficult if not impossible.) The success of this strategy is indicated indirectly by the large difficulties that many users have in keeping track of what is user-written and what is not!
2. Adopting user-written stuff creates a burden of testing, documentation and user support that could be spent otherwise on really big projects that users would hardly take on, such as multiple imputation or Bayes.
3. We all understand best what we create ourselves. That's really important. So, StataCorp developers look for problems that users are running into and then create their own solutions. (Sometimes, they take several years to get to that, but the company has no interest in implementing transient fads that researchers will only play with for a couple of years before the next version of sliced bread arrives.)
Nick: Your effective use of an @-tag brought this thread into my email inbox
I understand the reasoning for why StataCorp doesn't generally adopt user code. And of course they would need to decide whether it's worth the effort to adopt
fastxtile
. This may be a rare case where the benefit exceeds the cost. StataCorp could run
fastxtile
through their existing certification scripts, since it follows the same syntax as
xtile
. They could
remove the bells and whistles if they'd like to keep the
xtile
documentation largely unchanged (save updating Author and Date). Someone at StataCorp would likely need to read the code to double-check that it's sensible, but fortunately it's well-structured and only 200 lines.
Despite the fact that user scripts are easily obtainable,
defaults are powerful
. I would guess the ratio of
xtile
to
fastxtile
users is something like 10,000:1. Perhaps StataCorp would deem it worthwhile to spend a couple employee hours on this. It would likely save their user base many hundreds of hours in the aggregate (although most users might save only a few seconds or minutes of their time).
Nick, to be honest I don't know the appropriate way to get this suggestion in front of the right person at StataCorp. The last time I made the suggestion, it was at an annual Stata Conference. Do you have advice about how I should suggest the replacement of
xtile
with
fastxtile
to StataCorp? I wouldn't be offended if they decide not to follow it.
I agree with michael. while user-added capabilities are a great feature of stata, I would have loved to see stata adopt some of the most used and known user packages into vanilla stata. as was stated in this thread, it's sometimes very hard to know what you don't know - so that's a big limitation on whether or not you install user packages. The fact that the stata FAQ itself has some mentions of user-packages as well is, sort of an admission of that. I'm talking for example about spmap, which I originally came upon by chance as I thought that stata has no geo-mapping capabilities at all.
When programming code is well written and publicly available, I would have liked to see stata-corp utilize that "openness" for the benefit of all users.
Dear Nick, thanks for your time. Several of your comments need my reply. First of all, I think you have read my first post where I posted version 1 of
astile
. The post is located here
http://www.statalist.org/forums/foru...tile-vs-astile
. The 'so-called naive 'approach that you have posted here for making qunatile groups was actually used in that first version. But then you commented and used "MUST and SHOULD "
That aside, any serious program in this territory, in Stata or in any other language,
1. must handle ties intelligently
2. must handle missing data intelligently
3. should support groupwise calculations
4. should support equal or unequal bins (e.g. some people might want to bin with boundaries at 5, 10, 25, 50, 75, 90, 95% points of a distribution and not just according to so many quantiles equally spaced in probability terms).
5. should lend itself to application to several variables
The use of "must" and "should" matches my suggestions on what is essential and desirable respectively.
The so-called naive approach is 100 time faster, but does handle ties and missing values. So I think it is not suggested for any serious analysis. See the following example
Code:
clear
webuse grunfeld
sort mvalue
replace mvalue = . in 4
replace mvalue=. in 5
replace mvalue=. in 18
replace mvalue=. in 198
replace mvalue=. in 12
gen naive1 = ceil(10 * _n/_N)
xtile off10=mvalue, nq(10)
astile size10=mvalue, nq(10)
assert size10 ==off10
. assert naive1 ==off10
21 contradictions in 200 observations
assertion is false
r(9);
Nick further commented that
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my
asdoc
program, which sends outputs to MS Word.
For more flexibility, consider using
asdocx
which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Attaullah: Let me rephrase one point more precisely: I don't know why you are hiding your code and even when the point is raised you don't explain. Your decision, to hide or not, to explain or not, and no obligation that you don't take on yourself, but code hidden cannot be discussed.
The role of the naive code is already explained, to act as a benchmark showing how fast the code could run in perfect conditions.. You're right that missing values are ignored and so the code will not handle those correctly. I was just running the code on the test problem you set out yourself. If you post a test problem with missings, the naive code could be adapted and compared.
Thanks for reminding me of the ideals I laid out. Points #4 and #5 cited in #8 are presumably now on the
astile
agenda.
A main theme of this thread is exactly what you have made it: claims about speed for your code. Just with
asrol
,
astile
runs faster than what you compared it with, and well and good, but corrections and qualifications are needed.
fastxtile
is faster for the first problem, and
astile
is still very slow for your second problem and I don't know why.
As
Michael Stepner
commented, there is also an
egen
version of his command to run on the second problem.
Michael Stepner
, You are right. I was unaware of fastxtile. As compared to the posted version of
astile, fastxtile
is faster by few seconds without using the by option, and with the by option, the difference increases. I have been working on the next version of
astile
, which still needs certification. So far, results of that beta version show that it is even faster than fastxtile. I am working on the certification, once ready, I shall post that version first on Statalist and then update the SSC version.
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my
asdoc
program, which sends outputs to MS Word.
For more flexibility, consider using
asdocx
which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my
asdoc
program, which sends outputs to MS Word.
For more flexibility, consider using
asdocx
which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Hidden code: As I said, it's your decision and I won't press the point further.
What remains interesting to anyone concerned with these calculations is which programs are correct, most versatile and fast(est).
As I write, the last post in the thread on
asrol
is my post #16 at
http://www.statalist.org/forums/foru...dvantage/page2
with a suggestion that the help wording remains misleading. That suggestion still holds.
I plan to change the help file in the next version, which brings significant changes in the way
asrol
understands different data structures, thereby overcoming its present limitation of not efficiently handling data with too many gaps in the
range
variable. But till then, you are right, the help file needs correction.
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my
asdoc
program, which sends outputs to MS Word.
For more flexibility, consider using
asdocx
which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
I am posting version 3.0 of astile here. My initial tests show that version 3.0 of the astile program is reliable and is the fastest among available alternative programs, such as xtile, fastxtile, and xtile from egenmore. I have posted the new version on my website and can be downloaded by typing the below command
If proxy settings do not allow the download, it can be manually installed by copying the attached files to:
Code:
copy and paste
astile.ado
astile.sthlp
in the following folder
C:/ado/plus/a
For some reason lasn.mlib cannot be attached, so here is the direct link
https://sites.google.com/site/imspes...edirects=0&d=1
and copy lasn.mlib to
Code:
C:/ado/plus/l
Since I have already shown speed tests for official xtile and xtile from egenmore, following are the speed tests for fastxtile(
Michael Stepner
) and fastxtile from egenmisc (SSC)
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my
asdoc
program, which sends outputs to MS Word.
For more flexibility, consider using
asdocx
which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Thanks for the extra news. Some small points of clarification from me:
1. The
egen
function
fastxtile
is not a wrapper for @Michael Stepner's code. Matthieu Gomez, its author, only used the same name. I think Michael was guessing from the name that it was a wrapper but looking at the code contradicts that. There is no Mata code there, for example.
2. The only advantage of any
egen
function to users lies in their support for a
by:
or
by()
calculation, as calling code from within
egen
can only slow it down.
Some small points of clarification now requested:
1. I don't see here timings for
fastxtile
and your new
astile
.
2. What's puzzling now is that your timings for the new
astile
show that it is
slower
than it was in post #1 (5.4 s rather than 3.6s). Is this a different problem? (Looks the same to me.) Are you using a different machine?