ASTILE: New Package - Fast and byable alternative for XTILE

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

阳光的胡萝卜 · STATA如何分行业性质，将行业转化为数字 ...· 1 周前 ·

朝气蓬勃的木瓜 · stata中的数据类型 - CSDN文库· 1 周前 ·

刀枪不入的皮带 · 怎么将excel数据直接导入stata | ...· 1 周前 ·

坚强的大蒜 · TunerStudio won't ...· 1 周前 ·

八块腹肌的风衣 · 爱沙尼亚交通方式-世界游网World ...· 2 月前 ·

豪情万千的回锅肉 · 广州大学2021届土建、环境类毕业生专场招聘 ...· 2 月前 ·

纯真的丝瓜 · vb编程中什么叫缺少数组 • Worktile社区· 3 月前 ·

酷酷的馒头 · 汽车之家|蓝猫|论坛|社区|怎么样|买车网· 1 年前 ·

The slow speed of xtile has long irritated me . As an attempt to find a speedy alternative, I posted astile program on this forum ( http://www.statalist.org/forums/foru...tile-vs-astile ). However, that version had issues and was abandoned. Even so, my motivation for a speedy alternative never died. Thanks to Kit Baum, new version of astile is now available on SSC. Here is the description of the program and some speed tests. To install the package, type

Code:

ssc install astile
help astile

Title
astile - Creates variable containing quantile categories
Syntax
astile newvar = exp [if] [in] [, nquantiles(#) by(varlist)]
Description
astile creates a new variable that categorizes exp by its quantiles. For example, we might be interested in making 10 size-based portfolios. This will involve placing the smallest 10% firms in portfolio 1, next 10% in portfolio 2, and so on. astile creates a new variable as specified in the newvar option from the existing variable which is specified in the = exp. Values of the newvar ranges from 1, 2, 3, ... up to n, where n is the maximum number of quantile groups specified in the nq option. For example, if we want to make 10 portfolios, values of the newvar will range from 1 to 10.
astile is faster than Stata official xtile . It's speed efficiency matters more in larger data sets or when the quantile categories are created multiple times, e.g, we might want to create portfolios in each year or each month. Unlike Stata's official xtile, astile is byable .
Options
astile has the following two optional options.
1. nquantiles
The nq(#) option specifies the number of quantiles. The default value of nq is 2, that is the median.
2. by
astile is byable . Hence, it can be run on groups as specified by option by(varlist).
Example 1: Create 10 groups of firms based on thier market value

Code:

 webuse grunfeld
    . astile size10=mvalue, nq(10)

Example 2: Create 5 groups of firms based on thier market value in each year

Code:

. webuse grunfeld
    . astile size5=mvalue, nq(5) by(year)

Limitatons
This version of astile does not support weights, altdef and cutpoint options that are available in the official xtile function. In the next version, I plan to include some of these options.
SPEED COMPARISON
The following tests are performed using Stata 14.2. The test results might vary from computer to computer based on CPU speed.
Without by Option
* To generate an example data set of one million observations

Code:

clear
set obs 1000
gen year=_n+1000
expand 1000
bys year: gen id=_n
gen size=uniform()*100
timer clear
timer on 1
egen xt10=xtile(size), nq(10) // from egenmore package
timer off 1
timer on 2
astile as10=size, nq(10)
timer off 2
timer on 3
xtile of10=size, nq(10) // Stata official
timer off 3
assert as10==of10
timer list
   1:      4.71 /        1 =       4.7130
   2:      3.61 /        1 =       3.6140
   3:      9.01 /        1 =       9.0050

* Both xtile (from egenmore package and astile perform three time faster than the official xtile, with marginal speed efficiency for astile over xtile from egenmore.
With By Option
Since the official xtile does not have a by option, I would compare astile with xtile from egenmore package.

Code:

. timer clear
. timer on 1
bys year: egen yxt10=xtile(size), nq(10)
timer off 1
timer on 2
bys year: astile yas10=size, nq(10)
timer off 2
timer list
 1:   1037.37 /        1 =    1037.3700
 2:    198.31 /        1 =     198.3080
assert yxt10==yas10

--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML. Fast always beats slow with nothing else said, so speeding up slow commands is clearly good.
But as with Attaullah's asrol , discussed at
http://www.statalist.org/forums/foru...peed-advantage
there is a claim here of speed without full access to the code: the Mata code at the heart of this is presented as compiled code and programmers can't inspect it.
I feel more or less the same way about this practice as I do about magic, such as a card trick or someone sawn in half as far as you see, except that they aren't. With magic, evidently someone has a smart trick but we don't know what it is. Similarly, I can't learn from code I can't see and I can't fully discuss it either (e.g. what are its hidden assumptions, if any).
I don't know why some programmers hide Mata code in programs made public.
That said, there is plenty of scope for various comments on the details.
1. The egen function xtile() in egenmore (SSC) was written for Stata 8.2 by Ulrich Kohler . Uli can speak for himself, but it's offered as a convenience wrapper. In Attaullah's second example, there are 1000 distinct groups to loop over, so yes, it's slow. I've not seen anyone before wanting to bin using quantiles, but separately within 1000 distinct groups, but someone will tell me they do often want this. There is no reason to use this function if you only want a single group, as xtile does that directly.
2. The comparison has missed a serious alternative candidate, Michael Stepner wrote fastxtile and it's been on SSC since 2014. That program has several useful bells and whistles more than astile , just not (so far as I can see) support for by: .
3. There is another benchmark with bins ideally of equal size when the quantiles are regularly spaced in probability terms, as they are in the only version of quantile binning tacked by astile . You just round the ranks suitably. That method is exact if (and only if) ties do not bite at quantile boundaries. With a million random uniforms generated as a float that's possible but very unlikely. But it's a reference level.
I tweaked Attaullah's script by leaving out the generation of identifiers (irrelevant) and adding an explicit seed, important for reproducibility, and adding calls to Michael's code and what I am calling the naive approach of rounding ranks.
Here are my results:

Code:

. clear
. set obs 1000
number of observations (_N) was 0, now 1,000
. set seed 2803
. gen year=_n+1000
. expand 1000
(999,000 observations created)
. gen size=uniform()*100
. timer clear
. timer on 1
. egen xt10=xtile(size), nq(10) // from egenmore package
. timer off 1
. timer on 2
. astile as10=size, nq(10)
. timer off 2
. timer on 3
. xtile of10=size, nq(10) // Stata official
. timer off 3
. timer on 4
. fastxtile ms10=size, nq(10)
. timer off 4
. timer on 5
. sort size
. gen naive10 = ceil(10 * _n/_N)
. timer off 5
. assert as10==of10
. assert ms10==of10
. assert naive10==of10
. timer list
   1:      1.70 /        1 =       1.7000
   2:      1.33 /        1 =       1.3260
   3:      2.50 /        1 =       2.4960
   4:      1.20 /        1 =       1.2010
   5:      0.86 /        1 =       0.8580
. timer clear
. timer on 1
. bys year: egen yxt10=xtile(size), nq(10)
. timer off 1
. timer on 2
. bys year: astile yas10=size, nq(10)
. timer off 2
. timer on 3
. bysort year (size): gen ynaive10 = ceil(10 * _n/_N)
. timer off 3
. timer list
   1:    772.65 /        1 =     772.6540
   2:    111.14 /        1 =     111.1440
   3:      1.05 /        1 =       1.0540
. assert yxt10==yas10
. assert yxt10==ynaive10

The new details are that
1. fastxtile is faster yet than astile for the simpler problem. Personally I don't mind waiting 2 seconds for 1 million values to be binned. I can check Statalist while I am waiting. (If I had 1 billion values, I would care.) But if speed is claimed, astile is not, on this limited evidence, the fastest beast in the zoo.
2. While Attaullah's program is as before much faster than the egen function for the groupwise problem, naive code is 100 times faster and (in this case; I was lucky) gives the same result. That doesn't mean it's better code, as it won't handle ties properly. The question is what is astile doing to be so slow when the only difference is fixing what happens with ties?
Good news then: scope for many further improvements is implied.
There is in fact another creature in the zoo! Mathieu Gomez created an egen wrapper for fastxtile , which is available in SSC under egenmisc . I have not used this code or inspected it, but I believe it is based on my fastxtile code.
I can't blame Atullah for being unaware of these alternatives: it's hard to keep up with the growing catalog of options. During the time Stata 14 was in development, I advocated for replacing xtile with fastxtile . My own tests show that they produce identical output, but Stata has their own certification scripts which they could subject fastxtile to. I wouldn't be offended if they removed the additional functionality before placing it in mainline Stata, if they'd prefer not to add any new bells and whistles to xtile . And I've released the code to the public domain under CC0, so there are no licensing issues to speak of.
If anyone at StataCorp would like to replace xtile with fastxtile for Stata 15, they'd have my full support in the endeavor. Michael: Glad you spotted this thread.
StataCorp can speak for themselves, but by and large they don't adopt user code any more. I guess the reason is three-fold:
1. Really good user code such as your own fastxtile is already accessible and easy to download and use, so the margin in StataCorp adopting it is smaller than you think. Otherwise put, the main StataCorp solution is providing net install so that it's trivial for most people to pull stuff across the net. (But I don't forget that a large class of users are behind firewalls that make that difficult if not impossible.) The success of this strategy is indicated indirectly by the large difficulties that many users have in keeping track of what is user-written and what is not!
2. Adopting user-written stuff creates a burden of testing, documentation and user support that could be spent otherwise on really big projects that users would hardly take on, such as multiple imputation or Bayes.
3. We all understand best what we create ourselves. That's really important. So, StataCorp developers look for problems that users are running into and then create their own solutions. (Sometimes, they take several years to get to that, but the company has no interest in implementing transient fads that researchers will only play with for a couple of years before the next version of sliced bread arrives.)
Nick: Your effective use of an @-tag brought this thread into my email inbox

I understand the reasoning for why StataCorp doesn't generally adopt user code. And of course they would need to decide whether it's worth the effort to adopt fastxtile . This may be a rare case where the benefit exceeds the cost. StataCorp could run fastxtile through their existing certification scripts, since it follows the same syntax as xtile . They could remove the bells and whistles if they'd like to keep the xtile documentation largely unchanged (save updating Author and Date). Someone at StataCorp would likely need to read the code to double-check that it's sensible, but fortunately it's well-structured and only 200 lines.
Despite the fact that user scripts are easily obtainable, defaults are powerful . I would guess the ratio of xtile to fastxtile users is something like 10,000:1. Perhaps StataCorp would deem it worthwhile to spend a couple employee hours on this. It would likely save their user base many hundreds of hours in the aggregate (although most users might save only a few seconds or minutes of their time).
Nick, to be honest I don't know the appropriate way to get this suggestion in front of the right person at StataCorp. The last time I made the suggestion, it was at an annual Stata Conference. Do you have advice about how I should suggest the replacement of xtile with fastxtile to StataCorp? I wouldn't be offended if they decide not to follow it. I agree with michael. while user-added capabilities are a great feature of stata, I would have loved to see stata adopt some of the most used and known user packages into vanilla stata. as was stated in this thread, it's sometimes very hard to know what you don't know - so that's a big limitation on whether or not you install user packages. The fact that the stata FAQ itself has some mentions of user-packages as well is, sort of an admission of that. I'm talking for example about spmap, which I originally came upon by chance as I thought that stata has no geo-mapping capabilities at all.
When programming code is well written and publicly available, I would have liked to see stata-corp utilize that "openness" for the benefit of all users. Dear Nick, thanks for your time. Several of your comments need my reply. First of all, I think you have read my first post where I posted version 1 of astile . The post is located here http://www.statalist.org/forums/foru...tile-vs-astile . The 'so-called naive 'approach that you have posted here for making qunatile groups was actually used in that first version. But then you commented and used "MUST and SHOULD "
That aside, any serious program in this territory, in Stata or in any other language,
1. must handle ties intelligently
2. must handle missing data intelligently
3. should support groupwise calculations
4. should support equal or unequal bins (e.g. some people might want to bin with boundaries at 5, 10, 25, 50, 75, 90, 95% points of a distribution and not just according to so many quantiles equally spaced in probability terms).
5. should lend itself to application to several variables
The use of "must" and "should" matches my suggestions on what is essential and desirable respectively. The so-called naive approach is 100 time faster, but does handle ties and missing values. So I think it is not suggested for any serious analysis. See the following example

Code:

clear
webuse grunfeld
sort mvalue
replace mvalue = . in 4
replace mvalue=. in 5
replace mvalue=. in 18
replace mvalue=. in 198
replace mvalue=. in 12
gen naive1 = ceil(10 * _n/_N)
xtile off10=mvalue, nq(10)
astile size10=mvalue, nq(10)
assert size10 ==off10
. assert naive1 ==off10
21 contradictions in 200 observations
assertion is false
r(9);

Nick further commented that

Attaullah: Let me rephrase one point more precisely: I don't know why you are hiding your code and even when the point is raised you don't explain. Your decision, to hide or not, to explain or not, and no obligation that you don't take on yourself, but code hidden cannot be discussed.
The role of the naive code is already explained, to act as a benchmark showing how fast the code could run in perfect conditions.. You're right that missing values are ignored and so the code will not handle those correctly. I was just running the code on the test problem you set out yourself. If you post a test problem with missings, the naive code could be adapted and compared.
Thanks for reminding me of the ideals I laid out. Points #4 and #5 cited in #8 are presumably now on the astile agenda.
A main theme of this thread is exactly what you have made it: claims about speed for your code. Just with asrol , astile runs faster than what you compared it with, and well and good, but corrections and qualifications are needed. fastxtile is faster for the first problem, and astile is still very slow for your second problem and I don't know why.
As Michael Stepner commented, there is also an egen version of his command to run on the second problem.
Michael Stepner , You are right. I was unaware of fastxtile. As compared to the posted version of astile, fastxtile is faster by few seconds without using the by option, and with the by option, the difference increases. I have been working on the next version of astile , which still needs certification. So far, results of that beta version show that it is even faster than fastxtile. I am working on the certification, once ready, I shall post that version first on Statalist and then update the SSC version. --------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML. --------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML. Hidden code: As I said, it's your decision and I won't press the point further.
What remains interesting to anyone concerned with these calculations is which programs are correct, most versatile and fast(est).
As I write, the last post in the thread on asrol is my post #16 at http://www.statalist.org/forums/foru...dvantage/page2 with a suggestion that the help wording remains misleading. That suggestion still holds. I plan to change the help file in the next version, which brings significant changes in the way asrol understands different data structures, thereby overcoming its present limitation of not efficiently handling data with too many gaps in the range variable. But till then, you are right, the help file needs correction. --------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML. I am posting version 3.0 of astile here. My initial tests show that version 3.0 of the astile program is reliable and is the fastest among available alternative programs, such as xtile, fastxtile, and xtile from egenmore. I have posted the new version on my website and can be downloaded by typing the below command
If proxy settings do not allow the download, it can be manually installed by copying the attached files to:

Code:

copy and paste
astile.ado
astile.sthlp
 in the following folder
C:/ado/plus/a

For some reason lasn.mlib cannot be attached, so here is the direct link https://sites.google.com/site/imspes...edirects=0&d=1
and copy lasn.mlib to

Code:

C:/ado/plus/l

Since I have already shown speed tests for official xtile and xtile from egenmore, following are the speed tests for fastxtile( Michael Stepner ) and fastxtile from egenmisc (SSC)
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML. Thanks for the extra news. Some small points of clarification from me:
1. The egen function fastxtile is not a wrapper for @Michael Stepner's code. Matthieu Gomez, its author, only used the same name. I think Michael was guessing from the name that it was a wrapper but looking at the code contradicts that. There is no Mata code there, for example.
2. The only advantage of any egen function to users lies in their support for a by: or by() calculation, as calling code from within egen can only slow it down.
Some small points of clarification now requested:
1. I don't see here timings for fastxtile and your new astile .
2. What's puzzling now is that your timings for the new astile show that it is slower than it was in post #1 (5.4 s rather than 3.6s). Is this a different problem? (Looks the same to me.) Are you using a different machine?