That was 1.5 years ago so I think things have changed significantly since then!
In particular, I am super curious which tokenizer type to optimally pick for code (maybe char-based in a better option now, cc
@patrickvonplaten
)
Great to so much interest, Lets’ officially define this project
Added everyone here who commented in this
sheet
. Please leave a comment here or in the sheet if you want to change something.
There are already 10 members here, if more people join we will need to split the team so that it’ll be easier for management. (cc
@patrickvonplaten
)
Omg I am sooooo happy to see so much excitement for this project
. We are gonna kill this y’all
.
I agree with Julien as well, tokenized will be important. Character or even byte level may be the way to go, but I worry we will run into memory issues if we have the model predicting large amount of code similar to copilot. My research group tried regular old BPE but then added in the special keywords that exist to try and make it so that the BPE model didn’t have too many superfluous tokens, but it’s hard to say if that is the optimal.
I love the idea of fine-tuning the model and using the stack exchange, especially since the big part of copilot is how you can prompt it with comments to generate your code. So, having all sorts of data that has some mix of natural language and code would be the best. We will need to define some cleaning criteria as well, maybe we could run some static analyzer to check for certain known vulnerabilities or insecurities. GitHub has their codescanning tool that does this and i know a few research tools as well that we could look at
There are also a few people who were interested in twitter that haven’t commented here. I’ll msg them to also post here.
Wrt tokenization of code, it may be useful to refer to section 4.3 of the
TransCoder paper
. This paper is on the unsupervised translation of programming languages. In this work javalang for Java, clang for C++ and the
tokenize
library for Python were used. The figure below shows how robust tokenize is to 2 versions of the same function:
image
2248×1123 597 KB
Then the BPE codes are learnt on the tokenized code files using
fastBPE
For data that includes both useful comments and code, we could look at code snippets at
GeeksforGeeks
and code samples such as those for TF and PyTorch available on the official websites