predicate Range: {
module : Module,
span: src.ByteSpan,
predicate Name: string
predicate Declaration: {
name: Name,
loc: Range,
# connects a variable to its declaration within the same file
predicate LocalDeclarationReference: {
declaration: Declaration,
loc: Range,
Write the data to Glean
Now let’s ingest that into Glean to query. I’ll make a directory in $HOME/gleandbs
to store the Glean db images. We can install the standard schema, or just point Glean at the source. Now load all that JSON in. You can do this in parallel on multiple cores to speed things up — +RTS -N8 -A32m -RTS
if it is a very, very big DB, but this is fine to run single threaded.
Our first db will be called “react” with tag “main”, but in a production setting you would probably use the commit hash as the tag to identify the data set. Glean data is immutable once the DB is finished, so its fine to use the commit hash if it is also immutable.
$ mkdir ~/gleandbs
$ glean --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source/ create --repo react/main /tmp/flow-out/*.json
And the Glean RTS does some work:
$ glean shell --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source/
Glean Shell, built on 2022-01-08 07:22:56.472585205 UTC, from rev 9adc5e80b7f6f7fb9b556fbf3d7a8774fa77d254
type :help for help.
> :list
react/main (incomplete)
Created: 2022-01-12 02:01:04 UTC (3 minutes ago)
> :pager on
> :db react/main
> :stat
Use :stat to see a summary of the data we have stored. It’s already a basic form of code analysis:
flow.Declaration.3
count: 33527
size: 1111222 (1.06 MB) 7.1861%
flow.DeclarationInfo.3
count: 33610
size: 1177304 (1.12 MB) 7.6135%
flow.Documentation.3
count: 2215
size: 61615 (60.17 kB) 0.3985%
flow.Export.3
count: 1803
size: 53050 (51.81 kB) 0.3431%
flow.ImportDeclaration.3
count: 5504
size: 196000 (191.41 kB) 1.2675%
flow.LocalDeclarationReference.3
count: 86297
size: 2922952 (2.79 MB) 18.9024%
So just basic info but the React project has 33k unique declarations, 5,500 import declarations, 86k local variable uses, 3,600 type declarations, 1,117 modules, and 904 files. You can use these summaries over time to understand code complexity growth. Things may be missing here — its up to the indexer owner to run the indexer and capture all the things that need capturing. Glean is just reporting what was actually found.
The DB is in “incomplete” state, meaning we could write more data to it (e.g. if the indexer failed part way through we could restart it and resume safely, or we could shard analysis of very large projects). But before we “finish” the DB to freeze it, we need to derive additional predicates.
Note there are some limitations here: the Glean index need to know about the JavaScript and Flow modules system (in particular, names of modules to strings, and module string names to filepaths), so that imports like ‘react’ resolve to the correct module filepath.
import {useDebugValue, useEffect, useState} from 'react';
However, if we look closely at our default Flow index, the string to file facts are all empty. This will limit our ability to see through file names imported via string names (e.g. “React.js” gets imported as ‘react’).
"predicate":"flow.FileOfStringModuleExport.3","facts":[]}
which I think means I haven’t configured Flow correctly or set up the module maps properly (halp Flow folks?).
Derived predicates in Glean
A bit like stored procedures, we can write Angle predicates that are defined in terms of other, existing predicates. This is how we do abstraction. It’s morally equivalent to defining SQL tables on the fly in terms of other tables, with some different guarantees as Glean is more like datalog than a relational system. Derived predicates can be computed on the fly, or fully generated and stored. A very common use case is to compute inverse indices (e.g. find-references is the inverse of jump-to-definition). We can index all uses of definitions, then compute the inverse by deriving.
An example is the “FileXRef” predicate in Flow, which builds an index of File name facts to cross-references in those files. You would do this to quickly discover all outbound references from a file.
$ glean --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source/ derive --repo react/main flow.FileXRef
I0112 12:20:10.484172 241107 Open.hs:344] react/main: opening
I0112 12:20:10.526576 241107 rocksdb.cpp:605] loadOwnershipSets loaded 0 sets, 0 bytes
I0112 12:20:10.526618 241107 Open.hs:350] react/main: opened
I0112 12:20:10.749966 241107 Open.hs:352] react/main: schema has 799 predicates
flow.FileXRef : 0 facts
I0112 12:20:11.119634 241107 Stats.hs:223] mut_lat: 59ms [59ms] mut_thp: - [-] ded_thp: - [-] dup_thp: - [-] rnm_thp: - [-] cmt_thp: - [-] ibk_mis: - [-] tbi_mis: - [-] fbi_mis: - [-] lch_mem: 0B lch_cnt: 0
I0112 12:20:11.547000 241108 rocksdb.cpp:605] loadOwnershipSets loaded 0 sets, 0 bytes
flow.FileXRef : 112662 facts
We generated 112,662 facts about cross-references. Taking a peek at the DB now with :stat
flow.FileXRef.3
count: 112662
size: 4043028 (3.86 MB) 20.7266%
We’ve increased the DB size by 3.8M. We can derive the rest of the stored predicates now and finalize the DB. Note we have to derive in dependency order, as some stored predicates depend on the results of others. I just do this in two phases:
$ glean --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source/ derive --repo react/main flow.NameLowerCase flow.FileDeclaration flow.FileXRef flow.FlowEntityImportUses flow.FlowTypeEntityImportUses
I0112 12:43:12.098162 241911 Open.hs:344] react/main: opening
I0112 12:43:12.141024 241911 rocksdb.cpp:605] loadOwnershipSets loaded 0 sets, 0 bytes
I0112 12:43:12.141064 241911 Open.hs:350] react/main: opened
I0112 12:43:12.322456 241911 Open.hs:352] react/main: schema has 799 predicates
I0112 12:43:12.367130 242084 Stats.hs:223] mut_lat: 112us [112us] mut_thp: - [-] ded_thp: - [-] dup_thp: - [-] rnm_thp: - [-] cmt_thp: - [-] ibk_mis: - [-] tbi_mis: - [-] fbi_mis: - [-] lch_mem: 0B lch_cnt: 0
flow.FileDeclaration : 46594 facts
flow.FileXRef : 112662 facts
flow.FlowEntityImportUses : 3022 facts
flow.NameLowerCase : 9621 facts
flow.FlowTypeEntityImportUses : 692 facts
And freeze the data.
$ glean --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source finish --repo react/main
I0112 12:45:54.415550 242274 Open.hs:344] react/main: opening
I0112 12:45:54.451892 242274 Open.hs:350] react/main: opened
I0112 12:45:54.671070 242274 Open.hs:352] react/main: schema has 799 predicates
I0112 12:45:54.701830 242270 Work.hs:506] workFinished Work {work_repo = Repo {repo_name = "react", repo_hash = "main"}, work_task = "", work_parcelIndex = 0, work_parcelCount = 0, work_handle = "glean@9adc5e80b7f6f7fb9b556fbf3d7a8774fa77d254"}
I0112 12:45:54.707198 242274 Backup.hs:334] thinned schema for react/main contains src.1, flow.3
I0112 12:45:54.707224 242274 Open.hs:287] updating schema for: react/main
I0112 12:45:54.824131 242274 Open.hs:299] done updating schema for open DBs
I0112 12:45:54.824172 242274 Backup.hs:299] react/main: finalize: finished
react> :limit 0
react> :count flow.FileXRef { file = "react/packages/react/src/ReactHooks.js" }
134 results, 605 facts, 2.84ms, 310224 bytes, 1032 compiled bytes
To see say, only local references, and just the names of the definitions they point at:
react> N where flow.FileXRef { file = "react/packages/react/src/ReactHooks.js", ref = { localRef = { declaration = { name = N } } } }
{ "id": 14052, "key": "deps" }
{ "id": 4327, "key": "callback" }
{ "id": 13980, "key": "dispatcher" }
{ "id": 9459, "key": "create" }
{ "id": 5957, "key": "ReactCurrentDispatcher" }
{ "id": 1353, "key": "getServerSnapshot" }
{ "id": 1266, "key": "getSnapshot" }
{ "id": 1279, "key": "subscribe" }
{ "id": 14130, "key": "initialValue" }
{ "id": 3073, "key": "init" }
{ "id": 5465, "key": "source" }
...
Or we could query for types used in the file:
react> N where flow.FileXRef { file = "react/packages/react/src/ReactHooks.js", ref = { typeRef = { typeDeclaration = { name = N } } } }
{ "id": 1416, "key": "T" }
{ "id": 3728, "key": "A" }
{ "id": 14493, "key": "Dispatch" }
{ "id": 14498, "key": "S" }
{ "id": 14505, "key": "I" }
{ "id": 14522, "key": "BasicStateAction" }
{ "id": 3318, "key": "ReactContext" }
{ "id": 2363, "key": "Snapshot" }
{ "id": 14551, "key": "AbortSignal" }
{ "id": 6196, "key": "Source" }
{ "id": 8059, "key": "Dispatcher" }
{ "id": 7362, "key": "MutableSourceSubscribeFn" }
{ "id": 7357, "key": "MutableSourceGetSnapshotFn" }
{ "id": 7369, "key": "MutableSource" }
Ok this is starting to get useful.
We’re doing some basic code analysis on the fly in the shell. But I had to know / explore the flow schema to make these queries. That doesn’t really scale if we have a client that needs to look at multiple languages — we can’t reasonably expect the client to know how declarations and definitions etc are defined in every single language. Luckily, Glean defines abstractions for us in code.angle and codemarkup.angle to generically query for common code structures.
Querying generically
Entities are an Angle abstraction for “things that have definitions” in programming languages — like types, modules, classes etc. There are some common queries we need across any language:
- files to their cross-references , of any entity sort
- references to definitions
- definitions in this file
- entity to its definition location and file
For these common operations, a language-agnostic layer is defined in codemarkup.angle, taking care of all the subtleties resolving imports/headers/ .. for each language. E.g. for find-references, there’s a derived “EntityUses
” predicate for a bunch of languages here: https://github.com/facebookincubator/Glean/blob/main/glean/schema/source/codemarkup.angle#L259
We can use this to query Flow too E.g. how many known entities are defined or declared in ReactHooks.js? 99.
react> :count codemarkup.FileEntityLocations { file = "react/packages/react/src/ReactHooks.js" }
99 results, 354 facts, 13.15ms, 9297888 bytes, 54232 compiled bytes
And how many uses (xrefs) are in that file? 132.
:count codemarkup.FileEntityXRefLocations { file = "react/packages/react/src/ReactHooks.js" }
132 results, 329 facts, 40.44ms, 27210432 bytes, 160552 compiled bytes
Quick and dirty find-references for JavaScript
So we probably have enough now to do some basic semantic code search. i.e. not just textual search like grep, but semantically precise search as the compiler would see it. Let’s pick an entity and find its references. Since React is basically purely functional programming for UIs, let’s look for how often state is used — find-references to useState.
First, we get the entity. This tells us the definition site. The Glean key of the entity is $575875
. and its structure is as below. Note the compound query here (the semicolon), where I name the entity ‘E’, then filter on its body for only those ‘Es’ with the name “useState”
react> E where codemarkup.FileEntityLocations { file = "react/packages/react/src/ReactHooks.js", entity = E } ; { flow = { decl = { localDecl = { name = "useState" } } } } = E
"id": 575875,
"key": {
"flow": {
"decl": {
"localDecl": {
"id": 14269,
"key": {
"name": { "id": 1317, "key": "useState" },
"loc": {
"id": 14268,
"key": {
"module": {
"id": 12232,
"key": {
"file": { "id": 12231, "key": "react/packages/react/src/ReactHooks.js" }
"span": { "start": 2841, "length": 8 }
Now to direct references to this elsewhere in React, we add codemarkup.EntityUses { target = E, file = F }
to the query and return the files F:
react> F where codemarkup.FileEntityLocations { file = "react/packages/react/src/ReactHooks.js", entity = E } ; { flow = { decl = { localDecl = { name = "useState" } } } } = E ; codemarkup.EntityUses { target = E, file = F }
{ "id": 10971, "key": "react/packages/react/src/React.js" }
1 results, 1 facts, 9.19ms, 5460072 bytes, 8140 compiled bytes
So that finds the first-order direct reference to useState from ReactHooks.js to React.js.. To find the actual uses in the rest of the react package, we need a proper index for module names to strings, so that an import of ‘react’ can be resolved to ‘React.js’ and thus to the origin. Glean knows about this, but my indexer doesn’t have StringToModule facts — I need the flow indexer to generate these somehow.
For now, this is enough. We are alive.
In the next part I’ll look at writing a simple code search client to the Glean server running on the VM.
I want to develop and use Glean on ARM as I have a MacBook Air (road warrior mode) and I’m interested in making Glean more useful for local developer IDE backends. (c.f What is Glean?)
To build Glean just read the fine instructions and fix any compilation errors, right? Actually, we need a few patches to disable Intel-specific things, but otherwise the instructions are the same. It’s a fairly normal-ish Haskell set of projects with an FFI into some moderately bespoke C++ runtime relying on folly and a few other C++ libs.
Thankfully, all the non-portable parts of Glean are easily isolated to the rts/ownership parts of the Glean database runtime. In this case “ownership” is only used for incremental updates to the database and other semi-advanced things I don’t need right now.
The only real bits of non-portable code are:
- Flags to tell folly and thrift to use haswell or corei7 (we will ignore this on non-x86_64)
- An implementation of 256-bit bitsets (via AVX).
- Use of folly/Elias-Fano coding, for efficient compression of sorted integer list or sets as offsets (how we represent ownership of facts to things they depend on).
Why is this stuff in Glean? Well, Glean is a database for storing and querying very large scale code information, represented as 64 bit keys into “tables” (predicates) which represent facts. These facts relate to each other forming DAGs. Facts are named by 64 bit key. A Glean db is millions (or billions) of facts across hundreds of predicates. I.e. lots of 64 bit values.
So we’re in classic information retrieval territory – hence the focus on efficient bit and word encodings and operations. Generally, you flatten AST information (or other code facts) into tables, then write those tables into Glean. Glean then goes to a lot of work to store that efficiently. That’s how we get the sub-millisecond query times.
What is a “fact about code”? A single true statement about the code. E.g. for a method M in file F we might have quite a lot of information:
M is a method
M is located at file F
M is located at span 102-105
M has parent P
F is a file
M has type signature T
M is referred to by file/spans (G, 107-110) and (H, 23-26)
Real code bases have millions of such facts, all relating things in the code to each other – types to methods, methods to container modules, declarations to uses, definitions to declarations etc. We want that to be efficient, hence all the bit fiddling.
So let’s try to build this on non-x86 and see what breaks.
Building Glean from scratch
The normal way to build Glean is from source. There are two repos:
I’ve put some PRs up for the non-x86_64 builds, so if you’re building for ARM or something else, you’ll need these from here:
git clone
https://github.com/donsbot/Glean.git
cd Glean
git clone
https://github.com/donsbot/hsthrift.git
Worth doing a cabal update as well, just in case you never built Haskell stuff before.
Now we can build the dependent libraries and the thrift compiler (n.b. we need some stuff installed in /usr/local (needs sudo).
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
So let’s build thrift and folly:
cd hsthrift
./install-deps –sudo
I’m doing this on the aarch64 Debian 11 image running in UTM on a Macbook Air.
Now… first time build I seem to reliably get a gcc segfault on both x86 and aarch64, which I will conveniently sidestep by running it again. This seems mildly concerning as open source thrift might be miscompiled with gcc. I should likely be using clang here.
[ 57%] Building CXX object thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/protocol/TableBasedSerializer.cpp.o
In file included from :
/usr/include/stdc-predef.h: In substitution of ‘template constexpr T apache::thrift::detail::identity(T) [with T = ]’:
/home/dons/Glean/hsthrift/fbthrift/thrift/lib/cpp2/protocol/TableBasedSerializer.cpp:37:1: required from here
/usr/include/stdc-predef.h:32:92: internal compiler error: Segmentation fault
32 | whether the overall intent is to support these features; otherwise,
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
make[2]: *** [thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/build.make:173: thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/protocol/TableBasedSerializer.cpp.o] Error 1
Vanilla Debian gcc.
$ gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110
That really looks like a gcc bug and probably other things lurking there. Urk. Re-run the command and it seems to make progress. Hmm. Compilers in memory-unsafe languages eh? Moving along quickly…
Build the Glean rts and Haskell bits
Once hsthrift is built installed, we have all the various C++ deps (folly, xxhash etc). So we can try building Glean itself now. Glean is a mixture of Haskell tools over a C++ runtime. There’s a ton of schemas, bytecode generators, thrift mungers. Glean is sort of an ecosystem of indexers (analyzing code and spitting out facts as logs), a database runtime coupled to a Thrift server (“Glean” itself) and tooling for building up distributed systems around this (for restoring/ migrating/ monitoring / administering clusters of Glean services).
Building Glean .. if you get an error about missing HUnit, that means we haven’t synced the cabal package list. I got this on the first go with a blank Debian iso as the initial cabal package list is a basic one.
Resolving dependencies…
cabal: Could not resolve dependencies:
[__0] trying: fb-stubs-0.1.0.0 (user goal)
[__1] unknown package: HUnit (dependency of fb-stubs)
[__1] fail (backjumping, conflict set: HUnit, fb-stubs)
That’s fixable with a cabal update
.
If you’re not using my branch, and building on non-x86 you’ll fail at the first AVX header.
Preprocessing library 'rts' for glean-0.1.0.0..
Building library 'rts' for glean-0.1.0.0..
In file included from ./glean/rts/ownership/uset.h:11,
from ./glean/rts/ownership.h:12, from glean/rts/ffi.cpp:18:0: error:
glean/rts/ownership/setu32.h:11:10: error:
fatal error: immintrin.h: No such file or directory
11 | #include
int _mm256_testc_si256(__m256i __M, __m256i __V);
int _mm256_testz_si256(__m256i __M, __m256i __V);
__m256i _mm256_setzero_si256();
__m256i _mm256_set1_epi32(int __A);
__m256i _mm256_sllv_epi32(__m256i __X, __m256i __Y);__m256i _mm256_sub_epi32(__m256i __A, __m256i __B);
__m256i _mm256_set_epi32(int __A, int __B, int __C, int __D, int __E, int __F, int __G, int __H);
Ooh AV512
long long _mm_popcnt_u64(unsigned long long __X);
unsigned long long _lzcnt_u64(unsigned long long __X);
__m256i _mm256_or_si256(__m256i __A, __m256i __B);__m256i _mm256_and_si256(__m256i __A, __m256i __B);
__m256i _mm256_xor_si256(__m256i __A, __m256i __B);
Figuring out what these are all used for is interesting. We have 256-bit bitsets everywhere, and e.g. 4 64 bit popcnts to count things (fact counts?).
size_t count() const {
const uint64_t* p = reinterpret_cast(&value);
// _mm256_popcnt instructions require AVX512
return
_mm_popcnt_u64(p[0]) +
_mm_popcnt_u64(p[1]) +
_mm_popcnt_u64(p[2]) +
_mm_popcnt_u64(p[3]);
}
Anyway, its relatively trivial to stub these out, match the types and we have a mocked AVX layer. Left to the reader to write a portable shim for 256 bitsets that does these things on vectors of words.
Elias Fano
So the other bit is a little more hairy. Glean uses Elias Fano to compress all these sets of 64 bit keys we have floating around. Tons of sets indicating facts are owned or related to other facts. The folly implementation of Elias Fano is x86_64 only, so just falls over on aarch64:
/usr/local/include/folly/experimental/EliasFanoCoding.h:43:2: error:
error: #error EliasFanoCoding.h requires x86_64
43 | #error EliasFanoCoding.h requires x86_64
| ^~~~~
|
43 | #error EliasFanoCoding.h requires x86_64
So hmm. Reimplement? No its Saturday so I’m going to sub this out as well, just enough to get it to compile. My guess is we don’t use many methods here, read/write/iterate and some constructors. So I copy just enough of the canonical implementation declarations and dummy bodies to get it all to go through. Hsthrift under aarch64 emulation on UMT on an arm64 M1 takes about 10 mins to build with no custom flags.
Build and test
So we should be good to go now. Compile the big thing: Glean. Some of these generated bits of schema are pretty big too.
Glean storage is described via “schemas” for languages. Schemas represent what predicates (tables and their types) we want to capture. Uniquely, Glean’s Angle language is rich enough to support abstracting over types and predicates, building up layers of API that let you hide language complexity. You can paper over differences between languages while also providing precise language-specific captur
To see an example, look at the mulit-language find-references layer in codemarkup.angle:
The joy of this is that a client only has to know to query codemarkup:find-references and the right query will be issued for the right language. Client doesn’t have to know language-specific stuff, its all hidden in the database engine.
But .. that does end up meaning we generate quite a lot of code. With some trial and error I needed something under 16G to compile the “codemarkup” abstraction layer (this is a language-angostic navigation layer over the Glean schemas).
make
make test
That should pass and we are in business. We can run the little hello world demo.
$ uname -msr
Linux 5.10.0-10-arm64 aarch64
$ glean shell --db-root /tmp/glean/db/ --schema /tmp/glean/schema/
Glean Shell, built on 2022-01-08 07:22:56.472585205 UTC, from rev 9adc5e80b7f6f7fb9b556fbf3d7a8774fa77d254
type :help for help.
Check our little db created from the walkthrough:
:list
facts/0 (complete)
Created: 2022-01-08 10:59:17 UTC (1 day, 18 hours ago)
Completed: 2022-01-08 10:59:18 UTC (1 day, 18 hours ago)
What predicates does it have?
facts> :schema
predicate example.Member.1 :
{ method : { name : string, doc : maybe string } | variable : { name : string } | }
predicate example.FileClasses.1 : { file : string, classes : [example.Class.1]
predicate example.Reference.1 :
{ file : string, line : nat, column : nat }
-> example.Class.1
predicate example.Class.1 : { name : string, line : nat }
predicate example.Parent.1 : { child : example.Class.1, parent : example.Class.1 }
predicate example.Has.1 :
{ class_ : example.Class.1, has : example.Member.1, access : enum { Public | Private | } }
predicate example.Child.1 : { parent : example.Class.1, child : example.Class.1 }
Try a query or two: e.g. “How many classes do we have?”
facts> example.Class _
{ "id": 1026, "key": { "name": "Fish", "line": 30 } }
{ "id": 1027, "key": { "name": "Goldfish", "line": 40 } }
{ "id": 1025, "key": { "name": "Lizard", "line": 20 } }
{ "id": 1024, "key": { "name": "Pet", "line": 10 } }
What is the parent of the Fish class?
facts> example.Parent { child = { name = "Fish" } }
{
"id": 1029,
"key": {
"child": { "id": 1026, "key": { "name": "Fish", "line": 30 } },
"parent": { "id": 1024, "key": { "name": "Pet", "line": 10 } }
1 results, 3 facts, 5.59ms, 172320 bytes, 1014 compiled bytes
Ok we have a working ARM64 port of Glean. In the next post I’ll look at indexing some real code and serving up queries.