添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Indexing is the process of analysing source code and logging the things we find. Indexers look a lot like parsers or compiler front-ends: they take source code and build up semantic information, then record it to files for ingestion into Glean, or write directly to the database.

Almost all linters, static analysis tools, compilers, transpilers, formatters parse the code, analyse and write out transformed results, usually throwing away what they learned in the process. What Glean lets us do is efficiently persist the semantic information the compiler discovered, in a way that can be very efficiently queried. We turn compilers into data sources for a distributed cache.

This splits up the usual compiler frontend <-> backend coupling, with Glean as a cache between the two phases.This lets us scale: we can index a repo frequently, and share the results with thousands of engineers, or support millions of concurrent queries to a Glean database, with very low latency. It’s like the compiler AST and type environment are now in memcache, and our IDEs and code analysis tools can hit the cache instead of painfully reconstructing the frontend themselves.

What makes a good index?

What we index is described by a Glean schema written in Angle . The schema describes the types, predicates (tables) and references between facts, as well as how they are represented. It’s broadly “pay as you go” — you don’t need to capture everything about the language, but just what you need for specific tasks. There are schemas for most common languages , as well as many mini-languages (like GraphQL, Thrift, Buck). (But n.b. there aren’t _indexers_ for most languages, just schemas. The indexers a quite a bit more work as they usually hook into custom compiler toolchains).

Common examples of things we would capture are:

  • file names
  • declarations, definition locations
  • uses of definitions (“xrefs”)
  • language elements: module names, type names, functions, methods, classes, ..

That’s usually enough to get a working navigation service up (e.g. jump-to-definition or find-references). For more sophisticated static analysis you will need to capture more of the compiler environment. It’s a good idea to float out strings and values that are repeated a lot into their own predicates, so as to maximise sharing. And to have a sense of the queries you need to write when constructing the index schema.

Once you have a schema, you can store data in that format. Indexers are standalone programs, often with no dependency on Glean itself, that parse code (and add other type information, resolve names, resolve packages), before writing out lines and lines of JSON in the schema format you specified (or writing directly to the Glean db over binary thrift).

Ok, let’s index some JavaScript

Let’s see if we can index the React codebase. React is written in JavaScript, and uses the Flow type system. Flow knows about Glean , and can be run directly as an indexer. My aim here is to use the aarch64 VM as the glean server, but the indexer can run anywhere we want — on any box. We just have to write the data into Glean on the VM. Let’s have a go at installing Flow on aarch64/Debian though, for fun, as arm64/Linux is supported by Flow.

We can build from source (needs opam/OCaml) or install a pre-built binary from https://github.com/facebook/flow/releases . I installed the binary into my aarch64 VM and we are in business:

$ flow/flow --version
Flow, a static type checker for JavaScript, version 0.169.0

We’ll need the React source , so get that. This will be our source code to index:

git clone https://github.com/facebook/react.git

Initialize flow and process the source using flow glean:

$ flow glean packages --output-dir=/tmp/flow-out/ --write-root="react"
>>> Launching report...
Wrote facts about 787 JavaScript files.

And we are in business! What did we index? Have a look in /tmp/flow-out at the raw JSON. The index is in textual JSON format, and are just arrays of predicates + fact pairs. Each predicate has a set of facts associated (and facts are unique in Glean, any duplicates will be de-duped when ingested).

Flow indexer output

The whole React index is about 58M of JSON, while the raw source code was 17M. We have a few predicates defined with facts:

$ sed 's/"predicate":"\([^"]*\)",/\n\1\n/g' * | grep '^flow.' | sort | uniq
flow.DeclarationInfo.3
flow.FileOfStringModule.3
flow.ImportDeclaration.3
flow.LocalDeclarationReference.3
flow.MemberDeclarationInfo.3
flow.MemberDeclarationReference.3
flow.SourceOfExport.3
flow.SourceOfTypeExport.3
flow.TypeDeclarationInfo.3
flow.TypeDeclarationReference.3
flow.TypeImportDeclaration.3

The definitions of these predicates are in flow.angle , which define all the possible predicates we might have facts for in Flow/JavaScript. Looking at the first entry in the first file:

[{"predicate":"flow.LocalDeclarationReference.3"
 ,"facts":[{"key":{"declaration":{"key":{"name":{"key":"ownerDocument"},"loc":{"key":{"module":{"key":{"file":{"key":"react/packages/react-devtools-shared/src/devtools/views/SearchInput.js"}}},"span":{"start":1849,"length":13}}}}},"loc":{"key":{"module":{"key":{"file":{"key":"react/packages/react-devtools-shared/src/devtools/views/SearchInput.js"}}},"span":{"start":1901,"length":13}}}}}
,{"key":{"declaration":{"key":{"name":{"key":"ownerDocument"},"loc":{"key":{"module":{"key":{"file":{"key":"react/packages/react-devtools-shared/src/devtools/views/SearchInput.js"}}},"span":{"start":1849,"length":13}}}}},"loc":{"key":{"module":{"key":{"file":{"key":"react/packages/react-devtools-sha..

We can parse this fact as:

There is a “LocalDeclarationReference” of “ownerDocument’ in react/packages/react-devtools-shared/src/devtools/views/SearchInput.js at bytespan 1849, length 13. which has a reference at offset 1901.

Seems plausible. The structure of schema tells us the shape of the JSON we need to generate. E.g. LocalDeclarationReferences are bytespans associated with the use of a Declaration. Represented in Angle as::

predicate Range: {
  module : Module,
  span: src.ByteSpan,
predicate Name: string
predicate Declaration: {
  name: Name,
  loc: Range,
# connects a variable to its declaration within the same file
predicate LocalDeclarationReference: {
  declaration: Declaration,
  loc: Range,

Write the data to Glean

Now let’s ingest that into Glean to query. I’ll make a directory in $HOME/gleandbs to store the Glean db images. We can install the standard schema, or just point Glean at the source. Now load all that JSON in. You can do this in parallel on multiple cores to speed things up — +RTS -N8 -A32m -RTS if it is a very, very big DB, but this is fine to run single threaded.

Our first db will be called “react” with tag “main”, but in a production setting you would probably use the commit hash as the tag to identify the data set. Glean data is immutable once the DB is finished, so its fine to use the commit hash if it is also immutable.

$ mkdir ~/gleandbs
$ glean --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source/ create --repo react/main /tmp/flow-out/*.json
And the Glean RTS does some work:
$ glean shell --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source/
Glean Shell, built on 2022-01-08 07:22:56.472585205 UTC, from rev 9adc5e80b7f6f7fb9b556fbf3d7a8774fa77d254
type :help for help.
> :list
react/main (incomplete)
  Created: 2022-01-12 02:01:04 UTC (3 minutes ago)
> :pager on
> :db react/main
> :stat

Use :stat to see a summary of the data we have stored. It’s already a basic form of code analysis:

flow.Declaration.3
  count: 33527
  size:  1111222 (1.06 MB) 7.1861%
flow.DeclarationInfo.3
  count: 33610
  size:  1177304 (1.12 MB) 7.6135%
flow.Documentation.3
  count: 2215
  size:  61615 (60.17 kB) 0.3985%
flow.Export.3
  count: 1803
  size:  53050 (51.81 kB) 0.3431%
flow.ImportDeclaration.3
  count: 5504
  size:  196000 (191.41 kB) 1.2675%
flow.LocalDeclarationReference.3
  count: 86297
  size:  2922952 (2.79 MB) 18.9024%

So just basic info but the React project has 33k unique declarations, 5,500 import declarations, 86k local variable uses, 3,600 type declarations, 1,117 modules, and 904 files. You can use these summaries over time to understand code complexity growth. Things may be missing here — its up to the indexer owner to run the indexer and capture all the things that need capturing. Glean is just reporting what was actually found.

The DB is in “incomplete” state, meaning we could write more data to it (e.g. if the indexer failed part way through we could restart it and resume safely, or we could shard analysis of very large projects). But before we “finish” the DB to freeze it, we need to derive additional predicates.

Note there are some limitations here: the Glean index need to know about the JavaScript and Flow modules system (in particular, names of modules to strings, and module string names to filepaths), so that imports like ‘react’ resolve to the correct module filepath.

import {useDebugValue, useEffect, useState} from 'react';

However, if we look closely at our default Flow index, the string to file facts are all empty. This will limit our ability to see through file names imported via string names (e.g. “React.js” gets imported as ‘react’).

"predicate":"flow.FileOfStringModuleExport.3","facts":[]}

which I think means I haven’t configured Flow correctly or set up the module maps properly (halp Flow folks?).

Derived predicates in Glean

A bit like stored procedures, we can write Angle predicates that are defined in terms of other, existing predicates. This is how we do abstraction. It’s morally equivalent to defining SQL tables on the fly in terms of other tables, with some different guarantees as Glean is more like datalog than a relational system. Derived predicates can be computed on the fly, or fully generated and stored. A very common use case is to compute inverse indices (e.g. find-references is the inverse of jump-to-definition). We can index all uses of definitions, then compute the inverse by deriving.

An example is the “FileXRef” predicate in Flow, which builds an index of File name facts to cross-references in those files. You would do this to quickly discover all outbound references from a file.

$ glean --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source/ derive --repo react/main flow.FileXRef
I0112 12:20:10.484172 241107 Open.hs:344] react/main: opening
I0112 12:20:10.526576 241107 rocksdb.cpp:605] loadOwnershipSets loaded 0 sets, 0 bytes
I0112 12:20:10.526618 241107 Open.hs:350] react/main: opened
I0112 12:20:10.749966 241107 Open.hs:352] react/main: schema has 799 predicates
flow.FileXRef : 0 facts
I0112 12:20:11.119634 241107 Stats.hs:223] mut_lat: 59ms [59ms] mut_thp: - [-] ded_thp: - [-] dup_thp: - [-] rnm_thp: - [-] cmt_thp: - [-] ibk_mis: - [-] tbi_mis: - [-] fbi_mis: - [-] lch_mem: 0B lch_cnt: 0
I0112 12:20:11.547000 241108 rocksdb.cpp:605] loadOwnershipSets loaded 0 sets, 0 bytes
flow.FileXRef : 112662 facts

We generated 112,662 facts about cross-references. Taking a peek at the DB now with :stat

flow.FileXRef.3
  count: 112662
  size:  4043028 (3.86 MB) 20.7266%

We’ve increased the DB size by 3.8M. We can derive the rest of the stored predicates now and finalize the DB. Note we have to derive in dependency order, as some stored predicates depend on the results of others. I just do this in two phases:

$ glean --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source/ derive --repo react/main flow.NameLowerCase flow.FileDeclaration flow.FileXRef flow.FlowEntityImportUses flow.FlowTypeEntityImportUses
I0112 12:43:12.098162 241911 Open.hs:344] react/main: opening
I0112 12:43:12.141024 241911 rocksdb.cpp:605] loadOwnershipSets loaded 0 sets, 0 bytes
I0112 12:43:12.141064 241911 Open.hs:350] react/main: opened
I0112 12:43:12.322456 241911 Open.hs:352] react/main: schema has 799 predicates
I0112 12:43:12.367130 242084 Stats.hs:223] mut_lat: 112us [112us] mut_thp: - [-] ded_thp: - [-] dup_thp: - [-] rnm_thp: - [-] cmt_thp: - [-] ibk_mis: - [-] tbi_mis: - [-] fbi_mis: - [-] lch_mem: 0B lch_cnt: 0
flow.FileDeclaration : 46594 facts
flow.FileXRef : 112662 facts
flow.FlowEntityImportUses : 3022 facts
flow.NameLowerCase : 9621 facts
flow.FlowTypeEntityImportUses : 692 facts

And freeze the data.

$ glean --db-root $HOME/gleandbs --schema $HOME/Glean/glean/schema/source finish --repo react/main
I0112 12:45:54.415550 242274 Open.hs:344] react/main: opening
I0112 12:45:54.451892 242274 Open.hs:350] react/main: opened
I0112 12:45:54.671070 242274 Open.hs:352] react/main: schema has 799 predicates
I0112 12:45:54.701830 242270 Work.hs:506] workFinished Work {work_repo = Repo {repo_name = "react", repo_hash = "main"}, work_task = "", work_parcelIndex = 0, work_parcelCount = 0, work_handle = "glean@9adc5e80b7f6f7fb9b556fbf3d7a8774fa77d254"}
I0112 12:45:54.707198 242274 Backup.hs:334] thinned schema for react/main contains src.1, flow.3
I0112 12:45:54.707224 242274 Open.hs:287] updating schema for: react/main
I0112 12:45:54.824131 242274 Open.hs:299] done updating schema for open DBs
I0112 12:45:54.824172 242274 Backup.hs:299] react/main: finalize: finished
react> :limit 0
react> :count flow.FileXRef { file = "react/packages/react/src/ReactHooks.js" }
134 results, 605 facts, 2.84ms, 310224 bytes, 1032 compiled bytes

To see say, only local references, and just the names of the definitions they point at:

react> N where flow.FileXRef { file = "react/packages/react/src/ReactHooks.js", ref = { localRef = { declaration =  { name = N } } } }
{ "id": 14052, "key": "deps" }
{ "id": 4327, "key": "callback" }
{ "id": 13980, "key": "dispatcher" }
{ "id": 9459, "key": "create" }
{ "id": 5957, "key": "ReactCurrentDispatcher" }
{ "id": 1353, "key": "getServerSnapshot" }
{ "id": 1266, "key": "getSnapshot" }
{ "id": 1279, "key": "subscribe" }
{ "id": 14130, "key": "initialValue" }
{ "id": 3073, "key": "init" }
{ "id": 5465, "key": "source" }...

Or we could query for types used in the file:

react> N where flow.FileXRef { file = "react/packages/react/src/ReactHooks.js", ref = { typeRef =  { typeDeclaration = { name =  N }  } } }
{ "id": 1416, "key": "T" }
{ "id": 3728, "key": "A" }
{ "id": 14493, "key": "Dispatch" }
{ "id": 14498, "key": "S" }
{ "id": 14505, "key": "I" }
{ "id": 14522, "key": "BasicStateAction" }
{ "id": 3318, "key": "ReactContext" }
{ "id": 2363, "key": "Snapshot" }
{ "id": 14551, "key": "AbortSignal" }
{ "id": 6196, "key": "Source" }
{ "id": 8059, "key": "Dispatcher" }
{ "id": 7362, "key": "MutableSourceSubscribeFn" }
{ "id": 7357, "key": "MutableSourceGetSnapshotFn" }
{ "id": 7369, "key": "MutableSource" }

Ok this is starting to get useful.

We’re doing some basic code analysis on the fly in the shell. But I had to know / explore the flow schema to make these queries. That doesn’t really scale if we have a client that needs to look at multiple languages — we can’t reasonably expect the client to know how declarations and definitions etc are defined in every single language. Luckily, Glean defines abstractions for us in code.angle and codemarkup.angle to generically query for common code structures.

Querying generically

Entities are an Angle abstraction for “things that have definitions” in programming languages — like types, modules, classes etc. There are some common queries we need across any language:

  • files to their cross-references , of any entity sort
  • references to definitions
  • definitions in this file
  • entity to its definition location and file

For these common operations, a language-agnostic layer is defined in codemarkup.angle, taking care of all the subtleties resolving imports/headers/ .. for each language. E.g. for find-references, there’s a derived “EntityUses” predicate for a bunch of languages here: https://github.com/facebookincubator/Glean/blob/main/glean/schema/source/codemarkup.angle#L259

We can use this to query Flow too E.g. how many known entities are defined or declared in ReactHooks.js? 99.

react> :count codemarkup.FileEntityLocations { file = "react/packages/react/src/ReactHooks.js" }
99 results, 354 facts, 13.15ms, 9297888 bytes, 54232 compiled bytes

And how many uses (xrefs) are in that file? 132.

:count codemarkup.FileEntityXRefLocations { file = "react/packages/react/src/ReactHooks.js" }
132 results, 329 facts, 40.44ms, 27210432 bytes, 160552 compiled bytes

Quick and dirty find-references for JavaScript

So we probably have enough now to do some basic semantic code search. i.e. not just textual search like grep, but semantically precise search as the compiler would see it. Let’s pick an entity and find its references. Since React is basically purely functional programming for UIs, let’s look for how often state is used — find-references to useState.

First, we get the entity. This tells us the definition site. The Glean key of the entity is $575875. and its structure is as below. Note the compound query here (the semicolon), where I name the entity ‘E’, then filter on its body for only those ‘Es’ with the name “useState”

react> E where codemarkup.FileEntityLocations { file = "react/packages/react/src/ReactHooks.js", entity = E } ; { flow = { decl = { localDecl = { name = "useState" } } } }  = E
  "id": 575875,
  "key": {
    "flow": {
      "decl": {
        "localDecl": {
          "id": 14269,
          "key": {
            "name": { "id": 1317, "key": "useState" },
            "loc": {
              "id": 14268,
              "key": {
                "module": {
                  "id": 12232,
                  "key": {
                    "file": { "id": 12231, "key": "react/packages/react/src/ReactHooks.js" }
                "span": { "start": 2841, "length": 8 }

Now to direct references to this elsewhere in React, we add codemarkup.EntityUses { target = E, file = F } to the query and return the files F:

react> F where codemarkup.FileEntityLocations { file = "react/packages/react/src/ReactHooks.js", entity = E } ; { flow = { decl = { localDecl = { name = "useState" } } } } = E ; codemarkup.EntityUses { target = E, file = F }
{ "id": 10971, "key": "react/packages/react/src/React.js" }

1 results, 1 facts, 9.19ms, 5460072 bytes, 8140 compiled bytes

So that finds the first-order direct reference to useState from ReactHooks.js to React.js.. To find the actual uses in the rest of the react package, we need a proper index for module names to strings, so that an import of ‘react’ can be resolved to ‘React.js’ and thus to the origin. Glean knows about this, but my indexer doesn’t have StringToModule facts — I need the flow indexer to generate these somehow.

For now, this is enough. We are alive.

In the next part I’ll look at writing a simple code search client to the Glean server running on the VM.

I want to develop and use Glean on ARM as I have a MacBook Air (road warrior mode) and I’m interested in making Glean more useful for local developer IDE backends. (c.f What is Glean?)

To build Glean just read the fine instructions and fix any compilation errors, right? Actually, we need a few patches to disable Intel-specific things, but otherwise the instructions are the same. It’s a fairly normal-ish Haskell set of projects with an FFI into some moderately bespoke C++ runtime relying on folly and a few other C++ libs.

Thankfully, all the non-portable parts of Glean are easily isolated to the rts/ownership parts of the Glean database runtime. In this case “ownership” is only used for incremental updates to the database and other semi-advanced things I don’t need right now.

The only real bits of non-portable code are:

  • Flags to tell folly and thrift to use haswell or corei7 (we will ignore this on non-x86_64)
  • An implementation of 256-bit bitsets (via AVX).
  • Use of folly/Elias-Fano coding, for efficient compression of sorted integer list or sets as offsets (how we represent ownership of facts to things they depend on).

Why is this stuff in Glean? Well, Glean is a database for storing and querying very large scale code information, represented as 64 bit keys into “tables” (predicates) which represent facts. These facts relate to each other forming DAGs. Facts are named by 64 bit key. A Glean db is millions (or billions) of facts across hundreds of predicates. I.e. lots of 64 bit values.

So we’re in classic information retrieval territory – hence the focus on efficient bit and word encodings and operations. Generally, you flatten AST information (or other code facts) into tables, then write those tables into Glean. Glean then goes to a lot of work to store that efficiently. That’s how we get the sub-millisecond query times.

What is a “fact about code”? A single true statement about the code. E.g. for a method M in file F we might have quite a lot of information:

M is a method
M is located at file F
M is located at span 102-105
M has parent P
F is a file
M has type signature T
M is referred to by file/spans (G, 107-110) and (H, 23-26)

Real code bases have millions of such facts, all relating things in the code to each other – types to methods, methods to container modules, declarations to uses, definitions to declarations etc. We want that to be efficient, hence all the bit fiddling.

So let’s try to build this on non-x86 and see what breaks.

Building Glean from scratch

The normal way to build Glean is from source. There are two repos:

I’ve put some PRs up for the non-x86_64 builds, so if you’re building for ARM or something else, you’ll need these from here:

git clone https://github.com/donsbot/Glean.git
cd Glean
git clone https://github.com/donsbot/hsthrift.git

Worth doing a cabal update as well, just in case you never built Haskell stuff before.

Now we can build the dependent libraries and the thrift compiler (n.b. we need some stuff installed in /usr/local (needs sudo).

export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

So let’s build thrift and folly:

 cd hsthrift
 ./install-deps –sudo

I’m doing this on the aarch64 Debian 11 image running in UTM on a Macbook Air.

Now… first time build I seem to reliably get a gcc segfault on both x86 and aarch64, which I will conveniently sidestep by running it again. This seems mildly concerning as open source thrift might be miscompiled with gcc. I should likely be using clang here.

[ 57%] Building CXX object thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/protocol/TableBasedSerializer.cpp.o
In file included from :
/usr/include/stdc-predef.h: In substitution of ‘template constexpr T apache::thrift::detail::identity(T) [with T = ]’:
/home/dons/Glean/hsthrift/fbthrift/thrift/lib/cpp2/protocol/TableBasedSerializer.cpp:37:1: required from here
/usr/include/stdc-predef.h:32:92: internal compiler error: Segmentation fault
32 | whether the overall intent is to support these features; otherwise,
| ^
Please submit a full bug report,
with preprocessed source if appropriate.
See for instructions.
make[2]: *** [thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/build.make:173: thrift/lib/cpp2/CMakeFiles/thriftprotocol.dir/protocol/TableBasedSerializer.cpp.o] Error 1

Vanilla Debian gcc.

$ gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110

That really looks like a gcc bug and probably other things lurking there. Urk. Re-run the command and it seems to make progress. Hmm. Compilers in memory-unsafe languages eh? Moving along quickly…

Build the Glean rts and Haskell bits

Once hsthrift is built installed, we have all the various C++ deps (folly, xxhash etc). So we can try building Glean itself now. Glean is a mixture of Haskell tools over a C++ runtime. There’s a ton of schemas, bytecode generators, thrift mungers. Glean is sort of an ecosystem of indexers (analyzing code and spitting out facts as logs), a database runtime coupled to a Thrift server (“Glean” itself) and tooling for building up distributed systems around this (for restoring/ migrating/ monitoring / administering clusters of Glean services).

Building Glean .. if you get an error about missing HUnit, that means we haven’t synced the cabal package list. I got this on the first go with a blank Debian iso as the initial cabal package list is a basic one.

Resolving dependencies…
cabal: Could not resolve dependencies:
[__0] trying: fb-stubs-0.1.0.0 (user goal)
[__1] unknown package: HUnit (dependency of fb-stubs)
[__1] fail (backjumping, conflict set: HUnit, fb-stubs)

That’s fixable with a cabal update.

If you’re not using my branch, and building on non-x86 you’ll fail at the first AVX header.

Preprocessing library 'rts' for glean-0.1.0.0..
Building library 'rts' for glean-0.1.0.0..
In file included from ./glean/rts/ownership/uset.h:11,
from ./glean/rts/ownership.h:12, from glean/rts/ffi.cpp:18:0: error:
glean/rts/ownership/setu32.h:11:10: error:
fatal error: immintrin.h: No such file or directory
11 | #include

int _mm256_testc_si256(__m256i __M, __m256i __V);
int _mm256_testz_si256(__m256i __M, __m256i __V);
__m256i _mm256_setzero_si256();
__m256i _mm256_set1_epi32(int __A);
__m256i _mm256_sllv_epi32(__m256i __X, __m256i __Y);__m256i _mm256_sub_epi32(__m256i __A, __m256i __B);
__m256i _mm256_set_epi32(int __A, int __B, int __C, int __D, int __E, int __F, int __G, int __H);

Ooh AV512

long long _mm_popcnt_u64(unsigned long long __X);

unsigned long long _lzcnt_u64(unsigned long long __X);
__m256i _mm256_or_si256(__m256i __A, __m256i __B);__m256i _mm256_and_si256(__m256i __A, __m256i __B);
__m256i _mm256_xor_si256(__m256i __A, __m256i __B);

Figuring out what these are all used for is interesting. We have 256-bit bitsets everywhere, and e.g. 4 64 bit popcnts to count things (fact counts?).

size_t count() const {
const uint64_t* p = reinterpret_cast(&value);
// _mm256_popcnt instructions require AVX512
return
_mm_popcnt_u64(p[0]) +
_mm_popcnt_u64(p[1]) +
_mm_popcnt_u64(p[2]) +
_mm_popcnt_u64(p[3]);
}

Anyway, its relatively trivial to stub these out, match the types and we have a mocked AVX layer. Left to the reader to write a portable shim for 256 bitsets that does these things on vectors of words.

Elias Fano

So the other bit is a little more hairy. Glean uses Elias Fano to compress all these sets of 64 bit keys we have floating around. Tons of sets indicating facts are owned or related to other facts. The folly implementation of Elias Fano is x86_64 only, so just falls over on aarch64:

/usr/local/include/folly/experimental/EliasFanoCoding.h:43:2: error:
error: #error EliasFanoCoding.h requires x86_64
43 | #error EliasFanoCoding.h requires x86_64
| ^~~~~
|
43 | #error EliasFanoCoding.h requires x86_64

So hmm. Reimplement? No its Saturday so I’m going to sub this out as well, just enough to get it to compile. My guess is we don’t use many methods here, read/write/iterate and some constructors. So I copy just enough of the canonical implementation declarations and dummy bodies to get it all to go through. Hsthrift under aarch64 emulation on UMT on an arm64 M1 takes about 10 mins to build with no custom flags.

Build and test

So we should be good to go now. Compile the big thing: Glean. Some of these generated bits of schema are pretty big too.

Glean storage is described via “schemas” for languages. Schemas represent what predicates (tables and their types) we want to capture. Uniquely, Glean’s Angle language is rich enough to support abstracting over types and predicates, building up layers of API that let you hide language complexity. You can paper over differences between languages while also providing precise language-specific captur

To see an example, look at the mulit-language find-references layer in codemarkup.angle:

The joy of this is that a client only has to know to query codemarkup:find-references and the right query will be issued for the right language. Client doesn’t have to know language-specific stuff, its all hidden in the database engine.

But .. that does end up meaning we generate quite a lot of code. With some trial and error I needed something under 16G to compile the “codemarkup” abstraction layer (this is a language-angostic navigation layer over the Glean schemas).

make
make test

That should pass and we are in business. We can run the little hello world demo.

$ uname -msr
Linux 5.10.0-10-arm64 aarch64

$ glean shell --db-root /tmp/glean/db/ --schema /tmp/glean/schema/
Glean Shell, built on 2022-01-08 07:22:56.472585205 UTC, from rev 9adc5e80b7f6f7fb9b556fbf3d7a8774fa77d254
type :help for help.

Check our little db created from the walkthrough:

:list
facts/0 (complete)
Created: 2022-01-08 10:59:17 UTC (1 day, 18 hours ago)
Completed: 2022-01-08 10:59:18 UTC (1 day, 18 hours ago)

What predicates does it have?

facts> :schema
predicate example.Member.1 :
{ method : { name : string, doc : maybe string } | variable : { name : string } | }

predicate example.FileClasses.1 : { file : string, classes : [example.Class.1]

predicate example.Reference.1 :
{ file : string, line : nat, column : nat }
-> example.Class.1

predicate example.Class.1 : { name : string, line : nat }

predicate example.Parent.1 : { child : example.Class.1, parent : example.Class.1 }

predicate example.Has.1 :
{ class_ : example.Class.1, has : example.Member.1, access : enum { Public | Private | } }

predicate example.Child.1 : { parent : example.Class.1, child : example.Class.1 }

Try a query or two: e.g. “How many classes do we have?”

facts> example.Class _
{ "id": 1026, "key": { "name": "Fish", "line": 30 } }
{ "id": 1027, "key": { "name": "Goldfish", "line": 40 } }
{ "id": 1025, "key": { "name": "Lizard", "line": 20 } }
{ "id": 1024, "key": { "name": "Pet", "line": 10 } }

What is the parent of the Fish class?

facts> example.Parent { child = { name = "Fish" } }
{
"id": 1029,
"key": {
"child": { "id": 1026, "key": { "name": "Fish", "line": 30 } },
"parent": { "id": 1024, "key": { "name": "Pet", "line": 10 } }

1 results, 3 facts, 5.59ms, 172320 bytes, 1014 compiled bytes

Ok we have a working ARM64 port of Glean. In the next post I’ll look at indexing some real code and serving up queries.