One finds oneself debugging one’s knowledge of the programming language instead of debugging the application itself. <div style=“text-align: right”>— The Zig Language Docs</div>

Hello, SEGFAULT!

Normally, one should not get a SIGSEGV, the glorious SEGFAULT signal, if they’re working locally with tree-sitter objects.

What do I mean by that?

The canonical example we’re giving in the docs and the examples directory looks roughly like this:

require 'tree_sitter'

src = "console.log('hello, world!);"

parser = TreeSitter::Parser.new
language = TreeSitter.load('javascript',
                           'path/to/libtree-sitter-javascript.{dylib, so}')
parser.language = language

tree = parser.parse_string(nil, src)
root = tree.root_node
# …

As long as tree is outlives root (or any node you extract from your parse), you should be safe. For instance, you can have parser, language, and tree as class members of a visitor, and then define your visiting methods in the same class to traverse the whole tree.

If node outlives tree, then you’re setting a trap for yourself, and you’re asking for SEGFAULT trouble, because internally, a TSNode struct contains a TSTree pointer, and invoking functions on the node will most likely involve the TSTree pointer. When ruby‘s GC decides to collect tree, it will call its finalizer, which in turn calls the C function ts_tree_delete; from now on, any pointer referring to tree is a loaded weapon.

But sometimes you’re not in control of the lifecycle of these objects, and this is why this doc exists.

PS: Other objects that will have their ts_*_delete called on garbage collection are TreeCursor, Query, and QueryCursor.

The Setup

I encountered this bug in production. We had a visitor set up the way I explained earlier, i.e.: we’re sure that tree outlives node.

We added new features for this visitor, and their respective test-cases. When we ran the tests in isolation, we had some failures, which was somehow expected.

However, when we launched the whole test suite, we got a SEGFAULT. Cool! That means we have some bugs in ruby-tree-sitter asking for some ass-whipping.

Dear reader, any lineraization of narrative below does not reflect reality. Chaos, discord, and uncertainty better describe the campaign for finding the source of this particular segmentation fault.

But I digress. Our test suite is set up to run a huge battery of test in parallel using minitest_parallel-fork. Since we were sure that the bug came up when running the whole test suite only, I had to see if anything changes if I set up the suite to run serially. And lo and behold, the bug didn’t creep up across numerous serial runs.

The chase begins

After some amusing investigation, it turned out that the first issue we were having came from tree-sitter itself.

When we visit the end node, and ask it for child_count, tree-sitter will happily tell us it has a positive children count. In fact, end.child_count == root.child_count, and when you try to access end.child(i), for all 0 <= i < root.child_count.

So I opened an on tree-sitter to get a confirmation that this is indeed a bug in the generated parsers, and meanwhile we fixed the issue in ruby-tree-sitter.

Excellent! Issue fixed, tree_sitter‘s ref in Gemfile updated, hit bundle update, then ran the test suite again: SEGFAULT in the same test.

And this time the crash report looked like this:

-- Control frame information -----------------------------------------------
c:0012 p:---- s:0060 e:000059 CFUNC  :inspect
c:0011 p:---- s:0057 e:000056 CFUNC  :inspect
c:0010 p:---- s:0054 e:000053 CFUNC  :_dump
c:0009 p:---- s:0051 e:000050 CFUNC  :dump
c:0008 p:0064 s:0046 e:000044 BLOCK  /Users/firas/projects/faveod/langeod/.gems/ruby/3.1.0/gems/minitest-parallel_fork-1.3.0/lib/minitest/parallel_fork.rb:64 [FINISH]
c:0007 p:---- s:0040 e:000039 CFUNC  :fork
c:0006 p:0040 s:0036 e:000035 BLOCK  /Users/firas/projects/faveod/langeod/.gems/ruby/3.1.0/gems/minitest-parallel_fork-1.3.0/lib/minitest/parallel_fork.rb:47 [FINISH]
c:0005 p:---- s:0030 e:000029 CFUNC  :times
c:0004 p:0064 s:0026 e:000025 METHOD /Users/firas/projects/faveod/langeod/.gems/ruby/3.1.0/gems/minitest-parallel_fork-1.3.0/lib/minitest/parallel_fork.rb:44
c:0003 p:0175 s:0016 e:000015 METHOD /Users/firas/projects/faveod/langeod/.gems/ruby/3.1.0/gems/minitest-5.16.2/lib/minitest.rb:159
c:0002 p:0497 s:0009 E:001ab8 EVAL   bin/langeod.rb:1800 [FINISH]
c:0001 p:0000 s:0003 E:001a20 (none) [FINISH]

No indicator whatsoever that it’s coming from the bindings. Maybe those inspect C functions come from the bindings, but I couldn’t be sure for reasonable reasons: my macOS refused to generate the crash reports it was generating earlier, and I couldn’t lay my hands on a proper core dump.

More than meets the eye

So I embarked on roughly two very, very long days investigating the source of the issue.

First, I tried to hook up gdb to the ruby process launching the test suite. You can read about how you can do it in Development.md. That was a fail because minitest_parallel-fork was forking groups of test in their own processes, and I couldn’t make gdb hook up to the forks.

I tried to do some ASAN and UBSAN just to make sure that we’re not leaking somthing. Also, no success. Couldn’t see anything wrong.

So it came down to narrowing down the test file to the single test that produced the segfault. It turned out to be a problem in our visit where we were doing a variation of this:

def get_node
  # …
  [node1, node2]
end

def visit_x
  # …
  n = get_node().type
end

Obviously, an Array doesn’t have a type method, so we should just get an exception. The test suite should report the exception and we should move on with our lives. Why would it segfault? It can’t be a ruby bug on such a trivial case! It was doing it when running this test in isolation and when we run the whole test-suite serially!

But then, upon reading minites_parallel-fork‘s code, and quite some time reflecting on the generated error report, and what could possibly cause such an issue, I remembered that ruby may defer evaluation of some expressions until they’re needed.

And what does ruby do when you have an exception? It calls inspect on the expression.

So could it be that it is actually generating an exception string, which calls Array::inspect, and in turn call its elements’ inspect methods, i.e. TreeSitter::Node::inspect, but defering its evaluation until needed?

I know that Node::inspect calls tree-sitter‘s ts_node_string to print a sexp representation of the node and its descendants.

So the only reasonable explanation is that when inspect is invoked, the underlying TSTree pointer used to print the sexp, by visiting the node’s descendants, is invalid ⇒ 💥.

Could it be that the visitor and its member parser, tree, etc. were GCed already when the time came to output the exception message?

The answer is yes, of course. I don’t do click-baits …

Reproducing the Bug

First, we will simulate a failing test case:

class Test
  def run
    parser = TreeSitter::Parser.new
    language = TreeSitter.lang('javascript')

    src = "console.log('hello, world!');"

    parser.language = language

    tree = parser.parse_string(nil, src)
    root = tree.root_node
    [root].type
  end
end

Notice that we’re calling [root].type instead of root.type, which should generate the exception we’re looking for.

Then, we’re going to need to simulate what’s happening in minitest_parallel-fork, which is basically forking a process, running the test, and sending the results back to the parent process via Marshall::dump.

Take a look at the crash report: the call stack has inspect sitting on top of dump.

class Minitest
  attr_accessor :result

  def run
    read, write = IO.pipe

    pid = Process.fork do
      data = 'intialized'
      begin
        read.close
        data = Test.new.run
      rescue StandardError => e
        data = e
      ensure
        GC.start(full_mark: true, immediate_sweep: true)
        write.write(Marshal.dump(data))
      end
    end

    write.close
    res = read.read
    Process.wait(pid)
    @result = Marshal.load(res)
  end
end

Finally, all that remains is running the test:

t = Minitest.new
t.run

puts t.result

The full details are in examples/05-segfault.

What happens when you run bundle exec ruby examples/05-segfault.rb? Of course it segfaults.

If you remove the forced GC.start() then you’re most likely never going to see the error in your runs. It’s put here to simulate our case where we have a great amount of objects created and recycled in the test-suite.

And I know that you’re tempted to point out that tree is not a class member, so of course it will be collected. Well, try it for yourself. Add those sweet @ wherever you like in Test. It will not change a thing, because the Test object itself will be collected when we hit ensure.

The only true fix for this is to retain Test.new in a Minitest member.

Who’s to blame?

So who’s fault its this, and who has to fix this behavior?

Is it Minitest‘s fault for discarding all my objects? Should I make an extension called minitest-hoarder just to fix my issue?

Is it minitest_parallel-fork‘s fault? Should the maintainers fix the way they handle forks and error messages?

Is it matz’s fault for making ruby?

Are we all guilty for chosing ruby? Or any garbage-collected language for that matter?

Alas, the blame game is nice, but unproductive. I could cache the results of inspect in a ruby string upon Node creation. Possible, doable, and it works. It means I (the library consumer) would need to consume far more memory than we should, and it’s all redundant information.

So I’d rather not.

A fix or a hack? Potato, poh-tah-toh!

I ended up doing some reference-counting on the TSTree* pointers. In a very hacky way, I must confess.

The Tree class now has class variable @@rc which is a hashmap of ptr -> reference-count. Every time a Node object is created, we increment its TSTree* ptr’s ref count. When a Node is garbage collected, we decrement the ref count. When ref count hits 0, ts_tree_delete is called, as if nothing happened.

And cruicially, I added a finalizer to the Tree class that will delete the remaining references to Tree, if any.

Et voilà!

I am not sure whether other parts might require a similar treatment. I am particularly nervous about TreeCursor which holds a void *tree. I am not touching it for now because I am sure it’s a void * for a reason; assuming that it’s a TSTree * is most likely a mistake.

If you found a similar behavior, don’t hesitate to open an issue, or even better: submit a PR.

PS: If you want to reproduce the segfault with examples/05-segfault.rb, checkout commit d8ee60a.

PPS: It is true that if you tampered with @@rc then a portal to Hades will replace your screen, and Cerebrus themselves will come after you, but this is trouble you explicitly asked for. A more long-term solution is to create a thread-safe association-list in C-space and hide reference-counting from Ruby-space; until then, try not to disturb the balance of nature, will ya?

Acknowledgment

I would like to thank Ulysse Buonomo. He was very kind to help me figure out solutions for this.