Lernapparat

Tutorial case study: Fixing your first PyTorch bug

June 7, 2019

PyTorch is a great project and I have only met very helpful people when contributing to it. However, the code base can be quite intimidating. Here we look at fixing a simple bug in detail and see that it is a less daunting task than it might seem at first.

Preparation

You have probably seen the awesome notes on PyTorch internals by my fellow PyTorch developer Edward Z. Yang, but also the great slides from Christian Perrone or even an earlier blog post of yours truly. If not, I highly recommend browsing through at least one of them.

The first step is to build your own PyTorch. The source code comes with a contributor's tutorial, which covers building, but also has a brief overview of the different components of the PyTorch repository.

Choosing a bug

We have checked out and built PyTorch. What now? Let us go fix a bug! But which one? PyTorch keeps a list of small issues which may be a good start.

For this example I chose a segmentation fault in pack_padded_sequence.1 One super-nice thing about this issue is that the submitter provided a small script to reproduce the issue. That gives us a head start, as it is easy to trigger the issue. Sometimes, getting such a reproducing script is the hardest and most time consuming step in fixing a bug. It is a good idea to send a small notice to the bug saying that you are working on it so no work is duplicated - every small bug likes a friend to attend to it, so it is good to spread the love to many of them.

All theory is gray, my friend

To make things less abstract, I recorded a screen cast of me fixing the bug. It is real time, except for an edit for an unintended recompile.

Debugging

When debugging a crash, it is always handy to fire up a debugger to get a stack trace. I usually save the the reproducing script somewhere (~/pytorch/scripts/ for me).

Then, from the torch checkout with the built PyTorch in it, I run PYTHONPATH=./build/lib.SOMETHING/ gdb -ex run --args python3 ~/pytorch/scripts/my_repro.py. What this does is have gdb load everything after the --args and then run it. For a segfault, it will automatically stop when the segfault happens and can do bt or so to get a backtrace. If you want to debug an exception -ex catch throw before -ex run is handy, and there also is -ex 'break somefile.cpp:<lineno>' set a breakpoint at a specific line. Dynamic module loading has it that you'll be asked about the file not being known yet.

The backtrace tells us that the error happens in aten/src/ATen/native/PackedSequence.cpp, function
_pack_padded_sequence. The reason is that we have a TORCH_CHECK(lengths[batch_size - 1] > 0,...), but batch_size is 0, a classic C++ bug...

Fixing the bug

We can fix this by checking whether the tensor is empty (as in has 0 elements) before assuming that it is at least one.

Most C++ functions correspond to the Python equivalents in PyTorch, so we can add a TORCH_CHECK(input.numel(), ...) above, indeed, I linked the file after the fix above and you find this two lines above. A one line bugfix.

Testing!

But we're not done: We should always test our code. To this we look for a good place to add our new test. In the test/ directory, we find a number of files with tests. The most prominent ones are test_torch.py for functions in ATen, test_cuda.py for GPU-functions in ATen, test_autograd.pyfor Autograd, and test_jit*.py for JIT-related tests, and test_nn.py for NN-related tests, but there are others, too. We changed a function exposed to Python via torch.nn.utils, so test_nn.py is where we want to add a test.

There already is a test for pack_padded_sequence unsurprisingly named test_pack_padded_sequence2. It is a long function. At the very end there is an interesting bit # test error message and a small with self.assertRaisesRegex: block. We copy this block but pass an empty tensor and adapt the expected error message. (You can see my result in on github.) Of course, your test will look differently. The most common test pattern is perhaps using with self.assertEqual to compare with some expectation, NumPy calculation or some such. You can check that the test segfaults without our fix and works with it.

Before you submit

There are two things you should do before you commit your patch:

  • run the test suite in order to check that the tests are passing, (PYTHONPATH=./build/libTHERIGHTTHING python3 test/run_test.py, don't forget the correct PYTHONPATH or you'll test with some system-installed PyTorch),
  • run python3 -mflake8 test/test_nn.py or whatever Python modules you have touched to make sure you've not messed up the formatting,
  • for C++ there are less strict formatting rules, but clang format is considered good style (and some files are fully formatted, you would want to keep it that way. You can usegit-clang-format` for your changes, but do not include formatting-only changes.
  • Read the fine print in the contributor's guide.

After that, you are good! Push a git branch and open a PR (referencing the right issue number). Thank you for helping out!

After submitting the PR, the continuous integration will kick in to test your PR. Do watch what is happening there and look at the failure logs. Sometimes tests will be flaky, i.e. have spurious failures, but they are there to save us from introducing embarrassing bugs in our PRs. To me the most interesting times to check this are after 5 minutes, half an hour and maybe three hours or so.

Conclusion

So this bug has been fixed in less than an hour wall clock time (because the rebuilt after make clean took quite a while). Do not despair if it takes longer. The first PR is always the hardest and I picked a particularly easy one to record the video with - I've spent weeks on some of my more intricate PRs. It should be fun!

If you are close to Munich and want to learn more about how to use PyTorch, do checkout my workshop offering.

I hope you enjoyed this little trip into working on PyTorch. I would love to hear from you if you have feedback on this blog post and screen cast (it's my first) and also if you have topics you'd like to see covered here. Mail me at tv@lernapparat.de.


  1. The issue is fixed in master when you read this, but PyTorch 1.1 still has it for your viewing pleasure. 

  2. It is a bit of a matter of taste when to add a new function and when to amend an existing one. I usually try to blend in with the surroundings.