13 years of bug debugging experience

In "Learning From Your Bugs" I wrote about how I track down some of the most interesting bugs I've come across. Recently, I reviewed all my 194 entries (from 13) to see what lessons I could learn. Below are the most important lessons I've summed up, covering the three areas of coding, testing, and debugging.

Alt text

coding

Here are some of the issues I've experienced that can cause difficult bugs:

1. Sequence of events. When dealing with events, it can be productive to ask the following questions: Can events arrive in a different order? What if we don't receive this event? What if this event happens twice in a row? Bugs in other parts of the system (or interactive system) may cause events to occur, even if they don't usually happen.

2. Too early. This is a special case of the first "order of events", but it does cause some tricky bugs, so I've singled it out. For example, if signaling messages are received prematurely before the configuration and startup procedures are complete, then a lot of strange behavior can occur. Another example: a connection is marked down before it is put on the free list. When debugging these kinds of problems, we always assume that the connection is set to down while it is on the free list (but why not put it off the list then?). It's our lack of thinking, not taking into account that sometimes things happen prematurely.

3. A silent glitch. Some of the hardest bugs to track down are partly caused by code that silently fails and expands rather than throwing errors. For example, a system call (like bind ) that returns an error without checking the code. Another example: the parsing code just returns instead of throwing an error when it encounters an error element. Calls that persist for a while in an error state can make debugging harder. It is best to return an error once a failure is detected.

4.If. If statements with several conditions, if (a or b), especially when chained, if (x) else if (y), are causing me a lot of bugs. Even though the if statement is conceptually simple, it can still be error-prone when there are multiple conditions to track. These days I try to rewrite the code to make it simpler to avoid dealing with complex if statements.

5. Else. There are some bugs caused by not properly taking into account what happens if the condition is false. In almost all cases, there should be an else section for every if statement. Also, if you set the variable in one branch of the if statement, then maybe you also set it in the other branch. Related to this case is the case where the flag is set. It's not hard to just add the condition for the flag that is set, but it's easy to forget to add the condition when the flag should be reset again. Leaving a flag that is set forever can lead to bugs in the future.

6. Change assumptions. Many of the hardest bugs to prevent in the first place are caused by changing assumptions. For example, in the beginning, there may be only one customer event per day. So a lot of code is written under this assumption. But then, the design changed to allow multiple customer events per day. When this happens, it's hard to change all the situations that the new design affects. It's not hard to find all explicit dependencies on changes, what's hard is to find all implicit dependencies on the old design. For example, there might be code that gets all customer events on a given day. The implicit assumption here is that the result set will never exceed the number of customers. I also don't have a good strategy for this issue. If you have any, please don't hesitate to let me know.

7. Logging. Visualizing what the program does is critical, especially when the logic is complex. Make sure to add enough (but not too much) logging so you can explain why the program is doing it. If everything works fine, that's fine, but if something goes wrong, you'll be glad you added these logs.

test

As a developer, I don't deal with features until it's time to test. At the very least, this means that every new or changed line of code has been executed at least once. Also, unit tests and functional tests are good, but not enough. New features must also be tested and explored in a product-like environment. Only then can I say that I complete a function. Here are some important lessons about testing that my experience with bugs taught me:

8. Zero and null. Make sure to always test with zeros and nulls if possible. For strings, this means testing for zero-length strings and for nulls. Another example: testing the disconnection of a TCP connection, before sending data to it. Not testing with these combined methods is the number one cause of bugs.

9. Additions and deletions. Typically, new functionality includes the ability to add new profiles to the system - for example, a new profile for mobile number translation. It's natural to test whether it can add new profiles. However, I find it's easy to forget to test that deleting the config file is just as ok.

10. Error handling. Code that handles errors is often difficult to test. It would be nice to have automated tests that check the error handling code, but sometimes this is not possible. One trick I sometimes use is to temporarily modify the code to make the error handling code work. The easiest way to do this is to reverse the if statement—for example, change from if error_count > 0 to error_count == 0. Another example is misspelling a database column name, causing the expected error handling code to run.

11. Random input. Often, one way of testing to expose bugs is to use random inputs. For example, ASN.1 decoding of the H.323 protocol uses binary data operations. By sending random bytes to decode, we found several bugs in the decoder. Another example is using a test call to generate a script, where the call duration, answer delay, first party hang up, etc. are all randomly generated. These test scripts can expose many bugs, especially when events occur together that cause side-by-side interference.

12. Check for actions that shouldn't be happening. Usually testing involves checking that the desired action has occurred. But it's easy to overlook the opposite -- forgetting to check that an action that shouldn't have happened didn't actually happen.

13. Have tools. I created my own gadget to make testing easier. For example, when I work with the VoIP SIP protocol, I write a small script that responds with exactly the headers and values I want. This tool makes it easy to test many edge cases. Another example is a command line tool that can make API calls. By starting to gradually add the little features I need, I got some really useful tools. The nice thing about writing my own tools is that I get exactly what I want.

It is absolutely impossible to find all bugs in testing. In one case, I changed the handling of number dependencies, where numbers consisted of two parts: a routing address prefix (which is usually constant), and a number that was dynamically assigned from 000 to 999. The problem is that when correlations are found, the first number of dynamically assigned numbers is mistakenly truncated before being rendered in the table. That is, 637 becomes 37. This means that it works until 100, so the first 100 calls are OK, but the next 900 are all failures. So unless I'm able to test more than 100 times before rebooting (which I don't), I won't find this problem when testing.

debugging

14. Discuss. The debugging technique that has helped me the most is discussing problems with colleagues. Often, just explaining the problem to a colleague makes me realize the crux of the problem. Also, even if they're not very familiar with the code in question, they tend to have some good ideas. Discussions with colleagues are especially effective when dealing with the most difficult bugs.

15. Pay close attention. Often, if debugging a problem takes a long time, it's often because I've made wrong assumptions. For example, I think the problem is in a method, but the truth is that it never even gets to that method. Or, the exception being thrown is not what I thought it would be. Or, I think the latest version of the software is running, but it's actually an older version. Therefore, be sure to verify details, not assumptions. People are more likely to see what they want to see than the truth.

16. Recent changes. When something that used to work stops working, it's usually because of something that changed recently. In one case, the most recent change was just logging, but an error in the log caused a bigger problem. To make it easier to find such regressions, it can be helpful to acknowledge that different commits lead to different changes, and to clearly state those changes.

17. Trust the user. Sometimes when users report a problem, my instinct is, "This is impossible. They must be doing something wrong". But I've learned not to respond that way. More time, it often turns out that what they report is actually what happened. So these days, I'm starting to accept the indicated value of what they report. Of course, I still double check that everything is set up correctly, etc. I've seen a lot of this, and it's clear to me that weird things happen because of unusual configurations or unexpected usage, and my default assumption is that they're right and the program is wrong.

18. Test fixes. If a bug fix is ready, it must be tested. First run the code before the fix and watch the bug. Then apply the fix and repeat the test case. At this point the erroneous behavior should disappear. Following these steps ensures that it is indeed a bug and that this fix does resolve it. Simple and necessary.

Other observations

For 13 years I've been tracking down the toughest bugs I've ever encountered, and a lot has changed. I have worked on small embedded systems, large telecom systems as well as web based systems. I have used C++, Ruby, Java and Python. Several types of bugs encountered while working in C++ have completely disappeared, like stack overflows, memory corruption, string problems and some form of memory leak.

Other issues, like loop errors and edge cases, I see much less. However, that doesn't mean there aren't bugs out there. The lessons learned in this post are designed to help reduce bugs in the three stages of coding, testing, and debugging. If you have any useful technical methods for preventing and finding bugs, you are welcome to provide guidance.

Translation link: http://www.codeceo.com/article/13-years-bug.html
English original: Lessons From 13 Years of Bugs
Translation author: Code Farm Network – Xiaofeng

13 years of bug debugging experience

Guess you like