Introducing Qodo Cover: Automate Test Coverage

foundry27 7 months ago

First off, congratulations folks! It’s never easy getting a new product off the ground, and I wish you the best of luck. So please don’t take this as anything other than genuine constructive criticism as a potential customer: generating tests to increase coverage is a misunderstanding of the point of collecting code coverage metrics, and businesses that depend on getting verification activities right will know this when they evaluate your product.

A high-quality test passes when the functionality of the software under test is consistent with the design intent of that software. If the software doesn’t do the Right Thing, the test must fail. It’s why TDD is effective: you’re essentially specifying the intent and then implementing code against it, like a self-verifying requirements specification. When we look at Qodo tests in the GitHub MRs you’ve linked, it’s argued that a high-quality test is defined as one that:

1. Executes successfully

2. Passes all assertions

3. Increases overall code coverage

4. Tests previously uncovered behaviors (as specified in the LLM prompt)

So, given source code for a project as input, a hypothetical “perfect AI” built into Qodo that always writes a high-quality test would (naturally!) never fail to write a passing test for that code; the semantics of the code would be perfectly encoded in the test. If the code had a defect, it follows logically that optimizing the quality of your AI for the metrics Qodo is aiming for will actually LOWER the probability of finding that defect! The generated test would have successfully managed to validate the code against itself, enshrining defective behavior as correct. It’s easy to say that higher code coverage is good, more maintainable, etc., but this outcome is actually the exact opposite of maintainable and actively undermines confidence in the code under test and the ability to refactor.

There are better ways to do this, and you’ve got competitors who are already well on the way to doing them using a diverse range of inputs besides code. It boils down to answering two questions:

1. Can a technique be applied so that a LLM, with or without explicit specifications and understanding of developer intentions, will reliably reconstruct the intended behavior of code?

2. Can a technique be applied so that tests generated by a LLM truly verify the specific behaviors the LLM was prompted to test, as opposed to writing a valid test but not the one that was asked for?

m3kw9 7 months ago

Why can’t i just use cursor to just “generate tests” instead?

timbilt 7 months ago

> validates each test to ensure it runs successfully, passes, and increases code coverage
This seems to be based on the cover agent open source which implements Meta's TestGen-LLM paper. https://www.qodo.ai/blog/we-created-the-first-open-source-im...
After generating each test, it's automatically run — it needs to pass and increase coverage, otherwise it's discarded.
This means you're guaranteed to get working tests that aren't repetitions of existing tests. You just need to do a quick review to check that they aren't doing something strange and they're good to go.
- torginus 7 months ago
  
  What the reasoning behind generating tests until they pass? Isn't the point of tests to discover erroneous corner cases?
  What purpose does this serve besides the bragging rights of 'we need 90% coverage otherwise Sonarqube fails the build'?
  - timbilt 7 months ago
    
    Unit tests are more commonly written to future proof code from issues down the road, rather than to discover existing bugs. A code base with good test coverage is considered more maintainable — you can make changes without worrying that it will break something in an unexpected place.
    I think automating test coverage would be really useful if you needed to refactor a legacy project — you want to be sure that as you change the code, the existing functionality is preserved. I could imagine running this to generate tests and get to good coverage before starting the refactor.
    
    HideousKojima 7 months ago
    
    >Unit tests are more commonly written to future proof code from issues down the road, rather than to discover existing bugs. A code base with good test coverage is considered more maintainable — you can make changes without worrying that it will break something in an unexpected place.
    The problem is a lot of unit tests could accurately be described as testing "that the code does what the code does." If the future changes to your code also require you to modify your tests (which they likely will) then your tests are largely useless. And if tests for parts of your code that you aren't changing start failing when you make code changes, that means you made terrible design decisions in the first place that led to your code being too tightly coupled (or had too many side effects, or something like global mutable state).
    Integration tests are far, far more useful than unit tests. A good type system and avoiding the bad design patterns I mentioned handle 95% of what unit tests could conceivably be useful for.
    
    torginus 7 months ago
    
    I disagree in my experience, poorly designed tests test implementation rather than behavior. To test behavior you must know what is actually supposed to happen when the user presses a button.
    One of the issues with getting high coverage is that often tests need to be written for testing implementation, rather than desired outcomes.
    Why is this an issue? As you mentioned, testing is useful for future proofing codebases and making sure changing the code doesn't break existing use cases.
    When test look for desired behavior, this usually means that unless the spec changes, all tests should pass.
    The problem is when you test implementation - suppose you do a refactoring, cleanup, or extend the code to support future use cases - the test start failing. Clearly something must be changed in the tests - but what? Which cases encode actual important rules about how the code should behave, and which ones were just tautologically testing that the code did what it did?
    This introduces murkiness and diminishes the value of tests.

swyx 7 months ago

congrats team! we just had Itamar back on the pod who reintroduced Qodo, AlphaCodium and teased Qodo Cover: https://www.latent.space/p/bolt