Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: baseline violations #137

Open
1 task done
frankdekker opened this issue Dec 7, 2023 · 5 comments
Open
1 task done

Feature: baseline violations #137

frankdekker opened this issue Dec 7, 2023 · 5 comments

Comments

@frankdekker
Copy link

frankdekker commented Dec 7, 2023

In a large existing codebase, it usually not possible to fix all violations at once. Or when new violations appear when new rules are added. Think for example the Slevomat Coding Standard. This feature will allow to create a baseline of the current violations, and only report new violations. This is a proposal to add a baseline feature:

Commandline

Create a baseline, and save it to the --baseline-file <path> or default to phpcs-baseline.xml

vendor/bin/phpcs ... --generate-baseline [--baseline-file=<path>]

Update the baseline, but only remove, not add new violations

vendor/bin/phpcs ... --update-baseline [--baseline-file=<path>]

Using the baseline. If --baseline-file argument is absent will look for phpcs-baseline.xml in the root of the project.

vendor/bin/phpcs ... [--baseline-file=<path>]

To ignore the current baseline:

vendor/bin/phpcs ... --ignore-baseline

Options for baseline file format

Which baseline format would we prefer?

  • xml: No additional dependency, due to the ext-simplexml dependency.
  • json: can be read with if the ext-json dependency is added.
  • neon: requires additional dependency, for example nette/neon.

Proposal for baseline file name

  • phpcs-baseline._format. Similar to PHPStan's phpstan-baseline.neon.
  • default location is the root of the project. Root is where /vendor/ directory lives.

Proposal for strict mode

  • Add a strict option in which violations that are in the baseline but do not appear will
    result in a violation. Commandline option: --baseline-strict ( ? )

Violation tracking strategies

To be able to track violations, we need to be able to identify them. We minimally need:

  • file path: This is the file path the violation occurred in. The file path will be normalized to be relative to the project root and only contain forward slashes.
  • Ignore the line in the file a violation occurs. Modifications to the top of the file will have unwanted effects.

Further the following strategies can be chosen:

Code signature:

  • the name of the sniff: For example PSR12.Files.FileHeader.SpacingAfterBlock.
  • hash of surrounding code: The code surrounding the violation is hashed. This is a very strict strategy, but
    also very sensitive to changes in the code. Downside there's a performance penalty for hashing the code.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<phpcs-baseline>
    <violation file="relative/path/to/file" 
               sniff="PSR12.Files.FileHeader.SpacingAfterBlock" 
               signature="81a547481bf2e1839bf4cff69dffeb6674dbc203"
    />
</phpcs-baseline>

Violation count:

  • the name of the sniff: For example PSR12.Files.FileHeader.SpacingAfterBlock.
  • count the number of violations: Keep track of that PSR12.Files.FileHeader.SpacingAfterBlock occurs at most 2 times
    in the file. This is a very loose strategy, but also very insensitive to changes in the code. Downside is that
    it's not very strict, and it's not possible to track the specific violation messages. It will however prevent
    additional violations from being added.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<phpcs-baseline>
    <violation file="relative/path/to/file" 
               sniff="PSR12.Files.FileHeader.SpacingAfterBlock" 
               count="2"
    />
</phpcs-baseline>

Violation message:

  • the name of the sniff: For example PSR12.Files.FileHeader.SpacingAfterBlock.
  • the violation message: Keep track of the specific violation message. Token index and
    line number should be excluded here as adding code above the violation with change these values.
    The difference with the strategy above is that different messages within the same Sniff can be
    baselined. This will improve accuracy.
  • count the number of violations: In this strategy also keep count of the violations.

Example:

<?xml version="1.0" encoding="UTF-8"?>
<phpcs-baseline>
    <violation file="relative/path/to/file" 
               sniff="PSR2.Classes.ClassDeclaration.CloseBraceAfterBody"
               message="The closing brace for the class must go on the next line after the body"
               count="1"
    />
</phpcs-baseline>

Final words

In short:

  • Choose file format: xml, json, or neon

  • Choose track strategy: code signature, violation count or message + violation count

  • I intend to create a pull request to implement this feature. See Add phpcs baseline support #115

@derrabus
Copy link

Thank you for the proposal. I've missed a baseline feature for a very long time.

In a large existing codebase, it usually not possible to fix all violations at once. Or when new violations appear when new rules are added.

Exactly this. PHPStan and Psalm come with a baseline and it makes working with them on legacy code so much easier. We can look at how they've built a baseline if we need inspiration.

From the options above, I think I'd pick "violation count" because it feels like a reasonable compromise. I believe we should use the same configuration format for the baseline as for the main configuration, so XML.

@jrfnl
Copy link
Member

jrfnl commented Dec 15, 2023

Thanks @frankdekker for taking the time to write this up.

I'm missing two things though in this write-up.

1. Discussion of other approaches

This is basically about making it easier to introduce PHPCS to older codebases.

A baseline is one of the most detailed options - which also makes it the most complex and creates the most work for implementing and maintaining the feature -, but there have been various proposals in the past suggesting other possibilities.
Those may be less precise, but that doesn't necessarily mean they should be disregarded as an option without a thought.

Here's a list of some tickets I found via a quick search in the Squizlabs repo. Not claiming completeness. A more thorough search might a good idea.

I also know there are various external projects which have already created either threshold based or baseline based solutions for their own purposes, some re-usable by other projects, some custom to one project.
(I even maintain a threshold based one myself for one external standard)

It may be worth looking at those for inspiration and talking through the pros and cons of each approach ?
Here are some of the ones mentioned in squizlabs/PHP_CodeSniffer#3387 and related threads:

On that note: is it a must for a baseline/threshold feature to be part of PHPCS itself or is there a conceivable way to build this as an external dependency which could be maintained separately ? (with possibly some tweaks in PHPCS to accommodate)

Without an opinion on the above, know that I'd be happy to create a new repo in this organisation and give someone maintainer rights to it, if the discussion would lean that way. Could also possibly be a way to work on a proof of concept, work out the kinks and stabilize it before adding it into PHPCS itself.

2. Impact on PHPCBF

When running phpcbf, the fixer runs through up to 50 loops to see if it can fix the code and the code in each of the loops will normally be slightly different as fixes are being applied iteratively.

If we take the example of a baseline with code signature - it is likely that by the second loop the signature will no longer match, so the violation will no longer be ignored.

If we take the example of baseline with a count per error code - if the count is 2 and the number of violations found is 5, how is PHPCBF to know which one to fix ?

...etc...

It's easy to glance over this when comparing with PHPStan or Psalm, but as far as I know, those don't run fixers.

@frankdekker
Copy link
Author

frankdekker commented Dec 16, 2023

Hi, thanks for the reply. I've looked some through older tickets and different baseline strategies. I took the strategies I would find useful.

1. Discussion of other approaches

squizlabs/PHP_CodeSniffer#2094
squizlabs/PHP_CodeSniffer#3389
I really don't like these approaches with having a global violation counter, which you have to lower each time an issue is fixed. A couple of years ago we did this strategy with our dev-team, and it just results in that issues gets fixed (unknownly), and that new warnings silently are ignored until you reach the threshold. Hence my minimum requirement is violation counter per file (as described).

I skimmed through the remaining approaches and created the 3 strategies I listed in the first post.

Extensions

I also know there are various external projects which have already created either threshold based or baseline based solutions for > their own purposes, some re-usable by other projects, some custom to one project.

Most/some of these projects were created by the lack of the baseline feature in PHPCS itself. In my personal opinion I would expect a project like phpcs would contain its own baseline mechanism. PHPStan, PHPMD provide this mechanism out of the box, and creating the baseline in these tools is usually only adding a --generate-baseline flag. For some baseline options you really want the cli arguments. With taking the extension approach you usually need a different way to set configuration options. My preference would really be adding the baseline feature to PHPCS itself, and keep it close to how PHPStan and PHPMD (and others) work.

I can understand that you don't want the maintainer responsibility for this. Adding support for extension would be nice second option then. Although that approach is out of scope for this issue. If that option is desired, this issue can be closed.

2. Impact on PHPCBF

If we take the example of a baseline with code signature - it is likely that by the second loop the signature will no longer match, > so the violation will no longer be ignored.

By default if a violation is in the baseline, phpcbf shouldn't pick it up. That's currently the behaviour with https://github.com/123inkt/php-codesniffer-baseline extension I wrote.

The code signature mechanism, takes about 3 lines above and below, and generates a signature. So only changes within these lines will invalidate the baseline. For phpcbf to make changes near that line, somebody already made modifications near these lines because it's not in the baseline and phpcbf detected it. By default phpcbf will not change the baselined code and therefore will not accidentally invalidate violations. Only when you edit near these violations, but you're already doing that as developer then.

If we take the example of baseline with a count per error code - if the count is 2 and the number of violations found is 5, how is > PHPCBF to know which one to fix ?

This one is more interesting indeed. I would expect it to ignore violation 1 and 2, and only fix violation 3-5. But obviously, violation 1 and 2 might be the new violations. This would happen less often with the code signature approach.

There's no clean solution for this, but I look at it differently. Does the baseline have to be perfect? The goal here is that devs do not add more violations to an existing codebase, and that violations count is only going down. The above problems only arises when a dev makes modifications to that specific file. Shouldn't he ++they++ be able to cleanup some extra?

Secondly, when you hit a PHPCS violation issue as dev, those are usually very low impact fixes. Most of the time it's removing or adding some whitespaces without running the risk of breaking something in the codebase. The problem is that you can't blindly update a codebase with 20000 violations. Hence in my opinion, a near perfect baseline feature would also suffice.

@derrabus
Copy link

Speaking for myself, it would be acceptable if PHPCBF would simply ignore the baseline completely. For me, the baseline is mostly interesting for non-fixable sniffs. If I run PHPCBF, I want stuff to be fixed. If the fixer is safe, I'd rather run it on the whole codebase once than baseline all violations. And if the fixer is unsafe, I'd disable the sniff for PHPCBF until all violations have been addressed.

@connorhu
Copy link

connorhu commented Mar 1, 2024

For formats, I suggest you also include php. For large baseline files it can be a big performance improvement not to have to parse a config format.
For fixable violations, I think the role of the baseline would be more to see the violated rules in one list, to be able to enable one by one, to make a PR review more transparent, or to not apply a bunch of fixes to a codebase all at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants