Skip to content

Conversation

okumin
Copy link
Contributor

@okumin okumin commented Oct 6, 2025

What changes were proposed in this pull request?

Normalize license headers of all Java files.

The following ones are exceptional.

Why are the changes needed?

Our CI strictly verifies license headers of new files. I suppose many people copy and paste headers from existing files for convenience, so, I sometimes observe CI failing because of the format of licence headers. I have hit the case once, and I have reviewed such a pull request once.

As additional bonuses, consistent headers will reduce legal risks and future maintenance costs(e.g., rewriting the header format will be done by just one command next time).

Does this PR introduce any user-facing change?

No

How was this patch tested?

I put the following configuration on checkstyle/checkstyle.xml, standalone-metastore/checkstyle/checkstyle.xml, and ``storage-api/checkstyle/checkstyle.xml. After that, I ran mvn checkstyle:check.

<?xml version="1.0"?>
<!DOCTYPE module PUBLIC
    "-//Puppy Crawl//DTD Check Configuration 1.2//EN"
    "http://www.puppycrawl.com/dtds/configuration_1_2.dtd">
<module name="Checker">
  <module name="Header">
    <property name="headerFile" value="${config_loc}/asf.header"/>
    <property name="fileExtensions" value="java"/>
  </module>
</module>

And I manually checked changes with git diff master...HIVE-29245-asf-header

@okumin okumin force-pushed the HIVE-29245-asf-header branch from 1d39753 to 51d9953 Compare October 7, 2025 02:24
@okumin okumin marked this pull request as ready for review October 7, 2025 02:25
Copy link

sonarqubecloud bot commented Oct 7, 2025

@InvisibleProgrammer
Copy link
Contributor

This PR contains ~1400 file changes in 45 commits. Most of the changes are generated. Could you please clarify which files/commits worth reviewing?
Is there a script that generated the changes or a plugin execution? If yes, what was the tool or command that you ran?

TBH, I'm afraid, in that format the change will be squashed onto one single commit and later, it will be extremely hard to figure out what happened.
Is it possible to create two commits instead of 45? One commit for the changes around the rules, exceptions, generator script, etc. And one for the generated content. In that way the git history will be obvious.

@Aggarwal-Raghav
Copy link
Contributor

@okumin , it seems that the bottom part of license header is not indented:
For example in AccumuloConnectionParameters.java

* Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

vs https://www.apache.org/legal/src-headers.html

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.    

@Aggarwal-Raghav
Copy link
Contributor

FYI,
I am checking using spotless plugin:

        <plugin>
          <groupId>com.diffplug.spotless</groupId>
          <artifactId>spotless-maven-plugin</artifactId>
          <version>2.46.1</version>
          <configuration>
            <java>
              <licenseHeader>
                <file>${maven.multiModuleProjectDirectory}/style/java-header</file>
              </licenseHeader>
              <excludes>
                <exclude>**/LazyBinaryUnion.java</exclude>
                <exclude>**/JsonSerde.java</exclude>
              </excludes>
            </java>
          </configuration>
          <executions>
            <execution>
              <goals>
                <goal>check</goal>
              </goals>
            </execution>
          </executions>
        </plugin>

@okumin
Copy link
Contributor Author

okumin commented Oct 9, 2025

@InvisibleProgrammer
It is challenging to specify a single script because the header variations of the original code differ. As explained, I validated the final deliverable using Checkstyle. If we want more substantial evidence, I can split this PR into some PRs per a set of patterns.

How was this patch tested?

I put the following configuration on checkstyle/checkstyle.xml, standalone-metastore/checkstyle/checkstyle.xml, and ``storage-api/checkstyle/checkstyle.xml. After that, I ran mvn checkstyle:check.

<?xml version="1.0"?>
<!DOCTYPE module PUBLIC
    "-//Puppy Crawl//DTD Check Configuration 1.2//EN"
    "http://www.puppycrawl.com/dtds/configuration_1_2.dtd">
<module name="Checker">
  <module name="Header">
    <property name="headerFile" value="${config_loc}/asf.header"/>
    <property name="fileExtensions" value="java"/>
  </module>
</module>

And I manually checked changes with git diff master...HIVE-29245-asf-header

@Aggarwal-Raghav
Just a clarification, does it mean we should update our expected files? Probably no.

@Aggarwal-Raghav
Copy link
Contributor

@okumin, IMO we should change the expected file to be as same on asf website as mentioned above.
For example, iceberg project (failry new project) is using latest license header
https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/procedures/AncestorsOfProcedure.java

But flip side is number of files modified in this PR might substantially increase 😬

@okumin
Copy link
Contributor Author

okumin commented Oct 9, 2025

@InvisibleProgrammer @Aggarwal-Raghav
If we switch to the following PR, is it possible to review it?
#6122

@okumin, IMO we should change the expected file to be as same on asf website as mentioned above.

I personally think we can postpone the conclusion for three reasons. First, the current issue is the reviewers' cost in explaining why SonarQube complains about contributors' headers every time. Second, once we have completed normalizing all headers, a single script can rewrite all at once, and it is easy to write such a script, validate the script, or validate the result. Finally, ours is likely to be legal, hopefully.

@InvisibleProgrammer
Copy link
Contributor

If we switch to the following PR, is it possible to review it?
#6122

TBH, I see no difference. The second PR has more than 700 file changes. And it doesn't even build due to the pom.xml of storage-api.
It contains 9 commits. And still, nothing shows what script modified the files.

Honestly, I love the concept of fixing those header files and forcing having the proper header files. But at this PR I don't see how the change can properly reviewed.
And also, the thing that actually happens, still hidden.
I mean, it is possible to spot the extra checks after the change. But it is not possible to spot how it changed. Is there a script that did the change.

I give you one example about what I'm thinking about:

Iceberg applied spotless back then. It ended up changing more than 3000 files. But it happened in only a few, individually understandable commits. And all the generated changes are in a single one commit. And this commit doesn't contain anything but a generated content.
In that way, it is easy to understand what happened and how happened. And also, it is easy to repeat the process if it is required.
Ihttps://github.com/apache/iceberg/pull/5312/commits

At this PR, it is easy to understand the end goal. But I don't see the how.

@okumin
Copy link
Contributor Author

okumin commented Oct 10, 2025

@InvisibleProgrammer
Thanks for the feedback. I added what commands are used or what are hand-maded.

commit c71e63c1d4b44137589519d13608eb197a03279b (HEAD -> HIVE-29245-regex-indent, origin/HIVE-29245-regex-indent)
Author: okumin <[email protected]>
Date:   Thu Oct 9 14:00:18 2025 +0900

    Don't run RAT for checkstyle

commit 1292595d685be4056fc7ff9ad5d73a3ae6aaad1c
Author: okumin <[email protected]>
Date:   Thu Oct 9 13:40:43 2025 +0900

    Add RegexpHeader

commit 76a30b762d486cc7d83235f2babcf0856a8fd844
Author: okumin <[email protected]>
Date:   Thu Oct 9 13:36:12 2025 +0900

    Modify irregular files by hand

commit 0c349da0c8f3d5cd337d07613956c367ff2a2168
Author: okumin <[email protected]>
Date:   Thu Oct 9 13:19:59 2025 +0900

    Add asterisk to some irregular files
    
    By hand

commit 6a1b7ae9061cb61b15d7c75002e379014efbeaad
Author: okumin <[email protected]>
Date:   Thu Oct 9 13:11:45 2025 +0900

    Normalize # of indents of irregular LLAP patterns
    
    git ls-files '*.java' | grep -v 'src/gen/thrift' | grep -v '^iceberg/' \
      | xargs perl -pi -e 's|^ \*  you may not use this file except in compliance with the License\.$| * you may not use this file except in compliance with the License.|'
    
    git ls-files '*.java' | grep -v 'src/gen/thrift' | grep -v '^iceberg/' \
      | xargs perl -pi -e 's|^ \*  You may obtain a copy of the License at$| * You may obtain a copy of the License at|'

commit f90704c64312f53ef980fb52822dc2c9e9acd5e2
Author: okumin <[email protected]>
Date:   Thu Oct 9 13:02:35 2025 +0900

    Reduce extra indents, irregular pattern 1
    
    git ls-files '*.java' | grep -v 'src/gen/thrift' | grep -v '^iceberg/' \
      | xargs perl -pi -e 's|^ \*  \* Licensed to the Apache Software Foundation \(ASF\) under one$| * * Licensed to the Apache Software Foundation (ASF) under one|'
    
    git ls-files '*.java' | grep -v 'src/gen/thrift' | grep -v '^iceberg/' \
      | xargs perl -pi -e 's|^ \*  \* or more contributor license agreements\.  See the NOTICE file$| * * or more contributor license agreements.  See the NOTICE file|'
    
    git ls-files '*.java' | grep -v 'src/gen/thrift' | grep -v '^iceberg/' \
      | xargs perl -pi -e 's|^ \*  \* distributed with this work for additional information$| * * distributed with this work for additional information|'
    
    git ls-files '*.java' | grep -v 'src/gen/thrift' | grep -v '^iceberg/' \
      | xargs perl -pi -e 's|^ \*  \* regarding copyright ownership\.  The ASF licenses this file$| * * regarding copyright ownership.  The ASF licenses this file|'

And the diff is 500+ lines. Is this still huge? In that case, I will split it more.

 % git diff master...HIVE-29245-regex-indent -w | wc -l
     555

@okumin
Copy link
Contributor Author

okumin commented Oct 10, 2025

@InvisibleProgrammer
My mistake is that I included whitespace changes and another ad-hoc change in a single pull request, and the ad-hoc styles are the headache to make the entire reformat hard. So, I tried to address them first. Please feel free to ask me if it is still huge.
#6125

@okumin okumin closed this Oct 10, 2025
@okumin okumin deleted the HIVE-29245-asf-header branch October 11, 2025 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants