Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix symlink handling #764

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Conversation

matrss
Copy link
Contributor

@matrss matrss commented Jun 19, 2023

This PR implements the heuristics for handling symlinks I've described in more detail in #627 (comment) and #627 (comment).

Basically, this is based on an interpretation of the REUSE specification that:

  1. a symlink pointing to a covered file is considered to be the same file as the covered file and can therefore be ignored.
  2. a symlink pointing to a file that is not a covered file is itself considered to be a covered file and should not be ignored, unless the symlink is ignored by other means.

This fixes #627 while not breaking #202 for the described situation, since it would fall into category 1 above.

@matrss
Copy link
Contributor Author

matrss commented Jun 20, 2023

There is another bug in here when dealing with symlink targets in .git if the target file exists. I erroneously assumed Project._is_path_ignored considers a file inside .git as ignored, which is not the case (it only considers .git/.hg itself as ignored, but no sub-paths). I'll provide additional tests and a fix later.

edit: fixed

@matrss matrss marked this pull request as draft June 20, 2023 09:16
matrss added 4 commits June 22, 2023 11:19
Previously a sub-path of an ignored directory was not considered to be
ignored, e.g. is_ignored returned False for a file build/hello.py even
if _all_ignored_files contained Path("build").  Instead of just checking
if the path is in the ignored files we also have to check if it is
inside of an ignored directory.

Additionally this commit fixes a bug that surfaced: when building the
_all_ignored_files set all_files contains an empty string as its last
element. This leads to Path(".") being an element in _all_ignored_files,
i.e. ignore everything. Using all but the last element from all_files
fixes that.
These tests assume the interpretation of the REUSE specification that:
1. a symlink pointing to a covered file is considered to be the same
   file as the covered file and can therefore be ignored.
2. a symlink pointing to a file that is not a covered file is itself
   considered to be a covered file and should not be ignored, unless the
   symlink itself is ignored by other means.
@matrss matrss force-pushed the fix-symlinks-for-git-annex branch from 9ae09f6 to fbb5ed2 Compare June 22, 2023 09:20
@matrss matrss force-pushed the fix-symlinks-for-git-annex branch from fbb5ed2 to 13e4fa7 Compare June 22, 2023 09:27
@matrss
Copy link
Contributor Author

matrss commented Jun 22, 2023

@wtraylor Feel free to try this out on a real-world project if you have one available. This was less straight-forward to implement than I thought it would be and there might still be some edge-cases I didn't consider. I tried to implement the symlink handling in a way I thought to be intuitive, but that might also differ from your expectations. So some testing would be great :)

@carmenbianca
Copy link
Member

Hi @matrss

This PR looks qualitatively really good. We'll need some time to review it, though.

It will also need a change to the specification. Although this version of the spec is not yet live, the current development version says that symlinks are ignored.

@wtraylor
Copy link
Contributor

Many thanks for implementing this feature, @matrss !
For the main use case with git-annex it works great: reuse lint correctly complains about annexed files (i.e., symlinks pointing to content in .git) not having license information.

git init repo && cd repo
git annex init
cp ~/bigfile.dat ./
git annex add bigfile.dat # creates a symlink in place of bigfile.dat
reuse lint  # fails correctly
reuse annotate --force-dot-license -l CC0-1.0 -c Author bigfile.dat
reuse download --all
reuse lint # passes correctly

The implementation appears to assume that all symlinks require dot-license files. For example:

git init repo && cd repo
git annex init
echo Hello World > README.md
reuse annotate -l CC0-1.0 -c Me README.md
reuse download --all
reuse lint # success
rm -r LICENSES
git annex add README.md
git commit -m "Add annexed README"
reuse lint # fails: no licensing info for README.md
reuse annotate -l CC0-1.0 -c Me README.md # fails: can’t write to README.md

I guess it makes sense to require dot-license files for symlinks because the symlink targets may not be available (i.e., symlink is broken, for example if annexed content is not in the working copy). However, I suggest to make this more transparent to the user—for example with these messages:

  1. reuse annotate on a symlink file: “blablabla is a symbolic link. Please use --force-dot-license.”
  2. reuse lint with a symlink file that doesn’t have a dot-license: “The following symbolic links have no copyright and licensing information in separate dot-license files:”
  3. reuse lint with a symlinked dot-license file: “WARNING: Can’t parse blablabla.license because it’s a symbolic link. Dot-license files should be regular files.”

Point 3 stems from my experience that it can easily happen to annex a file by accident. Then you might end up with an annexed .license file that REUSE doesn’t parse.

Finally, there is the question of how to deal with symlink folders. This does not fall into the use case of git-annex (which only symlinks files). Here’s an example:

mkdir folder
echo Hello World > folder/foo
git init repo && cd repo
ln -s ../folder symlinked_folder
reuse lint # no complaints

I think this behavior is good. It doesn’t make sense for reuse to check files in symlinked folders with content outside of the repository. However, I wonder if it would make sense to make this transparent to the user. For example, reuse lint good give a hint that it ignored symlinked_folder because it’s a symbolic link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Symbolic links (git-annex files) are ignored
3 participants