Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count File Size (MB) #504

Open
DarwinJS opened this issue Jul 19, 2020 · 4 comments
Open

Count File Size (MB) #504

DarwinJS opened this issue Jul 19, 2020 · 4 comments

Comments

@DarwinJS
Copy link

DarwinJS commented Jul 19, 2020

Some commercial security scanning tools now charge by file volume in MB.
I am wondering if that measurement could be taken and reported.
I am thinking just raw file size - not any attempt to estimate actual code versus comments (unless it is super easy).

Maybe a further version could rewrite files without comments and get an estimate of true code MB - but I'd be happy with just a raw number for a "minimum viable feature" release :)

@AlDanial
Copy link
Owner

Sure, that measurement can be taken and recorded--but I don't think cloc needs to do that as a new feature. Instead, use cloc to collect the names of the source files, then run a trivial add-up-the-file-sizes script. For example:
Step 1 cloc --by-file --csv --out counts.csv directory
Step 2 count_bytes counts.csv
where count_bytes is something like

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while (<>) {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

A drawback to this method is that it won't work on archive (.tar, .zip, etc) files; you'll need to expand these out first.

The solution can easily be adapted to count bytes in files after comments are removed.
Step 1 cloc --strip-comments No_Comments --original-dir --by-file --csv --out counts.csv directory
Step 2 count_bytes_no_comments counts.csv
where count_bytes_no_comments is

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while (<>) {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    $file .= ".No_Comments";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

@DarwinJS
Copy link
Author

There are several benefits to having it integrated:

  • Integrates MBs on the same report or data output format - so that data can be consumed in the same ways. Having completely separate reports would mean a lot of folks would want to try to merge the reports so they can estimate both of these code metrics in the same way - per repository, per language.
  • Handles all the files in the same way your base code does (so automatically handling archived files like you are doing)
  • Allows your code for aggregating reports to be used for MBs as well

I was also thinking of an implementation detail that might make this super-efficient. If you are already creating storage (like a variable) that contains the code with comments stripped - maybe a size could be taken at that point and then just add an overhead value to create a "file size" estimate. Maybe PerFileOverheadBytes could be a built-in default variable and overrideable by users with a parameter - so they could tune it to their liking.

I was also thinking of building a CI plugin around this similar to these: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions.

@cdeszaq
Copy link

cdeszaq commented Jun 19, 2024

This seems to be available, basically, via the --categorized arg. Perhaps not in-line in the report, or quite as "easy" to get at, but still available if a more generic script-oriented (and/or unix-philosophy-aligned) approach isn't sufficient.

@includesec-erik
Copy link

The original requester asked for "by file volume in MB." which is already given a couple of ways for a given file on Unix

$ ls -l ./langs_includesec_audited.txt |awk '{ print $5 }'
221 #in bytes
or
$ du -sh ./langs_includesec_audited.txt
4.0K    ./langs_includesec_audited.txt #in human readable rounded to nearest size

For anything but a massive scale use-case, running cloc and then running a script that counts the size of all the files after would work fine.

IMHO (similar to #798) this issue seems to be asking for something that other Unix tools already do. My vote is for @AlDanial to spend his valuable time adding some other awesome cloc feature or fixing known bugs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants