-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ModelDB contains files with invalid UTF-8 #91
Comments
Not a CI issue, but agree it would be nice to standardize. To anybody reading this: pull requests to the relevant repositories at https://github.com/modeldbrepository are welcome. Note that this isn't limited to NEURON models. |
Hi, I opened a whole bunch of PR... Here is a list of all of the models that I opened PR's against: I opened an issue with the following repo: I could not determine the text encoding for the following models. Most of these changes are very simple, trivial changes to comments. |
And here are the tools I used to do this, in case anyone else encounters legacy character encoding formats in the future:
Here is the messy program that I wrote: import requests
from pathlib import Path
import subprocess
import json
import os
import sys
import chardet
commit_msg = "Convert character encoding to UTF-8"
user = "YOUR GITHUB USERNAME GOES HERE"
token = "YOUR GITHUB TOKEN GOES HERE"
# Make a cache dir to hold all of the temp files.
cache_dir = Path.cwd().joinpath('tmp_modeldb_repos')
cache_dir.mkdir(exist_ok=True)
os.chdir(cache_dir)
repo_list = []
if True:
# Get a list of all github repo's owned by user "ModelDBRepository"
page = 1
per_page = 100
while True:
print(f'Talking to github, requesting info for repo\'s {(page-1)*per_page+1} - {page*per_page} ... ', flush=1, end='')
page_data = requests.get(
'https://api.github.com/users/ModelDBRepository/repos',
params={'page': str(page), 'per_page': str(per_page)},
auth=(user, token))
print('status code:', page_data.status_code)
if page_data.status_code != 200:
print(page_data.text)
sys.exit()
page_data = json.loads(page_data.text)
page_data = [x['name'] for x in page_data]
repo_list.extend(page_data)
page += 1
if not page_data:
break
elif True:
# Use the models that are already downloaded.
repo_list = list(x.name for x in cache_dir.iterdir())
else:
#
repo_list = ['98017']
print("REPO LIST:", ', '.join(str(x) for x in repo_list), '\n')
print("NUM REPOS:", len(repo_list), '\n')
# Check each repo for non-UTF-8 text files.
changed_repos = []
failed_repos = []
repo_list = [Path(str(x)) for x in repo_list]
for repo in repo_list:
# Download the git repository.
if not repo.exists():
subprocess.run(
['git', 'clone', f'https://github.com/ModelDBRepository/{repo}.git'],
check=True,
capture_output=True)
# Find all of the text files.
hoc = list(repo.glob("**/*.hoc"))
ses = list(repo.glob("**/*.ses"))
mod = list(repo.glob("**/*.mod"))
inc = list(repo.glob("**/*.inc"))
# Fix the character encoding.
any_fixed = False
any_failed = False
for file in (hoc + ses + mod + inc):
# Ignore hidden files.
if file.name.startswith('.'):
continue
with file.open('rb') as f:
raw = f.read()
# Check if python can decode the string.
try:
raw.decode()
continue
except Exception as err:
pass
# Try to figure out what encoding this data is using.
detected_encoding = chardet.detect(raw)
if detected_encoding['encoding'] in {'utf-8', 'ascii'}:
continue
if detected_encoding['confidence'] < .5:
continue
#
try:
utf8 = raw.decode(detected_encoding['encoding'].lower())
except Exception as err:
failed_repos.append(f"{str(file)}: {str(err)}")
any_failed = True
break
#
with file.open('wt') as f:
f.write(utf8)
print('Fixed', detected_encoding['encoding'], file)
any_fixed = True
if any_failed:
continue
if any_fixed:
changed_repos.append(repo)
subprocess.run(['git', 'commit', '-a', '-m', commit_msg],
cwd=repo,
capture_output=True,
check=True)
else:
# Remove unchanged files.
# subprocess.run(['rm', '-rf', str(repo)], check=True)
pass
print()
print("FIXED MODELS:")
print('\n'.join(str(x.name) for x in changed_repos))
print("FAILURES:")
print('\n'.join(failed_repos)) |
Note that some models already have workarounds for this in the CI runs using nrn-modeldb-ci/modeldb/modeldb-run.yaml Lines 1294 to 1295 in 5c00892
Fixing the problem at source would be much better. |
Many of the models on modeldb contain files that are not UTF-8.
They use an outdated text encoding format called "UCS-2" also known as "ISO 8859-1".
I know this is not anyone's priority, but it would be nice if all of the files used UTF-8 encoding.
C++ does not care about this kind of stuff, but python does care and raises errors when you open these files.
It's possible to work around this issue in python by either using raw bytes objects or by looking up how to decode UCS-2.
I wrote this quick script to find all of the files that use UCS-2.
In could be modified to automatically update the files to use UTF-8.
And here is the list of UCS-2 files:
The text was updated successfully, but these errors were encountered: