ModelDB contains files with invalid UTF-8 #91

ctrl-z-9000-times · 2023-07-07T01:05:39Z

Many of the models on modeldb contain files that are not UTF-8.
They use an outdated text encoding format called "UCS-2" also known as "ISO 8859-1".

https://www.ibm.com/docs/en/i/7.1?topic=unicode-ucs-2-its-relationship-utf-16

I know this is not anyone's priority, but it would be nice if all of the files used UTF-8 encoding.
C++ does not care about this kind of stuff, but python does care and raises errors when you open these files.
It's possible to work around this issue in python by either using raw bytes objects or by looking up how to decode UCS-2.

I wrote this quick script to find all of the files that use UCS-2.
In could be modified to automatically update the files to use UTF-8.

from pathlib import Path
import zipfile
cached_dir = Path.cwd().joinpath('cache')

# Unzip all of the files.
for file in cached_dir.glob("*.zip"):
    out = cached_dir.joinpath(file.stem)
    out.mkdir(exist_ok=True)
    with zipfile.ZipFile(file, 'r') as zip_ref:
        print(file)
        zip_ref.extractall(out)

# Find all of the text files.
hoc = list(cached_dir.glob("**/*.hoc"))
ses = list(cached_dir.glob("**/*.ses"))
mod = list(cached_dir.glob("**/*.mod"))
inc = list(cached_dir.glob("**/*.inc"))

for file in (hoc + ses + mod + inc):
    if file.name.startswith('.'): continue
    try:
        _ = file.open('rt').read()
    except Exception as err:
        try: 
            bin = file.open('rb').read()
            _ = str(bin, 'iso-8859-1')
            print(file.relative_to(cached_dir))
        except Exception as err:
            print(err)

And here is the list of UCS-2 files:

DendroDendriticInhibition/ShortDendrite/bulb.hoc
DendroDendriticInhibition/LongDendrite/bulb.hoc
190559/MiglioreEJN2016/soma.hoc
253369/LombardiEtAl2019/Isolated_Dendrite_tauNKCC1__Fig9/GABA-Stim_2xPSC_Spatial_Sum_isolated_dendrite.hoc
253369/LombardiEtAl2019/Isolated_Dendrite_tauNKCC1__Fig9/GABA-Stim_2xPSC_Spatial_Sum_isolated_dendrite_enlarged_tauNKCC1.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type2.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type8.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type1.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type5.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type7.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type6.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type3.hoc
147460/OverstreetEtAl2013/Interneuron/NEURON_code/anat_type4.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type14.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type14sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type11.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type12sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type10sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type12.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type10.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type15.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type13sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type13.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type9sym.hoc
147460/OverstreetEtAl2013/Pyramidal/NEURON_code/anat_type9.hoc
50207/NMDA_Mg/nmda_demo.hoc
106551/nc-mri/Neuron1.hoc
106551/nc-mri/synapses.hoc
123897/HuEtAl2009/experiment/Pyramidal/inject_soma.hoc
123897/HuEtAl2009/experiment/Pyramidal/failureThres.hoc
146376/reduction1.0/useful&InitProc.hoc
21329/inhibnet/netring.hoc
229276/Final/createsimulation.hoc
229276/Final/template.hoc
229276/Final/synapses/synapses.hoc
169208/YoungEtAl2013/NEURON files/velocity/ALG_PARCHA-130523.hoc
169208/YoungEtAl2013/NEURON files/voltage/ALG_PARCHA-130523.hoc
116094/DendroDendriticInhibition/ShortDendrite/bulb.hoc
116094/DendroDendriticInhibition/LongDendrite/bulb.hoc
SFS-IPS-WM-network/ECell.hoc
SFS-IPS-WM-network/Results.hoc
SFS-IPS-WM-network/MultiModuleWMNetXP.hoc
SFS-IPS-WM-network/ICell.hoc
SFS-IPS-WM-network/Net.hoc
SFS-IPS-WM-network/LabCell.hoc
127021/Golgi_cell_NaKATPAse/Synapses.hoc
127021/Golgi_cell_NaKATPAse/Channel_dynamics.hoc
127021/Golgi_cell_NaKATPAse/Save_data.hoc
127021/Golgi_cell_NaKATPAse/Golgi_ComPanel.hoc
127021/Golgi_cell_NaKATPAse/utils.hoc
139656/network/Golgi_ComPanel.hoc
139656/network/utils.hoc
150288/KimEtAl2013/LA_model_main_file.hoc
108458/KampaStuart2006/runRi18.hoc
267106/Lodge_2021_Cell_Rep_GC_Models/GC-Ball.hoc
206244/CA1_multi/experiment/cell-setup_regular.hoc
206244/CA1_multi/experiment/cell-setup_div2.hoc
206244/CA1_multi/experiment/cell-setup_mul2.hoc
186977/Avella_GonzalezEtAl2015/templates/presets_WT1.hoc
184732/FietkiewiczEtAl2016/initialize.hoc
184732/FietkiewiczEtAl2016/main.hoc
184732/FietkiewiczEtAl2016/cellTemplates.hoc
184732/FietkiewiczEtAl2016/run.hoc
DiFrancescoNoble1985/cellinit.hoc
113732/SS-cortex/wiring-config_thresh_percieved.hoc
Golgi_cell/Synapses.hoc
Golgi_cell/Channel_dynamics.hoc
Golgi_cell/Golgi_template.hoc
Golgi_cell/Save_data.hoc
Golgi_cell/Golgi_ComPanel.hoc
Golgi_cell/utils.hoc
network/Golgi_ComPanel.hoc
network/utils.hoc
156120/HAE_LAE_Netk/templates/presets_WT1.hoc
267189/Crbl_tDCS_Zhang2021/DCN/DCN_simulation.hoc
140789/DG_BC/NEURON-models/DG-BasketCell1.hoc
140789/DG_BC/NEURON-models/DG-BasketCell4.hoc
140789/DG_BC/NEURON-models/DG-BasketCell5.hoc
140789/DG_BC/NEURON-models/DG-BasketCell3.hoc
140789/DG_BC/NEURON-models/DG-BasketCell6.hoc
140789/DG_BC/NEURON-models/DG-BasketCell2.hoc
140789/DG_BC/Figure_2/specifiy-BC6.hoc
140789/DG_BC/Figure_2/pipettes.hoc
140789/DG_BC/Figure_2/morph-BC6.hoc
140789/DG_BC/Figure_2/run-BC6.hoc
149739/ACh_ModelDB/OB.hoc
155705/AvellaEtAl2014/Two_netsPaper/templates/presets_WT1.hoc
157157/SaudargieneEtAl2015/main.hoc
2730/bulbNet/bulb.hoc
185513/SudhakarEtAl2015/DCN_params.hoc
144520/DiFrancescoNoble1985/cellinit.hoc
251493/EbnerEtAl2019/Fig2B.hoc
251493/EbnerEtAl2019/Fig3.hoc
266578/MaEtAl2020/2_compartment_template.hoc
266578/MaEtAl2020/SC_template.hoc
266578/MaEtAl2020/PF_template.hoc
Chloride_Model/init_ClmIPSCs_GC.hoc
Chloride_Model/init_ClmIPSCs_GC_single.hoc
108459/LetzkusEtAl2006/runRi18.hoc
7907/dendritica-1.1/batch_back/*
150551/AshhadNarayanan2013/CalciumWave.hoc
112685/Golgi_cell/Synapses.hoc
112685/Golgi_cell/Channel_dynamics.hoc
112685/Golgi_cell/Golgi_template.hoc
112685/Golgi_cell/Save_data.hoc
112685/Golgi_cell/Golgi_ComPanel.hoc
112685/Golgi_cell/utils.hoc
150024/CNModel_May2013/DCN_params_fi_init.hoc
150024/CNModel_May2013/DCN_params.hoc
150024/CNModel_May2013/DCN_params_axis.hoc
150024/CNModel_May2013/DCN_params_rebound.hoc
144523/LuthmanEtAl2011/DCN_simulation.hoc
98017/SFS-IPS-WM-network/ECell.hoc
98017/SFS-IPS-WM-network/Results.hoc
98017/SFS-IPS-WM-network/MultiModuleWMNetXP.hoc
98017/SFS-IPS-WM-network/ICell.hoc
98017/SFS-IPS-WM-network/Net.hoc
98017/SFS-IPS-WM-network/LabCell.hoc
3800/cardiac1998/aboutatrial.hoc
149000/PurkReductionOnLine/useful&InitProc.hoc
Crbl_tDCS_Zhang2021/DCN/DCN_simulation.hoc
YoungEtAl2013/NEURON files/velocity/ALG_PARCHA-130523.hoc
YoungEtAl2013/NEURON files/voltage/ALG_PARCHA-130523.hoc
148253/Chloride_Model/init_ClmIPSCs_GC.hoc
148253/Chloride_Model/init_ClmIPSCs_GC_single.hoc
229750/DDnet/net_dd_emodel.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/morphology/cell_1.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/experiment/graphs.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/experiment/Protocols.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/experiment/init.hoc
144482/Pyramidal_STDP_Gomez_Delgado_2010/experiment/savedata.hoc
64229/netmod/parbulbNet/bulb.hoc
64229/netmod/parbulbNet/par_bulb.hoc
145836/MoradiEtAl2012/SynExp2NMDA.mod
145836/MoradiEtAl2012/SynExp3NMDA2.mod
145836/MoradiEtAl2012/SynExp3NMDA.mod
126637/purkinje_ppr/Leak.mod
261714/Cav23/Cav23.mod
121060/MSN2009/chan_inKIR.mod

The text was updated successfully, but these errors were encountered:

ramcdougal · 2023-07-07T04:47:37Z

Not a CI issue, but agree it would be nice to standardize.

To anybody reading this: pull requests to the relevant repositories at https://github.com/modeldbrepository are welcome.

Note that this isn't limited to NEURON models.

ctrl-z-9000-times · 2023-07-08T17:55:11Z

Hi, I opened a whole bunch of PR...

Here is a list of all of the models that I opened PR's against:
190559
253369
147460
50207
106551
146376
145836
169208
116094
127021
139656
150288
108458
126637
267106
206244
186977
184732
156120
267189
261714
140789
149739
155705
157157
2730
185513
144520
251493
266578
121060
108459
150551
112685
150024
144523
3800
149000
148253
144482
64229
229276
123897
229750
98017 - This PR is non-trivial, see the comments on this PR.

I opened an issue with the following repo:
21329

I could not determine the text encoding for the following models.
185121
7907

Most of these changes are very simple, trivial changes to comments.
Take you're time reviewing these! I did this quickly because I wrote a program to do it for me, but reviewing and approving these changes is by necessity a manual and time-consuming process.

ctrl-z-9000-times · 2023-07-08T18:01:52Z

And here are the tools I used to do this, in case anyone else encounters legacy character encoding formats in the future:

Table of ancient character encodings: https://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing)
chardet is a tool to figure out which encoding a file uses: https://pypi.org/project/chardet/

Here is the messy program that I wrote:

import requests
from pathlib import Path
import subprocess
import json
import os
import sys
import chardet

commit_msg = "Convert character encoding to UTF-8"
user = "YOUR GITHUB USERNAME GOES HERE"
token = "YOUR GITHUB TOKEN GOES HERE"

# Make a cache dir to hold all of the temp files.
cache_dir = Path.cwd().joinpath('tmp_modeldb_repos')
cache_dir.mkdir(exist_ok=True)
os.chdir(cache_dir)

repo_list = []
if True:
    # Get a list of all github repo's owned by user "ModelDBRepository"
    page = 1
    per_page = 100
    while True:
        print(f'Talking to github, requesting info for repo\'s {(page-1)*per_page+1} - {page*per_page} ... ', flush=1, end='')
        page_data = requests.get(
                'https://api.github.com/users/ModelDBRepository/repos',
                params={'page': str(page), 'per_page': str(per_page)},
                auth=(user, token))
        print('status code:', page_data.status_code)
        if page_data.status_code != 200:
            print(page_data.text)
            sys.exit()
        page_data = json.loads(page_data.text)
        page_data = [x['name'] for x in page_data]
        repo_list.extend(page_data)
        page += 1
        if not page_data:
            break
elif True:
    # Use the models that are already downloaded.
    repo_list = list(x.name for x in cache_dir.iterdir())
else:
    # 
    repo_list = ['98017']

print("REPO LIST:", ', '.join(str(x) for x in repo_list), '\n')
print("NUM REPOS:", len(repo_list), '\n')

# Check each repo for non-UTF-8 text files.
changed_repos = []
failed_repos = []
repo_list = [Path(str(x)) for x in repo_list]
for repo in repo_list:

    # Download the git repository.
    if not repo.exists():
        subprocess.run(
                ['git', 'clone', f'https://github.com/ModelDBRepository/{repo}.git'],
                check=True,
                capture_output=True)

    # Find all of the text files.
    hoc = list(repo.glob("**/*.hoc"))
    ses = list(repo.glob("**/*.ses"))
    mod = list(repo.glob("**/*.mod"))
    inc = list(repo.glob("**/*.inc"))

    # Fix the character encoding.
    any_fixed = False
    any_failed = False
    for file in (hoc + ses + mod + inc):
        # Ignore hidden files.
        if file.name.startswith('.'):
            continue
        with file.open('rb') as f:
            raw = f.read()
        # Check if python can decode the string.
        try:
            raw.decode()
            continue
        except Exception as err:
            pass
        # Try to figure out what encoding this data is using.
        detected_encoding = chardet.detect(raw)
        if detected_encoding['encoding'] in {'utf-8', 'ascii'}:
            continue
        if detected_encoding['confidence'] < .5:
            continue
        # 
        try:
            utf8 = raw.decode(detected_encoding['encoding'].lower())
        except Exception as err:
            failed_repos.append(f"{str(file)}: {str(err)}")
            any_failed = True
            break
        # 
        with file.open('wt') as f:
            f.write(utf8)
        print('Fixed', detected_encoding['encoding'], file)
        any_fixed = True

    if any_failed:
        continue

    if any_fixed:
        changed_repos.append(repo)
        subprocess.run(['git', 'commit', '-a', '-m', commit_msg],
                cwd=repo,
                capture_output=True,
                check=True)
    else:
        # Remove unchanged files.
        # subprocess.run(['rm', '-rf', str(repo)], check=True)
        pass

print()
print("FIXED MODELS:")
print('\n'.join(str(x.name) for x in changed_repos))
print("FAILURES:")
print('\n'.join(failed_repos))

olupton · 2023-07-17T08:09:44Z

Note that some models already have workarounds for this in the CI runs using iconv:

nrn-modeldb-ci/modeldb/modeldb-run.yaml

Lines 1294 to 1295 in 5c00892

    
           - iconv -f LATIN1 -t UTF-8 template.hoc.iconv.bak > template.hoc 
        
           - iconv -f LATIN1 -t UTF-8 createsimulation.hoc.iconv.bak > createsimulation.hoc

Fixing the problem at source would be much better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModelDB contains files with invalid UTF-8 #91

ModelDB contains files with invalid UTF-8 #91

ctrl-z-9000-times commented Jul 7, 2023

ramcdougal commented Jul 7, 2023

ctrl-z-9000-times commented Jul 8, 2023

ctrl-z-9000-times commented Jul 8, 2023

olupton commented Jul 17, 2023

ModelDB contains files with invalid UTF-8 #91

ModelDB contains files with invalid UTF-8 #91

Comments

ctrl-z-9000-times commented Jul 7, 2023

ramcdougal commented Jul 7, 2023

ctrl-z-9000-times commented Jul 8, 2023

ctrl-z-9000-times commented Jul 8, 2023

olupton commented Jul 17, 2023