Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About _epoch_train and _epoch_val #7

Open
fireholder opened this issue Aug 5, 2019 · 21 comments
Open

About _epoch_train and _epoch_val #7

fireholder opened this issue Aug 5, 2019 · 21 comments

Comments

@fireholder
Copy link

When i was traning, I've met a problem that the progress came to a standstill. And I've found that it was the function _epoch_train and _epoch_val stopped it, which raises NotImplementedError. I wonder why and how to fix it.

@Ike-yang
Copy link

Ike-yang commented Aug 5, 2019

hi, bro, I am trying to run the trainer.py, but I don't know about the argument "--load_model_path", there is nothing in the current folder, I am sure what kind of pretrain model need to load here, any advise?

@fireholder
Copy link
Author

I think '--load_model_path' is only used when 'pretrained', but the log.txt shows error when not loading model files.

@Ike-yang
Copy link

Ike-yang commented Aug 5, 2019

Exactly, I got something in the logs.txt file like this :
Vocab Size:1173
[Load Model Failed] [Errno 2] No such file or directory: ''
[Load Model Failed] [Errno 21] Is a directory: '.'
[Load MLC Failed [Errno 21] Is a directory: '.'!]
[Load Co-attention Failed [Errno 21] Is a directory: '.'!]
[Load Sentence model Failed [Errno 21] Is a directory: '.'!]
[Load Word model Failed [Errno 21] Is a directory: '.'!]
Namespace(attention_version='v4', batch_size=16, caption_json='./data/new_data/.......

I thought program just stop here because of the error message.
So, I could just ignore the message, and keep training?
Are there other places need to be modified?

@fireholder
Copy link
Author

I find that it's not stopped, it's just not printed.

@Ike-yang
Copy link

Ike-yang commented Aug 6, 2019

Yeah, I leave it to run all night, but I found val_loss is always 0 in logs.txt, there must something wrong and need to be modified

@fireholder
Copy link
Author

Because in '_epoch_val' all val loss is set to 0, you can try uncomenting the code in '_epoch_val'. But I find my train loss very large, is it the same to you? By the way, have you tried the tester

@Ike-yang
Copy link

Ike-yang commented Aug 6, 2019

Yes, extremely large train loss. Haven't tried the tester yet

@Ike-yang
Copy link

Ike-yang commented Aug 7, 2019

I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely?

@fireholder
Copy link
Author

Yes, just convert to tensor.cpu() as the error suggested.

@fireholder
Copy link
Author

fireholder commented Aug 9, 2019 via email

@Cao-Shuang
Copy link

Cao-Shuang commented Aug 10, 2019 via email

@fireholder
Copy link
Author

not yet

@ShivamPanchal
Copy link

ShivamPanchal commented Sep 8, 2019

When I run
python tester.py

FileNotFoundError: [Errno 2] No such file or directory: './data/new_data/debug_vocab.pkl'

@CinKKKyo
Copy link

Did u guys met the problem like"

WARNING:tensorflow:From /content/drive/Shared drives/shared drive-zma/ACL18/utils/logger.py:15: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

Traceback (most recent call last):
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 662, in
debugger.train()
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 60, in train
train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() #???
File "/content/drive/Shared drives/shared drive-zma/ACL18/trainer.py", line 402, in _epoch_train
batch_tag_loss = self.mse_criterion(tags, self._to_var(label, requires_grad=False)).sum() # ???
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 431, in forward
return F.mse_loss(input, target, reduction=self.reduction)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2203, in mse_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/usr/local/lib/python3.6/dist-packages/torch/functional.py", line 52, in broadcast_tensors
return torch._C._VariableFunctions.broadcast_tensors(tensors)

RuntimeError: The size of tensor a (210) must match the size of tensor b (0) at non-singleton dimension 1
"
it's really make me confused, anyone could do me a favor? Thx!

@mfilipav
Copy link

mfilipav commented Dec 3, 2019

However , My test results are all the Same. All my predicted captions are the same

------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Hi @fireholder! Did you eventually give up trying to solve the issue? were all the predicted captions always identical?

@yangyan22
Copy link

My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.

@AnkitMalviya
Copy link

Yes, extremely large train loss. Haven't tried the tester yet

Hi, you were able to decrease the loss. I am also facing the same issue.

@AnkitMalviya
Copy link

I have the same caption too. Can you find the reason?------------------ 原始邮件 ------------------ 发件人: "xwt"notifications@github.com 发送时间: 2019年8月9日(星期五) 晚上9:47 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "Subscribed"subscribed@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) However , My test results are all the Same. All my predicted captions are the same

------------------ 原始邮件 ------------------ 发件人: "Ike-yang"notifications@github.com; 发送时间: 2019年8月7日(星期三) 中午12:26 收件人: "ZexinYan/Medical-Report-Generation"Medical-Report-Generation@noreply.github.com; 抄送: "横舟"xuwenting33@qq.com; "Author"author@noreply.github.com; 主题: Re: [ZexinYan/Medical-Report-Generation] About _epoch_train and_epoch_val (#7) I have tried tester.py, not working, someplace need to convert tensor.cpu(), have you run tester.py completely? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

I am also facing the same issue. Are you able to solve this?

@Alsalivan
Copy link

My train loss is also very large. And all my predicted captions are the same: "No acute cardiopulmonary abnormality", could anyone do me a favor? Thx! Is it because of Python2 and Python3, since I used python3.

I guess train loss is large, because author uses MSELoss for predicting tags. If he has 156 different tags, then the exponent ~ (156-0)^2 = 24336. That is why so big loss

You can change it L1Loss or decrease lambda argument for tags loss (if you find it reasonable).

@Hareem1997
Copy link

In debugger.py and tester.py file of the given project. I'm facing an error at 3rd last line in the following given section of code.
` tag_loss += self.args.lambda_tag * batch_tag_loss.data
stop_loss += self.args.lambda_stop * batch_stop_loss.data
word_loss += self.args.lambda_word * batch_word_loss.data
loss += batch_loss.data

return tag_loss, stop_loss, word_loss, loss`

Error is :
File "D:/Hareem/Auto_report/debugger.py", line 61, in train train_tag_loss, train_stop_loss, train_word_loss, train_loss = self._epoch_train() File "D:/Hareem/Auto_report/debugger.py", line 424, in _epoch_train word_loss += self.args.lambda_word * batch_word_loss.data AttributeError: 'int' object has no attribute 'data'

@domyown
Copy link

domyown commented Dec 8, 2021

Is there anybody who solve the problem predicting captions all the same?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests