Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added config.ssh.retries and config.ssh.retry_interval to allow for SSH retries #13047

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nmaludy
Copy link

@nmaludy nmaludy commented Jan 5, 2023

Closes #10776

When provisioning RHEL VMs, i noticed that the hosts become pingable before the SSH daemon is available. Watching the vagrant logs with --debug i noticed that the SSH connection was not being retried properly. Digging into the code, i also discovered that even with a properly setup retry counter, the retries were happening as fast as possible, not allowing time for the SSH daemon to become available.

To mitigate these problems, i added config.ssh.retries to allow for multiple SSH connection retry attempts before failing (default=5). I also added config.ssh.retry_interval to insert a sleep time (seconds) between each connection attempt, allowing time for the SSH daemon to come up.

This is my first PR to Vagrant. I tried to do the right things, but i'm sure i'm missing some stuff. If you all could help guide me through the contribution process, i would very much appreciate it!

@hashicorp-cla
Copy link

hashicorp-cla commented Jan 5, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@soapy1 soapy1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

It looks like the winrm communicator plugin has similar functionality with max_tries and retry-delay. It would be nice to have consistency between the names of these options, how they are implemented, and their default values.

This also needs unit tests. They are in the tests subdir.

@@ -192,6 +192,8 @@
optional :connect_timeout, :int64, 15
proto3_optional :ssh_command, :string, 16
proto3_optional :proxy_command, :string, 17
proto3_optional :retries, :int64, 18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a generated file, it should not be edited

@@ -400,7 +400,16 @@ def connect(**opts)
raise Vagrant::Errors::SSHNotReady if ssh_info.nil?

# Default some options
opts[:retries] = 5 if !opts.key?(:retries)
opts[:retries] = if ssh_info.key?(:retries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ssh_info should probably already have the default value for the retries (coming from the config). So, there is no need to default this again. Instead, this can be opts[:retries] = ssh_info[:retries] if !opts.key?(:retries) (same for the retry_interval below). This also implies that the internal caller of this method can override the user defined config.

This also means that the ssh_info method in the Machine class needs to be updated to account for this new config.

@@ -134,6 +142,26 @@ def validate(machine)
given: @connect_timeout.to_s)
end

if !@retries.is_a?(Integer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the validate method for the winrm communicator. It has some better validations steps. For example, it's ok to not have any retries or to have a retry interval between 0 and 1.

@@ -2,6 +2,8 @@ module VagrantPlugins
module Kernel_V2
class SSHConnectConfig < Vagrant.plugin("2", :config)
DEFAULT_SSH_CONNECT_TIMEOUT = 15
DEFAULT_SSH_RETRIES = 5
DEFAULT_SSH_RETRY_INTERVAL = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number can probably come down some. The winrm communicator uses 2.

@avoidik
Copy link

avoidik commented Jan 17, 2023

Worth noting that the current implementation has no retries at all due to this line (it is always 1 and counter isn't reset), probably you'll be keen to address this quite old observation in your work. Cheers!

https://github.com/hashicorp/vagrant/blob/v2.4.1/plugins/communicators/ssh/communicator.rb#L90

@ashleemboyer
Copy link
Contributor

Hi @nmaludy! I've just merged in #13048, which modified 120 MDX files and added a new workflow that would be good to have run on this PR. Since this is from a fork, I'm not sure I have permissions to rebase on main for you, so I wanted to send this quick message to let you know to update with main. Thanks!

@soapy1 soapy1 removed this from the 2.3.5 milestone Feb 17, 2023
@avoidik
Copy link

avoidik commented Jan 28, 2024

hello all,

could you please suggest what can we do to finally close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add sleep period before retrying the ssh connection
5 participants