Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent reactor_netty_http_client_* and reactor_netty_tcp_client_* metrics when proxy in use #3060

Closed
jtorkkel opened this issue Feb 14, 2024 · 6 comments · Fixed by #3074 or #3090
Assignees
Labels
type/bug A general bug
Milestone

Comments

@jtorkkel
Copy link

Current behauviour:

Without proxy springFW "http_client_requests_seconds_count" "client_name" label report same address as reactor_netty_http_client_* and reactor_netty_tcp_client_* "remote_address" label.

Example correctly reported by reactor_netty_http_client

http_client_requests_seconds_count{client_name="petstore.swagger.io",error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="none",} 6.0
reactor_netty_http_client_data_sent_bytes_count{remote_address="petstore.swagger.io:443",uri="http",} 3.0

With proxy enabled the reactor_netty reports

  • proxy address
  • real address
  • both proxy address and real address .

http and tcp reports differently.

Example when both reported

http_client_requests_seconds_count{client_name="petstore.swagger.io",error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="none",} 6.0
reactor_netty_http_client_data_sent_bytes_count{remote_address="proxy:8080",uri="/",} 3.0
reactor_netty_http_client_data_sent_bytes_count{remote_address="petstore.swagger.io:443",uri="http",} 3.0

Expected behaviour

  • http and tcp should report consistent way
  • http should report remote_address values consistent way, either report real address or if both reported then maybe separate "remote_proxy_address" label should be used to report proxy address (proxy address or empty/none) in addition to remote_address

Reproduction steps:

Create @RestController GET /proxy endpoint, which make proxy http client (you needs to use your own proxy like "proxy:8080") GET call to "https://petstore.swagger.io/v2/swagger.json", enable either reactor_netty_tcp or http_metrics

    @GetMapping("/proxy")
    public Mono<Map<String, Object>> proxy() {
        ParameterizedTypeReference<Map<String, Object>> typeRef
                = new ParameterizedTypeReference<>() {
        };
        return webClient
                .get()
                .uri(URI.create("https://petstore.swagger.io/v2/swagger.json"))
                .retrieve()
                .bodyToMono(typeRef)
                .doOnCancel(() -> new Exception().printStackTrace());
    }

@Configuration
@EnableConfigurationProperties(ReactorConnectorProperties.class)
public class WebClientConfiguration {
    private final ReactorConnectorProperties properties;

    public WebClientConfiguration(ReactorConnectorProperties properties) {
        this.properties = properties;
    }

    @Bean
    public WebClient webClient(WebClient.Builder webClientBuilder) throws SSLException {

        // create TCP first, needed to enable tcp metrics
        /* 
        TcpClient tcpClient = TcpClient
                .create(ConnectionProvider
                        .builder("example-pool")
                        .maxConnections(properties.getMaxConnections())
                        .pendingAcquireTimeout(properties.getAcquireTimeout())
                        .pendingAcquireMaxCount(properties.getAcquireMaxCount())
                        .maxIdleTime(properties.getIdleTimeout())
                        .maxLifeTime(properties.getMaxLifeTime())
                        .metrics(properties.isNettyTcpMetricsEnabled()) // tcpClient.config.connectionProvider.builder.metricsEnabled, not working alone
                        .evictInBackground(properties.getEvictInBackground())  // 0 if not set
                        //.disposeTimeout()
                        .build())
                .metrics(properties.isNettyTcpMetricsEnabled()) // tcpClient.config.metricsRecorder, works alone
                .option(CONNECT_TIMEOUT_MILLIS, (int) properties.getConnectTimeout().toMillis())
                .doOnConnected(connection -> connection
                        .addHandlerLast(new ReadTimeoutHandler(properties.getReadTimeout().toMillis(), MILLISECONDS))
                        .addHandlerLast(new WriteTimeoutHandler(properties.getWriteTimeout().toMillis(), MILLISECONDS))
                );

        if (properties.getProxyHost().length() > 0) {
            //httpClient = httpClient.proxyWithSystemProperties();
            tcpClient = tcpClient.proxy(spec -> spec.type(ProxyProvider.Proxy.HTTP)
                    .host(properties.getProxyHost())
                    .port(properties.getProxyPort())
                    .nonProxyHosts("localhost")
                    .connectTimeoutMillis(20_000));
            SslContext sslContext = SslContextBuilder
                    .forClient()
                    .trustManager(InsecureTrustManagerFactory.INSTANCE)
                    .build();

            tcpClient = tcpClient.secure(t -> t.sslContext(sslContext));

        }
        HttpClient httpClient = HttpClient.from(tcpClient);
        if (properties.isNettyHttpMetricsEnabled()) {
            httpClient = httpClient.metrics(true, s -> {
                if (s.startsWith("/mx/")) {
                    return "/mx/{n}";
                }
                return s;
            });  // overwrite tcp one as same metrics entry used
        }

        */

        
        // use http client directly
        HttpClient httpClient = HttpClient
                .create(ConnectionProvider
                        .builder("example-pool")
                        .maxConnections(properties.getMaxConnections())
                        .pendingAcquireTimeout(properties.getAcquireTimeout())
                        .pendingAcquireMaxCount(properties.getAcquireMaxCount())
                        .maxIdleTime(properties.getIdleTimeout())
                        .maxLifeTime(properties.getMaxLifeTime())
                        .fifo()
                        .evictInBackground(properties.getEvictInBackground())  // 0 if not set
                        //.disposeTimeout()
                        .build())
                .option(CONNECT_TIMEOUT_MILLIS, (int) properties.getConnectTimeout().toMillis())
                .doOnConnected(connection -> connection
                        .addHandlerLast(new ReadTimeoutHandler(properties.getReadTimeout().toMillis(), MILLISECONDS))
                        .addHandlerLast(new WriteTimeoutHandler(properties.getWriteTimeout().toMillis(), MILLISECONDS))
                );

        if (properties.getProxyHost().length() > 0) {
            //httpClient = httpClient.proxyWithSystemProperties();
            httpClient = httpClient.proxy(spec -> spec.type(ProxyProvider.Proxy.HTTP)
                    .host(properties.getProxyHost())
                    .port(properties.getProxyPort())
                    .nonProxyHosts("localhost")
                    .connectTimeoutMillis(20_000));
            SslContext sslContext = SslContextBuilder
                    .forClient()
                    .trustManager(InsecureTrustManagerFactory.INSTANCE)
                    .build();

            httpClient = httpClient.secure(t -> t.sslContext(sslContext));

        }

        if (properties.isNettyHttpMetricsEnabled()) {
            //httpClient2 = httpClient2.metrics(true, (Function<String, String>) null); // overwrite tcp one as same metrics entry used
            httpClient = httpClient.metrics(true, s -> "/"); // no tcp metrics
        }


        /*
        AdvancedByteBufFormat.SIMPLE - only the events will be logged
        AdvancedByteBufFormat.HEX_DUMP - both events and content will be logged, with content in hex format
        AdvancedByteBufFormat.TEXTUAL - both events and content will be logged, with content in plain text format
        */
        if (properties.isWiretap()) {
            httpClient = httpClient.wiretap(properties.isWiretap()).wiretap("wiretap", INFO, AdvancedByteBufFormat.HEX_DUMP);
        }
        return webClientBuilder
                .defaultHeader(HttpHeaders.USER_AGENT, "SAMPLE_APP")
                .clientConnector(new ReactorClientHttpConnector(httpClient))
                .build();
    }
}


@Getter
@Setter
@ConfigurationProperties(prefix = "example.reactor.client")
public class ReactorConnectorProperties {

    private String poolName;

    /**
     * Maximum number of connections towards destination.
     */
    @Builder.Default
    private int maxConnections = 50;

    /**
     * {@link ReactorClientHttpConnector} connect timeout
     */
    @Builder.Default
    private Duration connectTimeout = Duration.ofSeconds(5);

    /**
     * {@link ReactorClientHttpConnector} read timeout
     */
    @Builder.Default
    private Duration readTimeout = Duration.ofSeconds(10);

    /**
     * {@link ReactorClientHttpConnector} write timeout
     */
    @Builder.Default
    private Duration writeTimeout = Duration.ofSeconds(10);

    /**
     * Acquire timeout to get connection from pool
     */
    @Builder.Default
    private Duration acquireTimeout = Duration.ofSeconds(2);

    /**
     * The maximum number of registered requests for acquire to keep in a pending queue
     * A good default is double the max connections.
     */
    @Builder.Default
    private int acquireMaxCount = 2 * 50;

    /**
     * Idle timeout to close active connections. This should be higher than combined connect+acquire+read+write timeouts
     */
    @Builder.Default
    private Duration idleTimeout = Duration.ofMinutes(5);

    /**
     * Max Life Time of a connection.
     */
    @Builder.Default
    private Duration maxLifeTime = Duration.ofMinutes(60);

    /**
     * Enabled/Disable Reactor Netty Http metrics
     */
    @Builder.Default
    private boolean nettyHttpMetricsEnabled = false;

    /**
     * Enabled/Disable Reactor Netty TCP metrics
     */
    @Builder.Default
    private boolean nettyTcpMetricsEnabled = true;

    @Builder.Default
    private Duration evictInBackground = Duration.ofMinutes(1);

    @Builder.Default
    private boolean wiretap = false;

    @Builder.Default
    private String proxyHost = "";

    @Builder.Default
    private Integer proxyPort = 8080;
}

Enable proxy, enable http or tcp metrics

example.reactor.client.maxConnections=50
example.reactor.client.connectTimeout=5s
example.reactor.client.readTimeout=10s
example.reactor.client.writeTimeout=10s
example.reactor.client.acquireTimeout=2s
example.reactor.client.acquireMaxCount=100
example.reactor.client.idleTimeout=1m
example.reactor.client.maxLifeTime=2m
example.reactor.client.nettyHttpMetricsEnabled=true
example.reactor.client.nettyTcpMetricsEnabled=false
example.reactor.client.evictInBackground=10s
example.reactor.client.proxy-host=proxy
example.reactor.client.proxy-port=8080

Metrics outcome when reactor_netty_http_client metrics in use

spring FW http metrics

http_client_requests_seconds_count{client_name="petstore.swagger.io",error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="none",} 6.0

proxy address reported

reactor_netty_http_client_data_sent_time_seconds_count{method="GET",remote_address="proxy:8080",uri="/",} 3.0
reactor_netty_http_client_tls_handshake_time_seconds_count{remote_address="proxy:8080",status="SUCCESS",} 1.0
reactor_netty_http_client_data_received_time_seconds_count{method="GET",remote_address="proxy:8080",status="200",uri="/",} 3.0
reactor_netty_http_client_response_time_seconds_count{method="GET",remote_address="proxy:8080",status="200",uri="/",} 3.0

real address reported

reactor_netty_http_client_connect_time_seconds_count{remote_address="petstore.swagger.io:443",status="SUCCESS",} 1.0

both proxy and real address ****

reactor_netty_http_client_data_sent_bytes_count{remote_address="proxy:8080",uri="/",} 3.0
reactor_netty_http_client_data_sent_bytes_count{remote_address="petstore.swagger.io:443",uri="http",} 3.0
reactor_netty_http_client_data_received_bytes_count{remote_address="proxy:8080",uri="/",} 11.0
reactor_netty_http_client_data_received_bytes_count{remote_address="petstore.swagger.io:443",uri="http",} 40.0

This was not shown in test case, but looks that it use XXX

reactor_netty_http_client_address_resolver_seconds_count

Metrics outcome when reactor_netty_tcp_client metrics in use

spring FW http metrics

http_client_requests_seconds_count{client_name="petstore.swagger.io",error="none",exception="none",method="GET",outcome="SUCCESS",status="200",uri="none",} 6.0

proxy address reported

reactor_netty_tcp_client_tls_handshake_time_seconds_count{remote_address="proxy:8080",status="SUCCESS",} 1.0

real address reported

reactor_netty_tcp_client_data_sent_bytes_count{remote_address="petstore.swagger.io:443",uri="tcp",} 6.0
reactor_netty_tcp_client_connect_time_seconds_count{remote_address="petstore.swagger.io:443",status="SUCCESS",} 1.0
reactor_netty_tcp_client_data_received_bytes_count{remote_address="petstore.swagger.io:443",uri="tcp",} 23.0

Environment

SpringBoot 3.2.2.
Windows and Linux
java 17

@jtorkkel jtorkkel added status/need-triage A new issue that still need to be evaluated as a whole type/bug A general bug labels Feb 14, 2024
@violetagg violetagg removed the status/need-triage A new issue that still need to be evaluated as a whole label Feb 14, 2024
@violetagg violetagg self-assigned this Feb 14, 2024
@violetagg violetagg added this to the 1.0.43 milestone Feb 19, 2024
@violetagg
Copy link
Member

@jtorkkel I can reproduce it and I'm working on a fix. Thanks for the detailed explanation!

@jtorkkel
Copy link
Author

Thanks, please consider having separate label for proxy and remote_address. It would also show if proxy in use.

I also noticed that "reactor_netty_tcp_client_errors_total" but probably also "reactor_netty_http_client_errors_total" are also inconsistent.

I can see from logs 6 separate errors, 2 first are 1s from each other and remaining 4 few min apart from each.

 r.netty.http.client.HttpClientConnect    -[ece1fcd2-3, L:/11.11.11.11:11111! R:remote_address=proxy:443/10.10.10.10:8080] The connection observed an error reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response

And I can see from spring FW I can also see same amount of "WebClientRequestException" in http client and http server, 2 change in one scrape and 4 change in 4 other scrape.

# error label in SpringBoot 3, in Spring 2 no error label in http client
http_client_requests_seconds_count{dnsname="service", status=~"", error="WebClientRequestException", status="CLIENT_ERROR", method="POST", remote_address="xxx-api:443"} 6

http_server_requests_seconds_count{dnsname="service", status=~"5..", error="WebClientRequestException", method="POST"} 6

But in reactor metrics I can see only 3 and they happen at same time as last of above 6, but in logs only one error seen.

reactor_netty_tcp_client_errors_total{dnsname="service", remote_address="xxx-api:443", uri="tcp"} 3

So sounds that "reactor_netty_*_client_errors_total" is not counting all errors, and sometimes counting error multiple times.

Would be great also to have label for error reason like exception as spring FW seems to aggregate all request exceptions into "WebClientRequestException" resulting you cannot differentiate "connectTimeout", "readtimeout", queueTimeout. But apparently might be hard as exception happens on so many layers, read timeout happen in "io.netty.handler.timeout.ReadTimeoutException"

@violetagg
Copy link
Member

@jtorkkel For Reactor Netty version 1.0.x I'm gonna fix it as I will guarantee the remote address is always the real one and not the proxy address. The new tag with proxy address information I'm gonna add to Reactor Netty version 1.1.x. Wdyt?

For the issue with the errors number can you provide some reproducible example?

@jtorkkel
Copy link
Author

Thanks, make sense to add new label only to 1.1.x.

We were seeing "The connection observed an error reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response".

We never found the root cause but we noticed that increasing web client "maxIdleTime" from 2min to 10min and turning on eviction (2min, instead of checking age in acquire/release) we started to get 100x more of "prematurely closed" errors.
It turned out that our loadbalancer had 180s maxIdle timeout, plus 120s background eviction resulting that requests sent after being 180s idle were immediately closed, but loadbalancer actually close idle sockets with delay ~181-301s after being idle (180 + 0-120s + jitter). Thus there was "race" condition.
Most can be eliminated by decreasing idle timeout or by adding TCP_KEEPALIVE < 180s

While root causing I was hoping to see errors in "reactor_netty_tcp_client_errors_total" metrics but as said sometimes "prematurely closed" were reported, sometimes not reported and sometimes single exception resulted multiple errors (no other exception seen) .

I then tested 5 different test cases

  1. connect error (configure localhost:port address which result RST)
  2. connect error after timeout use 1.1.1.1:11111
  3. readTimeout (delay response more than read timeout)
  4. pendingAcquireMaxCount
  5. pendingAcquireTimeout (set pool size to 1 and make multiple parallel calls while delaying responses resulting queing)

1 and 2 resulted connect error correctly, and naturally no error on "reactor_netty_tcp_client_errors_total"

reactor_netty_http_client_connect_time_seconds_count{remote_address="localhost:31111",springBoot="3.2.2",status="ERROR"} 1.0
reactor_netty_http_client_connect_time_seconds_count{remote_address="1.1.1.1:11111",springBoot="3.2.2",status="ERROR"} 1.0

I was expecting that 3-5 would be also seen on "reactor_netty_tcp_client_errors_total" but did not see counter increased (actually missing).
And naturally if error reported would be nice to see if

  • pendingAcquireTimeout
  • writeTimeout
  • readTimeout
  • PrematureCloseException
  • etc.
    as long as cardinality not exploding too much (read < ~50)

Apparently reactor netty is not handling most of the exception and WebClient handle them instead, and aggregate to single WebClientRequestException.

org.springframework.web.reactive.function.client.WebClientRequestException: Pool#acquire(Duration) has been pending for more than the configured timeout of 1000ms
reactor.netty.internal.shaded.reactor.pool.PoolAcquireTimeoutException: Pool#acquire(Duration) has been pending for more than the configured timeout of 1000ms

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: localhost/127.0.0.1:11111
Caused by: java.net.ConnectException: Connection refused: no further information
org.springframework.web.reactive.function.client.ExchangeFunctions$DefaultExchangeFunction.lambda$wrapException$9(ExchangeFunctions.java:136) ~[spring-webflux-6.1.3.jar:6.1.3]
io.netty.handler.timeout.ReadTimeoutException: null

@violetagg
Copy link
Member

@jtorkkel PR #3081 adds support for proxy address

violetagg added a commit that referenced this issue Mar 8, 2024
…e correct order for ChannelHandlers

- Place ResponseTimeoutHandler before HttpMetricsHandler so that
is an error happened it can be recorded
- Add test to verify that errors with establishing a connection are recorded

Fixes #3060
@violetagg
Copy link
Member

@jtorkkel

For the moment we do not plan to add the type of the error.

violetagg added a commit that referenced this issue Mar 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment