Skip to content

Releases: openzipkin/zipkin

Zipkin 2.4

29 Nov 11:50
Compare
Choose a tag to compare

Zipkin 2.4 increases http collection performance, introduces a new quick start and adds chinese language support

Http collector performance

Prior to Zipkin 2.4, our http collector was written as a normal Spring WebMVC application. Usually, this is fine, and it is still fine for our query endpoints. The below details how we ended up switching to raw Undertow to implement Zipkin's POST endpoints. Tests show up to 5x throughput with no connection errors. You don't need to do anything but upgrade.

Under a surge of POST requests, threads could backlog resulting in timeout exceptions. Timeouts are another way of saying "you held up clients", which is a bad thing as tracing isn't supposed to hurt clients. Worse. failures here occur prior to metrics, making the count of failures invisible. This led to custom code in the stackdriver-zipkin proxy to help unveil thread pool issues. In efforts to simplify troubleshooting and custom code, we now implement our http collector directly at the network layer. Benchmarks show dramatically better throughput, without network errors and with less memory pressure. Performance is a work in progress, so please do help if you are capable.

PS Some ask why Undertow and not Netty? quick answer is that undertow is already supported in Spring boot 1.5. Also, Netty versions are sensitive and we already have to play games to ensure for example Cassandra's Netty doesn't conflict with gRPC's netty. Undertow being a bit obscure helps avoid conflicts.

New quickstart

For about 2 years, we've used maven central's query api to find the latest version of the server and download it in a single semantic request. Recently, this stopped working and broke our ability to do a quick start. Thanks to hard work by @abesto, we have a replacement script, hosted on our https endpoint, that does the same. Simply copy/paste below to grab the latest server:

$ curl -sSL https://zipkin.io/quickstart.sh | bash -s

i18n and Chinese language support

Zipkin has many chinese speaking, or should we say chinese reading users. Distributed tracing has a lot of vocabulary and some things can be confused in translation. Thanks to @gzchenyong from China Telecom, Zipkin's UI now includes labels and tool tips in chinese. There's even more to do, so if you can help, join @MrGlaucus who's taking this work further.

Other improvements

  • UI now loads in IE 11
  • The server "exec" jar is now a few megs smaller, under 50MiB, by eliminating some unused deps
  • The server "exec" jar's MD5s were incorrect. They are now fixed
  • Prometheus duration metrics could result in double-counting. this is now fixed.
  • ES_TIMEOUT Controls the connect, read and write socket timeouts for Elasticsearch Api.
  • @dos65 fixed rendering numeric service names in the UI
  • @mikewrighton added guards against writing huge data in thrift
  • @shakuzen made the build fail nicer when JDK 9 is in use (JDK 9 support on the way soon)

Zipkin 2.3

13 Nov 09:47
Compare
Choose a tag to compare

Zipkin 2.3 allows querying across all services and introduces Cassandra 3 support

Exploring all data

In the past, the zipkin UI required specifying a service name. This is ok when you know which service you are interested in, but it isn't helpful to explore all data. Now, Zipkin defaults to search across all services.

For example, the below looks for a trace containing an http path and at least 15 milliseconds to complete
screen shot 2017-11-13 at 5 18 15 pm

Thanks @xqliang on helping with the code on this.

Cassandra 3 support

Cassandra has been a supported storage option in Zipkin for over 5 years. Thanks to immense help from @michaelsembwever and @llinder we have a modernized option taking advantage of Cassandra v3.9+ features and Zipkin v2 data format.

Cassandra is a very capable backend. It supports data expiration (TTL) and has powerful replication models which can simplify your ability to trace even across regions. However, if you used our cassandra schema, you'd notice it wasn't great for browsing: Our schema stored encoded thrift blobs, so you couldn't meaningfully query it in CQL. Moreover, our duration query support was problematic to the point of being a pulled feature.

The "cassandra3" storage type eliminates these problems, retaining all the strengths. It installs automatically into the "zipkin2" keyspace, corresponding with what the data structures look like. You need Cassandra v3.9+ (we test on 3.11.1 which is latest). Look at our README for technical details about the schema.

What's notable about this is that the schema now is a span model, as opposed to serialized thrifts. This means you can look at the data in cqlsh and make sense of it. For example, the trace ID is the same hex as B3 headers, which means you can literally paste into CQL if you want.

While most will use zipkin's UI, some of you will like being able to write queries like below:

cqlsh:zipkin2> select trace_id, toTimestamp(ts) as timestamp, duration as duration_micros, minus(writetime(span), plus(ts,duration)) as write_lag_micros, span as name, value(tags, 'http.path') as path, l_service from span limit 10;

 trace_id         | timestamp                       | duration_micros | write_lag_micros | name | path | l_service
------------------+---------------------------------+-----------------+------------------+------+------+-----------
 907a5124315b2cc1 | 2017-11-13 08:16:39.365000+0000 |             634 |           810418 |  get | /api |   backend
 907a5124315b2cc1 | 2017-11-13 08:16:39.364000+0000 |            1693 |           810891 |  get | /api |  frontend
 907a5124315b2cc1 | 2017-11-13 08:16:39.363000+0000 |            2891 |           810507 |  get |    / |  frontend
 d7f8afc80b3f357b | 2017-11-13 08:14:50.478000+0000 |             616 |           288349 |  get | /api |   backend
 d7f8afc80b3f357b | 2017-11-13 08:14:50.476000+0000 |            3093 |           288574 |  get |    / |  frontend
 d7f8afc80b3f357b | 2017-11-13 08:14:50.476000+0000 |            1687 |           289084 |  get | /api |  frontend
 2483c223283177d2 | 2017-11-13 08:14:48.527000+0000 |             895 |          1129547 |  get | /api |   backend
 2483c223283177d2 | 2017-11-13 08:14:48.523000+0000 |            4125 |          1130022 |  get | /api |  frontend
 2483c223283177d2 | 2017-11-13 08:14:48.522000+0000 |            6171 |          1131768 |  get |    / |  frontend
 4ff33dd90b4f8f61 | 2017-11-13 08:16:37.350000+0000 |             659 |           798568 |  get | /api |   backend

(10 rows)

A severe amount of thanks is owed to @llinder (Lance) and @michaelsembwever (Mick). Lance has been piloting a beta model several months, fixing some issues like pagination and generally being first to fire. He also wrote a pilot version of the dependency linking spark job. Mick helped significantly with porting that work in progress to the simpler v2 span model, as well stress tests, advice, tons of advice, code, more code, and advice. Please reach out and thank these two for volunteering hard yards needed to reinvent our Cassandra model.

Zipkin 2.2

11 Oct 08:27
Compare
Choose a tag to compare

Zipkin 2.2 focuses on operations, allowing proxy-mounting the UI and bundles a Prometheus Grafana dashboard

@stepanv modified the zipkin UI such that it can work behind reverse proxies which choose a different path prefix than '/zipkin'. If you'd like to try zipkin under a different path, Stepan wrote docs showing how to setup apache http.

Previously, zipkin had both spring and prometheus metrics exporters. Through hard work from @abesto and @kristofa, we now have a comprehensive example setup including a Zipkin+Prometheus Grafana dashboard. To try it out, use our docker-compose example, which starts everything for you. Once that's done, you can start viewing the health of your tracing system, including how many messages are dropped.

Here's an example, which you'd see at http://192.168.99.100:3000/dashboard/db/zipkin-prometheus?refresh=5s&orgId=1&from=now-5m&to=now if using docker-machine:

screen shot 2017-10-11 at 4 26 51 pm

Other notes

  • our docker JVM has been upgraded to 1.8.0_144 from 1.8.0_131
  • the zipkin-server no longer writes log messages about drop messages at warning level as it can fill up disk. Enable debug logging to see the cause of drops
  • elasticsearch storage will now drop on backlog as opposed to backing up, as the latter led to out-of-memory crashes under load surges.

Finally, please join us on gitter if you have any questions or feedback about Zipkin 2.2

Zipkin 2.1

03 Oct 00:48
Compare
Choose a tag to compare

Thanks to @shakuzen, zipkin 2.1 adds RabbitMQ to the available span transports.

RabbitMQ has been requested many times, though we only started formally tracking it this year. A lot of interest grew from spring-cloud-sleuth which supported a custom RabbitMQ transport. Starting with Zipkin 2.1, RabbitMQ support is built-in to zipkin-server (though custom deployments can remove it).

Using this is easy, just set RABBIT_ADDRESSES to a comma-separated list of rabbit hosts.. if playing around, you can use localhost:

$ RABBIT_ADDRESSES=localhost java -jar zipkin.jar

More documentation is available here.

Once a server is running applications send spans to rabbit, specifically to the queue/routing key associated with zipkin (defaults to "zipkin"). You can post a test trace using normal CLI while you wait for tracers to support RabbitMQ transport.

$ echo '[{"traceId":"9032b04972e475c5","id":"9032b04972e475c5","kind":"SERVER","name":"get","timestamp":1505990621526000,"duration":612898,"localEndpoint":{"serviceName":"brave-webmvc-example","ipv4":"192.168.1.113"},"remoteEndpoint":{"serviceName":"","ipv4":"127.0.0.1","port":60149},"tags":{"error":"500 Internal Server Error","http.path":"/a"}}]' > sample-spans.json
$ rabbitmqadmin publish exchange=amq.default routing_key=zipkin < sample-spans.json

Many thanks to @shakuzen for driving this feature. There's a lot more work than just coding when we add a new default feature. Evenings and weekend time from Tommy are gratefully received.

Zipkin 2

12 Sep 16:58
Compare
Choose a tag to compare

In version 1.31, we introduced our v2 http api, availing dramatically simplified data types. Zipkin 2 is the effort to move all infrastructure towards that model, while still remaining backwards compatible.

What's new?

The core java library (under the package zipkin2) has model, codec and storage types. This includes a bounded in-memory storage component used in test environments.

The following artifacts are new and can coexist with previous ones.

  • io.zipkin.zipkin2:zipkin:2.0.0 < core library
  • io.zipkin.zipkin2:zipkin-storage-elasticsearch:2.0.0 < first v2 native storage driver

Note: If you are using io.zipkin.java:zipkin and io.zipkin.zipkin2:zipkin, use version 2.0.0 (or later) for both as we still maintain the old libraries.

What's next?

There are a few storage implementations in-flight and some may port to the new libraries. Next, we will add a v2 native transport library and work on a Spring Boot 2 based server. Expect incremental progress along the way. Please join us on gitter if you have ideas!

The server itself is still the same

Note: if you are only using or configuring Zipkin, there's little impact. Zipkin server hasn't changed, you just upgrade it. If you have java tracing setup, read the below. Otherwise, you are done unless you want extra details.

Changing java applications to use Zipkin v2 format

Java applications often use the zipkin-reporter project directly or indirectly to send data to Zipkin collectors. Our version 2 json format is smaller and measurably more efficient.

Once you've upgraded your Zipkin servers, opt-into the version 2 format like this:
Ex:

   /** Configuration for how to send spans to Zipkin */
   @Bean Sender sender() {
-    return OkHttpSender.create("http://your_host:9411/api/v1/spans");
+    return OkHttpSender.json("http://your_host:9411/api/v2/spans");
   }
 
   /** Configuration for how to buffer spans into messages for Zipkin */
-  @Bean Reporter<Span> reporter() {
-    return AsyncReporter.builder(sender()).build();
+  @Bean Reporter<Span> spanReporter() {
+    return AsyncReporter.v2(sender()).build();
   }

If you are using Brave directly, you can stick the v2 reporter here:

     return Tracing.newBuilder()
-        .reporter(reporter()).build();
+        .spanReporter(spanReporter())

If you are using Spring XML, the related change looks like this:

-  <bean id="sender" class="zipkin.reporter.okhttp3.OkHttpSender" factory-method="create"
+  <bean id="sender" class="zipkin.reporter.okhttp3.OkHttpSender" factory-method="json"
       destroy-method="close">
-    <constructor-arg type="String" value="http://localhost:9411/api/v1/spans"/>
+    <constructor-arg type="String" value="http://localhost:9411/api/v2/spans"/>
   </bean>
 
   <bean id="tracing" class="brave.spring.beans.TracingFactoryBean">
     <property name="reporter">
       <bean class="brave.spring.beans.AsyncReporterFactoryBean">
+        <property name="encoder" value="JSON_V2"/>

What's new in the Zipkin v2 library

Zipkin v2 libraries are under the zipkin2 java package and the io.zipkin.zipkin2 maven group ID. The core library has a few changes, which mostly cleanup or pare down features we had before. Here are some highlights:

Span now uses validated strings as opposed to parsed objects

Our new json encoder is 2x as fast as prior due to factors including a validation approach. For example, before we used the java long type to represent a 64-bit ID and a 32-bit integer to represent an ipv4 address. Most of the time, IDs are and IPs are transmitted and stored as strings. This resulted in needless expensive conversions. By switching to this, using other serialization libraries is easier, too, as you don't need custom type converters.

Ex.

-  Endpoint.builder().serviceName("tweetie").ipv4(192 << 24 | 168 << 16 | 1).build());
+  Endpoint.newBuilder().serviceName("tweetie").ip("192.168.0.1").build());

protip: if you have an old endpoint, you can do endpoint.toV2() on it!

Span now uses auto-value instead of public final fields

We originally had public final fields for our model types (borrowing from square wire style). This has a slight glitch which is that data transformations can't use method references (as fields aren't methods!). This is cleaned up now.

-    assertThat(spans).extracting(s -> s.duration)
+    assertThat(spans).extracting(Span::duration)

Asynchronous operations are now cancelable

Most will not make custom Zipkin servers, but those making storage or transport plugins have a cleaner api.

Borrowing heavily from Square Retrofit and OkHttp, Zipkin storage interfaces return a Call object, which represents a single unit of work, such as storing spans. This provides means to either synchronously invoke the command, pass a callback, or compose with your favorite library. Unlike before, calls are cancelable.

For example, before, if you wanted to write integration tests that synchronously invoke storage, you'd need to play callback games. These are gone.

-    CallbackCaptor<Void> callback = new CallbackCaptor<>();
-    storage().asyncSpanConsumer().accept(spans, callback);
-    callback.get();
+   storage.spanConsumer().accept(spans).execute();

As an implementor, the whole thing is simpler especially combined with validated string IDs

-  @Override public void getTrace(long traceIdHigh, long traceIdLow, Callback<List<Span>>) {
-    String traceIdHex = Util.toLowerHex(traceIdHigh, traceIdLow);
+  @Override public Call<List<Span>> getTrace(String traceId) {

(json) Codec libraries are cleaned up

We've introduced SpanBytesEncoder and SpanBytesDecoder instead of the catch-all Codec type from v1. When writing zipkin-reporter, we noticed that almost all applications do not need decode logic (as they simply serialize and send out of process). For those writing data to Zipkin, we can serialize either the old format or the new with SpanBytesEncoder.JSON_V1 or SpanBytesEncoder.JSON_V2 accordingly. It is important to note that writing v1 format does not require a version 1.X jar in your classpath.

Zipkin 1.30

27 Mar 06:05
Compare
Choose a tag to compare

Zipkin 1.30 accepts a new simplified json format on all major transports including http, Kafka, SQS, Kinesis, Azure Event Hub and Google Stackdriver.

The primary goal of this format is making Zipkin data easier to understand and simpler for folks to write. A dozen folks in Zipkin have vetted ideas on this format for over a year. We took it seriously because we don't want to bother you with a format unless it will last years. Thanks especially to @bplotnick @basvanbeek and @mansu for donating time recently towards vetting final details.

Here's an example curl command that uploads json representing a server operation:

# make epoch seconds epoch microseconds, because.. microservices!
$ date +%s123456
1502677917123456
$ curl -s localhost:9411/api/v2/spans -H'Content-Type: application/json' -d'[{
  "traceId": "86154a4ba6e91387",
  "id": "86154a4ba6e91387",
  "kind": "SERVER",
  "name": "get",
  "timestamp": 1502677917123456,
  "duration": 207000,
  "localEndpoint": {
    "serviceName": "hamster-wheel",
    "ipv4": "113.210.108.10"
  },
  "remoteEndpoint": {
    "ipv4": "77.12.22.11"
  },
  "tags": {
    "http.path": "/api/hamsters",
    "http.status_code": "302"
  }
}]'

The above says a lot with a little: the server's identifier in discovery (hamster-wheel), the http route and the client IP (likely from X-Forwarded-For or similar). This request took 207ms in the server and resulted in a redirect.

We released collector-side ahead of client/reporter-side, so that folks can roll-out version upgrades ahead of demand. That said, there are already work in progress using this, like census and @flier's c/c++ tracer so update to the most recent patch release as soon as you can!

If you are interested more in this format, check out the newly polished OpenApi spec, or a go client example compiled from it (thx @devinsba). If you have further questions, hop on https://gitter.im/openzipkin/zipkin

Next releases will formalize more including "zipkin2" java types for those who need it. That said, one nice thing about the new format is that it is easy enough for normal json tools to manage. Regardless, keep eyes open for more and thanks for the interest.

Zipkin 1.29

07 Aug 09:32
Compare
Choose a tag to compare

Zipkin 1.29 models messaging spans, shows errors in the service graph and supports Elasticsearch 6

Message tracing

Producing and consuming messages from a broker, such as RabbitMQ or Kafka, is similar but different than one-way RPC. For example, one message can have multiple consumers, and many times the producer of the message can't know if this will be the case. Also, and particularly in Kafka, consuming a message is often completely decoupled from processing of it, and consumption may happen in bulk.

Through community discussion, notably advice from @bogdandrutu from Census, we reached this conclusion for message tracing with Zipkin:

  • Messaging consumers should always be a child span of the producing span (and not a linked trace)
    • If using B3, this means X-B3-SpanId is the parent of the consumer span
  • "ms" and "mr" annotate message send and receive events
    • span2 format replaces these with Span.Kind.PRODUCER, CONSUMER
  • If producer and consumer spans include duration, it should only reflect local batching delay
    • time spent processing a message should be in a different child span

There are diagrams of how instrumentation work with this model on the website. You can also look at @ImFlog's Kafka 0.11 tracing work in progress. If you have more questions or want to share your work, contact us on gitter.

Visualizing error count between services

Thanks to @hfgbarrigas' initial work, and lots of review support by @shakuzen,
we now have errorCount on dependency links, indicating how many of callCount
between services were in error.

MySQL users who want this need to add the error_count column:

alter table zipkin_dependencies add `error_count` BIGINT

The UI is relatively simple, coloring the line yellow when 50% or more calls are in error, and red when 75%. These rates can be overridden or disabled with configuration.

Example link detail screen
screen shot 2017-07-28 at 6 25 44 pm

Example of when >50% of calls are in error
screen shot 2017-07-28 at 6 25 35 pm

Example of when >75% of calls are in error
screen shot 2017-07-28 at 6 25 08 pm

Trace instrumentation's contract is easy: add the "error" tag, for example on http 500. When aggregating links, the value of the "error" tag isn't important. Please update to latest versions of instrumentation if you don't see errors, yet. For example, zipkin-ruby recently support this thanks to @jcarres-mdsol.

Elasticsearch 6

Currently, Elasticsearch uses one index for all types: spans, dependencies (and a special service name index). Elasticsearch 6 no longer supports multiple types per index. Instead we write separate indexes for span and dependency links when Elasticsearch 6 is detected. Incidentally, we also use the new span2 json format, which is simplified and more efficient.

The next version will support the same single-type indexing with Elasticsearch 2.4+. If you can't wait that long, look at #1674 for the experimental flag you can use today.

Thanks to @anuraaga @ImFlog @xeraa and @jcarres-mdsol for advice and support leading to this feature. The next release will thank those who test it!

Zipkin 1.28

06 Jul 06:15
Compare
Choose a tag to compare

Zipkin 1.28 bounds the in-memory storage component

Since the rewrite, we've always had a way to start zipkin without any storage service dependency. This is great for running examples, unit tests, or ad-hoc tests. It wasn't good for tests in more persistent environments like Kubernetes as eventually the memory would blow up and we'd recommend people to use something else. It also wasn't good for short tests that take a lot of traffic for the same reason.

Initially, we were hesitant to add features that might end up as people accidentally going prod with our in-memory storage. However, many people asked about this, usually after something blew-up in test: We realized bounding the memory provider was indeed worthwhile. Thanks to hard work and tuning by @joel-airspring, the default server now starts and won't likely blow up if you send a lot of traffic to it.

So, now you can play around and zipkin will just drop old traces to make room for new ones.

# run with self-tracing enabled, so each api hit is traces, and max-spans set lower than 500000 spans (default)
$ SELF_TRACING_ENABLED=true java -Dzipkin.storage.mem.max-spans=500 -jar ./zipkin-server/target/zipkin-server-*exec.jar
# in another window, do this for a while
$ while true; do curl -s localhost:9411/api/v1/services;done
# then, check to see the span count is less than or equal to what you set it to: <=500
$ curl -s localhost:9411/api/v1/traces?limit=1000000|jq '.[]|.[]|.id'|wc -l

Please note this option can likely break under certain types of load, so please don't consider the in-memory provider production-grade, or on a path to be the latest data grid! If you are interested in an in-memory storage option for production, you might consider upvoting Hazelcast, noting you want it to work embedded.

Zipkin 1.27

30 Jun 08:41
Compare
Choose a tag to compare

Zipkin 1.27 moves the UI under the path /zipkin, allows listening on multiple Kafka topics and improves Cassandra 3 support.

The Zipkin UI was formerly served from an unmodified server as the base path. We've had folks ask for a year in various ways to have this under a subpath instead. We decided to move the UI under /zipkin as it matched most users' requirements and was easiest for our single-page app to route. Thanks to @eirslett @danielkwinsor and @neilstevenson for help with implementation and testing.

We recently added Kafka 0.10 support. This version includes the ability to listen on multiple topics, something you might do if you have environments where spans come from different sources. Thanks to @danielkwinsor for implementation and @dgrabows for review, we now support this by simply comma-delimiting the topic. Note: there are some gotchas if you are considering migrating from Kafka 0.8 to 0.10. Thanks to @fedj for noting something you might run into.

Some of you may using the experimental "cassandra3" storage type. We had a serious glitch @llinder found where blocking could occur on a query depending on the count of results retuned. Not only did Lance fix the glitch, but also added testcontainers to ensure clean, docker-based integration tests run on every PR.

Finally, Zipkin 1.27 fixes a number of broken windows. Thanks @NithinMadhavanpillai for adding a test to help us fix a bad data bug parsing dependencies, @fgcui1204 for finding out why service names sometimes cut off in the UI, @ImFlog for backfilling docs about how ports can be specified in cassandra and @joel-airspring for fixing a few distracting glitches in our build.

Zipkin 1.26

18 May 05:27
Compare
Choose a tag to compare

Thanks to @dgrabows, Zipkin 1.26 now supports Kafka 0.10. Notably, this allows you to run without a ZooKeeper dependency. (Recent versions of Kafka no longer require consumers to connect to ZooKeeper)

Our docker image will automatically use this, if the variable KAFKA_BOOTSTRAP_SERVERS is set instead of KAFKA_ZOOKEEPER. An example docker setup is available here.

While you do not need to upgrade your instrumented apps, you can choose to opt-in by using libraries such as our kafka10 sender.

Thanks again for the comprehensive work by @dgrabows and review feedback by @StephenWithPH