Skip to content

Perusing Pair (and BiStream)

Ben Yu edited this page Dec 5, 2020 · 24 revisions

Let's talk about Pair

It's 2020 and I'm still talking about Pairs. ;-)

There are plenty of Stackoverflow questions asking about a generic Pair class in Java:

And at least 4 or 5 libraries that provide from a whole slew of tuple implementations to at least a Pair class.

At Google, we are largely biased against Pair (and all of those tuple types).

Where I had to use them, my own Flume code had disgusted me enough (both Java and C++). Some extreme examples can be found in this post

Code using these nested Pair classes also tend to read horrible:

emitFn.emit(in.getFirst(), Iterables.getOnlyElement(in.getSecond().getFirst()));
...

return String.format("%s (SourceId=%d)\t Status:%s\t AllocationCount:%d",
    getNetworkName(input.getFirst().getFirst()),
    input.getFirst().getFirst(),
    input.getFirst().getSecond(),
    input.getSecond());
if ((next_mid_iterator->second.first.first > mid_iterator->second.first.first)
   || (next_mid_iterator->second.first.second <= mid_iterator->second.first.second)) {
  ...
}

So what exactly is wrong about Pair?

Meaningless Names

You may opt for the first/second terminology, or _1/_2, or left/right, car/cdr, foo/bar, a/b, yin/yang, head/tail, night/day, gandolf/saruman.

Whatever names you choose, they have one thing in common: they don't mean anything.

And that is why the above Pair usage code are horrible. "second.first.first" is likely the first second best thing since and before the second goto was first invented.

If you can come up with a logical meaning for these "second.first.first" thingies, and you want yourself weeks later to understand the code, by all means try to name them what they are, for example: value.name.first_name.

Granted, Java didn't make it easy to create proper classes with proper field names. But programmers are also partially responsible because often times we don't really need hashCode()/equals()/getters/setters. If you are just trying to have a place to define fields and document their semantics/invariants, there is nothing wrong with the following simple class:

class Name {
  public final String firstName;
  public final String lastName;

  Name(String firstName, String lastName) {
    this.firstName = firstName;
    this.lastName = lastName;
  }
}

"But it exposes the fields as public and breaks encapsulation?" True, it doesn't provide abstractions through getters. But neither does Pair<String, String>. And what about YAGNI?

Modern IDEs have the "Encapsulate Field" auto refactoring. If it turns out you need to wrap the fields through a getter, great. It means two things: 1. You had made a good choice not using Pair in the first place. Because by now it'd have been more difficult to add any abstraction onn top of the generic Pair class that may be reused hundreds of different places. 2. Just use the "Encapsulate Field" auto refactoring. It will take care of updating your callers.

The YAGNI optimistism only goes so far for locally-used, private/inner classes where you know you won't need to store the object as a hash map key or a Set. It won't work if you justifiably need equals()/hashCode() (or in C++, many might need the operator==, operator< etc.)

For these other uncooperative use cases, code generators like AutoValue give a way out so we can create proper value classes almost as easy as we had wished:

@AutoValue
class Name {
  public abstract String firstName();
  public abstract String lastName();

  static Name of(String firstName, String lastName) {
    return AutoValue_Name(firstName, lastName);
  }
}

(In the not-too-distant future, we may even be able to use tuples)

To be fair, even with the Pair class, this problem could be alleviated in the age of lambda. For example, why not add a method like:

class Pair<A, B> {
  public <R> as(BiFunction<? super A, ? super B, R> output);
}

Code like the following isn't hard to read:

parseUserNameAndDomain("foo@gmail.com")
    .as((userId, domain) -> ...);

Useless Type

The type Pair<Person, Person> is both under-specified and over-specified:

  • It underspecifies the relationship between the two Person objects. Are they a couple of husband/wife? doctor/patient? interviewer/interviewee?
  • It overspecifies the implementation detais. If for example it represents a marriage between two persons, I need a Marriage type, not a type that hides its identity but taunts me with a riddle:

    Hey, I have two Person objects in me, guess what I am?".

But wait, questions:

Does it mean a method shouldn't ever return int, or String, and should always wrap them?

No. When there is just one thing, the "relationship" argument is moot. Relationship is at least between two things.

That said, it can still be bad if you are over-using primitives to represent higher-level logical entities, especially if this logical entity will be used in multiple places. For example, if your code tends to use "user id" concept over and over again, it's probably a better idea to create a UserId type. Don't use String just because the user id happens to be represented/encoded as a String.

But what about Map<String, String>?

In a Map, the relationship between the two types is defined. They are keys and the values associated with the key.

And yes, while Map<String, String> may be okay in the internal implementation detail where it's used once or twice, with the context clearly in scope, it can be bad if it gets used and passed around across packages. Not knowing which String means what can be a readability problem. You'd be better off with Map<UserId, UserId> if they are some kind of user id mapping, or wrap the Map inside a higher-level abstraction class.

Is BiStream<Integer, Integer> bad?

Unlike Map, BiStream doesn't define a relationship between the two types. So unless seeing the two types gives the readers an immediate clue of the relationship (like in BiStream<UserId, User>), BiStream<Integer, Integer> would be bad.

That said, BiStream typically forms a chain of operations, where at each line the BiStream's type changes. the BiStream<Integer, Integer> type may only be invisible intermediary types, like this:

BiStream.zip(indexesFrom(0), visits)  // BiStream<Integer, Integer>
    .map((index, visit) -> ...)
    ...
    .collect(...);

If the context is clear enough that we don't even bother spelling out the type explicitly, it can't hurt us.

What if a proper class makes no sense?

There are situations where a semantic-free pair type is precisely what we need. This happened in a real-life project. We had a layered application with a bunch of domain types (Order, LineItem etc.) and then a bunch of corresponding DTO types (OrderDto, LineItemDto). At the boundary of the DTO -> Domain, the implementation of translation code sometimes needed to accept or return a list of Pair<OrderDto, Order> objects.

There was no relationship untold upon seeing Pair<OrderDto, Order>; and that this thing has a pair of OrderDto and Order is exactly the semantics we needed to convey.

In such case, I'd use either BiStream<FooDto, Foo> or BiCollection<FooDto, Foo>, depending on whether I need it to be streamed once, or repetitively accessed.

In Conclusion

Going back to the root of the problem, people need Pair because they have methods that need to return two values.

As argued above, some of these scenarios are more like a hack (so that people can "save" the effort of building a proper value type) than really two-valued function use cases. A proper type would look more natural for the concept, and what happens to be two things today may evolve to 3 or 4 things tomorrow.

For example, you'll want to return a Marriage object, not a Pair<Person, Person> object, because in the future Marriage may evolve to also need other information such as, say, Asset? Jurisdiction? Diamond? Anniversay? ExpirationDate? :)

Legit two-valued binary use cases do exist though, you know, when you'd have a hard time coming up with any better type names than FooAndBar, DomainAndDto etc. Some other real world examples I can think of:

  • Split a flag string in the form of "--mode=dry_run" into the flag name/value pair.
  • Calculate the quotient and remainder of a division.
  • Find a list element and its current index in the list.

So here's my suggestion if you really need to return two values:

When dealing with a collection or a stream of these pairs, use BiStream or BiCollection.

Or else, consider to use the lambda approach (as similarly done in JDK 12's Collectors.teeing() API. For example, if you return a Both<String, String>, it can then be accessed more easily:

/** Splits string by delimiter */
Both<String, String> split(String s);

/** Finds the element and its index */
Both<Integer, T> R locate(Id id);

The benefit is that the callers can call the method to create the appropriate type as it fits:

Flag flag = split(...).andThen(Flag::new);
locate(id).andThen((index, element) -> ...);

As a bonus, there are common utilities that work smoothly with Both<A, B>:

import static com.google.mu.util.stream.GuavaCollectors.toImmutableListMultimap;

ImmutableListMultimap<String, String> keyValues = readLines().stream()
    .collect(
        toImmutableListMultimap(
            s -> Substring.first('=').splitThenTrim(s).orElseThrow(...)))