Skip to content

Releases: dotnet/orleans

v3.4.1

03 Feb 20:43
5a87bbb
Compare
Choose a tag to compare

Kubernetes hosting package marked as stable

The Microsoft.Orleans.Kubernetes.Hosting package is now marked as stable. This package is intended to help users who are deploying to Kubernetes by automating configuration of silos, monitoring Kubernetes for changes in the active pods, and terminating pods which are marked as defunct by the Orleans cluster. Please try it and give us your feedback. Documentation is available here and a sample project is available here.

Improvements and bug fixes since 3.4.0

  • Non-breaking improvements

    • Improve performance of LRU.Add() (#6872)
    • Added a base class to IStorage that is not generic (#6928) (#6931)
    • Cleanup Kubernetes hosting package for stable release (#6902) (#6911)
    • Mark Kubernetes hosting package as stable (#6804) (#6903)
  • Non-breaking bug fixes

    • Fix leak in RunningRequestSenders (#6903)
    • Avoid disposing uncompleted task in LinuxEnvironmentStatistics (#6842) (#6887)
    • Only log that a message is being forwarded if it is being forwarded (#6892) (#6910)
    • In GrainDirectoryPartition, throw an exception instead of returning null if trying to register an activation on a non-valid silo (#6896) (#6901)
    • Do not retry to send streaming events if the pulling agent has been stopped (#6897) (#6900)
    • Try to limit forwarding when a grain activation throws an exception in OnActivateAsync() (#6891) (#6893)

v3.4.0

06 Jan 20:27
399dd4e
Compare
Choose a tag to compare

Improved resiliency during severe performance degradation

This release includes improvements to the cluster membership algorithm which are opt-in in this initial release. These changes are aimed at improving the accuracy of cluster membership when some or all nodes are in a degraded state. Details follow.

Perform self-health checks before suspecting other nodes (#6745)

This PR implements some of the ideas from Lifeguard (paper, talk, blog) which can help during times of catastrophe, where a large portion of a cluster is in a state of partial failure. One cause for these kinds of partial failures is large scale thread pool starvation, which can cause a node to run slowly enough to not process messages in a timely manner. Slow nodes can therefore suspect healthy nodes simply because the slow node is not able to process the healthy node's timely response. If a sufficiently proportion of nodes in a cluster are slow (eg, due to an application bug), then healthy nodes may have trouble joining and remaining in the cluster, since the slow nodes can evict them. In this scenario, slow nodes will also be evicting each other. The intention is to improve cluster stability in these scenarios.

This PR introduces LocalSiloHealthMonitor which uses heuristics to score the local silo's health. A low score (0) represents a healthy node and a high score (1 to 8) represents an unhealthy node.

LocalSiloHealthMonitor implements the following heuristics:

  • Check that this silos is marked as Active in membership
  • Check that no other silo suspects this silo
  • Check for recently received successful ping responses
  • Check for recently received ping requests
  • Check that the .NET Thread Pool is able to execute work items within 1 second from enqueue time
  • Check that local async timers have been firing on-time (within 3 seconds of their due time)

Failing heuristics contribute to increased probe timeouts, which has two effects:

  • Improves the chance of a successful probe to a healthy node
  • Increases the time taken for an unhealthy node to vote a healthy node dead, giving the cluster a larger chance of voting the unhealthy node dead first (Nodes marked as dead are pacified and cannot vote others)

This effects of this feature are disabled by default in this release, with only passive background monitoring being enabled. The extended probe timeouts feature can be enabled by setting ClusterMembershipOptions.ExtendProbeTimeoutDuringDegradation to true. The passive background monitoring period can be configured by changing ClusterMembershipOptions.LocalHealthDegradationMonitoringPeriod from its default value of 10 seconds.

Probe silos indirectly before submitting a vote (#6800)

This PR adds support for indirectly pinging silos before suspecting/declaring them dead.
When a silo is one missed probe away from being voted, the monitoring silo switches to indirect pings. In this mode, the silo picks a random other silo and sends it a request to probe the target silo. If that silo responds promptly with a negative acknowledgement (after waiting for a specified timeout), then the silo will be suspected/declared dead.

Additionally, when the vote limit to declare a silo dead is 2 silos, a negative acknowledgement counts for both required votes and the silos is unilaterally declared dead.

The feature is disabled by default in this release - only direct probes are used by-default - but could be enabled in a later release, or by users by setting ClusterMembershipOptions.EnableIndirectProbes to true.

Improvements and bug fixes since 3.3.0

  • Non-breaking improvements
    • Probe silos indirectly before submitting a vote (#6800) (#6839)
    • Perform self-health checks before suspecting other nodes (#6745) (#6836)
    • Add IManagementGrain.GetActivationAddress() (#6816) (#6827)
    • In GrainId.ToString(), display the grain type name and format the key properly (#6774)
    • Add ADO.NET Provider support MySqlConnector 0.x and 1.x. (#6831)
  • Non-breaking bug fixes
    • Avoid race for stateless worker grains with activation limit #6795 (#6796) (#6803)
    • Fix bad merge of GrainInterfaceMap (#6767)
    • Make Activation Data AsyncDisposable (#6761)

v3.4.0 RC1

10 Dec 00:22
3f344a7
Compare
Choose a tag to compare
v3.4.0 RC1 Pre-release
Pre-release

Improved resiliency during severe performance degradation

This release includes improvements to the cluster membership algorithm which are opt-in in this initial release. These changes are aimed at improving the accuracy of cluster membership when some or all nodes are in a degraded state. Details follow.

Perform self-health checks before suspecting other nodes (#6745)

This PR implements some of the ideas from Lifeguard (paper, talk, blog) which can help during times of catastrophe, where a large portion of a cluster is in a state of partial failure. One cause for these kinds of partial failures is large scale thread pool starvation, which can cause a node to run slowly enough to not process messages in a timely manner. Slow nodes can therefore suspect healthy nodes simply because the slow node is not able to process the healthy node's timely response. If a sufficiently proportion of nodes in a cluster are slow (eg, due to an application bug), then healthy nodes may have trouble joining and remaining in the cluster, since the slow nodes can evict them. In this scenario, slow nodes will also be evicting each other. The intention is to improve cluster stability in these scenarios.

This PR introduces LocalSiloHealthMonitor which uses heuristics to score the local silo's health. A low score (0) represents a healthy node and a high score (1 to 8) represents an unhealthy node.

LocalSiloHealthMonitor implements the following heuristics:

  • Check that this silos is marked as Active in membership
  • Check that no other silo suspects this silo
  • Check for recently received successful ping responses
  • Check for recently received ping requests
  • Check that the .NET Thread Pool is able to execute work items within 1 second from enqueue time
  • Check that local async timers have been firing on-time (within 3 seconds of their due time)

Failing heuristics contribute to increased probe timeouts, which has two effects:

  • Improves the chance of a successful probe to a healthy node
  • Increases the time taken for an unhealthy node to vote a healthy node dead, giving the cluster a larger chance of voting the unhealthy node dead first (Nodes marked as dead are pacified and cannot vote others)

This effects of this feature are disabled by default in this release, with only passive background monitoring being enabled. The extended probe timeouts feature can be enabled by setting ClusterMembershipOptions.ExtendProbeTimeoutDuringDegradation to true. The passive background monitoring period can be configured by changing ClusterMembershipOptions.LocalHealthDegradationMonitoringPeriod from its default value of 10 seconds.

Probe silos indirectly before submitting a vote (#6800)

This PR adds support for indirectly pinging silos before suspecting/declaring them dead.
When a silo is one missed probe away from being voted, the monitoring silo switches to indirect pings. In this mode, the silo picks a random other silo and sends it a request to probe the target silo. If that silo responds promptly with a negative acknowledgement (after waiting for a specified timeout), then the silo will be suspected/declared dead.

Additionally, when the vote limit to declare a silo dead is 2 silos, a negative acknowledgement counts for both required votes and the silos is unilaterally declared dead.

The feature is disabled by default in this release - only direct probes are used by-default - but could be enabled in a later release, or by users by setting ClusterMembershipOptions.EnableIndirectProbes to true.

Improvements and bug fixes since 3.3.0

  • Non-breaking improvements
    • Probe silos indirectly before submitting a vote (#6800) (#6839)
    • Perform self-health checks before suspecting other nodes (#6745) (#6836)
    • Add IManagementGrain.GetActivationAddress() (#6816) (#6827)
    • In GrainId.ToString(), display the grain type name and format the key properly (#6774)
  • Non-breaking bug fixes
    • Avoid race for stateless worker grains with activation limit #6795 (#6796) (#6803)
    • Fix bad merge of GrainInterfaceMap (#6767)
    • Make Activation Data AsyncDisposable (#6761)

v3.3.0

09 Sep 20:34
baa1dc8
Compare
Choose a tag to compare

Improved diagnostics for long running, delayed, and blocked request:

This release includes improvements to give developers additional context when a request does not return promptly. PR #6672 added these improvements. Orleans will periodically probe active grains to inspect their message queues and send status updates for certain requests which have been enqueued or executing for too long. These status messages will appear as warnings in the logs and will also be included in exceptions when a request timeout occurs. The information included can help a developer to identify what the grain is doing at the time of the request. For example, which messages are enqueued ahead of this message, and which messages are executing, how long they have been executing, how long this message has been enqueued, and the status of the grain's TaskScheduler.

Microsoft.Orleans.Hosting.Kubernetes NuGet package (3.3.0-beta1) for tighter integration with Kubernetes

This release adds a new pre-release package, Microsoft.Orleans.Hosting.Kubernetes, which adds richer integration for users hosting on Kubernetes. The package assists users by monitoring Kubernetes for silo pods and reflecting changes in cluster membership. For example, when a Pod is deleted, it is immediately removed from Orleans' membership. In addition, the package configures EndpointOptions and `ClusterOptions' to match the Pod's environments. Documentation and a sample project are expected in the coming weeks, and in the meantime, please see the original PR for more information: #6707.

Improvements and bug fixes since 3.2.0.

  • Potentially breaking change

    • Added 'RecordExists' flag to perisistent store so that grains can det… (#6580)
      (Implementations of IStorage<TState> and IGrainState need to be updated to add a RecordExists property.)
  • Non-breaking improvements

    • Use "static" client observer to notify from the gateway when the silo is shutting down (#6613)
    • More graceful termination of network connections (#6557) (#6625)
    • Use TaskCompletionSource.RunContinuationsAsynchronously (#6573)
    • Observe discarded ping task results (#6577)
    • Constrain work done under a lock in BatchWorker (#6586)
    • Support deterministic builds with CodeGenerator (#6592)
    • Fix some xUnit test discovery issues (#6584)
    • Delete old Joining records as part of cleanup of defunct entries (#6601, #6624)
    • Propagate transaction exceptions in more cases (#6615)
    • SocketConnectionListener: allow address reuse (#6653)Improve ClusterClient disposal (#6583)
    • AAD authentication for Azure providers (blob, queue & table) (#6648)
    • Make delay after gw shutdown notification configurable (#6679)
    • Tweak shutdown completion signalling (#6685) (#6696)
    • Close some kinds of misbehaving connections during shutdown (#6684) (#6695)
    • Send status messages for long-running and blocked requests (#6672) (#6694)
    • Kubernetes hosting integration (#6707) (#6721)
    • Reduce log noise (#6705)
    • Upgrade AWS dependencies to their latest versions. (#6723)
  • Non-breaking bug fixes

    • Fix SequenceNumber for MemoryStream (#6622) (#6623)
    • When activation is stuck, make sure to unregister from the directory before forwarding messages (#6593)
    • Fix call pattern that throws. (#6626)
    • Avoid NullReferenceException in Message.TargetAddress (#6635)
    • Fix unobserved ArgumentOutOfRangeException from Task.Delay (#6640)
    • Fix bad merge (#6656)
    • Avoid race in GatewaySender.Send (#6655)
    • Ensure that only one instance of IncomingRequestMonitor is created (#6714)

v3.3.0-rc2

03 Sep 00:20
47a1fa6
Compare
Choose a tag to compare

Improvements and bug fixes since 3.3.0-rc1.

  • Non-breaking improvements

    • Kubernetes hosting integration (#6707) (#6721)
    • Reduce log noise (#6705)
    • Upgrade AWS dependencies to their latest versions. (#6723)
  • Non-breaking bug fixes

    • Ensure that only one instance of IncomingRequestMonitor is created (#6714)

v3.3.0-rc1

19 Aug 18:25
6eec48d
Compare
Choose a tag to compare

Improvements and bug fixes since 3.2.2.

  • Non-breaking improvements
    • Improve ClusterClient disposal (#6583)
    • Added 'RecordExists' flag to perisistent store so that grains can det… (#6580)
    • AAD authentication for Azure providers (blob, queue & table) (#6648)
    • Make delay after gw shutdown notification configurable (#6679)
    • Tweak shutdown completion signalling (#6685) (#6696)
    • Close some kinds of misbehaving connections during shutdown (#6684) (#6695)
    • Send status messages for long-running and blocked requests (#6672) (#6694)

Improved diagnostics for long running, delayed, and blocked request:

This release includes improvements to give developers additional context when a request does not return promptly. PR #6672 added these improvements. Orleans will periodically probe active grains to inspect their message queues and send status updates for certain requests which have been enqueued or executing for too long. These status messages will appear as warnings in the logs and will also be included in exceptions when a request timeout occurs. The information included can help a developer to identify what the grain is doing at the time of the request. For example, which messages are enqueued ahead of this message, and which messages are executing, how long they have been executing, how long this message has been enqueued, and the status of the grain's TaskScheduler.

v3.2.2

22 Jul 23:07
dc76212
Compare
Choose a tag to compare

Improvements and bug fixes since 3.2.1.

  • Non-breaking improvements

    • SocketConnectionListener: allow address reuse (#6653)
  • Non-breaking bug fixes

    • Avoid NullReferenceException in Message.TargetAddress (#6635)
    • Fix unobserved ArgumentOutOfRangeException from Task.Delay (#6640)
    • Fix bad merge (#6656)
    • Avoid race in GatewaySender.Send (#6655)

v3.2.1

06 Jul 19:09
Compare
Choose a tag to compare

Improvements and bug fixes since 3.2.0.

  • Non-breaking improvements

    • Use "static" client observer to notify from the gateway when the silo is shutting down (#6613)
    • More graceful termination of network connections (#6557) (#6625)
    • Use TaskCompletionSource.RunContinuationsAsynchronously (#6573)
    • Observe discarded ping task results (#6577)
    • Constrain work done under a lock in BatchWorker (#6586)
    • Support deterministic builds with CodeGenerator (#6592)
    • Fix some xUnit test discovery issues (#6584)
    • Delete old Joining records as part of cleanup of defunct entries (#6601, #6624)
    • Propagate transaction exceptions in more cases (#6615)
  • Non-breaking bug fixes

    • Fix SequenceNumber for MemoryStream (#6622) (#6623)
    • When activation is stuck, make sure to unregister from the directory before forwarding messages (#6593)
    • Fix call pattern that throws. (#6626)

v3.2.0

05 Jun 00:01
6c7b270
Compare
Choose a tag to compare

3.2.0 includes two major changes

  • Pluggable grain directory

This feature allows to use external storage as an option for keeping grain directory information. Directory plugins can be configured for different grain classes independently, so that different consistency/availability tradeoffs can be made for different grain classes.
As part of this change, we had to remove support for multi-cluster functionality. We intend to bring it back as a grain directory plugin at a later time. Removal of multi-clustering is the only breaking change, and only if you used the feature previously.

  • Switch to using .NET thread pool for scheduling

Since the initial release, Orleans has been using its own custom thread pool implementation to make up for the deficiencies in the .NET thread pool. Since then, the .NET thread pool has improved significantly, and there is no need any more for a separate solution within Orleans.

We measured a performance increase of 3.2.0 compared to 3.1.6 of 12% to 20% depending on the test scenario.

Other improvements and bug fixes since 3.1.0.

  • Breaking changes

    • Remove current multicluster implementation (#6498)
  • Non-breaking improvements

    • Remove new() constraint for grain persistence (#6351)
    • Improve TLS troubleshooting experience (#6352)
    • Remove unnecessary RequestContext.Clear in networking (#6357)
    • Cleanup GrainBasedReminderTable (#6355)
    • Avoid using GrainTimer in non-grain contexts (#6342)
    • Omit assembly name for all types from System namespace during codegen (#6394)
    • Fix System namespace classification in Orleans.CodeGenerator (#6396)
    • Reduce port clashes in TestCluster (#6399, #6413)
    • Use the overload of ConcurrentDictionary.GetOrAdd that takes a method (#6409)
    • Ignore not found exception when clearing azure queues (#6419)
    • MembershipTableCleanupAgent: dispose timer if cleanup is unsupported (#6415)
    • Allow grain call filters to retry calls (#6414)
    • Avoid most cases of loggers with non-static category names (#6430)
    • Free SerializationContext and DeserializationContext between calls (#6433)
    • Don't use iowait in cpu calcs on linux (#6444)
    • TLS: specify an application protocol to satisfy ALPN (#6455)
    • Change the error about not supported membership table cleanup functionality into a warning. (#6447)
    • Update obsoletion warning for ISiloBuilderConfigurator (#6461)
    • Allow GatewayManager initialization to be retried (#6459)
    • Added eventIndex (#6467)
    • Send rejections for messages enqueued on stopped outbound queue (#6474)
    • Stopped WorkItemGroup logging enhancement (#6483)
    • Streamline LINQ/Enumerable use (#6482)
    • Support for pluggable grain directory (#6340, #6354, #6366, #6385, #6473, #6485, #6502, #6524)
    • Expose timeouts for Azure Table Storage (#6462, #6501, #6509)
    • Schedule Tasks and WorkItems on .NET ThreadPool (#6261)
    • Schedule received messages onto thread pool in Connection.ProcessIncoming (#6263)
    • Remove AsyncAgent, Executor and related (#6264)
    • Reorient RuntimeContext around IGrainContext (#6365)
    • Remove Message.DebugContext and related code (#6323)
    • Remove obviated GrainId constructor and associated code (#6322)
    • Set isolation level to READ COMMITTED to avoid Gap Lock issues (#6331)
    • AdoNet: Rename Storage table to OrleansStorage for consistency with other tables. (#6336)
    • Avoid using GrainTimer in non-grain contexts (#6342)
    • Remove unnecessary provider runtime members (#6362)
    • Remove ClientInvokeCallback (#6364)
    • Remove ProcessExitHandlingOptions (#6369)
    • Simplify OrleansTaskScheduler (#6370)
    • Remove IServiceProvider from IGrainContext (#6372)
    • Streamline MemoryStorage and InMemoryReminderTable (#6315)
    • Fix test glitch in PersistenceProvider_Memory_FixedLatency_WriteRead (#6378)
    • Fix errors reported by GitHub Semmle code analysis tools. (#6374)
    • Remove Microsoft prefix from logging categories (#6431)
    • Streamline Dictionary use and remove some dead code (#6439)
    • Make methods on AITelemetryConsumer virtual; clean-up (#6469)
    • Remove IHostedClient abstraction (#6475)
    • Only allocate an array for lengths when array rank is greater than 3 (#6493)
    • Support ValueTask as [OneWay] Methods Return Type (#6521)
    • Grain Directory Redis implementation (#6543, #6569, #6570, #6571)
  • Non-breaking bug fixes

    • Fix CleanupDefunctSiloMembership & MembershipTableTests (#6344)
    • Schedule IMembershipTable.CleanupDefunctSiloEntries more frequently (#6346)
    • CodeGenerator fixes (#6347)
    • Fix handling of gateways in Orleans.TestingHost (#6348)
    • Avoid destructuring in log templates (#6356)
    • Clear RequestContext after use (#6358)
    • Amended Linux stats registration to add services on Linux only (#6375)
    • Update performance counter dependencies (#6397)
    • Reminders period overflow issue in ADO.NET Reminders Table (#6390)
    • Read only the body segment from EventData (#6412)
    • Consistently sanitize RowKey & PartitionKey properties for Azure Table Storage reminders implementation (#6460)
    • Gossip that the silo is dead before the outbound queue gets closed (#6480)
    • Fix a race condition in LifecycleSubject (#6481)
    • Fix log message (#6408)
    • Do not reject rejection messages locally. Drop them instead (#6525)
    • LocalGrainDirectory.UnregisterManyAsync should always be called from RemoteGrainDirectory context (#6575)

v3.2.0-rc2

20 May 01:11
Compare
Choose a tag to compare
v3.2.0-rc2 Pre-release
Pre-release

Improvements and bug fixes since 3.2.0-rc2

  • Non-breaking bug fixes
    • Do not reject rejection messages locally. Drop them instead (#6525)