Releases: dotnet/orleans
v3.4.1
Kubernetes hosting package marked as stable
The Microsoft.Orleans.Kubernetes.Hosting
package is now marked as stable. This package is intended to help users who are deploying to Kubernetes by automating configuration of silos, monitoring Kubernetes for changes in the active pods, and terminating pods which are marked as defunct by the Orleans cluster. Please try it and give us your feedback. Documentation is available here and a sample project is available here.
Improvements and bug fixes since 3.4.0
-
Non-breaking improvements
-
Non-breaking bug fixes
- Fix leak in
RunningRequestSenders
(#6903) - Avoid disposing uncompleted task in
LinuxEnvironmentStatistics
(#6842) (#6887) - Only log that a message is being forwarded if it is being forwarded (#6892) (#6910)
- In
GrainDirectoryPartition
, throw an exception instead of returningnull
if trying to register an activation on a non-valid silo (#6896) (#6901) - Do not retry to send streaming events if the pulling agent has been stopped (#6897) (#6900)
- Try to limit forwarding when a grain activation throws an exception in
OnActivateAsync()
(#6891) (#6893)
- Fix leak in
v3.4.0
Improved resiliency during severe performance degradation
This release includes improvements to the cluster membership algorithm which are opt-in in this initial release. These changes are aimed at improving the accuracy of cluster membership when some or all nodes are in a degraded state. Details follow.
Perform self-health checks before suspecting other nodes (#6745)
This PR implements some of the ideas from Lifeguard (paper, talk, blog) which can help during times of catastrophe, where a large portion of a cluster is in a state of partial failure. One cause for these kinds of partial failures is large scale thread pool starvation, which can cause a node to run slowly enough to not process messages in a timely manner. Slow nodes can therefore suspect healthy nodes simply because the slow node is not able to process the healthy node's timely response. If a sufficiently proportion of nodes in a cluster are slow (eg, due to an application bug), then healthy nodes may have trouble joining and remaining in the cluster, since the slow nodes can evict them. In this scenario, slow nodes will also be evicting each other. The intention is to improve cluster stability in these scenarios.
This PR introduces LocalSiloHealthMonitor
which uses heuristics to score the local silo's health. A low score (0) represents a healthy node and a high score (1 to 8) represents an unhealthy node.
LocalSiloHealthMonitor
implements the following heuristics:
- Check that this silos is marked as
Active
in membership - Check that no other silo suspects this silo
- Check for recently received successful ping responses
- Check for recently received ping requests
- Check that the .NET Thread Pool is able to execute work items within 1 second from enqueue time
- Check that local async timers have been firing on-time (within 3 seconds of their due time)
Failing heuristics contribute to increased probe timeouts, which has two effects:
- Improves the chance of a successful probe to a healthy node
- Increases the time taken for an unhealthy node to vote a healthy node dead, giving the cluster a larger chance of voting the unhealthy node dead first (Nodes marked as dead are pacified and cannot vote others)
This effects of this feature are disabled by default in this release, with only passive background monitoring being enabled. The extended probe timeouts feature can be enabled by setting ClusterMembershipOptions.ExtendProbeTimeoutDuringDegradation
to true
. The passive background monitoring period can be configured by changing ClusterMembershipOptions.LocalHealthDegradationMonitoringPeriod
from its default value of 10 seconds.
Probe silos indirectly before submitting a vote (#6800)
This PR adds support for indirectly pinging silos before suspecting/declaring them dead.
When a silo is one missed probe away from being voted, the monitoring silo switches to indirect pings. In this mode, the silo picks a random other silo and sends it a request to probe the target silo. If that silo responds promptly with a negative acknowledgement (after waiting for a specified timeout), then the silo will be suspected/declared dead.
Additionally, when the vote limit to declare a silo dead is 2 silos, a negative acknowledgement counts for both required votes and the silos is unilaterally declared dead.
The feature is disabled by default in this release - only direct probes are used by-default - but could be enabled in a later release, or by users by setting ClusterMembershipOptions.EnableIndirectProbes
to true
.
Improvements and bug fixes since 3.3.0
- Non-breaking improvements
- Probe silos indirectly before submitting a vote (#6800) (#6839)
- Perform self-health checks before suspecting other nodes (#6745) (#6836)
- Add IManagementGrain.GetActivationAddress() (#6816) (#6827)
- In GrainId.ToString(), display the grain type name and format the key properly (#6774)
- Add ADO.NET Provider support MySqlConnector 0.x and 1.x. (#6831)
- Non-breaking bug fixes
v3.4.0 RC1
Improved resiliency during severe performance degradation
This release includes improvements to the cluster membership algorithm which are opt-in in this initial release. These changes are aimed at improving the accuracy of cluster membership when some or all nodes are in a degraded state. Details follow.
Perform self-health checks before suspecting other nodes (#6745)
This PR implements some of the ideas from Lifeguard (paper, talk, blog) which can help during times of catastrophe, where a large portion of a cluster is in a state of partial failure. One cause for these kinds of partial failures is large scale thread pool starvation, which can cause a node to run slowly enough to not process messages in a timely manner. Slow nodes can therefore suspect healthy nodes simply because the slow node is not able to process the healthy node's timely response. If a sufficiently proportion of nodes in a cluster are slow (eg, due to an application bug), then healthy nodes may have trouble joining and remaining in the cluster, since the slow nodes can evict them. In this scenario, slow nodes will also be evicting each other. The intention is to improve cluster stability in these scenarios.
This PR introduces LocalSiloHealthMonitor
which uses heuristics to score the local silo's health. A low score (0) represents a healthy node and a high score (1 to 8) represents an unhealthy node.
LocalSiloHealthMonitor
implements the following heuristics:
- Check that this silos is marked as
Active
in membership - Check that no other silo suspects this silo
- Check for recently received successful ping responses
- Check for recently received ping requests
- Check that the .NET Thread Pool is able to execute work items within 1 second from enqueue time
- Check that local async timers have been firing on-time (within 3 seconds of their due time)
Failing heuristics contribute to increased probe timeouts, which has two effects:
- Improves the chance of a successful probe to a healthy node
- Increases the time taken for an unhealthy node to vote a healthy node dead, giving the cluster a larger chance of voting the unhealthy node dead first (Nodes marked as dead are pacified and cannot vote others)
This effects of this feature are disabled by default in this release, with only passive background monitoring being enabled. The extended probe timeouts feature can be enabled by setting ClusterMembershipOptions.ExtendProbeTimeoutDuringDegradation
to true
. The passive background monitoring period can be configured by changing ClusterMembershipOptions.LocalHealthDegradationMonitoringPeriod
from its default value of 10 seconds.
Probe silos indirectly before submitting a vote (#6800)
This PR adds support for indirectly pinging silos before suspecting/declaring them dead.
When a silo is one missed probe away from being voted, the monitoring silo switches to indirect pings. In this mode, the silo picks a random other silo and sends it a request to probe the target silo. If that silo responds promptly with a negative acknowledgement (after waiting for a specified timeout), then the silo will be suspected/declared dead.
Additionally, when the vote limit to declare a silo dead is 2 silos, a negative acknowledgement counts for both required votes and the silos is unilaterally declared dead.
The feature is disabled by default in this release - only direct probes are used by-default - but could be enabled in a later release, or by users by setting ClusterMembershipOptions.EnableIndirectProbes
to true
.
Improvements and bug fixes since 3.3.0
- Non-breaking improvements
- Non-breaking bug fixes
v3.3.0
Improved diagnostics for long running, delayed, and blocked request:
This release includes improvements to give developers additional context when a request does not return promptly. PR #6672 added these improvements. Orleans will periodically probe active grains to inspect their message queues and send status updates for certain requests which have been enqueued or executing for too long. These status messages will appear as warnings in the logs and will also be included in exceptions when a request timeout occurs. The information included can help a developer to identify what the grain is doing at the time of the request. For example, which messages are enqueued ahead of this message, and which messages are executing, how long they have been executing, how long this message has been enqueued, and the status of the grain's TaskScheduler
.
Microsoft.Orleans.Hosting.Kubernetes NuGet package (3.3.0-beta1) for tighter integration with Kubernetes
This release adds a new pre-release package, Microsoft.Orleans.Hosting.Kubernetes
, which adds richer integration for users hosting on Kubernetes. The package assists users by monitoring Kubernetes for silo pods and reflecting changes in cluster membership. For example, when a Pod is deleted, it is immediately removed from Orleans' membership. In addition, the package configures EndpointOptions
and `ClusterOptions' to match the Pod's environments. Documentation and a sample project are expected in the coming weeks, and in the meantime, please see the original PR for more information: #6707.
Improvements and bug fixes since 3.2.0.
-
Potentially breaking change
- Added 'RecordExists' flag to perisistent store so that grains can det… (#6580)
(Implementations ofIStorage<TState>
andIGrainState
need to be updated to add a RecordExists property.)
- Added 'RecordExists' flag to perisistent store so that grains can det… (#6580)
-
Non-breaking improvements
- Use "static" client observer to notify from the gateway when the silo is shutting down (#6613)
- More graceful termination of network connections (#6557) (#6625)
- Use TaskCompletionSource.RunContinuationsAsynchronously (#6573)
- Observe discarded ping task results (#6577)
- Constrain work done under a lock in BatchWorker (#6586)
- Support deterministic builds with CodeGenerator (#6592)
- Fix some xUnit test discovery issues (#6584)
- Delete old Joining records as part of cleanup of defunct entries (#6601, #6624)
- Propagate transaction exceptions in more cases (#6615)
- SocketConnectionListener: allow address reuse (#6653)Improve ClusterClient disposal (#6583)
- AAD authentication for Azure providers (blob, queue & table) (#6648)
- Make delay after gw shutdown notification configurable (#6679)
- Tweak shutdown completion signalling (#6685) (#6696)
- Close some kinds of misbehaving connections during shutdown (#6684) (#6695)
- Send status messages for long-running and blocked requests (#6672) (#6694)
- Kubernetes hosting integration (#6707) (#6721)
- Reduce log noise (#6705)
- Upgrade AWS dependencies to their latest versions. (#6723)
-
Non-breaking bug fixes
- Fix SequenceNumber for MemoryStream (#6622) (#6623)
- When activation is stuck, make sure to unregister from the directory before forwarding messages (#6593)
- Fix call pattern that throws. (#6626)
- Avoid NullReferenceException in Message.TargetAddress (#6635)
- Fix unobserved ArgumentOutOfRangeException from Task.Delay (#6640)
- Fix bad merge (#6656)
- Avoid race in GatewaySender.Send (#6655)
- Ensure that only one instance of IncomingRequestMonitor is created (#6714)
v3.3.0-rc2
v3.3.0-rc1
Improvements and bug fixes since 3.2.2.
- Non-breaking improvements
- Improve ClusterClient disposal (#6583)
- Added 'RecordExists' flag to perisistent store so that grains can det… (#6580)
- AAD authentication for Azure providers (blob, queue & table) (#6648)
- Make delay after gw shutdown notification configurable (#6679)
- Tweak shutdown completion signalling (#6685) (#6696)
- Close some kinds of misbehaving connections during shutdown (#6684) (#6695)
- Send status messages for long-running and blocked requests (#6672) (#6694)
Improved diagnostics for long running, delayed, and blocked request:
This release includes improvements to give developers additional context when a request does not return promptly. PR #6672 added these improvements. Orleans will periodically probe active grains to inspect their message queues and send status updates for certain requests which have been enqueued or executing for too long. These status messages will appear as warnings in the logs and will also be included in exceptions when a request timeout occurs. The information included can help a developer to identify what the grain is doing at the time of the request. For example, which messages are enqueued ahead of this message, and which messages are executing, how long they have been executing, how long this message has been enqueued, and the status of the grain's TaskScheduler
.
v3.2.2
Improvements and bug fixes since 3.2.1.
v3.2.1
Improvements and bug fixes since 3.2.0.
-
Non-breaking improvements
- Use "static" client observer to notify from the gateway when the silo is shutting down (#6613)
- More graceful termination of network connections (#6557) (#6625)
- Use TaskCompletionSource.RunContinuationsAsynchronously (#6573)
- Observe discarded ping task results (#6577)
- Constrain work done under a lock in BatchWorker (#6586)
- Support deterministic builds with CodeGenerator (#6592)
- Fix some xUnit test discovery issues (#6584)
- Delete old Joining records as part of cleanup of defunct entries (#6601, #6624)
- Propagate transaction exceptions in more cases (#6615)
-
Non-breaking bug fixes
v3.2.0
3.2.0 includes two major changes
- Pluggable grain directory
This feature allows to use external storage as an option for keeping grain directory information. Directory plugins can be configured for different grain classes independently, so that different consistency/availability tradeoffs can be made for different grain classes.
As part of this change, we had to remove support for multi-cluster functionality. We intend to bring it back as a grain directory plugin at a later time. Removal of multi-clustering is the only breaking change, and only if you used the feature previously.
- Switch to using .NET thread pool for scheduling
Since the initial release, Orleans has been using its own custom thread pool implementation to make up for the deficiencies in the .NET thread pool. Since then, the .NET thread pool has improved significantly, and there is no need any more for a separate solution within Orleans.
We measured a performance increase of 3.2.0 compared to 3.1.6 of 12% to 20% depending on the test scenario.
Other improvements and bug fixes since 3.1.0.
-
Breaking changes
- Remove current multicluster implementation (#6498)
-
Non-breaking improvements
- Remove new() constraint for grain persistence (#6351)
- Improve TLS troubleshooting experience (#6352)
- Remove unnecessary RequestContext.Clear in networking (#6357)
- Cleanup GrainBasedReminderTable (#6355)
- Avoid using GrainTimer in non-grain contexts (#6342)
- Omit assembly name for all types from System namespace during codegen (#6394)
- Fix System namespace classification in Orleans.CodeGenerator (#6396)
- Reduce port clashes in TestCluster (#6399, #6413)
- Use the overload of ConcurrentDictionary.GetOrAdd that takes a method (#6409)
- Ignore not found exception when clearing azure queues (#6419)
- MembershipTableCleanupAgent: dispose timer if cleanup is unsupported (#6415)
- Allow grain call filters to retry calls (#6414)
- Avoid most cases of loggers with non-static category names (#6430)
- Free SerializationContext and DeserializationContext between calls (#6433)
- Don't use iowait in cpu calcs on linux (#6444)
- TLS: specify an application protocol to satisfy ALPN (#6455)
- Change the error about not supported membership table cleanup functionality into a warning. (#6447)
- Update obsoletion warning for ISiloBuilderConfigurator (#6461)
- Allow GatewayManager initialization to be retried (#6459)
- Added eventIndex (#6467)
- Send rejections for messages enqueued on stopped outbound queue (#6474)
- Stopped WorkItemGroup logging enhancement (#6483)
- Streamline LINQ/Enumerable use (#6482)
- Support for pluggable grain directory (#6340, #6354, #6366, #6385, #6473, #6485, #6502, #6524)
- Expose timeouts for Azure Table Storage (#6462, #6501, #6509)
- Schedule Tasks and WorkItems on .NET ThreadPool (#6261)
- Schedule received messages onto thread pool in Connection.ProcessIncoming (#6263)
- Remove AsyncAgent, Executor and related (#6264)
- Reorient RuntimeContext around IGrainContext (#6365)
- Remove Message.DebugContext and related code (#6323)
- Remove obviated GrainId constructor and associated code (#6322)
- Set isolation level to
READ COMMITTED
to avoid Gap Lock issues (#6331) - AdoNet: Rename Storage table to OrleansStorage for consistency with other tables. (#6336)
- Avoid using GrainTimer in non-grain contexts (#6342)
- Remove unnecessary provider runtime members (#6362)
- Remove ClientInvokeCallback (#6364)
- Remove ProcessExitHandlingOptions (#6369)
- Simplify OrleansTaskScheduler (#6370)
- Remove IServiceProvider from IGrainContext (#6372)
- Streamline MemoryStorage and InMemoryReminderTable (#6315)
- Fix test glitch in PersistenceProvider_Memory_FixedLatency_WriteRead (#6378)
- Fix errors reported by GitHub Semmle code analysis tools. (#6374)
- Remove Microsoft prefix from logging categories (#6431)
- Streamline Dictionary use and remove some dead code (#6439)
- Make methods on AITelemetryConsumer virtual; clean-up (#6469)
- Remove IHostedClient abstraction (#6475)
- Only allocate an array for lengths when array rank is greater than 3 (#6493)
- Support ValueTask as [OneWay] Methods Return Type (#6521)
- Grain Directory Redis implementation (#6543, #6569, #6570, #6571)
-
Non-breaking bug fixes
- Fix CleanupDefunctSiloMembership & MembershipTableTests (#6344)
- Schedule IMembershipTable.CleanupDefunctSiloEntries more frequently (#6346)
- CodeGenerator fixes (#6347)
- Fix handling of gateways in Orleans.TestingHost (#6348)
- Avoid destructuring in log templates (#6356)
- Clear RequestContext after use (#6358)
- Amended Linux stats registration to add services on Linux only (#6375)
- Update performance counter dependencies (#6397)
- Reminders period overflow issue in ADO.NET Reminders Table (#6390)
- Read only the body segment from EventData (#6412)
- Consistently sanitize RowKey & PartitionKey properties for Azure Table Storage reminders implementation (#6460)
- Gossip that the silo is dead before the outbound queue gets closed (#6480)
- Fix a race condition in LifecycleSubject (#6481)
- Fix log message (#6408)
- Do not reject rejection messages locally. Drop them instead (#6525)
- LocalGrainDirectory.UnregisterManyAsync should always be called from RemoteGrainDirectory context (#6575)
v3.2.0-rc2
Improvements and bug fixes since 3.2.0-rc2
- Non-breaking bug fixes
- Do not reject rejection messages locally. Drop them instead (#6525)