Dissecting the Code

Task.Factory.StartNew and long running async tasks

2024-10-01T00:00:00+00:00

Let’s say you want to implement a Producer-Consumer pattern based on System.Threading.Channel to process items asynchronously:

public class AsyncLogProcessor
{
    private readonly Channel<string> _channel = Channel.CreateUnbounded<string>();
    private readonly Task _processingTask;

    public AsyncLogProcessor()
    {
        _processingTask = Task.Factory.StartNew(async () =>
        {
            await foreach (var log in _channel.Reader.ReadAllAsync())
            {
                // Processing a log item.
                Console.WriteLine(log);
            }
        }, TaskCreationOptions.LongRunning);
    }

    public void ProcessLog(string log)
    {
        _channel.Writer.TryWrite(log);
    }
}

And since you know that the processing task should run for the duration of the process, you use TaskCreationOptions.LongRunning flag.

Do you have an issue with this solution?

I actually, do.

The LongRunning flag tells the TPL to have a dedicated thread for a given callback, instead of getting a thread from the thread pool. But even though your task is semantically long running, the dedicated thread won’t be running for a long time.

Let’s simplify the code and add some tracing:

static void WriteLine(string message)
{
    Console.WriteLine($"[{Environment.CurrentManagedThreadId}] [IsThreadPoolTHread: {Thread.CurrentThread.IsThreadPoolThread}] - {message}.");
}

static void Main(string[] args)
{
    var task = Task.Factory.StartNew(async () =>
	{
	    WriteLine("Task started");
	    await Task.Delay(1000);
	    WriteLine("Task completed");
	}, TaskCreationOptions.LongRunning);

    Thread.Sleep(100);
    Console.WriteLine($"task.IsCompleted: " + task.IsCompleted);

    Console.ReadLine();
}

The output:

[4] [IsThreadPoolThread: False] - Task started.
task.IsCompleted: True
[6] [IsThreadPoolThread: True] - Task completed.

Here what’s happening at runtime:

The Task started message is printed from a dedicated thread.
The Task completed message is printed from a thread pool thread.
The task appears to be completed before the callback is done.

This happens because Task.Factory.StartNew is not async-friendly. The actual type of the task variable is Task, and the parent task completes when the new thread starts executing the callback, not when the callback itself completes.

The LongRunning flag is respected for running the first block of the async method before the first await. The await suspends the execution of the async method and the rest of the method is scheduled into the thread pool thread by the default task scheduler.

It is theoretically possible, that the callback that you provide to StartNew has a long running piece before the first await and you really want to run it in the dedicated thread. If this is the case the LongRunning flag is legit, but this is very uncommon and if you hit such case, please add a lengthy comment explaining the performance benefits of this approach.

The guidelines for Task.Factory.StartNew and async delegates:

Avoid using Task.Factory.StartNew with async delegates. If you must, use the Unwarp extension method to get the actual underlying task.
Do not use LongRunning flag with Task.Factory.StartNew for async callbacks. The flag is useful for synchronous methods that block the thread, but not for async methods where the continuation will be scheduled on the thread pool.

You probably should stop using a custom TaskScheduler

2024-06-14T00:00:00+00:00

If you don’t know what TaskScheduler is and you don’t have a custom version of it in your project, you probably can skip this post. But if you don’t know what it is but you do have one or two in your project, then this post is definitely for you.

Let’s start with the basics. The Task Parallel Library (a.k.a. TPL) was introduced in .NET 4 in 2010. And back then it was mostly used for parallel programming rather than for async programming since the async programming was not a first class citizen in C# 4 and .NET 4.

This manifested in the TPL API, for instance, Task.Factory.StartNew is taking the delegates that return void or T, instead of Task or Task:

var task = Task.Factory.StartNew(() =>
								 {
									 Console.WriteLine("Starting work...");
									 Thread.Sleep(1000);
									 Console.WriteLine("Done doing work.");
								 });

Task.Factory.StartNew has quite a few overloads and one of them takes TaskScheduler. What’s that? It’s a strategy that defines how the tasks are executed at runtime.

By default (if a custom TaskScheduler is not passed and TaskCreationOptions.LongRunning is not passed either) the default task scheduler is used. This is an internal type called ThreadPoolTaskScheduler and it uses the .NET Thread Pool for scheduling tasks. (If TaskCreationOptions.LongRunning is passed to Task.Factory.Startnew then a dedicated thread is used to avoid consuming a thread from a thread pool for a long time).

Like with any new technology, when TPL was released, a bunch of nerds got excited and tried to use (and abuse) a new tech as much as possible. And if Microsoft gives you an extensible library some people were thinking its a good idea to … you know … extend it.

One of the most common patterns was some kind of concurrency limiting task scheduler that uses a fixed number of dedicated threads to make sure you won’t oversubscribe the CPU:

public sealed class DedicatedThreadsTaskScheduler : TaskScheduler
{
    private readonly BlockingCollection<Task> _tasks = new BlockingCollection<Task>();
    private readonly List<Thread> _threads;

    public DedicatedThreadsTaskScheduler(int threadCount)
    {
        _threads = Enumerable.Range(0, threadCount).Select(i =>
        {
            var t = new Thread(() =>
            {
                foreach (var task in _tasks.GetConsumingEnumerable())
                {
                    TryExecuteTask(task);
                }
            })
            {
                IsBackground = true,
            };

            t.Start();
            return t;

        }).ToList();
    }

    protected override void QueueTask(Task task) => _tasks.Add(task);

    public override int MaximumConcurrencyLevel => _threads.Count;

    protected override bool TryExecuteTaskInline(Task task, bool taskWasPreviouslyQueued) => false;

    protected override IEnumerable<Task> GetScheduledTasks() => _tasks;
}

There are quite a few other implementations in the wild that do the same thing: DedicatedThreadTaskScheduler, DedicatedThreadsTaskScheduler, LimitedConcurrencyLevelTaskScheduler and even IOCompletionPortTaskScheduler that uses IO Completion ports to limit the concurrency.

Regardless of the implementation and fanciness, all of them do the same thing: they allow at most given number of tasks being executed at the same time. Here is an example of how we can use it to force having at most 2 tasks running at the same time:

var sw = Stopwatch.StartNew();
// Passing 2 as the threadCount to make sure we have at most 2 pending tasks.
var scheduler = new DedicatedThreadsTaskScheduler(threadCount: 2);
var tasks = new List<Task>();
for (int i = 0; i < 5; i++)
{
    int num = i;
    var task = Task.Factory.StartNew(() =>
    {
        Console.WriteLine($"{sw.Elapsed.TotalSeconds}: Starting {num}...");
        Thread.Sleep((num + 1) * 1000);
        Console.WriteLine($"{sw.Elapsed.TotalSeconds}: Finishing {num}");
    }, CancellationToken.None, TaskCreationOptions.None, scheduler);
    
    tasks.Add(task);
}

await Task.WhenAll(tasks);

In this case, we’re creating tasks in the loop, but in reality it might be in a request handler of some sort. Here is the output:

0154143: Starting 0...
0162219: Starting 1...
0262272: Finishing 0
0265169: Starting 2...
0224863: Finishing 1
0227441: Starting 3...
0417418: Finishing 2
041956: Starting 4...
0332304: Finishing 3
0453789: Finishing 4

As you can see, once the task 0 is done, we instantly schedule task 1 etc, so indeed we limit the concurrency here.

But lets make one small change:

static async Task FooBarAsync()
{
    await Task.Run(() => 42);
}

...
var task = Task.Factory.StartNew(() =>
{
    Console.WriteLine($"{sw.Elapsed.TotalSeconds}: Starting {num}...");
    Thread.Sleep((num + 1) * 1000);
    FooBarAsync().GetAwaiter().GetResult();
    Console.WriteLine($"{sw.Elapsed.TotalSeconds}: Finishing {num}");
}, CancellationToken.None, TaskCreationOptions.None, scheduler);

And the output is:

0.0176502: Starting 1...
0.0180366: Starting 0...

Yep. A deadlock! Why? Let’s update an example to see the issue better: let’s trace the current TaskScheduler and reduce the number of created tasks in the loop to 1:

static void Trace(string message) => 
    Console.WriteLine($"{message}, TS: {TaskScheduler.Current.GetType().Name}");

static async Task FooBarAsync()
{
    Trace("Starting FooBarAsync");
    await Task.Run(() => 42);
    Trace("Finishing FooBarAsync");
}

static async Task Main(string[] args)
{
    var sw = Stopwatch.StartNew();
    var scheduler = new DedicatedThreadsTaskScheduler(threadCount: 2);
    var tasks = new List<Task>();
    for (int i = 0; i < 1; i++)
    {
        int num = i;
        var task = Task.Factory.StartNew(() =>
        {
            Trace($"{sw.Elapsed.TotalSeconds}: Starting {num}...");
            Thread.Sleep((num + 1) * 1000);
            FooBarAsync().GetAwaiter().GetResult();
            Trace($"{sw.Elapsed.TotalSeconds}: Finishing {num}...");
        }, CancellationToken.None, TaskCreationOptions.None, scheduler);
        
        tasks.Add(task);
    }

	Trace("Done scheduling tasks...");
    await Task.WhenAll(tasks);
}

The output is:

0.018728: Starting 0..., TS: DedicatedThreadsTaskScheduler
Starting FooBarAsync, TS: DedicatedThreadsTaskScheduler
Finishing FooBarAsync, TS: DedicatedThreadsTaskScheduler
1.028004: Finishing 0..., TS: DedicatedThreadsTaskScheduler
Done scheduling tasks..., TS: ThreadPoolTaskScheduler

Now it should be relatively easy to understand, what’s going on and why when we tried running more than 2 tasks we got a deadlock. Remember, each step in an async method (the code between await keywords) is a task by itself, executed one by one by a task scheduler. And by default the task scheduler is sticky: if it was provided when the task was created, then all the continuations are going to use the same one. It means that the task scheduler flows through the awaits in the async methods.

In our case, it means that when FooAsync is done, our DedicatedThreadsTaskScheduler gets called to run it’s continuation. But it’s already busy running all the tasks so it can’t run a trivial piece of code at the end of FooAsync. And because FooAsync can’t be finished, we can’t finish the work the task scheduler runs at a moment. Causing a deadlock.

What can we do to solve this?

Solutions

There are a few ways how to avoid this issue:

1. Use `ConfigureAwait(false)`:

static async Task FooBarAsync()
{
    Trace("Starting FooBarAsync");
    await Task.Run(() => 42);
    Trace("Finishing FooBarAsync");
}

The issue we’re seeing here is very similar to a deadlock in the UI case, when a task is blocked and a single UI thread is unavailable to run the continuation.

We can avoid the issue by making sure we have ConfigureAwait(false) in every async method. Here is the output for a single item in a pool with the following FooBarAsync impl:

static async Task FooBarAsync()
{
    Trace("Starting FooBarAsync");
    await Task.Run(() => 42).ConfigureAwait(false);
    Trace("Finishing FooBarAsync");
}

0.0397394: Starting 0..., TS: DedicatedThreadsTaskScheduler
Starting FooBarAsync, TS: DedicatedThreadsTaskScheduler
**Finishing FooBarAsync, TS: ThreadPoolTaskScheduler**
1.0876967: Finishing 0..., TS: DedicatedThreadsTaskScheduler

One might say that this is the right solution to this problem, but I would disagree with it. In a real case in one of our projects, a blocking async method was in a library code that is hard to fix. You can make sure that your code follows the best practices by using analyzers, but its not practical to expect that everyone follows them.

The biggest issue here, is that this is an uncommon case. There are many backend systems that work perfectly fine without ConfigureAwait(false) because the team doesn’t have any UI with synchronization contexts, and the fact that the task schedulers behave the same way is not a widely known thing.

And I just feel that there are just better options.

2. Control the concurrency in a more explicit way

I think that concurrency control (a.k.a. rate limiting) is very important aspect of an application, and important aspects should be explicit.

The TaskScheduler is quite low level tool and I would prefer to have something higher level instead. If the work is CPU intensive, then PLINQ, or something like ActionBlock from TPL DataFlow is probably a better option.

If the work is mostly IO-bound and asynchronous, then you can use Parallel.ForEachAsync, Polly.RateLimiting or a custom helper class based on SemaphoreSlim.

Conclusion

A custom task scheduler is just a tool, and like any tool it might be used correctly or incorrectly. If you need a scheduler that knows about UI, then a task scheduler is for you. But should you use one for concurrency and parallelism control in your app? I would vote against it. It’s possible the team had legitimate reasons many years ago, but double check if those reasons exist today.

And yes, remember that blocking async call might bite you in variety of ways and the task scheduler case is just one of them. So I would recommend having a comment on every blocking call explaining why you think its safe and useful to do.

Figuring out mysterious `MissingMethodException` in a simple C# application

2024-03-21T00:00:00+00:00

As we already know from C# Language Features vs. Target Frameworks you can use most of the latest C# language features targeting .Net Standard or Full Framework. Some features just work with any target frameworks, but some require special attributes or types to be defined during compilation.

Here is an interesting problem that I’ve faced recently that took quite a bit of time to figure out.

Core.csproj

Let’s say you have a core library that multi-targets netstandard2.0 and net8.0. The library could have a bunch of stuff, like helpers for Span, or just anything else. For the sake of this example, this library just would have one class Config type with an init-only property:

// Core.csproj
// netstandard2.0;net8.0
namespace Core;  
public class Config { public int X { get; init; } }

Obviously, the code won’t compile, since netstandard2.0 version doesn’t have IsExternalInit type. The solution sounds pretty easy, right? We just add IsExternalInit.cs file manually (or with some MSBuild magic) with the following content:

#if NETSTANDARD2_0  
    namespace System.Runtime.CompilerServices;  
    internal class IsExternalInit;  
#endif

We either can add IsExternalInit.cs conditionally to the project itself if the target is netstandard2.0 or just have #if NETSTANDARD2_0 inside of it. We can’t just add this type for all the targets, but in this case we could face a compilation errors if the Core project would have InternalsVisibleTo attribute for a test project that target net8.0 or any other target runtime that has IsExternalInit type already defined.

Library.csproj

Now, we add another library, let’s say, Library.csproj that targets only netstandard2.0 that uses our Core.csproj. This might be not a super common case, but I’ve seen quite a few of them in the wild:

// Library.csproj
// netstandard2.0
public static class ConfigFactory  
{  
    public static Config Create(int value) => new () { X = value };  
}

Application.exe

And now we have a console app that targets net8.0 that just uses the factory:

// Application.exe
// net8.0
using Factory;  
  
var config = ConfigFactory.Create(42);  
  
Console.WriteLine("Done!");

Here is the dependency diagram:

Would you expect any issues with this code? Me neither, to be honest! But here is the output:

Unhandled exception. System.MissingMethodException: Method not found: 'Void Configuration.Config.set_X(Int32)'.
   at Factory.ConfigFactory.Create(Int32 value)
   at Program.$(String[] args) in Application/Program.cs:line 3

You can check the IL, and you’ll see that the set_X(Int32) “method” (which is a property) is definitely exists in the Config class. But why do we get the error? Is it a compiler bug? Not really!

The root cause

So here is the issue. Even though the Core.csproj is multi-targeted, the question is: which version of Core.dll is actually deployed in the output of folder? The core.dll that targets .netstandard2.0 or the core.dll that targets net8.0? At runtime there is no such a thing as ‘multi-targeting’, the multi-targeting is a build-time feature!

Sine Application project targets net8.0 and implicitly references Core.csproj, the net8.0 version is deploy.

Is it a problem? Actually, yes, it is. Let’s check the IL for the ConfigFactory:

.method public hidebysig static class [Core]Core.Config  
  Create(  
    int32 'value'  
  ) cil managed  
{    
  // [7 47 - 7 67]  
  IL_0000: newobj       instance void [Core]Core.Config::.ctor()  
  IL_0005: dup  
  IL_0006: ldarg.0      // 'value'  
  IL_0007: callvirt     instance void modreq ([Core]System.Runtime.CompilerServices.IsExternalInit) [Core]Core.Config::set_X(int32)  
  IL_000c: nop  
  IL_000d: ret  
  
} // end of method ConfigFactory::Create

Library.csproj targets netstandard2.0 and uses System.Runtime.CompilerServices.IsExgternalInit type from Core.dll, but at runtime we have Core.dll that targets net8.0 with the following set_X property:

.property instance int32 X()  
{  
  .get instance int32 Core.Config::get_X()  
  .set instance void modreq ([System.Runtime]System.Runtime.CompilerServices.IsExternalInit) Core.Config::set_X(int32)  
} // end of property Config::X

I.e. the one, that takes IsExternalInit from System.Runtime dll and not Core assembly. Yes, you could have the same types defined in different assemblies, and from the runtime point of view, they’re definitely are the two different types.

Solutions to the issue

So, how can we solve this issue? The simplest solution is just to use a tool that solved this problem already, for instance, PolySharp nuget package. But if this is not an option for you for some reason, there are two solutions available.

First, you can add IsExternalInit unconditionally. This might cause a problem with InternalsVisibleTo as I mentioned before, and second solution is based on TypeForwardingAttribute:

#if NETSTANDARD2_0  
namespace System.Runtime.CompilerServices;  
internal class IsExternalInit;  
#else  
[assembly: global::System.Runtime.CompilerServices.TypeForwardedTo(  
    typeof(global::System.Runtime.CompilerServices.IsExternalInit))]  
#endif

TypeForwardedToAttribute tells the runtime where to look the types that supposed to be in the current assembly. In this case, for net8.0 case we’re telling the runtime that IsExternalInit class is located in BCL and everything works just fine. Btw, this is the solution that PolySharp library uses under the hood as well.

C# Language Features vs. Target Frameworks

2024-03-06T00:00:00+00:00

If you check the official C# language versioning page you might think that there is a very strong relationship between the target framework and the C# language version.

And indeed, if you won’t specify the C# language version implicitly in the project file the version would be picked based on the target framework: C# 12 for .net8, C#11 for .net7, and C# 7.3 for Full Framework:

And even though the mapping just specifies the defaults, some people believe that the mapping is fixed and, for instance, if you got stuck with Full Framework, you also got stuck with C# 7.3. But this is not the case.

The actual relationship between the C# language version and the target framework is more delicate.

There are 3 ways how the feature might relate to the target framework.

Just works. Some features like enhanced pattern matching, readonly struct members, enhanced usings and static lambdas, just work out of the box. Set the right langVersion in a project file and a new feature works regardless of the target framework.
Requires a special type definition. Other features, such as new interpolated strings, non-nullable types, ranges and some others require special types to be discoverable by the compiler. These special types are added to .net core release that corresponds to particular C# version, but you can add them manually to your compilation (see examples below) to get the features working.
Runtime specific. And only a small fraction of all the new language features do require the runtime support. Features like Default Interface Implementations, Inline Arrays or ref fields won’t compile if the target framework doesn’t support it and if you’ll try, you’ll get an error: Error CS9064 : Target runtime doesn't support ref fields.

The first and the last cases are quite obvious, but the second one requires a bit of extra information. The C# compiler requires the special types to be available during compilation of the project for the feature to be usable, and it doesn’t care where the type definition is coming from: from the target framework, from a nuget package, or be part of the project itself.

Here is an example of using init-only setters (available since C# 9) in a project targeting netstandard 2.0:

// Project targets netstandard2.0 or net472
public record MyRecord
{
    // System.Runtime.CompilerServices.IsExternalInit class is required.
    public int X { get; init; }
}

namespace System.Runtime.CompilerServices
{
    internal class IsExternalInit { }
}

But if you’ll try to use some other features, like required members, you would have to add quite a bit of extra types to your compilation:

public record class MyRecord
{
    // System.Runtime.CompilerServices.IsExternalInit class is required.
    public int X { get; init; }
    // System.Runtime.CompilerServices.RequiredMemberAttribute,
    // CompilerFeatureRequiredAttribute and
    // System.Diagnostics.CodeAnalysis.SetsRequiredMembersAttribute are required
    public required int Y { get; set; }    
}

namespace System.Runtime.CompilerServices
{
    internal class IsExternalInit { }
    internal class RequiredMemberAttribute : System.Attribute { }
    internal sealed class CompilerFeatureRequiredAttribute(string featureName) : System.Attribute
    {
        public string FeatureName { get; set; } = featureName;
    }
}

namespace System.Diagnostics.CodeAnalysis
{
    internal class SetsRequiredMembersAttribute : System.Attribute { }
}

Adding all the attributes manually to every project is very tedious, so you can rely on some MSBuild magic to add a set of known files based on the target framework. Or you could just use something like PolySharp that uses source generation to add all the required types regardless of the target framework.

InternalsVisisbleTo catch

There is an issue with the case shown before. Let’s say you have A.csproj targeting netstandard2.0 and A.Tests.csproj targeting net8.0 with InternalVisibleTo("A.Tests") inside A.csproj.

In this case, you won’t be able to compile A.Tests.csproj with an error about duplicate member definition, since the type like IsExternalInit would be available from two places - from A.csproj and from .net8.0 runtime library.

The solution is pretty simple: multitarget A.csproj and target both netstandard2.0 and net8.0.

And here I want to show all the language features from C# 12 down to C# 8 with their requirements and a link to a github issue that explains the feature.

C# 12 Features

Language Feature	Requirements
ref-readonly parameters	No extra requirements (1)
Collection expressions	No extra requirements (2)
Interceptors	`InterceptsLocationAttribute` (3)
Inline Arrays	Runtime support is required: .net8+
nameof accessing instance members	No extra requirements
Using aliases for any types	No extra requirements
Primary Constructors	No extra requirements
Lambda optional parameters	No extra requirements
Experimental Attribute	`ExperimentalAttribute` (4)

(1) ref-readonly parameters is an interesting feature. On one hand, it doesn’t require any extra types to be declared manually, but it does rely on an extra type - System.Runtime.CompilerServices.RequiresLocationAttribute. But if the compilation is missing this type, the compiler would generate it for you!

(2) System.Runtime.CompilerServices.CollectionBuilderAttribute is needed to support collection expression for custom types.

(3) The full type name is System.Runtime.CompilerServices.InterceptsLocationAttribute (4) The full type name is System.Diagnostics.CodeAnalysis.ExperimentalAttribute

C# 11 Features

Language Feature	Requirements
File-local types	No extra requirements
ref fields a.k.a. low level struct enhancements	.net7+
Required properties	`RequiredMemberAttribute`, `CompilerFeatureRequiredAttribute`, `SetsRequiredMembersAttribute` (1)
Static abstract members in interfaces	.net7+
Numeric IntPtr	No extra requirements
Unsigned right shift operator	No extra requirements
utf8 string literals	System.Memory nuget or .net2.1+
Pattern matching on `ReadOnlySpan`	System.Memory nuget package to get `ReadOnlySpan` itself.
Checked Operators	No extra requirements
auto-default structs	No extra requirements
Newlines in string interpolations	No extra requirements
List patterns	`System.Index`, `System.Range`(2)
Raw string literals	No extra requirements
Cache delegates for static method group	No extra requirements
nameof(parameter)	No extra requirements
Relaxing Shift Operator	No extra requirements
Generic attributes	No extra requirements

(1) The full type names are System.Runtime.CompilerServices.RequiredMemberAttribute, System.Runtime.CompilerServices.CompilerFeatureRequiredAttribute and System.Diagnostics.CodeAnalysis.SetsRequiredMembersAttribute

(2) Some features are going to work only targeting net2.1 or netstandard2.1, for instance the following code requires System.Runtime.CompilerServices.RuntimeHelpers.GetSubArray to be available:

int[] n = new int[]{ 1 };  
if (n is [1, .. var x, 2])  
{  
}

C# 10 Features

Language Feature	Requirements
Record structs	No extra requirements
Global using directives	No extra requirements
Improved Definite Assignment	No extra requirements
Constant Interpolated Strings	No extra requirements
Extended Property Patterns	No extra requirements
Sealed record ToString	No extra requirements
Source generators V2 API	No extra requirements
Mix declarations and variables in deconstruction	No extra requirements
AsyncMethodBuilder override	`AsyncMethodBuilderAttribute` (1)
Enhanced `#line` directives	No extra requirements
Lambda improvements	No extra requirements
Interpolated string improvements	`InterpolatedStringHandler`, `InterpolatedStringHandlerArgument` (2)
File-scoped namespaces	No extra requirements
Paremeterless struct constructors	No extra requirements
`CallerArgumentExpression`	`CallerArgumentExpressionAttribute`

(1) The full type name is System.Runtime.CompilerServices.AsyncMethodBuilderAttribute. (2) The full type names are System.Runtime.CompilerServices.InterpolatedStringHandlerAttribute and System.Runtime.CompilerServices.InterpolatedStringHandlerArgumentAttribute.

C# 9 Features

Language Feature	Requirements
Target-typed new	No extra requirements
Skip local init	`SkipLocalsInitAttribute`
Lambda discard parameters	No extra requirements
Native ints	No extra requirements
Attributes on local functions	No extra requirements
Function pointers	No extra requirements
Pattern matching improvements	No extra requirements
Static lambdas	No extra requirements
Records	No extra requirements
Target-typed conditional	No extra requirements
Covariant Returns	.net5.0+
Extension `GetEnumerator`	No extra requirements
Module initializers	`ModuleInitializerAttribute` (1)
Extending partials	No extra requirements
Top level statements	No extra requirements

(1) The full type name is System.Runtime.CompilerServices.ModuleInitializerAttribute.

C# 8 Features

Language Feature	Requirements
Default Interface Methods	.net core 3.1+
Nullable reference types	A bunch of nullability attributes (1)
Recursive Patterns	No extra requirements
Async streams	Microsoft.Bcl.AsyncInterfaces or .net core 3.1+
Enhanced usings	No extra requirements
Ranges	`System.Index`, `System.Range`
Null-coalescing assignment	No extra requirements
Alternative interpolated strings pattern	No extra requirements
stackalloc in nested contexts	No extra requirements
Unmanaged generic structs	No extra requirements
Static local functions	No extra requirements
Readonly members	No extra requirements

(1) There are a lot of attributes: - [AllowNull], [DisallowNull], [DoesNotReturn], [DoesNotReturnIf], [MaybeNull], [MaybeNullWhen], [MemberNotNull], [MemberNotNullWhen], [NotNull], [NotNullIfNotNull], [NotNullWhen]

String Interning - To Use or Not to Use? A Performance Question

2023-12-10T00:00:00+00:00

I recently join a new team and one of the projects was having a high memory footprint issues. There are a few mitigations put in place and one of them was to de-duplicate strings by using string interning.

When the application creates tens of millions of strings with a high repetition rate such optimization is quite helpful and in this case it was reducing the memory footprint by about 10-15%. But when I looked into the profiling data I’ve noticed that the string interning was a huge bottle neck and the application was spending about 96% of the execution time in spin locks inside the string table.

This presented an interesting challenge: while string de-duplication helped with memory usage, it also significantly hurt startup performance, as most calls to string.Intern were made during app initialization. Removing string interning indeed helped performance quite a lot, but I was cusious if another string de-duplication approaches might be better. So I’ve tried a naive one based on ConcurrentDictionary.

public static class StringCache
{
    private static ConcurrentDictionary<string, string> cache = new(StringComparer.Ordinal);

    public static string Intern(string str) => cache.GetOrAdd(str, str);

    public static void Clear() => cache.Clear();
}

The cache currently uses a static ConcurrentDictionary, but it can easily be made non-static and passed around as needed. Additionally, if we know that string de-duplication is only needed during application initialization, we can clear the cache once initialization is complete to avoid keeping transient strings that are not part of the final object graph. Having the ability to clear the cache solves one of the issues that a global string interning cache has.



However, performance of this naive implementation is a concern. To test performance, we need to be careful when benchmarking a global state like the string interning cache, since the benchmark is executed multiple times within the same process, which can skew the data. One solution is to clean a custom table on each iteration, but cleaning the string table cache requires running each iteration in a separate process.

But we need to start somewhere. So lets try this benchmark first:

private List<string> _list;

[Params(10_000, 100_000, 1_000_000)]
public int Count { get; set; }

[GlobalSetup]
public void Setup()
{
    _list = Enumerable.Range(1, Count).Select(n => n.ToString()).ToList();
}

[Benchmark]
public void String_Intern()
{
    _list.AsParallel().ForAll(static s => string.Intern(s));
}

[Benchmark]
public void StringCache_Intern()
{
    _list.AsParallel().ForAll(static s => StringCache.TryIntern(s));
}


In this case we’re measuring the read performance, which still might be a useful thing to check. Here are the results for .NET 8 (but they’re pretty much the same for .NET Framework as well):

| Method             | Count   | Mean         | StdDev       | Allocated |
|------------------- |-------- |-------------:|-------------:|----------:|
| String_Intern      | 10000   |   3,463.7 us |     47.04 us |   4.04 KB |
| StringCache_Intern | 10000   |     114.5 us |      3.61 us |   4.01 KB |
| String_Intern      | 100000  |  39,546.8 us |  1,653.10 us |    4.1 KB |
| StringCache_Intern | 100000  |   1,371.8 us |    129.97 us |   4.03 KB |
| String_Intern      | 1000000 | 823,046.8 us | 16,736.25 us |   5.05 KB |
| StringCache_Intern | 1000000 |  32,094.0 us |  3,291.34 us |   4.07 KB |


Ignore the allocations since they’re caused by PLINQ. The time looks bad! Why the built-in version is so slow?

To double check the runtime behavior (and to look the code under the profiler) I’ve decided to write a “simple” console app that calls de-duplication logic on 10M different strings multiple times. This is not the exact scenario our service has but it might be closer than the benchmark.

var bm = new StringInterningBenchmarks() { Count = 10 };
bm.Setup();
bm.String_Intern();
bm.StringCache_Intern();

bm.Count = 10_000_000;
bm.Setup();
GC.Collect();
// to make it easier to see the sections in profiling session
Thread.Sleep(2_000);

var sw = Stopwatch.StartNew();
// The first call will populate the cache
// and the second one will mostly read from the cache.
for (int i = 0; i < 10; i++)
    bm.StringCache_Intern();

Console.WriteLine($"Custom string interning is done in {sw.Elapsed}");

GC.Collect();
// to make it easier to see the sections in profiling session
Thread.Sleep(2_000);
sw.Restart();

for (int i = 0; i < 10; i++)
    bm.String_Intern();

Console.WriteLine($"String interning is done in {sw.Elapsed}");


The results:
Custom string interning is done in 00:00:03.9975182
String interning is done in 00:01:13.9881888


The difference is still huge (like 15-x). And by playing with the number of iterations, I got different ratios between the string interning and custom cache. It seems that the string interning is drastically slower (like 20-30x) in terms of reads, but “just” 2-3x slower in terms of writes.

And most importantly the string interning performance issue is not theoretical. After switching from the string interning to the custom StringCache the startup time for our service dropped 2-x! With just a simple change! Plus we got an ability to clean-up the cache to get rid of the cached strings that are not part of the final state.

But before closing this topic, lets run the same custom benchmark with Native AOT:

Custom string interning is done in 00:00:03.3062479
String interning is done in 00:00:05.6756519


Why? The thing is that the string interning logic for both Full Framework and .NET Core is implemented in native code at StringLiteralMap::GetInternedString. String interning for native AOT has a different implementation and is written in C#! The new implementation uses LockFreReaderHashtable which is used by the runtime in many other places. And that implementation is WAY MORE efficient than the native string interning implementation. It is somewhat comparable with ConcurrentDictionary in terms of perf, but requires less memory for keeping all the records.

And running the same benchmark with Native AOT gives drastically different results as well:

| Method             | Count   | Mean        | Error       | StdDev      | Allocated |
|------------------- |-------- |------------:|------------:|------------:|----------:|
| String_Intern      | 10000   |    196.8 us |     3.82 us |     3.92 us |   4.11 KB |
| StringCache_Intern | 10000   |    211.9 us |     4.15 us |     5.67 us |   4.11 KB |
| String_Intern      | 100000  |  1,680.1 us |    47.58 us |   140.28 us |   4.14 KB |
| StringCache_Intern | 100000  |  2,102.1 us |    86.83 us |   250.53 us |   4.13 KB |
| String_Intern      | 1000000 | 31,059.8 us | 1,349.33 us | 3,827.82 us |   4.16 KB |
| StringCache_Intern | 1000000 | 40,368.6 us | 1,279.83 us | 3,713.02 us |   4.15 KB |


We can’t see the difference in memory consumption, since these benchmarks are essentially the stable state benchmarks, when all the records are already added to the string caches.

Conclusion

  String interning in non-native AOT is very slow and can drastically affect your application performance.
  If you call string.Intern in your code you probably should think if you really should.
  A very naive custom string cache based on ConcurrentDictionary is drastically faster then the string interning cache and gives you an opportunity to clean-up the cache.
  If your app runs as a Native AOT app, then the performance is good, and the only drawback of the bulit-in string interning is an inability to clean it.



Classes vs. Structs. How not to teach about performance!
2023-11-02T00:00:00+00:00
It’s been a while since my last blog post, but I believe it’s better to post late than never, so here I am!

Recently, I was browsing a list of courses on Pluralsight and noticed one with a very promising title: “C# 10 Performance Playbook.” As an advanced course on a topic I’m passionate about, I decided to give it a go. I wasn’t sure if I’d find many new things, but since I talk about performance a lot, I’m always looking for an interesting perspective on how to explain this topic to others. The content of this course raised my eyebrows way too much, so I decided to share my perspective on it and use it as a learning opportunity.

This blog post is quite similar to what Nick Chapsas does in his “Code Cop,” with one difference: I’m not going to anonymize the sample code. Since it’s paid content, I feel that I have a right to give a proper review and potentially ask for changes, since the potential damage of such content on a platform like Pluralsight could be quite high.

In this blog post, I want to focus on a single topic that was covered in a section called “Classes, Structs, and Records.” The section is just over six minutes long, and I didn’t expect too many details, since the topic is quite large. But you can be concise and correct.

Classes vs. Structs

Here is the first benchmark used for comparing classes vs. structs:

public class ClassvsStruct
{
    // This reads all the names from the resource file.
    public List<string> Names => new Loops().Names;

    [Benchmark]
    public void ThousandClasses()
    {
        var classes = Names.Select(x => new PersonClass { Name = x });
    }

    [Benchmark]
    public void ThousandStructs()
    {
        var classes = Names.Select(x => new PersonStruct { Name = x });
    }
}



The results were:
| Method          | Mean     | Error    | StdDev   | Rank |
|---------------- |---------:|---------:|---------:|-----:|
| ThousandStructs | 32.05 us | 0.639 us | 1.136 us |    1 |
| ThousandClasses | 34.11 us | 0.841 us | 2.480 us |    2 |



The author concluded that structs are slightly faster, which is an interesting conclusion given the fact that there were no constructions of classes or structs involved in the code. The difference between the two benchmarks is probably just noise and has nothing to do with the actual performance characteristics of classes or structs.

But that’s not all. Here is the next iteration of the benchmarks:

public class ClassvsStruct
{
    // This reads all the names from the resource file.
    public List<string> Names => new Loops().Names;

    [Benchmark]
    public void ThousandClasses()
    {
        var classes = Names.Select(x => new PersonClass { Name = x });
        for (var i = 0; i < classes.Count(); i++)
        {
            var x = classes.ElementAt(i).Name;
        }
    }

    [Benchmark]
    public void ThousandStructs()
    {
        var classes = Names.Select(x => new PersonStruct { Name = x });
        for (var i = 0; i < classes.Count(); i++)
        {
            var x = classes.ElementAt(i).Name;
        }
    }
}



The results are:

| Method          | Mean     | Error     | StdDev    | Rank |
|---------------- |---------:|----------:|----------:|-----:|
| ThousandStructs | 2.315 ms | 0.0460 ms | 0.0716 ms |    1 |
| ThousandClasses | 9.664 ms | 0.1837 ms | 0.3710 ms |    2 |



And I’m quoting the author: “This time the difference is HUGE!” My first reaction was, “Okay, he’s going to fix this, right? He’s just playing with us, expecting us to catch the issue in the code. You can’t have O(N^2) in the benchmark!” But nope, this was the final version of the code.

Even though I think this is a very bad way to compare structs and classes, let’s use this example to learn how we should be analyzing the results of the benchmarks.

Tip #1: Do not trust results you don’t understand

One thing every performance engineer should learn is the ability to interpret and explain the results. For instance, in this case, we changed the benchmarks to consume classes variable in a loop 1k times, and all of a sudden, the benchmark duration increased by 100x. Is it possible that accessing 1K elements in C# takes milliseconds? This sounds horrible! My gut reaction is that the construction is probably more expensive than the consumption, so I would not expect the benchmark to be significantly slower if done correctly. If you see a 100x difference in performance results, you should stop and think: why am I getting these results? Can I explain them? Is it possible that something is wrong with the benchmark?

Tip #2: Understand the code behind the scenes

In many cases, developers can rely on good abstractions and ignore the implementation details, but this is not true for performance analysis. In order to properly interpret the results, a performance engineer should be able to look through the abstractions and see what’s going on under the hood:


  What does the Names property do? What’s the complexity of accessing it? Is it backed by a field, or do we do some work every time we access it?
  What’s the nature of the “collection” we use? Is it a contiguous block of memory? Is it a node-based data structure like a linked list? Is it a generated sequence?
  Do you understand how LINQ works? What’s the asymptotic complexity of the code?


All of these questions are crucial, since each and every step might drastically affect the results.

If the Names property is expensive, then the benchmark will be measuring the work it does instead of the code inside the benchmark. And in the author’s case it was reading a list of names from the resource file. Meaning that we were doing a file IO in a benchmark which is not ok.

Different collection types have different performance characteristics. Even though the O-complexity is still the same, you’ll see significant difference between accessing an array or a linked list. Probably, the differences should be insignificant in real world cases, but the benchmark should show it since accessing an array is more cache-friendly since all the data are co-located (especially for structs).

And once you arrive with a hypothesis, you can check it by writing a benchmark that just access the elements of an array vs. elements of linked list with 1K elements:

| Method                   | Mean       | Error     | StdDev    | Rank |
|------------------------- |-----------:|----------:|----------:|-----:|
| StructAccessInArray      |   639.7 ns |  23.60 ns |  67.32 ns |    1 |
| ClassAccessInArray       |   776.9 ns |  39.18 ns | 111.14 ns |    2 |
| StructAccessInLinkedList | 4,526.5 ns | 114.47 ns | 332.11 ns |    3 |
| ClassAccessInLinkedList  | 4,806.1 ns | 141.65 ns | 410.96 ns |    4 |


These are the results I would expect: less then a nano second for accessing an array, 20-ish % difference between classes and structs and a significant differences between accessing an array vs. accessing a linked list. But even in this case we should not draw any conclusions on how changing array to linked list would affect performance in a real-world cases, since the code normally does way more than just getting the data.

Lastly, it’s important for every .NET engineer to have a solid understanding of algorithmic complexity and how LINQ works. We’ll revisit this topic after the tips, as it’s a key issue with these benchmarks.

Tip #3: Understand the Concepts Being Measured
The final tip is: make sure you understand the concepts being measured. There are many differences between structs and classes, and your mental model of these constructs should match the results. For example, you know  that classes are heap-allocated, while structs can be allocated on the stack or inside other objects, which can impact performance. Classes are references, while structs are values, which can also affect performance in various ways.

However, you should ask yourself if you can interpret the results with your knowledge and intuition. If the answer is “no,” it could be due to a lack of understanding of the concept in this context, a flawed benchmark that introduces noise, or other factors that affect the results that you still don’t understand. In any case, you should not draw any conclusions from data that you can’t interpret.

Understanding the results

Now, let’s try to understand the results that were presented.

First of all, we should avoid recomputing the Names property over and over again. This is bad, especially when the property is getting data from a resource file.

However, the main reason why the benchmarks are not correct is because of LINQ and lazy evaluation.

Let’s take a closer look at the code:

// This reads all the names from the resources.
public List<string> Names => new Loops().Names;

[Benchmark]
public void ThousandClasses()
{
    var classes = Names.Select(x => new PersonClass { Name = x });
    for (var i = 0; i < classes.Count(); i++)
    {
        var x = classes.ElementAt(i).Name;
    }
}


The classes variable is an IEnumerable, which is essentially a query (or a promise, or a generator) that will produce new results each time we consume it. However, on each iteration, we call classes.Count(), which calls new Loops().Names that creates 1,000 PersonClass instances just to return the number of items we want to consume. When you do O(N) work on each iteration, the entire loop’s complexity becomes O(N^2), which is already quite bad. Then, on each iteration, we call classes.ElementAt(i), which probably needs to traverse the entire sequence from the begining again.

This means that the overall complexity is O(2*N^2) (which I know is still O(N^2)! And this O(2*N^2) time complexity and O(2*N^2) memory complexity. Meaning that for 1,000 elements, the benchmark could be doing millions of operations and allocating millions of instances of PersonClass` in the managed heap!

We can confirm this assumption by doing two things: 1) adding the MemoryDiagnoser attribute to see the allocations and 2) adding another case with either 100 or 10,000 elements to access the asymptotic complexity of the code.

[MemoryDiagnoser]
public class ClassvsStruct
{
    [Params(100, 1000)]
    public int Count { get; set; }
    public List<string> Names => new Loops(Count).Names;

    [Benchmark]
    public void ThousandClasses() {}

    [Benchmark]
    public void ThousandStructs() {}
}



And here are the results:

| Method          | Count | Mean        | Rank | Gen0      | Gen1     | Allocated  |
|---------------- |------ |------------:|-----:|----------:|---------:|-----------:|
| ThousandStructs | 100   |    19.40 us |    1 |    0.6104 |        - |    3.87 KB |
| ThousandClasses | 100   |    65.38 us |    2 |   39.5508 |   0.4883 |  242.93 KB |
| ThousandStructs | 1000  | 1,342.93 us |    3 |    5.8594 |        - |   39.02 KB |
| ThousandClasses | 1000  | 4,844.48 us |    4 | 3835.9375 | 140.6250 | 23523.4 KB |


The results of this run are different from what was presented in the course, since my Loops().Names property is just a LINQ query. However, the same differences between structs and classes are still present: structs are significantly faster than classes. Why? Because of the allocations. Allocations in the managed heap are fast, but when you need to do millions of them just to iterate the loop, they would skew the results badly. You can clearly see a non-linear complexity here: the count goes from 100 to 1,000 (10x), and the duration goes up by a factor of 70 and the allocations goes up by a factorof 100.

It seems that the complexity is O(N^2) rather than O(2*N^2) as I expected. This is interesting! Obviously, my understanding of LINQ was incorrect.

Why? When I saw the results, my line of reasoning was the loop is O(N), Enumerable.Count() used in the loop is O(N), and Element.ElementAt(i) is O(N) as well. So for each loop iteration we iterate the loop from the begining twice.

I first checked the full framework sources:

public static TSource ElementAt<TSource>(this IEnumerable<TSource> source, int index) {
    if (source == null) throw Error.ArgumentNull("source");
    IList<TSource> list = source as IList<TSource>;
    if (list != null) return list[index];
    if (index < 0) throw Error.ArgumentOutOfRange("index");
    using (IEnumerator<TSource> e = source.GetEnumerator()) {
        while (true) {
            if (!e.MoveNext()) throw Error.ArgumentOutOfRange("index");
            if (index == 0) return e.Current;
            index--;
        }
    }
}


Hm… This is definitely O(N)!

But what about .NET Core version?

public static TSource ElementAt<TSource>(this IEnumerable<TSource> source, int index)
{
    if (source == null)
    {
        ThrowHelper.ThrowArgumentNullException(ExceptionArgument.source);
    }

    if (source is IPartition<TSource> partition)
    {
        TSource? element = partition.TryGetElementAt(index, out bool found);
        if (found)
        {
            return element!;
        }
    }
    else if (source is IList<TSource> list)
    {
        return list[index];
    }
    else if (TryGetElement(source, index, out TSource? element))
    {
        return element;
    }

    ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.index);
    return default;
}


The code is definitely different! There is a different handling of IList and another case for IPartition. What’s that? This is an optimization to avoid excessive work in some common scenarios, like the one we have here. We construct classes as a projection from List, so the actual type of classes is SelectListIterator that implements IPartition and gets the i-th element without enumerating from the beginning every time.

Again, once we had a hypothesis, we can validate it. In this case, the simplest way to do that is to compare the number of allocations for the full framework and .NET Core versions using a profiler.

Full Framework results:



.NET Core results:



As you can see from the DotTrace output, the .NET Core version calls the PersonClass constructor 1 million times, and the Full Framework version calls it 1.5 million times. This makes sense since the asymptotic complexity is the worst case that does not always happen. ElementAt(i) has to iterate up to the i-th element and should go through the entire sequence only on the last iteration. But as you can see, the optimization that .NET Core has is quite significant.

Classes vs. Structs: Performance Comparison

Okay, we’ve analyzed and understood the data, but can I give advice on classes vs. structs? As I’ve mentioned already, this is a complicated topic, and I’m pretty sure benchmarking can’t provide any guidance here. The main difference between the two is the impact on allocations and garbage collection and how the instances are passed arounnd - by reference or via a copy. And its very hard to give an abstract advice on how and when this matters.

When I do a performance analysis, I start with a symptom: “low throughput” (compared to an expected one) or “high memory utilization” (again, compared to either a baseline or just “it looks way too high”). Then I take a few snapshots of the system in various states, run a profiler, or collect some other performance-related metrics. I do look into transient memory allocations to see if the system produces a lot of waste that could be an indication of a unnecessary work: allocating an iterator or a closure on a hot path could easily reduce the throughput of a highly loaded component by 2-3x. But if the allocations are happening infrequently, then I won’t even look there.

If I see GC-related performance issues, I would start looking into how I can optimize things. Using structs instead of classes is an option, but not always the first or the best one. Other options would be to see if we can avoid doing work by caching the results, or use some form of domain-specific optimizations. If I need to reduce allocations, I might switch to structs or try reducing the size of class instances by removing unused or rarely used fields.

Structs are definitely a good tool, but you really need to understand how to use them and when.

Key Takeaways


  Don’t trust the data you can’t interpret, especially the results of microbenchmarks. I’ve seen a ton of “best practices” based on stale or dubious microbenchmarking results.
  Understand the thing you’re measuring. Don’t make rushed decisions; dig deeper into the topic if you think you still have gaps in knowledge.
  Look behind the scenes. Understanding a few levels of abstraction deeper is crucial for performance analysis.
  ElementAt is trickier than you might think, and overall, be VERY careful with LINQ in your benchmarks and in hot paths.



Dissecting Interpolated Strings Improvements in C# 10
2021-11-08T00:00:00+00:00
There are many interesting features coming into C# 10 and my favorite one is the improvements of interpolated strings.
It may sound weird that the #1 feature for me is not a new one but an improvement of an existing one. That’s because I do care a lot about performance and the interpolated string improvements in C# 10 will make my code faster without any changes from my side. But that’s not it. The new design is not only allows creating strings faster, but it also allows skipping the string creation altogether!

First, a bit of history. String interpolation is a quite popular concept that was added to C# 6 for creating strings with embedded expressions:

int n = 42;
string s = $"n == {n}"; // s is "n == 42"


But in the original form, this feature had some performance-related issues caused by a fairly naive implementation. To be fair, the language spec was intentionally vague in terms of how exactly the compiler should translate an interpolated string, so it was possible to have a better and more efficient code generation in the future.

Before C# 10, the compiler used to have a farily simple transformation. The code like string s = $"n == {n}" was simply translated string s = string.Format("n == {0}", n).

Here are a few issues with this approach:

  Extra step is required at runtime to parse the format string, even though the format is known by the compiler.
  Boxing will happen if the captured expression is of a value type.
  ToString call on a captured expression is required, meaning that a bunch of transient strings will be allocated in the process.
  No support for constant folding. If the expression is a constant expression, the string will still be constructed at runtime and not at compile time.
  The string is created eagerly and there is no way to avoid string construction if it’s not being used at runtime.


Starting from C# 10 all of those issues are solved!

Let’s look at a practical example that will show most of the benefits of the new implementation. Let’s say we have a very simple argument validation library, like RuntimeContracts and we want to check some invariants by calling Contract.Assert(predicate, message) (*). And if the predicate is false we want the contract to fail with an optional user-defined error message:

(*) The type name is intentionally the same as in System.Diagnostics.Contracts namespace, but the “runtime contracts” do not require any tools for rewriting code before using them.

private int _state; // can be changed.

public void DoSomething(int n)
{
    for (int i = 0; i < n; i++)
    {
        Contract.Assert(_state == 42, $"n must be 42 but was {_state}");
    }
}


Can you see the issue here? The check is called in the loop and the message will be created on each iteration! This can be very problematic and can cause real issues if the code is on an application’s hot path. Let see how we can avoid allocations with the interpolated string improvements.

Interpolated String Handler basics

Instead of “lowering” an interpolated string to string.Format call, the C# 10 compiler now uses “Interpolated String Handlers” pattern.

The handler is a type that follows a specific pattern: it must have a constructor that takes at least 2 arguments: literalLength and formattedCount, and may take some optional arguments as well as we’ll see later, and must have at least two methods: AppendLiteral(string) and AppendFormatted(T). The type must also be marked with a special attribute - InterpolatedStringHandlerAttribute.

Starting from C# 6 an interpolated string expression was assignable to string or System.FormattableString and now it can be assigned to any type that follows the aforementioned pattern. Starting with .NET 6 there is a built-in handler called DefaultInterpolatedStringHandler and by default, the compiler “lowers” an interpolated string expression to it.

int n = 0;

// s is System.String
var s = $"n == {n}";

// s2 is of type 'DefaultInterploatedStringHandler'
DefaultInterpolatedStringHandler s2 = $"n == {n}";


If you decompile this code you’ll see the changes in action:

int i = 0;

DefaultInterpolatedStringHandler defaultInterpolatedStringHandler = new DefaultInterpolatedStringHandler(5, 1);
defaultInterpolatedStringHandler.AppendLiteral("n == ");
defaultInterpolatedStringHandler.AppendFormatted(i);
// s is System.String
string s = defaultInterpolatedStringHandler.ToStringAndClear();

defaultInterpolatedStringHandler = new DefaultInterpolatedStringHandler(5, 1);
defaultInterpolatedStringHandler.AppendLiteral("n == ");
defaultInterpolatedStringHandler.AppendFormatted(i);
// s2 is of type 'DefaultInterploatedStringHandler'
DefaultInterpolatedStringHandler s2 = defaultInterpolatedStringHandler;


The DefaultInterpolatedStringHandler is more efficient compared to a regular string.Format call in multiple ways:

  No runtime work is required to parse the formatted string. Instead, each placeholder corresponds to a call to AppendFormatted.
  The char array used internally for building a final string is rented from an array pool, so in a steady state only a final string is allocated.
  A generic overload of AppendFormatted(T) avoids boxing when value types are captured in an interpolated string expression.
  ISpanFormattable type is respected, and that allows writing an object’s string representation into a Span without allocating a separate string. (many built-in types do implement this interface already).
  There is an overload for AppendFormatted(ReadOnlySpan) that allows capturing the span of char in the interpolated expression that was not possible before: string s = $"Str={strArg.AsSpan().Trim()}".
  Constant folding is also supported and if all the expressions are known at compile time the final string will be produced by the compiler.


Here is a small benchmark that shows the differences:

[MemoryDiagnoser]
public class PerformanceBenchmark
{
    private readonly DateTime _when = DateTime.Now;
    private readonly long _v1 = 1;
    private readonly long _v2 = 2;
    private readonly long _v3 = 3;

    [Benchmark]
    public string StringFormat()
    {
        return string.Format("When: {0}, V1={1}, V2={2}, V3={2}", _when, _v1, _v2, _v3);
    }

    [Benchmark]
    public string NewInterpolation()
    {
        return $"When: {_when}, V1={_v1}, V2={_v2}, V3={_v2}";
    }
}


|           Method |     Mean |    Error |  StdDev |  Gen 0 | Allocated |
|----------------- |---------:|---------:|--------:|-------:|----------:|
|     StringFormat | 518.0 ns | 10.34 ns | 8.63 ns | 0.0648 |     272 B |
| NewInterpolation | 392.7 ns |  7.55 ns | 6.70 ns | 0.0286 |     120 B |


As we can see, the new implementation is 25% faster and allocates less than half of the string.Format version.

What is ISpanFormattable?

A default API for getting a string representation of an object is Object.ToString() that every (**) type supports. But calling ToString by definition causes an extra allocation of a resulting string. And if you need to compose a string from multiple objects it may cause a lot of excessive allocations. To avoid this, many high performance applications instead of using Object.ToString also have void ToString(StringBuilder) for constructing a composed text without creating an extra string each time.

(**) Not every type per se, because pointers are types and they don’t support ToString(). And ref structs must define ToString methods explicitly because the base version defined in System.ValueType is not accessible for them.

But starting with .NET 6 we have ISpanFormattable interface that derives from IFormattable and has one extra method:

namespace System;

public interface ISpanFormattable : IFormattable
{
    /// 
    /// Tries to format the value of the current instance into the provided span of characters.
    /// 
    bool TryFormat(Span<char> destination, out int charsWritten, ReadOnlySpan<char> format, IFormatProvider? provider);
}


ISpanFormattable allows writing an object’s text representation into a destination Span if the destination is large enough to accept it.

The API of this interface looks scary and maybe labor-intensive to do this manually all the time. Luckily, we can use interpolated strings to write into a Span as well!

public readonly struct Point : ISpanFormattable
{
    public int X { get; }
    public int Y { get; }
    public Point(int x, int y) => (X, Y) = (x, y);

    public override string ToString() =>
        ToString(format: null, formatProvider: null);

    public bool TryFormat(Span<char> destination, out int charsWritten, ReadOnlySpan<char> format, IFormatProvider provider) =>
        destination.TryWrite($"X={X}, Y={Y}", out charsWritten);

    public string ToString(string format, IFormatProvider formatProvider) =>
        return string.Create(formatProvider, $"X={X}, Y={Y}");
}


In this case, TryFormat method calls MemoryExtensions.TryWrite that will do exactly what we want: it will try adding a newly produced string into a target span if the destination has enough space.

Besides writing to a span, .NET 6 also updated the StringBuilder API like Append and AppendLine to leverage new interpolated string handlers.

The calls like stringBuilder.AppendLine($"X = {X}, Y = {Y}"); used to create a separate string that was added to a StringBuilder instance. But now both StringBuilder.Append and StringBuilder.AppendLine are taking AppendInterpolatedStringHandler that appends an interpolated string in a very efficient way.

Ok, now it’s time to create a custom handler that will solve the issue that we had with our Contract.Assert method.

Custom Interpolated String Handler

Let’s start with a special handler type:

[InterpolatedStringHandler]
public ref struct ContractMessageInterpolatedStringHandler
{
    // Will delegate all the work here!
    private DefaultInterpolatedStringHandler _handler;

    public ContractMessageInterpolatedStringHandler(int literalLength, int formattedCount, bool predicate, out bool handlerIsValid)
    {
        _handler = default;

        if (predicate)
        {
            // If the predicate is evaluated to 'true', then we don't have to construct a message!
            handlerIsValid = false;
            return;
        }

        handlerIsValid = true;
        _handler = new DefaultInterpolatedStringHandler(literalLength, formattedCount);
    }

    public void AppendLiteral(string s) => _handler.AppendLiteral(s);

    public void AppendFormatted<T>(T t) => _handler.AppendFormatted(t);

    public override string ToString() => _handler.ToStringAndClear();
}


Now we can change the Contract.Assert signature to take the handler, and by using InterpolatedStringHandlerArgument we can “tell” the compiler to pass the predicate parameter to the constructor of the handler as well:

public static class Contract
{
    // "Telling" the compiler to pass the 'predicate' parameter to the handler.
    public static void Assert(bool predicate, [InterpolatedStringHandlerArgument("predicate")] ref ContractMessageInterpolatedStringHandler handler)
    {
        if (!predicate)
        {
            throw new Exception($"Precondition failed! Message:{handler.ToString()}");
        }
    }
}


Let’s check what will happen at runtime:

int n = 0;
// Contract is not violated! No messages will be constructed!
Contract.Assert(true, $"No side effects! n == {++n}");


The output will be:
n == 0


The compiler emitted the following code:

bool predicate = true;
bool handlerIsValid;
var handler = new ContractMessageInterpolatedStringHandler(22, 1, predicate, out handlerIsValid);
if (handlerIsValid)
{
    handler.AppendLiteral("No side effects! n == ");
    handler.AppendFormatted(++i);
}

Contract.Requires(predicate, ref handler);


The compiler generates the code that creates an instance of ContractMessageInterpolatedStringHandler and passes the length of a string literal and the number of slots. It also passes the predicate flag that the handler checks and sets ‘handlerIsValid` depending on its value. And if the handler is invalid (because the assertion is not violated) we completely skip the message construction!

And now we can call Contract.Assert with a custom error message in a loop and not be afraid of performance issues caused by excessive message construction!

private int _state; // can be set and changed.

public void DoSomething(int n)
{
    for (int i = 0; i < n; i++)
    {
        // No performance issues anymore! The string will never be constructed if the assertion is not violated!
        Contract.Assert(_state == 42, $"n must be 42 but was {_state}");
    }
}


Support for older .NET Frameworks

As always, the C# compiler uses the pattern-based approach for the new interpolated string improvements and it means that we can define required attributes manually in our code (but still put them into System.Runtime.CompilerServices namespace) and use the new behavior with the older frameworks.

await-ing in interpolated strings

One thing that you may have noticed is that the interpolation string handlers are ref-structs and you may remember that ref-structs have some restrictions: they can’t be “allocated” in the managed heap so they can’t be embedded into other non-ref structs or objects. And because of that, they can’t be used in async methods.

But the following code was working fine before and should be working just fine in C# 10:

public async Task FooAsync()
{
    string s = $"x = {await Task.Run(() => 42)}";
}


The language designers knew that the async case would be problematic. So they had a few options: 1) make handlers non-ref structs or 2) use different code generation when async code is involved. They decided to go with the second option and keep the handlers as ref structs and fallback in the async case to the old option and generate string.Format call instead.

Conclusion

  Interpolated strings in C# 10 are faster and produce 0 extra allocations besides the final string.
  The interpolated string handlers allow creating a very expressive, yet efficient API like one we had seen in Contract.Assert. The same “trick” can be used by logging frameworks to avoid string creation if the logging level is off.
  Interpolated strings in C# 10 support capturing ReadOnlySpan like string s = "foo bar "; string str = $"Trimmed: {s.AsSpan().Trim()}";.
  ISpanFormattable is a very handy interface that allows an object’s string representation to be written into a span without allocating a string.
  MemoryExtensions.TryWrite is a building block for implementing ISpanFormattable interface using interpolated strings.
  StringBuilder.Append and AppendLine were updated in .NET 6 to use interpolated string handlers for higher efficiency.



Shooting Yourself in the Foot with Concurrent Use of FileStream.Position
2019-05-29T00:00:00+00:00
Let’s explore the following scenario: you have a service that copies files between machines. And to track the progress there is a special “copy watcher” thread (or task) that logs a current position of the target stream by accessing a FileStream.Position property.

The question is: how safe or unsafe the access to FileStream.Position from another thread is? Of course, without any synchronization in place, the “watcher” could be a bit off and get a previous file position. And because Position property is of type long the read operation could yield some very weird results on a 32-bit platform for files larger than 2Gb. And, of course, the runtime could potentially do some weird optimizations due to lack of synchronization (even though this is not likely to happen in practice).

But is it possible for the watcher thread to affect the copy operation in a more drastic way? Like to corrupt the file?

Let’s do an experiment.

[Test]
public void ReadFileStreamPositionFromDifferentThread()
{
    const string path = "test.txt";
    int N = 10_000;
    int blockSize = 1024;

    using (var fileStream = new FileStream(path, FileMode.Create, FileAccess.Write))
    using (var writer = new StreamWriter(fileStream))
    {
        var cts = new CancellationTokenSource();
        // Start a background position reader
        Task.Run(async () =>
        {
            while (cts.IsCancellationRequested)
            {
                // Tracing the position. In this case, just obtaining it.
                long currentPosition = fileStream.Position;
                await Task.Delay(1);
            }
        });

        for (int i = 0; i < N; i++)
        {
            // Generate blocks of 'a's, then 'b's etc to 'z's
            var output = new string((char)('a' + (i%26)), blockSize);
            writer.WriteLine(output);
        }

        cts.Cancel();
    }

    var fileLength = new FileInfo(path).Length;
    // Need to count \r\n as well
    var expectedLength = (blockSize + Environment.NewLine.Length) * N;
    Assert.That(fileLength, Is.EqualTo(expectedLength));
}


We have a very simple code that writes synchronously to a file with blocks of 1024 characters N times. We can increase the N to be in millions, we can deploy this code to production and never see any errors for years. So we can make a conclusion that it is safe to read the FileStream.Position property while the other thread writes the content to the file.

And then we make a simple change. We either call FileStream.SafeFileHandle property on a FileStream instance or we start creating a FileStream by calling, for instance, new FileStream(safeHandle, FileAccess.Write).

[Test]
public void ReadFileStreamPositionFromDifferentThreadWithSafeFileHandleExposed()
{
    const string path = "test.txt";
    int N = 10_000;
    int blockSize = 1024;

    using (var fileStream = new FileStream(path, FileMode.Create, FileAccess.Write))
    using (var writer = new StreamWriter(fileStream))
    {
        // This is the key difference here: touching SafeFileHandle property.
        var handle = fileStream.SafeFileHandle;
        var cts = new CancellationTokenSource();
        // Start a background position reader
        Task.Run(async () =>
        {
            while (cts.IsCancellationRequested)
            {
                // Tracing the position. In this case, just obtaining it.
                long currentPosition = fileStream.Position;
                await Task.Delay(1);
            }
        });

        for (int i = 0; i < N; i++)
        {
            // Generate blocks of 'a's, then 'b's etc to 'z's
            var output = new string((char)('a' + (i%26)), blockSize);
            writer.WriteLine(output);
        }

        cts.Cancel();
    }

    var fileLength = new FileInfo(path).Length;
    // Need to count \r\n as well
    var expectedLength = (blockSize + Environment.NewLine.Length) * N;
    Assert.That(fileLength, Is.EqualTo(expectedLength));
}


And now, if we run the test, we’ll get a failure, Expected: 10260000 But was: 10258976. What. Is. Going. On. Here?

When the internal file handle is exposed (by calling FileStream.SafeFileHandle or by creating a FileStream instance by a given SafeFileHandle), then a FileStream instance forces some additional internal safety checks. If FileStream._exposedHandle is true, then every read, write, flush or Position getter calls VerifyOSHandlePosition, that calls SeekCore(0, SeekOrigin.Current) that reads the current position of the file and updates a current position by changing _pos field.

It means, that if _exposedHandle is true, the call to FileStream.Position is no longer pure! It updates a FileStream internal state that can affect a write operation happening in the other thread. To understand the problem, let’s take a look at FileStream.BeginWriteCore implementation (that is called from synchronous Write as well):

unsafe private FileStreamAsyncResult BeginWriteCore(byte[] bytes, int offset, int numBytes, AsyncCallback userCallback, Object stateObject) 
{
    // Create and store async stream class library specific data in the async result
    FileStreamAsyncResult asyncResult = new FileStreamAsyncResult(0, bytes, _handle, userCallback, stateObject, true);
    NativeOverlapped* intOverlapped = asyncResult.OverLapped;

    if (CanSeek) {
        // Make sure we set the length of the file appropriately.
        long len = Length;
        //Console.WriteLine("BeginWrite - Calculating end pos.  pos: "+pos+"  len: "+len+"  numBytes: "+numBytes);
        
        // Make sure we are writing to the position that we think we are
        if (_exposedHandle)
            VerifyOSHandlePosition();
        
        if (_pos + numBytes > len) {
            //Console.WriteLine("BeginWrite - Setting length to: "+(pos + numBytes));
            SetLengthCore(_pos + numBytes);
        }

        // Now set the position to read from in the NativeOverlapped struct
        // For pipes, we should leave the offset fields set to 0.
        intOverlapped->OffsetLow = (int)_pos;
        intOverlapped->OffsetHigh = (int)(_pos>>32);
 


If the file is not yet flushed and the next write operation is called when another thread calls FileStream.Position property, then the internal _pos field can be changed based on actual file position, effectively losing one of the rights and corrupting the content of the file!

No one should assume that a property is thread-safe unless it’s clearly stated in the documentation and there are no such claims for any FileStream properties. On the other hand, when we think about thread unsafety due to concurrent reads of a property we rarely think about such drastic effects like corrupted files. Framework Design Guidelines taught us to treat properties as smart fields without such drastic side effects like IO operations in a property getter.

I do understand that the FileStream implementation tries its best to protect us, the users, from undesirable errors and inconsistent state. But I also believe that such side effects, like potential file corruptions, should be more explicitly documented.

TLDR; Reading a FileStream.Position from another thread during write operations when a stream’s underlying SafeFileHandle is exposed, is extremely dangerous and may cause file corruption.

P.S. The issue could happen in full framework as well as in .NET Core.

It was a very important lesson for me, that even a simple change could have a drastic effect on a distributed system.
We’ve been running a service with concurrent Position reads for many years without any issues and a simple change in the code that switched FileStream to an “unsafe” mode caused a very strange and hard to understand issues in the system. But that was a very useful lesson for me anyway.

P.S. The issue affects both .NET Framework version as well as .NET Core version of FileStream.


The Dangers of Task.Factory.StartNew
2019-05-21T00:00:00+00:00
I’ve faced a very interesting problem recently with one of our production services: the service partially stopped responding to new requests even though some other parts of the service were still working.

To understand the problem, let’s review the following code. Suppose we have a service that processes internal requests in a “dedicated thread”. To do that it creates a long-running task by passing TaskCreationOptions.LongRunning into Task.Factory.StartNew method and creates a continuation for error reporting purpose.

public class Processor
{
    private Task _task;
    private readonly BlockingCollection<Request> _queue;

    public Processor()
    {
        _task = Task.Factory.StartNew(LoopAsync, TaskCreationOptions.LongRunning);
        _task.ContinueWith(_ =>
        {
            // Trace the error.
            // Maybe even restart the loop.
        }, TaskContinuationOptions.OnlyOnFaulted);
    }

    public void Stop() => _queue.CompleteAdding();

    private async Task LoopAsync()
    {
        foreach (var request in _queue.GetConsumingEnumerable())
        {
            await ProcessRequest(request);
        }
    }
}


What is the problem with this code? Quite a few, actually. And all of them are related to LoopAsync method return type.

First of all, let’s think about the long-running aspect. TaskCreationOptions.LongRunning indicates that a given operation is such a long running procedure that it deserves a dedicated thread. That makes sense because indeed LoopAsync can run for the entire lifetime of the service until Stop method is called.

But here is the catch: from CLR’s point of view the duration of LoopAsync is not “linear” and the operation “finishes” on the first await. It means that this code spawns a thread just to wait for and to process the first request. And once the first request is processed, the continuation inside LoopAsync is called in a thread pool’s thread causing the original thread to die.

The code creates unnecessary threads and this is not the best thing in the world, but this is not the most dangerous part here.

The type of the _task field is Task, but what is the actual type of the object at runtime? Is it just System.Threading.Tasks.Task? The actual type is Task.

Task.Factory.StartNew “wraps” the result of a given delegate into the task, and if the delegate itself returns a task, then the result is a task that creates a task.

In this case, it means that the error handling here is completely wrong. _task.ContinueWith creates a continuation of an outer task that will fail only if something will go terribly wrong with the system and the TPL will fail to launch a new thread. Otherwise, the outer task will succeed “hiding” potential issues with the inner task.

Here is a simpler example:

static void Main(string[] args)
{
    var task = Task.Factory.StartNew(async () =>
    {
        Console.WriteLine("Inside the delegate");
        throw new Exception("Error");
        return 42;
    }, TaskCreationOptions.LongRunning);
    task.ContinueWith(
        _ => { Console.WriteLine($"Error: {_.Exception}"); }, 
        TaskContinuationOptions.OnlyOnFaulted);
    Console.ReadLine();
}


When we run this code we’ll see Inside the delegate message on the screen and nothing else. And if we’ll check the status of the task variable at runtime we’ll notice that the task is actually finished successfully and the continuation, that supposes to handle the error, is never called.

What should you do in this case? The simplest solution is just to switch to Task.Run that will return an underlying task because the API was designed with async methods in mind.

Use TaskExtensions.Unwrap extension method to get the underlying task from Task instance.

But if you have to use Task.Factory.StartNew because you need to pass some other task creation options, then you can “unwrap” the resulting task to obtain the underlying task instance:

static void Main(string[] args)
{
    var task = Task.Factory.StartNew(async () =>
    {
        Console.WriteLine("Inside the delegate");
        throw new Exception("Error");
        return 42;
    }).Unwrap();
    
    // Now, task actually points to the underlying task and the next continuation works as expected.
    task.ContinueWith(
        _ => { Console.WriteLine($"Error: {_.Exception}"); }, 
        TaskContinuationOptions.OnlyOnFaulted);
    Console.ReadLine();
}


Always trace unobserved task exceptions

One way at least to mitigate the issues like this is to always react to unhandled exceptions in tasks. When a task fails but the user fails to “observe” the error, the TaskScheduler.UnobservedTaskException is triggered. Back in .NET 4.0 days unhandled task exceptions were “critical” and were causing an application to crash. Starting from .NET 4.5 the default behavior has changed (*) and unhandled task exceptions may stay unnoticed (use  configuration section if you want to change it back).

(*) The reason for this change is quite simple: it is extremely simple in this “async” days to get an unobserved task exception. Simple code like this can cause it:

var t1 = AsyncMethod1();
var t2 = AsyncMethod2();
// If both t1 and t2 will fail, then t2's error will be unobserved.
await t1;
await t2;


TLDR;

  Never use Task.Factory.StartNew with TaskCreationOptions.LongRunning if the given delegate is backed by an async method.
  Prefer Task.Run over Task.Factory.StartNew and use the latter only when you really have to.
  If you have to use Task.Factory.StartNew with async methods, always call Unwrap to get the underlying task back.
  Always trace unobserved task exceptions, because you never know what kind of subtle issues are hidden in your code.


If you work on a codebase that was started in .NET 4.0 era, I would highly recommend you search for Task.Factory.StartNew usages and double check that you don’t have the issues mentioned in this post.

Dissecting the Code

Task.Factory.StartNew and long running async tasks

You probably should stop using a custom TaskScheduler

Solutions

1. Use ConfigureAwait(false):

2. Control the concurrency in a more explicit way

Conclusion

Figuring out mysterious `MissingMethodException` in a simple C# application

Core.csproj

Library.csproj

Application.exe

The root cause

Solutions to the issue

C# Language Features vs. Target Frameworks

InternalsVisisbleTo catch

C# 12 Features

C# 11 Features

C# 10 Features

C# 9 Features

C# 8 Features

String Interning - To Use or Not to Use? A Performance Question

Conclusion

Classes vs. Structs. How not to teach about performance!

Classes vs. Structs

Tip #1: Do not trust results you don’t understand

Tip #2: Understand the code behind the scenes

Tip #3: Understand the Concepts Being Measured

Understanding the results

Classes vs. Structs: Performance Comparison

Key Takeaways

Dissecting Interpolated Strings Improvements in C# 10

Interpolated String Handler basics

What is ISpanFormattable?

Custom Interpolated String Handler

Support for older .NET Frameworks

await-ing in interpolated strings

Conclusion

Shooting Yourself in the Foot with Concurrent Use of FileStream.Position

The Dangers of Task.Factory.StartNew

Use TaskExtensions.Unwrap extension method to get the underlying task from Task instance.

Always trace unobserved task exceptions

TLDR;

1. Use `ConfigureAwait(false)`:

What is `ISpanFormattable`?

`await`-ing in interpolated strings

Use `TaskExtensions.Unwrap` extension method to get the underlying task from `Task` instance.