Strings are immutable
well, it’s hardly a news
var str1 = "A";
str1 += "B";
Line str1 += "B"
will allocate new string "AB"
and set its reference to str1
, while old string object ("A"
) becomes garbage. The IL representation of above segment makes it clear.
IL_0001: ldstr "A"
IL_0006: stloc.0 // str1
IL_0007: ldloc.0 // str1
IL_0008: ldstr "B"
IL_000d: call string [mscorlib]System.String::Concat(string, string)
IL_0012: stloc.0 // str1
As you can see the += is replaced with String.Concat, which does an allocation of new concatenated string.
In fact, string immutability is NOT enforced on the runtime level, only that BCL does not expose an API to modify a string object, but we still can acquire a pointer to the memory where string object resides and change some bytes there in unsafe mode, but it can cause an hard-to-diagnose issues and you should probably use StringBuilder
class instead.
The idea behind string interning
Since string object can’t be changed once allocated, ideally we should have a single copy of all the unique strings of our program in memory, in other words, if we have str1
and str2
string variables and both have a same value, then both should have a same reference too (point to the same memory address), that is called String interning.
Believe it or not, you’re already doing a string interning
var str1 = "A";
var str2 = "A";
Console.WriteLine(object.ReferenceEquals(str1, str2)); // True
Even if str1
and str2
are declared seperately they still point to the same memory address, the reason is string interning done by CLR automatically for string literals.
This behaviour is well documented
The Common Language Infrastructure (CLI) guarantees that the result of two ldstr instructions referring to two metadata tokens that have the same sequence of characters return precisely the same string object (a process known as “string interning”). Source
If we look at IL code for the first 2 lines of above c# code snippet
IL_0001: ldstr "A"
IL_0006: stloc.0 // str1
IL_0007: ldstr "A"
IL_000c: stloc.1 // str2
we see 2 ldstr
instruction that loads the exact same string value.
There is not much you can do to disable this behaviour for whatever reason, most you can do is
[assembly: CompilationRelaxations(CompilationRelaxations.NoStringInterning)]
Which Marks an assembly as not requiring string-literal interning, keep in mind that string interning still can happen.
How to intern a string manually
There are 2 methods for that
static string Intern (string str)
- internsstr
string and return a reference to interned stringstatic string IsInterned (string str)
- returns back an interned string ifstr
string is interned, ornull
otherwise.
Let’s see some examples
// Literal string is going to be interned automatically
var str1 = "Hello Bond";
// Dynamic string will not be interned
var tmpStr = new StringBuilder().Append("Hello ").Append("Bond").ToString();
// string.Intern returned an existing interned string (remember, "Hello Bond" is already interned by CLR)
var str2 = string.Intern(tmpStr);
Console.WriteLine(object.ReferenceEquals(str1, str2)); // true
Console.WriteLine(object.ReferenceEquals(str1, tmpStr)); // false
The api for interning a string seems pretty straightforward.
Saving a memory is not the only advantage
The idea of string interning is to reuse piece of memory to save some bytes but that’s not all, string interning has even more important advantage, faster string comparison.
If str1
and str2
are interned strings and their references are equal then their values are equal too and vice versa (if str1
and str2
references are NOT equal then their values are NOT equal too). This is a really good news, because string comparison is frequent and expensive operation.
How does it work internally
To make interning work a static hash table (InternTable
) is maintained for keeping a track of all the interned strings in our application. The interning process is simple
- checks if such string is already interned
- If yes
- return interned string
- If not
- allocate new string in memory
- copy the value to newly allocated string
- add newly allocated string reference to hash table
- return back the refernce to newly allocated string
Drawbacks
We talked about the advantages, string deduplication and performant comparison, now it’s time for disadvantages
Ever living string object
String, once interned, stays in memory during the lifetime of our application, the hash table we mentioned above keeps a refernce to interned string which keeps GC from collecting it. So, for the strings which are not frequently used during the lifetime of our application, string interning most likely is not the right way of storing it, we may end up in a situation, where instead of saving up a memory we do an opposite.
Temporary string
In order to get an interned string, we should have a “normal” string. As an example, say we want to intern a content of a file, first we should store the content into the string variable, str
for exmaple and then call str = string.Intern(str)
, which will internally create a new string and copy the value of str
to it. As a result, we end up having 2 duplicate string objects in memory, one of which (guess which one) should be garbage collected.
Conclusion
String interning can be used effectively for long living duplicate strings, specially if we’re doing a bunch of equality comparisons.