开发者

Pretending .NET strings are value type

In .NET, strings are immutable and are reference type variables. This often comes as a surprise to newer .NET developers who may mistake them for value type objects开发者_如何学C due to their behavior. However, other than the practice of using StringBuilder for long concatenation esp. in loops, is there any reason in practice that one needs to know this distinction?

What real-world scenarios are helped or avoided by understanding the value-reference distinction with regard to .NET strings vs. just pretending/misunderstanding them to be value types?


The design of strings was deliberately such that you shouldn't need to worry too much about it as a programmer. In many situations, this means that you can just assign, move, copy, change strings without thinking too much of the possible intricate consequences if another reference to your string existed and would be changed at the same time (as happens with object references).

String parameters in a method call

(EDIT: this section added later)
When strings are passed to a method, they are passed by reference. When they are only read in the method body, nothing special happens. But when they are changed, a copy is created and the temporary variable is used in the rest of the method. This process is called copy-on-write.

What troubles juniors is that they are used to the fact that objects are references and they are changed in a method which changes the passed parameter. To do the same with strings, they need to use the ref keyword. This actually allows the string reference to be changed and returned to the calling function. If you don't, the string cannot be changed by the method body:

void ChangeBad(string s)      { s = "hello world"; }
void ChangeGood(ref string s) { s = "hello world"; }

// in calling method:
string s1 = "hi";
ChangeBad(s1);       // s1 remains "hi" on return, this is often confusing
ChangeGood(ref s1);  // s1 changes to "hello world" on return

On StringBuilder

This distinction is important, but beginner programmers are usually better off not knowing too much about it. Using StringBuilder when you do a lot of string "building" is good, but often, your application will have much more fish to fry and the little performance gain of StringBuilder is negligible. Be wary of programmers that tell you that all string manipulation should be done using StringBuilder.

As a very rough rule of thumb: StringBuilder has some creation cost, but appending is cheap. String has a cheap creation cost, but concatenation is relatively expensive. The turning point is around 400-500 concatenations, depending on size: after that, StringBuilder becomes more efficient.

More on StringBuilder vs string performance

EDIT: based on a comment from Konrad Rudolph, I added this section.

If the previous rule of thumb makes you wonder, consider the following slightly more detailed explanations:

  • StringBuilder with many small string appends outruns string concatenation rather quickly (30, 50 appends), but on 2µs, even 100% performance gain is often negligible (safe for some rare situations);
  • StringBuilder with some large string appends (80 characters or larger strings) outruns string concatenation only after thousands, sometimes hundredths of thousands iterations and the difference is often just a few percents;
  • Mixing string actions (replace, insert, substring, regex etc) often makes using StringBuilder or string concatenation equal;
  • String concatenation of constants can be optimized away by the compiler, the CLR or the JIT, it can't for StringBuilder;
  • Code often mixes concatenation +, StringBuilder.Append, String.Format, ToString and other string operations, using StringBuilder in such cases is hardly ever effective.

So, when is it efficient? In cases where many small strings are appended, i.e., to serialize data to a file, for instance and when you don't need to change the "written" data once "written" to StringBuilder. And in cases where many methods need to append something, because StringBuilder is a reference type and strings are copied when they are changed.

On interned strings

A problem rises — not only with junior programmers — when they try to do a reference comparison and find out that sometimes the result is true, and sometimes it is false, in seemingly the same situations. What happened? When the strings were interned by the compiler and added to the global static interned pool of strings, comparison between two strings can point to the same memory address. When (reference!)comparing two equal strings, one interned and one not, will yield false. Use = comparison, or Equals and do not play around with ReferenceEquals when dealing with strings.

On String.Empty

In the same league fits a strange behavior that sometimes occurs when using String.Empty: the static String.Empty is always interned, but a variable with an assigned value is not. However, by default the compiler will assign String.Empty and point to the same memory address. Result: a mutable string variable, when compared with ReferenceEquals, returns true, while you might expect false instead.

// emptiness is treated differently:
string empty1 = String.Empty;
string empty2 = "";
string nonEmpty1 = "something";
string nonEmpty2 = "something";

// yields false (debug) true (release)
bool compareNonEmpty = object.ReferenceEquals(nonEmpty1, nonEmpty2);

// yields true (debug) false (release, depends on .NET version and how it's assigned)
bool compareEmpty = object.ReferenceEquals(empty1, empty2);

In depth

You basically asked about what situations can occur to the uninitiated. I think my point boils down to avoiding object.ReferenceEquals because it cannot be trusted when used with strings. The reason is that string interning is used when the string is constant in the code, but not always. You cannot rely on this behavior. Though String.Empty and "" are always interned, it is not when the compiler believes the value is changeable. Different optimization options (debug vs release and others) will yield different results.

When do you need ReferenceEquals anyway? With objects it makes sense, but with strings it does not. Teach anybody working with strings to avoid its usage unless they also understand unsafe and pinned objects.

Performance

When performance is important, you can find out that strings are actually not immutable and that using StringBuilder is not always the fastest approach.

A lot of the information I used here is detailed in this excellent article on strings, along with a "how to" for manipulating string in-place (mutable strings).

Update: added code sample
Update: added 'in depth' section (hope someone find this useful ;)
Update: added some links, added section on string params
Update: added estimation for when to switch from strings to stringbuilder
Update: added an extra section on StringBuilder vs String performance, after a remark by Konrad Rudolph


The only distinction that really matters for most code is the fact that null can be assigned to string variables.


An immutable class acts like a value type in all common situations, and you can do quite a lot of programming without caring much about the difference.

It's when you dig a little deeper and care about performance that you have real use for the distinction. For example to know that although passing a string as a parameter to a method acts as if a copy of the string is created, the copying doesn't actually take place. This might be a surprise for people used to languages where strings actually are value types (like VB6?), and passing a lot of strings as parameters would not be good for performance.


String is a special breed. They are reference type yet used by most coders as a value type. By making it immutable and using the intern pool, it optimizes memory usage which will be huge if it's a pure value type.

More readings here:
C# .NET String object is really by reference? on SO
String.Intern Method on MSDN
string (C# Reference) on MSDN

Update:
Please refer to abel's comment to this post. It corrected my misleading statement.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜