String.Trim() has problems

String.Trim() has some problems:

  1. It has bugs
  2. It is slow
  3. It is too often abused

String.Trim() has bugs

Yep, I'm not joking. The documentation says: "Removes all leading and trailing white-space characters from the current String object". So what do you expect the following program to do?

for (int i = (int)char.MinValue; i <= (int)char.MaxValue; i++) 
{ 
    char c = (char)i; 
    string s = c.ToString(); 
    bool charIsWhiteSpace = char.IsWhiteSpace(c); 
    bool trimTreatsCharAsWhiteSpace = s.Trim() == ""; 
    if (charIsWhiteSpace != trimTreatsCharAsWhiteSpace) 
    { 
        Console.WriteLine("Problem with char {0:X}: charIsWhiteSpace == {1}, trimTreatsCharAsWhiteSpace == {2}.", 
             (int) c, charIsWhiteSpace, trimTreatsCharAsWhiteSpace); 
    } 
}


According to the documentation, I would expect this to write nothing to the console, but here you go:

Problem with char 180E: charIsWhiteSpace == True, trimTreatsCharAsWhiteSpace == False.
Problem with char 200B: charIsWhiteSpace == False, trimTreatsCharAsWhiteSpace == True.
Problem with char 202F: charIsWhiteSpace == True, trimTreatsCharAsWhiteSpace == False.
Problem with char 205F: charIsWhiteSpace == True, trimTreatsCharAsWhiteSpace == False.
Problem with char FEFF: charIsWhiteSpace == False, trimTreatsCharAsWhiteSpace == True.
 

I looked them up:

180E: Mongolian vowel separator
200B: Zero width space
202F: Narrow no-break space
205F: Medium mathematical space
FEFF: Zero width no-break space

I'm not sure about the Mongolian thing ;-), but the others do look like white space to me.

String.Trim() is slow

Granted, this function has quite a bit of work to do. Basically, it needs to:

  1. Find the first non-white-space character in the string
  2. If there is none, return the empty string
  3. Find the last non-white-space character in the string
  4. If you get the entire string, just return it, otherwise return an appropriate substring.

And basically, that's what the function actually does. Well, in fact it swaps steps 2 and 3, so all space strings are actually scanned twice. Much worse is how the searching is done, or in fact, how a character is determined to be white-space or not. Reflector shows a pretty weird loop construct there. Unrolling that inner loop could speed up things considerably.

String.Trim() is too often abused

Have you ever written code like this:

if (s != null && s.Trim() != "")
{ 
    // ... 
}


Don't! In a test like that, you're not interested in the actual result, so don't calculate it! Look at the algorithm again: of the four steps mentioned, we only need the first one executed. We don't need the actual substring!

Now take a look at this function:

static bool IsEmptyOrWhiteSpace(string s) 
{ 
    foreach (char c in s) 
    { 
        if ((c < (char)9 || c > (char)13) && 
            c != ' ' && 
            c != (char)133 && 
            c != (char)160 && 
            c != (char)5760 && 
            (c < (char)8192 || c > (char)8203) && 
            c != (char)8232 && 
            c != (char)8233 && 
            c != (char)12288 && 
            c != (char)65279) 
        { 
            return false; 
        } 
    } 
    return true;
}


It's compatible with String.Trim(), so it has the same bugs mentioned above. But that makes it a perfect replacement in s.Trim() != "" tests, it won't change the semantics of your code.

I did a small benchmark, comparing its speed and memory usage with this function:

static bool Naive(string s) 
{ 
    return s.Trim() == ""; 
}


And these are the results:

Length:   0 WhiteSpace at start:   0 WhiteSpace at end:   0 
 Naive                      Time:     1,65 GC:      0
 IsEmptyOrWhiteSpace        Time:     0,17 GC:      0 
 
Length:   1 WhiteSpace at start:   1 WhiteSpace at end:   1 
 Naive                      Time:     5,27 GC:      0
 IsEmptyOrWhiteSpace        Time:     0,27 GC:      0 
 
Length: 100 WhiteSpace at start:   0 WhiteSpace at end:   0 
 Naive                      Time:    12,44 GC:      0
 IsEmptyOrWhiteSpace        Time:     0,72 GC:      0 
 
Length: 100 WhiteSpace at start:   1 WhiteSpace at end:   1 
 Naive                      Time:    37,23 GC:  20733
 IsEmptyOrWhiteSpace        Time:     1,08 GC:      0 
 
Length: 100 WhiteSpace at start:  44 WhiteSpace at end:  55 
 Naive                      Time:   177,77 GC:   1920
 IsEmptyOrWhiteSpace        Time:    20,46 GC:      0 
 
Length: 100 WhiteSpace at start: 100 WhiteSpace at end: 100 
 Naive                      Time:   169,46 GC:      0
 IsEmptyOrWhiteSpace        Time:    38,50 GC:      0
 

I called each function millions of times for different strings. For each string, I display its length and the number of leading and trailing spaces (ASCII 32). The first column shows actual time, second column shows the number of generation 0 garbage collection runs. Both functions were called through a delegate, which allowed me to take loop and call overhead into account. Obviously, I compiled with optimizations turned on.

As you can see, the IsEmtyOrWhiteSpace function is always faster than the naive approach, sometimes by orders of magnitude! A common case in many applications is a long string, e.g. an HTML document, ending with a CR/LF combination. That's actually the worst case for the naive approach, while being the best case for the IsEmptyOrWhiteSpace function.

Obviously, IsEmptyOrWhiteSpace doesn't do any memory allocations, while the naive approach in certain cases does.

Oh, and in case you wondered: I tried for loops and an unsafe version as well, both were slower than the simple foreach loop used above.

Technorati Tags: