Objective-C and Invalid UTF-8 Byte Sequences

I found out the following information the hard way. I hope it helps you in some way…

Objective-C does not handle invalid UTF-8 byte sequences gracefully.

NSString’s UTF8String method will simply return nil if it encounters any invalid chars (which could happen when deserializing a slightly corrupted file from disk, for example). The encoding conversion methods, even the ones that allow “lossy” conversions, will either return nil – or only return the first part of the string up to the invalid sequence. I guess “lossy” here only refers to valid characters.

I had a fiendish problem with one user, where a presumably corrupt char in the DB would cause UTF8String to return nil, and the whole thing fell over. I tried many alternatives to UTF8String like  NSData’s dataUsingEncoding:allowLossyConversion, and NSString’s getBytes:maxLength:usedLength:encoding:options:range:remainingRange.  The former simply returned nil too, the latter would give me the first part of the string.

What was worse, is that [stringToCheck canBeConvertedToEncoding:NSUTF8StringEncoding] would actually return YES (but then not convert).

My conclusion is that all these methods, and everywhere that “lossy” is referenced is designed for converting valid strings from one encoding to another. What I had was an invalid NSString, which these methods are not designed for, even canBeConvertedToEncoding:NSUTF8StringEncoding.

Third Party Solutions Are Hard to Come By

I searched outside the Objective-C/iOS SDK, and found a few leads. One was this solution which uses the Omni framework. I didn’t try it, as I didn’t want to add a massive dependancy for such a simple, rarely used function. It seems that people commonly use GNU’s iconv for the task, e.g. in Ruby and PHP, but I have no idea if I can even compile that for iOS, not to mention the whole LGPL issue.

My Solution to Fix Corrupt NSStrings

In the end, my solution was this: for strings that fail the UTF8String conversion (so only in 0.01% of the time), get every single UTF16 char from the NSString, and make it’s own NSString. Then, run UTF8String on just that single-char string to test if it worked, and ignore those that fail (actually, I replace them with the replacement char: ‘�’) . It’s not pretty, it’s not fast – but it works. And since this is a very rare case, generally the only performance hit you get is the overhead of the UTF8String test.

The only way I know how to test for sure that you can convert to UTF8 is to actually do the conversion via NSString’s UTF8String. It’s heavy, but it’s accurate…

Here’s the method:

/* UTF8 fixup methods, by William Denniss, http://williamdeniss.com/ */

/*
 * Convenience method to do the check and validation in one.
 */
+ (NSString*) makeValidUTF8:(NSString*) stringToCheck
{
	if (![Util isValidUTF8:stringToCheck])
	{
		return [Util removeInvalidCharsFromString:stringToCheck];
	}
	else
	{
		return stringToCheck;
	}
}

/*
 * Returns true if the string can be converted to UTF8
 */
+ (NSString*) isValidUTF8:(NSString*) stringToCheck
{
	return ([stringToCheck UTF8String] != nil);
}

/*
 * Removes invalid UTF8 chars from the NSString
 * This method is slow, so only run it on strings that fail the +Util::isValidUTF8 check.
 */
+ (NSString*) removeInvalidCharsFromString:(NSString*) stringToCheck
{
	NSMutableString* fixedUp = [[[NSMutableString alloc] initWithCapacity:[stringToCheck length]] autorelease];

	// iterates all characters of the string to check
	for (NSUInteger i = 0; i < [stringToCheck length]; i++)
	{
		// gets the character as a one-char string
		unichar character = [stringToCheck characterAtIndex:i];
		NSString* charString = [[NSString alloc] initWithCharacters:&character length:1];

		// converts it individually to UTF8, testing for errors
		if ([charString UTF8String] == nil)
		{
			NSLog(@"Invalid UTF-8 sequence encountered at position %lu. Code: %hu (%X). Replacing with \ufffd", (unsigned long) i, character, character);
			[fixedUp appendString:@"\ufffd"];
		}
		else
		{
			[fixedUp appendString:charString];
		}
		[charString release];
	}

	NSLog(@"Util:makeValidUTF8 WARNING: string was NOT valid utf-8.  Orig length %d, fixed length %d", [stringToCheck length], [fixedUp length]);

	//NSAssert([fixedUp UTF8String] != nil, @"still nil");

	return fixedUp;
}

Incidentally, the character that was in my NSString which caused UTF8String to fail and caused all my grief was U+D843


2 comments on “Objective-C and Invalid UTF-8 Byte Sequences

  1. Thanks for this. It’s a good solution. However, as listed, I think there is a bug:

    [stringToCheck UTF8String] always returns a valid pointer. The defererenced pointer does give you the indication of failure as a null-terminated string is returned.

    Let me know if you think I missed something. And thanks again. It’s much more complete, with the most amount of recovered text, than my previous effort.

  2. An NSString object contains already encoded characters, UTF-16 encoded, that is.
    The problem is often that a sequence of bytes appears not to be a valid UTF-8 sequence. If you start with an NSString, you are already half way.

    So, a method that takes an array of bytes or an NSData object as parameter would be better, I think.
    I have constructed a method to do this, but I doubt if that is entirely correct, may be you can take a look at it?
    It is on StackOverflow: http://stackoverflow.com/questions/30372870/string-from-nsinputstream-is-not-valid-utf8-how-to-convert-to-utf8-more-lossy

    Regards, Leon