Monday, December 9, 2013

Writing a STL-Style UTF-8 String Class Part 3

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

So far I've written a basic string class and have given it an iterator, but there's something missing. It doesn't use UTF-8. In this post, I'll introduce some UTF-8 utility functions that the string class will use.

The following information requires a knowledge of UTF-8. If you'd like information on it, you can watch
my video presentation that gives a good introduction to UTF-8 or check out the first post in this series for some links.

To add UTF-8 to this class, first I will define a new namespace to keep everything.

namespace sd_utf8
{
}

I'll also add another header with some UTF-8 utility functions. These functions will be used with the utf8string class, but application programmers will also be able to use the utility functions to add UTF-8 capabilities to their existing classes.

The functions will need this type which is a 4 character array. Encoded UTF-8 will be put here.

typedef _uchar8bit utf8_encoding[4];

Here some of the utility functions. All functions are inline because they are all defined in the header file.

The GetUTF8Encoding function will encode at 32 bit Unicode value into UTF-8. The size of the encoding will be returnd in out_size. This function can reorder the incoming data if it's in an endian that is the opposite of the current system.

/// This function generates a UTF-8 encoding from a 32 bit UCS-4 character.
/// This is being provided as a static method so it can be used with normal std::string objects
/// default_order is true when the byte order matches the system
inline void GetUTF8Encoding(_char32bit in_char, utf8_encoding &out_encoding, int &out_size, bool default_order = true)
{
 // check the order byte order and reorder if neccessary
 if(default_order == false)
 {
  in_char = ((in_char & 0x000000ff) << 24) + ((in_char & 0x0000ff00) << 8) + ((in_char & 0x00ff0000) >> 8) + ((in_char & 0xff000000) >> 24);
 }

 if(in_char < 0x80)
 {
  // 1 byte encoding
  out_encoding[0] = (char)in_char;
  out_size = 1;
 }
 else if(in_char < 0x800)
 {
  // 2 byte encoding
  out_encoding[0] = 0xC0 + ((in_char & 0x7C0) >> 6);
  out_encoding[1] = 0x80 + (in_char & 0x3F);
  out_size = 2;
 }
 else if(in_char < 0x10000)
 {
  // 3 byte encoding
  out_encoding[0] = 0xE0 + ((in_char & 0xF000) >> 12);
  out_encoding[1] = 0x80 + ((in_char & 0xFC0) >> 6);
  out_encoding[2] = 0x80 + (in_char & 0x3F);
  out_size = 3;
 }
 else
 {
  // 4 byte encoding
  out_encoding[0] = 0xF8 + ((in_char & 0x1C0000) >> 18);
  out_encoding[1] = 0x80 + ((in_char & 0x3F000) >> 12);
  out_encoding[2] = 0x80 + ((in_char & 0xFC0) >> 6);
  out_encoding[3] = 0x80 + (in_char & 0x3F);
  out_size = 4;
 }
}

inline void GetUTF8Encoding(_char16bit in_char, utf8_encoding &out_encoding, int &out_size, bool default_order = true)
{
 // check the order byte order and reorder if neccessary
 if(default_order == false)
 {
  in_char = ((in_char & 0x00ff) << 8) + ((in_char & 0xff00) >> 8);
 }

 // to reduce redundant code and possible bugs from typingg errors, use 32bit version
 GetUTF8Encoding((_char32bit)in_char, out_encoding, out_size, true);
}

The function UTF8CharToUnicode will read the next unicode character in a string and return a 32 bit unicode value.

inline _char32bit UTF8CharToUnicode(const _uchar8bit *utf8data)
{
 if(utf8data[0] < 0x80)
 {
  return (_char32bit)utf8data[0];
 }
 else if(utf8data[0] < 0xE0)
 {
  // 2 bytes
  return ((utf8data[0] & 0x1F) << 6) + (utf8data[1] & 0x3F);
 }
 else if (utf8data[0] < 0xF0)
 {
  // 3 bytes
  return ((utf8data[0] & 0xF) << 12) + ((utf8data[1] & 0x3F) << 6) + (utf8data[2] & 0x3F);
 }
 else
 {
  // 4 bytes
  return ((utf8data[0] & 0x7) << 18) + ((utf8data[1] & 0x3F) << 12) + ((utf8data[2] & 0x3F) << 6) + (utf8data[3] & 0x3F);
 }
}

Those two functions do all of the encoding and decoding work. In the UTF-8 code, I often just use simple < operations to determine the size of the encoding. Remember, in 4 byte encodings, the first byte will always be of the form 1111 0XXX. This means that for byte encodings will always be greater that 1111 0000 (F0 hex). 3-byte encodings have the form 1110 0000 (E0) so 3-byte encodings will be between E0 and F0. The same holds true for 2 and 1-byte encodings. So instead of doing fancy bit operations, a simple comparison is all that's needed.

Here's a list of all the other functions:

This function will increment a pointer to a UTF-8 string to the correct character position. It sets the pointer to point to the null-terminator if the position is off the string. Behavior is undefined is string doesn't point to a properly formated UTF-8 string.

inline void IncrementToPosition(const _uchar8bit *&string, size_t pos);


This function will  a UTF-8 encoded string and returns the actual begining in the buffer of the character at pos. Behavior is undefined is string doesn't point to a properly formated UTF-8 string or if pos is out of range

inline size_t GetBufferPosition(const _uchar8bit *string, size_t pos);


This function will get the minimum amount of memory needed to encode the string in UTF-8.

template <class T>
inline size_t GetMinimumBufferSize(const T *string);


Template function to convert a string into UTF-8 and stores the result in an std::basic_string. The type should be convertible to an int. A template function is being used because the 16-bit and 32-bit implementations are identical except for the type

template <typename char_type>
inline void MakeUTF8StringImpl(const char_type* instring, std::basic_string<_uchar8bit> &out, bool appendToOut);


Template function to convert a string into UTF-8 and stores the result in a buffer. The type should be convertible to an int. A template function is being used because the 16-bit and 32-bit implementations are identical except for the type. Out should point to a buffer large enough to hold the data.

template <typename char_type>
inline void MakeUTF8StringImpl(const char_type* instring, _uchar8bit *out);


This function uses the template function MakeUTF8StringImpl to convert a string.

inline void MakeUTF8String(const _char16bit* instring_UCS2, std::basic_string<_uchar8bit> &out, bool appendToOut = false);

inline void MakeUTF8String(const _char32bit* instring_UCS4, std::basic_string<_uchar8bit> &out, bool appendToOut = false);


These functions increment and decrement  a pointer to a UTF-8 encoded string to the next character. utf8data must point to a valid UTF-8 string.

inline void IncToNextCharacter(const _uchar8bit *&utf8data);

inline void DecToNextCharacter(const _uchar8bit *&utf8data);

Gets the length of a UTF-8 string in characters.

inline size_t GetNumCharactersInUTF8String(const _uchar8bit *utf8data)

Full source code can be found here: utf8utils.0.h

No comments:

Post a Comment