Tuesday, December 10, 2013

Writing a STL-Style UTF-8 String Class Part 4

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

In this fourth installment of "Writing a STL-Style UTF-8 String Class", I'll show the first version of the utf8string class and in my next post, I'll add all the other remaining member functions to make it behave like std::string.

One nice thing about UTF-8 is that it's compatible with normal 8-bit string functions. There is only one little problem. Processing UTF-8 requires using unsigned 8-bit characters, but some compilers have the char data type as a signed type. This means if we just cast the pointers, typical string functions will read our data that's over 127 as negative numbers. This is OK in most situations, but it will through of less than (<) and greater than (>) comparison functions. To get around this and still be able to utilize some of C++'s STL string capabilities, I'm going to cut out some of my code. I will remove the following lines:

pointer  buffer;
size_type reserve_size;
size_type used_length;

And replace them with this

std::basic_string<_uchar8bit> utfstring_data;

Also, the string will use _uchar8bit (unsigned char) internally, but externally, the class should now return _char32bit(unsigned int). This is because I want the class to return the decoded Unicode value, and not the UTF-8 encoding. This means our internal types will need to change.

typedef _char32bit   value_type;
typedef _char32bit   *pointer;
typedef const _char32bit *const_pointer;
typedef _char32bit   &reference;
typedef const _char32bit &const_reference;
typedef size_t    size_type;
typedef ptrdiff_t   difference_type;

This is significant in another way. This class can no longer return references to elements in the underlying array. It can only return values. If the class can only return values, then the iterator also should only return values. This means that our iterator will always be a const iterator that cannot change the string. This also means that we can no longer use the default std::reverse_iterator. That template class expects the iterator to return references and not values. If we try to use it, we'll get compiler errors. So we'll also need to write our
own reverse iterator. Creating our own reverse iterator is not overly difficult. The key to making a reverse iterator is having it use the forward iterator as a member. Here's an example:


template <class TBaseIterator>
class value_reverse_iterator : public std::iterator<std::bidirectional_iterator_tag, value_type>
{
 public:
  TBaseIterator forward_iterator;

 public:
  // copy constructor
  value_reverse_iterator(const value_reverse_iterator &other)
   :forward_iterator(other.forward_iterator)
  {
  }

  // create from forward iterator
  value_reverse_iterator(const TBaseIterator &iterator)
   :forward_iterator(iterator)
  {
   int a;
   a = 5;
  }

  value_type operator*() const
  {
   TBaseIterator temp = forward_iterator;
   return *(--temp);
  }

  // does not check to see if it goes past the end
  // iterating past the end is undefined
  value_reverse_iterator &operator++()
  {
   --forward_iterator;

   return *this;
  }

  // does not check to see if it goes past the end
  // iterating past the end is undefined
  value_reverse_iterator operator++(int)
  {
   value_reverse_iterator copy(*this);

   // increment
   --forward_iterator;

   return copy;
  }

  // does not check to see if it goes past the end
  // iterating past begin is undefined
  value_reverse_iterator &operator--()
  {
   ++forward_iterator;
   return *this;
  }

  // does not check to see if it goes past the end
  // iterating past begin is undefined
  value_reverse_iterator operator--(int)
  {
   value_reverse_iterator copy(*this);

   ++forward_iterator;

   return copy;
  }

  bool operator == ( const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator == other.forward_iterator;
  }
 
  bool operator != (const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator != other.forward_iterator;
  }

  bool operator < ( const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator > other.forward_iterator;
  }
 
  bool operator > (const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator < other.forward_iterator;
  }
  
  bool operator <= ( const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator >= other.forward_iterator;
  }
 
  bool operator >= (const value_reverse_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return forward_iterator <= other.forward_iterator;
  }

};

The next step is to use the UTF-8 utility functions that were made in the Part 3 to make this work with the UTF-8 encoded text. Also, all methods and operators that returned references before should now return values.

Many new constructors are needed because this should be able to automatically convert between std::string, std::wstring, and be able to use char and wchar_t type characters.

// default constructor
utf8string();

// build from a c string
// undefined (ie crashes) if str is NULL
utf8string(const _char8bit *str);

// build from a c string
// undefined (ie crashes) if str is NULL
utf8string(const _uchar8bit *str);

// construct from an unsigned char
utf8string(size_t n, _uchar8bit c);

// construct from a normal char
utf8string(size_t n, _char8bit c);

// construct from a normal char
utf8string(_uchar8bit c);

// construct from a normal char
utf8string(_char8bit c);

// construct from a normal char
utf8string(_char16bit c);

// construct from a normal char
utf8string(_char32bit c);

// copy constructor
utf8string(const utf8string &str);

/// \brief Constructs a UTF-8 string from an 16 bit character terminated string
utf8string (const _char16bit* instring_UCS2);

/// \brief Constructs a UTF-8 string from an 32 bit character terminated string
utf8string (const _char32bit* instring_UCS4);

/// \brief copy constructor from basic std::string
utf8string(const std::string &instring);

/// \brief copy constructor from basic std::string
utf8string(const std::wstring &instring);

I use the same pre-conditions as std::string. In other words, if std::string doesn't check the correctness of a parameter, I don't either. I've also written some extra constructors such as utf8string(char c). This is so I can do calls such as string += 'c' without having to overload the += operator with every type. If you provide constructors, the compiler will be able to construct a new utf8string and just use the += operator that takes "const utf8string &" as a parameter.

I'm also providing three versions of the copy method

// copies a sub string of this string to s and returns the number of characters copied
// if this string is shorter than len, as many characters as possible are copied
// undefined behavior if the buffer pointed to by s is not long enough
// UTF-8 version
size_type copy (_uchar8bit *s, size_type len, size_type pos = 0) const;

// copies a sub string of this string to s and returns the number of characters copied
// if this string is shorter than len, as many characters as possible are copied
// undefined behavior if the buffer pointed to by s is not long enough
// outputs to UCS-2
size_type copy (_char16bit *s, size_type len, size_type pos = 0) const;

// outputs to UCS-4
size_type copy (_char32bit *s, size_type len, size_type pos = 0) const;

For the complete source code so far, you can download these two files

utf8string.0.h
utf8utils.h

-------------
For the other post in this series:

Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4