Sunday, December 8, 2013

Writing a STL-Style UTF-8 string class Part 1

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

I've really been into UTF-8 these days. I've made a video introduction for it on Youtube and I've written an article. Now I'm writing an std::string style UTF-8 string class that I hope will be useful to me and to others.

This will be a 5 part series that I hope to complete this week. The final installment will be a full article about the class that I'll also publish on Gamedev.net

I will set out to create a utf8string class that behaves as much as possible as std::string. I will have all of the same methods as std::string and will overload the cast operator to be able to be cast to std::string and std::wstring. The class will also support C++ STL-style iterators. The iterators will be constant only though because UTF-8 is a variable-sized type. It wouldn't be possible to supply a mutable reference to any character. Because of this, the iterator will always return an unsigned 32-bit int type or wchar_t on systems that use a 32-bit wchar_t type. I'll implement it everything from scratch at first so you'll be able to see everything that's going on, but after that, I'll use more from the STL (Standard Template Library) to make it better.

To begin, let's start with the shell of the basic class. Over the next few days, I'll slowly transform this into a utf8string class that behaves as much like std::string as possible.

class mystring
{
 private:
  // this implementation will use a null-terminated string
  unsigned char * buffer;
  // this is the size of the buffer, not the string
  size_t   reserve_size;
  // keep the size of the string so we don't have to count it every time
  size_t   used_length;

  // this will resize the buffer, but it will not shrink the buffer is new_size < reserve_size
  void growbuffer(size_t new_size, bool copy_data = true)
  {
   if(new_size > reserve_size)
   {
    unsigned char *new_buffer = new unsigned char[new_size];

    if(used_length && copy_data)
    {
     // copy the buffer
     memcpy(new_buffer, buffer, used_length);
     new_buffer[used_length] = 0; // ensure null terminator
    }

    delete [] buffer;
    buffer = new_buffer;
    reserve_size = new_size;
   }
  }

  size_t recommendreservesize(size_t str_len)
  {
   return (str_len + 1) * 2;
  }

  // str_len is the number of 8-bit bytes before the null terminator
  void copystringtobuffer(const unsigned char *str, size_t str_len)
  {
   if(str_len >= reserve_size)
   {
    growbuffer(recommendreservesize(str_len), false);
   }
   memcpy(buffer, str, str_len);

   used_length = str_len;

   // set the null terminator
   buffer[used_length] = 0;
  }

 public:
  // default constructor
  mystring()
  :buffer(0L), reserve_size(0), used_length(0)
  {
   growbuffer(32, false); // set the string to an initial size
  }

  // build from a c string
  // undefined (ie crashes) if str is NULL
  mystring(const char *str)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   copystringtobuffer((unsigned char *)str, strlen(str));
  }

  // build from a c string
  // undefined (ie crashes) if str is NULL
  mystring(const unsigned char *str)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   copystringtobuffer(str, strlen((const char *)str));
  }

  // construct from an unsigned char
  mystring(const unsigned char c)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   // set the string to an initial size
   growbuffer(32, false);
   buffer[0] = c;
   buffer[1] = 0;
   used_length = 1;
  }

  // construct from a normal char
  mystring(const char c)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   // set the string to an initial size
   growbuffer(32, false);
   buffer[0] = (unsigned char)c;
   buffer[1] = 0;
   used_length = 1;
  }

  // copy constructor
  mystring(const mystring &str)
  :buffer(0L), reserve_size(0), used_length(0)
  {
   copystringtobuffer(str.buffer, str.used_length);
  }

  // destructor
  ~mystring()
  {
   delete [] buffer;
  }

  // cast to a c-string
  const unsigned char *c_str() const
  {
   return buffer;
  }

  // assignment operator
  // we can define assignment operators for all possible types such as char, const char *, etc,
  // but this is not neccessary. Because those constructors were provided, the compiler will be
  // able to build a mystring for those types and then call this overloaded operator.
  // if performance becomes an issue, the additional variations to this operator can be created
  mystring& operator= (const mystring &rvalue)
  {
   copystringtobuffer(rvalue.buffer, rvalue.used_length);

   return *this;
  }

  // move assignment operator
  // should move the data to this object and remove it from the old one
  mystring& operator= (mystring &&rvalue)
  {
   buffer   = rvalue.buffer;
   reserve_size = rvalue.reserve_size;
   used_length  = rvalue.used_length;

   // clear the values in the other string
   rvalue.buffer   = 0L;
   rvalue.reserve_size  = 0;
   rvalue.used_length  = 0;

   return *this;
  }

  // request a new buffer size
  // this will resize the buffer, but it will not shrink the buffer is new_size < reserve_size
  // not useful unless the actual size of the string is known
  void reserve(size_t new_size)
  {
   growbuffer(new_size, true);
  }

  // appends a string to the end of this one
  // we can define this operator for all possible types such as char, const char *, etc,
  // but this is not neccessary. Because those constructors were provided, the compiler will be
  // able to build a mystring for those types and then call this overloaded operator.
  // if performance becomes an issue, the additional variations to this operator can be created
  mystring& operator+= (const mystring& str)
  {
   size_t total_length  = used_length + str.used_length;
   if(total_length > reserve_size)
   {
    // resize the buffer
    reserve(recommendreservesize(total_length));
   }
   strcat((char *)buffer, (char *)str.buffer);
   used_length = total_length;

   // set the null terminator
   buffer[used_length] = 0;

   return *this;
  }

  // returns a reference to the character at the index
  // doesn't throw exception. undefined if out of range
  unsigned char &operator[](size_t pos)
  {
   return buffer[pos];
  }

  // returns a const reference to the character at the index
  // doesn't throw exception. undefined if out of range
  const unsigned char &operator[](size_t pos) const
  {
   return buffer[pos];
  }

  // returns a reference to the character at the index
  // will throw an exception if out of range
  unsigned char &at(size_t pos)
  {
   // check range
   if(pos >= used_length)
   {
    throw std::out_of_range("subscript out of range");
   }
   // use operator defined above, will help us later
   return (*this)[pos];
  }

  // returns a const reference to the character at the index
  // will throw an exception if out of range
  const unsigned char &at(size_t pos) const
  {
   // check range
   if(pos >= used_length)
   {
    throw std::out_of_range("subscript out of range");
   }
   // use operator defined above, will help us later
   return (*this)[pos];
  }

  // overload stream insertion so we can write to streams
  friend std::ostream& operator<<(std::ostream& os, const mystring& string)
  {
   os << string.c_str();

   return os;
  }

  // overload stream insertion so we can write to streams
  // we can define this operator for all possible types such as char, const char *, etc,
  // but this is not neccessary. Because those constructors were provided, the compiler will be
  // able to build a mystring for those types and then call this overloaded operator.
  // if performance becomes an issue, the additional variations to this operator can be created
  friend mystring operator + (const mystring& lhs, const mystring& rhs)
  {
   mystring out(lhs);
   out += rhs;

   return out;
  }
};



That's a very simple string class. It can do many of the things that std::string can do but not all of them.  Next I'll add some more things and give it more of an std::string feel.

For the full source code of this entry: UTF8 String Draft Header

I really want to thank Pete Goodliffe for inspiration. I've been meaning to make a UTF-8 string class for a long time now, but his article STL-style Circular Buffers By Example, made me want to go the extra mile with this.

-------------
For the other post in this series:

Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4