Sunday, December 8, 2013

Writing a STL-Style UTF-8 string class Part 2

Update: The latest code can now be found on GitHub. Check the post "STL-Style UTF-8 String Class now on GitHub" for more information.
____________________________________________________________________________

Over the next few days I will convert the basic string class that I made in my last post into a complete std::string-like UTF-8 string class. Before I add the real UTF-8 stuff, I want to add more stl-style related things to the class such as iterators and some standard types that I'll need.

First I'll add the following types

#ifndef WCHAR32BITS

// because signed or unsigned is not mandated for char or wchar_t in the standard,
// always use the char and wchar_t types or we may get compiler errors when using
// some standaard function

typedef char _char8bit; // for ASCII and UTF8
typedef unsigned char _uchar8bit; // for ASCII and UTF8
typedef wchar_t _char16bit; // for USC2
typedef std::uint32_t _char32bit; // for UTF32

#else

typedef char _char8bit; // for ASCII and UTF8
typedef unsigned char _uchar8bit; // for ASCII and UTF8
typedef std::uint16_t _char16bit; // for USC2
typedef wchar_t _char32bit; // for UTF32

#endif


The final UTF-8 string class should be able to process strings in different formats. These definitions will help keep those types organized without worrying about the compiler implementations. I'm also explicitly defining an unsigned 8-bit char type. This is because the sign of the char data type is not specified in the standard and can be signed or unsigned. For UTF-8 to work properly, it needs to be an unsigned 8-bit integer. The _char16bit and _char32bit types are being defined in terms of wchar_t to prevent compiler errors with std::wstring and so it'll work with the standard wide character string functions, such as wcscpy() and wcscmp().

Next these definitions are required by STL so they need to be added to our class. The types are fairly self-explanatory. For now the class will use _uchar8bit, but once I start adding the UTF-8 capabilities, I'll change the class to return _char32bit.

typedef _uchar8bit           value_type;
typedef _uchar8bit          *pointer;
typedef const _uchar8bit    *const_pointer;
typedef _uchar8bit          &reference;
typedef const _uchar8bit    &const_reference;
typedef size_t               size_type;
typedef ptrdiff_t            difference_type;

The class also needs an iterator. To build this iterator, I'll define a class using std::iterator as a base. The iterator will also use the std::bidirectional_iterator_tag.  This means I'll need to define increment(++) and decrement(--) operators. I'll also make it iterator a template class so I won't have to write the code twice for the const version.

template 
class utf8string_iterator : public std::iterator
{
 private:
  pointer buf_;

  void inc()
  {
   // increments the iterator by one
   // result in undefined behavior (crashes) if already at the end 
   ++buf_;
  }

  void dec()
  {
   // decrements the iterator by one
   // result in undefined behavior (crashes) if already at the beginning
   --buf_;
  }

 public:
  // b should be a null terminated string in UTF-8
  // if this is the end start_pos should be the index of the null terminator
  // start_pos should be the valid start of a character
  utf8string_iterator(pointer b, size_type start_pos)
  {
   buf_ = &b[start_pos];
  }

  // b should already point to the correct position in the string
  utf8string_iterator(pointer b)
   :buf_(b)
  {
  }

  reference operator*() const
  {
   return *buf_;
  }

  // does not check to see if it goes past the end
  // iterating past the end is undefined
  utf8string_iterator &operator++()
  {
   inc();

   return *this;
  }

  // does not check to see if it goes past the end
  // iterating past the end is undefined
  utf8string_iterator operator++(int)
  {
   utf8string_iterator copy(*this);

   // increment
   inc();

   return copy;
  }

  // does not check to see if it goes past the end
  // iterating past begin is undefined
  utf8string_iterator &operator--()
  {
   dec();
   return *this;
  }

  // does not check to see if it goes past the end
  // iterating past begin is undefined
  utf8string_iterator operator--(int)
  {
   utf8string_iterator copy(*this);

   dec();

   return copy;
  }

  bool operator == ( const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ == other.buf_;
  }
 
  bool operator != (const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ != other.buf_;
  }

  bool operator < ( const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ < other.buf_;
  }
 
  bool operator > (const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ > other.buf_;
  }
  
  bool operator <= ( const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ <= other.buf_;
  }
 
  bool operator >= (const utf8string_iterator &other) const
  {
   // just compare pointers
   // the programmer is responsible for making both iterators are for the same set of data
   return buf_ >= other.buf_;
  }
};
// End ---------------------

Now these types need to be added to the class definition. For now, I will not need to write a
reverse iterator. There is a STL class that I can use to do that.

// make our iterator types here
typedef utf8string_iterator   iterator;
typedef utf8string_iterator const_iterator;
typedef std::reverse_iterator   reverse_iterator;
typedef std::reverse_iterator const_reverse_iterator;
// End ---------------------

After adding a few more functions and reorganizing the class, I now have a class that's more than 50% like std::string, but it's missing the most important thing. It needs process UTF-8. In my next, I'll write all of the necessary UTF-8 functions in my next post.

Be sure to check out my previous post Writing a STL-Style UTF-8 string class Part 1 and my video presentation that gives a good introduction to UTF-8.

For the full source code: mystring.1.h

-------------
For the other post in this series:

Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4


1 comment:

  1. By the way, I do understand the char will not always be 8 bits, but it will always be the smallest addressable memory unit as defined by C/C++. Systems that don't allow for 8-bit bytes probably wouldn't use UTF-8 to store is strings internally as the extra processing to extract the 8-bit units wouldn't make sense. In systems that use bytes that are greater than 8 bits, 16-bit or 32-bit encodings of Unicode should be used.

    ReplyDelete