____________________________________________________________________________
I've really been into UTF-8 these days. I've made a video introduction for it on Youtube and I've written an article. Now I'm writing an std::string style UTF-8 string class that I hope will be useful to me and to others.
This will be a 5 part series that I hope to complete this week. The final installment will be a full article about the class that I'll also publish on Gamedev.net
I will set out to create a utf8string class that behaves as much as possible as std::string. I will have all of the same methods as std::string and will overload the cast operator to be able to be cast to std::string and std::wstring. The class will also support C++ STL-style iterators. The iterators will be constant only though because UTF-8 is a variable-sized type. It wouldn't be possible to supply a mutable reference to any character. Because of this, the iterator will always return an unsigned 32-bit int type or wchar_t on systems that use a 32-bit wchar_t type. I'll implement it everything from scratch at first so you'll be able to see everything that's going on, but after that, I'll use more from the STL (Standard Template Library) to make it better.
To begin, let's start with the shell of the basic class. Over the next few days, I'll slowly transform this into a utf8string class that behaves as much like std::string as possible.
class mystring { private: // this implementation will use a null-terminated string unsigned char * buffer; // this is the size of the buffer, not the string size_t reserve_size; // keep the size of the string so we don't have to count it every time size_t used_length; // this will resize the buffer, but it will not shrink the buffer is new_size < reserve_size void growbuffer(size_t new_size, bool copy_data = true) { if(new_size > reserve_size) { unsigned char *new_buffer = new unsigned char[new_size]; if(used_length && copy_data) { // copy the buffer memcpy(new_buffer, buffer, used_length); new_buffer[used_length] = 0; // ensure null terminator } delete [] buffer; buffer = new_buffer; reserve_size = new_size; } } size_t recommendreservesize(size_t str_len) { return (str_len + 1) * 2; } // str_len is the number of 8-bit bytes before the null terminator void copystringtobuffer(const unsigned char *str, size_t str_len) { if(str_len >= reserve_size) { growbuffer(recommendreservesize(str_len), false); } memcpy(buffer, str, str_len); used_length = str_len; // set the null terminator buffer[used_length] = 0; } public: // default constructor mystring() :buffer(0L), reserve_size(0), used_length(0) { growbuffer(32, false); // set the string to an initial size } // build from a c string // undefined (ie crashes) if str is NULL mystring(const char *str) :buffer(0L), reserve_size(0), used_length(0) { copystringtobuffer((unsigned char *)str, strlen(str)); } // build from a c string // undefined (ie crashes) if str is NULL mystring(const unsigned char *str) :buffer(0L), reserve_size(0), used_length(0) { copystringtobuffer(str, strlen((const char *)str)); } // construct from an unsigned char mystring(const unsigned char c) :buffer(0L), reserve_size(0), used_length(0) { // set the string to an initial size growbuffer(32, false); buffer[0] = c; buffer[1] = 0; used_length = 1; } // construct from a normal char mystring(const char c) :buffer(0L), reserve_size(0), used_length(0) { // set the string to an initial size growbuffer(32, false); buffer[0] = (unsigned char)c; buffer[1] = 0; used_length = 1; } // copy constructor mystring(const mystring &str) :buffer(0L), reserve_size(0), used_length(0) { copystringtobuffer(str.buffer, str.used_length); } // destructor ~mystring() { delete [] buffer; } // cast to a c-string const unsigned char *c_str() const { return buffer; } // assignment operator // we can define assignment operators for all possible types such as char, const char *, etc, // but this is not neccessary. Because those constructors were provided, the compiler will be // able to build a mystring for those types and then call this overloaded operator. // if performance becomes an issue, the additional variations to this operator can be created mystring& operator= (const mystring &rvalue) { copystringtobuffer(rvalue.buffer, rvalue.used_length); return *this; } // move assignment operator // should move the data to this object and remove it from the old one mystring& operator= (mystring &&rvalue) { buffer = rvalue.buffer; reserve_size = rvalue.reserve_size; used_length = rvalue.used_length; // clear the values in the other string rvalue.buffer = 0L; rvalue.reserve_size = 0; rvalue.used_length = 0; return *this; } // request a new buffer size // this will resize the buffer, but it will not shrink the buffer is new_size < reserve_size // not useful unless the actual size of the string is known void reserve(size_t new_size) { growbuffer(new_size, true); } // appends a string to the end of this one // we can define this operator for all possible types such as char, const char *, etc, // but this is not neccessary. Because those constructors were provided, the compiler will be // able to build a mystring for those types and then call this overloaded operator. // if performance becomes an issue, the additional variations to this operator can be created mystring& operator+= (const mystring& str) { size_t total_length = used_length + str.used_length; if(total_length > reserve_size) { // resize the buffer reserve(recommendreservesize(total_length)); } strcat((char *)buffer, (char *)str.buffer); used_length = total_length; // set the null terminator buffer[used_length] = 0; return *this; } // returns a reference to the character at the index // doesn't throw exception. undefined if out of range unsigned char &operator[](size_t pos) { return buffer[pos]; } // returns a const reference to the character at the index // doesn't throw exception. undefined if out of range const unsigned char &operator[](size_t pos) const { return buffer[pos]; } // returns a reference to the character at the index // will throw an exception if out of range unsigned char &at(size_t pos) { // check range if(pos >= used_length) { throw std::out_of_range("subscript out of range"); } // use operator defined above, will help us later return (*this)[pos]; } // returns a const reference to the character at the index // will throw an exception if out of range const unsigned char &at(size_t pos) const { // check range if(pos >= used_length) { throw std::out_of_range("subscript out of range"); } // use operator defined above, will help us later return (*this)[pos]; } // overload stream insertion so we can write to streams friend std::ostream& operator<<(std::ostream& os, const mystring& string) { os << string.c_str(); return os; } // overload stream insertion so we can write to streams // we can define this operator for all possible types such as char, const char *, etc, // but this is not neccessary. Because those constructors were provided, the compiler will be // able to build a mystring for those types and then call this overloaded operator. // if performance becomes an issue, the additional variations to this operator can be created friend mystring operator + (const mystring& lhs, const mystring& rhs) { mystring out(lhs); out += rhs; return out; } };
That's a very simple string class. It can do many of the things that std::string can do but not all of them. Next I'll add some more things and give it more of an std::string feel.
For the full source code of this entry: UTF8 String Draft Header
I really want to thank Pete Goodliffe for inspiration. I've been meaning to make a UTF-8 string class for a long time now, but his article STL-style Circular Buffers By Example, made me want to go the extra mile with this.
-------------
For the other post in this series:
Writing a STL-Style UTF-8 String Class Part 1
Writing a STL-Style UTF-8 String Class Part 2
Writing a STL-Style UTF-8 String Class Part 3
Writing a STL-style UTF-8 String Class Part 4
Casinos and Slots: Where to play in Las Vegas - FilmFileEurope
ReplyDeleteWhere 망고사이트 to 토토 싸이트 play in 해외 야구 Las 메이저 놀이터 Vegas: Where to play in Las Vegas. Casino, Restaurants, Bars, 겜블시티 Shops, Rooms, Restaurants, Events.