2 <!DOCTYPE part PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
3 "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"
6 <part id="manual.strings" xreflabel="Strings">
7 <?dbhtml filename="strings.html"?>
22 <indexterm><primary>Strings</primary></indexterm>
25 <!-- Chapter 01 : Character Traits -->
27 <!-- Chapter 02 : String Classes -->
28 <chapter id="manual.strings.string" xreflabel="string">
29 <title>String Classes</title>
31 <sect1 id="strings.string.simple" xreflabel="Simple Transformations">
32 <title>Simple Transformations</title>
34 Here are Standard, simple, and portable ways to perform common
35 transformations on a <code>string</code> instance, such as
36 "convert to all upper case." The word transformations
37 is especially apt, because the standard template function
38 <code>transform<></code> is used.
41 This code will go through some iterations. Here's a simple
45 #include <string>
46 #include <algorithm>
47 #include <cctype> // old <ctype.h>
51 char operator() (char c) const { return std::tolower(c); }
56 char operator() (char c) const { return std::toupper(c); }
61 std::string s ("Some Kind Of Initial Input Goes Here");
63 // Change everything into upper case
64 std::transform (s.begin(), s.end(), s.begin(), ToUpper());
66 // Change everything into lower case
67 std::transform (s.begin(), s.end(), s.begin(), ToLower());
69 // Change everything back into upper case, but store the
70 // result in a different string
71 std::string capital_s;
72 capital_s.resize(s.size());
73 std::transform (s.begin(), s.end(), capital_s.begin(), ToUpper());
77 <emphasis>Note</emphasis> that these calls all
78 involve the global C locale through the use of the C functions
79 <code>toupper/tolower</code>. This is absolutely guaranteed to work --
80 but <emphasis>only</emphasis> if the string contains <emphasis>only</emphasis> characters
81 from the basic source character set, and there are <emphasis>only</emphasis>
82 96 of those. Which means that not even all English text can be
83 represented (certain British spellings, proper names, and so forth).
84 So, if all your input forevermore consists of only those 96
85 characters (hahahahahaha), then you're done.
87 <para><emphasis>Note</emphasis> that the
88 <code>ToUpper</code> and <code>ToLower</code> function objects
89 are needed because <code>toupper</code> and <code>tolower</code>
90 are overloaded names (declared in <code><cctype></code> and
91 <code><locale></code>) so the template-arguments for
92 <code>transform<></code> cannot be deduced, as explained in
93 <ulink url="http://gcc.gnu.org/ml/libstdc++/2002-11/msg00180.html">this
95 <!-- section 14.8.2.4 clause 16 in ISO 14882:1998 -->
96 At minimum, you can write short wrappers like
101 return std::tolower(c);
103 <para>The correct method is to use a facet for a particular locale
104 and call its conversion functions. These are discussed more in
105 Chapter 22; the specific part is
106 <ulink url="../22_locale/howto.html#7">Correct Transformations</ulink>,
107 which shows the final version of this code. (Thanks to James Kanze
108 for assistance and suggestions on all of this.)
110 <para>Another common operation is trimming off excess whitespace. Much
111 like transformations, this task is trivial with the use of string's
112 <code>find</code> family. These examples are broken into multiple
113 statements for readability:
116 std::string str (" \t blah blah blah \n ");
118 // trim leading whitespace
119 string::size_type notwhite = str.find_first_not_of(" \t\n");
120 str.erase(0,notwhite);
122 // trim trailing whitespace
123 notwhite = str.find_last_not_of(" \t\n");
124 str.erase(notwhite+1); </programlisting>
125 <para>Obviously, the calls to <code>find</code> could be inserted directly
126 into the calls to <code>erase</code>, in case your compiler does not
127 optimize named temporaries out of existence.
131 <sect1 id="strings.string.case" xreflabel="Case Sensitivity">
132 <title>Case Sensitivity</title>
136 <para>The well-known-and-if-it-isn't-well-known-it-ought-to-be
137 <ulink url="http://www.gotw.ca/gotw/">Guru of the Week</ulink>
138 discussions held on Usenet covered this topic in January of 1998.
139 Briefly, the challenge was, <quote>write a 'ci_string' class which
140 is identical to the standard 'string' class, but is
141 case-insensitive in the same way as the (common but nonstandard)
142 C function stricmp()</quote>.
145 ci_string s( "AbCdE" );
148 assert( s == "abcde" );
149 assert( s == "ABCDE" );
151 // still case-preserving, of course
152 assert( strcmp( s.c_str(), "AbCdE" ) == 0 );
153 assert( strcmp( s.c_str(), "abcde" ) != 0 ); </programlisting>
155 <para>The solution is surprisingly easy. The original answer was
156 posted on Usenet, and a revised version appears in Herb Sutter's
157 book <emphasis>Exceptional C++</emphasis> and on his website as <ulink url="http://www.gotw.ca/gotw/029.htm">GotW 29</ulink>.
159 <para>See? Told you it was easy!</para>
161 <emphasis>Added June 2000:</emphasis> The May 2000 issue of C++
162 Report contains a fascinating <ulink
163 url="http://lafstern.org/matt/col2_new.pdf"> article</ulink> by
164 Matt Austern (yes, <emphasis>the</emphasis> Matt Austern) on why
165 case-insensitive comparisons are not as easy as they seem, and
166 why creating a class is the <emphasis>wrong</emphasis> way to go
167 about it in production code. (The GotW answer mentions one of
168 the principle difficulties; his article mentions more.)
170 <para>Basically, this is "easy" only if you ignore some things,
171 things which may be too important to your program to ignore. (I chose
172 to ignore them when originally writing this entry, and am surprised
173 that nobody ever called me on it...) The GotW question and answer
174 remain useful instructional tools, however.
176 <para><emphasis>Added September 2000:</emphasis> James Kanze provided a link to a
177 <ulink url="http://www.unicode.org/unicode/reports/tr21/">Unicode
178 Technical Report discussing case handling</ulink>, which provides some
179 very good information.
183 <sect1 id="strings.string.character_types" xreflabel="Arbitrary Characters">
184 <title>Arbitrary Character Types</title>
188 <para>The <code>std::basic_string</code> is tantalizingly general, in that
189 it is parameterized on the type of the characters which it holds.
190 In theory, you could whip up a Unicode character class and instantiate
191 <code>std::basic_string<my_unicode_char></code>, or assuming
192 that integers are wider than characters on your platform, maybe just
193 declare variables of type <code>std::basic_string<int></code>.
195 <para>That's the theory. Remember however that basic_string has additional
196 type parameters, which take default arguments based on the character
197 type (called <code>CharT</code> here):
200 template <typename CharT,
201 typename Traits = char_traits<CharT>,
202 typename Alloc = allocator<CharT> >
203 class basic_string { .... };</programlisting>
204 <para>Now, <code>allocator<CharT></code> will probably Do The Right
205 Thing by default, unless you need to implement your own allocator
208 <para>But <code>char_traits</code> takes more work. The char_traits
209 template is <emphasis>declared</emphasis> but not <emphasis>defined</emphasis>.
210 That means there is only
213 template <typename CharT>
216 static void foo (type1 x, type2 y);
219 <para>and functions such as char_traits<CharT>::foo() are not
220 actually defined anywhere for the general case. The C++ standard
221 permits this, because writing such a definition to fit all possible
222 CharT's cannot be done.
224 <para>The C++ standard also requires that char_traits be specialized for
225 instantiations of <code>char</code> and <code>wchar_t</code>, and it
226 is these template specializations that permit entities like
227 <code>basic_string<char,char_traits<char>></code> to work.
229 <para>If you want to use character types other than char and wchar_t,
230 such as <code>unsigned char</code> and <code>int</code>, you will
231 need suitable specializations for them. For a time, in earlier
232 versions of GCC, there was a mostly-correct implementation that
233 let programmers be lazy but it broke under many situations, so it
234 was removed. GCC 3.4 introduced a new implementation that mostly
235 works and can be specialized even for <code>int</code> and other
238 <para>If you want to use your own special character class, then you have
239 <ulink url="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00163.html">a lot
240 of work to do</ulink>, especially if you with to use i18n features
241 (facets require traits information but don't have a traits argument).
243 <para>Another example of how to specialize char_traits was given <ulink url="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00260.html">on the
244 mailing list</ulink> and at a later date was put into the file <code>
245 include/ext/pod_char_traits.h</code>. We agree
246 that the way it's used with basic_string (scroll down to main())
247 doesn't look nice, but that's because <ulink url="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00236.html">the
248 nice-looking first attempt</ulink> turned out to <ulink url="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00242.html">not
249 be conforming C++</ulink>, due to the rule that CharT must be a POD.
250 (See how tricky this is?)
255 <sect1 id="strings.string.token" xreflabel="Tokenizing">
256 <title>Tokenizing</title>
259 <para>The Standard C (and C++) function <code>strtok()</code> leaves a lot to
260 be desired in terms of user-friendliness. It's unintuitive, it
261 destroys the character string on which it operates, and it requires
262 you to handle all the memory problems. But it does let the client
263 code decide what to use to break the string into pieces; it allows
264 you to choose the "whitespace," so to speak.
266 <para>A C++ implementation lets us keep the good things and fix those
267 annoyances. The implementation here is more intuitive (you only
268 call it once, not in a loop with varying argument), it does not
269 affect the original string at all, and all the memory allocation
272 <para>It's called stringtok, and it's a template function. Sources are
273 as below, in a less-portable form than it could be, to keep this
274 example simple (for example, see the comments on what kind of
275 string it will accept).
279 #include <string>
280 template <typename Container>
282 stringtok(Container &container, string const &in,
283 const char * const delimiters = " \t\n")
285 const string::size_type len = in.length();
286 string::size_type i = 0;
290 // Eat leading whitespace
291 i = in.find_first_not_of(delimiters, i);
292 if (i == string::npos)
293 return; // Nothing left but white space
295 // Find the end of the token
296 string::size_type j = in.find_first_of(delimiters, i);
299 if (j == string::npos)
301 container.push_back(in.substr(i));
305 container.push_back(in.substr(i, j-i));
307 // Set up for next loop
315 The author uses a more general (but less readable) form of it for
316 parsing command strings and the like. If you compiled and ran this
322 std::list<string> ls;
323 stringtok (ls, " this \t is\t\n a test ");
324 for (std::list<string>const_iterator i = ls.begin();
327 std::cerr << ':' << (*i) << ":\n";
329 <para>You would see this as output:
335 :test: </programlisting>
336 <para>with all the whitespace removed. The original <code>s</code> is still
337 available for use, <code>ls</code> will clean up after itself, and
338 <code>ls.size()</code> will return how many tokens there were.
340 <para>As always, there is a price paid here, in that stringtok is not
341 as fast as strtok. The other benefits usually outweigh that, however.
342 <ulink url="stringtok_std_h.txt">Another version of stringtok is given
343 here</ulink>, suggested by Chris King and tweaked by Petr Prikryl,
344 and this one uses the
345 transformation functions mentioned below. If you are comfortable
346 with reading the new function names, this version is recommended
349 <para><emphasis>Added February 2001:</emphasis> Mark Wilden pointed out that the
350 standard <code>std::getline()</code> function can be used with standard
351 <ulink url="../27_io/howto.html">istringstreams</ulink> to perform
352 tokenizing as well. Build an istringstream from the input text,
353 and then use std::getline with varying delimiters (the three-argument
354 signature) to extract tokens into a string.
359 <sect1 id="strings.string.shrink" xreflabel="Shrink to Fit">
360 <title>Shrink to Fit</title>
363 <para>From GCC 3.4 calling <code>s.reserve(res)</code> on a
364 <code>string s</code> with <code>res < s.capacity()</code> will
365 reduce the string's capacity to <code>std::max(s.size(), res)</code>.
367 <para>This behaviour is suggested, but not required by the standard. Prior
368 to GCC 3.4 the following alternative can be used instead
371 std::string(str.data(), str.size()).swap(str);
373 <para>This is similar to the idiom for reducing a <code>vector</code>'s
374 memory usage (see <ulink url='../faq/index.html#5_9'>FAQ 5.9</ulink>) but
375 the regular copy constructor cannot be used because libstdc++'s
376 <code>string</code> is Copy-On-Write.
382 <sect1 id="strings.string.Cstring" xreflabel="CString (MFC)">
383 <title>CString (MFC)</title>
387 <para>A common lament seen in various newsgroups deals with the Standard
388 string class as opposed to the Microsoft Foundation Class called
389 CString. Often programmers realize that a standard portable
390 answer is better than a proprietary nonportable one, but in porting
391 their application from a Win32 platform, they discover that they
392 are relying on special functions offered by the CString class.
394 <para>Things are not as bad as they seem. In
395 <ulink url="http://gcc.gnu.org/ml/gcc/1999-04n/msg00236.html">this
396 message</ulink>, Joe Buck points out a few very important things:
399 <listitem><para>The Standard <code>string</code> supports all the operations
400 that CString does, with three exceptions.
402 <listitem><para>Two of those exceptions (whitespace trimming and case
403 conversion) are trivial to implement. In fact, we do so
406 <listitem><para>The third is <code>CString::Format</code>, which allows formatting
407 in the style of <code>sprintf</code>. This deserves some mention:
411 The old libg++ library had a function called form(), which did much
412 the same thing. But for a Standard solution, you should use the
413 stringstream classes. These are the bridge between the iostream
414 hierarchy and the string class, and they operate with regular
415 streams seamlessly because they inherit from the iostream
416 hierarchy. An quick example:
419 #include <iostream>
420 #include <string>
421 #include <sstream>
423 string f (string& incoming) // incoming is "foo N"
425 istringstream incoming_stream(incoming);
429 incoming_stream >> the_word // extract "foo"
430 >> the_number; // extract N
432 ostringstream output_stream;
433 output_stream << "The word was " << the_word
434 << " and 3*N was " << (3*the_number);
436 return output_stream.str();
438 <para>A serious problem with CString is a design bug in its memory
439 allocation. Specifically, quoting from that same message:
442 CString suffers from a common programming error that results in
443 poor performance. Consider the following code:
445 CString n_copies_of (const CString& foo, unsigned n)
448 for (unsigned i = 0; i < n; i++)
453 This function is O(n^2), not O(n). The reason is that each +=
454 causes a reallocation and copy of the existing string. Microsoft
455 applications are full of this kind of thing (quadratic performance
456 on tasks that can be done in linear time) -- on the other hand,
457 we should be thankful, as it's created such a big market for high-end
460 If you replace CString with string in the above function, the
463 <para>Joe Buck also pointed out some other things to keep in mind when
464 comparing CString and the Standard string class:
467 <listitem><para>CString permits access to its internal representation; coders
468 who exploited that may have problems moving to <code>string</code>.
470 <listitem><para>Microsoft ships the source to CString (in the files
471 MFC\SRC\Str{core,ex}.cpp), so you could fix the allocation
472 bug and rebuild your MFC libraries.
473 <emphasis><emphasis>Note:</emphasis> It looks like the CString shipped
474 with VC++6.0 has fixed this, although it may in fact have been
475 one of the VC++ SPs that did it.</emphasis>
477 <listitem><para><code>string</code> operations like this have O(n) complexity
478 <emphasis>if the implementors do it correctly</emphasis>. The libstdc++
479 implementors did it correctly. Other vendors might not.
481 <listitem><para>While parts of the SGI STL are used in libstdc++, their
482 string class is not. The SGI <code>string</code> is essentially
483 <code>vector<char></code> and does not do any reference
484 counting like libstdc++'s does. (It is O(n), though.)
485 So if you're thinking about SGI's string or rope classes,
486 you're now looking at four possibilities: CString, the
487 libstdc++ string, the SGI string, and the SGI rope, and this
488 is all before any allocator or traits customizations! (More
489 choices than you can shake a stick at -- want fries with that?)
496 <!-- Chapter 03 : Interacting with C -->