l4/pkg/libstdc++-v3/contrib/libstdc++-v3-4.1.0/docs/html/21_strings/howto.html

   1 <?xml version="1.0" encoding="ISO-8859-1"?>
   2 <!DOCTYPE html
   3           PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
   4           "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
   5
   6 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
   7 <head>
   8    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
   9    <meta name="AUTHOR" content="pme@gcc.gnu.org (Phil Edwards)" />
  10    <meta name="KEYWORDS" content="HOWTO, libstdc++, GCC, g++, libg++, STL" />
  11    <meta name="DESCRIPTION" content="HOWTO for the libstdc++ chapter 21." />
  12    <meta name="GENERATOR" content="vi and eight fingers" />
  13    <title>libstdc++-v3 HOWTO:  Chapter 21: Strings</title>
  14 <link rel="StyleSheet" href="../lib3styles.css" type="text/css" />
  15 <link rel="Start" href="../documentation.html" type="text/html"
  16   title="GNU C++ Standard Library" />
  17 <link rel="Prev" href="../20_util/howto.html" type="text/html"
  18   title="General Utilities" />
  19 <link rel="Next" href="../22_locale/howto.html" type="text/html"
  20   title="Localization" />
  21 <link rel="Copyright" href="../17_intro/license.html" type="text/html" />
  22 <link rel="Help" href="../faq/index.html" type="text/html" title="F.A.Q." />
  23 </head>
  24 <body>
  25
  26 <h1 class="centered"><a name="top">Chapter 21:  Strings</a></h1>
  27
  28 <p>Chapter 21 deals with the C++ strings library (a welcome relief).
  29 </p>
  30
  31
  32 <!-- ####################################################### -->
  33 <hr />
  34 <h1>Contents</h1>
  35 <ul>
  36    <li><a href="#1">MFC's CString</a></li>
  37    <li><a href="#2">A case-insensitive string class</a></li>
  38    <li><a href="#3">Breaking a C++ string into tokens</a></li>
  39    <li><a href="#4">Simple transformations</a></li>
  40    <li><a href="#5">Making strings of arbitrary character types</a></li>
  41    <li><a href="#6">Shrink-to-fit strings</a></li>
  42 </ul>
  43
  44 <hr />
  45
  46 <!-- ####################################################### -->
  47
  48 <h2><a name="1">MFC's CString</a></h2>
  49    <p>A common lament seen in various newsgroups deals with the Standard
  50       string class as opposed to the Microsoft Foundation Class called
  51       CString.  Often programmers realize that a standard portable
  52       answer is better than a proprietary nonportable one, but in porting
  53       their application from a Win32 platform, they discover that they
  54       are relying on special functions offered by the CString class.
  55    </p>
  56    <p>Things are not as bad as they seem.  In
  57       <a href="http://gcc.gnu.org/ml/gcc/1999-04n/msg00236.html">this
  58       message</a>, Joe Buck points out a few very important things:
  59    </p>
  60       <ul>
  61          <li>The Standard <code>string</code> supports all the operations
  62              that CString does, with three exceptions.
  63          </li>
  64          <li>Two of those exceptions (whitespace trimming and case
  65              conversion) are trivial to implement.  In fact, we do so
  66              on this page.
  67          </li>
  68          <li>The third is <code>CString::Format</code>, which allows formatting
  69              in the style of <code>sprintf</code>.  This deserves some mention:
  70          </li>
  71       </ul>
  72    <p><a name="1.1internal"> <!-- Coming from Chapter 27 -->
  73       The old libg++ library had a function called form(), which did much
  74       the same thing.  But for a Standard solution, you should use the
  75       stringstream classes.  These are the bridge between the iostream
  76       hierarchy and the string class, and they operate with regular
  77       streams seamlessly because they inherit from the iostream
  78       hierarchy.  An quick example:
  79       </a>
  80    </p>
  81    <pre>
  82    #include &lt;iostream&gt;
  83    #include &lt;string&gt;
  84    #include &lt;sstream&gt;
  85
  86    string f (string&amp; incoming)     // incoming is "foo  N"
  87    {
  88        istringstream   incoming_stream(incoming);
  89        string          the_word;
  90        int             the_number;
  91
  92        incoming_stream &gt;&gt; the_word        // extract "foo"
  93                        &gt;&gt; the_number;     // extract N
  94
  95        ostringstream   output_stream;
  96        output_stream &lt;&lt; "The word was " &lt;&lt; the_word
  97                      &lt;&lt; " and 3*N was " &lt;&lt; (3*the_number);
  98
  99        return output_stream.str();
 100    } </pre>
 101    <p>A serious problem with CString is a design bug in its memory
 102       allocation.  Specifically, quoting from that same message:
 103    </p>
 104    <pre>
 105    CString suffers from a common programming error that results in
 106    poor performance.  Consider the following code:
 107
 108    CString n_copies_of (const CString&amp; foo, unsigned n)
 109    {
 110            CString tmp;
 111            for (unsigned i = 0; i &lt; n; i++)
 112                    tmp += foo;
 113            return tmp;
 114    }
 115
 116    This function is O(n^2), not O(n).  The reason is that each +=
 117    causes a reallocation and copy of the existing string.  Microsoft
 118    applications are full of this kind of thing (quadratic performance
 119    on tasks that can be done in linear time) -- on the other hand,
 120    we should be thankful, as it's created such a big market for high-end
 121    ix86 hardware. :-)
 122
 123    If you replace CString with string in the above function, the
 124    performance is O(n).
 125    </pre>
 126    <p>Joe Buck also pointed out some other things to keep in mind when
 127       comparing CString and the Standard string class:
 128    </p>
 129       <ul>
 130          <li>CString permits access to its internal representation; coders
 131              who exploited that may have problems moving to <code>string</code>.
 132          </li>
 133          <li>Microsoft ships the source to CString (in the files
 134              MFC\SRC\Str{core,ex}.cpp), so you could fix the allocation
 135              bug and rebuild your MFC libraries.
 136              <em><strong>Note:</strong> It looks like the the CString shipped
 137              with VC++6.0 has fixed this, although it may in fact have been
 138              one of the VC++ SPs that did it.</em>
 139          </li>
 140          <li><code>string</code> operations like this have O(n) complexity
 141              <em>if the implementors do it correctly</em>.  The libstdc++
 142              implementors did it correctly.  Other vendors might not.
 143          </li>
 144          <li>While parts of the SGI STL are used in libstdc++-v3, their
 145              string class is not.  The SGI <code>string</code> is essentially
 146              <code>vector&lt;char&gt;</code> and does not do any reference
 147              counting like libstdc++-v3's does.  (It is O(n), though.)
 148              So if you're thinking about SGI's string or rope classes,
 149              you're now looking at four possibilities:  CString, the
 150              libstdc++ string, the SGI string, and the SGI rope, and this
 151              is all before any allocator or traits customizations!  (More
 152              choices than you can shake a stick at -- want fries with that?)
 153          </li>
 154       </ul>
 155    <p>Return <a href="#top">to top of page</a> or
 156       <a href="../faq/index.html">to the FAQ</a>.
 157    </p>
 158
 159 <hr />
 160 <h2><a name="2">A case-insensitive string class</a></h2>
 161    <p>The well-known-and-if-it-isn't-well-known-it-ought-to-be
 162       <a href="http://www.gotw.ca/gotw/index.htm">Guru of the Week</a>
 163       discussions held on Usenet covered this topic in January of 1998.
 164       Briefly, the challenge was, &quot;write a 'ci_string' class which
 165       is identical to the standard 'string' class, but is
 166       case-insensitive in the same way as the (common but nonstandard)
 167       C function stricmp():&quot;
 168    </p>
 169    <pre>
 170    ci_string s( "AbCdE" );
 171
 172    // case insensitive
 173    assert( s == "abcde" );
 174    assert( s == "ABCDE" );
 175
 176    // still case-preserving, of course
 177    assert( strcmp( s.c_str(), "AbCdE" ) == 0 );
 178    assert( strcmp( s.c_str(), "abcde" ) != 0 ); </pre>
 179
 180    <p>The solution is surprisingly easy.  The original answer pages
 181       on the GotW website were removed into cold storage, in
 182       preparation for
 183       <a href="http://cseng.aw.com/bookpage.taf?ISBN=0-201-61562-2">a
 184       published book of GotW notes</a>.  Before being
 185       put on the web, of course, it was posted on Usenet, and that
 186       posting containing the answer is <a href="gotw29a.txt">available
 187       here</a>.
 188    </p>
 189    <p>See?  Told you it was easy!</p>
 190    <p><strong>Added June 2000:</strong>  The May issue of <u>C++ Report</u>
 191       contains
 192       a fascinating article by Matt Austern (yes, <em>the</em> Matt Austern)
 193       on why case-insensitive comparisons are not as easy as they seem,
 194       and why creating a class is the <em>wrong</em> way to go about it in
 195       production code.  (The GotW answer mentions one of the principle
 196       difficulties; his article mentions more.)
 197    </p>
 198    <p>Basically, this is &quot;easy&quot; only if you ignore some things,
 199       things which may be too important to your program to ignore.  (I chose
 200       to ignore them when originally writing this entry, and am surprised
 201       that nobody ever called me on it...)  The GotW question and answer
 202       remain useful instructional tools, however.
 203    </p>
 204    <p><strong>Added September 2000:</strong>  James Kanze provided a link to a
 205       <a href="http://www.unicode.org/unicode/reports/tr21/">Unicode
 206       Technical Report discussing case handling</a>, which provides some
 207       very good information.
 208    </p>
 209    <p>Return <a href="#top">to top of page</a> or
 210       <a href="../faq/index.html">to the FAQ</a>.
 211    </p>
 212
 213 <hr />
 214 <h2><a name="3">Breaking a C++ string into tokens</a></h2>
 215    <p>The Standard C (and C++) function <code>strtok()</code> leaves a lot to
 216       be desired in terms of user-friendliness.  It's unintuitive, it
 217       destroys the character string on which it operates, and it requires
 218       you to handle all the memory problems.  But it does let the client
 219       code decide what to use to break the string into pieces; it allows
 220       you to choose the &quot;whitespace,&quot; so to speak.
 221    </p>
 222    <p>A C++ implementation lets us keep the good things and fix those
 223       annoyances.  The implementation here is more intuitive (you only
 224       call it once, not in a loop with varying argument), it does not
 225       affect the original string at all, and all the memory allocation
 226       is handled for you.
 227    </p>
 228    <p>It's called stringtok, and it's a template function.  It's given
 229       <a href="stringtok_h.txt">in this file</a> in a less-portable form than
 230       it could be, to keep this example simple (for example, see the
 231       comments on what kind of string it will accept).  The author uses
 232       a more general (but less readable) form of it for parsing command
 233       strings and the like.  If you compiled and ran this code using it:
 234    </p>
 235    <pre>
 236    std::list&lt;string&gt;  ls;
 237    stringtok (ls, " this  \t is\t\n  a test  ");
 238    for (std::list&lt;string&gt;const_iterator i = ls.begin();
 239         i != ls.end(); ++i)
 240    {
 241        std::cerr &lt;&lt; ':' &lt;&lt; (*i) &lt;&lt; ":\n";
 242    } </pre>
 243    <p>You would see this as output:
 244    </p>
 245    <pre>
 246    :this:
 247    :is:
 248    :a:
 249    :test: </pre>
 250    <p>with all the whitespace removed.  The original <code>s</code> is still
 251       available for use, <code>ls</code> will clean up after itself, and
 252       <code>ls.size()</code> will return how many tokens there were.
 253    </p>
 254    <p>As always, there is a price paid here, in that stringtok is not
 255       as fast as strtok.  The other benefits usually outweight that, however.
 256       <a href="stringtok_std_h.txt">Another version of stringtok is given
 257       here</a>, suggested by Chris King and tweaked by Petr Prikryl,
 258       and this one uses the
 259       transformation functions mentioned below.  If you are comfortable
 260       with reading the new function names, this version is recommended
 261       as an example.
 262    </p>
 263    <p><strong>Added February 2001:</strong>  Mark Wilden pointed out that the
 264       standard <code>std::getline()</code> function can be used with standard
 265       <a href="../27_io/howto.html">istringstreams</a> to perform
 266       tokenizing as well.  Build an istringstream from the input text,
 267       and then use std::getline with varying delimiters (the three-argument
 268       signature) to extract tokens into a string.
 269    </p>
 270    <p>Return <a href="#top">to top of page</a> or
 271       <a href="../faq/index.html">to the FAQ</a>.
 272    </p>
 273
 274 <hr />
 275 <h2><a name="4">Simple transformations</a></h2>
 276    <p>Here are Standard, simple, and portable ways to perform common
 277       transformations on a <code>string</code> instance, such as &quot;convert
 278       to all upper case.&quot;  The word transformations is especially
 279       apt, because the standard template function
 280       <code>transform&lt;&gt;</code> is used.
 281    </p>
 282    <p>This code will go through some iterations (no pun).  Here's the
 283       simplistic version usually seen on Usenet:
 284    </p>
 285    <pre>
 286    #include &lt;string&gt;
 287    #include &lt;algorithm&gt;
 288    #include &lt;cctype&gt;      // old &lt;ctype.h&gt;
 289
 290    struct ToLower
 291    {
 292      char operator() (char c) const  { return std::tolower(c); }
 293    };
 294
 295    struct ToUpper
 296    {
 297      char operator() (char c) const  { return std::toupper(c); }
 298    };
 299
 300    int main()
 301    {
 302      std::string  s ("Some Kind Of Initial Input Goes Here");
 303
 304      // Change everything into upper case
 305      std::transform (s.begin(), s.end(), s.begin(), ToUpper());
 306
 307      // Change everything into lower case
 308      std::transform (s.begin(), s.end(), s.begin(), ToLower());
 309
 310      // Change everything back into upper case, but store the
 311      // result in a different string
 312      std::string  capital_s;
 313      capital_s.resize(s.size());
 314      std::transform (s.begin(), s.end(), capital_s.begin(), ToUpper());
 315    } </pre>
 316    <p><span class="larger"><strong>Note</strong></span> that these calls all
 317       involve the global C locale through the use of the C functions
 318       <code>toupper/tolower</code>.  This is absolutely guaranteed to work --
 319       but <em>only</em> if the string contains <em>only</em> characters
 320       from the basic source character set, and there are <em>only</em>
 321       96 of those.  Which means that not even all English text can be
 322       represented (certain British spellings, proper names, and so forth).
 323       So, if all your input forevermore consists of only those 96
 324       characters (hahahahahaha), then you're done.
 325    </p>
 326    <p><span class="larger"><strong>Note</strong></span> that the
 327       <code>ToUpper</code> and <code>ToLower</code> function objects
 328       are needed because <code>toupper</code> and <code>tolower</code>
 329       are overloaded names (declared in <code>&lt;cctype&gt;</code> and
 330       <code>&lt;locale&gt;</code>) so the template-arguments for
 331       <code>transform&lt;&gt;</code> cannot be deduced, as explained in
 332       <a href="http://gcc.gnu.org/ml/libstdc++/2002-11/msg00180.html">this
 333       message</a>.  <!-- section 14.8.2.4 clause 16 in ISO 14882:1998
 334       if you're into that sort of thing -->
 335       At minimum, you can write short wrappers like
 336    </p>
 337    <pre>
 338    char toLower (char c)
 339    {
 340       return std::tolower(c);
 341    } </pre>
 342    <p>The correct method is to use a facet for a particular locale
 343       and call its conversion functions.  These are discussed more in
 344       Chapter 22; the specific part is
 345       <a href="../22_locale/howto.html#7">Correct Transformations</a>,
 346       which shows the final version of this code.  (Thanks to James Kanze
 347       for assistance and suggestions on all of this.)
 348    </p>
 349    <p>Another common operation is trimming off excess whitespace.  Much
 350       like transformations, this task is trivial with the use of string's
 351       <code>find</code> family.  These examples are broken into multiple
 352       statements for readability:
 353    </p>
 354    <pre>
 355    std::string  str (" \t blah blah blah    \n ");
 356
 357    // trim leading whitespace
 358    string::size_type  notwhite = str.find_first_not_of(" \t\n");
 359    str.erase(0,notwhite);
 360
 361    // trim trailing whitespace
 362    notwhite = str.find_last_not_of(" \t\n");
 363    str.erase(notwhite+1); </pre>
 364    <p>Obviously, the calls to <code>find</code> could be inserted directly
 365       into the calls to <code>erase</code>, in case your compiler does not
 366       optimize named temporaries out of existence.
 367    </p>
 368    <p>Return <a href="#top">to top of page</a> or
 369       <a href="../faq/index.html">to the FAQ</a>.
 370    </p>
 371
 372 <hr />
 373 <h2><a name="5">Making strings of arbitrary character types</a></h2>
 374    <p>The <code>std::basic_string</code> is tantalizingly general, in that
 375       it is parameterized on the type of the characters which it holds.
 376       In theory, you could whip up a Unicode character class and instantiate
 377       <code>std::basic_string&lt;my_unicode_char&gt;</code>, or assuming
 378       that integers are wider than characters on your platform, maybe just
 379       declare variables of type <code>std::basic_string&lt;int&gt;</code>.
 380    </p>
 381    <p>That's the theory.  Remember however that basic_string has additional
 382       type parameters, which take default arguments based on the character
 383       type (called CharT here):
 384    </p>
 385    <pre>
 386       template &lt;typename CharT,
 387                 typename Traits = char_traits&lt;CharT&gt;,
 388                 typename Alloc = allocator&lt;CharT&gt; &gt;
 389       class basic_string { .... };</pre>
 390    <p>Now, <code>allocator&lt;CharT&gt;</code> will probably Do The Right
 391       Thing by default, unless you need to implement your own allocator
 392       for your characters.
 393    </p>
 394    <p>But <code>char_traits</code> takes more work.  The char_traits
 395       template is <em>declared</em> but not <em>defined</em>.
 396       That means there is only
 397    </p>
 398    <pre>
 399       template &lt;typename CharT&gt;
 400         struct char_traits
 401         {
 402             static void foo (type1 x, type2 y);
 403             ...
 404         };</pre>
 405    <p>and functions such as char_traits&lt;CharT&gt;::foo() are not
 406       actually defined anywhere for the general case.  The C++ standard
 407       permits this, because writing such a definition to fit all possible
 408       CharT's cannot be done.  (For a time, in earlier versions of GCC,
 409       there was a mostly-correct implementation that let programmers be
 410       lazy.  :-)  But it broke under many situations, so it was removed.
 411       You are no longer allowed to be lazy and non-portable.)
 412    </p>
 413    <p>The C++ standard also requires that char_traits be specialized for
 414       instantiations of <code>char</code> and <code>wchar_t</code>, and it
 415       is these template specializations that permit entities like
 416       <code>basic_string&lt;char,char_traits&lt;char&gt;&gt;</code> to work.
 417    </p>
 418    <p>If you want to use character types other than char and wchar_t,
 419       such as <code>unsigned char</code> and <code>int</code>, you will
 420       need to write specializations for them at the present time.  If you
 421       want to use your own special character class, then you have
 422       <a href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00163.html">a lot
 423       of work to do</a>, especially if you with to use i18n features
 424       (facets require traits information but don't have a traits argument).
 425    </p>
 426    <p>One example of how to specialize char_traits is given <a
 427       href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00260.html">in
 428       this message</a>, which was then put into the file <code>
 429       include/ext/pod_char_traits.h</code> at a later date.  We agree
 430       that the way it's used with basic_string (scroll down to main())
 431       doesn't look nice, but that's because <a
 432       href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00236.html">the
 433       nice-looking first attempt</a> turned out to <a
 434       href="http://gcc.gnu.org/ml/libstdc++/2002-08/msg00242.html">not
 435       be conforming C++</a>, due to the rule that CharT must be a POD.
 436       (See how tricky this is?)
 437    </p>
 438    <p>Other approaches were suggested in that same thread, such as providing
 439       more specializations and/or some helper types in the library to assist
 440       users writing such code.  So far nobody has had the time...
 441       <a href="../17_intro/contribute.html">do you?</a>
 442    </p>
 443    <p>Return <a href="#top">to top of page</a> or
 444       <a href="../faq/index.html">to the FAQ</a>.
 445    </p>
 446
 447 <hr />
 448 <h2><a name="6">Shrink-to-fit strings</a></h2>
 449    <!-- referenced by faq/index.html#5_9, update link if numbering changes -->
 450    <p>From GCC 3.4 calling <code>s.reserve(res)</code> on a
 451       <code>string s</code> with <code>res &lt; s.capacity()</code> will
 452       reduce the string's capacity to <code>std::max(s.size(), res)</code>.
 453    </p>
 454    <p>This behaviour is suggested, but not required by the standard. Prior
 455       to GCC 3.4 the following alternative can be used instead
 456    </p>
 457    <pre>
 458       std::string(str.data(), str.size()).swap(str);
 459    </pre>
 460    <p>This is similar to the idiom for reducing a <code>vector</code>'s
 461       memory usage (see <a href='../faq/index.html#5_9'>FAQ 5.9</a>) but
 462       the regular copy constructor cannot be used because libstdc++'s
 463       <code>string</code> is Copy-On-Write.
 464    </p>
 465
 466
 467 <!-- ####################################################### -->
 468
 469 <hr />
 470 <p class="fineprint"><em>
 471 See <a href="../17_intro/license.html">license.html</a> for copying conditions.
 472 Comments and suggestions are welcome, and may be sent to
 473 <a href="mailto:libstdc++@gcc.gnu.org">the libstdc++ mailing list</a>.
 474 </em></p>
 475
 476
 477 </body>
 478 </html>