l4/pkg/libstdc++-v3/contrib/libstdc++-v3-4.3.3/doc/xml/manual/mt_allocator.xml

   1 <sect1 id="manual.ext.allocator.mt" xreflabel="mt allocator">
   2 <?dbhtml filename="mt_allocator.html"?>
   3
   4 <sect1info>
   5   <keywordset>
   6     <keyword>
   7       ISO C++
   8     </keyword>
   9     <keyword>
  10       allocator
  11     </keyword>
  12   </keywordset>
  13 </sect1info>
  14
  15 <title>mt_allocator</title>
  16
  17 <para>
  18 </para>
  19
  20 <sect2 id="allocator.mt.intro" xreflabel="allocator.mt.intro">
  21 <title>Intro</title>
  22
  23 <para>
  24   The mt allocator [hereinafter referred to simply as "the allocator"]
  25   is a fixed size (power of two) allocator that was initially
  26   developed specifically to suit the needs of multi threaded
  27   applications [hereinafter referred to as an MT application]. Over
  28   time the allocator has evolved and been improved in many ways, in
  29   particular it now also does a good job in single threaded
  30   applications [hereinafter referred to as a ST application]. (Note:
  31   In this document, when referring to single threaded applications
  32   this also includes applications that are compiled with gcc without
  33   thread support enabled. This is accomplished using ifdef's on
  34   __GTHREADS). This allocator is tunable, very flexible, and capable
  35   of high-performance.
  36 </para>
  37
  38 <para>
  39   The aim of this document is to describe - from an application point of
  40   view - the "inner workings" of the allocator.
  41 </para>
  42
  43 </sect2>
  44
  45
  46 <sect2 id="allocator.mt.design_issues" xreflabel="allocator.mt.design_issues">
  47 <title>Design Issues</title>
  48
  49 <sect3 id="allocator.mt.overview" xreflabel="allocator.mt.overview">
  50 <title>Overview</title>
  51
  52
  53 <para> There are three general components to the allocator: a datum
  54 describing the characteristics of the memory pool, a policy class
  55 containing this pool that links instantiation types to common or
  56 individual pools, and a class inheriting from the policy class that is
  57 the actual allocator.
  58 </para>
  59
  60 <para>The datum describing pools characteristics is
  61 </para>
  62 <programlisting>
  63   template&lt;bool _Thread&gt;
  64     class __pool
  65 </programlisting>
  66 <para> This class is parametrized on thread support, and is explicitly
  67 specialized for both multiple threads (with <code>bool==true</code>)
  68 and single threads (via <code>bool==false</code>.) It is possible to
  69 use a custom pool datum instead of the default class that is provided.
  70 </para>
  71
  72 <para> There are two distinct policy classes, each of which can be used
  73 with either type of underlying pool datum.
  74 </para>
  75
  76 <programlisting>
  77   template&lt;bool _Thread&gt;
  78     struct __common_pool_policy
  79
  80   template&lt;typename _Tp, bool _Thread&gt;
  81     struct __per_type_pool_policy
  82 </programlisting>
  83
  84 <para> The first policy, <code>__common_pool_policy</code>, implements a
  85 common pool. This means that allocators that are instantiated with
  86 different types, say <code>char</code> and <code>long</code> will both
  87 use the same pool. This is the default policy.
  88 </para>
  89
  90 <para> The second policy, <code>__per_type_pool_policy</code>, implements
  91 a separate pool for each instantiating type. Thus, <code>char</code>
  92 and <code>long</code> will use separate pools. This allows per-type
  93 tuning, for instance.
  94 </para>
  95
  96 <para> Putting this all together, the actual allocator class is
  97 </para>
  98 <programlisting>
  99   template&lt;typename _Tp, typename _Poolp = __default_policy&gt;
 100     class __mt_alloc : public __mt_alloc_base&lt;_Tp&gt;,  _Poolp
 101 </programlisting>
 102 <para> This class has the interface required for standard library allocator
 103 classes, namely member functions <code>allocate</code> and
 104 <code>deallocate</code>, plus others.
 105 </para>
 106
 107 </sect3>
 108 </sect2>
 109
 110 <sect2 id="allocator.mt.impl" xreflabel="allocator.mt.impl">
 111 <title>Implementation</title>
 112
 113
 114 <sect3 id="allocator.mt.tune" xreflabel="allocator.mt.tune">
 115 <title>Tunable Parameters</title>
 116
 117 <para>Certain allocation parameters can be modified, or tuned. There
 118 exists a nested <code>struct __pool_base::_Tune</code> that contains all
 119 these parameters, which include settings for
 120 </para>
 121    <itemizedlist>
 122      <listitem><para>Alignment</para></listitem>
 123      <listitem><para>Maximum bytes before calling <code>::operator new</code> directly</para></listitem>
 124      <listitem><para>Minimum bytes</para></listitem>
 125      <listitem><para>Size of underlying global allocations</para></listitem>
 126      <listitem><para>Maximum number of supported threads</para></listitem>
 127      <listitem><para>Migration of deallocations to the global free list</para></listitem>
 128      <listitem><para>Shunt for global <code>new</code> and <code>delete</code></para></listitem>
 129    </itemizedlist>
 130 <para>Adjusting parameters for a given instance of an allocator can only
 131 happen before any allocations take place, when the allocator itself is
 132 initialized. For instance:
 133 </para>
 134 <programlisting>
 135 #include &lt;ext/mt_allocator.h&gt;
 136
 137 struct pod
 138 {
 139   int i;
 140   int j;
 141 };
 142
 143 int main()
 144 {
 145   typedef pod value_type;
 146   typedef __gnu_cxx::__mt_alloc&lt;value_type&gt; allocator_type;
 147   typedef __gnu_cxx::__pool_base::_Tune tune_type;
 148
 149   tune_type t_default;
 150   tune_type t_opt(16, 5120, 32, 5120, 20, 10, false);
 151   tune_type t_single(16, 5120, 32, 5120, 1, 10, false);
 152
 153   tune_type t;
 154   t = allocator_type::_M_get_options();
 155   allocator_type::_M_set_options(t_opt);
 156   t = allocator_type::_M_get_options();
 157
 158   allocator_type a;
 159   allocator_type::pointer p1 = a.allocate(128);
 160   allocator_type::pointer p2 = a.allocate(5128);
 161
 162   a.deallocate(p1, 128);
 163   a.deallocate(p2, 5128);
 164
 165   return 0;
 166 }
 167 </programlisting>
 168
 169 </sect3>
 170
 171 <sect3 id="allocator.mt.init" xreflabel="allocator.mt.init">
 172 <title>Initialization</title>
 173
 174 <para>
 175 The static variables (pointers to freelists, tuning parameters etc)
 176 are initialized as above, or are set to the global defaults.
 177 </para>
 178
 179 <para>
 180 The very first allocate() call will always call the
 181 _S_initialize_once() function.  In order to make sure that this
 182 function is called exactly once we make use of a __gthread_once call
 183 in MT applications and check a static bool (_S_init) in ST
 184 applications.
 185 </para>
 186
 187 <para>
 188 The _S_initialize() function:
 189 - If the GLIBCXX_FORCE_NEW environment variable is set, it sets the bool
 190   _S_force_new to true and then returns. This will cause subsequent calls to
 191   allocate() to return memory directly from a new() call, and deallocate will
 192   only do a delete() call.
 193 </para>
 194
 195 <para>
 196 - If the GLIBCXX_FORCE_NEW environment variable is not set, both ST and MT
 197   applications will:
 198   - Calculate the number of bins needed. A bin is a specific power of two size
 199     of bytes. I.e., by default the allocator will deal with requests of up to
 200     128 bytes (or whatever the value of _S_max_bytes is when _S_init() is
 201     called). This means that there will be bins of the following sizes
 202     (in bytes): 1, 2, 4, 8, 16, 32, 64, 128.
 203
 204   - Create the _S_binmap array. All requests are rounded up to the next
 205     "large enough" bin. I.e., a request for 29 bytes will cause a block from
 206     the "32 byte bin" to be returned to the application. The purpose of
 207     _S_binmap is to speed up the process of finding out which bin to use.
 208     I.e., the value of _S_binmap[ 29 ] is initialized to 5 (bin 5 = 32 bytes).
 209 </para>
 210 <para>
 211   - Create the _S_bin array. This array consists of bin_records. There will be
 212     as many bin_records in this array as the number of bins that we calculated
 213     earlier. I.e., if _S_max_bytes = 128 there will be 8 entries.
 214     Each bin_record is then initialized:
 215     - bin_record-&gt;first = An array of pointers to block_records. There will be
 216       as many block_records pointers as there are maximum number of threads
 217       (in a ST application there is only 1 thread, in a MT application there
 218       are _S_max_threads).
 219       This holds the pointer to the first free block for each thread in this
 220       bin. I.e., if we would like to know where the first free block of size 32
 221       for thread number 3 is we would look this up by: _S_bin[ 5 ].first[ 3 ]
 222
 223     The above created block_record pointers members are now initialized to
 224     their initial values. I.e. _S_bin[ n ].first[ n ] = NULL;
 225 </para>
 226
 227 <para>
 228 - Additionally a MT application will:
 229   - Create a list of free thread id's. The pointer to the first entry
 230     is stored in _S_thread_freelist_first. The reason for this approach is
 231     that the __gthread_self() call will not return a value that corresponds to
 232     the maximum number of threads allowed but rather a process id number or
 233     something else. So what we do is that we create a list of thread_records.
 234     This list is _S_max_threads long and each entry holds a size_t thread_id
 235     which is initialized to 1, 2, 3, 4, 5 and so on up to _S_max_threads.
 236     Each time a thread calls allocate() or deallocate() we call
 237     _S_get_thread_id() which looks at the value of _S_thread_key which is a
 238     thread local storage pointer. If this is NULL we know that this is a newly
 239     created thread and we pop the first entry from this list and saves the
 240     pointer to this record in the _S_thread_key variable. The next time
 241     we will get the pointer to the thread_record back and we use the
 242     thread_record-&gt;thread_id as identification. I.e., the first thread that
 243     calls allocate will get the first record in this list and thus be thread
 244     number 1 and will then find the pointer to its first free 32 byte block
 245     in _S_bin[ 5 ].first[ 1 ]
 246     When we create the _S_thread_key we also define a destructor
 247     (_S_thread_key_destr) which means that when the thread dies, this
 248     thread_record is returned to the front of this list and the thread id
 249     can then be reused if a new thread is created.
 250     This list is protected by a mutex (_S_thread_freelist_mutex) which is only
 251     locked when records are removed or added to the list.
 252 </para>
 253 <para>
 254   - Initialize the free and used counters of each bin_record:
 255     - bin_record-&gt;free = An array of size_t. This keeps track of the number
 256       of blocks on a specific thread's freelist in each bin. I.e., if a thread
 257       has 12 32-byte blocks on it's freelists and allocates one of these, this
 258       counter would be decreased to 11.
 259
 260     - bin_record-&gt;used = An array of size_t. This keeps track of the number
 261       of blocks currently in use of this size by this thread. I.e., if a thread
 262       has made 678 requests (and no deallocations...) of 32-byte blocks this
 263       counter will read 678.
 264
 265     The above created arrays are now initialized with their initial values.
 266     I.e. _S_bin[ n ].free[ n ] = 0;
 267 </para>
 268 <para>
 269   - Initialize the mutex of each bin_record: The bin_record-&gt;mutex
 270     is used to protect the global freelist. This concept of a global
 271     freelist is explained in more detail in the section "A multi
 272     threaded example", but basically this mutex is locked whenever a
 273     block of memory is retrieved or returned to the global freelist
 274     for this specific bin. This only occurs when a number of blocks
 275     are grabbed from the global list to a thread specific list or when
 276     a thread decides to return some blocks to the global freelist.
 277 </para>
 278
 279 </sect3>
 280
 281 <sect3 id="allocator.mt.deallocation" xreflabel="allocator.mt.deallocation">
 282 <title>Deallocation Notes</title>
 283
 284 <para> Notes about deallocation. This allocator does not explicitly
 285 release memory. Because of this, memory debugging programs like
 286 valgrind or purify may notice leaks: sorry about this
 287 inconvenience. Operating systems will reclaim allocated memory at
 288 program termination anyway. If sidestepping this kind of noise is
 289 desired, there are three options: use an allocator, like
 290 <code>new_allocator</code> that releases memory while debugging, use
 291 GLIBCXX_FORCE_NEW to bypass the allocator's internal pools, or use a
 292 custom pool datum that releases resources on destruction.
 293 </para>
 294
 295 <para>
 296   On systems with the function <code>__cxa_atexit</code>, the
 297 allocator can be forced to free all memory allocated before program
 298 termination with the member function
 299 <code>__pool_type::_M_destroy</code>. However, because this member
 300 function relies on the precise and exactly-conforming ordering of
 301 static destructors, including those of a static local
 302 <code>__pool</code> object, it should not be used, ever, on systems
 303 that don't have the necessary underlying support. In addition, in
 304 practice, forcing deallocation can be tricky, as it requires the
 305 <code>__pool</code> object to be fully-constructed before the object
 306 that uses it is fully constructed. For most (but not all) STL
 307 containers, this works, as an instance of the allocator is constructed
 308 as part of a container's constructor. However, this assumption is
 309 implementation-specific, and subject to change. For an example of a
 310 pool that frees memory, see the following
 311     <ulink url="http://gcc.gnu.org/viewcvs/trunk/libstdc%2B%2B-v3/testsuite/ext/mt_allocator/deallocate_local-6.cc?view=markup">
 312     example.</ulink>
 313 </para>
 314
 315 </sect3>
 316
 317 </sect2>
 318
 319 <sect2 id="allocator.mt.example_single" xreflabel="allocator.mt.example_single">
 320 <title>Single Thread Example</title>
 321
 322 <para>
 323 Let's start by describing how the data on a freelist is laid out in memory.
 324 This is the first two blocks in freelist for thread id 3 in bin 3 (8 bytes):
 325 </para>
 326 <programlisting>
 327 +----------------+
 328 | next* ---------|--+  (_S_bin[ 3 ].first[ 3 ] points here)
 329 |                |  |
 330 |                |  |
 331 |                |  |
 332 +----------------+  |
 333 | thread_id = 3  |  |
 334 |                |  |
 335 |                |  |
 336 |                |  |
 337 +----------------+  |
 338 | DATA           |  |  (A pointer to here is what is returned to the
 339 |                |  |   the application when needed)
 340 |                |  |
 341 |                |  |
 342 |                |  |
 343 |                |  |
 344 |                |  |
 345 |                |  |
 346 +----------------+  |
 347 +----------------+  |
 348 | next*          |&lt;-+  (If next == NULL it's the last one on the list)
 349 |                |
 350 |                |
 351 |                |
 352 +----------------+
 353 | thread_id = 3  |
 354 |                |
 355 |                |
 356 |                |
 357 +----------------+
 358 | DATA           |
 359 |                |
 360 |                |
 361 |                |
 362 |                |
 363 |                |
 364 |                |
 365 |                |
 366 +----------------+
 367 </programlisting>
 368
 369 <para>
 370 With this in mind we simplify things a bit for a while and say that there is
 371 only one thread (a ST application). In this case all operations are made to
 372 what is referred to as the global pool - thread id 0 (No thread may be
 373 assigned this id since they span from 1 to _S_max_threads in a MT application).
 374 </para>
 375 <para>
 376 When the application requests memory (calling allocate()) we first look at the
 377 requested size and if this is &gt; _S_max_bytes we call new() directly and return.
 378 </para>
 379 <para>
 380 If the requested size is within limits we start by finding out from which
 381 bin we should serve this request by looking in _S_binmap.
 382 </para>
 383 <para>
 384 A quick look at _S_bin[ bin ].first[ 0 ] tells us if there are any blocks of
 385 this size on the freelist (0). If this is not NULL - fine, just remove the
 386 block that _S_bin[ bin ].first[ 0 ] points to from the list,
 387 update _S_bin[ bin ].first[ 0 ] and return a pointer to that blocks data.
 388 </para>
 389 <para>
 390 If the freelist is empty (the pointer is NULL) we must get memory from the
 391 system and build us a freelist within this memory. All requests for new memory
 392 is made in chunks of _S_chunk_size. Knowing the size of a block_record and
 393 the bytes that this bin stores we then calculate how many blocks we can create
 394 within this chunk, build the list, remove the first block, update the pointer
 395 (_S_bin[ bin ].first[ 0 ]) and return a pointer to that blocks data.
 396 </para>
 397
 398 <para>
 399 Deallocation is equally simple; the pointer is casted back to a block_record
 400 pointer, lookup which bin to use based on the size, add the block to the front
 401 of the global freelist and update the pointer as needed
 402 (_S_bin[ bin ].first[ 0 ]).
 403 </para>
 404
 405 <para>
 406 The decision to add deallocated blocks to the front of the freelist was made
 407 after a set of performance measurements that showed that this is roughly 10%
 408 faster than maintaining a set of "last pointers" as well.
 409 </para>
 410
 411 </sect2>
 412
 413 <sect2 id="allocator.mt.example_multi" xreflabel="allocator.mt.example_multi">
 414 <title>Multiple Thread Example</title>
 415
 416 <para>
 417 In the ST example we never used the thread_id variable present in each block.
 418 Let's start by explaining the purpose of this in a MT application.
 419 </para>
 420
 421 <para>
 422 The concept of "ownership" was introduced since many MT applications
 423 allocate and deallocate memory to shared containers from different
 424 threads (such as a cache shared amongst all threads). This introduces
 425 a problem if the allocator only returns memory to the current threads
 426 freelist (I.e., there might be one thread doing all the allocation and
 427 thus obtaining ever more memory from the system and another thread
 428 that is getting a longer and longer freelist - this will in the end
 429 consume all available memory).
 430 </para>
 431
 432 <para>
 433 Each time a block is moved from the global list (where ownership is
 434 irrelevant), to a threads freelist (or when a new freelist is built
 435 from a chunk directly onto a threads freelist or when a deallocation
 436 occurs on a block which was not allocated by the same thread id as the
 437 one doing the deallocation) the thread id is set to the current one.
 438 </para>
 439
 440 <para>
 441 What's the use? Well, when a deallocation occurs we can now look at
 442 the thread id and find out if it was allocated by another thread id
 443 and decrease the used counter of that thread instead, thus keeping the
 444 free and used counters correct. And keeping the free and used counters
 445 corrects is very important since the relationship between these two
 446 variables decides if memory should be returned to the global pool or
 447 not when a deallocation occurs.
 448 </para>
 449
 450 <para>
 451 When the application requests memory (calling allocate()) we first
 452 look at the requested size and if this is &gt;_S_max_bytes we call new()
 453 directly and return.
 454 </para>
 455
 456 <para>
 457 If the requested size is within limits we start by finding out from which
 458 bin we should serve this request by looking in _S_binmap.
 459 </para>
 460
 461 <para>
 462 A call to _S_get_thread_id() returns the thread id for the calling thread
 463 (and if no value has been set in _S_thread_key, a new id is assigned and
 464 returned).
 465 </para>
 466
 467 <para>
 468 A quick look at _S_bin[ bin ].first[ thread_id ] tells us if there are
 469 any blocks of this size on the current threads freelist. If this is
 470 not NULL - fine, just remove the block that _S_bin[ bin ].first[
 471 thread_id ] points to from the list, update _S_bin[ bin ].first[
 472 thread_id ], update the free and used counters and return a pointer to
 473 that blocks data.
 474 </para>
 475
 476 <para>
 477 If the freelist is empty (the pointer is NULL) we start by looking at
 478 the global freelist (0). If there are blocks available on the global
 479 freelist we lock this bins mutex and move up to block_count (the
 480 number of blocks of this bins size that will fit into a _S_chunk_size)
 481 or until end of list - whatever comes first - to the current threads
 482 freelist and at the same time change the thread_id ownership and
 483 update the counters and pointers. When the bins mutex has been
 484 unlocked, we remove the block that _S_bin[ bin ].first[ thread_id ]
 485 points to from the list, update _S_bin[ bin ].first[ thread_id ],
 486 update the free and used counters, and return a pointer to that blocks
 487 data.
 488 </para>
 489
 490 <para>
 491 The reason that the number of blocks moved to the current threads
 492 freelist is limited to block_count is to minimize the chance that a
 493 subsequent deallocate() call will return the excess blocks to the
 494 global freelist (based on the _S_freelist_headroom calculation, see
 495 below).
 496 </para>
 497
 498 <para>
 499 However if there isn't any memory on the global pool we need to get
 500 memory from the system - this is done in exactly the same way as in a
 501 single threaded application with one major difference; the list built
 502 in the newly allocated memory (of _S_chunk_size size) is added to the
 503 current threads freelist instead of to the global.
 504 </para>
 505
 506 <para>
 507 The basic process of a deallocation call is simple: always add the
 508 block to the front of the current threads freelist and update the
 509 counters and pointers (as described earlier with the specific check of
 510 ownership that causes the used counter of the thread that originally
 511 allocated the block to be decreased instead of the current threads
 512 counter).
 513 </para>
 514
 515 <para>
 516 And here comes the free and used counters to service. Each time a
 517 deallocation() call is made, the length of the current threads
 518 freelist is compared to the amount memory in use by this thread.
 519 </para>
 520
 521 <para>
 522 Let's go back to the example of an application that has one thread
 523 that does all the allocations and one that deallocates. Both these
 524 threads use say 516 32-byte blocks that was allocated during thread
 525 creation for example.  Their used counters will both say 516 at this
 526 point. The allocation thread now grabs 1000 32-byte blocks and puts
 527 them in a shared container. The used counter for this thread is now
 528 1516.
 529 </para>
 530
 531 <para>
 532 The deallocation thread now deallocates 500 of these blocks. For each
 533 deallocation made the used counter of the allocating thread is
 534 decreased and the freelist of the deallocation thread gets longer and
 535 longer. But the calculation made in deallocate() will limit the length
 536 of the freelist in the deallocation thread to _S_freelist_headroom %
 537 of it's used counter.  In this case, when the freelist (given that the
 538 _S_freelist_headroom is at it's default value of 10%) exceeds 52
 539 (516/10) blocks will be returned to the global pool where the
 540 allocating thread may pick them up and reuse them.
 541 </para>
 542
 543 <para>
 544 In order to reduce lock contention (since this requires this bins
 545 mutex to be locked) this operation is also made in chunks of blocks
 546 (just like when chunks of blocks are moved from the global freelist to
 547 a threads freelist mentioned above). The "formula" used can probably
 548 be improved to further reduce the risk of blocks being "bounced back
 549 and forth" between freelists.
 550 </para>
 551
 552 </sect2>
 553
 554 </sect1>