1 <chapter xmlns="http://docbook.org/ns/docbook" version="5.0"
2 xml:id="manual.ext.profile_mode" xreflabel="Profile Mode">
3 <?dbhtml filename="profile_mode.html"?>
5 <info><title>Profile Mode</title>
22 <section xml:id="manual.ext.profile_mode.intro" xreflabel="Intro"><info><title>Intro</title></info>
25 <emphasis>Goal: </emphasis>Give performance improvement advice based on
26 recognition of suboptimal usage patterns of the standard library.
30 <emphasis>Method: </emphasis>Wrap the standard library code. Insert
31 calls to an instrumentation library to record the internal state of
32 various components at interesting entry/exit points to/from the standard
33 library. Process trace, recognize suboptimal patterns, give advice.
35 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://dx.doi.org/10.1109/CGO.2009.36">paper presented at
39 <emphasis>Strengths: </emphasis>
42 Unintrusive solution. The application code does not require any
45 <listitem><para> The advice is call context sensitive, thus capable of
46 identifying precisely interesting dynamic performance behavior.
49 The overhead model is pay-per-view. When you turn off a diagnostic class
50 at compile time, its overhead disappears.
55 <emphasis>Drawbacks: </emphasis>
58 You must recompile the application code with custom options.
60 <listitem><para>You must run the application on representative input.
61 The advice is input dependent.
64 The execution time will increase, in some cases by factors.
70 <section xml:id="manual.ext.profile_mode.using" xreflabel="Using"><info><title>Using the Profile Mode</title></info>
74 This is the anticipated common workflow for program <code>foo.cc</code>:
77 #include <vector>
80 for (int k = 0; k < 1024; ++k) v.insert(v.begin(), k);
83 $ g++ -D_GLIBCXX_PROFILE foo.cc
85 $ cat libstdcxx-profile.txt
86 vector-to-list: improvement = 5: call stack = 0x804842c ...
87 : advice = change std::vector to std::list
88 vector-size: improvement = 3: call stack = 0x804842c ...
89 : advice = change initial container size from 0 to 1024
98 Warning id. This is a short descriptive string for the class
99 that this warning belongs to. E.g., "vector-to-list".
104 Estimated improvement. This is an approximation of the benefit expected
105 from implementing the change suggested by the warning. It is given on
106 a log10 scale. Negative values mean that the alternative would actually
107 do worse than the current choice.
108 In the example above, 5 comes from the fact that the overhead of
109 inserting at the beginning of a vector vs. a list is around 1024 * 1024 / 2,
110 which is around 10e5. The improvement from setting the initial size to
111 1024 is in the range of 10e3, since the overhead of dynamic resizing is
117 Call stack. Currently, the addresses are printed without
118 symbol name or code location attribution.
119 Users are expected to postprocess the output using, for instance, addr2line.
124 The warning message. For some warnings, this is static text, e.g.,
125 "change vector to list". For other warnings, such as the one above,
126 the message contains numeric advice, e.g., the suggested initial size
133 <para>Three files are generated. <code>libstdcxx-profile.txt</code>
134 contains human readable advice. <code>libstdcxx-profile.raw</code>
135 contains implementation specific data about each diagnostic.
136 Their format is not documented. They are sufficient to generate
137 all the advice given in <code>libstdcxx-profile.txt</code>. The advantage
138 of keeping this raw format is that traces from multiple executions can
139 be aggregated simply by concatenating the raw traces. We intend to
140 offer an external utility program that can issue advice from a trace.
141 <code>libstdcxx-profile.conf.out</code> lists the actual diagnostic
142 parameters used. To alter parameters, edit this file and rename it to
143 <code>libstdcxx-profile.conf</code>.
146 <para>Advice is given regardless whether the transformation is valid.
147 For instance, we advise changing a map to an unordered_map even if the
148 application semantics require that data be ordered.
149 We believe such warnings can help users understand the performance
150 behavior of their application better, which can lead to changes
151 at a higher abstraction level.
156 <section xml:id="manual.ext.profile_mode.tuning" xreflabel="Tuning"><info><title>Tuning the Profile Mode</title></info>
159 <para>Compile time switches and environment variables (see also file
160 profiler.h). Unless specified otherwise, they can be set at compile time
161 using -D_<name> or by setting variable <name>
162 in the environment where the program is run, before starting execution.
165 <code>_GLIBCXX_PROFILE_NO_<diagnostic></code>:
166 disable specific diagnostics.
167 See section Diagnostics for possible values.
168 (Environment variables not supported.)
171 <code>_GLIBCXX_PROFILE_TRACE_PATH_ROOT</code>: set an alternative root
172 path for the output files.
174 <listitem><para>_GLIBCXX_PROFILE_MAX_WARN_COUNT: set it to the maximum
175 number of warnings desired. The default value is 10.</para></listitem>
177 <code>_GLIBCXX_PROFILE_MAX_STACK_DEPTH</code>: if set to 0,
179 be collected and reported for the program as a whole, and not for each
181 This could also be used in continuous regression tests, where you
182 just need to know whether there is a regression or not.
183 The default value is 32.
186 <code>_GLIBCXX_PROFILE_MEM_PER_DIAGNOSTIC</code>:
187 set a limit on how much memory to use for the accounting tables for each
188 diagnostic type. When this limit is reached, new events are ignored
189 until the memory usage decreases under the limit. Generally, this means
190 that newly created containers will not be instrumented until some
191 live containers are deleted. The default is 128 MB.
194 <code>_GLIBCXX_PROFILE_NO_THREADS</code>:
195 Make the library not use threads. If thread local storage (TLS) is not
196 available, you will get a preprocessor error asking you to set
197 -D_GLIBCXX_PROFILE_NO_THREADS if your program is single-threaded.
198 Multithreaded execution without TLS is not supported.
199 (Environment variable not supported.)
202 <code>_GLIBCXX_HAVE_EXECINFO_H</code>:
203 This name should be defined automatically at library configuration time.
204 If your library was configured without <code>execinfo.h</code>, but
205 you have it in your include path, you can define it explicitly. Without
206 it, advice is collected for the program as a whole, and not for each
208 (Environment variable not supported.)
218 <section xml:id="manual.ext.profile_mode.design" xreflabel="Design"><info><title>Design</title></info>
224 <title>Profile Code Location</title>
226 <tgroup cols="2" align="left" colsep="1" rowsep="1">
227 <colspec colname="c1"/>
228 <colspec colname="c2"/>
232 <entry>Code Location</entry>
238 <entry><code>libstdc++-v3/include/std/*</code></entry>
239 <entry>Preprocessor code to redirect to profile extension headers.</entry>
242 <entry><code>libstdc++-v3/include/profile/*</code></entry>
243 <entry>Profile extension public headers (map, vector, ...).</entry>
246 <entry><code>libstdc++-v3/include/profile/impl/*</code></entry>
247 <entry>Profile extension internals. Implementation files are
248 only included from <code>impl/profiler.h</code>, which is the only
249 file included from the public headers.</entry>
258 <section xml:id="manual.ext.profile_mode.design.wrapper" xreflabel="Wrapper"><info><title>Wrapper Model</title></info>
261 In order to get our instrumented library version included instead of the
263 we use the same wrapper model as the debug mode.
264 We subclass entities from the release version. Wherever
265 <code>_GLIBCXX_PROFILE</code> is defined, the release namespace is
266 <code>std::__norm</code>, whereas the profile namespace is
267 <code>std::__profile</code>. Using plain <code>std</code> translates
268 into <code>std::__profile</code>.
271 Whenever possible, we try to wrap at the public interface level, e.g.,
272 in <code>unordered_set</code> rather than in <code>hashtable</code>,
273 in order not to depend on implementation.
276 Mixing object files built with and without the profile mode must
277 not affect the program execution. However, there are no guarantees to
278 the accuracy of diagnostics when using even a single object not built with
279 <code>-D_GLIBCXX_PROFILE</code>.
280 Currently, mixing the profile mode with debug and parallel extensions is
281 not allowed. Mixing them at compile time will result in preprocessor errors.
282 Mixing them at link time is undefined.
287 <section xml:id="manual.ext.profile_mode.design.instrumentation" xreflabel="Instrumentation"><info><title>Instrumentation</title></info>
290 Instead of instrumenting every public entry and exit point,
291 we chose to add instrumentation on demand, as needed
292 by individual diagnostics.
293 The main reason is that some diagnostics require us to extract bits of
294 internal state that are particular only to that diagnostic.
295 We plan to formalize this later, after we learn more about the requirements
296 of several diagnostics.
299 All the instrumentation points can be switched on and off using
300 <code>-D[_NO]_GLIBCXX_PROFILE_<diagnostic></code> options.
301 With all the instrumentation calls off, there should be negligible
302 overhead over the release version. This property is needed to support
303 diagnostics based on timing of internal operations. For such diagnostics,
304 we anticipate turning most of the instrumentation off in order to prevent
305 profiling overhead from polluting time measurements, and thus diagnostics.
308 All the instrumentation on/off compile time switches live in
309 <code>include/profile/profiler.h</code>.
314 <section xml:id="manual.ext.profile_mode.design.rtlib" xreflabel="Run Time Behavior"><info><title>Run Time Behavior</title></info>
317 For practical reasons, the instrumentation library processes the trace
319 rather than dumping it to disk in raw form. Each event is processed when
320 it occurs. It is usually attached a cost and it is aggregated into
321 the database of a specific diagnostic class. The cost model
322 is based largely on the standard performance guarantees, but in some
323 cases we use knowledge about GCC's standard library implementation.
326 Information is indexed by (1) call stack and (2) instance id or address
327 to be able to understand and summarize precise creation-use-destruction
328 dynamic chains. Although the analysis is sensitive to dynamic instances,
329 the reports are only sensitive to call context. Whenever a dynamic instance
330 is destroyed, we accumulate its effect to the corresponding entry for the
331 call stack of its constructor location.
336 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://dx.doi.org/10.1109/CGO.2009.36">paper presented at
342 <section xml:id="manual.ext.profile_mode.design.analysis" xreflabel="Analysis and Diagnostics"><info><title>Analysis and Diagnostics</title></info>
345 Final analysis takes place offline, and it is based entirely on the
346 generated trace and debugging info in the application binary.
347 See section Diagnostics for a list of analysis types that we plan to support.
350 The input to the analysis is a table indexed by profile type and call stack.
351 The data type for each entry depends on the profile type.
356 <section xml:id="manual.ext.profile_mode.design.cost-model" xreflabel="Cost Model"><info><title>Cost Model</title></info>
359 While it is likely that cost models become complex as we get into
360 more sophisticated analysis, we will try to follow a simple set of rules
364 <listitem><para><emphasis>Relative benefit estimation:</emphasis>
365 The idea is to estimate or measure the cost of all operations
366 in the original scenario versus the scenario we advise to switch to.
367 For instance, when advising to change a vector to a list, an occurrence
368 of the <code>insert</code> method will generally count as a benefit.
369 Its magnitude depends on (1) the number of elements that get shifted
370 and (2) whether it triggers a reallocation.
372 <listitem><para><emphasis>Synthetic measurements:</emphasis>
373 We will measure the relative difference between similar operations on
374 different containers. We plan to write a battery of small tests that
375 compare the times of the executions of similar methods on different
376 containers. The idea is to run these tests on the target machine.
377 If this training phase is very quick, we may decide to perform it at
378 library initialization time. The results can be cached on disk and reused
381 <listitem><para><emphasis>Timers:</emphasis>
382 We plan to use timers for operations of larger granularity, such as sort.
383 For instance, we can switch between different sort methods on the fly
384 and report the one that performs best for each call context.
386 <listitem><para><emphasis>Show stoppers:</emphasis>
387 We may decide that the presence of an operation nullifies the advice.
388 For instance, when considering switching from <code>set</code> to
389 <code>unordered_set</code>, if we detect use of operator <code>++</code>,
390 we will simply not issue the advice, since this could signal that the use
391 care require a sorted container.</para></listitem>
397 <section xml:id="manual.ext.profile_mode.design.reports" xreflabel="Reports"><info><title>Reports</title></info>
400 There are two types of reports. First, if we recognize a pattern for which
401 we have a substitute that is likely to give better performance, we print
402 the advice and estimated performance gain. The advice is usually associated
403 to a code position and possibly a call stack.
406 Second, we report performance characteristics for which we do not have
407 a clear solution for improvement. For instance, we can point to the user
408 the top 10 <code>multimap</code> locations
409 which have the worst data locality in actual traversals.
410 Although this does not offer a solution,
411 it helps the user focus on the key problems and ignore the uninteresting ones.
416 <section xml:id="manual.ext.profile_mode.design.testing" xreflabel="Testing"><info><title>Testing</title></info>
419 First, we want to make sure we preserve the behavior of the release mode.
420 You can just type <code>"make check-profile"</code>, which
421 builds and runs the whole test suite in profile mode.
424 Second, we want to test the correctness of each diagnostic.
425 We created a <code>profile</code> directory in the test suite.
426 Each diagnostic must come with at least two tests, one for false positives
427 and one for false negatives.
433 <section xml:id="manual.ext.profile_mode.api" xreflabel="API"><info><title>Extensions for Custom Containers</title></info>
437 Many large projects use their own data structures instead of the ones in the
438 standard library. If these data structures are similar in functionality
439 to the standard library, they can be instrumented with the same hooks
440 that are used to instrument the standard library.
441 The instrumentation API is exposed in file
442 <code>profiler.h</code> (look for "Instrumentation hooks").
448 <section xml:id="manual.ext.profile_mode.cost_model" xreflabel="Cost Model"><info><title>Empirical Cost Model</title></info>
452 Currently, the cost model uses formulas with predefined relative weights
453 for alternative containers or container implementations. For instance,
454 iterating through a vector is X times faster than iterating through a list.
458 We are working on customizing this to a particular machine by providing
459 an automated way to compute the actual relative weights for operations
460 on the given machine.
464 We plan to provide a performance parameter database format that can be
465 filled in either by hand or by an automated training mechanism.
466 The analysis module will then use this database instead of the built in.
473 <section xml:id="manual.ext.profile_mode.implementation" xreflabel="Implementation"><info><title>Implementation Issues</title></info>
477 <section xml:id="manual.ext.profile_mode.implementation.stack" xreflabel="Stack Traces"><info><title>Stack Traces</title></info>
480 Accurate stack traces are needed during profiling since we group events by
481 call context and dynamic instance. Without accurate traces, diagnostics
482 may be hard to interpret. For instance, when giving advice to the user
483 it is imperative to reference application code, not library code.
486 Currently we are using the libc <code>backtrace</code> routine to get
488 <code>_GLIBCXX_PROFILE_STACK_DEPTH</code> can be set
489 to 0 if you are willing to give up call context information, or to a small
490 positive value to reduce run time overhead.
495 <section xml:id="manual.ext.profile_mode.implementation.symbols" xreflabel="Symbolization"><info><title>Symbolization of Instruction Addresses</title></info>
498 The profiling and analysis phases use only instruction addresses.
499 An external utility such as addr2line is needed to postprocess the result.
500 We do not plan to add symbolization support in the profile extension.
501 This would require access to symbol tables, debug information tables,
502 external programs or libraries and other system dependent information.
507 <section xml:id="manual.ext.profile_mode.implementation.concurrency" xreflabel="Concurrency"><info><title>Concurrency</title></info>
510 Our current model is simplistic, but precise.
511 We cannot afford to approximate because some of our diagnostics require
512 precise matching of operations to container instance and call context.
513 During profiling, we keep a single information table per diagnostic.
514 There is a single lock per information table.
519 <section xml:id="manual.ext.profile_mode.implementation.stdlib-in-proflib" xreflabel="Using the Standard Library in the Runtime Library"><info><title>Using the Standard Library in the Instrumentation Implementation</title></info>
522 As much as we would like to avoid uses of libstdc++ within our
523 instrumentation library, containers such as unordered_map are very
524 appealing. We plan to use them as long as they are named properly
530 <section xml:id="manual.ext.profile_mode.implementation.malloc-hooks" xreflabel="Malloc Hooks"><info><title>Malloc Hooks</title></info>
533 User applications/libraries can provide malloc hooks.
534 When the implementation of the malloc hooks uses stdlibc++, there can
535 be an infinite cycle between the profile mode instrumentation and the
539 We protect against reentrance to the profile mode instrumentation code,
540 which should avoid this problem in most cases.
541 The protection mechanism is thread safe and exception safe.
542 This mechanism does not prevent reentrance to the malloc hook itself,
543 which could still result in deadlock, if, for instance, the malloc hook
544 uses non-recursive locks.
545 XXX: A definitive solution to this problem would be for the profile extension
546 to use a custom allocator internally, and perhaps not to use libstdc++.
551 <section xml:id="manual.ext.profile_mode.implementation.construction-destruction" xreflabel="Construction and Destruction of Global Objects"><info><title>Construction and Destruction of Global Objects</title></info>
554 The profiling library state is initialized at the first call to a profiling
555 method. This allows us to record the construction of all global objects.
556 However, we cannot do the same at destruction time. The trace is written
557 by a function registered by <code>atexit</code>, thus invoked by
565 <section xml:id="manual.ext.profile_mode.developer" xreflabel="Developer Information"><info><title>Developer Information</title></info>
568 <section xml:id="manual.ext.profile_mode.developer.bigpic" xreflabel="Big Picture"><info><title>Big Picture</title></info>
571 <para>The profile mode headers are included with
572 <code>-D_GLIBCXX_PROFILE</code> through preprocessor directives in
573 <code>include/std/*</code>.
576 <para>Instrumented implementations are provided in
577 <code>include/profile/*</code>. All instrumentation hooks are macros
578 defined in <code>include/profile/profiler.h</code>.
581 <para>All the implementation of the instrumentation hooks is in
582 <code>include/profile/impl/*</code>. Although all the code gets included,
583 thus is publicly visible, only a small number of functions are called from
584 outside this directory. All calls to hook implementations must be
585 done through macros defined in <code>profiler.h</code>. The macro
586 must ensure (1) that the call is guarded against reentrance and
587 (2) that the call can be turned off at compile time using a
588 <code>-D_GLIBCXX_PROFILE_...</code> compiler option.
593 <section xml:id="manual.ext.profile_mode.developer.howto" xreflabel="How To Add A Diagnostic"><info><title>How To Add A Diagnostic</title></info>
596 <para>Let's say the diagnostic name is "magic".
599 <para>If you need to instrument a header not already under
600 <code>include/profile/*</code>, first edit the corresponding header
601 under <code>include/std/</code> and add a preprocessor directive such
602 as the one in <code>include/std/vector</code>:
604 #ifdef _GLIBCXX_PROFILE
605 # include <profile/vector>
610 <para>If the file you need to instrument is not yet under
611 <code>include/profile/</code>, make a copy of the one in
612 <code>include/debug</code>, or the main implementation.
613 You'll need to include the main implementation and inherit the classes
614 you want to instrument. Then define the methods you want to instrument,
615 define the instrumentation hooks and add calls to them.
616 Look at <code>include/profile/vector</code> for an example.
619 <para>Add macros for the instrumentation hooks in
620 <code>include/profile/impl/profiler.h</code>.
621 Hook names must start with <code>__profcxx_</code>.
622 Make sure they transform
623 in no code with <code>-D_NO_GLBICXX_PROFILE_MAGIC</code>.
624 Make sure all calls to any method in namespace <code>__gnu_profile</code>
625 is protected against reentrance using macro
626 <code>_GLIBCXX_PROFILE_REENTRANCE_GUARD</code>.
627 All names of methods in namespace <code>__gnu_profile</code> called from
628 <code>profiler.h</code> must start with <code>__trace_magic_</code>.
631 <para>Add the implementation of the diagnostic.
634 Create new file <code>include/profile/impl/profiler_magic.h</code>.
637 Define class <code>__magic_info: public __object_info_base</code>.
638 This is the representation of a line in the object table.
639 The <code>__merge</code> method is used to aggregate information
640 across all dynamic instances created at the same call context.
641 The <code>__magnitude</code> must return the estimation of the benefit
642 as a number of small operations, e.g., number of words copied.
643 The <code>__write</code> method is used to produce the raw trace.
644 The <code>__advice</code> method is used to produce the advice string.
647 Define class <code>__magic_stack_info: public __magic_info</code>.
648 This defines the content of a line in the stack table.
651 Define class <code>__trace_magic: public __trace_base<__magic_info,
652 __magic_stack_info></code>.
653 It defines the content of the trace associated with this diagnostic.
658 <para>Add initialization and reporting calls in
659 <code>include/profile/impl/profiler_trace.h</code>. Use
660 <code>__trace_vector_to_list</code> as an example.
663 <para>Add documentation in file <code>doc/xml/manual/profile_mode.xml</code>.
668 <section xml:id="manual.ext.profile_mode.diagnostics"><info><title>Diagnostics</title></info>
672 The table below presents all the diagnostics we intend to implement.
673 Each diagnostic has a corresponding compile time switch
674 <code>-D_GLIBCXX_PROFILE_<diagnostic></code>.
675 Groups of related diagnostics can be turned on with a single switch.
676 For instance, <code>-D_GLIBCXX_PROFILE_LOCALITY</code> is equivalent to
677 <code>-D_GLIBCXX_PROFILE_SOFTWARE_PREFETCH
678 -D_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>.
682 The benefit, cost, expected frequency and accuracy of each diagnostic
683 was given a grade from 1 to 10, where 10 is highest.
684 A high benefit means that, if the diagnostic is accurate, the expected
685 performance improvement is high.
686 A high cost means that turning this diagnostic on leads to high slowdown.
687 A high frequency means that we expect this to occur relatively often.
688 A high accuracy means that the diagnostic is unlikely to be wrong.
689 These grades are not perfect. They are just meant to guide users with
690 specific needs or time budgets.
694 <title>Profile Diagnostics</title>
696 <tgroup cols="7" align="left" colsep="1" rowsep="1">
697 <colspec colname="c1"/>
698 <colspec colname="c2"/>
699 <colspec colname="c3"/>
700 <colspec colname="c4"/>
701 <colspec colname="c5"/>
702 <colspec colname="c6"/>
703 <colspec colname="c7"/>
709 <entry>Benefit</entry>
712 <entry>Implemented</entry>
717 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.containers">
718 CONTAINERS</link></entry>
719 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_too_small">
720 HASHTABLE_TOO_SMALL</link></entry>
729 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_too_large">
730 HASHTABLE_TOO_LARGE</link></entry>
739 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.inefficient_hash">
740 INEFFICIENT_HASH</link></entry>
749 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_too_small">
750 VECTOR_TOO_SMALL</link></entry>
759 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_too_large">
760 VECTOR_TOO_LARGE</link></entry>
769 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_to_hashtable">
770 VECTOR_TO_HASHTABLE</link></entry>
779 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.hashtable_to_vector">
780 HASHTABLE_TO_VECTOR</link></entry>
789 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.vector_to_list">
790 VECTOR_TO_LIST</link></entry>
799 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.list_to_vector">
800 LIST_TO_VECTOR</link></entry>
809 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.assoc_ord_to_unord">
810 ORDERED_TO_UNORDERED</link></entry>
815 <entry>only map/unordered_map</entry>
818 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.algorithms">
819 ALGORITHMS</link></entry>
820 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.algorithms.sort">
829 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality">
830 LOCALITY</link></entry>
831 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality.sw_prefetch">
832 SOFTWARE_PREFETCH</link></entry>
841 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.locality.linked">
842 RBTREE_LOCALITY</link></entry>
851 <entry><link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="#manual.ext.profile_mode.analysis.mthread.false_share">
852 FALSE_SHARING</link></entry>
863 <section xml:id="manual.ext.profile_mode.analysis.template" xreflabel="Template"><info><title>Diagnostic Template</title></info>
866 <listitem><para><emphasis>Switch:</emphasis>
867 <code>_GLIBCXX_PROFILE_<diagnostic></code>.
869 <listitem><para><emphasis>Goal:</emphasis> What problem will it diagnose?
871 <listitem><para><emphasis>Fundamentals:</emphasis>.
872 What is the fundamental reason why this is a problem</para></listitem>
873 <listitem><para><emphasis>Sample runtime reduction:</emphasis>
874 Percentage reduction in execution time. When reduction is more than
875 a constant factor, describe the reduction rate formula.
877 <listitem><para><emphasis>Recommendation:</emphasis>
878 What would the advise look like?</para></listitem>
879 <listitem><para><emphasis>To instrument:</emphasis>
880 What stdlibc++ components need to be instrumented?</para></listitem>
881 <listitem><para><emphasis>Analysis:</emphasis>
882 How do we decide when to issue the advice?</para></listitem>
883 <listitem><para><emphasis>Cost model:</emphasis>
884 How do we measure benefits? Math goes here.</para></listitem>
885 <listitem><para><emphasis>Example:</emphasis>
896 <section xml:id="manual.ext.profile_mode.analysis.containers" xreflabel="Containers"><info><title>Containers</title></info>
900 <emphasis>Switch:</emphasis>
901 <code>_GLIBCXX_PROFILE_CONTAINERS</code>.
904 <section xml:id="manual.ext.profile_mode.analysis.hashtable_too_small" xreflabel="Hashtable Too Small"><info><title>Hashtable Too Small</title></info>
907 <listitem><para><emphasis>Switch:</emphasis>
908 <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_SMALL</code>.
910 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with many
911 rehash operations, small construction size and large destruction size.
913 <listitem><para><emphasis>Fundamentals:</emphasis> Rehash is very expensive.
914 Read content, follow chains within bucket, evaluate hash function, place at
915 new location in different order.</para></listitem>
916 <listitem><para><emphasis>Sample runtime reduction:</emphasis> 36%.
917 Code similar to example below.
919 <listitem><para><emphasis>Recommendation:</emphasis>
920 Set initial size to N at construction site S.
922 <listitem><para><emphasis>To instrument:</emphasis>
923 <code>unordered_set, unordered_map</code> constructor, destructor, rehash.
925 <listitem><para><emphasis>Analysis:</emphasis>
926 For each dynamic instance of <code>unordered_[multi]set|map</code>,
927 record initial size and call context of the constructor.
928 Record size increase, if any, after each relevant operation such as insert.
929 Record the estimated rehash cost.</para></listitem>
930 <listitem><para><emphasis>Cost model:</emphasis>
931 Number of individual rehash operations * cost per rehash.</para></listitem>
932 <listitem><para><emphasis>Example:</emphasis>
934 1 unordered_set<int> us;
935 2 for (int k = 0; k < 1000000; ++k) {
939 foo.cc:1: advice: Changing initial unordered_set size from 10 to 1000000 saves 1025530 rehash operations.
946 <section xml:id="manual.ext.profile_mode.analysis.hashtable_too_large" xreflabel="Hashtable Too Large"><info><title>Hashtable Too Large</title></info>
949 <listitem><para><emphasis>Switch:</emphasis>
950 <code>_GLIBCXX_PROFILE_HASHTABLE_TOO_LARGE</code>.
952 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables which are
953 never filled up because fewer elements than reserved are ever
956 <listitem><para><emphasis>Fundamentals:</emphasis> Save memory, which
957 is good in itself and may also improve memory reference performance through
958 fewer cache and TLB misses.</para></listitem>
959 <listitem><para><emphasis>Sample runtime reduction:</emphasis> unknown.
961 <listitem><para><emphasis>Recommendation:</emphasis>
962 Set initial size to N at construction site S.
964 <listitem><para><emphasis>To instrument:</emphasis>
965 <code>unordered_set, unordered_map</code> constructor, destructor, rehash.
967 <listitem><para><emphasis>Analysis:</emphasis>
968 For each dynamic instance of <code>unordered_[multi]set|map</code>,
969 record initial size and call context of the constructor, and correlate it
970 with its size at destruction time.
972 <listitem><para><emphasis>Cost model:</emphasis>
973 Number of iteration operations + memory saved.</para></listitem>
974 <listitem><para><emphasis>Example:</emphasis>
976 1 vector<unordered_set<int>> v(100000, unordered_set<int>(100)) ;
977 2 for (int k = 0; k < 100000; ++k) {
978 3 for (int j = 0; j < 10; ++j) {
979 4 v[k].insert(k + j);
983 foo.cc:1: advice: Changing initial unordered_set size from 100 to 10 saves N
984 bytes of memory and M iteration steps.
990 <section xml:id="manual.ext.profile_mode.analysis.inefficient_hash" xreflabel="Inefficient Hash"><info><title>Inefficient Hash</title></info>
993 <listitem><para><emphasis>Switch:</emphasis>
994 <code>_GLIBCXX_PROFILE_INEFFICIENT_HASH</code>.
996 <listitem><para><emphasis>Goal:</emphasis> Detect hashtables with polarized
999 <listitem><para><emphasis>Fundamentals:</emphasis> A non-uniform
1000 distribution may lead to long chains, thus possibly increasing complexity
1001 by a factor up to the number of elements.
1003 <listitem><para><emphasis>Sample runtime reduction:</emphasis> factor up
1006 <listitem><para><emphasis>Recommendation:</emphasis> Change hash function
1007 for container built at site S. Distribution score = N. Access score = S.
1008 Longest chain = C, in bucket B.
1010 <listitem><para><emphasis>To instrument:</emphasis>
1011 <code>unordered_set, unordered_map</code> constructor, destructor, [],
1014 <listitem><para><emphasis>Analysis:</emphasis>
1015 Count the exact number of link traversals.
1017 <listitem><para><emphasis>Cost model:</emphasis>
1018 Total number of links traversed.</para></listitem>
1019 <listitem><para><emphasis>Example:</emphasis>
1023 size_t operator() (int i) const { return 0; }
1026 unordered_set<int, dumb_hash> hs;
1028 for (int i = 0; i < COUNT; ++i) {
1036 <section xml:id="manual.ext.profile_mode.analysis.vector_too_small" xreflabel="Vector Too Small"><info><title>Vector Too Small</title></info>
1039 <listitem><para><emphasis>Switch:</emphasis>
1040 <code>_GLIBCXX_PROFILE_VECTOR_TOO_SMALL</code>.
1042 <listitem><para><emphasis>Goal:</emphasis>Detect vectors with many
1043 resize operations, small construction size and large destruction size..
1045 <listitem><para><emphasis>Fundamentals:</emphasis>Resizing can be expensive.
1046 Copying large amounts of data takes time. Resizing many small vectors may
1047 have allocation overhead and affect locality.</para></listitem>
1048 <listitem><para><emphasis>Sample runtime reduction:</emphasis>%.
1050 <listitem><para><emphasis>Recommendation:</emphasis>
1051 Set initial size to N at construction site S.</para></listitem>
1052 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>.
1054 <listitem><para><emphasis>Analysis:</emphasis>
1055 For each dynamic instance of <code>vector</code>,
1056 record initial size and call context of the constructor.
1057 Record size increase, if any, after each relevant operation such as
1058 <code>push_back</code>. Record the estimated resize cost.
1060 <listitem><para><emphasis>Cost model:</emphasis>
1061 Total number of words copied * time to copy a word.</para></listitem>
1062 <listitem><para><emphasis>Example:</emphasis>
1064 1 vector<int> v;
1065 2 for (int k = 0; k < 1000000; ++k) {
1069 foo.cc:1: advice: Changing initial vector size from 10 to 1000000 saves
1070 copying 4000000 bytes and 20 memory allocations and deallocations.
1076 <section xml:id="manual.ext.profile_mode.analysis.vector_too_large" xreflabel="Vector Too Large"><info><title>Vector Too Large</title></info>
1079 <listitem><para><emphasis>Switch:</emphasis>
1080 <code>_GLIBCXX_PROFILE_VECTOR_TOO_LARGE</code>
1082 <listitem><para><emphasis>Goal:</emphasis>Detect vectors which are
1083 never filled up because fewer elements than reserved are ever
1086 <listitem><para><emphasis>Fundamentals:</emphasis>Save memory, which
1087 is good in itself and may also improve memory reference performance through
1088 fewer cache and TLB misses.</para></listitem>
1089 <listitem><para><emphasis>Sample runtime reduction:</emphasis>%.
1091 <listitem><para><emphasis>Recommendation:</emphasis>
1092 Set initial size to N at construction site S.</para></listitem>
1093 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>.
1095 <listitem><para><emphasis>Analysis:</emphasis>
1096 For each dynamic instance of <code>vector</code>,
1097 record initial size and call context of the constructor, and correlate it
1098 with its size at destruction time.</para></listitem>
1099 <listitem><para><emphasis>Cost model:</emphasis>
1100 Total amount of memory saved.</para></listitem>
1101 <listitem><para><emphasis>Example:</emphasis>
1103 1 vector<vector<int>> v(100000, vector<int>(100)) ;
1104 2 for (int k = 0; k < 100000; ++k) {
1105 3 for (int j = 0; j < 10; ++j) {
1106 4 v[k].insert(k + j);
1110 foo.cc:1: advice: Changing initial vector size from 100 to 10 saves N
1111 bytes of memory and may reduce the number of cache and TLB misses.
1117 <section xml:id="manual.ext.profile_mode.analysis.vector_to_hashtable" xreflabel="Vector to Hashtable"><info><title>Vector to Hashtable</title></info>
1120 <listitem><para><emphasis>Switch:</emphasis>
1121 <code>_GLIBCXX_PROFILE_VECTOR_TO_HASHTABLE</code>.
1123 <listitem><para><emphasis>Goal:</emphasis> Detect uses of
1124 <code>vector</code> that can be substituted with <code>unordered_set</code>
1125 to reduce execution time.
1127 <listitem><para><emphasis>Fundamentals:</emphasis>
1128 Linear search in a vector is very expensive, whereas searching in a hashtable
1129 is very quick.</para></listitem>
1130 <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up
1133 <listitem><para><emphasis>Recommendation:</emphasis>Replace
1134 <code>vector</code> with <code>unordered_set</code> at site S.
1136 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1137 operations and access methods.</para></listitem>
1138 <listitem><para><emphasis>Analysis:</emphasis>
1139 For each dynamic instance of <code>vector</code>,
1140 record call context of the constructor. Issue the advice only if the
1141 only methods called on this <code>vector</code> are <code>push_back</code>,
1142 <code>insert</code> and <code>find</code>.
1144 <listitem><para><emphasis>Cost model:</emphasis>
1145 Cost(vector::push_back) + cost(vector::insert) + cost(find, vector) -
1146 cost(unordered_set::insert) + cost(unordered_set::find).
1148 <listitem><para><emphasis>Example:</emphasis>
1150 1 vector<int> v;
1152 2 for (int i = 0; i < 1000; ++i) {
1153 3 find(v.begin(), v.end(), i);
1156 foo.cc:1: advice: Changing "vector" to "unordered_set" will save about 500,000
1163 <section xml:id="manual.ext.profile_mode.analysis.hashtable_to_vector" xreflabel="Hashtable to Vector"><info><title>Hashtable to Vector</title></info>
1166 <listitem><para><emphasis>Switch:</emphasis>
1167 <code>_GLIBCXX_PROFILE_HASHTABLE_TO_VECTOR</code>.
1169 <listitem><para><emphasis>Goal:</emphasis> Detect uses of
1170 <code>unordered_set</code> that can be substituted with <code>vector</code>
1171 to reduce execution time.
1173 <listitem><para><emphasis>Fundamentals:</emphasis>
1174 Hashtable iterator is slower than vector iterator.</para></listitem>
1175 <listitem><para><emphasis>Sample runtime reduction:</emphasis>95%.
1177 <listitem><para><emphasis>Recommendation:</emphasis>Replace
1178 <code>unordered_set</code> with <code>vector</code> at site S.
1180 <listitem><para><emphasis>To instrument:</emphasis><code>unordered_set</code>
1181 operations and access methods.</para></listitem>
1182 <listitem><para><emphasis>Analysis:</emphasis>
1183 For each dynamic instance of <code>unordered_set</code>,
1184 record call context of the constructor. Issue the advice only if the
1185 number of <code>find</code>, <code>insert</code> and <code>[]</code>
1186 operations on this <code>unordered_set</code> are small relative to the
1187 number of elements, and methods <code>begin</code> or <code>end</code>
1188 are invoked (suggesting iteration).</para></listitem>
1189 <listitem><para><emphasis>Cost model:</emphasis>
1190 Number of .</para></listitem>
1191 <listitem><para><emphasis>Example:</emphasis>
1193 1 unordered_set<int> us;
1196 3 for (unordered_set<int>::iterator it = us.begin(); it != us.end(); ++it) {
1200 foo.cc:1: advice: Changing "unordered_set" to "vector" will save about N
1201 indirections and may achieve better data locality.
1207 <section xml:id="manual.ext.profile_mode.analysis.vector_to_list" xreflabel="Vector to List"><info><title>Vector to List</title></info>
1210 <listitem><para><emphasis>Switch:</emphasis>
1211 <code>_GLIBCXX_PROFILE_VECTOR_TO_LIST</code>.
1213 <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1214 <code>vector</code> could be substituted with <code>list</code> for
1217 <listitem><para><emphasis>Fundamentals:</emphasis>
1218 Inserting in the middle of a vector is expensive compared to inserting in a
1221 <listitem><para><emphasis>Sample runtime reduction:</emphasis>factor up to
1224 <listitem><para><emphasis>Recommendation:</emphasis>Replace vector with list
1225 at site S.</para></listitem>
1226 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1227 operations and access methods.</para></listitem>
1228 <listitem><para><emphasis>Analysis:</emphasis>
1229 For each dynamic instance of <code>vector</code>,
1230 record the call context of the constructor. Record the overhead of each
1231 <code>insert</code> operation based on current size and insert position.
1232 Report instance with high insertion overhead.
1234 <listitem><para><emphasis>Cost model:</emphasis>
1235 (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1236 method in [push_back, insert, erase])
1237 + (Cost(iterate vector) - Cost(iterate list))</para></listitem>
1238 <listitem><para><emphasis>Example:</emphasis>
1240 1 vector<int> v;
1241 2 for (int i = 0; i < 10000; ++i) {
1242 3 v.insert(v.begin(), i);
1245 foo.cc:1: advice: Changing "vector" to "list" will save about 5,000,000
1252 <section xml:id="manual.ext.profile_mode.analysis.list_to_vector" xreflabel="List to Vector"><info><title>List to Vector</title></info>
1255 <listitem><para><emphasis>Switch:</emphasis>
1256 <code>_GLIBCXX_PROFILE_LIST_TO_VECTOR</code>.
1258 <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1259 <code>list</code> could be substituted with <code>vector</code> for
1262 <listitem><para><emphasis>Fundamentals:</emphasis>
1263 Iterating through a vector is faster than through a list.
1265 <listitem><para><emphasis>Sample runtime reduction:</emphasis>64%.
1267 <listitem><para><emphasis>Recommendation:</emphasis>Replace list with vector
1268 at site S.</para></listitem>
1269 <listitem><para><emphasis>To instrument:</emphasis><code>vector</code>
1270 operations and access methods.</para></listitem>
1271 <listitem><para><emphasis>Analysis:</emphasis>
1272 Issue the advice if there are no <code>insert</code> operations.
1274 <listitem><para><emphasis>Cost model:</emphasis>
1275 (Sum(cost(vector::method)) - Sum(cost(list::method)), for
1276 method in [push_back, insert, erase])
1277 + (Cost(iterate vector) - Cost(iterate list))</para></listitem>
1278 <listitem><para><emphasis>Example:</emphasis>
1280 1 list<int> l;
1283 3 for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
1287 foo.cc:1: advice: Changing "list" to "vector" will save about 1000000 indirect
1294 <section xml:id="manual.ext.profile_mode.analysis.list_to_slist" xreflabel="List to Forward List"><info><title>List to Forward List (Slist)</title></info>
1297 <listitem><para><emphasis>Switch:</emphasis>
1298 <code>_GLIBCXX_PROFILE_LIST_TO_SLIST</code>.
1300 <listitem><para><emphasis>Goal:</emphasis> Detect cases where
1301 <code>list</code> could be substituted with <code>forward_list</code> for
1304 <listitem><para><emphasis>Fundamentals:</emphasis>
1305 The memory footprint of a forward_list is smaller than that of a list.
1306 This has beneficial effects on memory subsystem, e.g., fewer cache misses.
1308 <listitem><para><emphasis>Sample runtime reduction:</emphasis>40%.
1309 Note that the reduction is only noticeable if the size of the forward_list
1310 node is in fact larger than that of the list node. For memory allocators
1311 with size classes, you will only notice an effect when the two node sizes
1312 belong to different allocator size classes.
1314 <listitem><para><emphasis>Recommendation:</emphasis>Replace list with
1315 forward_list at site S.</para></listitem>
1316 <listitem><para><emphasis>To instrument:</emphasis><code>list</code>
1317 operations and iteration methods.</para></listitem>
1318 <listitem><para><emphasis>Analysis:</emphasis>
1319 Issue the advice if there are no <code>backwards</code> traversals
1320 or insertion before a given node.
1322 <listitem><para><emphasis>Cost model:</emphasis>
1323 Always true.</para></listitem>
1324 <listitem><para><emphasis>Example:</emphasis>
1326 1 list<int> l;
1329 3 for (list<int>::iterator it = l.begin(); it != l.end(); ++it) {
1333 foo.cc:1: advice: Change "list" to "forward_list".
1339 <section xml:id="manual.ext.profile_mode.analysis.assoc_ord_to_unord" xreflabel="Ordered to Unordered Associative Container"><info><title>Ordered to Unordered Associative Container</title></info>
1342 <listitem><para><emphasis>Switch:</emphasis>
1343 <code>_GLIBCXX_PROFILE_ORDERED_TO_UNORDERED</code>.
1345 <listitem><para><emphasis>Goal:</emphasis> Detect cases where ordered
1346 associative containers can be replaced with unordered ones.
1348 <listitem><para><emphasis>Fundamentals:</emphasis>
1349 Insert and search are quicker in a hashtable than in
1350 a red-black tree.</para></listitem>
1351 <listitem><para><emphasis>Sample runtime reduction:</emphasis>52%.
1353 <listitem><para><emphasis>Recommendation:</emphasis>
1354 Replace set with unordered_set at site S.</para></listitem>
1355 <listitem><para><emphasis>To instrument:</emphasis>
1356 <code>set</code>, <code>multiset</code>, <code>map</code>,
1357 <code>multimap</code> methods.</para></listitem>
1358 <listitem><para><emphasis>Analysis:</emphasis>
1359 Issue the advice only if we are not using operator <code>++</code> on any
1360 iterator on a particular <code>[multi]set|map</code>.
1362 <listitem><para><emphasis>Cost model:</emphasis>
1363 (Sum(cost(hashtable::method)) - Sum(cost(rbtree::method)), for
1364 method in [insert, erase, find])
1365 + (Cost(iterate hashtable) - Cost(iterate rbtree))</para></listitem>
1366 <listitem><para><emphasis>Example:</emphasis>
1369 2 for (int i = 0; i < 100000; ++i) {
1373 6 for (int i = 0; i < 100000; ++i) {
1374 7 sum += *s.find(i);
1385 <section xml:id="manual.ext.profile_mode.analysis.algorithms" xreflabel="Algorithms"><info><title>Algorithms</title></info>
1388 <para><emphasis>Switch:</emphasis>
1389 <code>_GLIBCXX_PROFILE_ALGORITHMS</code>.
1392 <section xml:id="manual.ext.profile_mode.analysis.algorithms.sort" xreflabel="Sorting"><info><title>Sort Algorithm Performance</title></info>
1395 <listitem><para><emphasis>Switch:</emphasis>
1396 <code>_GLIBCXX_PROFILE_SORT</code>.
1398 <listitem><para><emphasis>Goal:</emphasis> Give measure of sort algorithm
1399 performance based on actual input. For instance, advise Radix Sort over
1400 Quick Sort for a particular call context.
1402 <listitem><para><emphasis>Fundamentals:</emphasis>
1404 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://portal.acm.org/citation.cfm?doid=1065944.1065981">
1405 A framework for adaptive algorithm selection in STAPL</link> and
1406 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=4228227">
1407 Optimizing Sorting with Machine Learning Algorithms</link>.
1409 <listitem><para><emphasis>Sample runtime reduction:</emphasis>60%.
1411 <listitem><para><emphasis>Recommendation:</emphasis> Change sort algorithm
1412 at site S from X Sort to Y Sort.</para></listitem>
1413 <listitem><para><emphasis>To instrument:</emphasis> <code>sort</code>
1414 algorithm.</para></listitem>
1415 <listitem><para><emphasis>Analysis:</emphasis>
1416 Issue the advice if the cost model tells us that another sort algorithm
1417 would do better on this input. Requires us to know what algorithm we
1418 are using in our sort implementation in release mode.</para></listitem>
1419 <listitem><para><emphasis>Cost model:</emphasis>
1420 Runtime(algo) for algo in [radix, quick, merge, ...]</para></listitem>
1421 <listitem><para><emphasis>Example:</emphasis>
1431 <section xml:id="manual.ext.profile_mode.analysis.locality" xreflabel="Data Locality"><info><title>Data Locality</title></info>
1434 <para><emphasis>Switch:</emphasis>
1435 <code>_GLIBCXX_PROFILE_LOCALITY</code>.
1438 <section xml:id="manual.ext.profile_mode.analysis.locality.sw_prefetch" xreflabel="Need Software Prefetch"><info><title>Need Software Prefetch</title></info>
1441 <listitem><para><emphasis>Switch:</emphasis>
1442 <code>_GLIBCXX_PROFILE_SOFTWARE_PREFETCH</code>.
1444 <listitem><para><emphasis>Goal:</emphasis> Discover sequences of indirect
1445 memory accesses that are not regular, thus cannot be predicted by
1446 hardware prefetchers.
1448 <listitem><para><emphasis>Fundamentals:</emphasis>
1449 Indirect references are hard to predict and are very expensive when they
1450 miss in caches.</para></listitem>
1451 <listitem><para><emphasis>Sample runtime reduction:</emphasis>25%.
1453 <listitem><para><emphasis>Recommendation:</emphasis> Insert prefetch
1454 instruction.</para></listitem>
1455 <listitem><para><emphasis>To instrument:</emphasis> Vector iterator and
1458 <listitem><para><emphasis>Analysis:</emphasis>
1459 First, get cache line size and page size from system.
1460 Then record iterator dereference sequences for which the value is a pointer.
1461 For each sequence within a container, issue a warning if successive pointer
1462 addresses are not within cache lines and do not form a linear pattern
1463 (otherwise they may be prefetched by hardware).
1464 If they also step across page boundaries, make the warning stronger.
1466 <para>The same analysis applies to containers other than vector.
1467 However, we cannot give the same advice for linked structures, such as list,
1468 as there is no random access to the n-th element. The user may still be
1469 able to benefit from this information, for instance by employing frays (user
1470 level light weight threads) to hide the latency of chasing pointers.
1473 This analysis is a little oversimplified. A better cost model could be
1474 created by understanding the capability of the hardware prefetcher.
1475 This model could be trained automatically by running a set of synthetic
1479 <listitem><para><emphasis>Cost model:</emphasis>
1480 Total distance between pointer values of successive elements in vectors
1481 of pointers.</para></listitem>
1482 <listitem><para><emphasis>Example:</emphasis>
1485 2 vector<int*> v(10000000, &zero);
1486 3 for (int k = 0; k < 10000000; ++k) {
1487 4 v[random() % 10000000] = new int(k);
1489 6 for (int j = 0; j < 10000000; ++j) {
1490 7 count += (*v[j] == 0 ? 0 : 1);
1493 foo.cc:7: advice: Insert prefetch instruction.
1499 <section xml:id="manual.ext.profile_mode.analysis.locality.linked" xreflabel="Linked Structure Locality"><info><title>Linked Structure Locality</title></info>
1502 <listitem><para><emphasis>Switch:</emphasis>
1503 <code>_GLIBCXX_PROFILE_RBTREE_LOCALITY</code>.
1505 <listitem><para><emphasis>Goal:</emphasis> Give measure of locality of
1506 objects stored in linked structures (lists, red-black trees and hashtables)
1507 with respect to their actual traversal patterns.
1509 <listitem><para><emphasis>Fundamentals:</emphasis>Allocation can be tuned
1510 to a specific traversal pattern, to result in better data locality.
1512 <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.springerlink.com/content/8085744l00x72662/">
1513 Custom Memory Allocation for Free</link>.
1515 <listitem><para><emphasis>Sample runtime reduction:</emphasis>30%.
1517 <listitem><para><emphasis>Recommendation:</emphasis>
1518 High scatter score N for container built at site S.
1519 Consider changing allocation sequence or choosing a structure conscious
1520 allocator.</para></listitem>
1521 <listitem><para><emphasis>To instrument:</emphasis> Methods of all
1522 containers using linked structures.</para></listitem>
1523 <listitem><para><emphasis>Analysis:</emphasis>
1524 First, get cache line size and page size from system.
1525 Then record the number of successive elements that are on different line
1526 or page, for each traversal method such as <code>find</code>. Give advice
1527 only if the ratio between this number and the number of total node hops
1528 is above a threshold.</para></listitem>
1529 <listitem><para><emphasis>Cost model:</emphasis>
1530 Sum(same_cache_line(this,previous))</para></listitem>
1531 <listitem><para><emphasis>Example:</emphasis>
1534 2 for (int i = 0; i < 10000000; ++i) {
1537 5 set<int> s1, s2;
1538 6 for (int i = 0; i < 10000000; ++i) {
1543 // Fast, better locality.
1544 10 for (set<int>::iterator it = s.begin(); it != s.end(); ++it) {
1547 // Slow, elements are further apart.
1548 13 for (set<int>::iterator it = s1.begin(); it != s1.end(); ++it) {
1552 foo.cc:5: advice: High scatter score NNN for set built here. Consider changing
1553 the allocation sequence or switching to a structure conscious allocator.
1562 <section xml:id="manual.ext.profile_mode.analysis.mthread" xreflabel="Multithreaded Data Access"><info><title>Multithreaded Data Access</title></info>
1566 The diagnostics in this group are not meant to be implemented short term.
1567 They require compiler support to know when container elements are written
1568 to. Instrumentation can only tell us when elements are referenced.
1571 <para><emphasis>Switch:</emphasis>
1572 <code>_GLIBCXX_PROFILE_MULTITHREADED</code>.
1575 <section xml:id="manual.ext.profile_mode.analysis.mthread.ddtest" xreflabel="Dependence Violations at Container Level"><info><title>Data Dependence Violations at Container Level</title></info>
1578 <listitem><para><emphasis>Switch:</emphasis>
1579 <code>_GLIBCXX_PROFILE_DDTEST</code>.
1581 <listitem><para><emphasis>Goal:</emphasis> Detect container elements
1582 that are referenced from multiple threads in the parallel region or
1583 across parallel regions.
1585 <listitem><para><emphasis>Fundamentals:</emphasis>
1586 Sharing data between threads requires communication and perhaps locking,
1587 which may be expensive.
1589 <listitem><para><emphasis>Sample runtime reduction:</emphasis>?%.
1591 <listitem><para><emphasis>Recommendation:</emphasis> Change data
1592 distribution or parallel algorithm.</para></listitem>
1593 <listitem><para><emphasis>To instrument:</emphasis> Container access methods
1596 <listitem><para><emphasis>Analysis:</emphasis>
1597 Keep a shadow for each container. Record iterator dereferences and
1598 container member accesses. Issue advice for elements referenced by
1600 See paper: <link xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://portal.acm.org/citation.cfm?id=207110.207148">
1601 The LRPD test: speculative run-time parallelization of loops with
1602 privatization and reduction parallelization</link>.
1604 <listitem><para><emphasis>Cost model:</emphasis>
1605 Number of accesses to elements referenced from multiple threads
1607 <listitem><para><emphasis>Example:</emphasis>
1614 <section xml:id="manual.ext.profile_mode.analysis.mthread.false_share" xreflabel="False Sharing"><info><title>False Sharing</title></info>
1617 <listitem><para><emphasis>Switch:</emphasis>
1618 <code>_GLIBCXX_PROFILE_FALSE_SHARING</code>.
1620 <listitem><para><emphasis>Goal:</emphasis> Detect elements in the
1621 same container which share a cache line, are written by at least one
1622 thread, and accessed by different threads.
1624 <listitem><para><emphasis>Fundamentals:</emphasis> Under these assumptions,
1625 cache protocols require
1626 communication to invalidate lines, which may be expensive.
1628 <listitem><para><emphasis>Sample runtime reduction:</emphasis>68%.
1630 <listitem><para><emphasis>Recommendation:</emphasis> Reorganize container
1631 or use padding to avoid false sharing.</para></listitem>
1632 <listitem><para><emphasis>To instrument:</emphasis> Container access methods
1635 <listitem><para><emphasis>Analysis:</emphasis>
1636 First, get the cache line size.
1637 For each shared container, record all the associated iterator dereferences
1638 and member access methods with the thread id. Compare the address lists
1639 across threads to detect references in two different threads to the same
1640 cache line. Issue a warning only if the ratio to total references is
1641 significant. Do the same for iterator dereference values if they are
1642 pointers.</para></listitem>
1643 <listitem><para><emphasis>Cost model:</emphasis>
1644 Number of accesses to same cache line from different threads.
1646 <listitem><para><emphasis>Example:</emphasis>
1648 1 vector<int> v(2, 0);
1649 2 #pragma omp parallel for shared(v, SIZE) schedule(static, 1)
1650 3 for (i = 0; i < SIZE; ++i) {
1654 OMP_NUM_THREADS=2 ./a.out
1655 foo.cc:1: advice: Change container structure or padding to avoid false
1656 sharing in multithreaded access at foo.cc:4. Detected N shared cache lines.
1665 <section xml:id="manual.ext.profile_mode.analysis.statistics" xreflabel="Statistics"><info><title>Statistics</title></info>
1669 <emphasis>Switch:</emphasis>
1670 <code>_GLIBCXX_PROFILE_STATISTICS</code>.
1674 In some cases the cost model may not tell us anything because the costs
1675 appear to offset the benefits. Consider the choice between a vector and
1676 a list. When there are both inserts and iteration, an automatic advice
1677 may not be issued. However, the programmer may still be able to make use
1678 of this information in a different way.
1681 This diagnostic will not issue any advice, but it will print statistics for
1682 each container construction site. The statistics will contain the cost
1683 of each operation actually performed on the container.
1692 <bibliography xml:id="profile_mode.biblio"><info><title>Bibliography</title></info>
1697 Perflint: A Context Sensitive Performance Advisor for C++ Programs
1700 <author><personname><firstname>Lixia</firstname><surname>Liu</surname></personname></author>
1701 <author><personname><firstname>Silvius</firstname><surname>Rus</surname></personname></author>
1710 Proceedings of the 2009 International Symposium on Code Generation