1 optimization Tips (for libavcodec):
2 ===================================
6 If you plan to do non-x86 architecture specific optimizations (SIMD normally),
7 then take a look in the i386/ directory, as most important functions are
8 already optimized for MMX.
10 If you want to do x86 optimizations then you can either try to finetune the
11 stuff in the i386 directory or find some other functions in the C source to
12 optimize, but there aren't many left.
15 Understanding these overoptimized functions:
16 --------------------------------------------
17 As many functions tend to be a bit difficult to understand because
18 of optimizations, it can be hard to optimize them further, or write
19 architecture-specific versions. It is recommened to look at older
20 revisions of the interesting files (for a web frontend try ViewVC at
21 http://svn.mplayerhq.hu/ffmpeg/trunk/).
22 Alternatively, look into the other architecture-specific versions in
23 the i386/, ppc/, alpha/ subdirectories. Even if you don't exactly
24 comprehend the instructions, it could help understanding the functions
25 and how they can be optimized.
27 NOTE: If you still don't understand some function, ask at our mailing list!!!
28 (http://lists.mplayerhq.hu/mailman/listinfo/ffmpeg-devel)
31 When is an optimization justified?
32 ----------------------------------
33 Normally, clean & simple optimizations on widely used codecs can achieve
34 an overall speedup of 0.1%. These speedups accumulate and can make a big
35 difference after awhile. Also, if none of the following factors get
36 worse due to an optimization -- speed, binary code size, source size,
37 source readability -- and at least one factor improves, then an
38 optimization is always a good idea even if the overall gain is less than
39 0.1%. For obscure codecs that are not often used, the goal is more
40 toward keeping the code clean, small, and readable than to make it 1%
44 WTF is that function good for ....:
45 -----------------------------------
46 The primary purpose of that list is to avoid wasting time to optimize functions
49 put(_no_rnd)_pixels{,_x2,_y2,_xy2}
50 Used in motion compensation (en/decoding).
52 avg_pixels{,_x2,_y2,_xy2}
53 Used in motion compensation of B-frames.
54 These are less important than the put*pixels functions.
59 pix_abs16x16{,_x2,_y2,_xy2}
60 Used in motion estimation (encoding) with SAD.
62 pix_abs8x8{,_x2,_y2,_xy2}
63 Used in motion estimation (encoding) with SAD of MPEG-4 4MV only.
64 These are less important than the pix_abs16x16* functions.
66 put_mspel8_mc* / wmv2_mspel8*
68 it is not recommended that you waste your time with these, as WMV2
69 is an ugly and relatively useless codec.
71 mpeg4_qpel* / *qpel_mc*
72 Used in MPEG-4 qpel motion compensation (encoding & decoding).
73 The qpel8 functions are used only for 4mv,
74 the avg_* functions are used only for B-frames.
75 Optimizing them should have a significant impact on qpel
78 qpel{8,16}_mc??_old_c / *pixels{8,16}_l4
79 Just used to work around a bug in an old libavcodec encoder version.
82 tpel_mc_func {put,avg}_tpel_pixels_tab
83 Used only for SVQ3, so only optimize them if you need fast SVQ3 decoding.
86 For huffyuv only, optimize if you want a faster ffhuffyuv codec.
88 get_pixels / diff_pixels
89 Used for encoding, easy.
96 Optimizing this should have a significant effect on the gmc decoding
100 Used for chroma blocks in MPEG-4 gmc with 1 warp point
101 (there are 4 luma & 2 chroma blocks per macroblock, so
102 only 1/3 of the gmc blocks use this, the other 2/3
103 use the normal put_pixel* code, but only if there is
105 Note: DivX5 gmc always uses just 1 warp point.
110 hadamard8_diff / sse / sad == pix_norm1 / dct_sad / quant_psnr / rd / bit
111 Specific compare functions used in encoding, it depends upon the
112 command line switches which of these are used.
113 Don't waste your time with dct_sad & quant_psnr, they aren't
116 put_pixels_clamped / add_pixels_clamped
117 Used for en/decoding in the IDCT, easy.
118 Note, some optimized IDCTs have the add/put clamped code included and
119 then put_pixels_clamped / add_pixels_clamped will be unused.
122 idct (encoding & decoding)
124 difficult to optimize
127 Used for encoding with trellis quantization.
128 difficult to optimize
134 Used in MPEG-1 en/decoding.
137 Used in MPEG-2 en/decoding.
140 Used in MPEG-4/H.263 en/decoding.
142 FIXME remaining functions?
143 BTW, most of these functions are in dsputil.c/.h, some are in mpegvideo.c/.h.
148 Some instructions on some architectures have strict alignment restrictions,
149 for example most SSE/SSE2 instructions on x86.
150 The minimum guaranteed alignment is written in the .h files, for example:
151 void (*put_pixels_clamped)(const DCTELEM *block/*align 16*/, UINT8 *pixels/*align 8*/, int line_size);
157 http://www.aggregate.org/MAGIC/
161 http://developer.intel.com/design/pentium4/manuals/248966.htm
163 The IA-32 Intel Architecture Software Developer's Manual, Volume 2:
164 Instruction Set Reference
165 http://developer.intel.com/design/pentium4/manuals/245471.htm
167 http://www.agner.org/assem/
169 AMD Athlon Processor x86 Code Optimization Guide:
170 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
175 ARM Architecture Reference Manual (up to ARMv5TE):
176 http://www.arm.com/community/university/eulaarmarm.html
178 Procedure Call Standard for the ARM Architecture:
179 http://www.arm.com/pdfs/aapcs.pdf
181 Optimization guide for ARM9E (used in Nokia 770 Internet Tablet):
182 http://infocenter.arm.com/help/topic/com.arm.doc.ddi0240b/DDI0240A.pdf
183 Optimization guide for ARM11 (used in Nokia N800 Internet Tablet):
184 http://infocenter.arm.com/help/topic/com.arm.doc.ddi0211j/DDI0211J_arm1136_r1p5_trm.pdf
185 Optimization guide for Intel XScale (used in Sharp Zaurus PDA):
186 http://download.intel.com/design/intelxscale/27347302.pdf
190 PowerPC32/AltiVec PIM:
191 www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPEM.pdf
193 PowerPC32/AltiVec PEM:
194 www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf
197 http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E/$file/Language_Extensions_for_CBEA_2.4.pdf
198 http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/9F820A5FFA3ECE8C8725716A0062585F/$file/CBE_Handbook_v1.1_24APR2007_pub.pdf
202 SPARC Joint Programming Specification (JPS1): Commonality
203 http://www.fujitsu.com/downloads/PRMPWR/JPS1-R1.0.4-Common-pub.pdf
205 UltraSPARC III Processor User's Manual (contains instruction timings)
206 http://www.sun.com/processors/manuals/USIIIv2.pdf
208 VIS Whitepaper (contains optimization guidelines)
209 http://www.sun.com/processors/vis/download/vis/vis_whitepaper.pdf
213 official doc but quite ugly
214 http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
216 a bit old (note "+" is valid for input-output, even though the next disagrees)
217 http://www.cs.virginia.edu/~clc5q/gcc-inline-asm.pdf