mplayer.hg: libvo/aclib_template.c annotate

annotate libvo/aclib_template.c @ 28421:6284666d3794

Remove the Present call after adapter reinitialization, it can not work anyway since no video frame is uploaded to the new context yet.

author	reimar
date	Tue, 03 Feb 2009 10:54:27 +0000
parents	31287e75b5d8
children	7681eab10aea

rev	line source
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	1 /*
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	2 aclib - advanced C library ;)
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	3 This file contains functions which improve and expand standard C-library
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	4 */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	5
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	6 #if !HAVE_SSE2
1123 5b69dabe5823 Issues about P3 performance and SSE2 support. nickols_k parents: 698 diff changeset	7 /*
5b69dabe5823 Issues about P3 performance and SSE2 support. nickols_k parents: 698 diff changeset	8 P3 processor has only one SSE decoder so can execute only 1 sse insn per
5b69dabe5823 Issues about P3 performance and SSE2 support. nickols_k parents: 698 diff changeset	9 cpu clock, but it has 3 mmx decoders (include load/store unit)
5b69dabe5823 Issues about P3 performance and SSE2 support. nickols_k parents: 698 diff changeset	10 and executes 3 mmx insns per cpu clock.
5b69dabe5823 Issues about P3 performance and SSE2 support. nickols_k parents: 698 diff changeset	11 P4 processor has some chances, but after reading:
5b69dabe5823 Issues about P3 performance and SSE2 support. nickols_k parents: 698 diff changeset	12 http://www.emulators.com/pentium4.htm
5b69dabe5823 Issues about P3 performance and SSE2 support. nickols_k parents: 698 diff changeset	13 I have doubts. Anyway SSE2 version of this code can be written better.
5b69dabe5823 Issues about P3 performance and SSE2 support. nickols_k parents: 698 diff changeset	14 */
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	15 #undef HAVE_SSE
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	16 #define HAVE_SSE 0
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	17 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	18
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	19
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	20 /*
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	21 This part of code was taken by me from Linux-2.4.3 and slightly modified
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	22 for MMX, MMX2, SSE instruction set. I have done it since linux uses page aligned
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	23 blocks but mplayer uses weakly ordered data and original sources can not
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	24 speedup them. Only using PREFETCHNTA and MOVNTQ together have effect!
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	25
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	26 >From IA-32 Intel Architecture Software Developer's Manual Volume 1,
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	27
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	28 Order Number 245470:
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	29 "10.4.6. Cacheability Control, Prefetch, and Memory Ordering Instructions"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	30
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	31 Data referenced by a program can be temporal (data will be used again) or
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	32 non-temporal (data will be referenced once and not reused in the immediate
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	33 future). To make efficient use of the processor's caches, it is generally
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	34 desirable to cache temporal data and not cache non-temporal data. Overloading
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	35 the processor's caches with non-temporal data is sometimes referred to as
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	36 "polluting the caches".
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	37 The non-temporal data is written to memory with Write-Combining semantics.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	38
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	39 The PREFETCHh instructions permits a program to load data into the processor
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	40 at a suggested cache level, so that it is closer to the processors load and
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	41 store unit when it is needed. If the data is already present in a level of
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	42 the cache hierarchy that is closer to the processor, the PREFETCHh instruction
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	43 will not result in any data movement.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	44 But we should you PREFETCHNTA: Non-temporal data fetch data into location
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	45 close to the processor, minimizing cache pollution.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	46
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	47 The MOVNTQ (store quadword using non-temporal hint) instruction stores
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	48 packed integer data from an MMX register to memory, using a non-temporal hint.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	49 The MOVNTPS (store packed single-precision floating-point values using
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	50 non-temporal hint) instruction stores packed floating-point data from an
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	51 XMM register to memory, using a non-temporal hint.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	52
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	53 The SFENCE (Store Fence) instruction controls write ordering by creating a
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	54 fence for memory store operations. This instruction guarantees that the results
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	55 of every store instruction that precedes the store fence in program order is
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	56 globally visible before any store instruction that follows the fence. The
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	57 SFENCE instruction provides an efficient way of ensuring ordering between
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	58 procedures that produce weakly-ordered data and procedures that consume that
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	59 data.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	60
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	61 If you have questions please contact with me: Nick Kurshev: nickols_k@mail.ru.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	62 */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	63
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	64 // 3dnow memcpy support from kernel 2.4.2
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	65 // by Pontscho/fresh!mindworkz
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	66
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	67
23378 ef54df9f07d3 HAVE_MMX1 -> HAVE_ONLY_MMX1 (makes more sense ...) michael parents: 15639 diff changeset	68 #undef HAVE_ONLY_MMX1
28335 31287e75b5d8 HAVE_3DNOW --> HAVE_AMD3DNOW diego parents: 28290 diff changeset	69 #if HAVE_MMX && !HAVE_MMX2 && !HAVE_AMD3DNOW && !HAVE_SSE
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	70 /* means: mmx v.1. Note: Since we added alignment of destinition it speedups
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	71 of memory copying on PentMMX, Celeron-1 and P2 upto 12% versus
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	72 standard (non MMX-optimized) version.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	73 Note: on K6-2+ it speedups memory copying upto 25% and
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	74 on K7 and P3 about 500% (5 times). */
23378 ef54df9f07d3 HAVE_MMX1 -> HAVE_ONLY_MMX1 (makes more sense ...) michael parents: 15639 diff changeset	75 #define HAVE_ONLY_MMX1
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	76 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	77
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	78
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	79 #undef HAVE_K6_2PLUS
28335 31287e75b5d8 HAVE_3DNOW --> HAVE_AMD3DNOW diego parents: 28290 diff changeset	80 #if !HAVE_MMX2 && HAVE_AMD3DNOW
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	81 #define HAVE_K6_2PLUS
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	82 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	83
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	84 /* for small memory blocks (<256 bytes) this version is faster */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	85 #define small_memcpy(to,from,n)\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	86 {\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	87 register unsigned long int dummy;\
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	88 __asm__ volatile(\
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	89 "rep; movsb"\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	90 :"=&D"(to), "=&S"(from), "=&c"(dummy)\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	91 /* It's most portable way to notify compiler */\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	92 /* that edi, esi and ecx are clobbered in asm block. */\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	93 /* Thanks to A'rpi for hint!!! */\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	94 :"0" (to), "1" (from),"2" (n)\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	95 : "memory");\
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	96 }
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	97
3393 3624cd351618 runtime cpu detection michael parents: 3077 diff changeset	98 #undef MMREG_SIZE
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	99 #if HAVE_SSE
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	100 #define MMREG_SIZE 16
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	101 #else
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	102 #define MMREG_SIZE 64 //8
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	103 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	104
3393 3624cd351618 runtime cpu detection michael parents: 3077 diff changeset	105 #undef PREFETCH
3624cd351618 runtime cpu detection michael parents: 3077 diff changeset	106 #undef EMMS
5660 4dcc7af65eec pre mmx2/3dnow fix michael parents: 4684 diff changeset	107
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	108 #if HAVE_MMX2
5662 663ca5050f7e prefer prefetchnta if its available michael parents: 5660 diff changeset	109 #define PREFETCH "prefetchnta"
28335 31287e75b5d8 HAVE_3DNOW --> HAVE_AMD3DNOW diego parents: 28290 diff changeset	110 #elif HAVE_AMD3DNOW
5660 4dcc7af65eec pre mmx2/3dnow fix michael parents: 4684 diff changeset	111 #define PREFETCH "prefetch"
4dcc7af65eec pre mmx2/3dnow fix michael parents: 4684 diff changeset	112 #else
25973 ef4297ed0d12 libvo: change asm syntax to use ASMALIGN and " # nop" uau parents: 23378 diff changeset	113 #define PREFETCH " # nop"
5660 4dcc7af65eec pre mmx2/3dnow fix michael parents: 4684 diff changeset	114 #endif
4dcc7af65eec pre mmx2/3dnow fix michael parents: 4684 diff changeset	115
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	116 /* On K6 femms is faster of emms. On K7 femms is directly mapped on emms. */
28335 31287e75b5d8 HAVE_3DNOW --> HAVE_AMD3DNOW diego parents: 28290 diff changeset	117 #if HAVE_AMD3DNOW
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	118 #define EMMS "femms"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	119 #else
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	120 #define EMMS "emms"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	121 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	122
3393 3624cd351618 runtime cpu detection michael parents: 3077 diff changeset	123 #undef MOVNTQ
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	124 #if HAVE_MMX2
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	125 #define MOVNTQ "movntq"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	126 #else
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	127 #define MOVNTQ "movq"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	128 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	129
3393 3624cd351618 runtime cpu detection michael parents: 3077 diff changeset	130 #undef MIN_LEN
23378 ef54df9f07d3 HAVE_MMX1 -> HAVE_ONLY_MMX1 (makes more sense ...) michael parents: 15639 diff changeset	131 #ifdef HAVE_ONLY_MMX1
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	132 #define MIN_LEN 0x800 /* 2K blocks */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	133 #else
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	134 #define MIN_LEN 0x40 /* 64-byte blocks */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	135 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	136
7072 113d66d78967 removed nonsense 'inline' arpi parents: 5662 diff changeset	137 static void * RENAME(fast_memcpy)(void * to, const void * from, size_t len)
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	138 {
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	139 void *retval;
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	140 size_t i;
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	141 retval = to;
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	142 #ifdef STATISTICS
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	143 {
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	144 static int freq[33];
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	145 static int t=0;
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	146 int i;
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	147 for(i=0; len>(1<<i); i++);
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	148 freq[i]++;
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	149 t++;
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	150 if(102410241024 % t == 0)
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	151 for(i=0; i<32; i++)
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	152 printf("freq < %8d %4d\n", 1<<i, freq[i]);
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	153 }
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	154 #endif
23378 ef54df9f07d3 HAVE_MMX1 -> HAVE_ONLY_MMX1 (makes more sense ...) michael parents: 15639 diff changeset	155 #ifndef HAVE_ONLY_MMX1
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	156 /* PREFETCH has effect even for MOVSB instruction ;) */
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	157 __asm__ volatile (
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	158 PREFETCH" (%0)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	159 PREFETCH" 64(%0)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	160 PREFETCH" 128(%0)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	161 PREFETCH" 192(%0)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	162 PREFETCH" 256(%0)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	163 : : "r" (from) );
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	164 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	165 if(len >= MIN_LEN)
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	166 {
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	167 register unsigned long int delta;
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	168 /* Align destinition to MMREG_SIZE -boundary */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	169 delta = ((unsigned long int)to)&(MMREG_SIZE-1);
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	170 if(delta)
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	171 {
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	172 delta=MMREG_SIZE-delta;
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	173 len -= delta;
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	174 small_memcpy(to, from, delta);
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	175 }
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	176 i = len >> 6; /* len/64 */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	177 len&=63;
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	178 /*
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	179 This algorithm is top effective when the code consequently
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	180 reads and writes blocks which have size of cache line.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	181 Size of cache line is processor-dependent.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	182 It will, however, be a minimum of 32 bytes on any processors.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	183 It would be better to have a number of instructions which
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	184 perform reading and writing to be multiple to a number of
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	185 processor's decoders, but it's not always possible.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	186 */
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	187 #if HAVE_SSE /* Only P3 (may be Cyrix3) */
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	188 if(((unsigned long)from) & 15)
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	189 /* if SRC is misaligned */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	190 for(; i>0; i--)
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	191 {
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	192 __asm__ volatile (
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	193 PREFETCH" 320(%0)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	194 "movups (%0), %%xmm0\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	195 "movups 16(%0), %%xmm1\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	196 "movups 32(%0), %%xmm2\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	197 "movups 48(%0), %%xmm3\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	198 "movntps %%xmm0, (%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	199 "movntps %%xmm1, 16(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	200 "movntps %%xmm2, 32(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	201 "movntps %%xmm3, 48(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	202 :: "r" (from), "r" (to) : "memory");
14565 1a13df0d4fc2 Make this file compile with gcc-4.0.0. The old code was invalid C. gpoirier parents: 13720 diff changeset	203 from=((const unsigned char *) from)+64;
1a13df0d4fc2 Make this file compile with gcc-4.0.0. The old code was invalid C. gpoirier parents: 13720 diff changeset	204 to=((unsigned char *)to)+64;
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	205 }
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	206 else
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	207 /*
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	208 Only if SRC is aligned on 16-byte boundary.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	209 It allows to use movaps instead of movups, which required data
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	210 to be aligned or a general-protection exception (#GP) is generated.
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	211 */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	212 for(; i>0; i--)
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	213 {
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	214 __asm__ volatile (
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	215 PREFETCH" 320(%0)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	216 "movaps (%0), %%xmm0\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	217 "movaps 16(%0), %%xmm1\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	218 "movaps 32(%0), %%xmm2\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	219 "movaps 48(%0), %%xmm3\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	220 "movntps %%xmm0, (%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	221 "movntps %%xmm1, 16(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	222 "movntps %%xmm2, 32(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	223 "movntps %%xmm3, 48(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	224 :: "r" (from), "r" (to) : "memory");
14565 1a13df0d4fc2 Make this file compile with gcc-4.0.0. The old code was invalid C. gpoirier parents: 13720 diff changeset	225 from=((const unsigned char *)from)+64;
1a13df0d4fc2 Make this file compile with gcc-4.0.0. The old code was invalid C. gpoirier parents: 13720 diff changeset	226 to=((unsigned char *)to)+64;
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	227 }
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	228 #else
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	229 // Align destination at BLOCK_SIZE boundary
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	230 for(; ((int)to & (BLOCK_SIZE-1)) && i>0; i--)
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	231 {
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	232 __asm__ volatile (
23378 ef54df9f07d3 HAVE_MMX1 -> HAVE_ONLY_MMX1 (makes more sense ...) michael parents: 15639 diff changeset	233 #ifndef HAVE_ONLY_MMX1
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	234 PREFETCH" 320(%0)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	235 #endif
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	236 "movq (%0), %%mm0\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	237 "movq 8(%0), %%mm1\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	238 "movq 16(%0), %%mm2\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	239 "movq 24(%0), %%mm3\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	240 "movq 32(%0), %%mm4\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	241 "movq 40(%0), %%mm5\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	242 "movq 48(%0), %%mm6\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	243 "movq 56(%0), %%mm7\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	244 MOVNTQ" %%mm0, (%1)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	245 MOVNTQ" %%mm1, 8(%1)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	246 MOVNTQ" %%mm2, 16(%1)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	247 MOVNTQ" %%mm3, 24(%1)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	248 MOVNTQ" %%mm4, 32(%1)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	249 MOVNTQ" %%mm5, 40(%1)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	250 MOVNTQ" %%mm6, 48(%1)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	251 MOVNTQ" %%mm7, 56(%1)\n"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	252 :: "r" (from), "r" (to) : "memory");
15639 f26450da61a1 More gcc-4.0 fixes gpoirier parents: 14565 diff changeset	253 from=((const unsigned char *)from)+64;
f26450da61a1 More gcc-4.0 fixes gpoirier parents: 14565 diff changeset	254 to=((unsigned char *)to)+64;
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	255 }
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	256
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	257 // printf(" %d %d\n", (int)from&1023, (int)to&1023);
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	258 // Pure Assembly cuz gcc is a bit unpredictable ;)
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	259 if(i>=BLOCK_SIZE/64)
27754 08d18fe9da52 Change all occurrences of asm and __asm to __asm__, same as was done for FFmpeg. diego parents: 25973 diff changeset	260 __asm__ volatile(
13720 821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	261 "xor %%"REG_a", %%"REG_a" \n\t"
25973 ef4297ed0d12 libvo: change asm syntax to use ASMALIGN and " # nop" uau parents: 23378 diff changeset	262 ASMALIGN(4)
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	263 "1: \n\t"
13720 821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	264 "movl (%0, %%"REG_a"), %%ebx \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	265 "movl 32(%0, %%"REG_a"), %%ebx \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	266 "movl 64(%0, %%"REG_a"), %%ebx \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	267 "movl 96(%0, %%"REG_a"), %%ebx \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	268 "add $128, %%"REG_a" \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	269 "cmp %3, %%"REG_a" \n\t"
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	270 " jb 1b \n\t"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	271
13720 821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	272 "xor %%"REG_a", %%"REG_a" \n\t"
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	273
25973 ef4297ed0d12 libvo: change asm syntax to use ASMALIGN and " # nop" uau parents: 23378 diff changeset	274 ASMALIGN(4)
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	275 "2: \n\t"
13720 821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	276 "movq (%0, %%"REG_a"), %%mm0\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	277 "movq 8(%0, %%"REG_a"), %%mm1\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	278 "movq 16(%0, %%"REG_a"), %%mm2\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	279 "movq 24(%0, %%"REG_a"), %%mm3\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	280 "movq 32(%0, %%"REG_a"), %%mm4\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	281 "movq 40(%0, %%"REG_a"), %%mm5\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	282 "movq 48(%0, %%"REG_a"), %%mm6\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	283 "movq 56(%0, %%"REG_a"), %%mm7\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	284 MOVNTQ" %%mm0, (%1, %%"REG_a")\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	285 MOVNTQ" %%mm1, 8(%1, %%"REG_a")\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	286 MOVNTQ" %%mm2, 16(%1, %%"REG_a")\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	287 MOVNTQ" %%mm3, 24(%1, %%"REG_a")\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	288 MOVNTQ" %%mm4, 32(%1, %%"REG_a")\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	289 MOVNTQ" %%mm5, 40(%1, %%"REG_a")\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	290 MOVNTQ" %%mm6, 48(%1, %%"REG_a")\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	291 MOVNTQ" %%mm7, 56(%1, %%"REG_a")\n"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	292 "add $64, %%"REG_a" \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	293 "cmp %3, %%"REG_a" \n\t"
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	294 "jb 2b \n\t"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	295
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	296 #if CONFUSION_FACTOR > 0
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	297 // a few percent speedup on out of order executing CPUs
13720 821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	298 "mov %5, %%"REG_a" \n\t"
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	299 "2: \n\t"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	300 "movl (%0), %%ebx \n\t"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	301 "movl (%0), %%ebx \n\t"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	302 "movl (%0), %%ebx \n\t"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	303 "movl (%0), %%ebx \n\t"
13720 821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	304 "dec %%"REG_a" \n\t"
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	305 " jnz 2b \n\t"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	306 #endif
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	307
13720 821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	308 "xor %%"REG_a", %%"REG_a" \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	309 "add %3, %0 \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	310 "add %3, %1 \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	311 "sub %4, %2 \n\t"
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	312 "cmp %4, %2 \n\t"
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	313 " jae 1b \n\t"
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	314 : "+r" (from), "+r" (to), "+r" (i)
13720 821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	315 : "r" ((long)BLOCK_SIZE), "i" (BLOCK_SIZE/64), "i" ((long)CONFUSION_FACTOR)
821f464b4d90 adapting existing mmx/mmx2/sse/3dnow optimizations so they work on x86_64 aurel parents: 7072 diff changeset	316 : "%"REG_a, "%ebx"
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	317 );
99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	318
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	319 for(; i>0; i--)
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	320 {
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	321 __asm__ volatile (
23378 ef54df9f07d3 HAVE_MMX1 -> HAVE_ONLY_MMX1 (makes more sense ...) michael parents: 15639 diff changeset	322 #ifndef HAVE_ONLY_MMX1
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	323 PREFETCH" 320(%0)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	324 #endif
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	325 "movq (%0), %%mm0\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	326 "movq 8(%0), %%mm1\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	327 "movq 16(%0), %%mm2\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	328 "movq 24(%0), %%mm3\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	329 "movq 32(%0), %%mm4\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	330 "movq 40(%0), %%mm5\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	331 "movq 48(%0), %%mm6\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	332 "movq 56(%0), %%mm7\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	333 MOVNTQ" %%mm0, (%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	334 MOVNTQ" %%mm1, 8(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	335 MOVNTQ" %%mm2, 16(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	336 MOVNTQ" %%mm3, 24(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	337 MOVNTQ" %%mm4, 32(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	338 MOVNTQ" %%mm5, 40(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	339 MOVNTQ" %%mm6, 48(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	340 MOVNTQ" %%mm7, 56(%1)\n"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	341 :: "r" (from), "r" (to) : "memory");
15639 f26450da61a1 More gcc-4.0 fixes gpoirier parents: 14565 diff changeset	342 from=((const unsigned char *)from)+64;
f26450da61a1 More gcc-4.0 fixes gpoirier parents: 14565 diff changeset	343 to=((unsigned char *)to)+64;
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	344 }
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	345
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	346 #endif /* Have SSE */
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	347 #if HAVE_MMX2
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	348 /* since movntq is weakly-ordered, a "sfence"
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	349 * is needed to become ordered again. */
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	350 __asm__ volatile ("sfence":::"memory");
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	351 #endif
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	352 #if !HAVE_SSE
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	353 /* enables to use FPU */
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	354 __asm__ volatile (EMMS:::"memory");
3077 99f6db3255aa 10-20% faster fastmemcpy :) on my p3 at least but the algo is mostly from "amd athlon processor x86 code optimization guide" so it should be faster for amd chips too, but i fear it might be slower for mem->vram copies (someone should check that, i cant) ... there are 2 #defines to finetune it (BLOCK_SIZE & CONFUSION_FACTOR) michael parents: 1123 diff changeset	355 #endif
698 f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	356 }
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	357 /*
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	358 * Now do the tail of the block
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	359 */
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	360 if(len) small_memcpy(to, from, len);
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	361 return retval;
f0fbf1a9bf31 Moving fast_memcpy to separate file (Size optimization) nickols_k parents: diff changeset	362 }
4681 8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	363
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	364 /**
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	365 * special copy routine for mem -> agp/pci copy (based upon fast_memcpy)
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	366 */
7072 113d66d78967 removed nonsense 'inline' arpi parents: 5662 diff changeset	367 static void * RENAME(mem2agpcpy)(void * to, const void * from, size_t len)
4681 8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	368 {
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	369 void *retval;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	370 size_t i;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	371 retval = to;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	372 #ifdef STATISTICS
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	373 {
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	374 static int freq[33];
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	375 static int t=0;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	376 int i;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	377 for(i=0; len>(1<<i); i++);
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	378 freq[i]++;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	379 t++;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	380 if(102410241024 % t == 0)
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	381 for(i=0; i<32; i++)
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	382 printf("mem2agp freq < %8d %4d\n", 1<<i, freq[i]);
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	383 }
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	384 #endif
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	385 if(len >= MIN_LEN)
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	386 {
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	387 register unsigned long int delta;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	388 /* Align destinition to MMREG_SIZE -boundary */
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	389 delta = ((unsigned long int)to)&7;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	390 if(delta)
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	391 {
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	392 delta=8-delta;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	393 len -= delta;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	394 small_memcpy(to, from, delta);
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	395 }
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	396 i = len >> 6; /* len/64 */
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	397 len &= 63;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	398 /*
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	399 This algorithm is top effective when the code consequently
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	400 reads and writes blocks which have size of cache line.
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	401 Size of cache line is processor-dependent.
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	402 It will, however, be a minimum of 32 bytes on any processors.
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	403 It would be better to have a number of instructions which
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	404 perform reading and writing to be multiple to a number of
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	405 processor's decoders, but it's not always possible.
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	406 */
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	407 for(; i>0; i--)
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	408 {
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	409 __asm__ volatile (
4681 8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	410 PREFETCH" 320(%0)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	411 "movq (%0), %%mm0\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	412 "movq 8(%0), %%mm1\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	413 "movq 16(%0), %%mm2\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	414 "movq 24(%0), %%mm3\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	415 "movq 32(%0), %%mm4\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	416 "movq 40(%0), %%mm5\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	417 "movq 48(%0), %%mm6\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	418 "movq 56(%0), %%mm7\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	419 MOVNTQ" %%mm0, (%1)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	420 MOVNTQ" %%mm1, 8(%1)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	421 MOVNTQ" %%mm2, 16(%1)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	422 MOVNTQ" %%mm3, 24(%1)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	423 MOVNTQ" %%mm4, 32(%1)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	424 MOVNTQ" %%mm5, 40(%1)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	425 MOVNTQ" %%mm6, 48(%1)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	426 MOVNTQ" %%mm7, 56(%1)\n"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	427 :: "r" (from), "r" (to) : "memory");
14565 1a13df0d4fc2 Make this file compile with gcc-4.0.0. The old code was invalid C. gpoirier parents: 13720 diff changeset	428 from=((const unsigned char *)from)+64;
1a13df0d4fc2 Make this file compile with gcc-4.0.0. The old code was invalid C. gpoirier parents: 13720 diff changeset	429 to=((unsigned char *)to)+64;
4681 8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	430 }
28290 25337a2147e7 Lots and lots of #ifdef ARCH_... -> #if ARCH_... reimar parents: 27757 diff changeset	431 #if HAVE_MMX2
4681 8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	432 /* since movntq is weakly-ordered, a "sfence"
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	433 * is needed to become ordered again. */
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	434 __asm__ volatile ("sfence":::"memory");
4681 8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	435 #endif
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	436 /* enables to use FPU */
27757 b5a46071062a Replace all occurrences of '__volatile__' and '__volatile' by plain 'volatile'. diego parents: 27754 diff changeset	437 __asm__ volatile (EMMS:::"memory");
4681 8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	438 }
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	439 /*
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	440 * Now do the tail of the block
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	441 */
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	442 if(len) small_memcpy(to, from, len);
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	443 return retval;
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	444 }
8db59073127e mem2agpcpy() michael parents: 3393 diff changeset	445

Mercurial > mplayer.hg

annotate libvo/aclib_template.c @ 28421:6284666d3794