# HG changeset patch # User atmosfear # Date 988317358 0 # Node ID 88eb1a3f7bfbac49f4bbd4e9e6387bd49abfaa04 # Parent f1301ff4b9796a9841e5155d5cd9a2e6d7b35898 Changed code, should be faster on Athlon/K6 but slower on PIII with SSE, more portable. diff -r f1301ff4b979 -r 88eb1a3f7bfb libvo/fastmemcpy.h --- a/libvo/fastmemcpy.h Thu Apr 26 20:35:23 2001 +0000 +++ b/libvo/fastmemcpy.h Thu Apr 26 20:35:58 2001 +0000 @@ -1,13 +1,17 @@ #ifndef __MPLAYER_MEMCPY -#define __MPLAYER_MEMCPY +#define __MPLAYER_MEMCPY 1 + +#ifdef USE_FASTMEMCPY +#include /* This part of code was taken by from Linux-2.4.3 and slightly modified -for MMX2, SSE instruction set. I have done it since linux uses page aligned +for MMX, MMX2, SSE instruction set. I have done it since linux uses page aligned blocks but mplayer uses weakly ordered data and original sources can not speedup them. Only using PREFETCHNTA and MOVNTQ together have effect! -From IA-32 Intel Architecture Software Developer's Manual Volume 1, +>From IA-32 Intel Architecture Software Developer's Manual Volume 1, + Order Number 245470: "10.4.6. Cacheability Control, Prefetch, and Memory Ordering Instructions" @@ -16,7 +20,7 @@ future). To make efficient use of the processor's caches, it is generally desirable to cache temporal data and not cache non-temporal data. Overloading the processor's caches with non-temporal data is sometimes referred to as -"polluting the caches". +"polluting the caches". The non-temporal data is written to memory with Write-Combining semantics. The PREFETCHh instructions permits a program to load data into the processor @@ -47,7 +51,18 @@ // 3dnow memcpy support from kernel 2.4.2 // by Pontscho/fresh!mindworkz -#if defined( HAVE_MMX2 ) || defined( HAVE_3DNOW ) +#if defined( HAVE_MMX2 ) || defined( HAVE_3DNOW ) || defined( HAVE_MMX ) + +#undef HAVE_MMX1 +#if defined(HAVE_MMX) && !defined(HAVE_MMX2) && !defined(HAVE_3DNOW) && !defined(HAVE_SSE) +/* means: mmx v.1. Note: Since we added alignment of destinition it speedups + of memory copying on PentMMX, Celeron-1 and P2 upto 12% versus + standard (non MMX-optimized) version. + Note: on K6-2+ it speedups memory copying upto 25% and + on K7 and P3 about 500% (5 times). */ +#define HAVE_MMX1 +#endif + #undef HAVE_K6_2PLUS #if !defined( HAVE_MMX2) && defined( HAVE_3DNOW) @@ -58,54 +73,65 @@ #define small_memcpy(to,from,n)\ {\ __asm__ __volatile__(\ - "rep ; movsb\n"\ - ::"D" (to), "S" (from),"c" (n)\ - : "memory");\ + "rep; movsb"\ + :"=D"(to), "=S"(from), "=c"(n)\ +/* It's most portable way to notify compiler */\ +/* that edi, esi and ecx are clobbered in asm block. */\ +/* Thanks to A'rpi for hint!!! */\ + :"0" (to), "1" (from),"2" (n)\ + : "memory");\ } -inline static void * fast_memcpy(void * to, const void * from, unsigned len) -{ - void *p; - int i; +#ifdef HAVE_SSE +#define MMREG_SIZE 16 +#else +#define MMREG_SIZE 8 +#endif -#ifdef HAVE_SSE /* Only P3 (may be Cyrix3) */ -// printf("fastmemcpy_pre(0x%X,0x%X,0x%X)\n",to,from,len); - // Align dest to 16-byte boundary: - if((unsigned long)to&15){ - int len2=16-((unsigned long)to&15); - if(len>len2){ - len-=len2; - __asm__ __volatile__( - "rep ; movsb\n" - :"=D" (to), "=S" (from) - : "D" (to), "S" (from),"c" (len2) - : "memory"); - } - } -// printf("fastmemcpy(0x%X,0x%X,0x%X)\n",to,from,len); +/* Small defines (for readability only) ;) */ +#ifdef HAVE_K6_2PLUS +#define PREFETCH "prefetch" +/* On K6 femms is faster of emms. On K7 femms is directly mapped on emms. */ +#define EMMS "femms" +#else +#define PREFETCH "prefetchnta" +#define EMMS "emms" +#endif + +#ifdef HAVE_MMX2 +#define MOVNTQ "movntq" +#else +#define MOVNTQ "movq" #endif +inline static void * fast_memcpy(void * to, const void * from, size_t len) +{ + void *retval; + int i; + retval = to; if(len >= 0x200) /* 512-byte blocks */ - { - p = to; - i = len >> 6; /* len/64 */ - len&=63; - - __asm__ __volatile__ ( -#ifdef HAVE_K6_2PLUS - "prefetch (%0)\n" - "prefetch 64(%0)\n" - "prefetch 128(%0)\n" - "prefetch 192(%0)\n" - "prefetch 256(%0)\n" -#else /* K7, P3, CyrixIII */ - "prefetchnta (%0)\n" - "prefetchnta 64(%0)\n" - "prefetchnta 128(%0)\n" - "prefetchnta 192(%0)\n" - "prefetchnta 256(%0)\n" + { + register unsigned long int delta; + /* Align destinition to MMREG_SIZE -boundary */ + delta = ((unsigned long int)to)&(MMREG_SIZE-1); + if(delta) + { + delta=MMREG_SIZE-delta; + len -= delta; + small_memcpy(to, from, delta); + } + i = len >> 6; /* len/64 */ + len&=63; + +#ifndef HAVE_MMX1 + __asm__ __volatile__ ( + PREFETCH" (%0)\n" + PREFETCH" 64(%0)\n" + PREFETCH" 128(%0)\n" + PREFETCH" 192(%0)\n" + PREFETCH" 256(%0)\n" + : : "r" (from) ); #endif - : : "r" (from) ); /* This algorithm is top effective when the code consequently reads and writes blocks which have size of cache line. @@ -116,114 +142,89 @@ processor's decoders, but it's not always possible. */ #ifdef HAVE_SSE /* Only P3 (may be Cyrix3) */ - if(((unsigned long)from) & 15) - /* if SRC is misaligned */ - for(; i>0; i--) - { - __asm__ __volatile__ ( - "prefetchnta 320(%0)\n" - "movups (%0), %%xmm0\n" - "movups 16(%0), %%xmm1\n" - "movntps %%xmm0, (%1)\n" - "movntps %%xmm1, 16(%1)\n" - "movups 32(%0), %%xmm0\n" - "movups 48(%0), %%xmm1\n" - "movntps %%xmm0, 32(%1)\n" - "movntps %%xmm1, 48(%1)\n" - :: "r" (from), "r" (to) : "memory"); - from+=64; - to+=64; - } - else - /* - Only if SRC is aligned on 16-byte boundary. - It allows to use movaps instead of movups, which required data - to be aligned or a general-protection exception (#GP) is generated. - */ - for(; i>0; i--) - { - __asm__ __volatile__ ( - "prefetchnta 320(%0)\n" - "movaps (%0), %%xmm0\n" - "movaps 16(%0), %%xmm1\n" - "movntps %%xmm0, (%1)\n" - "movntps %%xmm1, 16(%1)\n" - "movaps 32(%0), %%xmm0\n" - "movaps 48(%0), %%xmm1\n" - "movntps %%xmm0, 32(%1)\n" - "movntps %%xmm1, 48(%1)\n" - :: "r" (from), "r" (to) : "memory"); - from+=64; - to+=64; - } + if(((unsigned long)from) & 15) + /* if SRC is misaligned */ + for(; i>0; i--) + { + __asm__ __volatile__ ( + PREFETCH" 320(%0)\n" + "movups (%0), %%xmm0\n" + "movups 16(%0), %%xmm1\n" + "movntps %%xmm0, (%1)\n" + "movntps %%xmm1, 16(%1)\n" + "movups 32(%0), %%xmm0\n" + "movups 48(%0), %%xmm1\n" + "movntps %%xmm0, 32(%1)\n" + "movntps %%xmm1, 48(%1)\n" + :: "r" (from), "r" (to) : "memory"); + ((const unsigned char *)from)+=64; + ((unsigned char *)to)+=64; + } + else + /* + Only if SRC is aligned on 16-byte boundary. + It allows to use movaps instead of movups, which required data + to be aligned or a general-protection exception (#GP) is generated. + */ + for(; i>0; i--) + { + __asm__ __volatile__ ( + PREFETCH" 320(%0)\n" + "movaps (%0), %%xmm0\n" + "movaps 16(%0), %%xmm1\n" + "movntps %%xmm0, (%1)\n" + "movntps %%xmm1, 16(%1)\n" + "movaps 32(%0), %%xmm0\n" + "movaps 48(%0), %%xmm1\n" + "movntps %%xmm0, 32(%1)\n" + "movntps %%xmm1, 48(%1)\n" + :: "r" (from), "r" (to) : "memory"); + ((const unsigned char *)from)+=64; + ((unsigned char *)to)+=64; + } #else - for(; i>0; i--) - { - __asm__ __volatile__ ( -#ifdef HAVE_K6_2PLUS - "prefetch 320(%0)\n" -#else - "prefetchnta 320(%0)\n" + for(; i>0; i--) + { + __asm__ __volatile__ ( +#ifndef HAVE_MMX1 + PREFETCH" 320(%0)\n" #endif -#ifdef HAVE_K6_2PLUS - "movq (%0), %%mm0\n" - "movq 8(%0), %%mm1\n" - "movq 16(%0), %%mm2\n" - "movq 24(%0), %%mm3\n" - "movq %%mm0, (%1)\n" - "movq %%mm1, 8(%1)\n" - "movq %%mm2, 16(%1)\n" - "movq %%mm3, 24(%1)\n" - "movq 32(%0), %%mm0\n" - "movq 40(%0), %%mm1\n" - "movq 48(%0), %%mm2\n" - "movq 56(%0), %%mm3\n" - "movq %%mm0, 32(%1)\n" - "movq %%mm1, 40(%1)\n" - "movq %%mm2, 48(%1)\n" - "movq %%mm3, 56(%1)\n" -#else /* K7 */ - "movq (%0), %%mm0\n" - "movq 8(%0), %%mm1\n" - "movq 16(%0), %%mm2\n" - "movq 24(%0), %%mm3\n" - "movntq %%mm0, (%1)\n" - "movntq %%mm1, 8(%1)\n" - "movntq %%mm2, 16(%1)\n" - "movntq %%mm3, 24(%1)\n" - "movq 32(%0), %%mm0\n" - "movq 40(%0), %%mm1\n" - "movq 48(%0), %%mm2\n" - "movq 56(%0), %%mm3\n" - "movntq %%mm0, 32(%1)\n" - "movntq %%mm1, 40(%1)\n" - "movntq %%mm2, 48(%1)\n" - "movntq %%mm3, 56(%1)\n" + "movq (%0), %%mm0\n" + "movq 8(%0), %%mm1\n" + "movq 16(%0), %%mm2\n" + "movq 24(%0), %%mm3\n" + MOVNTQ" %%mm0, (%1)\n" + MOVNTQ" %%mm1, 8(%1)\n" + MOVNTQ" %%mm2, 16(%1)\n" + MOVNTQ" %%mm3, 24(%1)\n" + "movq 32(%0), %%mm0\n" + "movq 40(%0), %%mm1\n" + "movq 48(%0), %%mm2\n" + "movq 56(%0), %%mm3\n" + MOVNTQ" %%mm0, 32(%1)\n" + MOVNTQ" %%mm1, 40(%1)\n" + MOVNTQ" %%mm2, 48(%1)\n" + MOVNTQ" %%mm3, 56(%1)\n" + :: "r" (from), "r" (to) : "memory"); + ((const unsigned char *)from)+=64; + ((unsigned char *)to)+=64; + } +#endif /* Have SSE */ +#ifdef HAVE_MMX2 + /* since movntq is weakly-ordered, a "sfence" + * is needed to become ordered again. */ + __asm__ __volatile__ ("sfence":::"memory"); #endif - :: "r" (from), "r" (to) : "memory"); - from+=64; - to+=64; - } -#endif /* Have SSE */ -#ifdef HAVE_K6_2PLUS - /* On K6 femms is fatser of emms. - On K7 femms is directly mapped on emms. */ - __asm__ __volatile__ ("femms":::"memory"); -#else /* K7, P3, CyrixIII */ - /* since movntq is weakly-ordered, a "sfence" - * is needed to become ordered again. */ - __asm__ __volatile__ ("sfence":::"memory"); -#ifndef HAVE_SSE - /* enables to use FPU */ - __asm__ __volatile__ ("emms":::"memory"); -#endif +#ifndef HAVE_SSE + /* enables to use FPU */ + __asm__ __volatile__ (EMMS:::"memory"); #endif - } - /* - * Now do the tail of the block - */ - small_memcpy(to, from, len); - return p; + } + /* + * Now do the tail of the block + */ + if(len) small_memcpy(to, from, len); + return retval; } #define memcpy(a,b,c) fast_memcpy(a,b,c) #undef small_memcpy @@ -231,3 +232,5 @@ #endif #endif + +#endif