# HG changeset patch # User lorenm # Date 1202022311 0 # Node ID ffb2a7b80d6d798798570871344790a22bc0d114 # Parent 4089a1ae6558027b9d913d0d984bac4de3f6c5f0 ff_h264_idct8_add_sse2. compared to mmx, 217->126 cycles on core2, 262->220 on k8. diff -r 4089a1ae6558 -r ffb2a7b80d6d h264.h --- a/h264.h Sun Feb 03 03:21:47 2008 +0000 +++ b/h264.h Sun Feb 03 07:05:11 2008 +0000 @@ -348,7 +348,7 @@ GetBitContext *intra_gb_ptr; GetBitContext *inter_gb_ptr; - DECLARE_ALIGNED_8(DCTELEM, mb[16*24]); + DECLARE_ALIGNED_16(DCTELEM, mb[16*24]); DCTELEM mb_padding[256]; ///< as mb is addressed by scantable[i] and scantable is uint8_t we can either check that i is not to large or ensure that there is some unused stuff after mb /** diff -r 4089a1ae6558 -r ffb2a7b80d6d i386/dsputil_h264_template_mmx.c --- a/i386/dsputil_h264_template_mmx.c Sun Feb 03 03:21:47 2008 +0000 +++ b/i386/dsputil_h264_template_mmx.c Sun Feb 03 07:05:11 2008 +0000 @@ -98,7 +98,7 @@ } /* general case, bilinear */ - rnd_reg = rnd ? &ff_pw_32 : &ff_pw_28; + rnd_reg = rnd ? ff_pw_32 : &ff_pw_28; asm volatile("movd %2, %%mm4\n\t" "movd %3, %%mm6\n\t" "punpcklwd %%mm4, %%mm4\n\t" @@ -250,7 +250,7 @@ "sub $2, %2 \n\t" "jnz 1b \n\t" : "+r"(dst), "+r"(src), "+r"(h) - : "r"((long)stride), "m"(ff_pw_32), "m"(x), "m"(y) + : "r"((long)stride), "m"(*ff_pw_32), "m"(x), "m"(y) ); } @@ -301,7 +301,7 @@ "sub $1, %2\n\t" "jnz 1b\n\t" : "+r" (dst), "+r"(src), "+r"(h) - : "m" (ff_pw_32), "r"((long)stride) + : "m" (*ff_pw_32), "r"((long)stride) : "%esi"); } diff -r 4089a1ae6558 -r ffb2a7b80d6d i386/dsputil_mmx.c --- a/i386/dsputil_mmx.c Sun Feb 03 03:21:47 2008 +0000 +++ b/i386/dsputil_mmx.c Sun Feb 03 07:05:11 2008 +0000 @@ -54,7 +54,7 @@ DECLARE_ALIGNED_8 (const uint64_t, ff_pw_15 ) = 0x000F000F000F000FULL; DECLARE_ALIGNED_8 (const uint64_t, ff_pw_16 ) = 0x0010001000100010ULL; DECLARE_ALIGNED_8 (const uint64_t, ff_pw_20 ) = 0x0014001400140014ULL; -DECLARE_ALIGNED_8 (const uint64_t, ff_pw_32 ) = 0x0020002000200020ULL; +DECLARE_ALIGNED_16(const uint64_t, ff_pw_32[2]) = {0x0020002000200020ULL, 0x0020002000200020ULL}; DECLARE_ALIGNED_8 (const uint64_t, ff_pw_42 ) = 0x002A002A002A002AULL; DECLARE_ALIGNED_8 (const uint64_t, ff_pw_64 ) = 0x0040004000400040ULL; DECLARE_ALIGNED_8 (const uint64_t, ff_pw_96 ) = 0x0060006000600060ULL; @@ -3328,6 +3328,8 @@ c->h264_idct_add= ff_h264_idct_add_mmx; c->h264_idct8_dc_add= c->h264_idct8_add= ff_h264_idct8_add_mmx; + if (mm_flags & MM_SSE2) + c->h264_idct8_add= ff_h264_idct8_add_sse2; if (mm_flags & MM_MMXEXT) { c->prefetch = prefetch_mmx2; diff -r 4089a1ae6558 -r ffb2a7b80d6d i386/dsputil_mmx.h --- a/i386/dsputil_mmx.h Sun Feb 03 03:21:47 2008 +0000 +++ b/i386/dsputil_mmx.h Sun Feb 03 07:05:11 2008 +0000 @@ -36,7 +36,7 @@ extern const uint64_t ff_pw_15; extern const uint64_t ff_pw_16; extern const uint64_t ff_pw_20; -extern const uint64_t ff_pw_32; +extern const uint64_t ff_pw_32[2]; extern const uint64_t ff_pw_42; extern const uint64_t ff_pw_64; extern const uint64_t ff_pw_96; diff -r 4089a1ae6558 -r ffb2a7b80d6d i386/h264dsp_mmx.c --- a/i386/h264dsp_mmx.c Sun Feb 03 03:21:47 2008 +0000 +++ b/i386/h264dsp_mmx.c Sun Feb 03 07:05:11 2008 +0000 @@ -75,7 +75,7 @@ IDCT4_1D( %%mm4, %%mm2, %%mm3, %%mm0, %%mm1 ) "pxor %%mm7, %%mm7 \n\t" - :: "m"(ff_pw_32)); + :: "m"(*ff_pw_32)); asm volatile( STORE_DIFF_4P( %%mm0, %%mm1, %%mm7) @@ -211,6 +211,93 @@ add_pixels_clamped_mmx(b2, dst, stride); } +#define STORE_DIFF_8P( p, d, t, z )\ + "movq "#d", "#t" \n"\ + "psraw $6, "#p" \n"\ + "punpcklbw "#z", "#t" \n"\ + "paddsw "#t", "#p" \n"\ + "packuswb "#p", "#p" \n"\ + "movq "#p", "#d" \n" + +#define H264_IDCT8_1D_SSE2(a,b,c,d,e,f,g,h)\ + "movdqa "#c", "#a" \n"\ + "movdqa "#g", "#e" \n"\ + "psraw $1, "#c" \n"\ + "psraw $1, "#g" \n"\ + "psubw "#e", "#c" \n"\ + "paddw "#a", "#g" \n"\ + "movdqa "#b", "#e" \n"\ + "psraw $1, "#e" \n"\ + "paddw "#b", "#e" \n"\ + "paddw "#d", "#e" \n"\ + "paddw "#f", "#e" \n"\ + "movdqa "#f", "#a" \n"\ + "psraw $1, "#a" \n"\ + "paddw "#f", "#a" \n"\ + "paddw "#h", "#a" \n"\ + "psubw "#b", "#a" \n"\ + "psubw "#d", "#b" \n"\ + "psubw "#d", "#f" \n"\ + "paddw "#h", "#b" \n"\ + "psubw "#h", "#f" \n"\ + "psraw $1, "#d" \n"\ + "psraw $1, "#h" \n"\ + "psubw "#d", "#b" \n"\ + "psubw "#h", "#f" \n"\ + "movdqa "#e", "#d" \n"\ + "movdqa "#a", "#h" \n"\ + "psraw $2, "#d" \n"\ + "psraw $2, "#h" \n"\ + "paddw "#f", "#d" \n"\ + "paddw "#b", "#h" \n"\ + "psraw $2, "#f" \n"\ + "psraw $2, "#b" \n"\ + "psubw "#f", "#e" \n"\ + "psubw "#a", "#b" \n"\ + "movdqa 0x00(%1), "#a" \n"\ + "movdqa 0x40(%1), "#f" \n"\ + SUMSUB_BA(f, a)\ + SUMSUB_BA(g, f)\ + SUMSUB_BA(c, a)\ + SUMSUB_BA(e, g)\ + SUMSUB_BA(b, c)\ + SUMSUB_BA(h, a)\ + SUMSUB_BA(d, f) + +static void ff_h264_idct8_add_sse2(uint8_t *dst, int16_t *block, int stride) +{ + asm volatile( + "movdqa 0x10(%1), %%xmm1 \n" + "movdqa 0x20(%1), %%xmm2 \n" + "movdqa 0x30(%1), %%xmm3 \n" + "movdqa 0x50(%1), %%xmm5 \n" + "movdqa 0x60(%1), %%xmm6 \n" + "movdqa 0x70(%1), %%xmm7 \n" + H264_IDCT8_1D_SSE2(%%xmm0, %%xmm1, %%xmm2, %%xmm3, %%xmm4, %%xmm5, %%xmm6, %%xmm7) + TRANSPOSE8(%%xmm4, %%xmm1, %%xmm7, %%xmm3, %%xmm5, %%xmm0, %%xmm2, %%xmm6, (%1)) + "paddw %4, %%xmm4 \n" + "movdqa %%xmm4, 0x00(%1) \n" + "movdqa %%xmm2, 0x40(%1) \n" + H264_IDCT8_1D_SSE2(%%xmm4, %%xmm0, %%xmm6, %%xmm3, %%xmm2, %%xmm5, %%xmm7, %%xmm1) + "movdqa %%xmm6, 0x60(%1) \n" + "movdqa %%xmm7, 0x70(%1) \n" + "pxor %%xmm7, %%xmm7 \n" + STORE_DIFF_8P(%%xmm2, (%0), %%xmm6, %%xmm7) + STORE_DIFF_8P(%%xmm0, (%0,%2), %%xmm6, %%xmm7) + STORE_DIFF_8P(%%xmm1, (%0,%2,2), %%xmm6, %%xmm7) + STORE_DIFF_8P(%%xmm3, (%0,%3), %%xmm6, %%xmm7) + "lea (%0,%2,4), %0 \n" + STORE_DIFF_8P(%%xmm5, (%0), %%xmm6, %%xmm7) + STORE_DIFF_8P(%%xmm4, (%0,%2), %%xmm6, %%xmm7) + "movdqa 0x60(%1), %%xmm0 \n" + "movdqa 0x70(%1), %%xmm1 \n" + STORE_DIFF_8P(%%xmm0, (%0,%2,2), %%xmm6, %%xmm7) + STORE_DIFF_8P(%%xmm1, (%0,%3), %%xmm6, %%xmm7) + :"+r"(dst) + :"r"(block), "r"((long)stride), "r"(3L*stride), "m"(*ff_pw_32) + ); +} + static void ff_h264_idct_dc_add_mmx2(uint8_t *dst, int16_t *block, int stride) { int dc = (block[0] + 32) >> 6; @@ -839,7 +926,7 @@ "decl %2 \n\t"\ " jnz 1b \n\t"\ : "+a"(tmp), "+c"(dst), "+m"(h)\ - : "S"((long)dstStride), "m"(ff_pw_32)\ + : "S"((long)dstStride), "m"(*ff_pw_32)\ : "memory"\ );\ }\ @@ -1113,7 +1200,7 @@ "decl %2 \n\t"\ " jnz 1b \n\t"\ : "+a"(tmp), "+c"(dst), "+m"(h)\ - : "S"((long)dstStride), "m"(ff_pw_32)\ + : "S"((long)dstStride), "m"(*ff_pw_32)\ : "memory"\ );\ tmp += 8 - size*24;\