# HG changeset patch # User rbultje # Date 1284471386 0 # Node ID 58a960d6e34c66dc074f242d5077a47efc8c27ed # Parent 990f8a5fc8af46f8d22a87cca34d83e68116e089 Rename h264_idct_sse2.asm to h264_idct.asm; move inline IDCT asm from h264dsp_mmx.c to h264_idct.asm (as yasm code). Because the loops are now coded in asm instead of C, this is (depending on the function) up to 50% faster for cases where gcc didn't do a great job at looping. Since h264_idct_add8() is now faster than the manual loop setup in h264.c, in-asm idct calling can now be enabled for chroma as well (see r16207). For MMX, this is 5% faster. For SSE2 (which isn't done for chroma if h264.c does the looping), this makes it up to 50% faster. Speed gain overall is ~0.5-1.0%. diff -r 990f8a5fc8af -r 58a960d6e34c h264.c --- a/h264.c Mon Sep 13 22:09:28 2010 +0000 +++ b/h264.c Tue Sep 14 13:36:26 2010 +0000 @@ -1318,14 +1318,9 @@ chroma_dc_dequant_idct_c(h->mb + 16*16, h->chroma_qp[0], h->dequant4_coeff[IS_INTRA(mb_type) ? 1:4][h->chroma_qp[0]][0]); chroma_dc_dequant_idct_c(h->mb + 16*16+4*16, h->chroma_qp[1], h->dequant4_coeff[IS_INTRA(mb_type) ? 2:5][h->chroma_qp[1]][0]); if(is_h264){ - idct_add = h->h264dsp.h264_idct_add; - idct_dc_add = h->h264dsp.h264_idct_dc_add; - for(i=16; i<16+8; i++){ - if(h->non_zero_count_cache[ scan8[i] ]) - idct_add (dest[(i&4)>>2] + block_offset[i], h->mb + i*16, uvlinesize); - else if(h->mb[i*16]) - idct_dc_add(dest[(i&4)>>2] + block_offset[i], h->mb + i*16, uvlinesize); - } + h->h264dsp.h264_idct_add8(dest, block_offset, + h->mb, uvlinesize, + h->non_zero_count_cache); }else{ for(i=16; i<16+8; i++){ if(h->non_zero_count_cache[ scan8[i] ] || h->mb[i*16]){ diff -r 990f8a5fc8af -r 58a960d6e34c x86/Makefile --- a/x86/Makefile Mon Sep 13 22:09:28 2010 +0000 +++ b/x86/Makefile Tue Sep 14 13:36:26 2010 +0000 @@ -10,7 +10,7 @@ MMX-OBJS-$(CONFIG_H264DSP) += x86/h264dsp_mmx.o YASM-OBJS-$(CONFIG_H264DSP) += x86/h264_deblock.o \ x86/h264_weight.o \ - x86/h264_idct_sse2.o \ + x86/h264_idct.o \ YASM-OBJS-$(CONFIG_H264PRED) += x86/h264_intrapred.o MMX-OBJS-$(CONFIG_H264PRED) += x86/h264_intrapred_init.o diff -r 990f8a5fc8af -r 58a960d6e34c x86/h264_idct.asm --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/x86/h264_idct.asm Tue Sep 14 13:36:26 2010 +0000 @@ -0,0 +1,865 @@ +;***************************************************************************** +;* MMX/SSE2-optimized H.264 iDCT +;***************************************************************************** +;* Copyright (C) 2004-2005 Michael Niedermayer, Loren Merritt +;* Copyright (C) 2003-2008 x264 project +;* +;* Authors: Laurent Aimar +;* Loren Merritt +;* Holger Lubitz +;* Min Chen +;* +;* This file is part of FFmpeg. +;* +;* FFmpeg is free software; you can redistribute it and/or +;* modify it under the terms of the GNU Lesser General Public +;* License as published by the Free Software Foundation; either +;* version 2.1 of the License, or (at your option) any later version. +;* +;* FFmpeg is distributed in the hope that it will be useful, +;* but WITHOUT ANY WARRANTY; without even the implied warranty of +;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +;* Lesser General Public License for more details. +;* +;* You should have received a copy of the GNU Lesser General Public +;* License along with FFmpeg; if not, write to the Free Software +;* 51, Inc., Foundation Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA +;***************************************************************************** + +%include "x86inc.asm" +%include "x86util.asm" + +SECTION_RODATA + +; FIXME this table is a duplicate from h264data.h, and will be removed once the tables from, h264 have been split +scan8_mem: db 4+1*8, 5+1*8, 4+2*8, 5+2*8 + db 6+1*8, 7+1*8, 6+2*8, 7+2*8 + db 4+3*8, 5+3*8, 4+4*8, 5+4*8 + db 6+3*8, 7+3*8, 6+4*8, 7+4*8 + db 1+1*8, 2+1*8 + db 1+2*8, 2+2*8 + db 1+4*8, 2+4*8 + db 1+5*8, 2+5*8 +%ifdef PIC +%define scan8 r11 +%else +%define scan8 scan8_mem +%endif + +cextern pw_32 + +SECTION .text + +; %1=uint8_t *dst, %2=int16_t *block, %3=int stride +%macro IDCT4_ADD 3 + ; Load dct coeffs + movq m0, [%2] + movq m1, [%2+8] + movq m2, [%2+16] + movq m3, [%2+24] + + IDCT4_1D 0, 1, 2, 3, 4, 5 + mova m6, [pw_32] + TRANSPOSE4x4W 0, 1, 2, 3, 4 + paddw m0, m6 + IDCT4_1D 0, 1, 2, 3, 4, 5 + pxor m7, m7 + + STORE_DIFFx2 m0, m1, m4, m5, m7, 6, %1, %3 + lea %1, [%1+%3*2] + STORE_DIFFx2 m2, m3, m4, m5, m7, 6, %1, %3 +%endmacro + +INIT_MMX +; ff_h264_idct_add_mmx(uint8_t *dst, int16_t *block, int stride) +cglobal h264_idct_add_mmx, 3, 3, 0 + IDCT4_ADD r0, r1, r2 + RET + +%macro IDCT8_1D 2 + mova m4, m5 + mova m0, m1 + psraw m4, 1 + psraw m1, 1 + paddw m4, m5 + paddw m1, m0 + paddw m4, m7 + paddw m1, m5 + psubw m4, m0 + paddw m1, m3 + + psubw m0, m3 + psubw m5, m3 + paddw m0, m7 + psubw m5, m7 + psraw m3, 1 + psraw m7, 1 + psubw m0, m3 + psubw m5, m7 + + mova m3, m4 + mova m7, m1 + psraw m1, 2 + psraw m3, 2 + paddw m3, m0 + psraw m0, 2 + paddw m1, m5 + psraw m5, 2 + psubw m0, m4 + psubw m7, m5 + + mova m4, m2 + mova m5, m6 + psraw m4, 1 + psraw m6, 1 + psubw m4, m5 + paddw m6, m2 + + mova m2, %1 + mova m5, %2 + SUMSUB_BA m5, m2 + SUMSUB_BA m6, m5 + SUMSUB_BA m4, m2 + SUMSUB_BA m7, m6 + SUMSUB_BA m0, m4 + SUMSUB_BA m3, m2 + SUMSUB_BA m1, m5 + SWAP 7, 6, 4, 5, 2, 3, 1, 0 ; 70315246 -> 01234567 +%endmacro + +%macro IDCT8_1D_FULL 1 + mova m7, [%1+112] + mova m6, [%1+ 96] + mova m5, [%1+ 80] + mova m3, [%1+ 48] + mova m2, [%1+ 32] + mova m1, [%1+ 16] + IDCT8_1D [%1], [%1+ 64] +%endmacro + +; %1=int16_t *block, %2=int16_t *dstblock +%macro IDCT8_ADD_MMX_START 2 + IDCT8_1D_FULL %1 + mova [%1], m7 + TRANSPOSE4x4W 0, 1, 2, 3, 7 + mova m7, [%1] + mova [%2 ], m0 + mova [%2+16], m1 + mova [%2+32], m2 + mova [%2+48], m3 + TRANSPOSE4x4W 4, 5, 6, 7, 3 + mova [%2+ 8], m4 + mova [%2+24], m5 + mova [%2+40], m6 + mova [%2+56], m7 +%endmacro + +; %1=uint8_t *dst, %2=int16_t *block, %3=int stride +%macro IDCT8_ADD_MMX_END 3 + IDCT8_1D_FULL %2 + mova [%2 ], m5 + mova [%2+16], m6 + mova [%2+32], m7 + + pxor m7, m7 + STORE_DIFFx2 m0, m1, m5, m6, m7, 6, %1, %3 + lea %1, [%1+%3*2] + STORE_DIFFx2 m2, m3, m5, m6, m7, 6, %1, %3 + mova m0, [%2 ] + mova m1, [%2+16] + mova m2, [%2+32] + lea %1, [%1+%3*2] + STORE_DIFFx2 m4, m0, m5, m6, m7, 6, %1, %3 + lea %1, [%1+%3*2] + STORE_DIFFx2 m1, m2, m5, m6, m7, 6, %1, %3 +%endmacro + +INIT_MMX +; ff_h264_idct8_add_mmx(uint8_t *dst, int16_t *block, int stride) +cglobal h264_idct8_add_mmx, 3, 4, 0 + %assign pad 128+4-(stack_offset&7) + SUB rsp, pad + + add word [r1], 32 + IDCT8_ADD_MMX_START r1 , rsp + IDCT8_ADD_MMX_START r1+8, rsp+64 + lea r3, [r0+4] + IDCT8_ADD_MMX_END r0 , rsp, r2 + IDCT8_ADD_MMX_END r3 , rsp+8, r2 + + ADD rsp, pad + RET + +; %1=uint8_t *dst, %2=int16_t *block, %3=int stride +%macro IDCT8_ADD_SSE 4 + IDCT8_1D_FULL %2 +%ifdef ARCH_X86_64 + TRANSPOSE8x8W 0, 1, 2, 3, 4, 5, 6, 7, 8 +%else + TRANSPOSE8x8W 0, 1, 2, 3, 4, 5, 6, 7, [%2], [%2+16] +%endif + paddw m0, [pw_32] + +%ifndef ARCH_X86_64 + mova [%2 ], m0 + mova [%2+16], m4 + IDCT8_1D [%2], [%2+ 16] + mova [%2 ], m6 + mova [%2+16], m7 +%else + SWAP 0, 8 + SWAP 4, 9 + IDCT8_1D m8, m9 + SWAP 6, 8 + SWAP 7, 9 +%endif + + pxor m7, m7 + lea %4, [%3*3] + STORE_DIFF m0, m6, m7, [%1 ] + STORE_DIFF m1, m6, m7, [%1+%3 ] + STORE_DIFF m2, m6, m7, [%1+%3*2] + STORE_DIFF m3, m6, m7, [%1+%4 ] +%ifndef ARCH_X86_64 + mova m0, [%2 ] + mova m1, [%2+16] +%else + SWAP 0, 8 + SWAP 1, 9 +%endif + lea %1, [%1+%3*4] + STORE_DIFF m4, m6, m7, [%1 ] + STORE_DIFF m5, m6, m7, [%1+%3 ] + STORE_DIFF m0, m6, m7, [%1+%3*2] + STORE_DIFF m1, m6, m7, [%1+%4 ] +%endmacro + +INIT_XMM +; ff_h264_idct8_add_sse2(uint8_t *dst, int16_t *block, int stride) +cglobal h264_idct8_add_sse2, 3, 4, 10 + IDCT8_ADD_SSE r0, r1, r2, r3 + RET + +%macro DC_ADD_MMX2_INIT 2-3 +%if %0 == 2 + movsx %1, word [%1] + add %1, 32 + sar %1, 6 + movd m0, %1 + lea %1, [%2*3] +%else + add %3, 32 + sar %3, 6 + movd m0, %3 + lea %3, [%2*3] +%endif + pshufw m0, m0, 0 + pxor m1, m1 + psubw m1, m0 + packuswb m0, m0 + packuswb m1, m1 +%endmacro + +%macro DC_ADD_MMX2_OP 3-4 + %1 m2, [%2 ] + %1 m3, [%2+%3 ] + %1 m4, [%2+%3*2] + %1 m5, [%2+%4 ] + paddusb m2, m0 + paddusb m3, m0 + paddusb m4, m0 + paddusb m5, m0 + psubusb m2, m1 + psubusb m3, m1 + psubusb m4, m1 + psubusb m5, m1 + %1 [%2 ], m2 + %1 [%2+%3 ], m3 + %1 [%2+%3*2], m4 + %1 [%2+%4 ], m5 +%endmacro + +INIT_MMX +; ff_h264_idct_dc_add_mmx2(uint8_t *dst, int16_t *block, int stride) +cglobal h264_idct_dc_add_mmx2, 3, 3, 0 + DC_ADD_MMX2_INIT r1, r2 + DC_ADD_MMX2_OP movh, r0, r2, r1 + RET + +; ff_h264_idct8_dc_add_mmx2(uint8_t *dst, int16_t *block, int stride) +cglobal h264_idct8_dc_add_mmx2, 3, 3, 0 + DC_ADD_MMX2_INIT r1, r2 + DC_ADD_MMX2_OP mova, r0, r2, r1 + lea r0, [r0+r2*4] + DC_ADD_MMX2_OP mova, r0, r2, r1 + RET + +; ff_h264_idct_add16_mmx(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add16_mmx, 5, 7, 0 + xor r5, r5 +%ifdef PIC + lea r11, [scan8_mem] +%endif +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + test r6, r6 + jz .skipblock + mov r6d, dword [r1+r5*4] + lea r6, [r0+r6] + IDCT4_ADD r6, r2, r3 +.skipblock + inc r5 + add r2, 32 + cmp r5, 16 + jl .nextblock + REP_RET + +; ff_h264_idct8_add4_mmx(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct8_add4_mmx, 5, 7, 0 + %assign pad 128+4-(stack_offset&7) + SUB rsp, pad + + xor r5, r5 +%ifdef PIC + lea r11, [scan8_mem] +%endif +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + test r6, r6 + jz .skipblock + mov r6d, dword [r1+r5*4] + lea r6, [r0+r6] + add word [r2], 32 + IDCT8_ADD_MMX_START r2 , rsp + IDCT8_ADD_MMX_START r2+8, rsp+64 + IDCT8_ADD_MMX_END r6 , rsp, r3 + mov r6d, dword [r1+r5*4] + lea r6, [r0+r6+4] + IDCT8_ADD_MMX_END r6 , rsp+8, r3 +.skipblock + add r5, 4 + add r2, 128 + cmp r5, 16 + jl .nextblock + ADD rsp, pad + RET + +; ff_h264_idct_add16_mmx2(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add16_mmx2, 5, 7, 0 + xor r5, r5 +%ifdef PIC + lea r11, [scan8_mem] +%endif +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + test r6, r6 + jz .skipblock + cmp r6, 1 + jnz .no_dc + movsx r6, word [r2] + test r6, r6 + jz .no_dc + DC_ADD_MMX2_INIT r2, r3, r6 +%ifdef ARCH_X86_64 +%define dst_reg r10 +%define dst_regd r10d +%else +%define dst_reg r1 +%define dst_regd r1d +%endif + mov dst_regd, dword [r1+r5*4] + lea dst_reg, [r0+dst_reg] + DC_ADD_MMX2_OP movh, dst_reg, r3, r6 +%ifndef ARCH_X86_64 + mov r1, r1m +%endif + inc r5 + add r2, 32 + cmp r5, 16 + jl .nextblock + REP_RET +.no_dc + mov r6d, dword [r1+r5*4] + lea r6, [r0+r6] + IDCT4_ADD r6, r2, r3 +.skipblock + inc r5 + add r2, 32 + cmp r5, 16 + jl .nextblock + REP_RET + +; ff_h264_idct_add16intra_mmx(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add16intra_mmx, 5, 7, 0 + xor r5, r5 +%ifdef PIC + lea r11, [scan8_mem] +%endif +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + or r6w, word [r2] + test r6, r6 + jz .skipblock + mov r6d, dword [r1+r5*4] + lea r6, [r0+r6] + IDCT4_ADD r6, r2, r3 +.skipblock + inc r5 + add r2, 32 + cmp r5, 16 + jl .nextblock + REP_RET + +; ff_h264_idct_add16intra_mmx2(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add16intra_mmx2, 5, 7, 0 + xor r5, r5 +%ifdef PIC + lea r11, [scan8_mem] +%endif +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + test r6, r6 + jz .try_dc + mov r6d, dword [r1+r5*4] + lea r6, [r0+r6] + IDCT4_ADD r6, r2, r3 + inc r5 + add r2, 32 + cmp r5, 16 + jl .nextblock + REP_RET +.try_dc + movsx r6, word [r2] + test r6, r6 + jz .skipblock + DC_ADD_MMX2_INIT r2, r3, r6 +%ifdef ARCH_X86_64 +%define dst_reg r10 +%define dst_regd r10d +%else +%define dst_reg r1 +%define dst_regd r1d +%endif + mov dst_regd, dword [r1+r5*4] + lea dst_reg, [r0+dst_reg] + DC_ADD_MMX2_OP movh, dst_reg, r3, r6 +%ifndef ARCH_X86_64 + mov r1, r1m +%endif +.skipblock + inc r5 + add r2, 32 + cmp r5, 16 + jl .nextblock + REP_RET + +; ff_h264_idct8_add4_mmx2(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct8_add4_mmx2, 5, 7, 0 + %assign pad 128+4-(stack_offset&7) + SUB rsp, pad + + xor r5, r5 +%ifdef PIC + lea r11, [scan8_mem] +%endif +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + test r6, r6 + jz .skipblock + cmp r6, 1 + jnz .no_dc + movsx r6, word [r2] + test r6, r6 + jz .no_dc + DC_ADD_MMX2_INIT r2, r3, r6 +%ifdef ARCH_X86_64 +%define dst_reg r10 +%define dst_regd r10d +%else +%define dst_reg r1 +%define dst_regd r1d +%endif + mov dst_regd, dword [r1+r5*4] + lea dst_reg, [r0+dst_reg] + DC_ADD_MMX2_OP mova, dst_reg, r3, r6 + lea dst_reg, [dst_reg+r3*4] + DC_ADD_MMX2_OP mova, dst_reg, r3, r6 +%ifndef ARCH_X86_64 + mov r1, r1m +%endif + add r5, 4 + add r2, 128 + cmp r5, 16 + jl .nextblock + + ADD rsp, pad + RET +.no_dc + mov r6d, dword [r1+r5*4] + lea r6, [r0+r6] + add word [r2], 32 + IDCT8_ADD_MMX_START r2 , rsp + IDCT8_ADD_MMX_START r2+8, rsp+64 + IDCT8_ADD_MMX_END r6 , rsp, r3 + mov r6d, dword [r1+r5*4] + lea r6, [r0+r6+4] + IDCT8_ADD_MMX_END r6 , rsp+8, r3 +.skipblock + add r5, 4 + add r2, 128 + cmp r5, 16 + jl .nextblock + + ADD rsp, pad + RET + +INIT_XMM +; ff_h264_idct8_add4_sse2(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct8_add4_sse2, 5, 7, 10 + xor r5, r5 +%ifdef PIC + lea r11, [scan8_mem] +%endif +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + test r6, r6 + jz .skipblock + cmp r6, 1 + jnz .no_dc + movsx r6, word [r2] + test r6, r6 + jz .no_dc +INIT_MMX + DC_ADD_MMX2_INIT r2, r3, r6 +%ifdef ARCH_X86_64 +%define dst_reg r10 +%define dst_regd r10d +%else +%define dst_reg r1 +%define dst_regd r1d +%endif + mov dst_regd, dword [r1+r5*4] + lea dst_reg, [r0+dst_reg] + DC_ADD_MMX2_OP mova, dst_reg, r3, r6 + lea dst_reg, [dst_reg+r3*4] + DC_ADD_MMX2_OP mova, dst_reg, r3, r6 +%ifndef ARCH_X86_64 + mov r1, r1m +%endif + add r5, 4 + add r2, 128 + cmp r5, 16 + jl .nextblock + REP_RET +.no_dc +INIT_XMM + mov dst_regd, dword [r1+r5*4] + lea dst_reg, [r0+dst_reg] + IDCT8_ADD_SSE dst_reg, r2, r3, r6 +%ifndef ARCH_X86_64 + mov r1, r1m +%endif +.skipblock + add r5, 4 + add r2, 128 + cmp r5, 16 + jl .nextblock + REP_RET + +INIT_MMX +h264_idct_add8_mmx_plane: +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + or r6w, word [r2] + test r6, r6 + jz .skipblock +%ifdef ARCH_X86_64 + mov r0d, dword [r1+r5*4] + add r0, [r10] +%else + mov r0, r1m ; XXX r1m here is actually r0m of the calling func + mov r0, [r0] + add r0, dword [r1+r5*4] +%endif + IDCT4_ADD r0, r2, r3 +.skipblock + inc r5 + add r2, 32 + test r5, 3 + jnz .nextblock + rep ret + +; ff_h264_idct_add8_mmx(uint8_t **dest, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add8_mmx, 5, 7, 0 + mov r5, 16 + add r2, 512 +%ifdef PIC + lea r11, [scan8_mem] +%endif +%ifdef ARCH_X86_64 + mov r10, r0 +%endif + call h264_idct_add8_mmx_plane +%ifdef ARCH_X86_64 + add r10, gprsize +%else + add r0mp, gprsize +%endif + call h264_idct_add8_mmx_plane + RET + +h264_idct_add8_mmx2_plane +.nextblock + movzx r6, byte [scan8+r5] + movzx r6, byte [r4+r6] + test r6, r6 + jz .try_dc +%ifdef ARCH_X86_64 + mov r0d, dword [r1+r5*4] + add r0, [r10] +%else + mov r0, r1m ; XXX r1m here is actually r0m of the calling func + mov r0, [r0] + add r0, dword [r1+r5*4] +%endif + IDCT4_ADD r0, r2, r3 + inc r5 + add r2, 32 + test r5, 3 + jnz .nextblock + rep ret +.try_dc + movsx r6, word [r2] + test r6, r6 + jz .skipblock + DC_ADD_MMX2_INIT r2, r3, r6 +%ifdef ARCH_X86_64 + mov r0d, dword [r1+r5*4] + add r0, [r10] +%else + mov r0, r1m ; XXX r1m here is actually r0m of the calling func + mov r0, [r0] + add r0, dword [r1+r5*4] +%endif + DC_ADD_MMX2_OP movh, r0, r3, r6 +.skipblock + inc r5 + add r2, 32 + test r5, 3 + jnz .nextblock + rep ret + +; ff_h264_idct_add8_mmx2(uint8_t **dest, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add8_mmx2, 5, 7, 0 + mov r5, 16 + add r2, 512 +%ifdef ARCH_X86_64 + mov r10, r0 +%endif +%ifdef PIC + lea r11, [scan8_mem] +%endif + call h264_idct_add8_mmx2_plane +%ifdef ARCH_X86_64 + add r10, gprsize +%else + add r0mp, gprsize +%endif + call h264_idct_add8_mmx2_plane + RET + +INIT_MMX +; r0 = uint8_t *dst, r2 = int16_t *block, r3 = int stride, r6=clobbered +h264_idct_dc_add8_mmx2: + movd m0, [r2 ] ; 0 0 X D + punpcklwd m0, [r2+32] ; x X d D + paddsw m0, [pw_32] + psraw m0, 6 + punpcklwd m0, m0 ; d d D D + pxor m1, m1 ; 0 0 0 0 + psubw m1, m0 ; -d-d-D-D + packuswb m0, m1 ; -d-d-D-D d d D D + pshufw m1, m0, 0xFA ; -d-d-d-d-D-D-D-D + punpcklwd m0, m0 ; d d d d D D D D + lea r6, [r3*3] + DC_ADD_MMX2_OP movq, r0, r3, r6 + ret + +ALIGN 16 +INIT_XMM +; r0 = uint8_t *dst (clobbered), r2 = int16_t *block, r3 = int stride +x264_add8x4_idct_sse2: + movq m0, [r2+ 0] + movq m1, [r2+ 8] + movq m2, [r2+16] + movq m3, [r2+24] + movhps m0, [r2+32] + movhps m1, [r2+40] + movhps m2, [r2+48] + movhps m3, [r2+56] + IDCT4_1D 0,1,2,3,4,5 + TRANSPOSE2x4x4W 0,1,2,3,4 + paddw m0, [pw_32] + IDCT4_1D 0,1,2,3,4,5 + pxor m7, m7 + STORE_DIFFx2 m0, m1, m4, m5, m7, 6, r0, r3 + lea r0, [r0+r3*2] + STORE_DIFFx2 m2, m3, m4, m5, m7, 6, r0, r3 + ret + +%macro add16_sse2_cycle 2 + movzx r0, word [r4+%2] + test r0, r0 + jz .cycle%1end + mov r0d, dword [r1+%1*8] +%ifdef ARCH_X86_64 + add r0, r10 +%else + add r0, r0m +%endif + call x264_add8x4_idct_sse2 +.cycle%1end +%if %1 < 7 + add r2, 64 +%endif +%endmacro + +; ff_h264_idct_add16_sse2(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add16_sse2, 5, 5, 8 +%ifdef ARCH_X86_64 + mov r10, r0 +%endif + ; unrolling of the loop leads to an average performance gain of + ; 20-25% + add16_sse2_cycle 0, 0xc + add16_sse2_cycle 1, 0x14 + add16_sse2_cycle 2, 0xe + add16_sse2_cycle 3, 0x16 + add16_sse2_cycle 4, 0x1c + add16_sse2_cycle 5, 0x24 + add16_sse2_cycle 6, 0x1e + add16_sse2_cycle 7, 0x26 + RET + +; ff_h264_idct_add16intra_sse2(uint8_t *dst, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add16intra_sse2, 5, 7, 8 + xor r5, r5 +%ifdef ARCH_X86_64 + mov r10, r0 +%endif +%ifdef PIC + lea r11, [scan8_mem] +%endif +.next2blocks + movzx r0, byte [scan8+r5] + movzx r0, word [r4+r0] + test r0, r0 + jz .try_dc + mov r0d, dword [r1+r5*4] +%ifdef ARCH_X86_64 + add r0, r10 +%else + add r0, r0m +%endif + call x264_add8x4_idct_sse2 + add r5, 2 + add r2, 64 + cmp r5, 16 + jl .next2blocks + REP_RET +.try_dc + movsx r0, word [r2 ] + or r0w, word [r2+32] + jz .skip2blocks + mov r0d, dword [r1+r5*4] +%ifdef ARCH_X86_64 + add r0, r10 +%else + add r0, r0m +%endif + call h264_idct_dc_add8_mmx2 +.skip2blocks + add r5, 2 + add r2, 64 + cmp r5, 16 + jl .next2blocks + REP_RET + +h264_idct_add8_sse2_plane: +.next2blocks + movzx r0, byte [scan8+r5] + movzx r0, word [r4+r0] + test r0, r0 + jz .try_dc +%ifdef ARCH_X86_64 + mov r0d, dword [r1+r5*4] + add r0, [r10] +%else + mov r0, r1m ; XXX r1m here is actually r0m of the calling func + mov r0, [r0] + add r0, dword [r1+r5*4] +%endif + call x264_add8x4_idct_sse2 + add r5, 2 + add r2, 64 + test r5, 3 + jnz .next2blocks + rep ret +.try_dc + movsx r0, word [r2 ] + or r0w, word [r2+32] + jz .skip2blocks +%ifdef ARCH_X86_64 + mov r0d, dword [r1+r5*4] + add r0, [r10] +%else + mov r0, r1m ; XXX r1m here is actually r0m of the calling func + mov r0, [r0] + add r0, dword [r1+r5*4] +%endif + call h264_idct_dc_add8_mmx2 +.skip2blocks + add r5, 2 + add r2, 64 + test r5, 3 + jnz .next2blocks + rep ret + +; ff_h264_idct_add8_sse2(uint8_t **dest, const int *block_offset, +; DCTELEM *block, int stride, const uint8_t nnzc[6*8]) +cglobal h264_idct_add8_sse2, 5, 7, 8 + mov r5, 16 + add r2, 512 +%ifdef PIC + lea r11, [scan8_mem] +%endif +%ifdef ARCH_X86_64 + mov r10, r0 +%endif + call h264_idct_add8_sse2_plane +%ifdef ARCH_X86_64 + add r10, gprsize +%else + add r0mp, gprsize +%endif + call h264_idct_add8_sse2_plane + RET diff -r 990f8a5fc8af -r 58a960d6e34c x86/h264_idct_sse2.asm --- a/x86/h264_idct_sse2.asm Mon Sep 13 22:09:28 2010 +0000 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,56 +0,0 @@ -;***************************************************************************** -;* SSE2-optimized H.264 iDCT -;***************************************************************************** -;* Copyright (C) 2003-2008 x264 project -;* -;* Authors: Laurent Aimar -;* Loren Merritt -;* Holger Lubitz -;* Min Chen -;* -;* This file is part of FFmpeg. -;* -;* FFmpeg is free software; you can redistribute it and/or -;* modify it under the terms of the GNU Lesser General Public -;* License as published by the Free Software Foundation; either -;* version 2.1 of the License, or (at your option) any later version. -;* -;* FFmpeg is distributed in the hope that it will be useful, -;* but WITHOUT ANY WARRANTY; without even the implied warranty of -;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -;* Lesser General Public License for more details. -;* -;* You should have received a copy of the GNU Lesser General Public -;* License along with FFmpeg; if not, write to the Free Software -;* 51, Inc., Foundation Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA -;***************************************************************************** - -%include "x86inc.asm" -%include "x86util.asm" - -SECTION_RODATA -pw_32: times 8 dw 32 - -SECTION .text - -INIT_XMM -cglobal x264_add8x4_idct_sse2, 3,3,8 - movq m0, [r1+ 0] - movq m1, [r1+ 8] - movq m2, [r1+16] - movq m3, [r1+24] - movhps m0, [r1+32] - movhps m1, [r1+40] - movhps m2, [r1+48] - movhps m3, [r1+56] - IDCT4_1D 0,1,2,3,4,5 - TRANSPOSE2x4x4W 0,1,2,3,4 - paddw m0, [pw_32] - IDCT4_1D 0,1,2,3,4,5 - pxor m7, m7 - STORE_DIFF m0, m4, m7, [r0] - STORE_DIFF m1, m4, m7, [r0+r2] - lea r0, [r0+r2*2] - STORE_DIFF m2, m4, m7, [r0] - STORE_DIFF m3, m4, m7, [r0+r2] - RET diff -r 990f8a5fc8af -r 58a960d6e34c x86/h264dsp_mmx.c --- a/x86/h264dsp_mmx.c Mon Sep 13 22:09:28 2010 +0000 +++ b/x86/h264dsp_mmx.c Tue Sep 14 13:36:26 2010 +0000 @@ -29,523 +29,37 @@ /***********************************/ /* IDCT */ -#define SUMSUB_BADC( a, b, c, d ) \ - "paddw "#b", "#a" \n\t"\ - "paddw "#d", "#c" \n\t"\ - "paddw "#b", "#b" \n\t"\ - "paddw "#d", "#d" \n\t"\ - "psubw "#a", "#b" \n\t"\ - "psubw "#c", "#d" \n\t" - -#define SUMSUBD2_AB( a, b, t ) \ - "movq "#b", "#t" \n\t"\ - "psraw $1 , "#b" \n\t"\ - "paddw "#a", "#b" \n\t"\ - "psraw $1 , "#a" \n\t"\ - "psubw "#t", "#a" \n\t" - -#define IDCT4_1D( s02, s13, d02, d13, t ) \ - SUMSUB_BA ( s02, d02 )\ - SUMSUBD2_AB( s13, d13, t )\ - SUMSUB_BADC( d13, s02, s13, d02 ) - -#define STORE_DIFF_4P( p, t, z ) \ - "psraw $6, "#p" \n\t"\ - "movd (%0), "#t" \n\t"\ - "punpcklbw "#z", "#t" \n\t"\ - "paddsw "#t", "#p" \n\t"\ - "packuswb "#z", "#p" \n\t"\ - "movd "#p", (%0) \n\t" - -static void ff_h264_idct_add_mmx(uint8_t *dst, int16_t *block, int stride) -{ - /* Load dct coeffs */ - __asm__ volatile( - "movq (%0), %%mm0 \n\t" - "movq 8(%0), %%mm1 \n\t" - "movq 16(%0), %%mm2 \n\t" - "movq 24(%0), %%mm3 \n\t" - :: "r"(block) ); - - __asm__ volatile( - /* mm1=s02+s13 mm2=s02-s13 mm4=d02+d13 mm0=d02-d13 */ - IDCT4_1D( %%mm2, %%mm1, %%mm0, %%mm3, %%mm4 ) - - "movq %0, %%mm6 \n\t" - /* in: 1,4,0,2 out: 1,2,3,0 */ - TRANSPOSE4( %%mm3, %%mm1, %%mm0, %%mm2, %%mm4 ) - - "paddw %%mm6, %%mm3 \n\t" - - /* mm2=s02+s13 mm3=s02-s13 mm4=d02+d13 mm1=d02-d13 */ - IDCT4_1D( %%mm4, %%mm2, %%mm3, %%mm0, %%mm1 ) - - "pxor %%mm7, %%mm7 \n\t" - :: "m"(ff_pw_32)); - - __asm__ volatile( - STORE_DIFF_4P( %%mm0, %%mm1, %%mm7) - "add %1, %0 \n\t" - STORE_DIFF_4P( %%mm2, %%mm1, %%mm7) - "add %1, %0 \n\t" - STORE_DIFF_4P( %%mm3, %%mm1, %%mm7) - "add %1, %0 \n\t" - STORE_DIFF_4P( %%mm4, %%mm1, %%mm7) - : "+r"(dst) - : "r" ((x86_reg)stride) - ); -} - -static inline void h264_idct8_1d(int16_t *block) -{ - __asm__ volatile( - "movq 112(%0), %%mm7 \n\t" - "movq 80(%0), %%mm0 \n\t" - "movq 48(%0), %%mm3 \n\t" - "movq 16(%0), %%mm5 \n\t" - - "movq %%mm0, %%mm4 \n\t" - "movq %%mm5, %%mm1 \n\t" - "psraw $1, %%mm4 \n\t" - "psraw $1, %%mm1 \n\t" - "paddw %%mm0, %%mm4 \n\t" - "paddw %%mm5, %%mm1 \n\t" - "paddw %%mm7, %%mm4 \n\t" - "paddw %%mm0, %%mm1 \n\t" - "psubw %%mm5, %%mm4 \n\t" - "paddw %%mm3, %%mm1 \n\t" - - "psubw %%mm3, %%mm5 \n\t" - "psubw %%mm3, %%mm0 \n\t" - "paddw %%mm7, %%mm5 \n\t" - "psubw %%mm7, %%mm0 \n\t" - "psraw $1, %%mm3 \n\t" - "psraw $1, %%mm7 \n\t" - "psubw %%mm3, %%mm5 \n\t" - "psubw %%mm7, %%mm0 \n\t" - - "movq %%mm4, %%mm3 \n\t" - "movq %%mm1, %%mm7 \n\t" - "psraw $2, %%mm1 \n\t" - "psraw $2, %%mm3 \n\t" - "paddw %%mm5, %%mm3 \n\t" - "psraw $2, %%mm5 \n\t" - "paddw %%mm0, %%mm1 \n\t" - "psraw $2, %%mm0 \n\t" - "psubw %%mm4, %%mm5 \n\t" - "psubw %%mm0, %%mm7 \n\t" - - "movq 32(%0), %%mm2 \n\t" - "movq 96(%0), %%mm6 \n\t" - "movq %%mm2, %%mm4 \n\t" - "movq %%mm6, %%mm0 \n\t" - "psraw $1, %%mm4 \n\t" - "psraw $1, %%mm6 \n\t" - "psubw %%mm0, %%mm4 \n\t" - "paddw %%mm2, %%mm6 \n\t" - - "movq (%0), %%mm2 \n\t" - "movq 64(%0), %%mm0 \n\t" - SUMSUB_BA( %%mm0, %%mm2 ) - SUMSUB_BA( %%mm6, %%mm0 ) - SUMSUB_BA( %%mm4, %%mm2 ) - SUMSUB_BA( %%mm7, %%mm6 ) - SUMSUB_BA( %%mm5, %%mm4 ) - SUMSUB_BA( %%mm3, %%mm2 ) - SUMSUB_BA( %%mm1, %%mm0 ) - :: "r"(block) - ); -} - -static void ff_h264_idct8_add_mmx(uint8_t *dst, int16_t *block, int stride) -{ - int i; - DECLARE_ALIGNED(8, int16_t, b2)[64]; - - block[0] += 32; - - for(i=0; i<2; i++){ - DECLARE_ALIGNED(8, uint64_t, tmp); - - h264_idct8_1d(block+4*i); - - __asm__ volatile( - "movq %%mm7, %0 \n\t" - TRANSPOSE4( %%mm0, %%mm2, %%mm4, %%mm6, %%mm7 ) - "movq %%mm0, 8(%1) \n\t" - "movq %%mm6, 24(%1) \n\t" - "movq %%mm7, 40(%1) \n\t" - "movq %%mm4, 56(%1) \n\t" - "movq %0, %%mm7 \n\t" - TRANSPOSE4( %%mm7, %%mm5, %%mm3, %%mm1, %%mm0 ) - "movq %%mm7, (%1) \n\t" - "movq %%mm1, 16(%1) \n\t" - "movq %%mm0, 32(%1) \n\t" - "movq %%mm3, 48(%1) \n\t" - : "=m"(tmp) - : "r"(b2+32*i) - : "memory" - ); - } - - for(i=0; i<2; i++){ - h264_idct8_1d(b2+4*i); - - __asm__ volatile( - "psraw $6, %%mm7 \n\t" - "psraw $6, %%mm6 \n\t" - "psraw $6, %%mm5 \n\t" - "psraw $6, %%mm4 \n\t" - "psraw $6, %%mm3 \n\t" - "psraw $6, %%mm2 \n\t" - "psraw $6, %%mm1 \n\t" - "psraw $6, %%mm0 \n\t" - - "movq %%mm7, (%0) \n\t" - "movq %%mm5, 16(%0) \n\t" - "movq %%mm3, 32(%0) \n\t" - "movq %%mm1, 48(%0) \n\t" - "movq %%mm0, 64(%0) \n\t" - "movq %%mm2, 80(%0) \n\t" - "movq %%mm4, 96(%0) \n\t" - "movq %%mm6, 112(%0) \n\t" - :: "r"(b2+4*i) - : "memory" - ); - } - - ff_add_pixels_clamped_mmx(b2, dst, stride); -} - -#define STORE_DIFF_8P( p, d, t, z )\ - "movq "#d", "#t" \n"\ - "psraw $6, "#p" \n"\ - "punpcklbw "#z", "#t" \n"\ - "paddsw "#t", "#p" \n"\ - "packuswb "#p", "#p" \n"\ - "movq "#p", "#d" \n" - -#define H264_IDCT8_1D_SSE2(a,b,c,d,e,f,g,h)\ - "movdqa "#c", "#a" \n"\ - "movdqa "#g", "#e" \n"\ - "psraw $1, "#c" \n"\ - "psraw $1, "#g" \n"\ - "psubw "#e", "#c" \n"\ - "paddw "#a", "#g" \n"\ - "movdqa "#b", "#e" \n"\ - "psraw $1, "#e" \n"\ - "paddw "#b", "#e" \n"\ - "paddw "#d", "#e" \n"\ - "paddw "#f", "#e" \n"\ - "movdqa "#f", "#a" \n"\ - "psraw $1, "#a" \n"\ - "paddw "#f", "#a" \n"\ - "paddw "#h", "#a" \n"\ - "psubw "#b", "#a" \n"\ - "psubw "#d", "#b" \n"\ - "psubw "#d", "#f" \n"\ - "paddw "#h", "#b" \n"\ - "psubw "#h", "#f" \n"\ - "psraw $1, "#d" \n"\ - "psraw $1, "#h" \n"\ - "psubw "#d", "#b" \n"\ - "psubw "#h", "#f" \n"\ - "movdqa "#e", "#d" \n"\ - "movdqa "#a", "#h" \n"\ - "psraw $2, "#d" \n"\ - "psraw $2, "#h" \n"\ - "paddw "#f", "#d" \n"\ - "paddw "#b", "#h" \n"\ - "psraw $2, "#f" \n"\ - "psraw $2, "#b" \n"\ - "psubw "#f", "#e" \n"\ - "psubw "#a", "#b" \n"\ - "movdqa 0x00(%1), "#a" \n"\ - "movdqa 0x40(%1), "#f" \n"\ - SUMSUB_BA(f, a)\ - SUMSUB_BA(g, f)\ - SUMSUB_BA(c, a)\ - SUMSUB_BA(e, g)\ - SUMSUB_BA(b, c)\ - SUMSUB_BA(h, a)\ - SUMSUB_BA(d, f) +void ff_h264_idct_add_mmx (uint8_t *dst, int16_t *block, int stride); +void ff_h264_idct8_add_mmx (uint8_t *dst, int16_t *block, int stride); +void ff_h264_idct8_add_sse2 (uint8_t *dst, int16_t *block, int stride); +void ff_h264_idct_dc_add_mmx2 (uint8_t *dst, int16_t *block, int stride); +void ff_h264_idct8_dc_add_mmx2(uint8_t *dst, int16_t *block, int stride); -static void ff_h264_idct8_add_sse2(uint8_t *dst, int16_t *block, int stride) -{ - __asm__ volatile( - "movdqa 0x10(%1), %%xmm1 \n" - "movdqa 0x20(%1), %%xmm2 \n" - "movdqa 0x30(%1), %%xmm3 \n" - "movdqa 0x50(%1), %%xmm5 \n" - "movdqa 0x60(%1), %%xmm6 \n" - "movdqa 0x70(%1), %%xmm7 \n" - H264_IDCT8_1D_SSE2(%%xmm0, %%xmm1, %%xmm2, %%xmm3, %%xmm4, %%xmm5, %%xmm6, %%xmm7) - TRANSPOSE8(%%xmm4, %%xmm1, %%xmm7, %%xmm3, %%xmm5, %%xmm0, %%xmm2, %%xmm6, (%1)) - "paddw %4, %%xmm4 \n" - "movdqa %%xmm4, 0x00(%1) \n" - "movdqa %%xmm2, 0x40(%1) \n" - H264_IDCT8_1D_SSE2(%%xmm4, %%xmm0, %%xmm6, %%xmm3, %%xmm2, %%xmm5, %%xmm7, %%xmm1) - "movdqa %%xmm6, 0x60(%1) \n" - "movdqa %%xmm7, 0x70(%1) \n" - "pxor %%xmm7, %%xmm7 \n" - STORE_DIFF_8P(%%xmm2, (%0), %%xmm6, %%xmm7) - STORE_DIFF_8P(%%xmm0, (%0,%2), %%xmm6, %%xmm7) - STORE_DIFF_8P(%%xmm1, (%0,%2,2), %%xmm6, %%xmm7) - STORE_DIFF_8P(%%xmm3, (%0,%3), %%xmm6, %%xmm7) - "lea (%0,%2,4), %0 \n" - STORE_DIFF_8P(%%xmm5, (%0), %%xmm6, %%xmm7) - STORE_DIFF_8P(%%xmm4, (%0,%2), %%xmm6, %%xmm7) - "movdqa 0x60(%1), %%xmm0 \n" - "movdqa 0x70(%1), %%xmm1 \n" - STORE_DIFF_8P(%%xmm0, (%0,%2,2), %%xmm6, %%xmm7) - STORE_DIFF_8P(%%xmm1, (%0,%3), %%xmm6, %%xmm7) - :"+r"(dst) - :"r"(block), "r"((x86_reg)stride), "r"((x86_reg)3L*stride), "m"(ff_pw_32) - ); -} - -static void ff_h264_idct_dc_add_mmx2(uint8_t *dst, int16_t *block, int stride) -{ - int dc = (block[0] + 32) >> 6; - __asm__ volatile( - "movd %0, %%mm0 \n\t" - "pshufw $0, %%mm0, %%mm0 \n\t" - "pxor %%mm1, %%mm1 \n\t" - "psubw %%mm0, %%mm1 \n\t" - "packuswb %%mm0, %%mm0 \n\t" - "packuswb %%mm1, %%mm1 \n\t" - ::"r"(dc) - ); - __asm__ volatile( - "movd %0, %%mm2 \n\t" - "movd %1, %%mm3 \n\t" - "movd %2, %%mm4 \n\t" - "movd %3, %%mm5 \n\t" - "paddusb %%mm0, %%mm2 \n\t" - "paddusb %%mm0, %%mm3 \n\t" - "paddusb %%mm0, %%mm4 \n\t" - "paddusb %%mm0, %%mm5 \n\t" - "psubusb %%mm1, %%mm2 \n\t" - "psubusb %%mm1, %%mm3 \n\t" - "psubusb %%mm1, %%mm4 \n\t" - "psubusb %%mm1, %%mm5 \n\t" - "movd %%mm2, %0 \n\t" - "movd %%mm3, %1 \n\t" - "movd %%mm4, %2 \n\t" - "movd %%mm5, %3 \n\t" - :"+m"(*(uint32_t*)(dst+0*stride)), - "+m"(*(uint32_t*)(dst+1*stride)), - "+m"(*(uint32_t*)(dst+2*stride)), - "+m"(*(uint32_t*)(dst+3*stride)) - ); -} - -static void ff_h264_idct8_dc_add_mmx2(uint8_t *dst, int16_t *block, int stride) -{ - int dc = (block[0] + 32) >> 6; - int y; - __asm__ volatile( - "movd %0, %%mm0 \n\t" - "pshufw $0, %%mm0, %%mm0 \n\t" - "pxor %%mm1, %%mm1 \n\t" - "psubw %%mm0, %%mm1 \n\t" - "packuswb %%mm0, %%mm0 \n\t" - "packuswb %%mm1, %%mm1 \n\t" - ::"r"(dc) - ); - for(y=2; y--; dst += 4*stride){ - __asm__ volatile( - "movq %0, %%mm2 \n\t" - "movq %1, %%mm3 \n\t" - "movq %2, %%mm4 \n\t" - "movq %3, %%mm5 \n\t" - "paddusb %%mm0, %%mm2 \n\t" - "paddusb %%mm0, %%mm3 \n\t" - "paddusb %%mm0, %%mm4 \n\t" - "paddusb %%mm0, %%mm5 \n\t" - "psubusb %%mm1, %%mm2 \n\t" - "psubusb %%mm1, %%mm3 \n\t" - "psubusb %%mm1, %%mm4 \n\t" - "psubusb %%mm1, %%mm5 \n\t" - "movq %%mm2, %0 \n\t" - "movq %%mm3, %1 \n\t" - "movq %%mm4, %2 \n\t" - "movq %%mm5, %3 \n\t" - :"+m"(*(uint64_t*)(dst+0*stride)), - "+m"(*(uint64_t*)(dst+1*stride)), - "+m"(*(uint64_t*)(dst+2*stride)), - "+m"(*(uint64_t*)(dst+3*stride)) - ); - } -} - -//FIXME this table is a duplicate from h264data.h, and will be removed once the tables from, h264 have been split -static const uint8_t scan8[16 + 2*4]={ - 4+1*8, 5+1*8, 4+2*8, 5+2*8, - 6+1*8, 7+1*8, 6+2*8, 7+2*8, - 4+3*8, 5+3*8, 4+4*8, 5+4*8, - 6+3*8, 7+3*8, 6+4*8, 7+4*8, - 1+1*8, 2+1*8, - 1+2*8, 2+2*8, - 1+4*8, 2+4*8, - 1+5*8, 2+5*8, -}; - -static void ff_h264_idct_add16_mmx(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i++){ - if(nnzc[ scan8[i] ]) - ff_h264_idct_add_mmx(dst + block_offset[i], block + i*16, stride); - } -} - -static void ff_h264_idct8_add4_mmx(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i+=4){ - if(nnzc[ scan8[i] ]) - ff_h264_idct8_add_mmx(dst + block_offset[i], block + i*16, stride); - } -} - +void ff_h264_idct_add16_mmx (uint8_t *dst, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct8_add4_mmx (uint8_t *dst, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct_add16_mmx2 (uint8_t *dst, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct_add16intra_mmx (uint8_t *dst, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct_add16intra_mmx2(uint8_t *dst, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct8_add4_mmx2 (uint8_t *dst, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct8_add4_sse2 (uint8_t *dst, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct_add8_mmx (uint8_t **dest, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct_add8_mmx2 (uint8_t **dest, const int *block_offset, + DCTELEM *block, int stride, const uint8_t nnzc[6*8]); -static void ff_h264_idct_add16_mmx2(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i++){ - int nnz = nnzc[ scan8[i] ]; - if(nnz){ - if(nnz==1 && block[i*16]) ff_h264_idct_dc_add_mmx2(dst + block_offset[i], block + i*16, stride); - else ff_h264_idct_add_mmx (dst + block_offset[i], block + i*16, stride); - } - } -} - -static void ff_h264_idct_add16intra_mmx(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i++){ - if(nnzc[ scan8[i] ] || block[i*16]) - ff_h264_idct_add_mmx(dst + block_offset[i], block + i*16, stride); - } -} - -static void ff_h264_idct_add16intra_mmx2(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i++){ - if(nnzc[ scan8[i] ]) ff_h264_idct_add_mmx (dst + block_offset[i], block + i*16, stride); - else if(block[i*16]) ff_h264_idct_dc_add_mmx2(dst + block_offset[i], block + i*16, stride); - } -} - -static void ff_h264_idct8_add4_mmx2(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i+=4){ - int nnz = nnzc[ scan8[i] ]; - if(nnz){ - if(nnz==1 && block[i*16]) ff_h264_idct8_dc_add_mmx2(dst + block_offset[i], block + i*16, stride); - else ff_h264_idct8_add_mmx (dst + block_offset[i], block + i*16, stride); - } - } -} - -static void ff_h264_idct8_add4_sse2(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i+=4){ - int nnz = nnzc[ scan8[i] ]; - if(nnz){ - if(nnz==1 && block[i*16]) ff_h264_idct8_dc_add_mmx2(dst + block_offset[i], block + i*16, stride); - else ff_h264_idct8_add_sse2 (dst + block_offset[i], block + i*16, stride); - } - } -} - -static void ff_h264_idct_add8_mmx(uint8_t **dest, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=16; i<16+8; i++){ - if(nnzc[ scan8[i] ] || block[i*16]) - ff_h264_idct_add_mmx (dest[(i&4)>>2] + block_offset[i], block + i*16, stride); - } -} - -static void ff_h264_idct_add8_mmx2(uint8_t **dest, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=16; i<16+8; i++){ - if(nnzc[ scan8[i] ]) - ff_h264_idct_add_mmx (dest[(i&4)>>2] + block_offset[i], block + i*16, stride); - else if(block[i*16]) - ff_h264_idct_dc_add_mmx2(dest[(i&4)>>2] + block_offset[i], block + i*16, stride); - } -} - -#if HAVE_YASM -static void ff_h264_idct_dc_add8_mmx2(uint8_t *dst, int16_t *block, int stride) -{ - __asm__ volatile( - "movd %0, %%mm0 \n\t" // 0 0 X D - "punpcklwd %1, %%mm0 \n\t" // x X d D - "paddsw %2, %%mm0 \n\t" - "psraw $6, %%mm0 \n\t" - "punpcklwd %%mm0, %%mm0 \n\t" // d d D D - "pxor %%mm1, %%mm1 \n\t" // 0 0 0 0 - "psubw %%mm0, %%mm1 \n\t" // -d-d-D-D - "packuswb %%mm1, %%mm0 \n\t" // -d-d-D-D d d D D - "pshufw $0xFA, %%mm0, %%mm1 \n\t" // -d-d-d-d-D-D-D-D - "punpcklwd %%mm0, %%mm0 \n\t" // d d d d D D D D - ::"m"(block[ 0]), - "m"(block[16]), - "m"(ff_pw_32) - ); - __asm__ volatile( - "movq %0, %%mm2 \n\t" - "movq %1, %%mm3 \n\t" - "movq %2, %%mm4 \n\t" - "movq %3, %%mm5 \n\t" - "paddusb %%mm0, %%mm2 \n\t" - "paddusb %%mm0, %%mm3 \n\t" - "paddusb %%mm0, %%mm4 \n\t" - "paddusb %%mm0, %%mm5 \n\t" - "psubusb %%mm1, %%mm2 \n\t" - "psubusb %%mm1, %%mm3 \n\t" - "psubusb %%mm1, %%mm4 \n\t" - "psubusb %%mm1, %%mm5 \n\t" - "movq %%mm2, %0 \n\t" - "movq %%mm3, %1 \n\t" - "movq %%mm4, %2 \n\t" - "movq %%mm5, %3 \n\t" - :"+m"(*(uint64_t*)(dst+0*stride)), - "+m"(*(uint64_t*)(dst+1*stride)), - "+m"(*(uint64_t*)(dst+2*stride)), - "+m"(*(uint64_t*)(dst+3*stride)) - ); -} - -extern void ff_x264_add8x4_idct_sse2(uint8_t *dst, int16_t *block, int stride); - -static void ff_h264_idct_add16_sse2(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i+=2) - if(nnzc[ scan8[i+0] ]|nnzc[ scan8[i+1] ]) - ff_x264_add8x4_idct_sse2 (dst + block_offset[i], block + i*16, stride); -} - -static void ff_h264_idct_add16intra_sse2(uint8_t *dst, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=0; i<16; i+=2){ - if(nnzc[ scan8[i+0] ]|nnzc[ scan8[i+1] ]) - ff_x264_add8x4_idct_sse2 (dst + block_offset[i], block + i*16, stride); - else if(block[i*16]|block[i*16+16]) - ff_h264_idct_dc_add8_mmx2(dst + block_offset[i], block + i*16, stride); - } -} - -static void ff_h264_idct_add8_sse2(uint8_t **dest, const int *block_offset, DCTELEM *block, int stride, const uint8_t nnzc[6*8]){ - int i; - for(i=16; i<16+8; i+=2){ - if(nnzc[ scan8[i+0] ]|nnzc[ scan8[i+1] ]) - ff_x264_add8x4_idct_sse2 (dest[(i&4)>>2] + block_offset[i], block + i*16, stride); - else if(block[i*16]|block[i*16+16]) - ff_h264_idct_dc_add8_mmx2(dest[(i&4)>>2] + block_offset[i], block + i*16, stride); - } -} -#endif +void ff_h264_idct_add16_sse2 (uint8_t *dst, const int *block_offset, DCTELEM *block, + int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct_add16intra_sse2(uint8_t *dst, const int *block_offset, DCTELEM *block, + int stride, const uint8_t nnzc[6*8]); +void ff_h264_idct_add8_sse2 (uint8_t **dest, const int *block_offset, DCTELEM *block, + int stride, const uint8_t nnzc[6*8]); /***********************************/ /* deblocking */ @@ -745,6 +259,10 @@ { int mm_flags = av_get_cpu_flags(); + if (mm_flags & AV_CPU_FLAG_MMX2) { + c->h264_loop_filter_strength= h264_loop_filter_strength_mmx2; + } +#if HAVE_YASM if (mm_flags & AV_CPU_FLAG_MMX) { c->h264_idct_dc_add= c->h264_idct_add= ff_h264_idct_add_mmx; @@ -764,15 +282,6 @@ c->h264_idct_add8 = ff_h264_idct_add8_mmx2; c->h264_idct_add16intra= ff_h264_idct_add16intra_mmx2; - c->h264_loop_filter_strength= h264_loop_filter_strength_mmx2; - } - if(mm_flags & AV_CPU_FLAG_SSE2){ - c->h264_idct8_add = ff_h264_idct8_add_sse2; - c->h264_idct8_add4= ff_h264_idct8_add4_sse2; - } - -#if HAVE_YASM - if (mm_flags & AV_CPU_FLAG_MMX2){ c->h264_v_loop_filter_chroma= ff_x264_deblock_v_chroma_mmxext; c->h264_h_loop_filter_chroma= ff_x264_deblock_h_chroma_mmxext; c->h264_v_loop_filter_chroma_intra= ff_x264_deblock_v_chroma_intra_mmxext; @@ -802,6 +311,9 @@ c->biweight_h264_pixels_tab[7]= ff_h264_biweight_4x2_mmx2; if (mm_flags&AV_CPU_FLAG_SSE2) { + c->h264_idct8_add = ff_h264_idct8_add_sse2; + c->h264_idct8_add4= ff_h264_idct8_add4_sse2; + c->weight_h264_pixels_tab[0]= ff_h264_weight_16x16_sse2; c->weight_h264_pixels_tab[1]= ff_h264_weight_16x8_sse2; c->weight_h264_pixels_tab[2]= ff_h264_weight_8x16_sse2; @@ -832,6 +344,6 @@ c->biweight_h264_pixels_tab[4]= ff_h264_biweight_8x4_ssse3; } } + } #endif - } }