`

ARM NEON

阅读更多

http://www.arm.com/products/processors/technologies/neon.php

http://www.arm.com/files/pdf/NEONSupportintheRealviewCompiler.pdf

http://hilbert-space.de/?p=22

 

 

ARM NEON Optimization. An Example
Filed under: Beagleboard, OMAP3530 — Nils @ 8:15 pm

Since there is so little information about NEON optimizations out there I thought I’d write a little about it.

Some weeks ago someone on the beagle-board mailing-list asked how to optimize a color to grayscale conversion for images. I haven’t done much pixel processing with ARM NEON yet, so I gave if a try. The results I got where quite spectacular, but more on this later.

For the color to grayscale conversion I used a very simple conversion scheme: A weighted average of the red, green and blue components. This conversion ignores the effect of gamma but works good enough in practice. Also I decided not to do proper rounding. It’s just an example after all.

First a reference implementation in C:

void reference_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
  int i;
  for (i=0; i<n; i++)
  {
    int r = *src++; // load red
    int g = *src++; // load green
    int b = *src++; // load blue

    // build weighted average:
    int y = (r*77)+(g*151)+(b*28);

    // undo the scale by 256 and write to memory:
    *dest++ = (y>>8);
  }
}

Optimization with NEON Intrinsics

Lets start optimizing the code using the compiler intrinsics. Intrinsics are nice to use because you they behave just like C-functions but compile to a single assembler statement. At least in theory as I’ll show you later..

Since NEON works in 64 or 128 bit registers it’s best to process eight pixels in parallel. That way we can exploit the parallel nature of the SIMD-unit. Here is what I came up with:

void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
  int i;
  uint8x8_t rfac = vdup_n_u8 (77);
  uint8x8_t gfac = vdup_n_u8 (151);
  uint8x8_t bfac = vdup_n_u8 (28);
  n/=8;

  for (i=0; i<n; i++)
  {
    uint16x8_t  temp;
    uint8x8x3_t rgb  = vld3_u8 (src);
    uint8x8_t result;

    temp = vmull_u8 (rgb.val[0],      rfac);
    temp = vmlal_u8 (temp,rgb.val[1], gfac);
    temp = vmlal_u8 (temp,rgb.val[2], bfac);

    result = vshrn_n_u16 (temp, 8);
    vst1_u8 (dest, result);
    src  += 8*3;
    dest += 8;
  }
}

Lets take a look at it step by step:

First off I load my weight factors into three NEON registers. The vdup.8 instruction does this and also replicates the byte into all 8 bytes of the NEON register.

    uint8x8_t rfac = vdup_n_u8 (77);
    uint8x8_t gfac = vdup_n_u8 (151);
    uint8x8_t bfac = vdup_n_u8 (28);


Now I load 8 pixels at once into three registers.

    uint8x8x3_t rgb  = vld3_u8 (src);

The vld3.8 instruction is a specialty of the NEON instruction set. With NEON you can not only do loads and stores of multiple registers at once, you can de-interleave the data on the fly as well. Since I expect my pixel data to be interleaved the vld3.8 instruction is a perfect fit for a tight loop.

After the load, I have all the red components of 8 pixels in the first loaded register. The green components end up in the second and blue in the third.

Now calculate the weighted average:

    temp = vmull_u8 (rgb.val[0],      rfac);
    temp = vmlal_u8 (temp,rgb.val[1], gfac);
    temp = vmlal_u8 (temp,rgb.val[2], bfac);

vmull.u8 multiplies each byte of the first argument with each corresponding byte of the second argument. Each result becomes a 16 bit unsigned integer, so no overflow can happen. The entire result is returned as a 128 bit NEON register pair.

vmlal.u8 does the same thing as vmull.u8 but also adds the content of another register to the result.

So we end up with just three instructions for weighted average of eight pixels. Nice.

Now it’s time to undo the scaling of the weight factors. To do so I shift each 16 bit result to the right by 8 bits. This equals to a division by 256. ARM NEON has lots of instructions to do the shift, but also a “narrow” variant exists. This one does two things at once: It does the shift and afterwards converts the 16 bit integers back to 8 bit by removing all the high-bytes from the result. We get back from the 128 bit register pair to a single 64 bit register.

    result = vshrn_n_u16 (temp, 8);

And finally store the result.

    vst1_u8 (dest, result);

First Results:

How does the reference C-function and the NEON optimized version compare? I did a test on my Omap3 CortexA8 CPU on the beagle-board and got the following timings:

C-version:       15.1 cycles per pixel.
NEON-version:     9.9 cycles per pixel.

That’s only a speed-up of factor 1.5. I expected much more from the NEON implementation. It processes 8 pixels with just 6 instructions after all. What’s going on here? A look at the assembler output explained it all. Here is the inner-loop part of the convert function:

 160:   f46a040f        vld3.8  {d16-d18}, [sl]
 164:   e1a0c005        mov     ip, r5
 168:   ecc80b06        vstmia  r8, {d16-d18}
 16c:   e1a04007        mov     r4, r7
 170:   e2866001        add     r6, r6, #1      ; 0x1
 174:   e28aa018        add     sl, sl, #24     ; 0x18
 178:   e8bc000f        ldm     ip!, {r0, r1, r2, r3}
 17c:   e15b0006        cmp     fp, r6
 180:   e1a08005        mov     r8, r5
 184:   e8a4000f        stmia   r4!, {r0, r1, r2, r3}
 188:   eddd0b06        vldr    d16, [sp, #24]
 18c:   e89c0003        ldm     ip, {r0, r1}
 190:   eddd2b08        vldr    d18, [sp, #32]
 194:   f3c00ca6        vmull.u8        q8, d16, d22
 198:   f3c208a5        vmlal.u8        q8, d18, d21
 19c:   e8840003        stm     r4, {r0, r1}
 1a0:   eddd3b0a        vldr    d19, [sp, #40]
 1a4:   f3c308a4        vmlal.u8        q8, d19, d20
 1a8:   f2c80830        vshrn.i16       d16, q8, #8
 1ac:   f449070f        vst1.8  {d16}, [r9]
 1b0:   e2899008        add     r9, r9, #8      ; 0x8
 1b4:   caffffe9        bgt     160

Note the store at offset 168? The compiler decides to write the three registers onto the stack. After a bit of useless memory accesses from the GPP side the compiler reloads them (offset 188, 190 and 1a0) in exactly the same physical NEON register.

What all the ordinary integer instructions do? I have no idea. Lots of memory accesses target the stack for no good reason. There is definitely no shortage of registers anywhere. For reference: I used the GCC 4.3.3 (CodeSourcery 2009q1 lite) compiler .
NEON and assembler

Since the compiler can’t generate good code I wrote the same loop in assembler. In a nutshell I just took the intrinsic based loop and converted the instructions one by one. The loop-control is a bit different, but that’s all.

convert_asm_neon:

      # r0: Ptr to destination data
      # r1: Ptr to source data
      # r2: Iteration count:

        push           {r4-r5,lr}
      lsr         r2, r2, #3

      # build the three constants:
      mov         r3, #77
      mov         r4, #151
      mov         r5, #28
      vdup.8      d3, r3
      vdup.8      d4, r4
      vdup.8      d5, r5

  .loop:

      # load 8 pixels:
      vld3.8      {d0-d2}, [r1]!

      # do the weight average:
      vmull.u8    q3, d0, d3
      vmlal.u8    q3, d1, d4
      vmlal.u8    q3, d2, d5

      # shift and store:
      vshrn.u16   d6, q3, #8
      vst1.8      {d6}, [r0]!

      subs        r2, r2, #1
      bne         .loop

      pop         { r4-r5, pc }

Final Results:

Time for some benchmarking again. How does the hand-written assembler version compares? Well – here are the results:

  C-version:       15.1 cycles per pixel.
  NEON-version:     9.9 cycles per pixel.
  Assembler:        2.0 cycles per pixel.

That’s roughly a factor of five over the intrinsic version and 7.5 times faster than my not-so-bad C implementation. And keep in mind: I didn’t even optimized the assembler loop.

My conclusion: If you want performance out of your NEON unit stay away from the intrinsics. They are nice as a prototyping tool. Use them to get your algorithm working and then rewrite the NEON-parts of it in assembler.

Btw: Sorry for the ugly syntax-highlighting. I’m still looking for a nice wordpress plug-in.

分享到:
评论
1 楼 laiyangdeli 2011-03-10  
NEON is a hybrid 64/128 bit SIMD architecture extension to the ARM v7-A profile,
targeted at multimedia applications. Positioning NEON within the processor allows it to
share the CPU resources for integer operation, loop control, and caching, significantly
reducing the area and power cost compared with a CPU plus hardware accelerator
combination. SIMD (Single Instruction Multiple Data) is where one instruction acts on
multiple data items, usually carrying out the same operation for all data.
The use of NEON instead of a CPU plus hardware accelerator combination allows savings
to be made in software development time as it creates a much simpler programming model
without forcing the programmer to search for ad-hoc concurrency and scheduling points.

相关推荐

    ARM NEON 使用手册

    ARM NEON是什么东西我就不多做介绍了,我觉得想用这技术的多半是高手,高手一般都有CSDN下载分的!哈哈哈!如果没有下载分,私信我留下邮箱,我会发给你。具体看链接:...

    ARM NEON 内建函数中文手册

    ARM NEON 查找手册,可以查找neon内建函数的功能以及入参和返回值类型; RVCT 提供在 ARM 和 Thumb 状态下为 Cortex-A8 处理器生成 NEON 代码的内在 函数。 NEON 内在函数在头文件 arm_neon.h 中定义。头文件既...

    Arm Neon Intrinsics Reference

    Arm Neon Intrinsics Reference Arm Neon Intrinsics Reference是 Arm 公司发布的一份关于 Neon intrinsics 的详细参考手册。Neon intrinsics 是 Arm 处理器架构中的一组指令集,用于实现高性能的数字信号处理、...

    基于C语言的neon_osd_Draw ARM Neon加速OSD点阵设计源码

    neon_osd_Draw是一个基于C语言开发的ARM Neon加速OSD点阵项目,包含36个文件,其中包括8个头文件、8个C源文件、5个文本文件、3个Markdown文件、3个PDF文件、2个PNG图片文件、1个Git忽略文件和1个LICENSE文件。...

    ARM Neon 编程指导

    ### ARM Neon编程指导知识点概述 本篇文档主要围绕ARM Neon技术进行深入讲解,旨在为开发者提供一份详尽的编程指南。以下将从ARM Neon的基本概念、应用领域、优化技巧等方面展开论述。 ### 一、ARM Neon简介 ####...

    arm neon programmers guide

    ARM NEON程序员指南是ARM公司为开发者提供的一份详尽文档,旨在帮助他们充分利用NEON向量处理单元进行高效计算。NEON是ARM架构中的一种单指令多数据(SIMD)技术,它允许同时处理多个数据元素,尤其适用于多媒体、...

    arm neon指令集说明

    ### ARM NEON指令集详解 #### 一、初始化寄存器 NEON指令集提供了多种方式来初始化向量寄存器。以下是一些常见的初始化指令: - **`vcreate_type`**:该指令用于创建一个特定类型的向量,其中包含了一个64位的...

    ARM Neon指令的介绍

    ARM Neon 指令的介绍 ARM Neon 指令是一种高性能的 SIMD(Single Instruction, Multiple Data)指令集,用于 Arm 处理器架构的矩阵计算和图形处理。下面是 ARM Neon 指令的详细介绍: 什么是 ARM Neon? ARM Neon...

    ARM NEON指令集.docx

    ### ARM NEON指令集概述及应用 #### 一、SIMD技术概览 单指令多数据(Single Instruction Multiple Data, SIMD)技术是一种重要的处理器技术,它允许一条指令同时作用于多个数据,以此来提高计算效率。SIMD的概念...

    基于C语言实现使用ARM NEON指令优化代码的例子源码.zip

    基于C语言实现使用ARM NEON指令优化代码的例子源码.zip基于C语言实现使用ARM NEON指令优化代码的例子源码.zip基于C语言实现使用ARM NEON指令优化代码的例子源码.zip基于C语言实现使用ARM NEON指令优化代码的例子源码...

    ARM Neon 整体介绍

    ARM Neon 是一种针对ARM处理器的向量浮点单元(VFP)扩展,旨在增强处理器在多媒体处理、信号处理以及近年来的人工智能应用中的性能。它引入了单指令多数据(SIMD)技术,允许处理器在同一时钟周期内处理多个数据...

    arm neon优化指令集

    ARM NEON优化指令集是ARM处理器架构中的一部分,主要面向需要处理多媒体数据的应用,比如音频、视频和图形处理。NEON指令集通过提供一系列专门的SIMD(单指令多数据)指令,能够同时处理多组数据,极大提升数据处理...

    exposition fast calculation with ARM NEON

    《ARM NEON加速图像曝光计算详解》 在移动设备开发领域,尤其是在图像处理技术中,高效计算能力至关重要。本文将深入探讨如何利用ARM NEON技术进行快速的图像曝光计算,帮助开发者实现更优化的性能。 ARM NEON是...

    ARM NEON优化开发

    ARM NEON技术是ARM架构中的一个重要组成部分,它是一种先进的单指令多数据(SIMD)处理单元。SIMD允许处理器在一条指令内对多个数据元素同时进行相同的操作。这种技术在图像处理、音频视频处理等多媒体应用中尤其...

    ARM Neon优化指南

    NEON 技术可加速多媒体和信号处理算法(如视频编码/解码、2D/3D 图形、游戏、音频和语音处理、图像处理技术、电话和声音合成),其性能至少为ARMv5 性能的3倍,为 ARMv6 SIMD性能的2倍。 关于SIMD和SISD:Single ...

    ARM NEON技术在车位识别算法中的应用.pdf

    ARM NEON 技术在车位识别算法中的应用 ARM NEON 技术是一种高性能的媒介处理技术,广泛应用于嵌入式系统、移动设备、服务器等领域。NEON 技术可以实现在 ARM 处理器架构上进行高效的媒介处理,提高图像处理速度,...

    基于ARM NEON的静态YUV图像缩小技术.pdf

    基于ARM NEON的静态YUV图像缩小技术 基于ARM NEON的静态YUV图像缩小技术是指使用ARM架构处理器扩展结构(ARM NEON)的静态YUV图像处理技术,以提高图像处理速度、质量和适用范围。该技术在视频监控领域广泛应用于...

    Arm NEON 介绍指南

    Arm NEON 介绍指南

    fftw arm neon lib.a v338

    在给定的标题“fftw arm neon lib.a v338”中,我们可以推断出这是一个针对ARM架构并利用NEON向量化指令集优化的FFTW库版本3.3.8。NEON是ARM处理器的一个高性能的SIMD(单指令多数据)单元,用于加速多媒体和计算...

    arm neon定点运算测试程序

    arm neon定点运算测试程序 neon一般用于浮点运算, 但定点运算也有一定效果 程序就是对neon定点运算进行测试

Global site tag (gtag.js) - Google Analytics