ARM Cortex-A系列处理器(Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A15)的区别

ARM Cortex-A各处理器(Cortex-A5, Cortex-A7, Cortex-A8, Cortex-A9, Cortex-A15)差别
类别 Cortex-A5 Cortex-A7 Cortex-A8 Cortex-A9 Cortex-A15
发布时间 2009年12月 2011年10月 2006年7月 2008年3月 2011年4月
时钟频率 ~1GHz ~1GHz on 28nm ~1GHz on 65nm ~2GHz on 40nm ~2.5GHz on 28nm
Execution order
执行顺序
In-order
顺序执行
In-order In-order Out of order
乱序
Out of order
乱序执行
多核支持 1 to 4 1 to 4 1 (只单核) 1 to 4 1 to 4
峰值指令处理速度 1.6DMIPS/MHz 1.9DMIPS/MHz 2 DMIPS/MHz 2.5 DMIPS/MHz 3.5 DMIPS/MHz
VFP/NEON
支持
VFPv4/NEON VFPv4/NEON VFPv3/NEON VFPv3/NEON VFPv4/NEON
Half precision
半精度扩展(16-bit floating-point)
否,只有32-bit单精度和64-bit双精度浮点
FP/NEON
寄存器重命名
GP寄存器重命名
硬件的除法器
LPAE (40-bit physical address) No Yes No No yes
硬件虚拟化 No Yes No No Yes
big.LITTLE No LITTLE No No Big
融合的MAC
乘累加
流水线级数 pipeline stages 8 8 13 9 to 12 15+
指令译码 decodes 1 Partial dual issue 2 (dual-issue) 2 (dual-issue) 3
返回堆栈stack条目 4 8 8 8 48
浮点运算单元FPU Optional Optional Yes Optional Optional
AMBA总线宽度 64-bit I/F

AMBA 3

128-bit I/F

AMBA 4

64 or 128-bit I/F

AMBA 3


64-bit I/FAMBA 3
128-bit
L1 Data Cache Size 4K to 64K 8KB to 64KB 16/32KB 16KB/32KB/64KB 32KB
L1 Instruction Cache Size 4K to 64K 8KB to 64KB 16/32KB 16KB/32KB/64KB 32KB
L1 Cache Structure 2-way set

associative (Inst)

4-way set

associative (Data)

2-way set

associative (Inst)

4-way set

associative (Data)

4-way set

associative

4-way set

associative (Inst)

4-way set

associative (Data)

2-way set

associative (Inst)

4-way set

associative (Data)

L2 Cache type External Integrated Integrated External Integrated
L2 Cache size - 128KB to 1MB 128KB to 1MB - 512KB to 1MB
L2 Cache Structure - 8-way set

associative

8-way set

associative

- 8-way set

associative

Cache line (bytes) 32 32 64 32 64

参考链接


ARM Cortex-A系列处理器

白皮书:ARM big.LITTLE 系统的软件技术

ARM big.LITTLE 系统的软件技术

Robin Randhawa,首席工程师2013 4

简介

移动应用已经发生了显著变化,当今的消费者更多地将智能手机应用于大部分互联生活。其中既包括高性能任务,例如网络浏览、导航和游戏,也包括那些要求不太苛刻的“始终在线,始终连接”后台任务,例如语音呼叫、社交网络和电子邮件服务。因此,移动电话已经成为很多消费者必不可少的计算设备。同时,平板电脑等新型移动设备也在重新定义计算平台,以应对消费者的需求。这一趋势为消费者创造了全新的内容互动方式,将以往只可能在固定设备上实现的应用带到移动设备上。这才是真正的智能下一代计算。

摩尔定律将如何往下发展?人们过去预测集成电路上可容纳的晶体管数目每隔 18 个月会增加一倍,直至从数千个增加至数十亿个晶体管,但如果真正审视单个处理器,却会发现它的性能增长停滞不前,因为您可以在系统中消耗的电能已经达到峰值。

对于未来的任何一款处理器,散热必然会限制其速度的大幅提高。一旦达到器件的热障,器件会融化,如果是在移动电话上,设备会开始发热,让用户感到不适。除了物理散热问题之外,能效也会变得相当低。如果调节处理器实施,使其速度逐渐加快,则其能耗将呈指数级增长,而为了增加最后这一丁点的性能,却会导致成本大幅提升。过去,尺寸增大一倍也意味着速度提高一倍,但到了现在,尺寸增大一倍却只能将速度提高几个百分点,因此出于复杂性的原因,效益不复存在,这也是单核系统的速度达到极限的原因之一。

如果您无法让单核运行更快,则必须增加核心的数量。这样做的好处还包括让每个核心能够匹配其承担的工作负载,这正是 ARM big.LITTLE™ 处理概念的用武之地。

Big.LITTLE处理技术可以解决我们当前面临的一个最大难题:扩展消费者的“始终在线,始终连接”移动体验,同时改进性能,延长电池续航时间。实现这一目标的方式是将“big”多核处理器与“LITTLE”多核处理器配合使用,根据性能要求,为适当的任务无缝选择适当的处理器。重要的是,这种动态选择对在处理器上运行的应用程序软件或中间件是透明的。 设备中采用的最新一代big.LITTLE 设计将高性能Cortex™-A15 多处理器集群与高能效Cortex-A7 多处理器集群组合在一起。这些处理器保持了 100% 架构兼容性,并且具有相同的功能(支持 LPAE 和虚拟化扩展,以及 NEON™ VFP 等功能单元),这使得针对一种处理器类型编译的软件应用程序能够在其他处理器上运行,而无需进行修改。
继续阅读白皮书:ARM big.LITTLE 系统的软件技术

.align 5之类的知识

经常会看到 arm-linux 汇编中有如下的指令:

.align n 它的含义就是使得下面的代码按一定规则对齐。

.align n 指令的对齐值有两种方案:n 或 2^n 。各种平台最初的汇编器一般都不是gas ,采取方案12 的都很多,gas 的目标是取代原来的汇编器,必然要保持和原来汇编器的兼容,因此在gas 中如何解释.align 指令会显得有些混乱,原因在于保持兼容。arm-linux 是按照 2^n 的方案对齐的,需要说明的是这个对齐和ld-script 里的对齐不同,不是一会事。下面的英文就不同平台的对齐进行了说明:版本2.11.92.0.12gasinfo(Mandrake 8.2 上的) 这样说:

The way the required alignment is specified varies from system to system. For the a29k, hppa, m68k, m88k, w65, sparc, and Hitachi SH, and i386 using ELF format, the first expression is the alignment request in bytes. For example .align 8 advances the location counter until it is a multiple of 8. If the location counter is already a multiple of 8, no change is needed.

For other systems, including the i386 using a.out format, and the arm and strongarm, it is the number of low-order zero bits the location counter must have after advancement. For example `.align 3' advances the location counter until it a multiple of 8. If the location counter is already a multiple of 8, no change is needed.

从这段文字来看,ARM.align 5就是 25次方对齐,也就是 4 字节对齐,通过反汇编也可以看出对齐方式:

反汇编:

一些忠告:

In the future, everytime when you build an elf file, you need meantime created your map file. And then you will avoid mistakes like this align.

Also, please also pick up some linker script knowlege part. For embedded system, we frequently play the linker script to tune an image, for example, align some special section and so on for protection or/and cache purpose. wish helpful.

参考链接


.align 5之类的知识

Why GEMM is at the heart of deep learning

I spend most of my time worrying about how to make deep learning with neural networks faster and more power efficient. In practice that means focusing on a function called GEMM. It’s part of the BLAS (Basic Linear Algebra Subprograms) library that was first created in 1979, and until I started trying to optimize neural networks I’d never heard of it.
继续阅读Why GEMM is at the heart of deep learning

Windows 7 系统电脑安装RNDIS驱动

本教程小编和大家分享 Windows 7 系统电脑安装RNDIS驱动的正确方法,RNDIS驱动是什么? Windows 7 系统驱动RNDIS是远端网络驱动接口协议,设备通过USB方式同主机连接,模拟网络连接以便用于下载和调试工作。但是很多 Windows 7 系统用户安装RNDIS的设备时失败,遇到无法安装的问题,所以小编给大家介绍 Windows 7 系统电脑安装RNDIS驱动的正确方法。

继续阅读Windows 7 系统电脑安装RNDIS驱动

粗略判断Shader每条代码的成本

GPU IS a processor (graphics proccessing unit). Anywho, i remember seeing somewhere that in geforce 6 series cards its a signle cycle (maybe i was just dreaming :-p) but i have that memory

radeon x800 has it anyways
EDIT:

Quote:

ORIGINALLY AT: http://gear.ibuypower.com/GVE/Store/ProductDetails.aspx?sku=VC-POWERC-147
Smartshader HD•Support for Microsoft® DirectX® 9.0 programmable vertex and pixel shaders in hardware
• DirectX 9.0 Vertex Shaders
- Vertex programs up to 65,280 instructions with flow control
- Single cycle trigonometric operations (SIN & COS)
• Direct X 9.0 Extended Pixel Shaders
- Up to 1,536 instructions and 16 textures per rendering pass
- 32 temporary and constant registers
- Facing register for two-sided lighting
- 128-bit, 64-bit & 32-bit per pixel floating point color formats
- Multiple Render Target (MRT) support
• Complete feature set also supported in OpenGL® via extensions

继续阅读粗略判断Shader每条代码的成本

Android Gradle Plugin源码解析之externalNativeBuild

在Android Studio 2.2开始的Android Gradle Plugin版本中,Google集成了对cmake的完美支持,而原先的ndkBuild的方式支持也变得更加良好。这篇文章就来说说Android Gradle Plugin与交叉编译之间的一些事,即externalNativeBuild相关的task,主要是解读一下gradle构建系统相关的源码。

继续阅读Android Gradle Plugin源码解析之externalNativeBuild

Overriding a default option(…) value in CMake from a parent CMakeLists.txt

CMakeLists.txt

CMakeLists.txt

执行如下命令的时候:

会观察到生成的配置文件中 BUILD_FOR_ANDROID 不一定能生效。

需要如下配置才行:
CMakeLists.txt

参考链接


Use ccache with CMake for faster compilation

C and C++ compilers aren’t the fastest pieces of software out there and there’s no lack of programmer jokes based on tedium of waiting for their work to complete.

There are ways to fix the pain though - one of them is ccache. CCache improves compilation times by caching previously built object files in private cache and reusing them when you’re recompiling same objects with same parameters. Obviously it will not help if you’re compiling the code for the first time and it also won’t help if you often change compilation flags. Most C/C++ development however involves recompiling same object files with the same parameters and ccache helps alot.

For illustration, here’s the comparison of first and subsequent compilation times of a largish C++ project:

Original run with empty cache:

Recompilation with warm cache:

Installation

CCache is available in repositories on pretty much all distributions. On OS X use homebrew:

and on Debian-based distros use apt:

CMake configuration

After ccache is installed, you need to tell CMake to use it as a wrapper for the compiler. Add these lines to your CMakeLists.txt:

Rerun cmake and next make should use ccache for wrapper.

Usage with Android NDK

CCache can even be used on Android NDK - you just need to export NDK_CCACHE environment variable with path to ccache binary. ndk-build script will automatically use it. E.g.

(Note that on Debian/Ubuntu the path will probably be /usr/bin/ccache)

CCache statistics

To see if ccache is really working, you can use ccache -s command, which will display ccache statistics:

On second and all subsequent compilations the “cache hit” values should increase and thus show that ccache is working.

参考链接


Use ccache with CMake for faster compilation