深海游弋的鱼 – 第 82 页 – 智障儿童欢乐多

Why GEMM is at the heart of deep learning

I spend most of my time worrying about how to make deep learning with neural networks faster and more power efficient. In practice that means focusing on a function called GEMM. It’s part of the BLAS (Basic Linear Algebra Subprograms) library that was first created in 1979, and until I started trying to optimize neural networks I’d never heard of it.
继续阅读Why GEMM is at the heart of deep learning

CNN 基础之卷积及其矩阵加速

卷积在 CNN 中是非常基础的一个操作, 但是, 一旦写出来, 要画不少的图, 所以, 一直拖了下来, 刚好最近看到一个比较好的图, 能够说明卷积转化为矩阵相乘就行操作的方法.
继续阅读

Windows 7 系统电脑安装RNDIS驱动

本教程小编和大家分享 Windows 7 系统电脑安装RNDIS驱动的正确方法，RNDIS驱动是什么？ Windows 7 系统驱动RNDIS是远端网络驱动接口协议，设备通过USB方式同主机连接，模拟网络连接以便用于下载和调试工作。但是很多 Windows 7 系统用户安装RNDIS的设备时失败，遇到无法安装的问题，所以小编给大家介绍 Windows 7 系统电脑安装RNDIS驱动的正确方法。

继续阅读Windows 7 系统电脑安装RNDIS驱动

粗略判断Shader每条代码的成本

GPU IS a processor (graphics proccessing unit). Anywho, i remember seeing somewhere that in geforce 6 series cards its a signle cycle (maybe i was just dreaming :-p) but i have that memory

radeon x800 has it anyways
EDIT:

Quote:

ORIGINALLY AT: http://gear.ibuypower.com/GVE/Store/ProductDetails.aspx?sku=VC-POWERC-147
Smartshader HD•Support for Microsoft® DirectX® 9.0 programmable vertex and pixel shaders in hardware
• DirectX 9.0 Vertex Shaders
- Vertex programs up to 65,280 instructions with flow control
- Single cycle trigonometric operations (SIN & COS)
• Direct X 9.0 Extended Pixel Shaders
- Up to 1,536 instructions and 16 textures per rendering pass
- 32 temporary and constant registers
- Facing register for two-sided lighting
- 128-bit, 64-bit & 32-bit per pixel floating point color formats
- Multiple Render Target (MRT) support
• Complete feature set also supported in OpenGL® via extensions

继续阅读粗略判断Shader每条代码的成本

Android Gradle Plugin源码解析之externalNativeBuild

在Android Studio 2.2开始的Android Gradle Plugin版本中，Google集成了对cmake的完美支持，而原先的ndkBuild的方式支持也变得更加良好。这篇文章就来说说Android Gradle Plugin与交叉编译之间的一些事，即externalNativeBuild相关的task，主要是解读一下gradle构建系统相关的源码。

继续阅读Android Gradle Plugin源码解析之externalNativeBuild

Overriding a default option(…) value in CMake from a parent CMakeLists.txt

子 CMakeLists.txt

option(BUILD_FOR_ANDROID "Build For Android" OFF)

if(SYSTEM.Android AND NOT BUILD_FOR_ANDROID)
    set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${NATIVE_LIBRARY_OUTPUT}/${ANDROID_ABI})
endif()

option(BUILD_FOR_ANDROID "Build For Android" OFF)

if(SYSTEM.Android AND NOT BUILD_FOR_ANDROID)

set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${NATIVE_LIBRARY_OUTPUT}/${ANDROID_ABI})

endif()

父 CMakeLists.txt

set(BUILD_FOR_ANDROID ON)
add_subdirectory(${CHILD_ROOT_DIR}/ ${CMAKE_CURRENT_SOURCE_DIR}/build)

1 2	set(BUILD_FOR_ANDROID ON) add_subdirectory(${CHILD_ROOT_DIR}/ ${CMAKE_CURRENT_SOURCE_DIR}/build)

执行如下命令的时候：

/Users/xxxx/Library/Android/sdk/cmake/3.6.4111459/bin/cmake --trace-expand \
-H/Users/xxxx/Source/example/demo/android/app \
-B/Users/xxxx/Source/example/demo/android/app/.externalNativeBuild/cmake/debug/arm64-v8a \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-21 \
-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/Users/xxxx/Source/example/demo/android/app/build/intermediates/cmake/debug/obj/arm64-v8a \
-DCMAKE_BUILD_TYPE=Debug \
-DANDROID_NDK=/Users/xxxx/Library/Android/android-ndk-r16b \
-DCMAKE_TOOLCHAIN_FILE=/Users/xxxx/Library/Android/android-ndk-r16b/build/cmake/android.toolchain.cmake \
-DCMAKE_MAKE_PROGRAM=/Users/xxxx/Library/Android/sdk/cmake/3.6.4111459/bin/ninja \
-G"Android Gradle - Ninja" \
-DANDROID_ARM_NEON=TRUE \
-DANDROID_TOOLCHAIN=gcc \
-DANDROID_PLATFORM=android-21 \
-DANDROID_STL=gnustl_shared

/Users/xxxx/Library/Android/sdk/cmake/3.6.4111459/bin/cmake --trace-expand \

-H/Users/xxxx/Source/example/demo/android/app \

-B/Users/xxxx/Source/example/demo/android/app/.externalNativeBuild/cmake/debug/arm64-v8a \

-DANDROID_ABI=arm64-v8a \

-DANDROID_PLATFORM=android-21 \

-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/Users/xxxx/Source/example/demo/android/app/build/intermediates/cmake/debug/obj/arm64-v8a \

-DCMAKE_BUILD_TYPE=Debug \

-DANDROID_NDK=/Users/xxxx/Library/Android/android-ndk-r16b \

-DCMAKE_TOOLCHAIN_FILE=/Users/xxxx/Library/Android/android-ndk-r16b/build/cmake/android.toolchain.cmake \

-DCMAKE_MAKE_PROGRAM=/Users/xxxx/Library/Android/sdk/cmake/3.6.4111459/bin/ninja \

-G"Android Gradle - Ninja" \

-DANDROID_ARM_NEON=TRUE \

-DANDROID_TOOLCHAIN=gcc \

-DANDROID_PLATFORM=android-21 \

-DANDROID_STL=gnustl_shared

会观察到生成的配置文件中 BUILD_FOR_ANDROID 不一定能生效。

需要如下配置才行：
父 CMakeLists.txt

set(BUILD_FOR_ANDROID ON CACHE BOOL "" FORCE)
add_subdirectory(${CHILD_ROOT_DIR}/ ${CMAKE_CURRENT_SOURCE_DIR}/build)

1 2	set(BUILD_FOR_ANDROID ON CACHE BOOL "" FORCE) add_subdirectory(${CHILD_ROOT_DIR}/ ${CMAKE_CURRENT_SOURCE_DIR}/build)

参考链接

Use ccache with CMake for faster compilation

C and C++ compilers aren’t the fastest pieces of software out there and there’s no lack of programmer jokes based on tedium of waiting for their work to complete.

There are ways to fix the pain though - one of them is ccache. CCache improves compilation times by caching previously built object files in private cache and reusing them when you’re recompiling same objects with same parameters. Obviously it will not help if you’re compiling the code for the first time and it also won’t help if you often change compilation flags. Most C/C++ development however involves recompiling same object files with the same parameters and ccache helps alot.

For illustration, here’s the comparison of first and subsequent compilation times of a largish C++ project:

Original run with empty cache:

$ make -j9
...
real    0m56.684s
user    5m31.996s
sys     0m41.638s

$ make -j9

...

real 0m56.684s

user 5m31.996s

sys 0m41.638s

Recompilation with warm cache:

$ make -j9
...
real    0m5.929s
user    0m11.896s
sys     0m8.722s

$ make -j9

...

real 0m5.929s

user 0m11.896s

sys 0m8.722s

Installation

CCache is available in repositories on pretty much all distributions. On OS X use homebrew:

$ brew install ccache

1	$ brew install ccache

and on Debian-based distros use apt:

$ apt-get install ccache

1	$ apt-get install ccache

CMake configuration

After ccache is installed, you need to tell CMake to use it as a wrapper for the compiler. Add these lines to your CMakeLists.txt:

# Configure CCache if available
find_program(CCACHE_FOUND ccache)
if(CCACHE_FOUND)
        set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache)
        set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK ccache)
endif(CCACHE_FOUND)

# Configure CCache if available

find_program(CCACHE_FOUND ccache)

if(CCACHE_FOUND)

set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache)

set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK ccache)

endif(CCACHE_FOUND)

Rerun cmake and next make should use ccache for wrapper.

Usage with Android NDK

CCache can even be used on Android NDK - you just need to export NDK_CCACHE environment variable with path to ccache binary. ndk-build script will automatically use it. E.g.

$ export NDK_CCACHE=/usr/local/bin/ccache

$ ndk-build -j9

$ export NDK_CCACHE=/usr/local/bin/ccache

$ ndk-build -j9

(Note that on Debian/Ubuntu the path will probably be /usr/bin/ccache)

CCache statistics

To see if ccache is really working, you can use ccache -s command, which will display ccache statistics:

cache directory                     /Users/jernej/.ccache
primary config                      /Users/jernej/.ccache/ccache.conf
secondary config      (readonly)    /usr/local/Cellar/ccache/3.2.2/etc/ccache.conf
cache hit (direct)                 77826
cache hit (preprocessed)           17603
cache miss                         46999
called for link                       18
compile failed                        45
ccache internal error                  1
preprocessor error                    62
unsupported source language          204
files in cache                     48189
cache size                           1.2 GB
max cache size                      20.0 GB

cache directory /Users/jernej/.ccache

primary config /Users/jernej/.ccache/ccache.conf

secondary config (readonly) /usr/local/Cellar/ccache/3.2.2/etc/ccache.conf

cache hit (direct) 77826

cache hit (preprocessed) 17603

cache miss 46999

called for link 18

compile failed 45

ccache internal error 1

preprocessor error 62

unsupported source language 204

files in cache 48189

cache size 1.2 GB

max cache size 20.0 GB

On second and all subsequent compilations the “cache hit” values should increase and thus show that ccache is working.

参考链接

Use ccache with CMake for faster compilation

macOS Mojave(10.14.3)系统QEMU虚拟机运行Clockwork OS

# 安装编译工具macOS Mojave(10.14.3)
$ brew install arm-linux-gnueabihf-binutils

# bison on macOS is too old
$ brew install bison
$ export PATH="/usr/local/opt/bison/bin:$PATH" 

# 安装 crosstool-ng 构建GCC编译环境
$ brew install crosstool-ng

$ export CT_NG_VER=$(brew list --versions crosstool-ng | tr ' ' '\n' | tail -1)

$ export CT_NG_VER_SHORT=${CT_NG_VER%_*}

# 安装的 crosstool-ng 的脚本文件缺少执行权限，导致无法执行，我们需要手工增加执行权限
$ chmod +x "$(brew --cellar crosstool-ng)/${CT_NG_VER}/lib/crosstool-ng-${CT_NG_VER_SHORT}/scripts/crosstool-NG.sh"

# 默认情况下，macOS的文件系统不区分大小写，我们需要手工创建一个区分大小写的分区
$ hdiutil create -volname "ClockworkOS" -type SPARSE -fs 'Case-sensitive Journaled HFS+' -size 30g ClockworkOS.dmg

$ hdiutil attach ClockworkOS.dmg.sparseimage -mountpoint /Volumes/ClockworkOS

$ cd /Volumes/ClockworkOS

$ mkdir arm-cortexa9_neon-linux

$ cd arm-cortexa9_neon-linux

$ ct-ng list-samples

# 变更x-tools存储目录
$ export HOME=/Volumes/ClockworkOS

$ ct-ng arm-cortexa9_neon-linux-gnueabihf

# 修复BUG Build failed in step 'Installing m4 for build' 
$ brew uninstall --ignore-dependencies binutils
$ brew install binutils

# 安装依赖工具
$ brew install automake

$ brew uninstall --ignore-dependencies gawk
$ brew install gawk

# 目前编译gettext-0.19.8.1的时候写死依赖automake-1.15，但是最新的已经是automake-1.16，我们通过手工编译安装automake-1.15规避这个问题
$ wget http://ftp.gnu.org/gnu/automake/automake-1.15.tar.gz

# 也可从本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/03/automake-1.15.tar.gz

$ tar xvf automake-1.15.tar.gz

$ cd automake-1.15

$ bash configure

$ make && make install

$ cd ..

# 修改文件打开数量限制，修正错误 “extra-module.mk:11: *** Too many open files.”
$ ulimit -n 2048

# 'scm_new_port_table_entry' was not declared in this scope
$ sed -i "" "s/CT_GDB_CROSS_EXTRA_CONFIG_ARRAY=.*/CT_GDB_CROSS_EXTRA_CONFIG_ARRAY=\"--with-guile=no\"/g" .config

$ export PATH="/usr/local/bin:$PATH"

$ ct-ng build -j8

# 安装编译工具macOS Mojave(10.14.3)

$ brew install arm-linux-gnueabihf-binutils

# bison on macOS is too old

$ brew install bison

$ export PATH="/usr/local/opt/bison/bin:$PATH"

# 安装 crosstool-ng 构建GCC编译环境

$ brew install crosstool-ng

$ export CT_NG_VER=$(brew list --versions crosstool-ng | tr ' ' '\n' | tail -1)

$ export CT_NG_VER_SHORT=${CT_NG_VER%_*}

# 安装的 crosstool-ng 的脚本文件缺少执行权限，导致无法执行，我们需要手工增加执行权限

$ chmod +x "$(brew --cellar crosstool-ng)/${CT_NG_VER}/lib/crosstool-ng-${CT_NG_VER_SHORT}/scripts/crosstool-NG.sh"

# 默认情况下，macOS的文件系统不区分大小写，我们需要手工创建一个区分大小写的分区

$ hdiutil create -volname "ClockworkOS" -type SPARSE -fs 'Case-sensitive Journaled HFS+' -size 30g ClockworkOS.dmg

$ hdiutil attach ClockworkOS.dmg.sparseimage -mountpoint /Volumes/ClockworkOS

$ cd /Volumes/ClockworkOS

$ mkdir arm-cortexa9_neon-linux

$ cd arm-cortexa9_neon-linux

$ ct-ng list-samples

# 变更x-tools存储目录

$ export HOME=/Volumes/ClockworkOS

$ ct-ng arm-cortexa9_neon-linux-gnueabihf

# 修复BUG Build failed in step 'Installing m4 for build'

$ brew uninstall --ignore-dependencies binutils

$ brew install binutils

# 安装依赖工具

$ brew install automake

$ brew uninstall --ignore-dependencies gawk

$ brew install gawk

# 目前编译gettext-0.19.8.1的时候写死依赖automake-1.15，但是最新的已经是automake-1.16，我们通过手工编译安装automake-1.15规避这个问题

$ wget http://ftp.gnu.org/gnu/automake/automake-1.15.tar.gz

# 也可从本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/03/automake-1.15.tar.gz

$ tar xvf automake-1.15.tar.gz

$ cd automake-1.15

$ bash configure

$ make && make install

$ cd ..

# 修改文件打开数量限制，修正错误 “extra-module.mk:11: *** Too many open files.”

$ ulimit -n 2048

# 'scm_new_port_table_entry' was not declared in this scope

$ sed -i "" "s/CT_GDB_CROSS_EXTRA_CONFIG_ARRAY=.*/CT_GDB_CROSS_EXTRA_CONFIG_ARRAY=\"--with-guile=no\"/g" .config

$ export PATH="/usr/local/bin:$PATH"

$ ct-ng build -j8

编译 u-boot

$ cd /Volumes/ClockworkOS

# 下载u-boot代码
$ git clone https://github.com/qemu/u-boot.git

$ cd u-boot

$ git checkout v2019.01 -b v2019.01

$ export PATH="/Volumes/ClockworkOS/x-tools/arm-cortexa9_neon-linux-gnueabihf/bin:$PATH"

$ export CROSS_COMPILE=arm-cortexa9_neon-linux-gnueabihf-

$ make clean

# R16又名A33 ,R16-J 代表包含Jazelle DBX 
$ make vexpress_ca9x4_defconfig

# fix Undefined symbols for architecture x86_64: "_PyArg_ParseTuple"
$ export HOSTLDFLAGS="-lpython -dynamclib"

$ brew install gnu-sed

# fix ./tools/../lib/bch.c:66:10: fatal error: 'endian.h' file not found
$ gsed -i "s/#include <sys\/endian.h>/#include <sys\/endian.h>\n#elif defined(__APPLE__)\n#include <machine\/endian.h>\n#include <libkern\/OSByteOrder.h>/g" lib/bch.c

$ gsed -i "s/#define cpu_to_be32 htobe32/#if defined(__APPLE__)\n#define cpu_to_be32 OSSwapHostToBigInt32\n#else\n#define cpu_to_be32 htobe32\n#endif/g" lib/bch.c

$ gsed -i "s/#if \!defined(__DragonFly__) \&\& \!defined(__FreeBSD__)/#if \!defined(__DragonFly__) \&\& \!defined(__FreeBSD__) \&\& \!defined(__APPLE__)/g" lib/bch.c

# 无视最后的失败提示，只要u-boot这个文件生成即可
$ make ARCH=arm -j8

$ cd /Volumes/ClockworkOS

# 下载u-boot代码

$ git clone https://github.com/qemu/u-boot.git

$ cd u-boot

$ git checkout v2019.01 -b v2019.01

$ export PATH="/Volumes/ClockworkOS/x-tools/arm-cortexa9_neon-linux-gnueabihf/bin:$PATH"

$ export CROSS_COMPILE=arm-cortexa9_neon-linux-gnueabihf-

$ make clean

# R16又名A33 ,R16-J 代表包含Jazelle DBX

$ make vexpress_ca9x4_defconfig

# fix Undefined symbols for architecture x86_64: "_PyArg_ParseTuple"

$ export HOSTLDFLAGS="-lpython -dynamclib"

$ brew install gnu-sed

# fix ./tools/../lib/bch.c:66:10: fatal error: 'endian.h' file not found

$ gsed -i "s/#include <sys\/endian.h>/#include <sys\/endian.h>\n#elif defined(__APPLE__)\n#include <machine\/endian.h>\n#include <libkern\/OSByteOrder.h>/g" lib/bch.c

$ gsed -i "s/#define cpu_to_be32 htobe32/#if defined(__APPLE__)\n#define cpu_to_be32 OSSwapHostToBigInt32\n#else\n#define cpu_to_be32 htobe32\n#endif/g" lib/bch.c

$ gsed -i "s/#if \!defined(__DragonFly__) \&\& \!defined(__FreeBSD__)/#if \!defined(__DragonFly__) \&\& \!defined(__FreeBSD__) \&\& \!defined(__APPLE__)/g" lib/bch.c

# 无视最后的失败提示，只要u-boot这个文件生成即可

$ make ARCH=arm -j8

编译 Linux 内核

$ cd /Volumes/ClockworkOS

$ brew install aria2

$ aria2c -c https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.14.2.tar.xz

# 也可本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/03/linux-4.14.2.tar.xz

$ tar xvf linux-4.14.2.tar.xz

$ cd linux-4.14.2

$ export PATH="/Volumes/ClockworkOS/x-tools/arm-cortexa9_neon-linux-gnueabihf/bin:$PATH"

# for mkimage
$ export PATH="/Volumes/ClockworkOS/u-boot/tools:$PATH"
# 或者 brew install u-boot-tools

# elf.h
$ brew install libelf

$ echo "
#include <libelf/libelf.h>
#define R_386_NONE 0
#define R_386_32 1
#define R_386_PC32 2
#define R_ARM_NONE 0
#define R_ARM_PC24 1
#define R_ARM_ABS32 2
#define R_MIPS_NONE 0
#define R_MIPS_16 1
#define R_MIPS_32 2
#define R_MIPS_REL32 3
#define R_MIPS_26 4
#define R_MIPS_HI16 5
#define R_MIPS_LO16 6
#define EF_ARM_EABIMASK 0xFF000000
#define EF_ARM_EABI_VERSION(flags) ((flags) & EF_ARM_EABIMASK)" > /usr/local/include/elf.h

# xargs: illegal option -- r
$ brew install findutils

$ export PATH="/usr/local/opt/findutils/libexec/gnubin:$PATH"

# stat: illegal option -- c
$ ln -s /usr/local/bin/gstat /usr/local/bin/stat

$ export PATH="/usr/local/bin:$PATH"

$ export CROSS_COMPILE=arm-cortexa9_neon-linux-gnueabihf-

$ export ARCH=arm

$ make vexpress_defconfig

$ make -j8

$ mkimage -A arm -O linux -T kernel -C none -a 0x40008000 -e 0x40008000 -n "Linux kernel" -d arch/arm/boot/zImage uImage

$ cd /Volumes/ClockworkOS

$ brew install aria2

$ aria2c -c https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.14.2.tar.xz

# 也可本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/03/linux-4.14.2.tar.xz

$ tar xvf linux-4.14.2.tar.xz

$ cd linux-4.14.2

$ export PATH="/Volumes/ClockworkOS/x-tools/arm-cortexa9_neon-linux-gnueabihf/bin:$PATH"

# for mkimage

$ export PATH="/Volumes/ClockworkOS/u-boot/tools:$PATH"

# 或者 brew install u-boot-tools

# elf.h

$ brew install libelf

$ echo "

#include <libelf/libelf.h>

#define R_386_NONE 0

#define R_386_32 1

#define R_386_PC32 2

#define R_ARM_NONE 0

#define R_ARM_PC24 1

#define R_ARM_ABS32 2

#define R_MIPS_NONE 0

#define R_MIPS_16 1

#define R_MIPS_32 2

#define R_MIPS_REL32 3

#define R_MIPS_26 4

#define R_MIPS_HI16 5

#define R_MIPS_LO16 6

#define EF_ARM_EABIMASK 0xFF000000

#define EF_ARM_EABI_VERSION(flags) ((flags) & EF_ARM_EABIMASK)" > /usr/local/include/elf.h

# xargs: illegal option -- r

$ brew install findutils

$ export PATH="/usr/local/opt/findutils/libexec/gnubin:$PATH"

# stat: illegal option -- c

$ ln -s /usr/local/bin/gstat /usr/local/bin/stat

$ export PATH="/usr/local/bin:$PATH"

$ export CROSS_COMPILE=arm-cortexa9_neon-linux-gnueabihf-

$ export ARCH=arm

$ make vexpress_defconfig

$ make -j8

$ mkimage -A arm -O linux -T kernel -C none -a 0x40008000 -e 0x40008000 -n "Linux kernel" -d arch/arm/boot/zImage uImage

$ cd /Volumes/ClockworkOS

$ brew install aria2

# 官方给出的这个地址下不到，只能用镜像地址 http://106.185.33.196/clockworkos_v0.3.img.bz2
$ aria2c -c http://clockworkpi.k15.net/clockworkos_v0.3.img.bz2

$ rm -rf clockworkos_v0.3.img

$ bzip2 -d -k -vvvv clockworkos_v0.3.img.bz2

# 替换镜像中的内核文件
$ hdiutil attach clockworkos_v0.3.img -mountpoint /Volumes/clockworkos_v0.3

$ echo y | cp -i /Volumes/ClockworkOS/linux-4.14.2/uImage /Volumes/clockworkos_v0.3/uImage

$ hdiutil detach /Volumes/clockworkos_v0.3

$ brew install qemu

$ qemu-img convert -f raw -O qcow2 clockworkos_v0.3.img clockworkos_v0.3.qcow2

$ cd /Volumes/ClockworkOS

$ brew install aria2

# 官方给出的这个地址下不到，只能用镜像地址 http://106.185.33.196/clockworkos_v0.3.img.bz2

$ aria2c -c http://clockworkpi.k15.net/clockworkos_v0.3.img.bz2

$ rm -rf clockworkos_v0.3.img

$ bzip2 -d -k -vvvv clockworkos_v0.3.img.bz2

# 替换镜像中的内核文件

$ hdiutil attach clockworkos_v0.3.img -mountpoint /Volumes/clockworkos_v0.3

$ echo y | cp -i /Volumes/ClockworkOS/linux-4.14.2/uImage /Volumes/clockworkos_v0.3/uImage

$ hdiutil detach /Volumes/clockworkos_v0.3

$ brew install qemu

$ qemu-img convert -f raw -O qcow2 clockworkos_v0.3.img clockworkos_v0.3.qcow2

手工编译 qemu

$ cd /Volumes/ClockworkOS

$ git clone https://github.com/qemu/qemu.git

$ cd qemu

# 从 qemu v2.1.0-rc1 开始，内存需要被映射到0x60000000开始的地址，更低的地址被映射为只读闪存，我们需要取消这种映射行为，否则执行的时候会报告错误
$ sed -i "" "s/\[VE_NORFLASHALIAS\] = 0/\[VE_NORFLASHALIAS\] = -1/g" hw/arm/vexpress.c

$ bash configure

$ make -j8

$ cd ..

# list supported machine `qemu-system-arm -machine help`

$ /Volumes/ClockworkOS/qemu/arm-softmmu/qemu-system-arm -M vexpress-a9 -m 1024M -kernel /Volumes/ClockworkOS/u-boot/u-boot -serial mon:stdio -nographic -sd clockworkos_v0.3.qcow2 -net nic,model=lan9118 -net user

$ cd /Volumes/ClockworkOS

$ git clone https://github.com/qemu/qemu.git

$ cd qemu

# 从 qemu v2.1.0-rc1 开始，内存需要被映射到0x60000000开始的地址，更低的地址被映射为只读闪存，我们需要取消这种映射行为，否则执行的时候会报告错误

$ sed -i "" "s/\[VE_NORFLASHALIAS\] = 0/\[VE_NORFLASHALIAS\] = -1/g" hw/arm/vexpress.c

$ bash configure

$ make -j8

$ cd ..

# list supported machine `qemu-system-arm -machine help`

$ /Volumes/ClockworkOS/qemu/arm-softmmu/qemu-system-arm -M vexpress-a9 -m 1024M -kernel /Volumes/ClockworkOS/u-boot/u-boot -serial mon:stdio -nographic -sd clockworkos_v0.3.qcow2 -net nic,model=lan9118 -net user

可惜到这一步了，还是没办法成功运行系统。

参考链接

Using QEMU to emulate a Raspberry Pi

If you're building software for the Raspberry Pi (like I sometimes do), it can be a pain to have to constantly keep Pi hardware around and spotting Pi-specific problems can be difficult until too late.

One option (and the one I most like) is to emulate a Raspberry Pi locally before ever hitting the device. Why?

Works anywhere you can install QEMU
No hardware setup needed (no more scratching around for a power supply)
Faster feedback cycle compared to hardware
I can use Pi software (like Raspbian) in a virtual context
I can prep my "virtual Pi" with all the tools I need regardless of my physical Pi's use case

Given I'm next-to-useless at Python, that last one is pretty important as it allows me to install every Python debugging and testing tool known to man on my virtual Pi while my end-product hardware stays comparatively pristine.

Getting started

First, you'll need a few prerequisites:

QEMU (more specifically `qemu-system-arm`)

You can find all the packages for your chosen platform on the QEMU website and is installable across Linux, macOS and even Windows.

Raspbian

Simply download the copy of Raspbian you need from the official site. Personally, I used the 2018-11-13 version of Raspbian Lite, since I don't need an X server.

Kernel

Since the standard RPi kernel can't be booted out of the box on QEMU, we'll need a custom kernel. We'll cover that in the next step.

Preparing

Get your kernel

First, you'll need to download a kernel. Personally, I (along with most people) use the dhruvvyas90/qemu-rpi-kernel repository's kernels. Either clone the repo:

$ git clone https://github.com/dhruvvyas90/qemu-rpi-kernel.git

1	$ git clone https://github.com/dhruvvyas90/qemu-rpi-kernel.git

or download a kernel directly:

$ wget https://github.com/dhruvvyas90/qemu-rpi-kernel/raw/master/kernel-qemu-4.4.34-jessie

1	$ wget https://github.com/dhruvvyas90/qemu-rpi-kernel/raw/master/kernel-qemu-4.4.34-jessie

or download a snapshot from my website directly:

$ wget https://www.mobibrw.com/wp-content/uploads/2019/03/qemu-rpi-kernel.zip

1	$ wget https://www.mobibrw.com/wp-content/uploads/2019/03/qemu-rpi-kernel.zip

For the rest of these steps I'm going to be using the kernel-qemu-4.4.34-jessiekernel, so update the commands as needed if you're using another version.

Filesystem image

This step is optional, but recommended

When you download the Raspbian image it will be in the raw format, a plain disk image (generally with an .img extension).

A more efficient option is to convert this to a qcow2 image first. Use the qemu-imgcommand to do this:

$ qemu-img convert -f raw -O qcow2 2018-11-13-raspbian-stretch-lite.img raspbian-stretch-lite.qcow

1	$ qemu-img convert -f raw -O qcow2 2018-11-13-raspbian-stretch-lite.img raspbian-stretch-lite.qcow

Now we can also easily expand the image:

$ qemu-img resize raspbian-stretch-lite.qcow +6G

1	$ qemu-img resize raspbian-stretch-lite.qcow +6G

You can check on your image using the qemu-img info command

Starting

You've got everything you need now: a kernel, a disk image, and QEMU!

Actually running the virtual Pi is done using the qemu-system-arm command and it can be quite complicated. The full command is this (don't worry it's explained below):

$ sudo qemu-system-arm \
-kernel ./kernel-qemu-4.4.34-jessie \
-append "root=/dev/sda2 panic=1 rootfstype=ext4 rw" \
-hda raspbian-stretch-lite.qcow \
-cpu arm1176 -m 256 \
-M versatilepb \
-no-reboot \
-serial stdio \
-net nic -net user

$ sudo qemu-system-arm \

-kernel ./kernel-qemu-4.4.34-jessie \

-append "root=/dev/sda2 panic=1 rootfstype=ext4 rw" \

-hda raspbian-stretch-lite.qcow \

-cpu arm1176 -m 256 \

-M versatilepb \

-no-reboot \

-serial stdio \

-net nic -net user

如果需要指定上网方式的话，执行如下命令：

$ sudo qemu-system-arm \
-kernel ./kernel-qemu-4.4.34-jessie \
-append "root=/dev/sda2 panic=1 rootfstype=ext4 rw" \
-hda raspbian-stretch-lite.qcow \
-cpu arm1176 -m 256 \
-M versatilepb \
-no-reboot \
-serial stdio \
-net nic -net user \
-net tap,ifname=vnet0,script=no,downscript=no

$ sudo qemu-system-arm \

-kernel ./kernel-qemu-4.4.34-jessie \

-append "root=/dev/sda2 panic=1 rootfstype=ext4 rw" \

-hda raspbian-stretch-lite.qcow \

-cpu arm1176 -m 256 \

-M versatilepb \

-no-reboot \

-serial stdio \

-net nic -net user \

-net tap,ifname=vnet0,script=no,downscript=no

So, in order:

sudo qemu-system-arm: you need to run QEMU as root
-kernel: this is the path to the QEMU kernel we downloaded in the previous step
-append: here we are providing the boot args direct to the kernel, telling it where to find it's root filesytem and what type it is
-hda: here we're attaching the disk image itself
-cpu/-m: this sets the CPU type and RAM limit to match a Raspberry Pi
-M: this sets the machine we are emulating. versatilepb is the 'ARM Versatile/PB' machine
-no-reboot: just tells QEMU to exit rather than rebooting the machine
-serial: redirects the machine's virtual serial port to our host's stdio
-net: this configures the machine's network stack to attach a NIC, use the user-mode stack, connect the host's vnet0 TAP device to the new NIC and don't use config scripts.

If it's all gone well, you should now have a QEMU window pop up and you should see the familiar Raspberry Pi boot screen show up.

Now, go get yourself a drink to celebrate, because it might take a little while.

Networking

Now, that's all well and good, but without networking, we may as well be back on hardware. When the machine started, it will have attached a NIC and connected it to the host's vnet0 TAP device. If we configure that device with an IP and add it to a bridge on our host, you should be able to reliably access it like any other virtual machine.

(on host) Find a bridge and address

This will vary by host, but on my Fedora machine, for example, there is a pre-configured virbr0 bridge interface with an address in the 192.168.122.0/24 space:

virbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 00:00:00:1e:77:43  txqueuelen 1000  (Ethernet)

virbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500

inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255

ether 00:00:00:1e:77:43 txqueuelen 1000 (Ethernet)

I'm going to use this bridge and just pick a static address for my Pi: 192.168.122.200

Reusing an existing (pre-configured) bridge means you won't need to sort your own routing

(in guest) Configure interface

NOTE: I'm assuming Stretch here.

Open /etc/dhcpcd.conf in your new virtual Pi and configure the eth0 interface with a static address in your bridge's subnet. For example, for my bridge:

# in /etc/dhcpcd.conf
interface eth0
static ip_address=192.168.122.200/24
static routers=192.168.122.254
static domain_name_servers=8.8.8.8 8.8.4.4

# in /etc/dhcpcd.conf

interface eth0

static ip_address=192.168.122.200/24

static routers=192.168.122.254

static domain_name_servers=8.8.8.8 8.8.4.4

You may need to reboot for this to take effect

(in host) Add TAP to bridge

Finally, add the machine's TAP interface to your chosen bridge with the brctl command:

$ sudo brctl addif virbr0 vnet0

1	$ sudo brctl addif virbr0 vnet0

Now, on your host, you should be able to ping 192.168.122.200 (or your Pi's address).

Set up SSH

Now, in your machine, you can run sudo raspi-config and enable the SSH server (in the "Interfacing Options" menu at time of writing).

Make sure you change the password from default while you're there!

Finally, on your host, run ssh-copy-id pi@192.168.122.200 to copy your SSH key into the Pi's pi user and you can now SSH directly into your Pi without a password prompt.

参考链接

Using QEMU to emulate a Raspberry Pi

Simple ARM NEON optimized sin, cos, log and exp

This is the sequel of the single precision SSE optimized sin, cos, log and exp that I wrote some time ago. Adapted to the NEON fpu of my pandaboard. Precision and range are exactly the same than the SSE version, so I won't repeat them.

The code

The functions below are licensed under the zlib license, so you can do basically what you want with them.

neon_mathfun.h source code for sin_ps, cos_ps, sincos_ps, exp_ps, log_ps, as straight C.
neon_mathfun_test.c Validation+Bench program for those function. Do not forget to run it once.

Performance

Results on a pandaboard with a 1GHz dual-core ARM Cortex A9 (OMAP4), using gcc 4.6.1

command line: gcc -O3 -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a9 -Wall -W neon_mathfun_test.c -lm

exp([        -1000,          -100,           100,          1000]) = [            0,             0, 2.4061436e+38, 2.4061436e+38]
exp([         -nan,           inf,          -inf,           nan]) = [          nan, 2.4061436e+38,             0,           nan]
log([            0,           -10,         1e+30, 1.0005271e-42]) = [         -nan,          -nan,     69.077553,          -nan]
log([         -nan,           inf,          -inf,           nan]) = [    89.128304,     88.722839,          -nan,     89.128304]
sin([         -nan,           inf,          -inf,           nan]) = [          nan,           nan,          -nan,           nan]
cos([         -nan,           inf,          -inf,           nan]) = [          nan,           nan,           nan,           nan]
sin([       -1e+30,       -100000,         1e+30,        100000]) = [          inf,  -0.035749275,          -inf,   0.035749275]
cos([       -1e+30,       -100000,         1e+30,        100000]) = [          nan,    -0.9993608,           nan,    -0.9993608]
benching                 sinf .. ->    2.0 millions of vector evaluations/second -> 121 cycles/value on a 1000MHz computer
benching                 cosf .. ->    1.8 millions of vector evaluations/second -> 132 cycles/value on a 1000MHz computer
benching                 expf .. ->    1.1 millions of vector evaluations/second -> 221 cycles/value on a 1000MHz computer
benching                 logf .. ->    1.7 millions of vector evaluations/second -> 141 cycles/value on a 1000MHz computer
benching          cephes_sinf .. ->    2.4 millions of vector evaluations/second -> 103 cycles/value on a 1000MHz computer
benching          cephes_cosf .. ->    2.0 millions of vector evaluations/second -> 123 cycles/value on a 1000MHz computer
benching          cephes_expf .. ->    1.6 millions of vector evaluations/second -> 153 cycles/value on a 1000MHz computer
benching          cephes_logf .. ->    1.5 millions of vector evaluations/second -> 156 cycles/value on a 1000MHz computer
benching               sin_ps .. ->    5.8 millions of vector evaluations/second ->  43 cycles/value on a 1000MHz computer
benching               cos_ps .. ->    5.9 millions of vector evaluations/second ->  42 cycles/value on a 1000MHz computer
benching            sincos_ps .. ->    6.0 millions of vector evaluations/second ->  41 cycles/value on a 1000MHz computer
benching               exp_ps .. ->    5.6 millions of vector evaluations/second ->  44 cycles/value on a 1000MHz computer
benching               log_ps .. ->    5.3 millions of vector evaluations/second ->  47 cycles/value on a 1000MHz computer

exp([ -1000, -100, 100, 1000]) = [ 0, 0, 2.4061436e+38, 2.4061436e+38]

exp([ -nan, inf, -inf, nan]) = [ nan, 2.4061436e+38, 0, nan]

log([ 0, -10, 1e+30, 1.0005271e-42]) = [ -nan, -nan, 69.077553, -nan]

log([ -nan, inf, -inf, nan]) = [ 89.128304, 88.722839, -nan, 89.128304]

sin([ -nan, inf, -inf, nan]) = [ nan, nan, -nan, nan]

cos([ -nan, inf, -inf, nan]) = [ nan, nan, nan, nan]

sin([ -1e+30, -100000, 1e+30, 100000]) = [ inf, -0.035749275, -inf, 0.035749275]

cos([ -1e+30, -100000, 1e+30, 100000]) = [ nan, -0.9993608, nan, -0.9993608]

benching sinf .. -> 2.0 millions of vector evaluations/second -> 121 cycles/value on a 1000MHz computer

benching cosf .. -> 1.8 millions of vector evaluations/second -> 132 cycles/value on a 1000MHz computer

benching expf .. -> 1.1 millions of vector evaluations/second -> 221 cycles/value on a 1000MHz computer

benching logf .. -> 1.7 millions of vector evaluations/second -> 141 cycles/value on a 1000MHz computer

benching cephes_sinf .. -> 2.4 millions of vector evaluations/second -> 103 cycles/value on a 1000MHz computer

benching cephes_cosf .. -> 2.0 millions of vector evaluations/second -> 123 cycles/value on a 1000MHz computer

benching cephes_expf .. -> 1.6 millions of vector evaluations/second -> 153 cycles/value on a 1000MHz computer

benching cephes_logf .. -> 1.5 millions of vector evaluations/second -> 156 cycles/value on a 1000MHz computer

benching sin_ps .. -> 5.8 millions of vector evaluations/second -> 43 cycles/value on a 1000MHz computer

benching cos_ps .. -> 5.9 millions of vector evaluations/second -> 42 cycles/value on a 1000MHz computer

benching sincos_ps .. -> 6.0 millions of vector evaluations/second -> 41 cycles/value on a 1000MHz computer

benching exp_ps .. -> 5.6 millions of vector evaluations/second -> 44 cycles/value on a 1000MHz computer

benching log_ps .. -> 5.3 millions of vector evaluations/second -> 47 cycles/value on a 1000MHz computer

So performance is not stellar. I recommend to use gcc 4.6.1 or newer as it generates much better code than previous (gcc 4.5) versions -- almost 20% faster here. I believe rewriting these functions in assembly would improve the performance by 30%, and should not be very hard as the ARM and NEON asm is quite nice and easy to write -- maybe I'll do it. Computing two SIMD vectors at once would also help to improve a lot the performance as there are enough registers on NEON, and it would reduce the dependancies between neon instructions.

Note also that I have no idea of the performance on a Cortex A8 -- it may be extremely bad, I don't know.

Comparison with an Intel Atom

For comparison purposes, here is the performance of the SSE version on a single core Intel Atom N270 running at 1.66GHz

command line: cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c (this is msvc 2010)

benching                 sinf .. ->    1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer
benching                 cosf .. ->    1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer
benching         sincos (x87) .. ->    1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer
benching                 expf .. ->    1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer
benching                 logf .. ->    1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer
benching          cephes_sinf .. ->    1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer
benching          cephes_cosf .. ->    1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer
benching          cephes_expf .. ->    0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer
benching          cephes_logf .. ->    0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer
benching               sin_ps .. ->    9.2 millions of vector evaluations/second ->  43 cycles/value on a 1600MHz computer
benching               cos_ps .. ->    9.5 millions of vector evaluations/second ->  42 cycles/value on a 1600MHz computer
benching            sincos_ps .. ->    8.8 millions of vector evaluations/second ->  45 cycles/value on a 1600MHz computer
benching               exp_ps .. ->    9.8 millions of vector evaluations/second ->  41 cycles/value on a 1600MHz computer
benching               log_ps .. ->    8.6 millions of vector evaluations/second ->  46 cycles/value on a 1600MHz computer

benching sinf .. -> 1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer

benching cosf .. -> 1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer

benching sincos (x87) .. -> 1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer

benching expf .. -> 1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer

benching logf .. -> 1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer

benching cephes_sinf .. -> 1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer

benching cephes_cosf .. -> 1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer

benching cephes_expf .. -> 0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer

benching cephes_logf .. -> 0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer

benching sin_ps .. -> 9.2 millions of vector evaluations/second -> 43 cycles/value on a 1600MHz computer

benching cos_ps .. -> 9.5 millions of vector evaluations/second -> 42 cycles/value on a 1600MHz computer

benching sincos_ps .. -> 8.8 millions of vector evaluations/second -> 45 cycles/value on a 1600MHz computer

benching exp_ps .. -> 9.8 millions of vector evaluations/second -> 41 cycles/value on a 1600MHz computer

benching log_ps .. -> 8.6 millions of vector evaluations/second -> 46 cycles/value on a 1600MHz computer

The number of cycles is quite similar -- but the atom has a higher clock..

Last modified: 2011/05/29

参考链接

Simple ARM NEON optimized sin, cos, log and exp

2025 年 12 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

参考链接

Installation

CMake configuration

Usage with Android NDK

CCache statistics

参考链接

参考链接

Getting started

QEMU (more specifically qemu-system-arm)

Raspbian

Kernel

Preparing

Get your kernel

Filesystem image

Starting

Networking

(on host) Find a bridge and address

(in guest) Configure interface

(in host) Add TAP to bridge

Set up SSH

参考链接

The code

The functions below are licensed under the zlib license, so you can do basically what you want with them.

Performance

Results on a pandaboard with a 1GHz dual-core ARM Cortex A9 (OMAP4), using gcc 4.6.1

Comparison with an Intel Atom

For comparison purposes, here is the performance of the SSE version on a single core Intel Atom N270 running at 1.66GHz

The number of cycles is quite similar -- but the atom has a higher clock..

Last modified: 2011/05/29

参考链接

QEMU (more specifically `qemu-system-arm`)