I spend most of my time worrying about how to make deep learning with neural networks faster and more power efficient. In practice that means focusing on a function called GEMM. It’s part of the BLAS (Basic Linear Algebra Subprograms) library that was first created in 1979, and until I started trying to optimize neural networks I’d never heard of it.
继续阅读Why GEMM is at the heart of deep learning
CNN 基础之卷积及其矩阵加速
继续阅读CNN 基础之卷积及其矩阵加速
Windows 7 系统电脑安装RNDIS驱动
本教程小编和大家分享 Windows 7 系统电脑安装RNDIS驱动的正确方法,RNDIS驱动是什么? Windows 7 系统驱动RNDIS是远端网络驱动接口协议,设备通过USB方式同主机连接,模拟网络连接以便用于下载和调试工作。但是很多 Windows 7 系统用户安装RNDIS的设备时失败,遇到无法安装的问题,所以小编给大家介绍 Windows 7 系统电脑安装RNDIS驱动的正确方法。
粗略判断Shader每条代码的成本
GPU IS a processor (graphics proccessing unit). Anywho, i remember seeing somewhere that in geforce 6 series cards its a signle cycle (maybe i was just dreaming :-p) but i have that memory
radeon x800 has it anyways
EDIT:
Quote:
ORIGINALLY AT: http://gear.ibuypower.com/GVE/Store/ProductDetails.aspx?sku=VC-POWERC-147
Smartshader HD•Support for Microsoft® DirectX® 9.0 programmable vertex and pixel shaders in hardware
• DirectX 9.0 Vertex Shaders
- Vertex programs up to 65,280 instructions with flow control
- Single cycle trigonometric operations (SIN & COS)
• Direct X 9.0 Extended Pixel Shaders
- Up to 1,536 instructions and 16 textures per rendering pass
- 32 temporary and constant registers
- Facing register for two-sided lighting
- 128-bit, 64-bit & 32-bit per pixel floating point color formats
- Multiple Render Target (MRT) support
• Complete feature set also supported in OpenGL® via extensions
Android Gradle Plugin源码解析之externalNativeBuild
在Android Studio 2.2开始的Android Gradle Plugin版本中,Google集成了对cmake的完美支持,而原先的ndkBuild的方式支持也变得更加良好。这篇文章就来说说Android Gradle Plugin与交叉编译之间的一些事,即externalNativeBuild相关的task,主要是解读一下gradle构建系统相关的源码。
Overriding a default option(…) value in CMake from a parent CMakeLists.txt
子 CMakeLists.txt
|
1 2 3 4 5 |
option(BUILD_FOR_ANDROID "Build For Android" OFF) if(SYSTEM.Android AND NOT BUILD_FOR_ANDROID) set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${NATIVE_LIBRARY_OUTPUT}/${ANDROID_ABI}) endif() |
父 CMakeLists.txt
|
1 2 |
set(BUILD_FOR_ANDROID ON) add_subdirectory(${CHILD_ROOT_DIR}/ ${CMAKE_CURRENT_SOURCE_DIR}/build) |
执行如下命令的时候:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
/Users/xxxx/Library/Android/sdk/cmake/3.6.4111459/bin/cmake --trace-expand \ -H/Users/xxxx/Source/example/demo/android/app \ -B/Users/xxxx/Source/example/demo/android/app/.externalNativeBuild/cmake/debug/arm64-v8a \ -DANDROID_ABI=arm64-v8a \ -DANDROID_PLATFORM=android-21 \ -DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/Users/xxxx/Source/example/demo/android/app/build/intermediates/cmake/debug/obj/arm64-v8a \ -DCMAKE_BUILD_TYPE=Debug \ -DANDROID_NDK=/Users/xxxx/Library/Android/android-ndk-r16b \ -DCMAKE_TOOLCHAIN_FILE=/Users/xxxx/Library/Android/android-ndk-r16b/build/cmake/android.toolchain.cmake \ -DCMAKE_MAKE_PROGRAM=/Users/xxxx/Library/Android/sdk/cmake/3.6.4111459/bin/ninja \ -G"Android Gradle - Ninja" \ -DANDROID_ARM_NEON=TRUE \ -DANDROID_TOOLCHAIN=gcc \ -DANDROID_PLATFORM=android-21 \ -DANDROID_STL=gnustl_shared |
会观察到生成的配置文件中 BUILD_FOR_ANDROID 不一定能生效。
需要如下配置才行:
父 CMakeLists.txt
|
1 2 |
set(BUILD_FOR_ANDROID ON CACHE BOOL "" FORCE) add_subdirectory(${CHILD_ROOT_DIR}/ ${CMAKE_CURRENT_SOURCE_DIR}/build) |
参考链接
Use ccache with CMake for faster compilation
C and C++ compilers aren’t the fastest pieces of software out there and there’s no lack of programmer jokes based on tedium of waiting for their work to complete.
There are ways to fix the pain though - one of them is ccache. CCache improves compilation times by caching previously built object files in private cache and reusing them when you’re recompiling same objects with same parameters. Obviously it will not help if you’re compiling the code for the first time and it also won’t help if you often change compilation flags. Most C/C++ development however involves recompiling same object files with the same parameters and ccache helps alot.
For illustration, here’s the comparison of first and subsequent compilation times of a largish C++ project:
Original run with empty cache:
|
1 2 3 4 5 |
$ make -j9 ... real 0m56.684s user 5m31.996s sys 0m41.638s |
Recompilation with warm cache:
|
1 2 3 4 5 |
$ make -j9 ... real 0m5.929s user 0m11.896s sys 0m8.722s |
Installation
CCache is available in repositories on pretty much all distributions. On OS X use homebrew:
|
1 |
$ brew install ccache |
and on Debian-based distros use apt:
|
1 |
$ apt-get install ccache |
CMake configuration
After ccache is installed, you need to tell CMake to use it as a wrapper for the compiler. Add these lines to your CMakeLists.txt:
|
1 2 3 4 5 6 |
# Configure CCache if available find_program(CCACHE_FOUND ccache) if(CCACHE_FOUND) set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache) set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK ccache) endif(CCACHE_FOUND) |
Rerun cmake and next make should use ccache for wrapper.
Usage with Android NDK
CCache can even be used on Android NDK - you just need to export NDK_CCACHE environment variable with path to ccache binary. ndk-build script will automatically use it. E.g.
|
1 2 3 |
$ export NDK_CCACHE=/usr/local/bin/ccache $ ndk-build -j9 |
(Note that on Debian/Ubuntu the path will probably be /usr/bin/ccache)
CCache statistics
To see if ccache is really working, you can use ccache -s command, which will display ccache statistics:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
cache directory /Users/jernej/.ccache primary config /Users/jernej/.ccache/ccache.conf secondary config (readonly) /usr/local/Cellar/ccache/3.2.2/etc/ccache.conf cache hit (direct) 77826 cache hit (preprocessed) 17603 cache miss 46999 called for link 18 compile failed 45 ccache internal error 1 preprocessor error 62 unsupported source language 204 files in cache 48189 cache size 1.2 GB max cache size 20.0 GB |
On second and all subsequent compilations the “cache hit” values should increase and thus show that ccache is working.
参考链接
macOS Mojave(10.14.3)系统QEMU虚拟机运行Clockwork OS
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
# 安装编译工具macOS Mojave(10.14.3) $ brew install arm-linux-gnueabihf-binutils # bison on macOS is too old $ brew install bison $ export PATH="/usr/local/opt/bison/bin:$PATH" # 安装 crosstool-ng 构建GCC编译环境 $ brew install crosstool-ng $ export CT_NG_VER=$(brew list --versions crosstool-ng | tr ' ' '\n' | tail -1) $ export CT_NG_VER_SHORT=${CT_NG_VER%_*} # 安装的 crosstool-ng 的脚本文件缺少执行权限,导致无法执行,我们需要手工增加执行权限 $ chmod +x "$(brew --cellar crosstool-ng)/${CT_NG_VER}/lib/crosstool-ng-${CT_NG_VER_SHORT}/scripts/crosstool-NG.sh" # 默认情况下,macOS的文件系统不区分大小写,我们需要手工创建一个区分大小写的分区 $ hdiutil create -volname "ClockworkOS" -type SPARSE -fs 'Case-sensitive Journaled HFS+' -size 30g ClockworkOS.dmg $ hdiutil attach ClockworkOS.dmg.sparseimage -mountpoint /Volumes/ClockworkOS $ cd /Volumes/ClockworkOS $ mkdir arm-cortexa9_neon-linux $ cd arm-cortexa9_neon-linux $ ct-ng list-samples # 变更x-tools存储目录 $ export HOME=/Volumes/ClockworkOS $ ct-ng arm-cortexa9_neon-linux-gnueabihf # 修复BUG Build failed in step 'Installing m4 for build' $ brew uninstall --ignore-dependencies binutils $ brew install binutils # 安装依赖工具 $ brew install automake $ brew uninstall --ignore-dependencies gawk $ brew install gawk # 目前编译gettext-0.19.8.1的时候写死依赖automake-1.15,但是最新的已经是automake-1.16,我们通过手工编译安装automake-1.15规避这个问题 $ wget http://ftp.gnu.org/gnu/automake/automake-1.15.tar.gz # 也可从本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/03/automake-1.15.tar.gz $ tar xvf automake-1.15.tar.gz $ cd automake-1.15 $ bash configure $ make && make install $ cd .. # 修改文件打开数量限制,修正错误 “extra-module.mk:11: *** Too many open files.” $ ulimit -n 2048 # 'scm_new_port_table_entry' was not declared in this scope $ sed -i "" "s/CT_GDB_CROSS_EXTRA_CONFIG_ARRAY=.*/CT_GDB_CROSS_EXTRA_CONFIG_ARRAY=\"--with-guile=no\"/g" .config $ export PATH="/usr/local/bin:$PATH" $ ct-ng build -j8 |
编译
u-boot|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
$ cd /Volumes/ClockworkOS # 下载u-boot代码 $ git clone https://github.com/qemu/u-boot.git $ cd u-boot $ git checkout v2019.01 -b v2019.01 $ export PATH="/Volumes/ClockworkOS/x-tools/arm-cortexa9_neon-linux-gnueabihf/bin:$PATH" $ export CROSS_COMPILE=arm-cortexa9_neon-linux-gnueabihf- $ make clean # R16又名A33 ,R16-J 代表包含Jazelle DBX $ make vexpress_ca9x4_defconfig # fix Undefined symbols for architecture x86_64: "_PyArg_ParseTuple" $ export HOSTLDFLAGS="-lpython -dynamclib" $ brew install gnu-sed # fix ./tools/../lib/bch.c:66:10: fatal error: 'endian.h' file not found $ gsed -i "s/#include <sys\/endian.h>/#include <sys\/endian.h>\n#elif defined(__APPLE__)\n#include <machine\/endian.h>\n#include <libkern\/OSByteOrder.h>/g" lib/bch.c $ gsed -i "s/#define cpu_to_be32 htobe32/#if defined(__APPLE__)\n#define cpu_to_be32 OSSwapHostToBigInt32\n#else\n#define cpu_to_be32 htobe32\n#endif/g" lib/bch.c $ gsed -i "s/#if \!defined(__DragonFly__) \&\& \!defined(__FreeBSD__)/#if \!defined(__DragonFly__) \&\& \!defined(__FreeBSD__) \&\& \!defined(__APPLE__)/g" lib/bch.c # 无视最后的失败提示,只要u-boot这个文件生成即可 $ make ARCH=arm -j8 |
编译
Linux 内核|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
$ cd /Volumes/ClockworkOS $ brew install aria2 $ aria2c -c https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.14.2.tar.xz # 也可本站下载 wget https://www.mobibrw.com/wp-content/uploads/2019/03/linux-4.14.2.tar.xz $ tar xvf linux-4.14.2.tar.xz $ cd linux-4.14.2 $ export PATH="/Volumes/ClockworkOS/x-tools/arm-cortexa9_neon-linux-gnueabihf/bin:$PATH" # for mkimage $ export PATH="/Volumes/ClockworkOS/u-boot/tools:$PATH" # 或者 brew install u-boot-tools # elf.h $ brew install libelf $ echo " #include <libelf/libelf.h> #define R_386_NONE 0 #define R_386_32 1 #define R_386_PC32 2 #define R_ARM_NONE 0 #define R_ARM_PC24 1 #define R_ARM_ABS32 2 #define R_MIPS_NONE 0 #define R_MIPS_16 1 #define R_MIPS_32 2 #define R_MIPS_REL32 3 #define R_MIPS_26 4 #define R_MIPS_HI16 5 #define R_MIPS_LO16 6 #define EF_ARM_EABIMASK 0xFF000000 #define EF_ARM_EABI_VERSION(flags) ((flags) & EF_ARM_EABIMASK)" > /usr/local/include/elf.h # xargs: illegal option -- r $ brew install findutils $ export PATH="/usr/local/opt/findutils/libexec/gnubin:$PATH" # stat: illegal option -- c $ ln -s /usr/local/bin/gstat /usr/local/bin/stat $ export PATH="/usr/local/bin:$PATH" $ export CROSS_COMPILE=arm-cortexa9_neon-linux-gnueabihf- $ export ARCH=arm $ make vexpress_defconfig $ make -j8 $ mkimage -A arm -O linux -T kernel -C none -a 0x40008000 -e 0x40008000 -n "Linux kernel" -d arch/arm/boot/zImage uImage |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
$ cd /Volumes/ClockworkOS $ brew install aria2 # 官方给出的这个地址下不到,只能用镜像地址 http://106.185.33.196/clockworkos_v0.3.img.bz2 $ aria2c -c http://clockworkpi.k15.net/clockworkos_v0.3.img.bz2 $ rm -rf clockworkos_v0.3.img $ bzip2 -d -k -vvvv clockworkos_v0.3.img.bz2 # 替换镜像中的内核文件 $ hdiutil attach clockworkos_v0.3.img -mountpoint /Volumes/clockworkos_v0.3 $ echo y | cp -i /Volumes/ClockworkOS/linux-4.14.2/uImage /Volumes/clockworkos_v0.3/uImage $ hdiutil detach /Volumes/clockworkos_v0.3 $ brew install qemu $ qemu-img convert -f raw -O qcow2 clockworkos_v0.3.img clockworkos_v0.3.qcow2 |
手工编译
qemu|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
$ cd /Volumes/ClockworkOS $ git clone https://github.com/qemu/qemu.git $ cd qemu # 从 qemu v2.1.0-rc1 开始,内存需要被映射到0x60000000开始的地址,更低的地址被映射为只读闪存,我们需要取消这种映射行为,否则执行的时候会报告错误 $ sed -i "" "s/\[VE_NORFLASHALIAS\] = 0/\[VE_NORFLASHALIAS\] = -1/g" hw/arm/vexpress.c $ bash configure $ make -j8 $ cd .. # list supported machine `qemu-system-arm -machine help` $ /Volumes/ClockworkOS/qemu/arm-softmmu/qemu-system-arm -M vexpress-a9 -m 1024M -kernel /Volumes/ClockworkOS/u-boot/u-boot -serial mon:stdio -nographic -sd clockworkos_v0.3.qcow2 -net nic,model=lan9118 -net user |
可惜到这一步了,还是没办法成功运行系统。
参考链接
- Running Clockwork OS in a Virtual Machine
- Request:Open source curcuitboard
- Progress on gameshell
- GameShell OS image files
- mac virt-manager
- Convert IMG (raw) to QCOW2
- Loading the OS on a virtual machine.(virt-manager)
- Clockwork OS on QEMU
- First impressions – A look at the OS
- Booting Linux with U-Boot on QEMU ARM
- Booting Linux with U-Boot on QEMU ARM
- 用Qemu模拟vexpress-a9 (三)--- 实现用u-boot引导Linux内核
- How to execute u-boot on qemu-system-arm
- Linux Shell 截取字符串
- Building embedded ARM systems with Crosstool-NG
- Where can I find the installed package path via brew
- Mac OSX下执行crosstool-ng提示“Your file system ‘xxxx’ is *not* case-sensitive!”
- qemu-system-arm仿真vexpress-a9踩坑记
- Build failed in step 'Installing m4 for build' #1097
- WARNING: 'automake-1.14' is missing on your system.
- ‘scm_new_port_table_entry’ was not declared in this scope #72
- 基于QEMU的ARM Cortex-A9开发板Vexpress-ca9的Linux内核的编译和运行
- building Linux kernel on Mac OS X
- 在Mac上编译uboot,linux kernel
- include/arm/elf.h
- In homebrew, how can I know xargs belongs to the findutil package?
- Is “xargs” on MacOS not the same as linux?
- 修改KVM虚机镜像中的文件的几种方法(Guestfish/Guestmount /virt-*tools)
- QEMU模拟vexpress-a9 搭建Linux kernel运行环境
- /srv/irclogs.linaro.org/2014/09/12/#linaro.txt
- Coreboot for QEMU armv7 (vexpress-a9) emulated mainboard
- MIT6.828课程JOS在macOS下的环境配置
- Mainline Debian HowTo
- 使用 monitor command 监控 QEMU 运行状态
- 在Mac OS X上用SWIG编译C
- endian.h not found on mac osx
- A33/R16
- Allwinner SoC Family
- Booting kernel from SD in qemu (ARM) with u-boot
Using QEMU to emulate a Raspberry Pi
If you're building software for the Raspberry Pi (like I sometimes do), it can be a pain to have to constantly keep Pi hardware around and spotting Pi-specific problems can be difficult until too late.
One option (and the one I most like) is to emulate a Raspberry Pi locally before ever hitting the device. Why?
- Works anywhere you can install QEMU
- No hardware setup needed (no more scratching around for a power supply)
- Faster feedback cycle compared to hardware
- I can use Pi software (like Raspbian) in a virtual context
- I can prep my "virtual Pi" with all the tools I need regardless of my physical Pi's use case
Given I'm next-to-useless at Python, that last one is pretty important as it allows me to install every Python debugging and testing tool known to man on my virtual Pi while my end-product hardware stays comparatively pristine.
Getting started
First, you'll need a few prerequisites:
QEMU (more specifically qemu-system-arm)
You can find all the packages for your chosen platform on the QEMU website and is installable across Linux, macOS and even Windows.
Raspbian
Simply download the copy of Raspbian you need from the official site. Personally, I used the 2018-11-13 version of Raspbian Lite, since I don't need an X server.
Kernel
Since the standard RPi kernel can't be booted out of the box on QEMU, we'll need a custom kernel. We'll cover that in the next step.
Preparing
Get your kernel
First, you'll need to download a kernel. Personally, I (along with most people) use the dhruvvyas90/qemu-rpi-kernel repository's kernels. Either clone the repo:
|
1 |
$ git clone https://github.com/dhruvvyas90/qemu-rpi-kernel.git |
or download a kernel directly:
|
1 |
$ wget https://github.com/dhruvvyas90/qemu-rpi-kernel/raw/master/kernel-qemu-4.4.34-jessie |
or download a snapshot from my website directly:
|
1 |
$ wget https://www.mobibrw.com/wp-content/uploads/2019/03/qemu-rpi-kernel.zip |
For the rest of these steps I'm going to be using the kernel-qemu-4.4.34-jessiekernel, so update the commands as needed if you're using another version.
Filesystem image
This step is optional, but recommended
When you download the Raspbian image it will be in the raw format, a plain disk image (generally with an .img extension).
A more efficient option is to convert this to a qcow2 image first. Use the qemu-imgcommand to do this:
|
1 |
$ qemu-img convert -f raw -O qcow2 2018-11-13-raspbian-stretch-lite.img raspbian-stretch-lite.qcow |
Now we can also easily expand the image:
|
1 |
$ qemu-img resize raspbian-stretch-lite.qcow +6G |
You can check on your image using the
qemu-img infocommand
Starting
You've got everything you need now: a kernel, a disk image, and QEMU!
Actually running the virtual Pi is done using the qemu-system-arm command and it can be quite complicated. The full command is this (don't worry it's explained below):
|
1 2 3 4 5 6 7 8 9 |
$ sudo qemu-system-arm \ -kernel ./kernel-qemu-4.4.34-jessie \ -append "root=/dev/sda2 panic=1 rootfstype=ext4 rw" \ -hda raspbian-stretch-lite.qcow \ -cpu arm1176 -m 256 \ -M versatilepb \ -no-reboot \ -serial stdio \ -net nic -net user |
如果需要指定上网方式的话,执行如下命令:
|
1 2 3 4 5 6 7 8 9 10 |
$ sudo qemu-system-arm \ -kernel ./kernel-qemu-4.4.34-jessie \ -append "root=/dev/sda2 panic=1 rootfstype=ext4 rw" \ -hda raspbian-stretch-lite.qcow \ -cpu arm1176 -m 256 \ -M versatilepb \ -no-reboot \ -serial stdio \ -net nic -net user \ -net tap,ifname=vnet0,script=no,downscript=no |
So, in order:
sudo qemu-system-arm: you need to run QEMU asroot-kernel: this is the path to the QEMU kernel we downloaded in the previous step-append: here we are providing the boot args direct to the kernel, telling it where to find it's root filesytem and what type it is-hda: here we're attaching the disk image itself-cpu/-m: this sets the CPU type and RAM limit to match a Raspberry Pi-M: this sets the machine we are emulating.versatilepbis the 'ARM Versatile/PB' machine-no-reboot: just tells QEMU to exit rather than rebooting the machine-serial: redirects the machine's virtual serial port to our host's stdio-net: this configures the machine's network stack to attach a NIC, use the user-mode stack, connect the host'svnet0TAP device to the new NIC and don't use config scripts.
If it's all gone well, you should now have a QEMU window pop up and you should see the familiar Raspberry Pi boot screen show up.
Now, go get yourself a drink to celebrate, because it might take a little while.
Networking
Now, that's all well and good, but without networking, we may as well be back on hardware. When the machine started, it will have attached a NIC and connected it to the host's vnet0 TAP device. If we configure that device with an IP and add it to a bridge on our host, you should be able to reliably access it like any other virtual machine.
(on host) Find a bridge and address
This will vary by host, but on my Fedora machine, for example, there is a pre-configured virbr0 bridge interface with an address in the 192.168.122.0/24 space:
|
1 2 3 |
virbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255 ether 00:00:00:1e:77:43 txqueuelen 1000 (Ethernet) |
I'm going to use this bridge and just pick a static address for my Pi: 192.168.122.200
Reusing an existing (pre-configured) bridge means you won't need to sort your own routing
(in guest) Configure interface
NOTE: I'm assuming Stretch here.
Open /etc/dhcpcd.conf in your new virtual Pi and configure the eth0 interface with a static address in your bridge's subnet. For example, for my bridge:
|
1 2 3 4 5 |
# in /etc/dhcpcd.conf interface eth0 static ip_address=192.168.122.200/24 static routers=192.168.122.254 static domain_name_servers=8.8.8.8 8.8.4.4 |
You may need to reboot for this to take effect
(in host) Add TAP to bridge
Finally, add the machine's TAP interface to your chosen bridge with the brctl command:
|
1 |
$ sudo brctl addif virbr0 vnet0 |
Now, on your host, you should be able to ping 192.168.122.200 (or your Pi's address).
Set up SSH
Now, in your machine, you can run sudo raspi-config and enable the SSH server (in the "Interfacing Options" menu at time of writing).
Make sure you change the password from default while you're there!
Finally, on your host, run ssh-copy-id pi@192.168.122.200 to copy your SSH key into the Pi's pi user and you can now SSH directly into your Pi without a password prompt.
参考链接
Simple ARM NEON optimized sin, cos, log and exp
This is the sequel of the single precision SSE optimized sin, cos, log and exp that I wrote some time ago. Adapted to the NEON fpu of my pandaboard. Precision and range are exactly the same than the SSE version, so I won't repeat them.
The code
The functions below are licensed under the zlib license, so you can do basically what you want with them.
- neon_mathfun.h source code for sin_ps, cos_ps, sincos_ps, exp_ps, log_ps, as straight C.
- neon_mathfun_test.c Validation+Bench program for those function. Do not forget to run it once.
Performance
Results on a pandaboard with a 1GHz dual-core ARM Cortex A9 (OMAP4), using gcc 4.6.1
command line: gcc -O3 -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a9 -Wall -W neon_mathfun_test.c -lm
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
exp([ -1000, -100, 100, 1000]) = [ 0, 0, 2.4061436e+38, 2.4061436e+38] exp([ -nan, inf, -inf, nan]) = [ nan, 2.4061436e+38, 0, nan] log([ 0, -10, 1e+30, 1.0005271e-42]) = [ -nan, -nan, 69.077553, -nan] log([ -nan, inf, -inf, nan]) = [ 89.128304, 88.722839, -nan, 89.128304] sin([ -nan, inf, -inf, nan]) = [ nan, nan, -nan, nan] cos([ -nan, inf, -inf, nan]) = [ nan, nan, nan, nan] sin([ -1e+30, -100000, 1e+30, 100000]) = [ inf, -0.035749275, -inf, 0.035749275] cos([ -1e+30, -100000, 1e+30, 100000]) = [ nan, -0.9993608, nan, -0.9993608] benching sinf .. -> 2.0 millions of vector evaluations/second -> 121 cycles/value on a 1000MHz computer benching cosf .. -> 1.8 millions of vector evaluations/second -> 132 cycles/value on a 1000MHz computer benching expf .. -> 1.1 millions of vector evaluations/second -> 221 cycles/value on a 1000MHz computer benching logf .. -> 1.7 millions of vector evaluations/second -> 141 cycles/value on a 1000MHz computer benching cephes_sinf .. -> 2.4 millions of vector evaluations/second -> 103 cycles/value on a 1000MHz computer benching cephes_cosf .. -> 2.0 millions of vector evaluations/second -> 123 cycles/value on a 1000MHz computer benching cephes_expf .. -> 1.6 millions of vector evaluations/second -> 153 cycles/value on a 1000MHz computer benching cephes_logf .. -> 1.5 millions of vector evaluations/second -> 156 cycles/value on a 1000MHz computer benching sin_ps .. -> 5.8 millions of vector evaluations/second -> 43 cycles/value on a 1000MHz computer benching cos_ps .. -> 5.9 millions of vector evaluations/second -> 42 cycles/value on a 1000MHz computer benching sincos_ps .. -> 6.0 millions of vector evaluations/second -> 41 cycles/value on a 1000MHz computer benching exp_ps .. -> 5.6 millions of vector evaluations/second -> 44 cycles/value on a 1000MHz computer benching log_ps .. -> 5.3 millions of vector evaluations/second -> 47 cycles/value on a 1000MHz computer |
So performance is not stellar. I recommend to use gcc 4.6.1 or newer as it generates much better code than previous (gcc 4.5) versions -- almost 20% faster here. I believe rewriting these functions in assembly would improve the performance by 30%, and should not be very hard as the ARM and NEON asm is quite nice and easy to write -- maybe I'll do it. Computing two SIMD vectors at once would also help to improve a lot the performance as there are enough registers on NEON, and it would reduce the dependancies between neon instructions.
Note also that I have no idea of the performance on a Cortex A8 -- it may be extremely bad, I don't know.
Comparison with an Intel Atom
For comparison purposes, here is the performance of the SSE version on a single core Intel Atom N270 running at 1.66GHz
command line: cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c (this is msvc 2010)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
benching sinf .. -> 1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer benching cosf .. -> 1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer benching sincos (x87) .. -> 1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer benching expf .. -> 1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer benching logf .. -> 1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer benching cephes_sinf .. -> 1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer benching cephes_cosf .. -> 1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer benching cephes_expf .. -> 0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer benching cephes_logf .. -> 0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer benching sin_ps .. -> 9.2 millions of vector evaluations/second -> 43 cycles/value on a 1600MHz computer benching cos_ps .. -> 9.5 millions of vector evaluations/second -> 42 cycles/value on a 1600MHz computer benching sincos_ps .. -> 8.8 millions of vector evaluations/second -> 45 cycles/value on a 1600MHz computer benching exp_ps .. -> 9.8 millions of vector evaluations/second -> 41 cycles/value on a 1600MHz computer benching log_ps .. -> 8.6 millions of vector evaluations/second -> 46 cycles/value on a 1600MHz computer |