今天折腾了一个下午, 特此记录一下其中遇到的坑, 主要还是因为 TX1 的 aarch64 架构, 以及小得可怜的内存与存储容量.

Environment

  • Hardware: NVIDIA Jetson TX1 Developer Kit
  • Software: JetPack 2.3.1
    • Ubuntu 16.04 64-bit (aarch64)
    • CUDA 8.0
    • cuDNN 5.1

Installation

建议全程开HTTP/HTTPS代理, 否则国内下载速度堪忧.

Install Java

1
2
3
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Install dependencens

1
2
$ sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev maven
$ sudo apt-get install python-numpy swig python-dev python-wheel

Build protobuf

这里测 Protobuf 要编译两份, 分别给 grpc 和 Bazel 用.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# For grpc-java build
$ git clone https://github.com/google/protobuf.git
$ cd protobuf
$ git checkout master
$ ./autogen.sh
$ git checkout v3.0.0-beta-3
$ ./autogen.sh
$ LDFLAGS=-static ./configure --prefix=$(pwd)/../
$ sed -i -e 's/LDFLAGS = -static/LDFLAGS = -all-static/' ./src/Makefile
$ make -j 4
$ make install

# For bazel build
$ git checkout v3.0.0-beta-2
$./autogen.sh
$ LDFLAGS=-static ./configure --prefix=$(pwd)/../
$ sed -i -e 's/LDFLAGS = -static/LDFLAGS = -all-static/' ./src/Makefile
$ make -j 4
$ cd ..

注意: 给 Bazel 用的不用make install, 之后直接cp过去.

Build grpc-java compiler

1
2
3
4
5
6
7
$ git clone https://github.com/neo-titans/odroid.git
$ git clone https://github.com/grpc/grpc-java.git
$ cd grpc-java/
$ git checkout v0.15.0
$ patch -p0 < ../odroid/build_tensorflow/grpc-java.v0.15.0.patch
$ CXXFLAGS="-I$(pwd)/../include" LDFLAGS="-L$(pwd)/../lib" ./gradlew java_pluginExecutable -Pprotoc=$(pwd)/../bin/protoc
$ cd ..

Build bazel

1
2
3
4
5
$ git clone https://github.com/bazelbuild/bazel.git
$ cd bazel
$ git checkout 0.3.2
$ cp ../protobuf/src/protoc third_party/protobuf/protoc-linux-arm32.exe
$ cp ../grpc-java/compiler/build/exe/java_plugin/protoc-gen-grpc-java third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-arm32.exe

在编译 Bazel 之前, 还需要改一些配置, 使得 Bazel 将 aarch64 认作 arm64, 以便编译成功.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
diff --git a/compile.sh b/compile.sh
index 53fc412..11035d9 100755
--- a/compile.sh
+++ b/compile.sh
@@ -27,7 +27,7 @@ cd "$(dirname "$0")"
 # Set the default verbose mode in buildenv.sh so that we do not display command
 # output unless there is a failure.  We do this conditionally to offer the user
 # a chance of overriding this in case they want to do so.
-: ${VERBOSE:=no}
+: ${VERBOSE:=yes}

 source scripts/bootstrap/buildenv.sh

diff --git a/scripts/bootstrap/compile.sh b/scripts/bootstrap/compile.sh
index 77372f0..657b254 100755
--- a/scripts/bootstrap/compile.sh
+++ b/scripts/bootstrap/compile.sh
@@ -48,6 +48,7 @@ linux)
   else
     if [ "${MACHINE_IS_ARM}" = 'yes' ]; then
       PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-arm32.exe}
+      GRPC_JAVA_PLUGIN=${GRPC_JAVA_PLUGIN:-third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-arm32.exe}
     else
       PROTOC=${PROTOC:-third_party/protobuf/protoc-linux-x86_32.exe}
       GRPC_JAVA_PLUGIN=${GRPC_JAVA_PLUGIN:-third_party/grpc/protoc-gen-grpc-java-0.15.0-linux-x86_32.exe}
@@ -150,7 +151,7 @@ function java_compilation() {

   run "${JAVAC}" -classpath "${classpath}" -sourcepath "${sourcepath}" \
       -d "${output}/classes" -source "$JAVA_VERSION" -target "$JAVA_VERSION" \
-      -encoding UTF-8 "@${paramfile}"
+      -encoding UTF-8 "@${paramfile}" -J-Xmx500M

   log "Extracting helper classes for $name..."
   for f in ${library_jars} ; do
diff --git a/src/main/java/com/google/devtools/build/lib/util/CPU.java b/src/main/java/com/google/devtools/build/lib/util/CPU.java
index 41af4b1..4d80610 100644
--- a/src/main/java/com/google/devtools/build/lib/util/CPU.java
+++ b/src/main/java/com/google/devtools/build/lib/util/CPU.java
@@ -26,7 +26,7 @@ public enum CPU {
   X86_32("x86_32", ImmutableSet.of("i386", "i486", "i586", "i686", "i786", "x86")),
   X86_64("x86_64", ImmutableSet.of("amd64", "x86_64", "x64")),
   PPC("ppc", ImmutableSet.of("ppc", "ppc64", "ppc64le")),
-  ARM("arm", ImmutableSet.of("arm", "armv7l")),
+  ARM("arm", ImmutableSet.of("arm", "armv7l", "aarch64")),
   UNKNOWN("unknown", ImmutableSet.<String>of());

   private final String canonicalName;
diff --git a/third_party/grpc/BUILD b/third_party/grpc/BUILD
index 2ba07e3..c7925ff 100644
--- a/third_party/grpc/BUILD
+++ b/third_party/grpc/BUILD
@@ -29,7 +29,7 @@ filegroup(
         "//third_party:darwin": ["protoc-gen-grpc-java-0.15.0-osx-x86_64.exe"],
         "//third_party:k8": ["protoc-gen-grpc-java-0.15.0-linux-x86_64.exe"],
         "//third_party:piii": ["protoc-gen-grpc-java-0.15.0-linux-x86_32.exe"],
-        "//third_party:arm": ["protoc-gen-grpc-java-0.15.0-linux-x86_32.exe"],
+        "//third_party:arm": ["protoc-gen-grpc-java-0.15.0-linux-arm32.exe"],
         "//third_party:freebsd": ["protoc-gen-grpc-java-0.15.0-linux-x86_32.exe"],
     }),
 )
diff --git a/third_party/protobuf/BUILD b/third_party/protobuf/BUILD
index 203fe51..4c2a316 100644
--- a/third_party/protobuf/BUILD
+++ b/third_party/protobuf/BUILD
@@ -28,6 +28,7 @@ filegroup(
         "//third_party:darwin": ["protoc-osx-x86_32.exe"],
         "//third_party:k8": ["protoc-linux-x86_64.exe"],
         "//third_party:piii": ["protoc-linux-x86_32.exe"],
+        "//third_party:arm": ["protoc-linux-arm32.exe"],
         "//third_party:freebsd": ["protoc-linux-x86_32.exe"],
     }),
 )
diff --git a/tools/cpp/cc_configure.bzl b/tools/cpp/cc_configure.bzl
index aeb0715..688835d 100644
--- a/tools/cpp/cc_configure.bzl
+++ b/tools/cpp/cc_configure.bzl
@@ -150,7 +150,12 @@ def _get_cpu_value(repository_ctx):
     return "x64_windows"
   # Use uname to figure out whether we are on x86_32 or x86_64
   result = repository_ctx.execute(["uname", "-m"])
-  return "k8" if result.stdout.strip() in ["amd64", "x86_64", "x64"] else "piii"
+  machine = result.stdout.strip()
+  if machine in ["arm", "armv7l", "aarch64"]:
+   return "arm"
+  elif machine in ["amd64", "x86_64", "x64"]:
+   return "k8"
+  return "piii"


 _INC_DIR_MARKER_BEGIN = "#include <...>"

之后编译安装:

1
2
3
$ ./compile.sh 
$ sudo cp output/bazel /usr/local/bin
$ cd ..

Build Tensorflow

1
2
$ git clone https://github.com/tensorflow/tensorflow.git
$ git checkout r0.11

同样地, 修改配置文件:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
diff --git a/tensorflow/core/kernels/BUILD b/tensorflow/core/kernels/BUILD
index 2e04827..867aaca 100644
--- a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -1184,7 +1184,7 @@ tf_kernel_libraries(
         "segment_reduction_ops",
         "scan_ops",
         "sequence_ops",
-        "sparse_matmul_op",
+       #DC "sparse_matmul_op",
     ],
     deps = [
         ":bounds_check",
diff --git a/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc b/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
index 02058a8..880252c 100644
--- a/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
+++ b/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
@@ -43,8 +43,14 @@ struct BatchSelectFunctor<GPUDevice, T> {
     const int all_but_batch = then_flat_outer_dims.dimension(1);

 #if !defined(EIGEN_HAS_INDEX_LIST)
-    Eigen::array<int, 2> broadcast_dims{{ 1, all_but_batch }};
-    Eigen::Tensor<int, 2>::Dimensions reshape_dims{{ batch, 1 }};
+    // Eigen::array<int, 2> broadcast_dims{{ 1, all_but_batch }};
+    Eigen::array<int, 2> broadcast_dims;
+    broadcast_dims[0] = 1;
+    broadcast_dims[1] = all_but_batch;
+    // Eigen::Tensor<int, 2>::Dimensions reshape_dims{{ batch, 1 }};
+    Eigen::Tensor<int, 2>::Dimensions reshape_dims;
+    reshape_dims[0] = batch;
+    reshape_dims[1] = 1;
 #else
     Eigen::IndexList<Eigen::type2index<1>, int> broadcast_dims;
     broadcast_dims.set(1, all_but_batch);
diff --git a/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc b/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
index a177696..75b67ba 100644
--- a/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
@@ -104,9 +104,17 @@ struct SparseTensorDenseMatMulFunctor<GPUDevice, T, ADJ_A, ADJ_B> {
     int n = (ADJ_B) ? b.dimension(0) : b.dimension(1);

 #if !defined(EIGEN_HAS_INDEX_LIST)
-    Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz{{ 1, nnz }};
-    Eigen::array<int, 2> n_by_1{{ n, 1 }};
-    Eigen::array<int, 1> reduce_on_rows{{ 0 }};
+    // Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz{{ 1, nnz }};
+    Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz;
+    matrix_1_by_nnz[0] = 1;
+    matrix_1_by_nnz[1] = nnz;
+    // Eigen::array<int, 2> n_by_1{{ n, 1 }};
+    Eigen::Tensor<int, 2>::Dimensions matrix_1_by_nnz;
+    matrix_1_by_nnz[0] = 1;
+    matrix_1_by_nnz[1] = nnz;
+    // Eigen::array<int, 2> n_by_1{{ n, 1 }};
+    Eigen::array<int, 2> n_by_1;
+    n_by_1[0] = n;
+    n_by_1[1] = 1;
+    // Eigen::array<int, 1> reduce_on_rows{{ 0 }};
+    Eigen::array<int, 1> reduce_on_rows;
+    reduce_on_rows[0]= 0;
 #else
     Eigen::IndexList<Eigen::type2index<1>, int> matrix_1_by_nnz;
     matrix_1_by_nnz.set(1, nnz);
diff --git a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
index 52256a7..1d027b9 100644
--- a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
+++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
@@ -888,6 +888,9 @@ CudaContext* CUDAExecutor::cuda_context() { return context_; }
 // For anything more complicated/prod-focused than this, you'll likely want to
 // turn to gsys' topology modeling.
 static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {
+// DC - make this clever later. ARM has no NUMA node, just return 0
+LOG(INFO) << "ARM has no NUMA node, hardcoding to return zero";
+return 0;
 #if defined(__APPLE__)
   LOG(INFO) << "OS X does not support NUMA - returning NUMA node zero";
   return 0;

之后即可编译:

1
2
3
4
5
$ ./configure
$ bazel build -c opt --jobs 2 --local_resources 1024,4.0,1.0 --config=cuda //tensorflow/tools/pip_package:build_pip_package
$ bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
# The name of the .whl file will depend on your platform.
$ sudo pip install /tmp/tensorflow_pkg/tensorflow-0.11.0-py2-none-any.whl

这里有我自己编译好的 tensorflow_gpu-0.11.0-py2-none-aarch64.whl, 可供使用.

Start using Tensorflow

首先还得装一下 OpenCV 的 Python port:

1
$ sudo apt-get install -y libopencv4tegra–python

之后即可测试 TensorFlow 是否成功安装:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
>>>

不报错即是安装成功.

Tips

Swap Memory

如果编译失败, 很有可能是内存不足的原因, 因此可以外接U盘或是SSD等, 并且将一部分缓存放在上面.

1
2
3
4
5
6
$ cd /path/to/your/storage
$ fallocate -l 8G swapfile
$ chmod 600 swapfile
$ mkswap swapfile
$ sudo swapon swapfile
$ swapon -s

8G的 swap 空间应该是够用了, 如果还嫌不够可以再设个大点的.

之后再运行 Bazel 编译:

1
$ bazel build -c opt --local_resources 3072,4.0,1.0 --verbose_failures --config=cuda //tensorflow/tools/pip_package:build_pip_package

Build on external storage

整个安装过程所需的空间大概是3G以上, 而 TX1 装完系统之后只剩下了 4G 的剩余空间. 所以最好将安装时的根目录选在外置的存储上, 以免因为内置存储空间不足而导致失败.

References