前言

本文的原文连接是:
https://blog.csdn.net/freewebsys/article/details/129331455

未经博主允许不得转载。
博主CSDN地址是：https://blog.csdn.net/freewebsys
博主掘金地址是：https://juejin.cn/user/585379920479288
博主知乎地址是：https://www.zhihu.com/people/freewebsystem

1，关于gpt2，之前在intel的i7上面跑耗时1小时20分钟，这次在amd的cpu上试试

https://yanghuaiyuan.blog.csdn.net/article/details/129327664

结果悲剧了，模型训练了一次就推出了，默认的模型好像不支持amd的cpu，

# time python demo.py 
2023-03-04 01:10:38.853476: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-04 01:10:40.558346: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
########### init start ###########
2023-03-04 01:10:48.376120: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:357] MLIR V1 optimization pass is not enabled
Loading checkpoint models/124M/model.ckpt
Loading dataset...
100%|███████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.19s/it]
dataset has 338025 tokens
Training...
[1 | 35.05] loss=4.39 avg=4.39
Killed

real	1m16.797s
user	11m24.346s
sys	1m55.479s

结果直接退出了。但是没有关系正好重新编译下，顺便解决cpu的那个问题。

2，使用docker编译TensorFlow

直接参考goolge的中文官方文档：
https://tensorflow.google.cn/install/docker?hl=zh-cn

这里要特别的注意，带 devel 是最新的 TensorFlow 源代码构建仅支持 CPU 的软件包。
里面并没有tensorflow的库，特别的大。

docker pull tensorflow/tensorflow:devel
docker pull tensorflow/tensorflow:devel-gpu

docker run --name tfbuild -itd -v $PWD:/mnt \
    -e HOST_PERMS="$(id -u):$(id -g)" tensorflow/tensorflow:devel

然后就是登陆进容器进行编译了。这个按照官方网站的操作手册一点一点执行就行了。

docker exec -it tfbuild bash

#选择默认值就行。
#首先要切换到tags 的分支 2.11.0 ，默认是master是开发的版本，不是正式版本。
#切换分支代码，master是开发版本

cd /tensorflow_src/
# 大工程分支特别多，耐心等待。
git fetch --tags
git checkout v2.11.0

# git status
HEAD detached at v2.11.0

# 一顿默认配置就行。
./configure  
# 设置8G内存，太高会导致killed 
# https://github.com/tensorflow/tensorflow/issues/41480

# 这个是两个斜杠！！好像是个参数的问题。当前目录下有个文件：
# /tensorflow_src/tensorflow/tools/pip_package/build_pip_package.sh
# 
bazel build --config=mkl --local_ram_resources=8000  --local_cpu_resources=4 //tensorflow/tools/pip_package:build_pip_package

# create package
./bazel-bin/tensorflow/tools/pip_package/build_pip_package /mnt/out  

chown $HOST_PERMS /tensorflow/out/tensorflow-version-tags.whl

这样就完成了针对AMD的cpu的编译优化，whl就是二进制文件。然后直接放到容器中按照就行。
开始编译中，漫长的等待。
在这里插入图片描述

因为在python的官网没有 ubuntu的镜像，所以直接用tensorflow的镜像先卸载，再安装就行了。

编译失败报错：

# bazel build --config=opt --local_ram_resources=8000 //tensorflow/tools/pip_package:build_pip_package
Extracting Bazel installation...

Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=150
INFO: Reading rc options for 'build' from /tensorflow_src/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /tensorflow_src/.bazelrc:
  'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false
INFO: Reading rc options for 'build' from /tensorflow_src/.tf_configure.bazelrc:
  'build' options: --action_env PYTHON_BIN_PATH=/usr/bin/python3 --action_env PYTHON_LIB_PATH=/usr/lib/python3/dist-packages --python_path=/usr/bin/python3
INFO: Reading rc options for 'build' from /tensorflow_src/.bazelrc:
  'build' options: --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/ir,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_jitrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/tfrt/common,tensorflow/core/tfrt/eager,tensorflow/core/tfrt/eager/backends/cpu,tensorflow/core/tfrt/eager/backends/gpu,tensorflow/core/tfrt/eager/core_runtime,tensorflow/core/tfrt/eager/cpp_tests/core_runtime,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/graph_executor,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils
INFO: Found applicable config definition build:short_logs in file /tensorflow_src/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /tensorflow_src/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:opt in file /tensorflow_src/.tf_configure.bazelrc: --copt=-Wno-sign-compare --host_copt=-Wno-sign-compare
INFO: Found applicable config definition build:linux in file /tensorflow_src/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-unknown-warning --copt=-Wno-array-parameter --copt=-Wno-stringop-overflow --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --distinct_host_configuration=false --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /tensorflow_src/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://mirror.bazel.build/github.com/bazelbuild/rules_cc/archive/081771d4a0e9d7d3aa0eed2ef389fa4700dfb23e.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/openxla/stablehlo/archive/fdd47908468488cbbb386bb7fc723dc19321cb83.zip failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/google/XNNPACK/archive/e8f74a9763aa36559980a0c2f37f587794995622.zip failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://golang.org/dl/?mode=json&include=all failed: class java.io.IOException connect timed out
INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (544 packages loaded, 31179 targets configured).
INFO: Found 1 target...
ERROR: /tensorflow_src/tensorflow/core/kernels/BUILD:3680:18: Compiling tensorflow/core/kernels/cwise_op_greater.cc failed: (Exit 1): gcc failed: error executing command /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 248 arguments skipped)
gcc: fatal error: Killed signal terminated program cc1plus
compilation terminated.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 28285.593s, Critical Path: 21597.74s
INFO: 6601 processes: 1056 internal, 5545 local.
FAILED: Build did NOT complete successfully

后来查询得知是编译过程中swap空间不足引起的，采用以下方法可以正常编译：

time bazel build -c opt //tensorflow/tools/pip_package:build_pip_package --local_ram_resources=4000 --local_cpu_resources=4

在这里插入图片描述

# bazel build -c opt //tensorflow/tools/pip_package:build_pip_package --local_ram_resources=4000 --local_cpu_resources=4
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Options provided by the client:
  Inherited 'common' options: --isatty=1 --terminal_columns=140
INFO: Reading rc options for 'build' from /tensorflow_src/.bazelrc:
  Inherited 'common' options: --experimental_repo_remote_exec
INFO: Reading rc options for 'build' from /tensorflow_src/.bazelrc:
  'build' options: --define framework_shared_object=true --define tsl_protobuf_header_only=true --define=use_fast_cpp_protos=true --define=allow_oversize_protos=true --spawn_strategy=standalone -c opt --announce_rc --define=grpc_no_ares=true --noincompatible_remove_legacy_whole_archive --enable_platform_specific_config --define=with_xla_support=true --config=short_logs --config=v2 --define=no_aws_support=true --define=no_hdfs_support=true --experimental_cc_shared_library --experimental_link_static_libraries_once=false
INFO: Reading rc options for 'build' from /tensorflow_src/.tf_configure.bazelrc:
  'build' options: --action_env PYTHON_BIN_PATH=/usr/bin/python3 --action_env PYTHON_LIB_PATH=/usr/lib/python3/dist-packages --python_path=/usr/bin/python3
INFO: Reading rc options for 'build' from /tensorflow_src/.bazelrc:
  'build' options: --deleted_packages=tensorflow/compiler/mlir/tfrt,tensorflow/compiler/mlir/tfrt/benchmarks,tensorflow/compiler/mlir/tfrt/jit/python_binding,tensorflow/compiler/mlir/tfrt/jit/transforms,tensorflow/compiler/mlir/tfrt/python_tests,tensorflow/compiler/mlir/tfrt/tests,tensorflow/compiler/mlir/tfrt/tests/ir,tensorflow/compiler/mlir/tfrt/tests/analysis,tensorflow/compiler/mlir/tfrt/tests/jit,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_tfrt,tensorflow/compiler/mlir/tfrt/tests/lhlo_to_jitrt,tensorflow/compiler/mlir/tfrt/tests/tf_to_corert,tensorflow/compiler/mlir/tfrt/tests/tf_to_tfrt_data,tensorflow/compiler/mlir/tfrt/tests/saved_model,tensorflow/compiler/mlir/tfrt/transforms/lhlo_gpu_to_tfrt_gpu,tensorflow/core/runtime_fallback,tensorflow/core/runtime_fallback/conversion,tensorflow/core/runtime_fallback/kernel,tensorflow/core/runtime_fallback/opdefs,tensorflow/core/runtime_fallback/runtime,tensorflow/core/runtime_fallback/util,tensorflow/core/tfrt/common,tensorflow/core/tfrt/eager,tensorflow/core/tfrt/eager/backends/cpu,tensorflow/core/tfrt/eager/backends/gpu,tensorflow/core/tfrt/eager/core_runtime,tensorflow/core/tfrt/eager/cpp_tests/core_runtime,tensorflow/core/tfrt/gpu,tensorflow/core/tfrt/run_handler_thread_pool,tensorflow/core/tfrt/runtime,tensorflow/core/tfrt/saved_model,tensorflow/core/tfrt/graph_executor,tensorflow/core/tfrt/saved_model/tests,tensorflow/core/tfrt/tpu,tensorflow/core/tfrt/utils
INFO: Found applicable config definition build:short_logs in file /tensorflow_src/.bazelrc: --output_filter=DONT_MATCH_ANYTHING
INFO: Found applicable config definition build:v2 in file /tensorflow_src/.bazelrc: --define=tf_api_version=2 --action_env=TF2_BEHAVIOR=1
INFO: Found applicable config definition build:linux in file /tensorflow_src/.bazelrc: --host_copt=-w --copt=-Wno-all --copt=-Wno-extra --copt=-Wno-deprecated --copt=-Wno-deprecated-declarations --copt=-Wno-ignored-attributes --copt=-Wno-unknown-warning --copt=-Wno-array-parameter --copt=-Wno-stringop-overflow --copt=-Wno-array-bounds --copt=-Wunused-result --copt=-Werror=unused-result --define=PREFIX=/usr --define=LIBDIR=$(PREFIX)/lib --define=INCLUDEDIR=$(PREFIX)/include --define=PROTOBUF_INCLUDE_PATH=$(PREFIX)/include --cxxopt=-std=c++17 --host_cxxopt=-std=c++17 --config=dynamic_kernels --distinct_host_configuration=false --experimental_guard_against_concurrent_changes
INFO: Found applicable config definition build:dynamic_kernels in file /tensorflow_src/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/tensorflow/runtime/archive/4ce3e4da2e21ae4dfcee9366415e55f408c884ec.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/openxla/stablehlo/archive/fdd47908468488cbbb386bb7fc723dc19321cb83.zip failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://mirror.bazel.build/github.com/bazelbuild/rules_cc/archive/081771d4a0e9d7d3aa0eed2ef389fa4700dfb23e.tar.gz failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://storage.googleapis.com/mirror.tensorflow.org/github.com/google/XNNPACK/archive/e8f74a9763aa36559980a0c2f37f587794995622.zip failed: class java.io.FileNotFoundException GET returned 404 Not Found
WARNING: Download from https://golang.org/dl/?mode=json&include=all failed: class java.io.IOException connect timed out
INFO: Analyzed target //tensorflow/tools/pip_package:build_pip_package (544 packages loaded, 31179 targets configured).
INFO: Found 1 target...
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 12172.184s, Critical Path: 319.39s
INFO: 13069 processes: 1519 internal, 11550 local.
INFO: Build completed successfully, 13069 total actions


real	97m39.864s
user	0m0.620s
sys	0m0.355s

3，终于编译完成了，期间出现过几次kill大概率都是内存cpu不够了导致的，然后打包安装

经过97分钟的编译终于可以了，然后再打包下；

# 打包生成 安装文件
# ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /mnt/out  

  check.warn(importable)
/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
Mon Mar 6 06:12:35 UTC 2023 : === Output wheel file is in: /mnt/out

测试tf 库是否存在：

   python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

在没有安装前：
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
安装之后：
mkdir /root/.pip/

# 增加 pip 的源，再进行安装 tf
echo "[global]" > ~/.pip/pip.conf
echo "index-url = https://mirrors.aliyun.com/pypi/simple/" >> ~/.pip/pip.conf
echo "[install]" >> ~/.pip/pip.conf
echo "trusted-host=mirrors.aliyun.com" >> ~/.pip/pip.conf

# 直接安装 tf的编译后的whl 文件和相关的依赖库：
p3 install tensorflow-2.11.0-cp38-cp38-linux_x86_64.whl

python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2023-03-06 06:26:01.652690: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-06 06:26:03.657543: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
tf.Tensor(-155.91696, shape=(), dtype=float32)

4，总结

https://u.jd.com/IsILuAE

打算找个便宜的主机，拆下系统盘，做模型训练，等需要电脑配置升级了卖了买新的。
在这里插入图片描述
为啥选择这么便宜的电脑呢，主要利用的就是是rtx3060的12G显存。
可以购买好多台，然后进行Tensorflow集群，做训练。主要还是没钱，穷。

tesnsorflow使用 devel 进行支持源代码的编译，非常的方便。
里面的babel 都配置好了，只要设置好了内存，慢慢等待编译结果就行了。
–local_ram_resources=4000
–local_cpu_resources=4
就可以直接在当前的镜像中进行安装即可。