Hint: If you want to see a list of allocated tensors when OOM happens, - 代码天地

Hint: If you want to see a list of allocated tensors when OOM happens,

其他 2019-03-06 02:34:07 阅读次数: 0

问题描述：

使用keras搭建siamese网络时，遇到错误如下：

OOM when allocating tensor with shape[129024,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: dense_1/kernel/Assign = Assign[T=DT_FLOAT, _class=["loc:@dense_1/kernel"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](dense_1/kernel, dense_1/kernel/cond/Merge)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

经查阅资料认为是内存不足。故修改batch_size，修改原始数据集样本对的生成方式。均无效。

经过仔细思考认为应该不是内存不足（原始数据集600M），而我是在服务器（内存128g）上运行，且服务器上没有其他人在使用。在程序运行期间使用top指令观察了服务器的内存使用情况，free部分一直有100g以上。排除内存不足的情况。

考虑可能显存不足（关于内存和显存的具体区别和使用不甚了解，请大家不吝赐教）。仔细阅读了错误日志（以后不能只关注Traceback 部分，Caused by更重要！！！）

发现是由于全连接层的参数太多导致显存不够（[129024,4096] ），确实太多。设计网络的时候疏忽了。

tip: 关于OOM 导致的错误，最重要的是定位到导致OOM的那行代码！！！

具体可以通过仔细阅读错误日志、在程序中设置标记（可能的地方print标记一下）。某师兄由于动态数组分配问题导致OOM，最后通过设置标记解决。

重新修改网络结构后，模型在16g内存的机器上顺利运行。

深度学习中最直观的方式就是减小batch_size或者hidden_layer中的单元数
---------------------
作者：huowa9077
来源：CSDN
原文：https://blog.csdn.net/huowa9077/article/details/81042553
版权声明：本文为博主原创文章，转载请附上博文链接！

猜你喜欢

转载自blog.csdn.net/qq_36387683/article/details/87880117

Hint: If you want to see a list of allocated tensors when OOM happens

Hint: If you want to see a list of allocated tensors when OOM happens,

解决Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocatio

What really happens when you navigate to a URL

坑之 TypeError: List of Tensors when single Tensor expected

TypeError: List of Tensors when single Tensor expected - when using const with tf.random_normal

CGContextSaveGState: invalid context 0x0. If you want to see the backtrace

tensorflow报ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape

DeepLab V3+：ResourceExhaustedError (see above for traceback): OOM when allocating tensor of shape

从网址输入到页面展现发生了什么?(What really happens when you navigate to a URL)

fuck！ want to forget you

fuck！want forget you

Do What you want

find the element you want

The meaning of the first line when you use dpkg list

I want to see more scenery.

let you see enough

2019.1.1, see you again

So you want to be a computational biologist?

To do all you want to do

So You Want to Be a Game Programmer?

Flask - What happens when secret key is lost?

tensorflow ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[256,256,15,15] and type float on 报错（显存不够报错）

报错：ResourceExhaustedError OOM when allocating

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match

When you are old

When You Believe

See you~ HDU1892

Tensors

quartus:Do you want to overwrite the database for revision

今日推荐

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

周排行

NEFU 117 素数个数的位数

Closest Common Ancestors (Lca,tarjan)

ELK部署

【转载】Hive笔记整理（三）

SQL语句（一）基本表的定义

关于Java web开发中的MySQL的事务语句

MFC创建自定义窗体

如何用一句话激怒程序员？

《逆袭大学》文摘——9.4 基础和应用的平衡中找到大学的节奏

【spring源码分析】@Value注解原理

每日归档

更多

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)