tensorflow custom loop training model, multi-gpu is slower than single-gpu

2022-12-07 21:23 问答作者：

tensorflow2.6；cuda11.2； 4G开发者_如何学运维PUS(RTX3070)；

Tensorflow uses keras to define the training model, and multiple GPUs can accelerate normally. However, when using a custom loop training model, the batch_size(the memory will overflow if the multi-gpu setting is too large) setting is the same as that of a single gpu, and the model training speed is slower than that of a single gpu. Could not find a solution, anyone can help, thanks.

I have googled a lot, there was no satisfied solution, anyone has good idea, thanks very much!

there are many possibilities

Input dataset, multiple units are working across device targets with their own handling speed and problem from the distributors. They had synchronized and assynchronized modes you applied config = tf.config.experimental.set_synchronous_execution( False )
The custom loop modes mean as its name execution guarantee modes, you need to handle the process with the program rather than model.fit() or estimator function.
Input data and label that you see from the example, you need to handles of data input by yourself even use the estimator()

Distribution training

Sample: Simple application using Tensorflow Keras model.fit(), dataset need to handles by attentions.

import tensorflow as tf

import numpy as np

import matplotlib.pyplot as plt


physical_devices = tf.config.experimental.list_physical_devices('CPU')
default_strategy = tf.distribute.get_strategy()
print( default_strategy )

config = tf.config.experimental.set_synchronous_execution( False )
print( tf.config.experimental.get_synchronous_execution() )

mirrored_strategy = tf.distribute.MirroredStrategy()

with mirrored_strategy.scope():
    model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])

model.compile(loss='mse', optimizer='sgd')

dataset = tf.data.Dataset.from_tensor_slices((tf.constant([1, 2, 3, 4], shape=(1, 4)), tf.constant([1], shape=(1, 1))))
history = model.fit( dataset, epochs=10000 )

input('...')

Output: With shared policy, you can estimates of work 60 percent, 40 percent or 80 percent 20 percents or fallbacks0

2022-12-07 13:21:30.873778: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy 
as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_INT32
      type: DT_INT32
    }
  }
}
attr {
  key: "_cardinality"
  value {
    i: 1
  }
}
attr {
  key: "is_files"
  value {
    b: false
  }
}
attr {
  key: "metadata"
  value {
    s: "\n\024TensorSliceDataset:0"
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: 4
        }
      }
      shape {
        dim {
          size: 1
        }
      }
    }
  }
}
experimental_type {
  type_id: TFT_PRODUCT
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_INT32
        }
      }
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_INT32
        }
      }
    }
  }
  args {
    type_id: TFT_DATASET
    args {
      type_id: TFT_PRODUCT
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_INT32
        }
      }
      args {
        type_id: TFT_TENSOR
        args {
          type_id: TFT_INT32
        }
      }
    }
  }
}

Epoch 1/10000
1/1 [==============================] - 2s 2s/step - loss: 5.8720
Epoch 2/10000
1/1 [==============================] - 0s 8ms/step - loss: 4.1545
Epoch 3/10000
1/1 [==============================] - 0s 8ms/step - loss: 2.9623
Epoch 4/10000
1/1 [==============================] - 0s 8ms/step - loss: 2.1346
Epoch 5/10000
1/1 [==============================] - 0s 8ms/step - loss: 1.5598

steps: 156
x_value -1.3065915
y_value -0.23498479
v 123839330.0
steps: 157
x_value 1.1961238
y_value -0.055203147
v 123832690.0
steps: 158
x_value -0.04365039
y_value 0.4533396
v 123826070.0
steps: 159
x_value 0.0
y_value 0.15724461
v 123819460.0

Sample: Application in single co-ordinates values finding.

step: 000004 action: 6 coff_0: -00002 coff_1: -00001 coff_2: 000015 coff_3: 000223 coff_4: 000089 epsilon: False
step: 000005 action: 6 coff_0: 000000 coff_1: 000004 coff_2: 000020 coff_3: 000218 coff_4: 000085 epsilon: False
step: 000006 action: 1 coff_0: 000002 coff_1: 000008 coff_2: 000024 coff_3: 000214 coff_4: 000081 epsilon: False
step: 000007 action: 6 coff_0: 000004 coff_1: 000011 coff_2: 000027 coff_3: 000211 coff_4: 000077 epsilon: False
step: 000008 action: 1 coff_0: 000006 coff_1: 000013 coff_2: 000029 coff_3: 000209 coff_4: 000073 epsilon: False
step: 000009 action: 6 coff_0: 000008 coff_1: 000014 coff_2: 000030 coff_3: 000208 coff_4: 000069 epsilon: False
step: 000010 action: 1 coff_0: 000010 coff_1: 000014 coff_2: 000030 coff_3: 000208 coff_4: 000065 epsilon: False
step: 000011 action: 6 coff_0: 000012 coff_1: 000013 coff_2: 000029 coff_3: 000209 coff_4: 000061 epsilon: False

tensorflow custom loop training model, multi-gpu is slower than single-gpu

@Jirayu Kaewprateep

I'm not using keras to build this model. And my data generator worked well. Here is a piece of my code.

with mirrored_strategy.scope():
    model = tf.keras.Model(input_data, bbox_tensors)
    optimizer = tf.keras.optimizers.Adam()
    ckpts = tf.train.Checkpoint(optimizer=optimizer, model=model)

def training(inputs):
    """training part"""
    image_data, labels = inputs
    with tf.GradientTape() as tap:
        predictions = model(image_data, training=True)
        tloss = compute_loss(predictions, labels)
    gradients = tap.gradient(tloss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return tloss

@tf.function
def distributed_training(dataset_inputs):
    per_replica_losses = mirrored_strategy.run(training, args=(dataset_inputs, ))
    return mirrored_strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)

继续阅读：multi-gpu yolov4

tensorflow custom loop training model, multi-gpu is slower than single-gpu

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？