Tensorflow mirrored strategy 사용하여 2개 GPU로 BERT fine-tuning 학습하기

Horovod로도 해봤지만 Tensorflow 내에서 기본적으로 제공하는 패키지를 사용하여 여러 개의 GPU로 분산학습이 가능하다.

이번엔 추가 프레임워크 사용없이 Tensorflow 내에서 해보고자 한다.

* Environment : Ubuntu 18.04, CUDA 10.0, CuDNN 7.6.5, 2 * Titan RTX, Tensorflow 1.14, Anaconda, Python 3.6.5

* Model : BERT Multilingual cased model

* Dataset : SQuAD 2.0

공식 문서와 colab으로 자세하게 설명이 되어있지만, 내 코드에 적용시키기 위한 충분한 설명이 아니어서 이를 구현하는 데에는 시간이 조금 필요했다.

이번에 사용할 분산 방법은 Mirrored Strategy !

이는 하나의 머신, 여러개의 GPU에서의 동기성 분산 학습을 지원하기 때문에 나의 케이스에 딱 맞는 전략이다. 자세한 설명은 공식 문서에 적혀져 있다.

출처 : 텐서플로우 공식 문서 (https://www.tensorflow.org/guide/distributed_training)

사용법은 매우..매우 간단하다!

mirrored_strategy = tf.distribute.MirroredStrategy()

아래와 같이 코드를 수정해주었다.

1. 우선 main 함수에 기존에 있던 run_config 위에 strategy와 config 정의

strategy = tf.contrib.distribute.MirroredStrategy(
      num_gpus=FLAGS.num_gpu_cores,
      cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps('nccl', num_packs=FLAGS.num_gpu_cores)
  )

log_every_n_steps = 8

multi_run_config = tf.compat.v1.estimator.RunConfig(
    train_distribute=strategy,
    eval_distribute=strategy,
    log_step_count_steps=log_every_n_steps,
    model_dir=FLAGS.output_dir,
    save_checkpoints_steps=FLAGS.save_checkpoints_steps
  )

2. parameter로 받는 FLAG 추가 정의

# gpu를 사용할 거니까 기본 값을 True
flags.DEFINE_bool("use_gpu", True, "Whether to use GPU.")

# 2개의 gpu를 사용할 거니까 기본 값을 2로 설정
flags.DEFINE_integer(
    "num_gpu_cores", 2,
    "Only used if `use_gpu` is True. Total number of GPU cores to use."
)

3. model_fn에 2번에서 정의한 FLAG 사용하여 인자로 넣기

# strategy 적용되는 범위
with strategy.scope():
	model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=FLAGS.init_checkpoint,
      learning_rate=FLAGS.learning_rate,
      num_train_steps=num_train_steps,
      num_warmup_steps=num_warmup_steps,
      use_tpu=FLAGS.use_tpu,
      use_one_hot_embeddings=FLAGS.use_tpu,
      use_gpu=FLAGS.use_gpu,
      num_gpu_cores=FLAGS.num_gpu_cores)

4. main 함수 내에 model_fn = model_fn_builder 밑에 다음과 같이 GPU 분산학습을 위한 estimator를 추가로 정의

is_multi_gpu = FLAGS.use_gpu and int(FLAGS.num_gpu_cores) >= 2

if is_multi_gpu:
    estimator = tf.compat.v1.estimator.Estimator(
      model_fn = model_fn,
      params = {},
      config = multi_run_config,
      )
else:
  estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=FLAGS.use_tpu,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=FLAGS.train_batch_size,
      predict_batch_size=FLAGS.predict_batch_size)

5. model_fn_builder 내의 model_fn 수정

is_multi_gpu = use_gpu and int(num_gpu_cores) >= 2
if is_multi_gpu:
  output_spec = tf.compat.v1.estimator.EstimatorSpec(
    mode=mode,
    loss=total_loss,
    train_op=train_op)
else:
  output_spec = tf.contrib.tpu.TPUEstimatorSpec(
    mode=mode,
    loss=total_loss,
    train_op=train_op,
    scaffold_fn=scaffold_fn)

elif mode == tf.estimator.ModeKeys.PREDICT:
  predictions = {
      "unique_ids": unique_ids,
      "start_logits": start_logits,
      "end_logits": end_logits,
  }
  is_multi_gpu = use_gpu and int(num_gpu_cores) >= 2
  if is_multi_gpu:
     output_spec = tf.contrib.tpu.TPUEstimatorSpec(
      mode=mode, predictions=predictions)
  else:
    output_spec = tf.contrib.tpu.TPUEstimatorSpec(
        mode=mode, predictions=predictions, scaffold_fn=scaffold_fn)

이렇게 하면 되는 줄 알았다..

Key Error : batch size

  # def input_fn_builder
  def input_fn_builder(input_file, seq_length, is_training, drop_remainder, batch_size):
  
  # batch_size 주석처리
  def input_fn(params):
    """The actual input function."""
    # batch_size = params["batch_size"]
  
  # def main train_input_fn / predict_input_fn 수정
  train_input_fn = input_fn_builder(
      input_file=train_writer.filename,
      seq_length=FLAGS.max_seq_length,
      is_training=True,
      drop_remainder=True,
      batch_size=FLAGS.train_batch_size)
      
  predict_input_fn = input_fn_builder(
        input_file=eval_writer.filename,
        seq_length=FLAGS.max_seq_length,
        is_training=False,
        drop_remainder=False,
        batch_size=FLAGS.predict_batch_size
        )

하지만 또 다른 오류가 발생한다..

WARNING:tensorflow:Entity <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x7fd6745def98>> could not be transformed and will be executed as-is. Please report this to the AutgoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: converting <bound method Dense.call of <tensorflow.python.layers.core.Dense object at 0x7fd6745def98>>: AssertionError: Bad argument number for Name: 3, expecting 4

pip install -U gast=0.2.2

다행히 gast downgrade하고 나니 위의 오류는 없어졌다. 그러나 새로운 오류..

ValueError: You must specify an aggregation method to update a MirroredVariable in Replica Context. You can do so by passing an explicit value for argument `aggregation` to tf.Variable(..).e.g. `tf.Variable(..., aggregation=tf.VariableAggregation.SUM)``tf.VariableAggregation` lists the possible aggregation methods.This is required because MirroredVariable should always be kept in sync. When updating them or assigning to them in a replica context, we automatically try to aggregate the values before updating the variable. For this aggregation, we need to know the aggregation method. Another alternative is to not try to update such MirroredVariable in replica context, but in cross replica context. You can enter cross replica context by calling `tf.distribute.get_replica_context().merge_call(merge_fn, ..)`.Inside `merge_fn`, you can then update the MirroredVariable using `tf.distribute.StrategyExtended.update()`.

공식 문서를 하나하나 읽다보니 optimization 과정에서 생겨난 오류였다.

https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/distribute/StrategyExtended

그래서 해당 관련 코드를 고치다가, 좋은 코드가 있어 optimization은 해당 소스로 변경해주었다.

https://github.com/HaoyuHu/bert-multi-gpu/blob/master/custom_optimization.py

하지만 나는 fp_16 과 관련없이 multi gpu로 돌려보고자 함으로 optimizaiton 코드에서 관련 인자를 지워주었다.

또한, model_fn 에 optimizer를 호출하는 부분에서 인자도 수정해주었다.

train_op = optimization_tf_multi.create_optimizer(
          total_loss, learning_rate, num_train_steps, num_warmup_steps)

극적인 성공!

gpu 두 개 다 잘 사용하고 있다 ㅎㅎㅎ

python squad_files/squad2.0/evaluate-v2.0.py squad_files/squad2.0/dev-v2.0.json output/tf_multi12/squad2.0/predictions.json

{
"exact": 75.72643813694938,
"f1": 78.83357049287763,
"total": 11873,
"HasAns_exact": 72.38529014844805,
"HasAns_f1": 78.60846532758717,
"HasAns_total": 5928,
"NoAns_exact": 79.05803195962994,
"NoAns_f1": 79.05803195962994,
"NoAns_total": 5945
}

다시 실행 시간 측정해보고, single gpu랑 horovod framework 에서의 dual gpu 와 모두 성능을 비교해봐야겠다!!!!

- SQuAD 2.0

	Single GPU	Dual GPU on Horovod	Dual GPU on Horovod - sync with compression	Dual GPU by using Tensorflow mirrored strategy
Parameters	train_batch_size=12 learning_rate=3e-5 num_train_epochs=2.0 max_seq_length=384 doc_stride=128 do_lower_case = False
Total Runtime (sec)	9621.24	6336.52	7233.83	6478.45
Metrics	EM : 75.50 F1: 78.60	EM : 74.39 F1: 77.17	EM : 74.38 F1: 77.20	EM : 75.73 F1: 78.83

- KorQuAD 1.0

	Single GPU	Dual GPU on Horovod	Dual GPU on Horovod - sync with compression	Dual GPU by using Tensorflow mirrored strategy
Parameters	train_batch_size=12 learning_rate=3e-5 num_train_epochs=2.0 max_seq_length=384 doc_stride=128 do_lower_case = False
Total Runtime (sec)	4591.51	3001.26	3383.50	(buffer size 100) 3312.05 (buffer size 500) 3075.83
Metrics	EM : 70.45 F1: 89.91	EM : 69.21 F1: 89.03	EM : 68.27 F1: 88.59	(buffer size 100) EM : 70.77 f1: 90.41 (buffer size 500) EM : 70.49 F1: 90.10

Reference

https://www.tensorflow.org/guide/distributed_training

텐서플로로 분산 훈련하기 | TensorFlow Core

Note: 이 문서는 텐서플로 커뮤니티에서 번역했습니다. 커뮤니티 번역 활동의 특성상 정확한 번역과 최신 내용을 반영하기 위해 노력함에도 불구하고 공식 영문 문서의 내용과 일치하지 않을 수

www.tensorflow.org

https://colab.research.google.com/github/jiyongjung0/tf-docs/blob/distribute_strategy/site/ko/beta/guide/distribute_strategy.ipynb#scrollTo=n6eSfLN5RGY8

Google Colaboratory

colab.research.google.com

https://github.com/HaoyuHu/bert-multi-gpu

HaoyuHu/bert-multi-gpu

Feel free to fine tune large BERT models with Multi-GPU and FP16 support. - HaoyuHu/bert-multi-gpu

github.com

'Dev > Research' 카테고리의 다른 글

Tokenizer 비교 실험 (형태소 분석, word piece) (1)	2020.08.06
2개의 GPU로 BERT SQuAD2.0 fine-tuning training 하기 (0)	2020.05.21

웬디의 아카이브 - 우물 탈출 기록

Tensorflow mirrored strategy 사용하여 2개 GPU로 BERT fine-tuning 학습하기

'Dev > Research' 카테고리의 다른 글

댓글

티스토리툴바

Tensorflow mirrored strategy 사용하여 2개 GPU로 BERT fine-tuning 학습하기

'Dev > Research' 카테고리의 다른 글

관련글

댓글

티스토리툴바