PyTorch教程13.3之自動并行

2512907 2023-06-05 | pdf | 0.44 MB | 次下載 | 免費

資料介紹

深度學習框架（例如 MXNet 和 PyTorch）在后端自動構建計算圖。使用計算圖，系統了解所有依賴關系，并可以選擇性地并行執行多個非相互依賴的任務以提高速度。例如，第 13.2 節中的圖 13.2.2 獨立地初始化了兩個變量。因此，系統可以選擇并行執行它們。

通常，單個運算符將使用所有 CPU 或單個 GPU 上的所有計算資源。例如，dot算子將使用所有 CPU 上的所有內核（和線程），即使在一臺機器上有多個 CPU 處理器。這同樣適用于單個 GPU。因此，并行化對于單設備計算機不是很有用。有了多個設備，事情就更重要了。雖然并行化通常在多個 GPU 之間最相關，但添加本地 CPU 會略微提高性能。例如，參見 Hadjis等人。( 2016 年)專注于訓練結合 GPU 和 CPU 的計算機視覺模型。借助自動并行化框架的便利，我們可以在幾行 Python 代碼中實現相同的目標。更廣泛地說，我們對自動并行計算的討論集中在使用 CPU 和 GPU 的并行計算，以及計算和通信的并行化。

請注意，我們至少需要兩個 GPU 才能運行本節中的實驗。

						import torch
from d2l import torch as d2l

						from mxnet import np, npx
from d2l import mxnet as d2l

npx.set_np()

13.3.1。GPU 上的并行計算

讓我們首先定義一個要測試的參考工作負載：run 下面的函數使用分配到兩個變量中的數據在我們選擇的設備上執行 10 次矩陣-矩陣乘法：x_gpu1和 x_gpu2。

							devices = d2l.try_all_gpus()
def run(x):
  return [x.mm(x) for _ in range(50)]

x_gpu1 = torch.rand(size=(4000, 4000), device=devices[0])
x_gpu2 = torch.rand(size=(4000, 4000), device=devices[1])

							 

現在我們將函數應用于數據。為了確保緩存不會在結果中發揮作用，我們通過在測量之前對其中任何一個執行單次傳遞來預熱設備。torch.cuda.synchronize() 等待 CUDA 設備上所有流中的所有內核完成。它接受一個device參數，即我們需要同步的設備。current_device()如果設備參數為（默認），則它使用由給出的當前設備None。

							run(x_gpu1)
run(x_gpu2) # Warm-up all devices
torch.cuda.synchronize(devices[0])
torch.cuda.synchronize(devices[1])

with d2l.Benchmark('GPU1 time'):
  run(x_gpu1)
  torch.cuda.synchronize(devices[0])

with d2l.Benchmark('GPU2 time'):
  run(x_gpu2)
  torch.cuda.synchronize(devices[1])

							 

							GPU1 time: 0.4967 sec
GPU2 time: 0.5151 sec

如果我們刪除synchronize兩個任務之間的語句，系統就可以自由地自動在兩個設備上并行計算。

							with d2l.Benchmark('GPU1 & GPU2'):
  run(x_gpu1)
  run(x_gpu2)
  torch.cuda.synchronize()

							 

							GPU1 & GPU2: 0.5000 sec

						

							devices = d2l.try_all_gpus()
def run(x):
  return [x.dot(x) for _ in range(50)]

x_gpu1 = np.random.uniform(size=(4000, 4000), ctx=devices[0])
x_gpu2 = np.random.uniform(size=(4000, 4000), ctx=devices[1])

							 

Now we apply the function to the data. To ensure that caching does not play a role in the results we warm up the devices by performing a single pass on either of them prior to measuring.

							run(x_gpu1) # Warm-up both devices
run(x_gpu2)
npx.waitall()

with d2l.Benchmark('GPU1 time'):
  run(x_gpu1)
  npx.waitall()

with d2l.Benchmark('GPU2 time'):
  run(x_gpu2)
  npx.waitall()

							 

							GPU1 time: 0.5233 sec
GPU2 time: 0.5158 sec

If we remove the waitall statement between both tasks the system is free to parallelize computation on both devices automatically.

							with d2l.Benchmark('GPU1 & GPU2'):
  run(x_gpu1)
  run(x_gpu2)
  npx.waitall()

							 

							GPU1 & GPU2: 0.5214 sec

						

在上述情況下，總執行時間小于其各部分的總和，因為深度學習框架會自動安排兩個 GPU 設備上的計算，而不需要代表用戶編寫復雜的代碼。

13.3.2。并行計算與通信

在許多情況下，我們需要在不同設備之間移動數據，比如在 CPU 和 GPU 之間，或者在不同 GPU 之間。例如，當我們想要執行分布式優化時會發生這種情況，我們需要在多個加速器卡上聚合梯度。讓我們通過在 GPU 上計算然后將結果復制回 CPU 來對此進行模擬。

							def copy_to_cpu(x, non_blocking=False):
  return [y.to('cpu', non_blocking=non_blocking) for y in x]

with d2l.Benchmark('Run on GPU1'):
  y = run(x_gpu1)
  torch.cuda.synchronize()

with d2l.Benchmark('Copy to CPU'):
  y_cpu = copy_to_cpu(y)
  torch.cuda.synchronize()

							 

							Run on GPU1: 0.5019 sec
Copy to CPU: 2.7168 sec

這有點低效。請注意，我們可能已經開始將的部分內容復制y到 CPU，而列表的其余部分仍在計算中。這種情況會發生，例如，當我們計算小批量的（反向傳播）梯度時。一些參數的梯度將比其他參數更早可用。因此，在 GPU 仍在運行時開始使用 PCI-Express 總線帶寬對我們有利。在 PyTorch 中，幾個函數（例如to()和）copy_()承認一個顯式non_blocking參數，它允許調用者在不需要時繞過同步。設置non_blocking=True 允許我們模擬這種情況。

							with d2l.Benchmark('Run on GPU1 and copy to CPU'):
  y = run(x_gpu1)
  y_cpu = copy_to_cpu(y, True)
  torch.cuda.synchronize()

							 

							Run on GPU1 and copy to CPU: 2.4682 sec

						

							def copy_to_cpu(x):
  return [y.copyto(npx.cpu()) for y in x]

with d2l.Benchmark('Run on GPU1'):
  y = run(x_gpu1)
  npx.waitall()

with d2l.Benchmark('Copy to CPU'):
  y_cpu = copy_to_cpu(y)
  npx.waitall()

							 

							Run on GPU1: 0.5796 sec
Copy to CPU: 3.0989 sec

This is somewhat inefficient. Note that we could already start copying parts of y to the CPU while the remainder of the list is still being computed. This situation occurs, e.g., when we compute the gradient on a minibatch. The gradients of some of the parameters will be available earlier than that of others. Hence it works to our advantage to start using PCI-Express bus bandwidth while the GPU is still running. Removing waitall between both parts allows us to simulate this scenario.

							with d2l.Benchmark('Run on GPU1 and copy to CPU'):
  y = run(x_gpu1)
  y_cpu = copy_to_cpu(y)
  npx.waitall()

							 

							Run on GPU1 and copy to CPU: 3.3488 sec

						

兩個操作所需的總時間（正如預期的那樣）小于它們各部分的總和。請注意，此任務不同于并行計算，因為它使用不同的資源：CPU 和 GPU 之間的總線。事實上，我們可以同時在兩個設備上進行計算和通信。如上所述，計算和通信之間存在依賴關系：y[i]必須在將其復制到 CPU 之前進行計算。幸運的是，系統可以y[i-1]邊計算邊復制y[i]，以減少總運行時間。

我們以在一個 CPU 和兩個 GPU 上進行訓練時簡單的兩層 MLP 的計算圖及其依賴關系的圖示作為結尾，如圖13.3.1所示。手動安排由此產生的并行程序將非常痛苦。這就是擁有基于圖形的計算后端進行優化的優勢所在。