Parallel Computing
The ADCME backend, TensorFlow, treats each operator as the smallest computation unit. Users are allowed to manually assign each operator to a specific device (GPU, CPU, or TPU). This is usually done with the @cpu device_id expr or @gpu device_id expr syntax, where device_id is the index of devices you want to place all operators and variables in expr. For example, if we want to create a variable a and compute $\sin(a)$ on GPU:0 we can write
@cpu 0 begin
    global a = Variable(1.0)
    global b = sin(a)
endIf the device_id is missing, 0 is treated as default. 
@cpu begin
    global a = Variable(1.0)
    global b = sin(a)
endCustom Device Placement Functions
The above placement function is useful and simple for placing operators on certain GPU devices without changing original codes. However, sometimes we want to place certain operators on certain devices. This can be done by implementing a custom assign_to_device function. As an example, we want to place all Variables on CPU:0 while placing all other operators on GPU:0, the code has the following form
PS_OPS = ["Variable", "VariableV2", "AutoReloadVariable"]
function assign_to_device(device, ps_device="/device:CPU:0")
    function _assign(op)
        node_def = pybuiltin("isinstance")(op, tf.NodeDef) ? op : op.node_def
        if node_def.op in PS_OPS
            return ps_device
        else
            return device
        end
    end
    return _assign
endThen we can write something like
@pywith tf.device(assign_to_device("/device:GPU:0")) begin
    global a = Variable(1.0)
    global b = sin(a)
endWe can check the location of a and b by inspecting their device attributes
julia> a.device
"/device:CPU:0"
julia> b.device
"/device:GPU:0"Collocate Gradient Operators
When we call gradients, TensorFlow actually creates a set of new operators, one for each operator in the forward computation. By default, those operators are placed on the default device (GPU:0 if GPU is available; otherwise it's CPU:0). Sometimes we want to place the operators created by gradients on the same devices as the corresponding forward computation operators. For example, if the operator b (sin) in the last example is on GPU:0, we hope the corresponding gradient computation (cos) is also on GPU:0. This can be done by specifying colocate [colocate] keyword arguments in gradients:
@pywith tf.device(assign_to_device("/device:GPU:0")) begin
    global a = Variable(1.0)
    global b = sin(a)
end
@pywith tf.device("/CPU:0") begin
    global c = cos(b)
end
g = gradients(c, a, colocate=true)In the following figure, we show the effects of colocate of the above codes. The test code snippet is
g = gradients(c, a, colocate=true)
sess = Session(); init(sess)
run_profile(sess, g+c)
save_profile("true.json")
g = gradients(c, a, colocate=false)
sess = Session(); init(sess)
run_profile(sess, g+c)
save_profile("false.json")
If you use bn (batch normalization) on multi-GPUs, you must be careful to update the parameters in batch normalization on CPUs. This can be done by explicitly specify 
@pywith tf.device("/cpu:0") begin
global update_ops = get_collection(tf.GraphKeys.UPDATE_OPS)
endand bind update_ops to an active operator (or explictly execute it in run(sess,...)).
- colocateUnfortunately, in the TensorFlow APIs, "collocate" is spelt as "colocate".