Multi-threading is a very important technique for accelerating simulations in ADCME. Through this section, we look into the multi-threading models of ADCME's backend, TensorFlow. Let us start with some basic concepts related to CPUs.

We often hear about processes and threads when talking about multi-threading. In a word, a process is a program in execution. A process may invoke multiple threads. A thread can be viewed as a scheduler for executing each line of codes in order. It tells the CPU to perform an instruction.

The biggest difference is that different processes do not share memory with each other. But different threads within the same process has the same memory space.

Now let's consider how ADCME works: for one CPU, ADCME always runs only one process. But to gain maximum efficiency, ADCME will create multiple threads to leverage any parallelism we have in hardware resources and computational models.

## Inter and Intra Parallelism

There are two types of parallelism in ADCME execution: inter and intra.

Consider a computational graph, there may be multiple independent operators and therefore we can execute them in parallel. This type of parallelism is called inter-parallelism. For example,

using ADCME

a1 = constant(rand(10))
a2 = constant(rand(10))
a3 = a1 * a2
a4 = a1 + a2
a5 = a3 + a4

In the above code, a3 = a1 * a2 and a4 = a1 + a2 are independent and can be executed in parallel.

Another type of parallelism is intra-parallel, that is, the computation within each operator can be computed in parallel. For example, in the example above, we can compute the first 5 entries and last 5 entries in a4 = a1 + a2 in parallel.

These types of parallelism can be achieved using multi-threading. In the next section, we explain how this is implemented in TensorFlow.

The backend of ADCME, TensorFlow, uses two threadpools for multithreading. One thread pool is for inter-parallelism, and the other is for intra-parallelism. They can be set by the users.

The following figure is an illustration of the two thread pools of ADCME.

## How to Use the Intra Thread Pool

In practice, when we implement custom operators, we may want to use the intra thread pool. Here gives an example how to use thread pools.

#include <thread>
#include <chrono>
#include <condition_variable>
#include <atomic>

cnt++;
if (cnt==7) cv.notify_one();
}

std::atomic_int cnt = 0;
std::condition_variable cv;
std::mutex mu;

printf("Maximum Parallelism = %d\n", port::MaxParallelism());

for (int i = 0; i < 7; i++)

{
std::unique_lock<std::mutex> lck(mu);
cv.wait(lck, [&cnt](){return cnt==7;});
}
printf("Op finished\n");
}

Basically, we can asynchronously launch jobs using the thread pools. Additionally, we are responsible for synchronization. Here we have used condition variables for synchronization.

Typically our CPU operators are synchronous and do not need the thread pools. But it does not hard to have an intra thread pool.

## Runtime Optimizations

If you are using Intel CPUs, we may have some runtime optimization configurations. See this link for details. Here, we show the effects of some optimizations.

We already understand intra_op_parallelism_threads and inter_op_parallelism_threads; now let us consider some other options. We consider computing $\sin$ function using the following formula

$$$\sin x \approx x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!}$$$

The implementation can be found here.

### Configure OpenMP

To set the number of OMP threads, we can configure the OMP_NUM_THREADS environment variable. One caveat is that the variable must be set before loading ADCME. For example

ENV["OMP_NUM_THREADS"] = 5
using ADCME

Running the omp_thread.jl, we have the following output

There are 5 OpenMP threads
4 is computing...
0 is computing...
4 is computing...
1 is computing...
1 is computing...
0 is computing...
3 is computing...
3 is computing...
2 is computing...
2 is computing...

We see that there are 5 threads running.

### Configure Number of Devices

Session accepts keywords CPU, which limits the number of CPUs we can use. Note, CPU corresponds to the number of CPU devices, not cores or threads. For example, if we run num_device.jl with (default is using all CPUs)

sess = Session(CPU=1); init(sess)

We will see

There are 144 OpenMP threads

This is because we have 144 cores in our machine.