Concurrent LM Calls¶
Many prompt engineering techniques like Self-Consistency (CoT-SC) and Tree of Thoughts (ToT) involve non-sequential LM calls such as branching and gathering, where parallelizing the calls can significantly speed up the process.
APPL provides a simple way to parallelize these calls using asynchronous computation.
Asynchronous Execution¶
In APPL, the gen
function automatically starts a new thread (or process) to handle the LM call. The gen
function does not block the main thread and returns a Generation
object that represents the generation result. The generation result is not synchronized (waited) until its value is needed, therefore, multiple independent gen
calls can be executed concurrently.
StringFuture to represent strings that may not be available yet
To support asynchronized execution, we introduce the StringFuture
object similar to concurrent.futures.Future
. StringFuture
is a placeholder for a string value that will be available in the future, and can be used to represent generation results that are computed in other threads and not yet available.
In most scenarios, you can use StringFuture
as a normal str
. The StringFuture
delays its synchronization when possible, ideally only synchronizes the value when the str
is called. For example, a concatenation of StringFuture
objects can be done without waiting for the value of each StringFuture
to be available.
- equivalent to
In this example, several Generation
objects are returned by the mul
function, and the StringFuture
objects are used to concatenate the results, which results in a StringFuture
object without synchronizing the generation results. The print
function requires the value of the StringFuture
object s
, which triggers the synchronization of the generation results. Since the threads are already started when the gen
function is called, the generation results are computed in parallel.
Output will looks like:
where starting new threads could have a small overhead, but relatively small than API calls.Force synchronization
If you want to force synchronization, you can call .results
to wait for the results, or simply use str
directly if the result is a string. For example, mul(3, 4).results
or str(mul(3, 4))
.
This could lead to a slower execution time, for example, if you replace the computation of s
as s = "\n".join(" ".join(str(mul(i + 1, j + 1)) for j in range(n)) for i in range(n))
, the runtime could be around 8 seconds since the generations are not parallelized.
Example (CoT-SC)¶
The following example demonstrates how to use APPL to naturally exploit the independence among the reasoning paths in Self Consistency of Chain-of-Thoughts to parallelize the execution.
Self Consistency of Chain-of-Thoughts (CoT-SC)
- Chain-of-thoughts (CoT) prompting enhances the LLM's ability to perform complex reasoning by providing examples of intermediate reasoning steps.
- Self consistency samples different reasoning pathes from the LLM then marginalizes to generate a consensus.
Below is an illustration of this method from the paper "Self-Consistency Improves Chain of Thought Reasoning in Language Models"1.
The implementation below shows an example of determining if a set of numbers add up to an even number (task introduced in source).
Output will looks like:
Parallel CoT-SC Answer: False
Parallel CoT-SC takes 1.74 seconds
Sequential CoT-SC Answer: False
Sequential CoT-SC takes 7.42 seconds
-
https://arxiv.org/abs/2203.11171 ↩