Skip to content

Concurrent LM Calls

Many prompt engineering techniques like Self-Consistency (CoT-SC) and Tree of Thoughts (ToT) involve non-sequential LM calls such as branching and gathering, where parallelizing the calls can significantly speed up the process.

APPL provides a simple way to parallelize these calls using asynchronous computation.

Asynchronous Execution

In APPL, the gen function automatically starts a new thread (or process) to handle the LM call. The gen function does not block the main thread and returns a Generation object that represents the generation result. The generation result is not synchronized (waited) until its value is needed, therefore, multiple independent gen calls can be executed concurrently.

StringFuture to represent strings that may not be available yet

To support asynchronized execution, we introduce the StringFuture object similar to concurrent.futures.Future. StringFuture is a placeholder for a string value that will be available in the future, and can be used to represent generation results that are computed in other threads and not yet available.

In most scenarios, you can use StringFuture as a normal str. The StringFuture delays its synchronization when possible, ideally only synchronizes the value when the str is called. For example, a concatenation of StringFuture objects can be done without waiting for the value of each StringFuture to be available.

import time

import appl
from appl import gen, ppl, StringFuture

def mul(x:int, y:int):
    return gen()

t0 = time.time()
n = 3
s = StringFuture("\n").join(
    StringFuture(" ").join(mul(i + 1, j + 1) for j in range(n))
    for i in range(n)
) # (1)
print(f"Time: {time.time() - t0:.2f}")
print(f"Time: {time.time() - t0:.2f}")
  1. equivalent to
    s = ""
    for i in range(n):
        if i:
            s += "\n"
        for j in range(n):
            if j:
                s += " "
            s += mul(i + 1, j + 1)

In this example, several Generation objects are returned by the mul function, and the StringFuture objects are used to concatenate the results, which results in a StringFuture object without synchronizing the generation results. The print function requires the value of the StringFuture object s, which triggers the synchronization of the generation results. Since the threads are already started when the gen function is called, the generation results are computed in parallel.

Output will looks like:

Time: 0.09
1 2 3
2 4 6
3 6 9
Time: 1.91
where starting new threads could have a small overhead, but relatively small than API calls.

Force synchronization

If you want to force synchronization, you can call .results to wait for the results, or simply use str directly if the result is a string. For example, mul(3, 4).results or str(mul(3, 4)).

This could lead to a slower execution time, for example, if you replace the computation of s as s = "\n".join(" ".join(str(mul(i + 1, j + 1)) for j in range(n)) for i in range(n)), the runtime could be around 8 seconds since the generations are not parallelized.

Example (CoT-SC)

The following example demonstrates how to use APPL to naturally exploit the independence among the reasoning paths in Self Consistency of Chain-of-Thoughts to parallelize the execution.

Self Consistency of Chain-of-Thoughts (CoT-SC)
  • Chain-of-thoughts (CoT) prompting enhances the LLM's ability to perform complex reasoning by providing examples of intermediate reasoning steps.
  • Self consistency samples different reasoning pathes from the LLM then marginalizes to generate a consensus.

Below is an illustration of this method from the paper "Self-Consistency Improves Chain of Thought Reasoning in Language Models"1.

CoT-SC Example

The implementation below shows an example of determining if a set of numbers add up to an even number (task introduced in source).

import time

from appl import gen, ppl

def parse_answer(answer: str):
    # parse the ANS from: The answer is [ANS].
    if (key := "The answer is ") in answer:
        return answer.split(key)[-1].split(".")[0].strip()
    return None

def get_mode(answers: list[str]):
    """Get the mode of the answers"""
    return max(set(answers), key=answers.count)

def marginalize(results: list):
    """Get the answer from the results and get the mode of the answers"""
    # explicitly syncronize the results using str()
    answers = [parse_answer(str(res)) for res in results]

    return get_mode(answers)

def cot_consistency(cot_examples: list[str], question: str, num_trials: int):
    cot_examples  # the list of examples are captured into prompt one-by-one
    results = [gen() for _ in range(num_trials)]  # concurrent generation
    return marginalize(results)  # marginalize the reasoning paths to get the answer

def cot_consistency_sequential(cot_examples: list[str], question: str, num_trials: int):
    cot_examples  # the list of examples are captured into prompt one-by-one
    results = [str(gen()) for _ in range(num_trials)]  # explicit syncronization
    return marginalize(results)  # marginalize the reasoning paths to get the answer

# example from
cot_examples = [
        "The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.\n"
        "A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False."
        "The odd numbers in this group add up to an even number: 17, 10, 19, 4, 8, 12, 24.\n"
        "A: Adding all the odd numbers (17, 19) gives 36. The answer is True."
question = (
    "The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1."

n = 5
start_time = time.time()
print(f"Parallel CoT-SC Answer: {cot_consistency(cot_examples, question, n)}")
print(f"Parallel CoT-SC takes {time.time() - start_time:.2f} seconds")

start_time = time.time()
    f"Sequential CoT-SC Answer: {cot_consistency_sequential(cot_examples, question, n)}"
print(f"Sequential CoT-SC takes {time.time() - start_time:.2f} seconds")

Output will looks like:

Parallel CoT-SC Answer: False
Parallel CoT-SC takes 1.74 seconds
Sequential CoT-SC Answer: False
Sequential CoT-SC takes 7.42 seconds
