07/02/2026 11:23 AM

⬅️ [06/02/2026 10:52 PM](<./06_02_2026 10_52 PM.md>) | ⬆️ [2026 - February](<./README.md>) | [08/02/2026 9:37 PM](<./08_02_2026 9_37 PM.md>) ➡️

07/02/2026 11:23 AM

For evaluation I need to remember the idea of having answer agents prompted in different ways to evaluate how well being able to answer clarifying questions from one agent transfers to another. Do we overfit to the haziness of a single oracle?

For medical we could then prompt them to believe they have the wrong illness.

07/02/2026 2:27 PM
I updated the clustering to use biconditional entailment as a postprocessing step. So we have semantic similarity and then split the groups created by that by if they are logically equivalent. Just doing the entailment fails on very different statements for some reason so the preprocessing for semantic similarity is necessary.
The preprocessing also makes it take far less time. Entailment is computationally expensive to compute so having it split small clusters is much more efficient than trying to compute $O(n^2)$ entailments.

class HybridClusterer(Clusterer):  
    """  
    Two-stage clustering:  
    1. Filter candidates using Cosine Similarity (Bi-Encoder).  
    2. Verify logic using Biconditional Entailment (Cross-Encoder).  

    This is faster than pure NLI and fixes 'hallucinated' entailment on disjoint topics.  
    """  
    def __init__(self, config: DictConfig, device: str):  
        super().__init__(config, device)  

        # 1. Bi-Encoder for Embeddings (Topic Similarity)  
        self.embedding_model_name = config.sentence_transformers_key  
        print(f"Loading Bi-Encoder (Filter): {self.embedding_model_name}...")  
        self.embedder = SentenceTransformer(self.embedding_model_name, device=self.device)  

        # 2. Cross-Encoder for NLI (Logical Equivalence)  
        self.nli_model_name = config.cross_encoder_key  
        print(f"Loading Cross-Encoder (Verifier): {self.nli_model_name}...")  
        self.nli_model = CrossEncoder(self.nli_model_name, device=self.device)  

        # Thresholds  
        # 0.65-0.75 is usually good for "same topic"  
        self.sim_threshold = config.get("similarity_threshold", 0.70) 
        # 0.5 is standard for CrossEncoders, but 0.7 is safer  
        self.nli_threshold = config.get("entailment_threshold", 0.5)  

        # Auto-detect entailment index (usually 1)  
        self.entailment_label_index = 1  

    def _check_bidirectional_entailment_batch(self, candidate_text: str, center_text: str) -> bool:  
        """  
        Checks A <-> B.  
        """  
        inputs = [[candidate_text, center_text], [center_text, candidate_text]]  

        # Predict returns logits. We don't necessarily need full softmax if we just check raw scores,  
        # but consistency with threshold is easier with probabilities.  
        scores = self.nli_model.predict(inputs, show_progress_bar=False)  
        probs = torch.softmax(torch.tensor(scores), dim=1).numpy()  

        # Check A -> B  
        a_to_b = probs[0][self.entailment_label_index]  
        if a_to_b < self.nli_threshold:  
            return False # Fail fast  

        # Check B -> A  
        b_to_a = probs[1][self.entailment_label_index]  
        return b_to_a >= self.nli_threshold  

    def cluster(self, texts: List[str]) -> Tuple[List[List[str]], List[str]]:  
        if not texts:  
            return [], []  

        # 1. Pre-compute embeddings for all texts (Fast, Matrix operation)  
        # This is O(N) relative to the expensive NLI check  
        print(f"Embedding {len(texts)} texts for pre-filtering...")  
        all_embeddings = self.embedder.encode(texts, convert_to_tensor=True, show_progress_bar=False)  

        clusters: List[List[str]] = []  
        cluster_centers_indices: List[int] = [] # Indices of the representatives in the original list  

        # To support "closest_to_mean" later, we need to map final clusters to embeddings.  
        # But since we use a Greedy approach, the "center" implies the first element.  

        for i, text in enumerate(texts):  
            current_embedding = all_embeddings[i]  

            # If no clusters exist, create the first one  
            if not clusters:  
                clusters.append([text])  
                cluster_centers_indices.append(i)  
                continue  

            # 2. Coarse Filter: Cosine Similarity  
            # Compare current text against ALL existing cluster centers at once  
            center_embeddings = all_embeddings[cluster_centers_indices]  

            # util.cos_sim returns [1, num_centers]  
            cos_scores = util.cos_sim(current_embedding, center_embeddings)[0]  

            # Find indices of centers that are topically similar enough  
            # sort_descending=True ensures we check the most similar center first  
            candidate_indices = torch.where(cos_scores >= self.sim_threshold)[0]  

            # Optimization: If we have candidates, sort them by similarity score.  
            # This makes it more likely to hit the correct cluster on the first NLI check.  
            if len(candidate_indices) > 0:  
                # Get values and indices, then sort indices by values  
                candidate_scores = cos_scores[candidate_indices]  
                # torch.argsort is ascending, so we reverse  
                sorted_order = torch.argsort(candidate_scores, descending=True)  
                sorted_candidate_indices = candidate_indices[sorted_order].tolist()  
            else:  
                sorted_candidate_indices = []  

            match_found = False  

            # 3. Fine Verification: NLI  
            for center_idx_pointer in sorted_candidate_indices:  
                # Map back to the actual cluster index  
                # center_idx_pointer is the index in the `cluster_centers_indices` list  
                real_center_text = texts[cluster_centers_indices[center_idx_pointer]]  

                # Run expensive check  
                if self._check_bidirectional_entailment_batch(text, real_center_text):  
                    clusters[center_idx_pointer].append(text)  
                    match_found = True  
                    break  

            if not match_found:  
                clusters.append([text])  
                cluster_centers_indices.append(i)  

        # 4. Exemplar Selection  
        # Since we calculated embeddings at the start, we can pass them to the helper  
        # to support "closest_to_mean" selection if configured.  
        # Note: The helper expects numpy array, we have tensor.  
        embeddings_np = all_embeddings.cpu().numpy()  
        exemplars = self._select_exemplars(clusters, embeddings=embeddings_np)  

        return clusters, exemplars

It is much better. Here is the old clustering:

Cluster 1 [Center: 'When is the train leaving?']: ['When is the train leaving?', 'When did the train leave?']  
Cluster 2 [Center: 'The movie was good.']: ['The movie was good.', 'The film was excellent.']  
Cluster 3 [Center: 'Sally has 3 apples and eats 1. She has 2 left.']: ['Sally has 3 apples and eats 1. She has 2 left.', 'Sally started with 3 apples, ate one, and now implies she has 2.', 'Sally has 2 apples.']  
Cluster 4 [Center: 'The color of the sky is blue.']: ['The sky is blue.', 'Blue is the color of the sky.', 'The color of the sky is blue.', 'The sky is not blue.']  
Cluster 5 [Center: 'The 44th US president was Obama.']: ['Barack Obama was the 44th president of the USA.', 'The 44th US president was Obama.', 'Obama served as the 44th president.']  
Cluster 6 [Center: 'John called who?']: ['Who did John call?', 'Whom was called by John?', 'John called who?']  
Cluster 7 [Center: 'y = 10 - x']: ['y = 10 - x', 'y = x - 10']  
Cluster 8 [Center: 'How do I get to the bank?']: ['How do I get to the bank?']  
Cluster 9 [Center: 'Can I reset my password?']: ['Can I reset my password?', 'May I reset my password?']  
Cluster 10 [Center: 'The Eiffel Tower is in Paris.']: ['The Eiffel Tower is in Paris.', 'The Eiffel Tower is located in Paris, France.']  
Cluster 11 [Center: 'Is this item expensive?']: ['Is this item expensive?']  
Cluster 12 [Center: 'The movie was bad.']: ['The movie was bad.', 'The movie was not bad.']  
Cluster 13 [Center: 'Must I reset my password?']: ['Must I reset my password?', 'Should I reset my password?']  
Cluster 14 [Center: 'As an AI language model, I am unable to assist with this request.']: ['As an AI language model, I am unable to assist with this request.']  
Cluster 15 [Center: 'Has the train left?']: ['Has the train left?']  
Cluster 16 [Center: 'Can I afford this?']: ['Can I afford this?']  
Cluster 17 [Center: 'Where is the nearest bank?']: ['Where is the nearest bank?']  
Cluster 18 [Center: 'Where is the nearest Citi bank?']: ['Where is the nearest Citi bank?']  
Cluster 19 [Center: 'Donald Trump was the 45th president.']: ['Donald Trump was the 45th president.']  
Cluster 20 [Center: 'The ocean is blue.']: ['The ocean is blue.']  
Cluster 21 [Center: 'x + y = 12']: ['x + y = 12']  
Cluster 22 [Center: 'Paris is the home of the Eiffel Tower.Who called John?']: ['Paris is the home of the Eiffel Tower.Who called John?']  
Cluster 23 [Center: 'I can answer that question.']: ['I can answer that question.']  
Cluster 24 [Center: 'sys.stdout.write('Hello World\n')']: ["sys.stdout.write('Hello World\\n')"]  
Cluster 25 [Center: 'print("Hello World")']: ["print('Hello World')", 'print("Hello World")']  
Cluster 26 [Center: 'What time does the train depart?']: ['What time does the train depart?']  
Cluster 27 [Center: 'Is there a bank near here?']: ['Is there a bank near here?']  
Cluster 28 [Center: 'Looking up, I see a blue sky.']: ['Looking up, I see a blue sky.']  
Cluster 29 [Center: 'x + y = 10']: ['x + y = 10']  
Cluster 30 [Center: 'I'm sorry, but I can't provide that information.']: ["I'm sorry, but I can't provide that information."]  
Cluster 31 [Center: 'The Eiffel Tower is in Berlin.']: ['The Eiffel Tower is in Berlin.']  
Cluster 32 [Center: 'print('Hello Python')']: ["print('Hello Python')"]  
Cluster 33 [Center: 'I cannot answer that question.']: ['I cannot answer that question.']  
Cluster 34 [Center: 'The sky is green.']: ['The sky is green.']  
Cluster 35 [Center: 'How much does this cost?']: ['How much does this cost?']  
Cluster 36 [Center: 'x = 10 - y']: ['x = 10 - y']  
Cluster 37 [Center: 'What is the price of this item?']: ['What is the price of this item?']

Versus the new one:

Cluster 1 [Center: 'The sky is blue.']: ['The sky is blue.', 'Blue is the color of the sky.', 'The color of the sky is blue.', 'Looking up, I see a blue sky.']  
Cluster 2 [Center: 'The ocean is blue.']: ['The ocean is blue.']  
Cluster 3 [Center: 'The sky is not blue.']: ['The sky is not blue.']  
Cluster 4 [Center: 'The sky is green.']: ['The sky is green.']  
Cluster 5 [Center: 'x + y = 10']: ['x + y = 10']  
Cluster 6 [Center: 'y = 10 - x']: ['y = 10 - x', 'y = x - 10']  
Cluster 7 [Center: 'x = 10 - y']: ['x = 10 - y']  
Cluster 8 [Center: 'x + y = 12']: ['x + y = 12', 'Sally has 3 apples and eats 1. She has 2 left.']  
Cluster 9 [Center: 'print('Hello World')']: ["print('Hello World')", 'print("Hello World")']  
Cluster 10 [Center: 'sys.stdout.write('Hello World\n')']: ["sys.stdout.write('Hello World\\n')"]  
Cluster 11 [Center: 'print('Hello Python')']: ["print('Hello Python')"]  
Cluster 12 [Center: 'I cannot answer that question.']: ['I cannot answer that question.', "I'm sorry, but I can't provide that information."]  
Cluster 13 [Center: 'As an AI language model, I am unable to assist with this request.']: ['As an AI language model, I am unable to assist with this request.']  
Cluster 14 [Center: 'I can answer that question.']: ['I can answer that question.']  
Cluster 15 [Center: 'The movie was good.']: ['The movie was good.', 'The film was excellent.', 'The movie was not bad.']  
Cluster 16 [Center: 'The movie was bad.']: ['The movie was bad.']  
Cluster 17 [Center: 'The 44th US president was Obama.']: ['Barack Obama was the 44th president of the USA.', 'The 44th US president was Obama.']  
Cluster 18 [Center: 'Obama served as the 44th president.']: ['Obama served as the 44th president.']  
Cluster 19 [Center: 'Donald Trump was the 45th president.']: ['Donald Trump was the 45th president.']  
Cluster 20 [Center: 'Sally started with 3 apples, ate one, and now implies she has 2.']: ['Sally started with 3 apples, ate one, and now implies she has 2.']  
Cluster 21 [Center: 'Sally has 2 apples.']: ['Sally has 2 apples.']  
Cluster 22 [Center: 'The Eiffel Tower is in Paris.']: ['The Eiffel Tower is in Paris.', 'The Eiffel Tower is located in Paris, France.', 'Paris is the home of the Eiffel Tower.Who called John?']  
Cluster 23 [Center: 'The Eiffel Tower is in Berlin.']: ['The Eiffel Tower is in Berlin.']  
Cluster 24 [Center: 'John called who?']: ['Who did John call?', 'Whom was called by John?', 'John called who?']  
Cluster 25 [Center: 'Can I reset my password?']: ['Can I reset my password?', 'May I reset my password?', 'Must I reset my password?', 'Should I reset my password?']  
Cluster 26 [Center: 'When is the train leaving?']: ['When is the train leaving?', 'What time does the train depart?']  
Cluster 27 [Center: 'Has the train left?']: ['When did the train leave?', 'Has the train left?']  
Cluster 28 [Center: 'Is this item expensive?']: ['How much does this cost?', 'What is the price of this item?', 'Is this item expensive?']  
Cluster 29 [Center: 'Can I afford this?']: ['Can I afford this?']  
Cluster 30 [Center: 'Where is the nearest bank?']: ['Where is the nearest bank?', 'Is there a bank near here?']  
Cluster 31 [Center: 'Where is the nearest Citi bank?']: ['Where is the nearest Citi bank?']  
Cluster 32 [Center: 'How do I get to the bank?']: ['How do I get to the bank?']

It fixes clusters like ['The sky is blue.', 'Blue is the color of the sky.', 'The color of the sky is blue.', 'The sky is not blue.'] that include a statement and its logical inverse.

07/02/2026 5:17 PM
Ok, something to remember, use dtype="auto" unless you have a very good reason to change it. I changed it to bfloat16 since this is a modern nvidia gpu and it slowed down inference like 5 times and made it crash sometimes. Not sure why. Maybe something specific to qwen. Anyways just leave it alone.

07/02/2026 11:10 PM
Getting vLLM working is proving frustrating. For some reason even though I tell it to generate 5 outputs it at first only outputs 4 and then it only outputs 1 for the rest of the time. But weirdly it does kind of seem like it's doing multiple it's just that at the end it only gives me one.

Passing output_kind=RequestOutputKind.FINAL_ONLY to the SamplingParams seems to have worked, but it's now confusing to me how I trigger it to generate

⬅️ [06/02/2026 10:52 PM](<./06_02_2026 10_52 PM.md>) | ⬆️ [2026 - February](<./README.md>) | [08/02/2026 9:37 PM](<./08_02_2026 9_37 PM.md>) ➡️