import React from "react";
import { useNavigate } from "react-router-dom";
import InterfaceOne from "../assets/Images/levers/interface_one.png";
import InterfaceTwo from "../assets/Images/levers/interface_two.png";

function AutograderSAE() {
  const navigate = useNavigate();

  return (
    <div
      className="max-w-[1000px] px-12 mx-auto my-0 my-12 tablet:my-36"
      id="container"
    >
      <button
        className="text-blue-500 underline mb-4"
        onClick={() => navigate(-1)}
      >
        Back
      </button>
      <div className="flex flex-col gap-4 items-start mb-16">
        <div className="text-left w-full font-times gap-2 flex flex-col">
          <h1 className="text-black-primary font-times text-4xl my-4">
            Levers as a new interface paradigm
          </h1>
          <p className="mb-4">
            This is a continuation of{" "}
            <a
              href="/the-lab/autograder-sae"
              className="text-blue-500 underline"
            >
              one of my last posts
            </a>
            , in an attempt to further explore the practicalities of sparse
            auto-encoders. As I've been working on this, I've realized that this
            is relatively technical topic that I admittedly, struggle explaining
            to a layman audience.
          </p>

          <p className="mb-4">
            Given the practical <strong>emphasis</strong> of the research I
            strive to do, I'm trying to craft a more digestible story for a
            wider audience. Perhaps, this lab update is a first of many
            attempts.
          </p>

          <h3 className="text-2xl font-bold mt-6 mb-4">On prompting</h3>
          <p className="mb-4">
            Why do we have to be{" "}
            <a
              href="https://varunshenoy.substack.com/p/natural-language-is-an-unnatural"
              className="text-blue-500 underline"
            >
              good prompt engineers
            </a>{" "}
            in order to effectively interact with large language models? We
            spend hours refining the grammar, syntax and structural formatting
            of our 'commands' to these ultra-powerful AI systems, only to attain
            a behavior that works the first 3 times and performs inconsistently
            in subsequent attempts. This struggle is particularly salient for
            interacting with multimodal models where it is difficult to express
            visual elements using words. I'm also limited by the boundaries of
            my own creative thought.
          </p>

          <h3 className="text-2xl font-bold mt-6 mb-4">On interfaces</h3>

          <h4 className="text-xl font-bold mt-4 mb-2">App-level interfaces</h4>
          <p className="mb-4">
            A lot of AI-powered apps are adopting interfaces that take a
            different approach. Rather than giving me complete freedom to
            directly interact with these models, the app presents me with a set
            of finite, semantically understandable 'levers' that I'm able to
            push and pull, in order to attain a final product that I desire.
            Apps encode an 'opinion' of a user's expected behavior within these
            levers, creating a more intentional interface -- and one that
            requires a little less thinking & skill, arguably, on the user end.
            It feels like a tactile interface, and though my output space is
            more constrained, as a user, I actually <em>feel</em> like{" "}
            <strong>I'm</strong> creating something, as opposed to telling an
            external, human-like entity of what I <em>want</em> to create.
          </p>

          <h4 className="text-xl font-bold mt-4 mb-2">
            Model-level interfaces
          </h4>
          <p className="mb-4">
            What if we adopted the same 'lever' philosophy, but on the model
            level? What if every trained large language model gave us a set of
            levers that we, humans, could understand, and cranking them up or
            down would allows us to <strong>precisely</strong> change the
            behavior of these models?
          </p>

          <h4 className="text-xl font-bold mt-4 mb-2">
            Output-level interfaces
          </h4>
          <p className="mb-4">
            Let's go even further. Instead of re-configuring the model's levers
            that directly produce some output, what if the output{" "}
            <em>itself</em> had a bunch of levers that we humans not only
            understood, but are able to crank up/down in order to modify the
            very output itself? This output could be any domain; text, image,
            video, or audio.
          </p>

          <p className="mb-4">
            I'll make a preliminary case for why I think levers on the output
            itself could be compelling.
          </p>

          <p className="mb-4">
            <em>
              Generating and editing feel like distinct, creative processes
            </em>
            <br />
            Of course, there are benefits to altering the brain of the
            generative model itself, especially when you generate the very first
            version of whatever you're trying to create. But subsequent edits of
            that very thing may require a new interaction paradigm. It's like
            writing the first version of an essay (often, a word vomit of some
            sort), and then spending days, weeks, or months wrangling with the
            semantic and structural details of that very essay to perfect it.
            Both <em>feel</em> like distinct creative processes.
          </p>

          <p className="mb-4">
            <em>You don't know what you don't know.</em>
            <br />
            We are constrained by the mental levers within our own brains. For a
            creative, I argue this is akin to <em>taste</em>. An experienced
            artist has a very wide, expansive arsenal of 'levers' in which they
            are able to direct and evolve their artwork -- their output space is
            very large, and often, refined to their own, personal 'style'. A
            writer may know how to improve his essay in 10 different ways, but a
            conversation with a writer they may admire may open up an 11th
            direction to alter the essay. Levers, especially in the editing
            phase, are intrinsically valuable.
          </p>

          <p className="mb-4">
            <em>Re-prompting is an even more unnatural interface.</em>
            <br />
            It's worse than prompting for the first time. As you re-prompt,
            you're so anxious about preserving things you found valuable in your
            essay that it is difficult to focus on meaningful changes to your
            work. It's not very precise.
          </p>

          <p className="mb-4">
            Creating is a deeply human process. Your taste is something that can
            only be discovered via experience, and experimentation. Yet, I think
            the discovery and the interaction with these hypothetical 'levers',
            especially in the editing, conversational creative process, is
            something that could inspire even more creativity.
          </p>

          <p className="mb-4">
            Particularly, if we are able to do so beyond the human-induced,
            app-level, but on a level that is more native to the inner workings
            of the large language model itself.
          </p>

          <h3 className="text-2xl font-bold mt-6 mb-4">Is this possible?</h3>

          <p className="mb-4">
            To put it simply, sparse auto-encoders (SAEs) enable us to interpret
            a series of machine-generated numbers, and to some extent, discover
            human-understandable 'levers' that directly change in these numbers
            of semantically understandable ways. These levers are intrinsic to
            the machine itself.
          </p>

          <p className="mb-4">
            Anthropic has been using SAEs on the model level, specifically the{" "}
            <em>reasoning</em> layers of the large language model itself,
            discovering these 'levers' to alter the behavior of the model. More
            recently, Linus has been using SAEs on the output-level,
            specifically the <em>embedding</em> of the output, discovering these
            'levers' on the output itself.
          </p>

          <p className="mb-4">
            For the reasons mentioned above, I find the discovery of
            output-level levers as a super compelling, and promising new
            interface paradigm. We are still far from this being a practical new
            method. Is it more precise? Are these features truly independent
            from one another? What is the nature of these levers?
          </p>

          <p className="mb-4">
            If we are to believe that interacting with some concept of a{" "}
            <em>lever</em> can be deeply valuable, the next logical question to
            me is: to what extent can we control what levers we can interact
            with? Can we artificially induce, and discover, features that I
            personally find valuable?
          </p>

          <p className="mb-4">
            From a safety and alignment perspective, it is interesting to
            discover dangerous features that a very large, non-explainable model
            has learned, in an attempt to 'deactivate' them.
          </p>

          <p className="mb-4">
            But, from a practical, creative, day-to-day usage perspective, I
            want to be able to interact with levers that are going to be useful
            for a task that I am trying to achieve. Beyond just useful, what if
            we could interact with levers that I didn't even known once existed,
            in a precise, specific manner?
          </p>

          <p className="mb-4">
            Could I see the levers of an artist I admire, a writer whom I adore,
            and simulate <em>change</em> in this nuanced direction?
          </p>

          <h3 className="text-2xl font-bold mt-6 mb-4">
            Grading, as a useful case study
          </h3>

          <p className="mb-4">
            One thing that distinguishes this exploration from many others is
            the emphasis on domain-specificity. The ability to use SAEs for a
            particular task you find interesting; hence, we begin our
            explorations on a more practical, hyper-specific useful task:
            grading.
          </p>

          <p className="mb-4">
            When teachers grade essays, short-answers, homework assignments,
            they look for particular signals. Often, they use rubrics that are
            intended to be specific, so that scores are standardized across
            multiple graders. The reality is, however, that these rubrics are
            not very specific, leading to a low level of agreement across
            different graders.
          </p>

          <p className="mb-4">
            Here is an example of a vague rubric for grading a short-answer
            student response:
          </p>

          <div className="bg-white rounded-lg p-4 shadow-sm mb-4">
            <p className="mb-4">
              <strong>Score 2:</strong> The response demonstrates: an
              exploration or development of the ideas presented in the text a
              strong conceptual understanding by the inclusion of specific
              relevant information from the text an extension of ideas that may
              include extensive and/or insightful inferences, connections
              between ideas in the text, and references to prior knowledge
              and/or experiences
            </p>

            <p className="mb-4">
              <strong>Score 1:</strong> The response demonstrates: some
              exploration or development of ideas presented in the text a
              fundamental understanding by the inclusion of some relevant
              information from the text an extension of ideas that lacks depth,
              although may include some inferences, connections between ideas in
              the text, or references to prior knowledge and/or experiences
            </p>

            <p>
              <strong>Score 0:</strong> The response demonstrates: limited or no
              exploration or development of ideas presented in the text limited
              or no understanding of the text, may be illogical, vague, or
              irrelevant possible incomplete or limited inferences, connections
              between ideas in the text, or references to prior knowledge and/or
              experiences
            </p>
          </div>

          <p className="mb-4">
            In a world where I'm trying to <em>perfect</em> my response to this
            question, I'd love to know what signals the teacher is looking at
            when grading my response. These signals are akin to 'levers' that
            I'd like to discover. More interestingly, as a student, I'd find it
            helpful to edit my response by cranking certain levers up/down to
            see directly how to improve my response in the ways that a teacher
            might be expecting. In some sense, this is generating helpful
            examples of 'good' and potentially 'bad' responses from the direct
            perspective of a teacher.
          </p>

          <p className="mb-4">So, there are two north stars:</p>
          <ol className="list-decimal list-outside mb-4 ml-6">
            <li className="mb-2">
              Is it possible to identify these <em>signals</em> as levers (that
              we can later alter)? Do they make sense? What is the{" "}
              <em>nature</em> of these signals?
            </li>
            <li>Once identified, how precise is the moving of these levers?</li>
          </ol>

          <p className="mb-4">
            This is a hyper-specific example, but you can imagine the appeal of
            being able to almost 'jump' into the mind of some person, identify
            the mental levers in their brain, and then be able to precisely pull
            and push these levers to see how they would interact some
            text/image/audio.
          </p>

          <p className="mb-4">
            You can argue that you could, on the app-level, do the same very
            thing. Dump a huge dataset of all the student's responses, and ask
            the LLM common characteristics between them, and then re-prompt to
            edit. We know this is challenging, and we know this can be
            imprecise. While nascent, I'm curious to see if using SAEs, a whole
            different method, could be a more precise method of doing both
            things; (1) identifying the levers you want to identify, and (2)
            moving them in more precise ways.
          </p>

          <h4 className="text-xl font-bold mt-4 mb-2">
            A brief technical background on SAEs
          </h4>
          <p className="mb-4">
            Sparse auto-encoders (SAEs) are a family of models that learns to
            reconstruct some vector (in our case, an embedding), by projecting
            the original vector into a higher dimensional space. In training an
            SAE, we try and discover a set of feature vectors over some dataset
            that enables us to best re-construct the embeddings -- these vectors
            are, conceptually, the same as the 'levers' that I discuss
            previously. In this setup, the feature vectors are the weights of
            the decoder model, and the feature activations (the strength of each
            feature vector for particular embedding) is represented by the
            intermediate layer in the higher dimensional space.
          </p>

          <p className="mb-4">
            We then feed a <em>test</em> dataset into the SAE, and collect all
            the samples that activate highly on a particular feature, and then
            ask GPT to identify the common characteristic across all
            highly-activating samples to make the feature 'human-readable'.
          </p>

          <p className="mb-4">
            If you want to learn more about the details of SAEs, here is a
            clear,{" "}
            <a
              href="https://substack.com/@nickjiang/p-145475328"
              className="text-blue-500 underline"
            >
              concise primer on SAEs.
            </a>{" "}
            Additionally, a lot of the techniques, especially the construction
            of the interpretability pipeline, are inspired by{" "}
            <a
              href="https://thesephist.com/posts/prism/"
              className="text-blue-500 underline"
            >
              Linus' Prism blog post
            </a>
            , which is technically rich and can help you understand some of my
            specific, technical decisions.
          </p>

          <h4 className="text-xl font-bold mt-4 mb-2">The method</h4>
          <p className="mb-4">
            SAEs have this profound ability to extract semantic features from
            high-dimensional, continuous vector representations. What if, we
            fine-tune some neural network (with a classification head) that
            tries to predict if a student response is a score 0, 1 or 2, and
            then we train the SAE on the 'last hidden layer' of the the neural
            network? This layer should, in theory, be highly indicative of
            whether a response is 0, 1 or 2, and should encode features that
            indicate this scoring.
          </p>

          <p className="mb-4">
            I hypothesize that analyzing this layer, should reveal features
            related to what the neural network was looking for in order to grade
            the student response. For simplicity, I decide to fine-tune a common
            BERT model (as an auto-grader), before extracting its last hidden
            layer. There are some floating theories that suggest that different
            layers encode <em>different things</em>, but we just pick the last
            layer for simplicity, and as an initial start.
          </p>

          <p className="mb-4">
            The specific grading task we focus on is Question 3 of the{" "}
            <a
              href="https://www.kaggle.com/c/asap-sas"
              className="text-blue-500 underline"
            >
              ASAP Short Answer Scoring dataset
            </a>
            . It is a reading comprehension question that asks a student to
            "explain how pandas in China are similar to koalas in Australia and
            how they both are different from pythons. Support your response with
            information from the article."
          </p>
          <p className="mb-4">We do the following:</p>
          <ol className="list-decimal list-outside mb-4 ml-6">
            <li className="mb-2">
              We fine-tune a BERT model with a classification head to predict
              the score (0, 1, or 2), given a student's response, using this
              dataset. We attain an QWK of ~75%.
            </li>
            <li className="mb-2">
              We embed 1 million sentences from the MiniPile dataset using this
              fine-tuned BERT model (extracting the last hidden layer). We also
              do the same using a pre-trained BERT model to compare the
              differences.
            </li>
            <li>
              We train an SAE on this dataset, and extract human-readable
              features using the the same initial ASAP dataset for Q3 via our
              interpretability pipeline with GPT-4o.
            </li>
          </ol>

          <p className="mb-4">A few notes:</p>
          <ul className="list-disc ml-6 list-outside mb-4">
            <li className="mb-2">
              On 1), the input when fine-tuning the BERT model is the{" "}
              <em>entire</em> student response, which can be multiple sentences.
            </li>
            <li className="mb-2">
              On 2), we use Minipile because we need a sufficiently large
              dataset to minimize the loss for the SAE (a dream case would be to
              train the SAE purely on student work / work specific to Q3, but we
              don't have enough data here)
            </li>
            <li>
              On 3), we run the interpretability pipeline on the domain-specific
              dataset so that only features that can be found in student
              responses for Q3 will be identified (we use Linus' confidence
              calculation as a metric for which features were successful vs not
              successful)
            </li>
          </ul>

          <h4 className="text-xl font-bold mt-4 mb-2">Interactive results</h4>

          <p className="mb-4">
            Figure A shows an interactive interface of the features{" "}
            <em>most indicative</em> of a poor (score 0), average (score 1), or
            good (score 2) student response. I define <em>most indicative</em>{" "}
            as having a 'confidence' of over 50%, and at least 50% of high
            acting samples needed to be from the respective poor/average/good
            response.
          </p>

          <p className="mb-4 w-full text-center">
            <button
              className="bg-white text-black border border-black py-2 px-4 hover:bg-black hover:text-white"
              onClick={() => navigate("/auto-ed-coder/features")}
            >
              See Feature Viewer
            </button>
          </p>
          <p className="mb-4">
            <img
              src={InterfaceOne}
              alt="Figure A: Interactive interface of features"
              className="w-full"
            />
            <figcaption className="text-center text-sm mt-2">
              Figure A: Interactive interface of features
            </figcaption>
          </p>

          <p className="mb-4">
            If you're curious how any, arbitrary response relates to the
            identified features, Figure B shows a second interface where you can
            input some response, and we'll show you the highest activating
            features based on your, unique response!
          </p>

          <p className="mb-4 w-full text-center">
            <button
              className="bg-white text-black border border-black py-2 px-4 hover:bg-black hover:text-white"
              onClick={() => navigate("/auto-ed-coder")}
            >
              Try your own response!
            </button>
          </p>

          <p className="mb-4">
            <img src={InterfaceTwo} alt="Figure B" className="w-full" />
            <figcaption className="text-center text-sm mt-2">
              Figure B: Try your own response!
            </figcaption>
          </p>

          <h4 className="text-xl font-bold mt-4 mb-2">Some thoughts</h4>

          <ol className="list-decimal list-outside mb-4 ml-6 gap-8 flex flex-col">
            <li className="mb-2">
              <strong>
                Meaningful signals indicative of poor/average/good responses
                were identified by SAEs
              </strong>
              <br />
              <p className="mt-2">
                Out of the first 500 features we extracted in part 3), the SAE
                identified 27 poor features, 11 average features and 43 good
                features. Generally, the identified indicators made semantic
                sense. It was actually insightful, and interesting to see what
                the SAE identified as 'strongest signals'. To recap, students
                were asked a two-part question: (1) what are the similarities
                between koalas and pandas, (2) what are the differences with
                pythons?
              </p>
              <br />
              <p className="mt-2">
                <em>Good features</em> are largely related to describing pythons
                as generalists, specifically related to two things: (1)
                adaptability in different habitats, and (2) their diverse diets.
                ~80% of features were of this nature, while a handful were
                related to the <strong>specific comparison</strong> between
                generalists and specialists.
              </p>
              <p className="mt-2">
                Could we infer that the strongest signals of good responses
                "made it" to question (2), and answered it correctly (generalist
                distinction is the best supported answer), with detail, and also
                made explicit comparisons with the specialist nature of
                koalas/pandas.?
              </p>
              <br />
              <p className="mt-2">
                <em>Average features</em> are more diverse, but mostly seemed to
                focus on pandas and koalas having specialized, exclusive diets.
                These features felt a little more arbitrary; one was a 'Direct
                statement about python' which referred to short, un-elaborated
                statements on pythons being generalists.
              </p>
              <p className="mt-2">
                Could we infer that average responses had less of an
                idea-centric through-line, and instead, contained structurally
                weak samples that seemed to lack detail / were short and direct?
                Additionally, perhaps, the strongest signals of average
                responses was a good answer to question (1)?
              </p>
              <br />
              <p className="mt-2">
                <em>Poor features</em> were mostly related to two things: (1)
                harmlessness of pandas and koalas (how unlike pythons, which are
                invasive, they are harmless), and (2) pandas and koalas being
                referred to as both <em>bears</em>.
              </p>
              <p className="mt-2">
                (1) confused me at first because pandas and koalas are harmless,
                compared to pythons. But, a key part of the prompt was "Support
                your response with information from the article", and no where
                in the article does it mention that pandas/koalas are harmless.
                The article-supported difference was pythons being a generalist.
                (2) makes sense as pandas and koalas are factually, not bears.
              </p>
            </li>
            <li className="mb-2">
              <strong>There were a lot of overlapping features.</strong>
              <br />
              ~80% of the features indicative of a 'good' response were
              basically the same thing. Even if there were nuanced differences,
              they weren't clear either. It may be interesting to quantify
              'similar features' by calculating the % of overlapping high acting
              samples. This is a similar concept to a new idea in recent
              SAE/embedding literature called 'feature families', whereby a
              graph-based/DFS technique was proposed to find overlapping
              features. It may be interesting to apply this here given the
              number of similar features.
            </li>
            <li className="mb-2">
              <strong>
                Nuanced features, particularly structural, were difficult to
                extract.
              </strong>
              <br />
              Alternatively, one could argue that these features were still
              different, but our automated interpretability pipeline limited the
              extraction of more nuanced, structural features. Firstly, the
              prompt used to compare different samples is more focused on
              content-similarities related to the ideas mentioned in the text.
              But, the auto-grader could very much be looking at other factors
              such as the length of the text, stylistic grammar features,
              spelling, tonality.
              <p className="mt-2">
                Is there a way to encourage the pipeline to extract more
                structural features without overly biasing these features?
              </p>
            </li>
            <li>
              <strong>
                Compared to the pre-trained BERT model, a lot more{" "}
                <em>indicative</em> features were identified.
              </strong>
              <br />
              When looking at the pre-trained BERT model's SAE, it identified 1
              poor feature, 40 average features and 2 good features. When it
              comes to enabling domain-specificity for SAEs, I think there are
              two main methods:
              <p className="mt-2">
                (1) You can change the dataset. SAEs learn feature vectors
                (weights of the decoder model) that best enable reconstruction
                of the original embedding <em>over</em> some particular dataset.
                If your dataset is filled with code, and your embedding
                effectively encodes code-related details, the SAE will learn
                these domain-relevant features.
              </p>
              <p className="mt-2">
                (2) You can change the embedding model. If you fine-tune some
                embedding model that selectively encodes domain-relevant
                features of some text, the SAE will learn features that help it
                re-construct these embeddings (with these domain-relevant
                features).
              </p>
              <br />
              <p className="mt-2">
                The pre-trained BERT model arbitrarily encodes generic
                information about the text. Since we embed 1 million sentences
                from Minipile, it is likely that nothing really that specific
                about koalas / pandas / pythons, especially in related to the
                question, was encoded. Therefore, features related to this
                domain were not learned, and features skewed towards a score of
                0, 1 or 2 were not identified. This is contrast to learning how
                to re-construct an embedding that is explicitly fine-tuned to
                encode indicators of a good, average or bad response as part of
                a predictor model, and even if it was over a generic Minipile
                dataset, the nature of the specific embedding model extracted
                specific, domain-relevant features.
              </p>
            </li>
          </ol>
        </div>
      </div>
    </div>
  );
}

export default AutograderSAE;
