MLG 035 Large Language Models 2

May 08, 2025
Click to Play Episode

At inference, large language models use in-context learning with zero-, one-, or few-shot examples to perform new tasks without weight updates, and can be grounded with Retrieval Augmented Generation (RAG) by embedding documents into vector databases for real-time factual lookup using cosine similarity. LLM agents autonomously plan, act, and use external tools via orchestrated loops with persistent memory, while recent benchmarks like GPQA (STEM reasoning), SWE Bench (agentic coding), and MMMU (multimodal college-level tasks) test performance alongside prompt engineering techniques such as chain-of-thought reasoning, structured few-shot prompts, positive instruction framing, and iterative self-correction.


Resources
Resources best viewed here
Stanford CS336 Language Modeling from Scratch
Hands-On Large Language Models: Language Understanding and Generation 1st Edition
Hugging Face NLP Course
Coursera Generative AI with Large Language Models


Show Notes

Build the future of multi-agent software with AGNTCY.

In-Context Learning (ICL)

  • Definition: LLMs can perform tasks by learning from examples provided directly in the prompt without updating their parameters.
    • Types:
      • Zero-shot: Direct query, no examples provided.
      • One-shot: Single example provided.
      • Few-shot: Multiple examples, balancing quantity with context window limitations.
    • Mechanism: ICL works through analogy and Bayesian inference, using examples as semantic priors to activate relevant internal representations.
    • Emergent Properties: ICL is an "inference-time training" approach, leveraging the model’s pre-trained knowledge without gradient updates; its effectiveness can be enhanced with diverse, non-redundant examples.

Retrieval Augmented Generation (RAG) and Grounding

  • Grounding: Connecting LLMs with external knowledge bases to supplement or update static training data.
    • Motivation: LLMs’ training data becomes outdated or lacks proprietary/specialized knowledge.
    • Benefit: Reduces hallucinations and improves factual accuracy by incorporating current or domain-specific information.
  • RAG Workflow:
    1. Embedding: Documents are converted into vector embeddings (using sentence transformers or representation models).
    2. Storage: Vectors are stored in a vector database (e.g., FAISS, ChromaDB, Qdrant).
    3. Retrieval: When a query is made, relevant chunks are extracted based on similarity, possibly with re-ranking or additional query processing.
    4. Augmentation: Retrieved chunks are added to the prompt to provide up-to-date context for generation.
    5. Generation: The LLM generates responses informed by the augmented context.
    • Advanced RAG: Includes agentic approaches - self-correction, aggregation, or multi-agent contribution to source ingestion, and can integrate external document sources (e.g., web search for real-time info, or custom datasets for private knowledge).

LLM Agents

  • Overview: Agents extend LLMs by providing goal-oriented, iterative problem-solving through interaction, memory, planning, and tool usage.
  • Key Components:
    • Reasoning Engine (LLM Core): Interprets goals, states, and makes decisions.
    • Planning Module: Breaks down complex tasks using strategies such as Chain of Thought or ReAct; can incorporate reflection and adjustment.
    • Memory: Short-term via context window; long-term via persistent storage like RAG-integrated databases or special memory systems.
    • Tools and APIs: Agents select and use external functions - file manipulation, browser control, code execution, database queries, or invoking smaller/fine-tuned models.
  • Capabilities: Support self-evaluation, correction, and multi-step planning; allow integration with other agents (multi-agent systems); face limitations in memory continuity, adaptivity, and controllability.
  • Current Trends: Research and development are shifting toward these agentic paradigms as LLM core scaling saturates.

Multimodal Large Language Models (MLLMs)

  • Definition: Models capable of ingesting and generating across different modalities (text, image, audio, video).
  • Architecture:
    • Modality-Specific Encoders: Convert raw modalities (text, image, audio) into numeric embeddings (e.g., vision transformers for images).
    • Fusion/Alignment Layer: Embeddings from different modalities are projected into a shared space, often via cross-attention or concatenation, allowing the model to jointly reason about their content.
    • Unified Transformer Backbone: Processes fused embeddings to allow cross-modal reasoning and generates outputs in the required format.
  • Recent Advances: Unified architectures (e.g., GPT-4o) use a single model for all modalities rather than switching between separate sub-models.
  • Functionality: Enables actions such as image analysis via text prompts, visual Q&A, and integrated speech recognition/generation.

Advanced LLM Architectures and Training Directions

  • Predictive Abstract Representation: Incorporating latent concept prediction alongside token prediction (e.g., via autoencoders).
  • Patch-Level Training: Predicting larger “patches” of tokens to reduce sequence lengths and computation.
  • Concept-Centric Modeling: Moving from next-token prediction to predicting sequences of semantic concepts (e.g., Meta’s Large Concept Model).
  • Multi-Token Prediction: Training models to predict multiple future tokens for broader context capture.

Evaluation Benchmarks (as of 2025)

  • Key Benchmarks Used for LLM Evaluation:
    • GPQA (Diamond): Graduate-level STEM reasoning.
    • SWE Bench Verified: Real-world software engineering, verifying agentic code abilities.
    • MMMU: Multimodal, college-level cross-disciplinary reasoning.
    • HumanEval: Python coding correctness.
    • HLE (Human’s Last Exam): Extremely challenging, multimodal knowledge assessment.
    • LiveCodeBench: Coding with contamination-free, up-to-date problems.
    • MLPerf Inference v5.0 Long Context: Throughput/latency for processing long contexts.
    • MultiChallenge Conversational AI: Multiturn dialogue, in-context reasoning.
    • TAUBench/PFCL: Tool utilization in agentic tasks.
    • TruthfulnessQA: Measures tendency toward factual accuracy/robustness against misinformation.

Prompt Engineering: High-Impact Techniques

  • Foundational Approaches:
    • Few-Shot Prompting: Provide pairs of inputs and desired outputs to steer the LLM.
    • Chain of Thought: Instructing the LLM to think step-by-step, either explicitly or through internal self-reprompting, enhances reasoning and output quality.
    • Clarity and Structure: Use clear, detailed, and structured instructions - task definition, context, constraints, output format, use of delimiters or markdown structuring.
    • Affirmative Directives: Phrase instructions positively (“write a concise summary” instead of “don’t write a long summary”).
    • Iterative Self-Refinement: Prompt the LLM to review and improve its prior response for better completeness, clarity, and factuality.
    • System Prompt/Role Assignment: Assign a persona or role to the LLM for tailored behavior (e.g., “You are an expert Python programmer”).
  • Guideline: Regularly consult official prompting guides from model developers as model capabilities evolve.

Trends and Research Outlook

  • Inference-time compute is increasingly important for pushing the boundaries of LLM task performance.
  • Agentic LLMs and multimodal reasoning represent the primary frontiers for innovation.
  • Prompt engineering and benchmarking remain essential for extracting optimal performance and assessing progress.
  • Models are expected to continue evolving with research into new architectures, memory systems, and integration techniques.

Transcript
Machine learning guide, episode 35. Large language models, part two, and we're picking up back with test time, scaling, inference time, inference, time training, reasoning, that sort of umbrella category of territory. The last thing we covered was chain of thought, where you prompt a model to think step by step, and either as a single prompt, so that the output includes in its tokens, the step-by-step thinking process, which drastically improves the accuracy of the outputs Or as part of the model machinery. I don't know that it's baked into the model architecture per se, but it may be tacked on technologically on the model after deployment that causes the model to reprompt itself with continued chains of thoughts until it gets the answer confidently. And now we are going to move on to ICL. Okay. In context learning. In context learning. we've already talked about this. It is, like zero shot, one shot and few shot prompting. So giving it some examples to work with you. Say, here's an unstructured text blob of trucking data. find a truck, DO ai, might do this. And,here's A-J-S-O-N output that I would've converted that to. Now here's a new text blob of trucking data. Now you go and, that would be a one shot prompt. So this is called in context learning, learning from. The prompt itself, the task it needs to perform, rather than that coming from the training data in the SFT training phase. So it's the ability to learn during inference by being presented simply with a prompt containing a few demonstrations, input output, examples of the task without any updates to the model Weights, no gradient descent. So the thinking behind how this works is, obviously,the model is leveraging vast knowledge,and patterns learned during pre-training. but the thinking on this specific structure is that the examples provide a semantic prior, in like Bayesian inference. And it guides the LLM to identify and activate relevant latent representations or concepts learned during pre-training, that correspond to the demonstrated task. So it's learning by analogy from the context. so it's like a Bayesian inference perspective, and it suggests that the model uses prompt examples as evidence to infer the underlying task or concept, sharpening its posterior distribution over the possible concepts. So it's slick stuff. It's pretty magical stuff. Okay, so the different ICLs,in context learning, there's zero shot, which means nothing. you just prompt it and it gives a response. one shot, which is the example I gave, and then few shots. You give it any number of examples, and of course the more examples you give it, the better. But you have to take the context window, into consideration when you are providing these one or few shot prompt examples, because you don't want to reduce From the token context budget, if you can only work with 32,000 tokens, you wanna be really careful what you stuff into that initial prompt these days. Gemini 2.5 Pro is a million context window and I believe the most recent OpenAI models are as well. So you don't quite have to be as careful with context length as you did in the past. In fact,a lot of times I will just stuff. Entire files using RU code. I uncheck the limit. The context of a file to 500 lines of code is the default. I uncheck that when I'm using Gemini 2.5 and I just provided the whole thing. context windows are really improving these days. You want to make sure if you're providing few shot prompts that there's a lot of diversity in the examples because if they're too similar,It will. Overfit so it won't be able to generalize as effectively. so this was all under the umbrella term reasoning. And in context performance, another thing you'll see is test time compute, or test time scaling inference time training. especially with ICL, they call that inference time training. So you'll see these things. they use words to imply that there's training or learning at inference time. What's really happening is it's learning on the fly from examples provided, or learning from it itself in a chain of thought reasoning sequence, almost like an attention mechanism under the hood. but it's, it's not modifying the weights or performing back propagation gradient, ascent. now. There's a few things to note. the concept of emergent abilities that the LLMs start to exhibit, complex reasoning behaviors and generalizable knowledge patterns outside of the training data. These test time compute or inference time training techniques, they go by so many different names. they capitalize upon the emergent abilities, but they don't require that the emergent abilities be present. So in the smaller models that are less equipped at exhibiting these emergent abilities, adding inference time compute can still improve the performance of these models. these chain of thought and reasoning and so forth are themselves emergent abilities. using chain of thought on various other emergent abilities that manifest within LLMs, improve the performance. So you'll see Sam Altman and various others, discussing The future potential growth and research avenues of LLMs outside of training and fine tuning. A lot of people will criticize LLMs saying that we are reaching some sort of a ceiling or we're tapering off diminishing returns. but even if that's true, which I don't believe it really is, the models are improving more and more. even if that's true, that fine tuning a model can only get us so far and the size of the model and the scaling laws and so forth, the emergent abilities that are being found more and more, and the. ways in which to explore optimization of inference, time, compute, this is all still frontier research work. the sky's the limit. so much is yet to be unveiled and the amount of performance that these techniques. Add to the base LLM is so significant, that we can see a continued trajectory of improvement in LLMs along the same trajectory that we've been seeing through the fine tuning process. So this is the stuff to really keep a close eye on. Okay. Grounding LLMs. RAG Retrieval, augmented generation. is embedding your documents and then putting them in a document store, a vector database for later lookup, using co-sign similarity to find relevant passages to whatever it is you're asking right now. And we've talked about RAGA fair bit in the past, but this general concept is called grounding, grounding LLMs. And it's the idea of connecting it to an external knowledge base. I. I didn't know that, that's what it was called. Grounding. And if you use Gemini AI Studio or Gemini Advanced, you'll see that there's a checkbox if you want it to search Google in the question that you're asking, not just using its training data for the knowledge that it uses. it will say, ground it in. Google search, and that means that it's gonna use the Google search database as its vector store, pull out the relevant passages using rag, add the relevant snippets. You know what I mean? Just paragraph sentences, not the entire document, typically to the context window as it's answering the question that it can reflect back on in its next token generation. oh, and also, that is the difference between, Google Deep Research and Google Notebook. Lm I took me a while to understand exactly what Notebook LM is, deep research in Google Gemini. You ask it a question and it will use tons and tons of web resources to deeply think about a topic. Before answering the question. And it'll give you a giant report and an audio snippet that you can listen to in podcast form. And then Notebook lm, instead of grounding it on the internet, you ground it with documents that you upload to Notebook lm. So you can upload PDFs and CSVs and whatnot, and then ask questions of the research papers or whatever it is you're asking questions of. And it'll be more isolated. It's not gonna go out to the internet and. almost like it's fine tuning on the small data that you provide it. but it's not really, it's just using those via rag to, ground it to a specific topic. So the big benefit of grounding LLMs is that obviously their knowledge is frozen at the time of training, and a lot of them will say when their training data cutoff is, and we've seen this problem a lot in the last three episodes on. Vibe coding because so many libraries and frameworks, have newer versions that may have major version, breaking change releases since the last time training was ran. it's a big problem. you need to ground your, vibe coding agent in. some sort of a knowledge base with the frameworks that you're using. So it has up-to-date information. So that's the big issue with LLMs is old data so they lack access to real-time information. because of their cutoff date or specialized proprietary knowledge not present in the training corpus. so as a result, the responses can be outdated or there can even be factual inaccuracies. These are hallucinations. So Rag retrieval, augmented generation is the primary tool. used for improving factual accuracy, reducing hallucinations, providing up-to-date information, providing focus for specific topics, domain specific private knowledge bases, like enterprise documents and technical manuals. all without having to fine tune your large language model on specific data. Okay,recap on how RAG works. So the first step is to save your documents to a vector store, a vector database. there is a hugging face library called sentence transformers and in transformers, which generate text token for token, there is the concept of representation, language models and generative language models. generative language models are the models. they generate text token by token. representation models,take in text, and they convert that to some latent space embedding. it's almost as if,you had an input. sentence and it's supposed to output an output sentence, and it goes through some series of steps, but the second half of the model is chopped off. And that middle thing that it outputs is the latent space representation of the input sequence, and that is called an embedding. And that embedding is a vector in vector space, a series of numbers. And you can do all sorts of math with that embedding. You can compare it to other sentences based on similarity, using cosign similarity, you can chop a paragraph into multiple sentences and embed each sentence. And then you take all of these embeddings and you stuff them into a vector database. a common one is F-A-I-S-S. There is chroma db. There is quadrant Q Q-D-R-A-N-T. there's all sorts of different vector databases and then, so that's the first step is vectorize your text, whether it's your code base or your documents, or in the case of Google search grounding, it's the internet and the next step is to retrieve. When a user submits a query, the query is used to search the external knowledge source. So it goes against the vector database using co-sign similarity, to find similarity between the query and anything that's relevant to that query in the knowledge base. And then it pulls out the top K most relevant chunks, semantically relevant chunks. There may be an additional step here, which is re-ranking. So let's say it takes the top 10 most relevant chunks and then it goes through a second pass to order them in order of relevance and then determine how much of that to keep, maybe based on the context window or, or some threshold co-sign similarity or something they call that re-ranking. these, chunks are formatted and combined, into the original user's query. And this is called the augment step. So retrieve, augment, generate. So the first one is retrieve the documents. Second one is augment the user query with these chunks. And then the last step is generate. So generate response with the augmented context. the retriever tends to be the most sophisticated part of the architecture. sometimes they don't just use cosign similarity searches. Sometimes they use some hybrid of keyword searching, T-F-I-D-F, or,other variants. of course, there's the re-ranking mechanism. some rag frameworks are super sophisticated. They may transform the user's query in advance, either to break down complex question into multiple parts, or to perform Basic,prompt engineering to the query. And then on the retrieval step, maybe it will summarize the chunks so they don't take up too much context window Maybe it will reduce redundancies that exist across multiple chunks from related documents. There's this whole concept of age agentic rag where it will do self-correction and self critiquing and synthesize metadata and synthetic data and so forth. by the way, the,past few episodes where I talked about using a custom rag MCP for modern framework in libraries and such. there's a popular MCP out there now called,context seven. it's a rag tool. people submit GitHub repos to Context Seven's website and it will scrape those repos For the code files and the markdown files, and then it will make them available as rag to your re code or cursor agent as an MCP server. Okay. LLM. Agents. Agents, they're the coolest thing. what we're using in vibe coding, that's an agent. There's agents all over the place now. we'll talk about what agents are in a bit, but something that's really important to talk about agents. back to that topic about LLMs are hitting a ceiling. I've seen a lot of chatter online, people saying, Gemini, open ai, we've got the best of best and nothing's happening. that AI revolution is dead in the water. Ha ha. Well, that's no longer the domain of LLMs. Now the impetus is on improving integration, getting LLMs to work with systems, APIs, cps with each other. and you see this obviously with Vibe coding per the last few episodes. So age agentic LLMs is a relatively new phenomenon in the grand scheme of LLMs, and strides are being made very rapidly and there is a large window of opportunity and growth and expansion in this arena. The improvements in LLM space can reasonably come to a close here in the near future, and the big improvements now are gonna be shifted over to agents, LLMs performing actions in the world. So LLM agents, these are systems that leverage an LLM as a central reasoning engine or brain to autonomously interact with their environment. They make decisions, they use tools, they pursue complex multi-step goals. by comparison to a standard LLM that just takes a prompt and generates a response, an agent operates within a loop. It involves planning, acting, observing the results, and refining its approach towards an objective. the main components of LLM agents are the core LLM, which is the brain or reasoning engine,a planning module, memory and tools. the. Core LLM acts as the brain or the reasoning engine, and it provides understanding, reasoning and decision making capabilities. It interprets your goal and analyzes the current state and decides what action to take next. So that's what you interface with first, and that's what responds to you last. So the planning module, These agents need to break down complex goals into smaller, manageable subtasks. so orchestrator in RU code will do this. Techniques like chain of thought or specialized prompting Strategies like react, reasoning and acting are often used for planning. advanced agents can also incorporate reflection where they evaluate their past actions And current plan and adjust the strategy based on feedback or environmental changes. agents need to extend beyond a single interaction. So they require memory. There's short-term memory, which is the context window of the LLM and the current task and the conversation. And then there's long-term memory, which has persisted as storage. Usually rag like a vector database. so they can recall past interactions, learned information, user preferences,or successful strategies over extended periods. It could be rag, it could be Some advanced tooling like roof flow or spark, updating the read me file along the way in a vibe coding session, maintaining a checklist that the agent can edit. And then of course, tools outside of vibe coating will have their own memory systems. So agents are equipped with tools, things like API calls native tools built into the agent. Like with RU code, you can read file, write file, open browser and so forth. Cps, of course, model context protocol servers. databases, code interpreters,other specialized models. So things like cursor and JetBrains, AI might call to smaller models Or fine tune specific models for particular tasks. and these tools allow the agent to gather external information using rag, perform calculations, execute code, take actions in an external environment,and the LLM decides when to use a tool, which tool to use and how to use it. So LLM agents, they obviously demonstrate capabilities beyond standard LLMs. So advanced problem solving involving external interaction, they can do self-evaluation and correction. they can test generated code, for example. they can potentially collaborate with multi-agent systems where different agents take on special roles. So there's layers of agent interactions , at the base layers. Just raw tool use, like edit file, read file, above that, but before CPS were introduced was API calls then CPS were added. , That streamlined APIs or. locally ran servers operating as APIs for things like RAG and so forth. And then as you saw with that ad that you heard, Out shift by Cisco, they're building this multi-agent software called agency, it's an open source collective building the internet of agents. And so this goes beyond calling MCP servers for one-off tool use to multiple agents interacting with each other on the internet. And with all these layers you have to ask yourself, turtles all the way down. When does it stop? And maybe it never stops. Maybe it just builds and builds and builds. but one of the Achilles heels here is still around memory. So with context, windows even being a million, if you're working with code bases over long periods of time, you have to come up with very sophisticated, clever ways of maintaining. Memory through time of your preferences and how to work with this code and what you did last and so forth, and long range planning. they also struggle with adaptation to unexpected events and, controllability issues. But agents is gonna be the frontier of everything that's happening right now. Everybody's been wanting to, tap into this space and make big improvements. multimodal LLMs. So audio, video image, not just text. This is a major frontier, that's happening right now in LLMs is generating across multiple modalities. So they call them multimodal large language models. M-L-L-M-M-L-L-M-S. And, for a while there, stable diffusion. I'll do an episode on diffusion models. you write a text prompt and it gives you an image. there are. Text to video generators like VO two, but multimodal LLMs go beyond that a little bit. they are architecturally merge the different modalities, so one model can generate any number of outputs or can receive any number of inputs. Modalities and that was the big deal with GPT-4 O, lowercase O, O for Omni. I think it was omni modality. So if you chat with chat GPT-4 oh and you say, generate an image of blank,it will use the same model architecture that it's using to chat with you to generate the image. And that wasn't always the case. Previously. It was using dolly two, which is a, a task specific image generator, text image. It was a dedicated model. It would just switch. It would detect that you want an image. It would switch that over. It would send the prompt over to Dolly too. now their 4.0 model is an all-in-one model for generating,images and, I don't know, about video and audio With, Gemini and OpenAI, they do support audio generation, text, speech and speech to text. Obviously their voice chat bot on the app, both models in the app. You can have a voice conversation, put it in your headphones, go for a walk and chat with it. I do believe that it is part of the integrated model, the MLLM. the way these things work, they have a modality specific encoder. So different encoders for the different modalities, raw input, and it converts them to numerical representation, embeddings. So they'll have a vision transformer for images. Or CNN that might encode the images and then transformer encodes texts, and then they have specialized models for audio. And then there is the fusion or a align representation, fusion slash alignment. So there's the middle area where the different modality representations are squished together. these different modalities are represented into a shared space where the model can reason about their relationships. So they're talking the same language. At this point in the game, And these techniques include projecting embeddings into a common dimension using cross attention mechanisms where one modality attends to another, or concatenating embeddings before feeding them into the main LLM backbone. And then the third step, the output is the unified processing or decoding, fused layer. So the fused representations are then processed by a large transformer backbone, the LLM part, and then this learns cross modal relationships. this cross modal processing or decoding phase now can generate output in different modalities, text image, and then self attention mechanisms are adapted to allow attention across the different modalities. they're integrated such that attention spans across the modalities and different techniques of projecting the different modalities into embedding space. So that concepts are merged so that the image is known about by the text. and recently,GPT-4 really improved things. and you can see that with the text. Like captions in an image. The image now is significantly improved,specifically with text, but also with the images that are,generated as well. This also allows you to interact with the different modalities so you can ask questions of the images. They call this visual q and a,because the model has knowledge in text land of things in that modality by way of the cross attention mechanisms. So there's still a lot of research on how to fuse the different data types, the,representation fusion and the decoding processes and how to get the cross modal. Understanding working well, whether it's Crossmodal attention or other mechanisms. There's still a lot of research in this area. Okay, the future. The future of LLMs. Right now, LLMs are trained as next token predictors, and even with things like mixture of experts and ML LMS and all these augmentations to the architecture, fundamentally, at least in the pre-training phase, they are trained as next token predictors. And we saw with RLHF. That really, nuanced, fundamental, latent invisible attributes can be, taught into the network architecture concepts of abstract reasoning in long range dependencies. So a lot of the augmentation of transformers through the years. with all the techniques we've discussed in this episode there, there must be a way to bake that into the neural architecture before adding these augmentations. And so they're exploring a bunch of different techniques here. and these are some of the advanced research areas. So one is called, predictive abstract Representation. So they train models to predict latent concepts derived from techniques like sparse auto encoders, potentially mixing these continuous concepts with discreet token predictions. and there's one called Cocoa Mix. there is a research. Avenue called patch level training. This is aggregating multiple tokens into larger patches and training the model to predict the next patch. aiming to reduce sequence length and computational cost during initial stages of training. So this would be like speed reading. There's one called concept centric modeling. Concept centric modeling architectures like meta's proposed large concept model or LCM, aim to bypass tokens altogether. Operating directly on sequences of semantic concept vectors derived from sentences. And then there's multi token prediction, training models to predict multiple future tokens simultaneously, potentially capturing broader semantic context than single token prediction. Okay,benchmarks, evaluating LLMs. So whatever a frontier model comes out, they'll put out a table comparing their scores to other LLM scores, And there are so many benchmarks. There are a lot of benchmarks out there, And so I asked Gemini, what are the top 10 for 2025? What's the least common denominator, top 10 that are currently used with the most recent models? And to include the hard future ones. So we have GPQA diamond in parentheses, diamond GPQA. it benchmarks advanced reasoning in STEM science technology. Engineering and math. so the description is it's graduate level Google proof questions. In physics, chemistry, biology diamond set is the most challenging. It assesses deep scientific reasoning, complex problem solving, and some. and some popular models that evaluate on it routinely are Gemini 2.5, pro Open AI oh three mini ROC three, Claude Sonnet 3.7. So GPQA for STEM and advanced reasoning, SWE bench, SWE for software engineering, swe, bench verified agentic coding, real world Software engineering. the verified set is confirmed solvable by humans. it assesses the entire agent system. This might be a good one to use for determining the right model to use in RU code and all those. I tend to use the Eighter liter board, but this would be a good one as well. Key capabilities assessed practical software engineering code patching and issue resolution. SWE bench verified MMMU, multimodal understanding college level reasoning. It tests college level questions across six disciplines using diverse interleaved image types and texts, tests, perception, knowledge and reasoning. It is testing for multimodal data integration expert level cross disciplinary reasoning, MMMU, human eval. This is for coding functional correctness, so it tests general functionality, correct python code snippets from doc strings, evaluated by unit tests. It's looking for basic code generation algorithmic understanding human eval. And then for the up and coming ones, the one I hear about the most is human's. Last exam, HLE, this tests advanced multimodal reasoning and broad knowledge, extremely challenging questions across many academic disciplines. Designed to be Google proof and test. Limits of AI versus expert human knowledge is looking for multimodal Texan image expert level difficulty broad subject coverage and resistance to simple retrieval. Live code bench, holistic coding, contamination free evaluation. It uses recent problems from coding contests, evaluates generation self-repair execution test output. Prediction and it aims for contamination free evaluation. it's looking for continuous updates, holistic assessment and contamination resistance. Ml perf inference, V 5.0, long contact. Okay, so this is testing specifically for performance over long context, processing efficiency, standardized industry benchmark for LM inference. For long context, 128 K. That's not that long. Throughput latency. multi challenge, conversational AI multiterm dialogue. realistic multiterm conversation, testing instruction following context allocation and in context reasoning simultaneously. So this is for testing some of the capabilities that we talked about, with the. Emergent abilities, multi challenge. And then, BFCL TAU bench is for agent capabilities. Tool utilization. Okay, so that's interesting. function tool calling accuracy across complex scenarios. TAU bench agent interaction with APIs and users in. Specific domains like retail, complex tool interaction, multi-step tasks, API usage domain specific agent behavior, that's the PFCL or tau bench tool. Use truthfulness, qa, safety, and trust. Truthfulness misinformation resistance. is measuring the tendency to generate false and misleading info, especially on common misconceptions. Binary choice version to reduce heuristic exploitation. So different tests, benchmarks for different types of tasks. And they will typically average them all together for for a sort of final score. if you're a coder like me, the ones to look out for are live code bench and SW bench. But I'll post these in the show notes so you can look at them and then of course you'll just see the specific benchmarks listed in the announcement of the frontier models when they come out. Okay, prompt engineering. I decided I'm gonna do like a mini segment on prompt engineering. I may do a full episode later, but I did some digging around and I just wanted to capture the biggest bang for buck. A handful of foundational prompting techniques that are used across the different frontier models. sort of 80 20, what do they have in common, least common denominator It turns out that so much of this was already covered in this episode. And So one of the examples of Best Bang for Buck Prompt engineering techniques is few shot prompting, , AKA in context learning. So we already talked about that. Another example gives this chain of thought prompting. they do go into for few shot prompting or in context learning and chain of thought prompting,some structure about how you might provide the examples to work with high specificity and clarity. So one example they have here is. The prompt includes pairs of example inputs and their corresponding desired outputs before the actual query code snippet example. So you say example one, colon input, translate, hello to French output, Bonjour example to colon input. So there's this.A clarity and almost like a robotic use of the English language and specificity goes a long way. . And, what that highlights. Something interesting you should take away from that is There's a real overlap between researchers at these companies coming up with techniques that improve the model's performance at inference time, and users taking away those techniques, whatever's published in some prompt engineering guidelines, PDF by open AI or GPT in their day-to-day usage. and these overlap significantly with just the concept of emergent abilities. so if I were to diagram this, it's LLMs develop emergent abilities. Wow. That is magical. Holy cow. That's so cool. a user and a model engineer should keep an eye on what emergent abilities are discovered along the way. The user should implement these as prompt engineering techniques in their day-to-day use case, and a model engineer should implement these as prompt engineering techniques that get integrated into the model itself, either through the machinery of the model running its inference job, or somehow integrated into the model in a more, holistic manner. the other ones this report shows is precision and structure and instructions. It says, the cornerstone of all effective prompting is clarity, specificity, and the structure of instructions provided to the LLM vague or ambiguous prompts are a primary source of suboptimal or unpredictable LM outputs for 2025 models. While their general understanding has improved precision remains paramount for complex and nuanced tasks. give explicit task definition. You clearly state what the LLM should do, such as summarize the following text. detailed context and constraints. so you provided the background information and the desired output format. Specify the length and tone, formal, casual, or technical, and any constraints like limiting the summary to 200 words or avoiding using jargon. Use delimiters anywhere you can triple back ticks for code. Mark down style headers for grouping instructional contexts. So that's an interesting one. maybe there's some stepwise instructions to follow in a bolded point and some guidelines to follow. providing a sort of structured document as the prompt with, you know,two hashtags indicating an H two space. The header, like this section is the guidelines you wanna follow, and then A written sentence under that, or if it's a stepwise series of instructions, you might want to use a numbered ordered list. adhering to sort of markdown standard and chunking things into structured document format, wherever there can be structured in your prompt goes a really long way. So . I wanna talk now about some more interesting ones , that may not be obvious affirmative directives. So it says, frame instructions positively stating what the model should do rather than what it should not do. so instead of saying don't write a long summary, use, write a concise summary of three sentences. So again, specificity with the three sentences, but The bit here is don't use negative language. And this one is so cool to me because I see people posting images where the prompt or something draw a room with a window and a table and whatever you do, do not draw a purple elephant. And it always draws a purple elephant and it can't help itself. And this hearkens to human psychology and the language aspect of human psychology. And they say, you know, a lot of self-help books will say positive framing of scenarios involves reframing something in a positive light. I'm going to improve my health and wellness and finances rather than negatively. Like I'm going to stop eating junk food. And evidently that makes a big difference. It also makes a big difference with the LLM. So use positive language rather than negative language. And then this one is super cool, iterative refinement, and self-correction. So you have it generat a response, and then you say, review your previous answer for specific criteria like factual, accuracy, clarity, and completeness, and identify any weaknesses and provide an improved response. So there's a level to which rather than telling it how to fix, whatever, it just messed up, whether it's code or. textual responses. You give it a little bit of autonomy and agency. You tell it to make improvements along certain guidelines, but without telling it what those improvements should be. And it's, I think it follows Similar concepts to chain of thought with step-by-step thinking. it goes through and it reflects on what could be changed and improves on it with a bit of autonomy and creativity. it might be more thoughtful in the improvements in refinement than had you been more specific, it would do exactly what you said and maybe nothing else. and maybe there was room for improvement outside of the things that you'd identified or there are more clever improvements than you would've thought of. it calls this a metacognitive ability. and then the last bang for buck,prompt tip is the system prompt. Evidently, that really goes a long way, is having a really solid system prompt. they call it role assignment or persona prompting, where you say act as an expert, Python programmer, or you are a helpful travel agent. And a lot of times, if you're using chat GPT as a product or if you're using RU code and it's different modes like code mode, or. Orchestrator, you're not controlling the system prompts. These things have the system prompt baked in. But if you're writing an API, or if you're writing your own mode for RU code, then you'll control the system prompt defining the role of the agent or the LLM And getting that system prompt right is really crucial. It goes a long way. beyond these few bang for buck items. You'll want to grab one of the frontier models, prompting guidelines, PDFs, they all release a big document with prompting guidelines. Maybe once a year or once every big model. And they're pretty big. Typically like 50 pages, 60 pages. You can use AI to summarize it or pull out the most impactful ones. but I do recommend reading at least one guide by whichever is your current preferred frontier model. And reading it cover to cover, really learn the ropes of prompt engineering 'cause it will improve Your work with AI from here on out, and it will start your creative thinking process, connecting the dots of why are these particular things. Good prompting techniques versus what I was doing before. What makes a good prompt? A good prompt and you might be able to, come up with some clever strategies outside of the guidelines once you have a handful of these in your back pocket that you can start working with for analogy going forward. So that's all I got for LLMs. That was a lot. I'm sure there's so much more I haven't covered that exists today and there's. Definitely a lot I haven't covered that's going to exist tomorrow because this stuff, especially the agentic LLM territory is expanding rapidly. This would be a good time for an alternative podcast. one of the more news and interviews,tech AI, podcasts like Practical ai. If you want to keep up with the latest and greatest. So I'll see you in the next episode.
Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.