MLA 022 Coding Agents: IDEs & Plugins

Feb 09, 2025 (updated Apr 11, 2025)
Click to Play Episode

Anthropic's Claude 4 model delivers the highest quality code, making Claude Code the top choice for complex terminal tasks. For specific workflows, Aider's git-native agent and Roo Code's customizable IDE extension offer the best performance and control.

Vibe Coding Mini Series

Show Notes
Try a walking desk

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

This page is the updated report July 2025. The episode is severely outdated

AI coding tools are now agents that handle entire features, bug fixes, and refactors, moving beyond simple autocomplete. All modern agents share baseline features: full codebase context via repository maps, step-by-step planning for developer approval, the ability to execute file edits and shell commands, and "Bring Your Own Key" (BYOK) architecture for model flexibility.

Claude Code vs Cursor, Windsurf, Roo Code, Aider, Gemini CLI, Github Copilot, Jules, Cline

The main differentiators are workflow philosophy and model quality.

Which Tool to Use

  • For Terminal/CLI users:
    • Need git-native safety & audit trails? -> Aider
    • Need highest quality code output? -> Claude Code
    • Need a free tier for simple tasks? -> Gemini CLI
  • For IDE (VS Code) users:
    • Want to build custom agents? -> Roo Code
    • Need deep GitHub Issue/Action integration? -> GitHub Copilot
  • For large, delegated tasks (receive a PR):
    • Trigger via GitHub Issues? -> GitHub Copilot Agent
    • Trigger via command to a cloud service? -> Google Jules

Tool Analysis by Workflow

  • Terminal-Native (CLI)

    • Aider: Safest agent due to its deep git integration; every AI change is an auditable commit. Uses tree-sitter for superior structural context.
    • Claude Code: Produces the highest quality code via Anthropic's Claude 4 model. Best for complex tasks where output quality is paramount.
    • Gemini CLI: Best for experimentation due to a generous free tier of 1,000 requests/day. A simple wrapper for Google's model.
  • IDE-Native (VS Code)

    • Roo Code: Top choice for power users. Its "Custom Modes" feature allows building specialized AI personalities (e.g., security auditor, docs writer).
    • GitHub Copilot: Main advantage is deep integration with the GitHub platform. Can assign GitHub Issues directly to the @copilot agent.
    • Cline: A stable, reliable open-source agent that Roo Code forked from. A solid choice for users who prefer stability over extensibility.
  • Asynchronous (Delegation)

    • Google Jules: A "fire-and-forget" service. It clones a GitHub repo to a sandboxed Google Cloud VM, completes a task, and submits a pull request.
    • GitHub Copilot Agent: A similar async workflow, but works natively inside GitHub Actions and is triggered by assigning an Issue to @copilot.

Performance & Market Status

  • Performance Metric: SWE-bench, which tests an agent's ability to resolve real GitHub issues, is the best metric. HumanEval is a poor differentiator for top models.
  • Model Quality: Anthropic's Claude 4 models lead on quality and reasoning. On SWE-bench, Claude 4 Sonnet scores 80.2% and Opus scores 72.5%, surpassing Gemini 2.5 Pro (63.2%) and OpenAI's o3 (69.1%).
  • Google's Strategy: Google's tools (Jules, Gemini CLI, Gemini Code Assist) are functionally distinct but have confusing marketing that creates user friction.

Tools to Avoid

  • Cursor: Hostile business model. Restrictive usage limits on its paid "Pro" and "Ultra" plans interrupt work. Alternative: Roo Code.
  • Windsurf: Unacceptable platform risk. Its future is unknown after its acquisition by Cognition AI. Alternative: GitHub Copilot.
Try a walking desk

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Long Version

AI assistance in software development has moved beyond suggesting single lines of code. Current tools operate as agentic partners, capable of delegating entire features, bug fixes, and refactors. These agents can read a project's context, understand user intent, and execute complex changes across a codebase.

This analysis provides a technical summary of these AI coding agents as of July 2025, focusing on factual capabilities to help engineers select the right tools.

Baseline Features for Modern AI Coding Agents

A core set of agentic capabilities is now the minimum requirement for any competitive tool.

Full Codebase Context

Agents must be able to reason about an entire project, not just open files. This is done using local file indexing, vector embeddings for semantic meaning, and repository maps (repo maps) that outline code structures and dependencies. Without this, an agent can only make single-file edits that risk breaking the larger system.

Planning and Task Decomposition

An agent must analyze a user's request and generate a step-by-step plan of action. This plan, which shows which files will be modified, must be presented to the developer for approval. This "visible planning" provides control and transparency.

Tool Use and Execution

Agents must be able to interact with the developer's environment. Baseline capabilities include:

"Bring Your Own Key" (BYOK) Architecture

Developer-focused tools use a BYOK model, which separates the agent tool from the underlying Large Language Model (LLM). Users provide their own API keys from providers like Anthropic, OpenAI, Google, or OpenRouter. This gives them control over model choice to balance cost, speed, and capability for different tasks and ensures the tool remains useful as new models are released.

Main Architectural Division: Interaction Modality

The primary differentiator among AI coding tools is their interaction modality: terminal, IDE, or asynchronous web service. This determines the tool's workflow and target user. Tools are generally designed either for the "inner loop" of iterative coding or the "outer loop" of larger, delegable tasks.

Terminal-Native Tools for Speed and Control

These tools augment command-line workflows for users who prioritize speed and scriptability.

IDE-Integrated Tools

These tools integrate AI directly into a graphical IDE like VS Code.

Asynchronous Agents for Delegation

These tools are for large "outer loop" tasks, acting as background workers that deliver a pull request on completion.

  • Google Jules: Clones a GitHub repository into a sandboxed Google Cloud VM, executes a task, and then submits a pull request. Its main feature is this hands-off model, which frees up the developer's local machine and API rate limits to work on other tasks.
  • GitHub Copilot Agent: This is GitHub's version of asynchronous execution. It is triggered when a developer assigns a GitHub Issue to the @copilot user. The agent works in the background and delivers a pull request. Its advantage is its native integration with the GitHub Issues workflow.

ChatGPT Agent: A Generalist Tool for Coding

OpenAI's web-based ChatGPT Agent is a universal agent that can be used for coding. It has access to a visual web browser, a terminal, and APIs within its own sandboxed virtual computer. Its strength is its versatility and reasoning power, useful for novel problems that require switching between web research and code generation. Its high score on the FrontierMath benchmark demonstrates its intelligence. However, because it is not integrated with a local development environment, it is inefficient for iterative refactoring of existing codebases.

Performance Analysis: Claude 4 Models Lead on SWE-bench

Performance depends on both the underlying LLM's intelligence and the agentic framework's effectiveness. Real-world performance on complex tasks is more telling than simple benchmarks.

Benchmark Definitions

Claude 4's Performance Lead

Data from benchmarks and user reports show that Anthropic's Claude 4 models are the current state-of-the-art.

  • Benchmark Results: Claude 4 Sonnet achieved an 80.2% resolution rate on SWE-bench with parallel reasoning, and Claude 4 Opus scored 72.5%. These scores are higher than competitors like Gemini 2.5 Pro (63.2%) and OpenAI's o3 (69.1%).
  • User Feedback: Developers report that Claude's output has better "taste," meaning it is more idiomatic, maintainable, and stylistically consistent. It requires fewer retries and shows a superior planning capability. Any tool using a Claude 4 model as its backend has an advantage in output quality.

The Agentic Multiplier Effect

The agentic framework is a critical performance multiplier. The SWE-bench leaderboard is dominated by combinations of a top-tier model (usually Claude 4 Sonnet) with a sophisticated open-source agentic framework like SWE-agent or OpenHands. This shows that the agent's ability to plan, use tools, and manage context is crucial. The strong performance of Aider across models demonstrates the quality of its git-native, tree-sitter-powered framework. The value of a tool comes from both its model and its own design.

ToolPrimary Model UsedSWE-bench (% Resolved)HumanEval (Pass@1)Key Agentic StrengthPerformance Summary
Claude CodeClaude 4 Opus/Sonnet~72.5% - 80.2%~92%Superior planning, reasoning, and code quality from Claude 4 models.Delivers the highest quality and most reliable results due to model and polished framework.
AiderModel Agnostic (BYOK)~26.3% (with GPT-4o/Opus)Varies by ModelDeep git integration for atomic, auditable changes. tree-sitter for repo context.Effective framework. Performance is determined by the chosen model. A top contender when paired with Claude 4.
Gemini CLIGemini 2.5 Pro~63.2%~99%1M+ token context window and direct access to Google Search.Capable model, but agent framework is less polished, leading to lower task success despite high HumanEval scores.
Codex CLIOpenAI o4-mini/o3~69.1% (o3)~80-90%Flexible approval modes for granular control.Competent and lightweight, but performance is surpassed by Claude-powered tools on complex tasks.
Roo CodeModel Agnostic (BYOK)Varies by ModelVaries by ModelExtensible "Custom Modes" for creating specialized agent personalities.Solid agent core. Power comes from user-defined specialization. Performance is a function of user skill and model choice.
GitHub Copilot AgentGPT-4.1, Claude, GeminiVaries by ModelVaries by ModelNative integration with GitHub Issues and Actions workflow.Performance is strong, but its value is tied to a team's investment in the GitHub platform.
Google JulesGemini 2.5 ProNot Publicly BenchmarkedNot Publicly BenchmarkedAsynchronous execution in an isolated cloud VM.Sound concept for "outer loop" tasks, but a lack of public benchmarks makes its effectiveness difficult to assess.

Market Sentiment Analysis

This section validates developer sentiment on key tools.

Roo Code Is the Preferred Tool for Power Users

Claude 4-Powered Tools Lead in Quality and Reliability

  • Verdict: Correct. Tools using Anthropic's Claude 4 models lead the market in code quality.
  • Analysis: Claude 4 models dominate the SWE-bench benchmark. User comparisons also find that Claude Code outperforms competitors like Gemini CLI in speed, cost, and final code quality. Developers report Claude's output has better "taste" and demonstrates superior planning. This may be due to a training focus on reasoning and safety, resulting in a lower tendency to hallucinate.

Google's Product Strategy Creates Confusion

Early Market Leaders Have Lost Developer Trust

  • Verdict: Correct. Early leaders have broken user trust, and their advantages have been commoditized.
  • Analysis:
    • Cursor: Is now considered unreliable due to its business model. Developers are frustrated by the opaque and user-hostile usage limits on its paid plans, which can block work arbitrarily.
    • Windsurf: Is a platform with an uncertain future after its acquisition by Cognition AI. The lack of a clear roadmap makes it a risky choice.
    • GitHub Copilot: Its position is shifting from innovator to incumbent. Its features have been replicated by other tools. Its main strength is now its deep integration into the GitHub platform. It is becoming the "good enough" default for enterprises, while power users migrate to Aider, Roo Code, and Claude Code.

Tools to Avoid: Cursor and Windsurf

Based on reliability and platform stability, two tools should be avoided as of July 2025.

Avoid: Cursor

  • Reason: Hostile business model. Cursor's technical features are negated by its business practices. The opaque and low usage limits on its paid "Pro" and "Ultra" plans make it an unreliable professional tool.
  • Alternative: Roo Code or Cline. These open-source VS Code extensions use a transparent BYOK model, giving developers control over costs and models. Roo Code offers extensibility, while Cline provides stability.

Avoid: Windsurf

  • Reason: Unacceptable platform risk. The chaotic acquisition of Windsurf by Cognition AI has left its future roadmap and support level unknown. This makes it too risky for a production workflow.
  • Alternative: GitHub Copilot. For a commercially supported and stable IDE solution, GitHub Copilot is the logical choice. It is backed by Microsoft/GitHub, has a public roadmap, and provides the long-term stability that Windsurf lacks.

Decision Matrix: Selecting a Tool by Workflow

The correct tool choice depends on your primary workflow.

START HERE: What is your primary work environment?

  • A) The Terminal.
    • Do you require a strict, git-native workflow with auditable AI commits?
  • B) An IDE (like VS Code).
    • Do you want maximum control and the ability to build custom AI assistants?
      • YES -> Use Roo Code. Its "Custom Modes" feature makes it the best for customization.
      • NO -> Do you prefer a stable, competent open-source agent without extra complexity?
  • C) Neither. I want to delegate a task and get a pull request.
    • How do you want to assign the task?
      • By assigning a GitHub Issue to an agent?
      • As a command to a standalone service in an isolated cloud environment?

Conclusion: Use a Combination of Specialized Tools

There is no single "best" AI coding tool. The market is specialized, and the correct choice depends on the specific workflow and task.

  • Terminal users who value control should use Aider.
  • Those who need the highest quality output from the CLI should use Claude Code.
  • Engineers wanting maximum extensibility in the IDE should choose Roo Code.
  • Organizations invested in the Microsoft/GitHub platform have a stable option in GitHub Copilot.

The most effective approach is to use a combination of these specialized tools: a terminal agent for frequent coding, an asynchronous executor for large delegated tasks, and a generalist web agent for research and prototyping.

Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.