MLA 022 Coding Agents: IDEs & Plugins

Feb 09, 2025 (updated Apr 11, 2025)

Click to Play Episode

Anthropic's Claude 4 model delivers the highest quality code, making Claude Code the top choice for complex terminal tasks. For specific workflows, Aider's git-native agent and Roo Code's customizable IDE extension offer the best performance and control.

Vibe Coding Mini Series

Show Notes

Learn Faster with a Walking DeskWalk While You Learn

Sitting for hours drains energy and focus. A walking desk boosts alertness, helping you retain complex ML topics more effectively.Boost focus and energy to learn faster and retain more.Discover the benefitsDiscover the benefits

This page is the updated report July 2025. The episode is severely outdated

AI coding tools are now agents that handle entire features, bug fixes, and refactors, moving beyond simple autocomplete. All modern agents share baseline features: full codebase context via repository maps, step-by-step planning for developer approval, the ability to execute file edits and shell commands, and "Bring Your Own Key" (BYOK) architecture for model flexibility.

Claude Code vs Cursor, Windsurf, Roo Code, Aider, Gemini CLI, Github Copilot, Jules, Cline

The main differentiators are workflow philosophy and model quality.

Which Tool to Use

For Terminal/CLI users:
- Need git-native safety & audit trails? -> Aider
- Need highest quality code output? -> Claude Code
- Need a free tier for simple tasks? -> Gemini CLI
For IDE (VS Code) users:
- Want to build custom agents? -> Roo Code
- Need deep GitHub Issue/Action integration? -> GitHub Copilot
For large, delegated tasks (receive a PR):
- Trigger via GitHub Issues? -> GitHub Copilot Agent
- Trigger via command to a cloud service? -> Google Jules

Tool Analysis by Workflow

Terminal-Native (CLI)
- Aider: Safest agent due to its deep git integration; every AI change is an auditable commit. Uses tree-sitter for superior structural context.
- Claude Code: Produces the highest quality code via Anthropic's Claude 4 model. Best for complex tasks where output quality is paramount.
- Gemini CLI: Best for experimentation due to a generous free tier of 1,000 requests/day. A simple wrapper for Google's model.
IDE-Native (VS Code)
- Roo Code: Top choice for power users. Its "Custom Modes" feature allows building specialized AI personalities (e.g., security auditor, docs writer).
- GitHub Copilot: Main advantage is deep integration with the GitHub platform. Can assign GitHub Issues directly to the @copilot agent.
- Cline: A stable, reliable open-source agent that Roo Code forked from. A solid choice for users who prefer stability over extensibility.
Asynchronous (Delegation)
- Google Jules: A "fire-and-forget" service. It clones a GitHub repo to a sandboxed Google Cloud VM, completes a task, and submits a pull request.
- GitHub Copilot Agent: A similar async workflow, but works natively inside GitHub Actions and is triggered by assigning an Issue to @copilot.

Performance & Market Status

Performance Metric: SWE-bench, which tests an agent's ability to resolve real GitHub issues, is the best metric. HumanEval is a poor differentiator for top models.
Model Quality: Anthropic's Claude 4 models lead on quality and reasoning. On SWE-bench, Claude 4 Sonnet scores 80.2% and Opus scores 72.5%, surpassing Gemini 2.5 Pro (63.2%) and OpenAI's o3 (69.1%).
Google's Strategy: Google's tools (Jules, Gemini CLI, Gemini Code Assist) are functionally distinct but have confusing marketing that creates user friction.

Tools to Avoid

Cursor: Hostile business model. Restrictive usage limits on its paid "Pro" and "Ultra" plans interrupt work. Alternative: Roo Code.
Windsurf: Unacceptable platform risk. Its future is unknown after its acquisition by Cognition AI. Alternative: GitHub Copilot.

Accelerate Your AI Strategy with TylerAI Strategy Call with Tyler

Go from concept to action plan. Get expert, confidential guidance on your specific AI implementation challenges in a private, one-hour strategy session with Tyler.Get personalized guidance from Tyler to solve your company's AI implementation challenges.Book Your Session with TylerBook Your Call with Tyler

Long Version

AI assistance in software development has moved beyond suggesting single lines of code. Current tools operate as agentic partners, capable of delegating entire features, bug fixes, and refactors. These agents can read a project's context, understand user intent, and execute complex changes across a codebase.

This analysis provides a technical summary of these AI coding agents as of July 2025, focusing on factual capabilities to help engineers select the right tools.

Baseline Features for Modern AI Coding Agents

A core set of agentic capabilities is now the minimum requirement for any competitive tool.

Full Codebase Context

Agents must be able to reason about an entire project, not just open files. This is done using local file indexing, vector embeddings for semantic meaning, and repository maps (repo maps) that outline code structures and dependencies. Without this, an agent can only make single-file edits that risk breaking the larger system.

Planning and Task Decomposition

An agent must analyze a user's request and generate a step-by-step plan of action. This plan, which shows which files will be modified, must be presented to the developer for approval. This "visible planning" provides control and transparency.

Tool Use and Execution

Agents must be able to interact with the developer's environment. Baseline capabilities include:

Reading and writing files directly in the local workspace without manual copy-pasting.
Executing shell commands to run build scripts, install dependencies, or run tests.
Browsing the web to find documentation or API specifications, a rapidly standardizing feature.

"Bring Your Own Key" (BYOK) Architecture

Developer-focused tools use a BYOK model, which separates the agent tool from the underlying Large Language Model (LLM). Users provide their own API keys from providers like Anthropic, OpenAI, Google, or OpenRouter. This gives them control over model choice to balance cost, speed, and capability for different tasks and ensures the tool remains useful as new models are released.

Main Architectural Division: Interaction Modality

The primary differentiator among AI coding tools is their interaction modality: terminal, IDE, or asynchronous web service. This determines the tool's workflow and target user. Tools are generally designed either for the "inner loop" of iterative coding or the "outer loop" of larger, delegable tasks.

Terminal-Native Tools for Speed and Control

These tools augment command-line workflows for users who prioritize speed and scriptability.

Aider: Aider's main feature is its integration with git. Every change is automatically committed to the local repository, creating an atomic and revertible history, making it a safe agent to use. It uses tree-sitter to generate a repository map, giving it a structural understanding of code that is more robust than text-based retrieval. It is best for users who require strict version control integration.
Claude Code: Anthropic's official CLI tool. Its main differentiator is the high quality of the underlying Claude 4 model. In comparisons with Gemini CLI, Claude Code is faster, more cost-effective, and produces higher-quality code. Users report it has a superior ability to plan and execute complex tasks. It is for developers who prioritize output quality and are willing to pay for API access.
Gemini CLI & Codex CLI: These are lightweight, open-source wrappers that provide a direct interface from a terminal prompt to Google's and OpenAI's models. Their strengths are their simplicity and generous free tiers, especially Gemini CLI's 1,000 requests per day. They are good for one-off tasks and scripting but lack the stateful project management of Aider or the polished experience of Claude Code.

IDE-Integrated Tools

These tools integrate AI directly into a graphical IDE like VS Code.

Roo Code & Cline: This pair represents a key market dynamic.
- Cline: A stable VS Code extension providing a reliable agentic experience where the user maintains control. Its development prioritizes stability, making it a dependable choice.
- Roo Code: A fork of Cline focused on extensibility. Its main feature is "Custom Modes," which lets users define specialized AI personalities like a security auditor or performance tuning expert. This makes it a platform for building a custom team of AI assistants and appeals to users who want to customize their tools.
GitHub Copilot: Copilot's main strength is its native integration into the GitHub platform. Users can delegate tasks by assigning GitHub Issues directly to the @copilot agent. Its recent addition of model-swapping (including Claude and Gemini) reduces its reliance on a single provider.
Windsurf & Cursor (Cautionary Tales): Both are forks of VS Code aiming to be AI-native IDEs, but both have significant issues.
- Cursor: Lost user trust due to a hostile business model. Its paid plans have confusing and restrictive usage limits that interrupt workflow, causing widespread user frustration and departure.
- Windsurf: The platform's future is uncertain. Its leadership was acquihired by Google, and the company was then acquired by Cognition AI. This makes it a risky choice due to an unknown product roadmap.

Asynchronous Agents for Delegation

These tools are for large "outer loop" tasks, acting as background workers that deliver a pull request on completion.

Google Jules: Clones a GitHub repository into a sandboxed Google Cloud VM, executes a task, and then submits a pull request. Its main feature is this hands-off model, which frees up the developer's local machine and API rate limits to work on other tasks.
GitHub Copilot Agent: This is GitHub's version of asynchronous execution. It is triggered when a developer assigns a GitHub Issue to the @copilot user. The agent works in the background and delivers a pull request. Its advantage is its native integration with the GitHub Issues workflow.

ChatGPT Agent: A Generalist Tool for Coding

OpenAI's web-based ChatGPT Agent is a universal agent that can be used for coding. It has access to a visual web browser, a terminal, and APIs within its own sandboxed virtual computer. Its strength is its versatility and reasoning power, useful for novel problems that require switching between web research and code generation. Its high score on the FrontierMath benchmark demonstrates its intelligence. However, because it is not integrated with a local development environment, it is inefficient for iterative refactoring of existing codebases.

Performance Analysis: Claude 4 Models Lead on SWE-bench

Performance depends on both the underlying LLM's intelligence and the agentic framework's effectiveness. Real-world performance on complex tasks is more telling than simple benchmarks.

Benchmark Definitions

HumanEval: Tests an LLM's ability to generate a correct Python function from a docstring. Top models like Google's Gemini 2.5 Pro (~99%) and Anthropic's Claude 3.5 Sonnet (92%) have nearly saturated this test, making it a poor differentiator for top tools.
SWE-bench: This is the standard for evaluating coding agents. It requires an agent to resolve real-world GitHub issues from open-source repositories. It is a more meaningful measure of practical ability, as it requires codebase understanding, planning, and multi-file edits.

Claude 4's Performance Lead

Data from benchmarks and user reports show that Anthropic's Claude 4 models are the current state-of-the-art.

Benchmark Results: Claude 4 Sonnet achieved an 80.2% resolution rate on SWE-bench with parallel reasoning, and Claude 4 Opus scored 72.5%. These scores are higher than competitors like Gemini 2.5 Pro (63.2%) and OpenAI's o3 (69.1%).
User Feedback: Developers report that Claude's output has better "taste," meaning it is more idiomatic, maintainable, and stylistically consistent. It requires fewer retries and shows a superior planning capability. Any tool using a Claude 4 model as its backend has an advantage in output quality.

The Agentic Multiplier Effect

The agentic framework is a critical performance multiplier. The SWE-bench leaderboard is dominated by combinations of a top-tier model (usually Claude 4 Sonnet) with a sophisticated open-source agentic framework like SWE-agent or OpenHands. This shows that the agent's ability to plan, use tools, and manage context is crucial. The strong performance of Aider across models demonstrates the quality of its git-native, tree-sitter-powered framework. The value of a tool comes from both its model and its own design.

Tool	Primary Model Used	SWE-bench (% Resolved)	HumanEval (Pass@1)	Key Agentic Strength	Performance Summary
Claude Code	Claude 4 Opus/Sonnet	~72.5% - 80.2%	~92%	Superior planning, reasoning, and code quality from Claude 4 models.	Delivers the highest quality and most reliable results due to model and polished framework.
Aider	Model Agnostic (BYOK)	~26.3% (with GPT-4o/Opus)	Varies by Model	Deep git integration for atomic, auditable changes. tree-sitter for repo context.	Effective framework. Performance is determined by the chosen model. A top contender when paired with Claude 4.
Gemini CLI	Gemini 2.5 Pro	~63.2%	~99%	1M+ token context window and direct access to Google Search.	Capable model, but agent framework is less polished, leading to lower task success despite high HumanEval scores.
Codex CLI	OpenAI o4-mini/o3	~69.1% (o3)	~80-90%	Flexible approval modes for granular control.	Competent and lightweight, but performance is surpassed by Claude-powered tools on complex tasks.
Roo Code	Model Agnostic (BYOK)	Varies by Model	Varies by Model	Extensible "Custom Modes" for creating specialized agent personalities.	Solid agent core. Power comes from user-defined specialization. Performance is a function of user skill and model choice.
GitHub Copilot Agent	GPT-4.1, Claude, Gemini	Varies by Model	Varies by Model	Native integration with GitHub Issues and Actions workflow.	Performance is strong, but its value is tied to a team's investment in the GitHub platform.
Google Jules	Gemini 2.5 Pro	Not Publicly Benchmarked	Not Publicly Benchmarked	Asynchronous execution in an isolated cloud VM.	Sound concept for "outer loop" tasks, but a lack of public benchmarks makes its effectiveness difficult to assess.

Market Sentiment Analysis

This section validates developer sentiment on key tools.

Roo Code Is the Preferred Tool for Power Users

Verdict: Correct. Roo Code is the winner for advanced developers.
Analysis: Roo Code's key feature is "Custom Modes," which allows users to create and share specialized AI "personalities". This transforms the tool into a platform for building custom AI agents. While Cline offers a more stable out-of-the-box experience, its development is more conservative. Roo Code's customizability makes it the top choice for power users.

Claude 4-Powered Tools Lead in Quality and Reliability

Verdict: Correct. Tools using Anthropic's Claude 4 models lead the market in code quality.
Analysis: Claude 4 models dominate the SWE-bench benchmark. User comparisons also find that Claude Code outperforms competitors like Gemini CLI in speed, cost, and final code quality. Developers report Claude's output has better "taste" and demonstrates superior planning. This may be due to a training focus on reasoning and safety, resulting in a lower tendency to hallucinate.

Google's Product Strategy Creates Confusion

Verdict: Jules, Gemini CLI, and Gemini Code Assist are distinct products, but their overlapping branding is confusing.
Analysis: The product roles are clear, but the marketing is not.
- Jules: Asynchronous, cloud-based agent. A Google PM states it does not overlap with the CLI.
- Gemini CLI: Terminal-native tool for interactive tasks.
- Gemini Code Assist: IDE-native tool for interactive tasks.
Users report finding the product family confusing, describing it as a "patchwork of painfully confounding marketing terms." This confusion seems to stem from Google's internal team structures and makes it difficult for developers to commit to the Google toolset.

Early Market Leaders Have Lost Developer Trust

Verdict: Correct. Early leaders have broken user trust, and their advantages have been commoditized.
Analysis:
- Cursor: Is now considered unreliable due to its business model. Developers are frustrated by the opaque and user-hostile usage limits on its paid plans, which can block work arbitrarily.
- Windsurf: Is a platform with an uncertain future after its acquisition by Cognition AI. The lack of a clear roadmap makes it a risky choice.
- GitHub Copilot: Its position is shifting from innovator to incumbent. Its features have been replicated by other tools. Its main strength is now its deep integration into the GitHub platform. It is becoming the "good enough" default for enterprises, while power users migrate to Aider, Roo Code, and Claude Code.

Tools to Avoid: Cursor and Windsurf

Based on reliability and platform stability, two tools should be avoided as of July 2025.

Avoid: Cursor

Reason: Hostile business model. Cursor's technical features are negated by its business practices. The opaque and low usage limits on its paid "Pro" and "Ultra" plans make it an unreliable professional tool.
Alternative: Roo Code or Cline. These open-source VS Code extensions use a transparent BYOK model, giving developers control over costs and models. Roo Code offers extensibility, while Cline provides stability.

Avoid: Windsurf

Reason: Unacceptable platform risk. The chaotic acquisition of Windsurf by Cognition AI has left its future roadmap and support level unknown. This makes it too risky for a production workflow.
Alternative: GitHub Copilot. For a commercially supported and stable IDE solution, GitHub Copilot is the logical choice. It is backed by Microsoft/GitHub, has a public roadmap, and provides the long-term stability that Windsurf lacks.

Decision Matrix: Selecting a Tool by Workflow

The correct tool choice depends on your primary workflow.

START HERE: What is your primary work environment?

A) The Terminal.
- Do you require a strict, git-native workflow with auditable AI commits?
  - YES -> Use Aider. Its git integration and use of tree-sitter provide structural awareness and auditability.
  - NO -> What is your top priority?
    - Highest quality code output and user experience, and you will pay API costs?
      - YES -> Use Claude Code. It is powered by the Claude 4 model and delivers the best results on complex tasks.
    - A generous free tier for experimentation and scripting?
      - YES -> Use Gemini CLI. Its 1,000 free requests per day make it the best value for experimentation.
B) An IDE (like VS Code).
- Do you want maximum control and the ability to build custom AI assistants?
  - YES -> Use Roo Code. Its "Custom Modes" feature makes it the best for customization.
  - NO -> Do you prefer a stable, competent open-source agent without extra complexity?
    - YES -> Use Cline. It is a rock-solid and reliable agentic tool.
    - NO -> Is your team deeply embedded in the GitHub ecosystem?
      - YES -> Use GitHub Copilot. Its main advantage is its native, end-to-end workflow integration with GitHub Issues and Actions.
C) Neither. I want to delegate a task and get a pull request.
- How do you want to assign the task?
  - By assigning a GitHub Issue to an agent?
    - YES -> Use the GitHub Copilot Agent. It is designed for this native GitHub workflow.
  - As a command to a standalone service in an isolated cloud environment?
    - YES -> Use Google Jules. It is designed for this asynchronous model, freeing up your local machine.

Conclusion: Use a Combination of Specialized Tools

There is no single "best" AI coding tool. The market is specialized, and the correct choice depends on the specific workflow and task.

Terminal users who value control should use Aider.
Those who need the highest quality output from the CLI should use Claude Code.
Engineers wanting maximum extensibility in the IDE should choose Roo Code.
Organizations invested in the Microsoft/GitHub platform have a stable option in GitHub Copilot.

The most effective approach is to use a combination of these specialized tools: a terminal agent for frequent coding, an asynchronous executor for large delegated tasks, and a generalist web agent for research and prototyping.

Comments temporarily disabled because Disqus started showing ads (and rough ones). I'll have to migrate the commenting system.