In the rapidly evolving world of large language models (LLMs), a new contender has emerged — Claude 3 from Anthropic. Claiming to outperform GPT-4 in benchmarks and real-world usage, Claude has captured the attention of the AI community. As a power user of LLMs, I decided to thoroughly test Claude 3 and compare it head-to-head with GPT-4 to answer the burning question: should you drop GPT-4 for Claude 3?
Key Points About Claude 3:
- Developed by Anthropic, claiming to be better than GPT-4 in benchmarks and practical usage
- Available for free usage at chat.lmsys.org, allowing side-by-side comparisons with GPT-4
- Priced at $20/month for the best model (Opus), which is gated behind a paywall
- 200k context window, significantly larger than GPT-4’s 32k window in ChatGPT
- Lacks some ChatGPT features like code interpreter, image generation, voice I/O, plugins, and message editing

Testing Methodology
To evaluate Claude 3, I focused on the day-to-day use cases that I rely on LLMs for, such as:
- Content creation assistance
- Idea generation
- Image analysis and understanding
- Prompt engineering and generation
- Creative writing
Results and Comparisons:
Content Creation and Idea Generation
- Claude excelled at generating highly relevant article generation ideas based on context and custom instructions
- GPT-4 struggled to provide ideas that aligned with my focus keywords, despite being given the same context
- Claude’s ability to generate spot-on ideas was a standout feature, making it my new go-to for brainstorming and ideation
Image Analysis and Understanding
- Claude demonstrated superior performance in analyzing and describing complex images accurately
- GPT-4 made minor errors in image descriptions, which could be problematic for automated workflows
- Claude’s multimodal capabilities felt more seamlessly integrated compared to GPT-4’s separate vision and language models
Prompt Engineering and Generation
- Claude performed equally well as GPT-4 for generating universal prompt formulas for specific professions
- Claude generated more detailed and actionable prompts when improving upon existing ones, preserving variables more effectively
- For image prompt generation, both Claude and GPT-4 performed similarly
Limitations and Failures
- Claude struggled with basic math problems that GPT-4 solved correctly
- Palindrome generation and code generation tests were inconclusive, with both models having mixed results
- Claude’s strict ethical guidelines prevent it from engaging in roleplaying or persona modeling, limiting certain prompt engineering techniques
Creative Writing
- Initial impressions suggest that Claude’s content creation capabilities are similar to GPT-4, if not slightly worse
- Claude provides text without taking on a directorial role, whereas GPT-4 tends to take more responsibility in content planning workflows
Comparing Conversational Abilities:
Handling Long Context and Specific Instructions
- Claude 3 Opus excels at dealing with long context while staying grounded with specific instructions
- GPT-4 in ChatGPT does not come close to Claude 3 Opus in this regard
- Claude 3 generates extended responses without appearing lazy
Roleplaying and Chatbot Capabilities
- Claude 3’s strong contextual understanding makes it excellent for roleplaying and chatbot applications
- Refuses certain roleplaying requests due to strong guardrails
- Skilled at copying styles and generating content based on provided materials (e.g., textbooks, video transcripts)
Additional Features and Improvements:
Multimodal Capabilities
- Claude 3 supports image input, allowing for convenient conversion of handwritten notes or complex LaTeX math formulas into LaTeX codes
Reasoning and Common Sense
- Both Claude 3 and GPT-4 perform well in reasoning and common sense tasks, with no significant differences noted
Pricing and Value:
- Claude 3 Opus: $20/month, offering 6x more context length compared to ChatGPT Plus
- Sonnet model (free) outperforms GPT-3.5 (free option for ChatGPT) across benchmarks
- Haiku API offers better value than GPT-3.5 API, with nearly 2x lower cost and performance on par with GPT-4
- Opus API is expensive, nearly 2x more than GPT-4 API, making it less economical at present
After thoroughly exploring Claude 3, I can confidently say it stands as a strong alternative to GPT-4. While it shines in areas like idea generation, image interpretation, and prompt refinement, it still trails GPT-4 in a few key areas—particularly when it comes to persona modeling and more nuanced creative writing.
That said, I’ve found a sweet spot in using both. Claude 3 has become my go-to for image-based tasks and ideation, while GPT-4 remains my choice for storytelling and structured content workflows. As the LLM space keeps evolving, staying flexible and testing new tools is essential to getting the best results.
Have you tried Claude 3 yet? I’d love to hear your thoughts—whether you’ve noticed similar strengths or discovered other advantages. Let’s continue exploring what these evolving models can unlock together.