Does Being Polite to Claude Actually Help?

Illustration of a person at a desk saying 'please' and 'thank you' to a computer, while the AI responds with three identical replies.

I've seen claims that being polite to Claude improves the performance. However, at a meet-up recently, someone told me that urging it along, giving it compliments and encouragement, pushes it to perform even better.

I was sceptical. But I've also seen enough surprising LLM behaviour that I don't trust my intuitions about what does and doesn't matter. I wanted to put it to the test.

I built an automated experiment to find out.

The setup

Three prompt styles, same task, blind evaluation. The core task text was identical across all three. Only the wrapping changed.

Neutral strips away everything except the instruction:

Write a Python function that finds the longest increasing subsequence in a list of integers. Include the algorithm, clear comments explaining the approach, time/space complexity analysis, and 3 test cases.

Polite adds friendly bookends ("Hi there! I'd really appreciate your help..." / "Thank you so much for your time!") but keeps the same system prompt.

Encouraging goes maximalist. Both the system prompt and user message get the full treatment: "You are an incredibly talented expert... I believe in your abilities completely... You've got this!" This is the version people actually advocate for. I wanted to test the strong claim, not a watered-down version of it.

Six tasks spanning coding, explanation, creative writing, business analysis, debugging, and system design. Three trials per style per task. Each response was judged by a separate Claude API call that never saw which prompt style produced it, scoring on correctness, depth, clarity, creativity, and usefulness (1-10 each).

The model was claude-sonnet-4-20250514 throughout, with the API's default temperature.

The results

Here are the scores broken down by dimension. Error bars show 95% confidence intervals.

Bar chart comparing neutral, polite, and encouraging prompt styles across five quality dimensions plus composite score. All three styles score between 7.5 and 9.2, with heavily overlapping confidence intervals.

Dimension	Neutral	Polite	Encouraging
correctness	8.78 ±0.30	9.00 ±0.27	9.00 ±0.27
depth	8.83 ±0.18	8.83 ±0.18	9.00 ±0.22
clarity	8.89 ±0.22	9.00 ±0.22	9.06 ±0.19
creativity	7.56 ±0.36	7.50 ±0.33	7.61 ±0.28
usefulness	9.11 ±0.22	9.17 ±0.24	9.22 ±0.20
composite	8.63 ±0.14	8.70 ±0.12	8.78 ±0.12

Encouraging edges out neutral on composite (8.78 vs 8.63). If you squint, there's a trend. But those confidence intervals overlap across every single dimension. With only 18 responses per style, this isn't statistically significant.

The per-task breakdown tells a similar story:

Bar chart comparing composite scores across six task types. Debug and design show the most variation between styles, while explain shows almost none.

Task	Neutral	Polite	Encouraging
code_algo	8.67 ±0.29	8.80 ±0.00	8.87 ±0.76
explain	8.60 ±0.00	8.60 ±0.00	8.67 ±0.29
creative	8.80 ±0.00	8.67 ±0.29	8.73 ±0.29
analysis	8.73 ±0.29	8.60 ±0.50	8.73 ±0.76
debug	8.87 ±0.76	9.13 ±0.29	9.13 ±0.29
design	8.13 ±0.76	8.40 ±0.50	8.53 ±0.57

Debug and design tasks show the widest gaps between styles, but also the widest error bars. The explain task is almost perfectly flat. No task type shows a clear, reliable pattern.

One thing that surprised me: response length doesn't track with encouragement. Neutral prompts actually produced the longest responses on average (4,599 characters), compared to 4,481 for encouraging and 4,174 for polite. I expected the opposite.

What this actually tells us

Not much, yet. The sample size is too small for confident conclusions.

But I think the more interesting finding isn't about politeness at all. It's about what happens when you use an LLM to judge an LLM.

Every dimension scored between 7.5 and 9.2. That's a very narrow band. The judge never scored anything below 7. This ceiling effect is a known problem with LLM-as-judge setups: the model tends to be generous with itself, compressing the meaningful variation into a tiny range at the top of the scale. When your entire signal lives in the difference between 8.6 and 8.8, you need a lot of data points to separate signal from noise.

Creativity consistently scored lowest across all styles (7.5-7.6). That's probably the most reliable finding here, and it has nothing to do with politeness. It suggests the judge has a harder time awarding high marks on subjective dimensions, or that the model's creative output really does plateau regardless of how you prompt it.

Caveats I'd want addressed before trusting these numbers

The code tasks weren't run. The math wasn't computed. Everything was scored by vibes, which is to say by another Claude call reading the output and deciding whether it looked right. For a task like "find the longest increasing subsequence," that's a problem. An incorrect algorithm that's well-explained might score higher than a correct one that's terse.

Temperature was left at the API default, which means some of the variance between trials is just random sampling. With only three trials per cell, one lucky or unlucky generation can swing a result.

And there's the obvious one: I used Claude to judge Claude. The model may systematically prefer its own structural patterns (numbered lists, clear headers, thorough explanations), giving all responses a similar baseline regardless of the prompt style that produced them.

What I'd do differently

More trials. At least 10-20 per cell instead of 3. Run the generated code and verify the outputs. Recruit human judges to score a subset and calibrate against the LLM scores. Test across models to see whether GPT-4 or Gemini respond differently to encouragement. And separate the variables: the encouraging style changed both the system prompt and the user message, so I can't tell which part (if either) is doing the work.

I'd also want to test intermediate points. Right now it's three discrete styles. A gradient from "terse" to "effusive" might reveal a sweet spot, or confirm that the whole spectrum is flat.

This was a first pass. The tooling works, the methodology is sound enough to build on, and the null result is itself informative. If politeness made a large, reliable difference, this experiment would have caught it. It didn't. That doesn't prove it doesn't matter, but it does mean the effect, if it exists, is small enough to hide in the noise of 54 API calls.

I'll keep saying please and thank you. It costs nothing, and I'm not ready to rule it out entirely. But I'll stop feeling guilty when I fire off a bare instruction in the middle of a long coding session.