Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy - IcaraX