Debugging Barman XAmzContentSHA256Mismatch error after upgrading to postgresql:17.5



In an effort to keep my home cluster up to date, I upgraded CloudNativePG to v1.26 and the associated database clusters to cloudnative-pg/postgresql:17.5. All seemed ok with the upgrade until a couple of days later when the cluster primary instance PVCs started to run out of space. wal archiving to my Ceph S3 Object Store had stopped working with the error message:

ERROR: Barman cloud WAL archiver exception: An error occurred (XAmzContentSHA256Mismatch) when calling the PutObject operation: None

Searching the cloudnative-pg issues didn’t turn up anything and a wider search didn’t find anything relating to cloudnative-pg or barman. hmm…

Has barman been updated?

After some digging, it turns out that barman is added to the vanilla postgres image as part of the build for cloudnative-pg/postgresql:17.5. Perhaps barman has been updated:

docker run -it --rm --entrypoint /bin/sh ghcr.io/cloudnative-pg/postgresql:17.4
$ barman --version
3.12.1 Barman by EnterpriseDB (www.enterprisedb.com)

and

docker run -it --rm --entrypoint /bin/sh ghcr.io/cloudnative-pg/postgresql:17.5
$ barman --version
3.14.0 Barman by EnterpriseDB (www.enterprisedb.com)

Ah, perhaps we are on to something… Buried on the second page of Brave search results was this issue: boto3 doesn’t handle checksums correctly

What’s boto3?

Boto3 is the Amazon Web Services (AWS) Software Development Kit (SDK) for Python, which allows Python developers to write software that makes use of services like Amazon S3 and Amazon EC2. You can find the latest, most up to date, documentation at our doc site, including a list of services that are supported. Boto3 is maintained and published by Amazon Web Services.

Has boto3 been updated?

yes.

postgresql:17.4:

docker run -it --rm --entrypoint /bin/sh ghcr.io/cloudnative-pg/postgresql:17.4
$ pip show boto3
Name: boto3
Version: 1.35.99
Summary: The AWS SDK for Python
Home-page: https://github.com/boto/boto3
Author: Amazon Web Services
Author-email:
License: Apache License 2.0
Location: /usr/local/lib/python3.9/dist-packages
Requires: botocore, jmespath, s3transfer
Required-by:

postgresql:17.5

docker run -it --rm --entrypoint /bin/sh ghcr.io/cloudnative-pg/postgresql:17.5
$ pip show boto3
Name: boto3
Version: 1.38.27
Summary: The AWS SDK for Python
Home-page: https://github.com/boto/boto3
Author: Amazon Web Services
Author-email:
License: Apache License 2.0
Location: /usr/local/lib/python3.9/dist-packages
Requires: botocore, jmespath, s3transfer
Required-by:

How to fix?

The boto3 issue above linked to a second issue that describes the checksum as current expected behaviour - Announcement: S3 default integrity change .

In AWS SDK for Python v1.36.0, we released changes to the S3 client that adopts new default integrity protections. For more information on default integrity behavior, please refer to the official SDK documentation. In SDK releases from this version on, clients default to enabling an additional checksum on all Put calls and enabling validation on Get calls. You can disable default integrity protections for S3. We do not recommend this because checksums are important to S3 integrity posture. Integrity protections can be disabled by setting the config flag to when_required, or by using the related AWS shared config file settings or environment variables.


Disclaimer: The AWS SDKs and CLI are designed for usage with official AWS services. We may introduce and enable new features by default, such as these new default integrity protections prior to them being supported or handled by third-party service implementations. You can disable the new behavior with the when_required value for the request_checksum_calculation and response_checksum_validation configuration options covered in Data Integrity Protections for Amazon S3.

There is also a description of the fix we need:

FYI to anyone using awscli, you want env-vars AWS_REQUEST_CHECKSUM_CALCULATION and AWS_RESPONSE_CHECKSUM_VALIDATION both set to when_required: https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-envvars.html#envvars-list-AWS_REQUEST_CHECKSUM_CALCULATION

Updating the cloudnative-pg Cluster specification

The fix is to just add a couple of environment variables:

  env:
    - name: AWS_REQUEST_CHECKSUM_CALCULATION
      value: when_required
    - name: AWS_RESPONSE_CHECKSUM_VALIDATION
      value: when_required

One more challenge for me. I had installed the Clusters with the cloudnative-pg helm chart. This doesn’t actually provide a mechanism for specifying additional environment variables for the cluster - Add support for env vars - so instead of this being a simple change, I had to move all the Cluster specifications from helm to straight yaml.

Moving the Cluster specifications to yaml is probably a good thing. As I understand more about Kubernetes, I sometimes get frustrated about the additonal layer of complexity of helm. CloudNativePG v1.26 introduces support for the Barman Cloud Plugin and deprecates the existing in-tree Barman Cloud support:

Barman Cloud Deprecation Begins

With the 1.26 release, the deprecation period for in-tree Barman Cloud support officially begins. While it remains fully functional in 1.26, we strongly encourage users to begin planning the migration to the Barman Cloud Plugin as early as possible and to adopt it for all new deployments. To help with this, we’ve published a comprehensive migration guide.

In CloudNativePG 1.28, Barman Cloud will be fully removed from CloudNativePG’s core. You have until then to complete your migration.

This marks a significant milestone in CloudNativePG’s evolution—the culmination of a multi-year effort that introduced CNPG-I, our extensible plugin interface. It is a crucial step toward making CloudNativePG a backup-agnostic solution while enabling leaner operand images by removing the need to bundle Barman Cloud directly. It also paves the way for future plugin support for volume snapshot backups and restores.

The migration guide provides a yaml walkthrough of the changes need to move the Barman Cloud Plugin so updating my Cluster specifications to yaml was probably a good first step!

That’s it, wal archiving is working again, thanks for reading.