Resolved Apple CalDav issues with PostgreSQL startup

Today I noticed that my phone could no longer create any new calendar items. With Server.app I noticed that the Calendar (and AddressBook) services were no longer running and when checking their status, it took forever for the panel to load. Enabling the service again also took forever to not start (and unfortunately without any error message).

After some digging I found that the PostgreSQL server the Apple CalDav service uses internally was no longer running and issues starting In the logfiles in /var/log/caldavd/postgresql/ I found messages like:

2015-03-14 12:59:33.665 CET [689] LOG:  unexpected pageaddr 0/5DC82000 in log segment 000000010000000000000061, offset 13115392
2015-03-14 12:59:33.665 CET [689] LOG:  invalid primary checkpoint record
2015-03-14 12:59:33.679 CET [689] LOG:  unexpected pageaddr 0/5DC7C000 in log segment 000000010000000000000061, offset 13090816
2015-03-14 12:59:33.679 CET [689] LOG:  invalid secondary checkpoint record
2015-03-14 12:59:33.679 CET [689] PANIC:  could not locate a valid checkpoint record

I suspect these were caused by a crash a few days ago of my NAS that serves the iSCSI disks where the postgres data is stored. I spent a lot of time today to look for a solution (including trying to restore a backup and set it up from scratch, which all failed). In the end I found a clue in the manual page of pg_resetxlog:

DESCRIPTION
pg_resetxlog clears the write-ahead log (WAL) and optionally resets some other control information stored in the pg_control file. This function is sometimes needed if these files have become corrupted. It should be used only as a last resort, when the server will not start due to such corruption.

This pretty closely matched my situation so (after making a backup of the DB folder) I executed the following command in the folder where Server.app stores it data (by default that is /Library/Server/Calendar and Contacts but in my case that’s /Volumes/Data/Library/Server/Calendar and Contacts as I store all data on a RAID5 container on my NAS)

sudo -u _calendar pg_resetxlog -f Data/Database.xpg/cluster.pg/

After running this command the PostgreSQL for Services started again and my Calendar (and AddressBook) services were running again. So far it looks like I did not lose any data apart from a calendar entry that I had added on my Macbook in iCal.I am glad it is resolved, but I have to look into how backups are made so that the next time I at least know that I can get my calendar and contacts back…

Crashplan stopped backing up due to corrupt cache

Today I noticed that Crashplan running on my NAS was no longer backing up any files and that the backup set was 0 Mb and only 2 files (which should have been a few 100k files and > 350Gb). Rescanning the fileset didn’t help, neither did removing and adding the folders again.

After a little digging I noticed in the logs entries like (log message was a log longer but I only included the relevant part of it):

com.code42.exception.DebugException: BSM:: SET-1: Exception adding source file...skipping - fileStat=FileStat[/volume1/photo, exists = true, fileType = 1, length = 0, lastModified = 1419895925000, lastAccess = 1425690867000, created = 1419895925000], com.code42.backup.manifest.FileManifest$CorruptFileManifestException: CORRUPT FMF ENTRY FixedPortion[entryPosition = 186768232, fileId = 00000000000000000000000000000000, parentFileId = 00000000000000000000000000000000, fileType = 0, version = Version[timestamp = 0, sourceLastModified = 0, sourceLength = 0, sourceChecksum = null, fileType = 0]

Googling for this message did not render any result unfortunately and this part of the Crashplan system is rather obscure (nothing to debug, messages are limited. The only thing I could think of to try to resolve it was to drop the cache Crashplan maintains (in the cache subdirectory of the Crashplan installation). It turns out that this was sufficient as after a restart the cache was rebuild and the the scan resulted in the expected number of filed.

The steps I performed were:

  1. Stop the Crashplan engine
  2. remove all files in the Crashplan cache/ subdirectory
  3. Start Crashplan
  4. Enforce a rescan of the fileset in [Settings] –> [Backup] –> Verify Selection [Now]

Since I had removed the folders from the backup set I feared that I had to upload all data again to my external backup targets, but Crashplan was smart enough not to need that.