Maildir (sorta) for c-client/pine
This page contains an enhanced maildir driver for c-client. First off, let me say that it does *not* follow the maildir specification entirely. It was written with the thought that only this driver is supported in manipulating mailboxes that have been processed with it; however, if a "standard" maildir application were to come across a mailbox, it shouldn't do any irrepreable harm, nor should it have problems reading and processing the mail contained.
Please read this page before trying to download and use these patches. You really should know what this stuff does and why it does them. Links to the patches are at the bottom of the page.
This c-client driver is based on the popular Bloodhounds maildir patch which has been floating around the interweb for some time. However, in doing the research to implement maildir at a particular site, it was found that the code, as given, was missing major functionality and had some serious performance issues when dealing with large folders.
Why?
This is a good question. As far as infrastructure goes, I'm a huge fan of horizontal scaling. But not horizontal scaling as in having 10 IMAP servers, each handling a subset of users... but for horizontal scaling as in having 10 IMAP servers, load balanced, each with the ability to serve all users. Same goes for mail delivery. And then concentrate on making the back-end storage as reliable as possible. At the site this work was done for, there are roughly 20,000 active accounts, with 500,000 mail messages processed and delivered daily. (quite a few of these are delivered to inactive accounts, as we're really slow on aging them out -- but that's another issue altogether.)
Most of the users at this site use IMAP (or POP), while there are a few holdouts that still insist on using Pine. There are currently 9 Sun Netra X1/V100 class machines handling mail delivery, and 4 Dell 2650's as IMAP/POP servers. Each of these mail readers typically sees 400-500 mail reading processes at a time.
The backend storage is AFS. Previous to converting to this mail format, we used the typical "unix" mailspool format & c-client stuff, with procmail as the MDA. Significant changes were made to procmail and c-client's unix driver to deal with the locking and file semantics of AFS. While these changes did add to the IO load of the mailbox processing, they didn't really change the IO/Computational/memory profile of the driver's performance.
Before switching the users to the maildir-ish format, we were seeing significant IO and processor utilization on the IMAP servers, as well as the delivery boxes. In fact, our IMAP servers would be pretty much maxed-out when they they hit around 400-500 users. After converting to maildir, the load on these servers is insignificant. 500 active IMAP processes show a .02 load average, and a significant reduction in memory footprint and memory utilization. The unchanging-file model of the messages takes special advantage of AFS's client side cache. Some other back-of-house improvments we're seeing is a decrease in the size of our incremental backups, as setting a flag on a mail message no longer means re-writing (and re-backing-up) the full-file unix mailspool!
Of course, there are tradeoffs. The maildir-ish format on the backend works specifically well with smart mail clients that do significant caching of message information locally. Operations that would require that each message be opened, such as a body search, take longer under this format than others; however, in comparison to how often other operations are done such as mailbox scans, flag changes, and simple downloads, the performance gains in other areas easily take the lead. Because of the way 'pine' works, it generally performs well too, as it only gathers full message header information for those messages which are currently being viewed in the index.
Differences from the "Maildir" spec
The Bloodhounds IMAP driver store message UIDs in the mtime/ctime of
the message file's stat information. For trivial mailboxes, this was
not found to be a problem. However, in analysing the driver's
performance with very large mailboxes (think > 30,000 messages), the
resources taken to stat() every file was found to slow the performance
down significantly. The UID information has been moved to
the filename of the message, so generating a list of messages in a
mailbox, with their UIDs and flags, has basically gone from an
O(n) operation to an O(1), as all of this information is retrieved
from the single syscall to read the contents of the directory.
Yes, an incorrect use of big-o notation -- I'm really talking about FS-operation/syscall count here
This change is reflected in the filename when it is moved from the
new directory to the cur directory by appending a
",U<uid#&rt;" to the message ID portion of the filename (before
the flags).
More UID assignment hassles
The original Bloodhounds IMAP driver could not guarantee that message
UIDs would be unique for a mailbox. In the case where the last message
in a mailbox was deleted, a second invocation of the driver could come
along and re-assign that message's UID to the next message. This breaks
the IMAP specification that requires that messages UIDs should be forever
increasing and never reissued for mailbox. Thus, in addition to the
".uidvalidity" file that the Bloodhounds driver placed in the root of
the maildir structure, there has been added a ".uidlast" file that
stores that last UID issued. In keeping with the Bloodhounds way
of doing things, this data is stored as the mtime/ctime of the file's
stat information. This may change in the future, as while it's cute,
it can also cause problems if a folder is manipulated at the filesystem
level in a way that doesn't preserve these.
This also brings up the shameful topic of locking. While one of the
primary benefits of the maildir format has been "no locking", some
basic locking of this ".uidlast" file had to be added to make certain
that the UID assignment takes place in a sane fashion. In the original
production version of this code, the ".uidlast" was locked, updated, and
the unlocked for each message a UID was being assigned for. After
being in production for awhile, this was further refined. This driver
will lock the .uidlast file during a processing run if a UID assignment
is required, and update and unlock it at the end of the processing
run. While this may lock out UID assignment/new message processing
for other processes viewing the mailbox for a time, the speedup in
new message processing is significant enough to warrant it.
Things we don't like and why...
Using the ctime/mtime for storing metadata.
This is just a bad idea.
That this is still called a "maildir" driver.
I'll be changing the name to "maildirx" very soon. Then I can rid
myself of the crazy idea of making it compatible maildir, and
concentrate more on finding ways to expand the features of the driver
Invasive changes to dummy.c & mail.h
Currently, the maildir driver requires an addition to the 'elt'
structure in mail.h, as well as changes to dummy_scan to keep it
from showing maildir's structural folders as subfolders/directories.
While the latter can probably be done somewhere in the maildir
driver, depending on how maildir_scan is called (I need to research it),
the former seems to be a bit more vexing, as the "generic" pointer
available in the elt structure is for use by the user of the c-client
code, not code internal to the c-client itself (or at least, that's
what the documentation alludes to). Perhaps a void * for
"internal use only" by drivers could be added to elt.private?
Things i'd like to do...
Folder compression
This is the old processor vs storage/IO trade off. The message per
folder format lends itself well to being able to use something
such as zlib and compress the message streams, and, as email is
usually pretty damn compressable, it could lend to some signficiant
space & IO operation reduction.
Directory hashing
Some filesystems, such as AFS, have limits on the size a directory
can be. Other filesystems have issues with performance when a
directory grows large. This can be mitigated by creating hash
directory structure for the message storage.
Better metadata encoding
The format currently used for the filenames is quite fat, and
inflexible. The single-letter storage of flag information is
compact, however, inflexible as it does not allow user-defined
flags to be used (currently, flags need to be compiled into the
driver). The UID is currently being text-ized as an integer, however,
that is not the most compact storage format that can be recognized
by printable characters. In addition, all of that timestamp/hostname
stuff can be paired down a bit as well.
You may ask "why, pray tell, do you care about squeezing things down".
It will be answered with "because I don't work for Microsoft, and
I still believe in efficiency."
Support for folder sharing
The IMAP protocol includes specifications for commands to alter
the access control information for a folder. Backend filesystems,
such as AFS, do well in storing said information and implementing
the access control. Maildir-ish formats share will between
multiple readers. Work needs to be done to the UWash IMAPd &
c-client library, as well as the driver, to implement the server
side mechinations of this. But it would rock.
Support read-only folders
This goes hand-in-hand with the above.
The Software
This patches are provided with no warranty what-so-ever. In fact,
they probably don't work. Well, they work for me, but they probably
won't work for you. If you're still think they may, even after being
told they won't, then go ahead and download them. If you have
any questions, feel free to ask me. If you have any improvments,
feel free to improve and contribute back. That would rock.
C-Client Maildir Patch (updated 7/05/2005)
imap-2004c1-maildir-patch
This is the current stable version of the maildir patch, to the
imap-2004c1 codebase. Note, it does stuff that you may not want
it to do to your Makefile, so PLEASE look over it first before
applying. There is no warranty, expressed, implied, blah.
If you are using this format for your users' inbox, look at the
definition of MAILDIRPATH in maildir.h. It's currently set up
to look for a folder named inbox in the users' ~/Mail directory.
Procmail Patch
procmail-3.22-maildir.patch
This patch makes some behaviour changes to procmail that make it
behave better (I think, at least) in a maildir environment. These
changes are enabled through defines in config.h, which, are enabled
by applying this patch. They are:
- FORCE_MAILDIR - By default, if a folder being delivered
to by procmail looks like a directory, it simply considers it
an 'mh'-like folder and throws the message in there; it only
considers things maildir if they have a '/' at the end of the folder
name. This could cause some confusion -- this makes it less confusing
by causing a folder that turns out to be a directory to be treated
as a maildir-like folder. - MAILDIR_DEFAULT - New folder creations are maildir. (pretty
simple). - MAILDIR_SHORTNAME - Instead of using the fully qualified
hostname in the maildir file name, use a non-fully-qualified version.