The MAFF specification

MAFF is meant to be a simple format for archiving a copy of some web content in a single file.

Design goals

Conformance levels

In order to satisfy the requirement of simplicity of implementation, and to encourage early adoption of the format, different conformance levels are available to implementors.

Definitions

Page

An atomic unit of related archived content. Multiple independent pages can be stored in a single MAFF archive.

Main document

The top-level file of a page, that is generally displayed in a web browser window.

File extension and type

MAFF files should be saved using the .maff file extension (lowercase is recommended), even on systems where file extensions are not normally used to identify file types. [Conformance level: elementary]

Implementations should treat files with the .maff file extension (case insensitive) as MAFF files, even on systems where file extensions are not normally used to identify file types. [Conformance level: basic]

The MIME type application/x-maff is suggested for files with the .maff extension. [Information]

ZIP implementation

The ZIP implementation must be based on PKWARE's ZIP Application Note [Conformance level: elementary].

File and directory names must be stored using UTF-8 [Conformance level: basic].

Directory structure

The root directory of the archive must be empty. [Conformance level: elementary]

One first-level directory must be present for every saved page. At least one page must be present in the archive. No additional first-level directories must be present. [Conformance level: elementary]

Every first-level directory should contain a file named index.rdf, with the metadata. [Conformance level: basic].

If the index.rdf file is not present, the main document must be stored in a file named index, with a file extension based on the content type. If the content type of the main document is HTML, the file must be named index.html. [Conformance level: elementary]

If the index.rdf file is present, the metadata must contain the name of the file containing the main document. This file must be located in the same folder as the index.rdf file. This file must be named index, with a file extension based on the content type, unless the file type is RDF. [Conformance level: basic]

Matching file extensions with MIME media types

Assignment of MIME type to individual files is implemented by ensuring that the file names of supported content match a list of well-known file extensions. [Conformance level: normal]

The well-known file extensions, with their related media types, are as follows:

When storing files that don't have a well-known media type, the use of any file extension is acceptable. Generally, in this case implementors should use the extension to type mapping provided by the operating system. In case this association is not available, the file extension of the original file may be used. [Conformance level: normal]

Metadata

The format used for storing metadata about the archived files in the index.rdf and history.rdf files is RDF/XML.

Some restrictions for the RDF/XML file format are still to be specified. These restrictions are required to allow for read-enabled implementations that are as simple as possible. In particular, only one of the possible XML representations of the RDF graph is valid. This is the current representation that uses the MAF XML namespace. In this way, a full RDF parser would not be required to read the metadata.

Further restrictions on the structure and format of the XML files are under consideration. For example, implementations might be required to put one tag per line, always use UTF-8 encoding, and never encode characters as entities unless necessary. This would make room for very simple implementations that don't embed an XML parser.

The following information is stored in index.rdf:

Date and time

The date and time of the save operation should be stored using the format described in RFC 5322 section 3.3. If this format is not used, the Mozilla JavaScript Date format must be used. Implementations must be able to parse both formats. [Conformance level: basic]

Title

If the file format of the main document allows an explicit page title to be specified, like HTML does, the title from the metadata should match the title from the main document, and if the title is not available, this field should be omitted. [Conformance level: basic]

This metadata field allows applications that want to display information about the contents of MAFF archives to do this without embedding an HTML parser. For example, an extension for a file manager might only use the title from the metadata, while a web browser might ignore this field and only display the title resulting from parsing the main document. [Information]

Character set

The character set declared in the MAFF archive should be used when parsing the contents of all the files that do not declare a different character set. [Conformance level: basic]

Extended metadata

Inside each first-level directory of a MAFF archive, the second-level directory named ^metadata^ (case-sensitive) is reserved and should not contain actual content. A file or folder named ^metadata^ (case-insensitive) should not exist inside any first-level directory. [Conformance level: extended].

File names inside the ^metadata^ folder should be limited to a sequence of up to 20 lowercase ASCII characters or hyphens (-), followed by an optional lowercase file extension beginning with dot (.). [Conformance level: extended].

Inside the ^metadata^ folder, file names that begin with x- are reserved for custom extensions to the MAFF format. Implementors that want to store additional metadata that is not documented in this specification must store it using such reserved names, for example 12345678_123/^metadata^/x-custom-info.rdf.

The provisions above are candidates for inclusion in the normal conformance level. However, at present they are considered extended and are subject to change.

Contents of extended metadata

The base specification does not provide a facility for storing the original URL from which each individual file was copied from, as each page is handled as an atomic unit. This feature is under consideration as an extension, but even if this information is present, implementations must not rely on it to be able to properly display the page. [Conformance level: extended]

The base specification does not consider custom MIME or HTTP headers associated with individual files, as the MAFF format is designed to work in the same way as when a saved page is opened directly from a local file system. This feature is however under discussion as an extension. Implementations should not rely on this feature unless necessary, since it may not be available in other implementations. [Conformance level: extended]

Storing and replaying arbitrary MIME headers in an archive can be subject to security considerations. While static information about the content itself, like the Content-Type header, is generally safe to be used from every location, information about the relation of the content with other resources may not apply anymore when the resource is moved. For example, a site may be listed as allowed in a Content Security Policy header, but this trust relationship is only relevant at the time the content is generated, and should not be used later.

Test cases

Web archives demonstrating the features of the file format are available here: