Introduction
"The primary function of XML is to consume RAM and datacommunication bandwidth"
Canadian Mind Products
XML is becoming a standard and it is more and more difficult to avoid it. Unfortunately, XML documents tend to be very large, making parsing very long.
AsmXml is an attempt to reduce CPU time waste due to parsing of XML files. There is already thousands of XML parsers, but most of them focus more on conformance than on performance. AsmXml tries to reach high performance by sacrifying conformance.
Features
- Written in pure assembler
The whole parsing routine is written in assembler, not just critical parts. - Optimized memory access
A character from the input is read only once from memory and the data is not copied unless necessary. - Parsing, validation and lookup at the same time
It helps a lot in optimizations: an attribute is read only once and avoid further lookup. It is not necessary to check if a name is valid according to the XML specification and then to check if the name is valid according to the schema. - It is a real parser
It handles CDATA, character references, entity references and checks for duplicate attributes. It does not leave you with an half-parsed document.
Limits
- Not conform
AsmXml does not supports namespaces and external entity references. - The file has to be fully loaded into memory
You can't parse a file over 4Gb.
1. Supports only an XML subset
The goal is to get rid of useless bloated stuffs in XML to make parsing really faster.
Namespaces
AsmXml reads namespaces but ignores them. Namespaces make parsing more complicated and slower. It also makes DOM API more complicated: just look at the number of functions added in the Element interface of the DOM level 2, there are 8 methods + 7 methods added just to support namespaces.
As support of namespaces would slow down dramatically the speed of AsmXml, it will never be implemented.
Entity References
AsmXml supports standard entity references:
- "
- &
- '
- <
- >
External entity references are not handled yet.
Encoding
AsmXml handles only 8-bit encodings and UTF-8 encoding.
Since AsmXml decodes elements and attributes while parsing, it does not have to check that the names are properly encoded, it just has to check that names match with the ones of the class definition. The only restriction is that the encoding of the XML must be compatible with the class definition:
- if elements and attributes names are ASCII only, the XML can have any 8-bit encoding or UTF-8
- if elements and attributes names use a non ASCII char, the XML file must use the same encoding
Support of UTF-16 is not planned.
2. Written in Pure Assembler
Depite the progress in compilers, assembler is still the best language to write efficient code. Fine use of registers, fine control of memory accesses and fine control of the code flow can make a big difference.
And today it is possible to write assembler that can be portable: an x86 assembler program can run on:
- Windows
- Linux
- *BSD
- Solaris
- MacOSX
The library has been tested on Windows and Linux, but the ELF binary might also work on FreeBSD too without any change because it uses the same calling convention and register usage as Linux.
The code uses only the 80386 instruction set, it does not use advanced extensions of recent processors, therefore the library should work on old machines.
3. Optimized Memory Accesses
Despite the incredible speed of today's machine, the multiple levels of cache, reading and writing into memory is still slower than using registers.
The less a program accesses memory, the faster it runs.
AsmXml reads a char once and only once from the memory. The attribute values and text from elements are not copied unless it contains escaped characters (entity or char references). It keeps only a pointer on the beginning and the end of the value.
This also saves a conversion from a 8-bit or UTF-8 encoding to an internal UTF-16 storage. This conversion is useless especially if the values are to be written back to another file or passed to an API that requires UTF-8 encoding such as GTK+.
An intermediary copy is also avoided if the value must be passed to a high level language (e.g. Python, Ruby, ...).
Only in the last resort, you may have to copy the value to get a zero terminated string.
4. Parsing and Decoding at the Same Time
Here, "decoding" means locating an attribute or a particular text element, not converting a raw string to a typed value such as integer or timestamp.
Doing parsing and decoding at the same time allows to read the attribute name only once. This reduces memory access and it avoids to perform two verifications on a name:
- The name contains only valid characters (according to the XML specification)
- The name is a valid name (according to a DTD, XSD, or anything else)
If the second assertion is verified, the first one is also implicitely verified.
5. The File has to be Fully Loaded into Memory
Well, AsmXml can parse large files but not so large due to this constraint: the file can not exceed 4GB and since it must share the process space with code and other data, you shouldn't expect to handle files larger than 3GBs.
With the algorithm, a char is read once and only once from the memory and read them in the order, so the use IO fgetc() function would work perfectly except that it would be extremely slow compared to direct memory access.