ITSF internal file formats

Preface

In this section, where the description of a file says that an item is an offset into another file, that file may be located in the same CHM, or it may be located in an accompanying CHI file.

The different types of ITSF files contain different internal files. The list below indicates which file types contain which internal files:

CHI
/#ITBITS, /#SYSTEM, /#IDXHDR, /#STRINGS, /#TOCIDX, /#TOPICS, /#URLSTR, /#URLTBL, /#WINDOWS, /$OBJINST, /$WWAssociativeLinks/BTree, /$WWAssociativeLinks/Data, /$WWAssociativeLinks/Map, /$WWAssociativeLinks/Property, /$WWKeywordLinks/BTree, /$WWKeywordLinks/Data, /$WWKeywordLinks/Map, /$WWKeywordLinks/Property
CHM
/#ITBITS, /#SYSTEM, /#IDXHDR, /#STRINGS, /#TOCIDX, /#TOPICS, /#URLSTR, /#URLTBL, /#IVB, /#SUBSETS, /#WINDOWS, /$FIftiMain, /$OBJINST, /$WWAssociativeLinks/BTree, /$WWAssociativeLinks/Data, /$WWAssociativeLinks/Map, /$WWAssociativeLinks/Property, /$WWKeywordLinks/BTree, /$WWKeywordLinks/Data, /$WWKeywordLinks/Map, /$WWKeywordLinks/Property
CHQ
/$FIftiMain, /$OBJINST, /$TitleMap
CHW
/$OBJINST, /$HHTitleMap, /$WWAssociativeLinks/BTREE, /$WWAssociativeLinks/DATA, /$WWAssociativeLinks/MAP, /$WWAssociativeLinks/PROPERTY, /$WWKeywordLinks/BTREE, /$WWKeywordLinks/DATA, /$WWKeywordLinks/MAP, /$WWKeywordLinks/PROPERTY
hh.dat
/Path/file.chm/windowtype, /Path/file.chm/AdvSearchUI/Keywords, /Path/file.chm/AdvSearchUI/Properties, /Path/file.chm/Bookmarks/v1/Count, /Path/file.chm/Bookmarks/v1/n/Topic, /Path/file.chm/Bookmarks/v1/n/Url
KPD
/#KEY_DATA, /#KEY_DELETED
Seen in HHA.dll or on the internet, but not seen in any ITSF files
/#GRPINF (see helpdeco docs by Manfred Winterhoff for a possible function), /#INFOTYPES (probably will be output when MS implements information types), /#URLS (probably a previous incarnation of the /#URLTBL + /#URLSTR combination), /#BSSC (8 bytes, based on something I saw in KeyTools.exe it might contain the version of RoboHelp used to create the CHM. If you have RoboHelp please check this out & be sure to send me your chm.)

Internal file formats

/#ITBITS

The files I have seen so far have been empty or filled with zero BYTEs so who knows. My guess is that it has something to do with information types. The file where it had a non-zero size (12 zero BYTEs in VOICESDK.CHI from the MSDN) also had a non-zero /#SYSTEM code 15 (Information type checksum) entry of 0xFFFFFFFF.

/#SYSTEM

OffsetTypeComment/Value
0DWORDVersion number. 2 in files compiled with "Compatibility=1.0", 3 in files compiled with "Compatibility=1.1 or later"
4/#SYSTEM entries to the EOF

/#SYSTEM entries have the following format:

OffsetTypeComment/Value
0WORDcode - see below for values & meanings
2WORDlength of data
4BYTEsdata

In the below list of the different codes the order of the codes in the /#SYSTEM file is 10, 9, 4, 2, 3, 16, 6, (5,0,1 or 0,1,5 - haven't been able to make files with all three), 7, 11, 12, 13, 14, 8 and lastly 15.

An eplanation for each of the /#SYSTEM codes
CodeExplanation
0Value of Contents file in [OPTIONS] section of the HHP file. NT
1Value of Index file in [OPTIONS] section of the HHP file. NT
2Value of Default topic in [OPTIONS] section of the HHP file. NT
3Value of Title in [OPTIONS] section of the HHP file. NT
428 (HHA Version 4.72.7294 and earlier) or 36 (HHA Version 4.72.8086 and later) byte structure:
OffsetTypeComment/Value
0DWORDLCID from the HHP file.
4DWORDOne if DBCS is in use.
8DWORDOne if full-text search is on.
0xCDWORDNon-zero if the file has KLinks.
0x10DWORDNon-zero if the file has ALinks.
0x14QWORDtimestamp - Definately not a straightforward Win32 FILETIME structure. On odd hours it seems to be reduced by a factor of 15, compared to even hours.
0x1CDWORD0/1 (unknown) Only dsmsdn.chi from the MSDN has 1 here. Perhaps 1 means it is the root chm of a collection?
0x20DWORD0 (unknown)
5Value of Default Window in [OPTIONS] section of the HHP file. NT
6Value of Compiled file in [OPTIONS] section of the HHP file. This is the lowercase of the stem of the CHM file name. If the name of the CHM is "..\bar\foo\ FOO-Bar . chm jimmy is a poo-bum" then this will be " foo-bar ". NT
7*DWORD present in files with "Binary Index=Yes". The entry in the /#URLTBL file that points to the sitemap index had the same first DWORD.
8Rare. VOICESDK.CHM & CHI and WOSA.CHI from the MSDN have one. The abbreviations and explanations seem to be the same in WOSA.CHI & VOICESDK.CHM, except for 2 mistakes (one in VOICESDK.CHM & one in WOSA.CHI) that seem to be created by bugs in the compiler. Both were compiled by the same version of HHA (4.72.8086), so perhaps this version has some weird bug. Each entry is 16 BYTEs:
OffsetTypeComment/Value
0DWORD0, 4 in some (unknown)
4DWORDOffset in /#STRINGS file. An abbreviation.
8DWORD3 where 1st DWORD is 0, 5 where it is 4 (unknown)
0xCDWORDOffset in /#STRINGS file. An explanation of the abbreviation.
9The version/program that the CHM was compiled by - shown in the version dialog as "Compiled with %s" where %s is what is in this entry of the /#SYSTEM file. If compiled with the MS HTML Help Author dll then it will be something like "HHA Version 4.74.8702". It comes directly from the resource strings of HHA.dll (I saw it there in Unicode and successfully altered it). Beware that the text control in the version dialog that displays it is only so big and in some cases the string won't be displayed, & in other cases only part, depending apon the effect of wrapping, so if you write a compiler, be sure to test it and use a short name and version. Usually NT, but HH won't crash if it isn't.
10time_t timestamp (DWORD). Not sure of the start year yet.
11*DWORD present in files with "Binary TOC=Yes". The entry in the /#URLTBL file that points to the sitemap contents has the same first DWORD.
12*Number of information types (DWORD).
13*The /#IDXHDR file contains exactly the same bytes. See below for more info
14Rare. The ones I saw were from MS Word 2000. My guess is that it is an MSOffice extension (or maybe not) that overrides the names & window types of the navigation tabs. DWORD number of windows to override, 2 ANSI/UTF-8 NT strings for each window. The first is the text for the tab & the second is probably the name of the window type to use. (eg 2, "&Answer Wizard\0MsoHelpAWDlg\0&Index\0MsoHelpKeyDlg\0")
These are from the Custom tab variables of the [OPTIONS] section of the HHP file. The resources from MSOHELP.EXE have a weird .reg file that gives the CLSIDs involved in the provision of these dialogs.
15*Information type checksum (DWORD). Unknown algorithm & data source.
16Value of Default Font in [OPTIONS] section of the HHP file. NT
17-65535Not yet seen. Please let us know if you see these.
*Not present in files with "Compatibility=1.0"

/#IDXHDR

This has exactly the same bytes as the code 13 entry in the /#SYSTEM file and is 4096 bytes long.

OffsetTypeComment/Value
0char[4]T#SM
4DWORDUnknown timestamp/checksum
8DWORD1 (unknown)
0xCDWORDNumber of topic nodes including the contents & index files
0x10DWORD0 (unknown)
0x14DWORDOffset in the /#STRINGS file of the ImageList param of the "text/site properties" object of the sitemap contents (0/-1 = none)
0x18DWORD0 (unknown)
0x1CDWORD1 if the value of the ImageType param of the "text/site properties" object of the sitemap contents is "Folder". 0 otherwise.
0x20DWORDThe value of the Background param of the "text/site properties" object of the sitemap contents
0x24DWORDThe value of the Foreground param of the "text/site properties" object of the sitemap contents
0x28DWORDOffset in the /#STRINGS file of the Font param of the "text/site properties" object of the sitemap contents (0/-1 = none)
0x2CDWORDThe value of the Window Styles param of the "text/site properties" object of the sitemap contents
0x30DWORDThe value of the ExWindow Styles param of the "text/site properties" object of the sitemap contents
0x34DWORDUnknown. Often -1. Sometimes 0.
0x38DWORDOffset in the /#STRINGS file of the FrameName param of the "text/site properties" object of the sitemap contents (0/-1 = none)
0x3CDWORDOffset in the /#STRINGS file of the WindowName param of the "text/site properties" object of the sitemap contents (0/-1 = none)
0x40DWORDNumber of information types.
0x44DWORDUnknown. Often 1. Also 0, 3.
0x48DWORDNumber of files in the [MERGE FILES] list
0x4CDWORDUnknown. Often 0. Non-zero mostly in files with some files in the merge files list.
0x50DWORD[1004]List of offsets in the /#STRINGS file that are the [MERGE FILES] list. Zero terminated, but don't count on it.

/#WINDOWS

This file contains information on the window types in the CHM. It has the following format:

OffsetTypeComment/Value
0DWORDNumber of entries in the file
4DWORDSize of each of the entries in the file (188 or 196)
8/#WINDOWS entries to the EOF

/#WINDOWS entries are basically HH_WINTYPE structures as specified in htmlhelp.h. Note the first DWORD can be used to specify different versions of this structure. Also note that the HHW docs show a different structure to htmlhelp.h. Therefore many CHM files need to be surveyed to find structures with sizes other than 188 or 196. In the description of /#WINDOWS entries below, Arg n means that that item is argument n of the window definition in the HHP file, either converted to a DWORD or to an offset in the indicated file:

The format of each /#WINDOWS entry.
OffsetTypeComment/Value
0DWORDSize of the entry (188 in CHMs compiled with "Compatibility=1.0", 196 in CHMs compiled with "Compatibility=1.1 or later")
4DWORD0 (unknown) - but htmlhelp.h indicates that this is "BOOL fUniCodeStrings; // IN/OUT: TRUE if all strings are in UNICODE"
8DWORDArg 0. Offset in /#STRINGS file.
0xCDWORDWhich window properties are valid & are to be used for this window. See the table below.
0x10DWORDArg 10.
0x14DWORDArg 1. Offset in /#STRINGS file.
0x18DWORDArg 14.
0x1CDWORDArg 15.
0x20RECTArg 13. Order left, top, right & bottom.
0x30DWORDArg 16.
0x34DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHelp; // OUT: window handle"
0x38DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndCaller; // OUT: who called this window"
0x3CDWORD0 (unknown) - but htmlhelp.h indicates that this is "HH_INFOTYPE* paInfoTypes; // IN: Pointer to an array of Information Types"
0x40DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndToolBar; // OUT: toolbar window in tri-pane window"
0x44DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndNavigation; // OUT: navigation window in tri-pane window"
0x48DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHTML; // OUT: window displaying HTML in tri-pane window"
0x4CDWORDArg 11.
0x50BYTE[16]0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcHTML; // OUT: HTML window coordinates" & the HHW docs say "Specifies the coordinates of the Topic pane."
0x60DWORDArg 2. Offset in /#STRINGS file.
0x64DWORDArg 3. Offset in /#STRINGS file.
0x68DWORDArg 4. Offset in /#STRINGS file.
0x6CDWORDArg 5. Offset in /#STRINGS file.
0x70DWORDArg 12.
0x74DWORDArg 17.
0x78DWORDArg 18.
0x7CDWORDArg 19.
0x80DWORDArg 20.
0x84BYTE[20]0 (unknown) - but htmlhelp.h indicates that this is "BYTE tabOrder[HH_MAX_TABS + 1]; // IN/OUT: tab order: Contents, Index, Search, History, Favorites, Reserved 1-5, Custom tabs"
0x98DWORD0 (unknown) - but htmlhelp.h indicates that this is "int cHistory; // IN/OUT: number of history items to keep (default is 30)"
0x9CDWORDArg 7. Offset in /#STRINGS file.
0xA0DWORDArg 9. Offset in /#STRINGS file.
0xA4DWORDArg 6. Offset in /#STRINGS file.
0xA8DWORDArg 8. Offset in /#STRINGS file.
0xACBYTE[16]0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcMinSize; // Minimum size for window (ignored in version 1)"
Everything after here is only present in CHMs compiled with "Compatibility=1.1 or later".
0xBCDWORD0 (unknown) - but htmlhelp.h indicates that this is "int cbInfoTypes; // size of paInfoTypes;"
0xC0DWORD0 (unknown) - but htmlhelp.h indicates that this is "LPCTSTR pszCustomTabs; // multiple zero-terminated strings"
Flags used to specify which values are valid.
ValueValid property
0x00000002Navigation Pane Style.
0x00000004Style Flags.
0x00000008Extended Style Flags.
0x00000010Initial Position.
0x00000020Navigation Pane Width.
0x00000040Show state.
0x00000080Info types.
0x00000100Buttons.
0x00000200Navigation Pane initially closed state.
0x00000400Tab pos.
0x00000800Tab order.
0x00001000History count.
0x00002000Default Pane.
0x?????00?The rest of the values either do nothing or are unknown. Please let us know if you find out what the rest are.

/#STRINGS

This file is a list of ANSI/UTF-8 NT strings. The first is just a NIL character so that offsets to this file can specify zero & get a valid string. The strings are sliced up into blocks that are 4096 bytes in length. If a string crosses the end of a block then it will be cut off without a NT and repeated in full, with a NT, at the start of the next block. For eg "To customize the appearance of a contents file" might become "To customize the <block ending>To customize the appearance of a contents file" when there are 17 bytes left at the end of the block.

The strings are in this order; "\0", [WINDOWS] (Arg 0, Arg 1, Arg 7, Arg 9, Arg 2, Arg 3, Arg 4, Arg 5, Arg 6, Arg 8) #n..., Contents_0_Entry_title, Index_0_Keyword, Contents_Image_file, Contents_Font, Contents_Default_frame, Contents_Default_window, [MERGE FILES] #n...

/#TOCIDX

Present in files with a non-empty contents file, "Binary TOC=Yes" and "Compatibility=1.1 or later".

This file is made up of 0x1000 byte blocks, but this is only apparent because of extra bytes interrupting what would otherwise be a stream of 20/28 byte structs. If the other parts (DWORDS & 16 byte structs) didn't fit into these blocks then presumably this would show up in the other parts too.

The first block is the header:

/#TOCIDX header
OffsetTypeComment/Value
0DWORD4096/header length/offset of #1 below
4DWORDoffset of #3 below
8DWORDnumber of #3 below
0xCDWORDoffset of #2 below
0x10BYTE[4080]0 (unknown)

The header is followed by the following different types of structs in the specified order:

  1. 20/28 byte structs (pages/books)
  2. list of dwords into /#TOPICS file
  3. 16 byte structs - links above stuff

First all the top level books/pages, then the next level, then the next & so on

20/28 byte structures
OffsetTypeComment/Value
0WORD0 (unknown)
2WORDUnknown
4DWORD Seems to be a bit field: 0x2 is whether or not the New value is set to 1, 0x4 is set when the entry is a book/has children and 0x8 is set when the entry has a Local value. The other bits are unknown (0x1, 0x40, 0x100 are sometimes set on books).
8DWORDUnknown. In some cases it is an index into the /#TOPICS file of the entry containing offsets to the title & filename.
0xCDWORDOffset to the parent book.
0x10DWORDOffset to the next book/page in the current book/page.
The next two DWORDs are only present in books (28 byte structs)
0x14DWORDOffset to the first child of the book.
0x18DWORD0 (unknown)
16 byte structures
OffsetTypeComment/Value
0DWORDOffset into #1 above.
4DWORDSome kind of sequence number that is incremented by one and starts at 666. I swear :)
8DWORDOffset into #2 above. Can contain RAM litter.
0xCDWORDIndex in /#TOPICS file of the entry containing offsets to the title & filename. Can contain RAM litter.

/#TOPICS

An index into this file can be converted to an offset in the /#URLTBL file, without reading this file using the following formula: offset = (index%341)*12 + index/341*4096

This file contains information on the topics present.

Each entry has the following format.

OffsetTypeComment/Value
0DWORDOffset into the tree in the /#TOCIDX file.
4DWORDOffset in /#STRINGS file of the contents of the title tag or the Name param of the file in question. -1 = no title.
8DWORDOffset in /#URLTBL of entry containing offset to /#URLSTR entry containing the URL.
0xCWORD2 indicates not in contents, 6 indicates that it is in the contents, 0/4 something else (unknown)
0xEWORD0, 2, 4, 8, 10, 12, 16, 32 (unknown)

/#URLSTR

This file is made up of 0x4000 byte blocks. If the last block is not filled then it will be smaller than 0x4000 bytes. The free space at the end of the blocks is filled with NUL bytes. The blocks contain the following elements one after another:

An unknown BYTE. So far this has been 0, 0x42 and in spechsdk.chi it was 0x49. Does not indicate presence/absence of URL/FrameName strings.

This is followed by pairs of URL, FrameName strings (both NT) from the HHC.

Then come all the Local strings from the HHC:

OffsetTypeComment/Value
0DWORDOffset of the URL for this topic.
4DWORDOffset of the FrameName for this topic.
8ANSI/UTF-8 NT string that is the Local for this topic.

There is one way to tell where the end of the URL/FrameName pairs occurs: Repeat the following: read 2 DWORDs and if both are less than the current offset then this is the start of the Local strings else skip two NT strings.

/#URLTBL

An offset in this file can be converted to an index into the /#TOPICS file, without reading this file using the following formula: index = ((offset%4096)+((offset/4096)*4096-4))/12

Each 0x1000 byte block has the following format.

OffsetTypeComment/Value
0DWORD[3][341]341 entries. 12 bytes each.
0xFFCDWORD4096 (unknown) possibly the length of the block? That MS would pull this kind of shit is really annoying; they should have just put all the entries one after another, not stuffed in an arbitrary DWORD after every 4092 bytes. From this and other blockness I guess they are optimizing for the Wintel platform.

Each entry has the following format.

OffsetTypeComment/Value
0DWORDUnknown. I suspect that this is either some kind of unique ID or two WORDs.
4DWORDIndex of entry in /#TOPICS file.
8DWORDOffset in /#URLSTR file of entry containing filename.

/#IVB

This is basically the [ALIAS] section of the HHP file.

OffsetTypeComment/Value
0DWORDSize of the file minus 4 (num entries = (filelen-4)/8)
4/#IVB entries to the EOF

/#IVB entries have the following format.

OffsetTypeComment/Value
0DWORDThe value of the alias
4DWORDOffset in /#STRINGS file of the file to show

/#SUBSETS

This file is present when the [SUBSETS] section is present in the HHP file.

OffsetTypeComment/Value
0WORD0 (unknown)
2WORDNumber of bytes taken up by the subset entries.

The subset entries currently seem to be garbage left over from previous usage of the same memory locations. Based on the number of bytes per non-whitespace line in the [SUBSETS] section each subset entry is 12 BYTEs in length.

/$FIftiMain

The majority of this description was contributed and or corrected by Jed Wing.

Empty when "Full-text search=No" or when no HTML files have been indexed. Holds the full-text search information. If you have a word longer than 99 characters in a HTML file then it seems the indexing routines will die during indexing of that file and then skip on to the next one. All word sorting, processing and storage is done case-insensitively and is not case-preserving. Note that files without ".h" in their names will not contribute keywords to this fast-search index. The function of this file seems to be to store the locations of the words found in the HTML files, so the search code can quickly find where those words occur.

This file is yet another tree, more similar to the ITSP directory than the BTree file.

This file makes use of 2 ways of encoding integers in variable length fields: the so called scale and root method and a variant of the ENCINT method used in the PMGL/PMGI directory chunks. For the ENCINTs in this file the bytes are stored least significant first (little endian), whereas in the PMGL/PMGI chunks they are stored most significant first (big endian).

The scale and root method needs two parameters, which I'll call s (scale) and r (root size). In the context of /$FIftiMain files, s always appears to be '2', but any other power of 2 could also work (and might be used in some rare cases). The encoding is as follows:

The integer is encoded as two parts, p (prefix) and q (actual bits). p determines how many bits are stored, as well as implicitly determining the high-order bit of the integer. To encode an integer, p starts out as a single 0. If the integer fits in r bits, you're done. If the integer fits in r+1 bits (i.e. r-th bit is set, counting from 0), prepend a 1 to the p and store the low r bits of the integer in q. Otherwise, while the integer does not fit in the allotted space, prepend a bit to p, and increase the size of q by one bit. It's hard to see from the description, but an example will make it more clear. Using s=2, r=3:

value   p  q
0:      0 000
1:      0 001
2:      0 010
..
7:      0 111
8:     10 000
9:     10 001
10:    10 010
..
15:    10 111
16:   110 0000
17:   110 0001
18:   110 0010
..
30:   110 1110
31:   110 1111
32:  1110 00000
33:  1110 00001
34:  1110 00010
..
62:  1110 11110
63:  1110 11111
64: 11110 000000

and so on.

A scale other than 2 has never been seen, so it is hard to say how s/r encoding works when s=4, etc. The following is how it might work using s=4, r=2:

value  p (base 2) q (base 4)
0:             0 00
1:             0 01
..
14:            0 32
15:            0 33
16:           10 00
17:           10 01
..
30:           10 32
31:           10 33
32:          110 000
33:          110 001
..

and so on. (i.e. a base-4 digit is added each time, meaning two bits added each time. In binary that looks like:

value         p   q
0:             0 0000
1:             0 0001
..
14:            0 1110
15:            0 1111
16:           10 0000
17:           10 0001
..
30:           10 1110
31:           10 1111
32:          110 000000
33:          110 000001
..

Of course, this is all wild speculation, since examples with s other than 2 haven't been seen... But the codes do work this way (i.e. prepending a 1 to the prefix multiplies the additive value 'b' by s and adds another log2(s) bits.)

The file begins with a header that is 0x400 bytes in length.

OffsetTypeComment/Value
0BYTE[4]0x00 0x00 0x28 0x00 (unknown)
4DWORDNumber of HTML files indexed after any automatic splitting.
8DWORDOffset to the last word tree block (4096 less than the file length)
0xCDWORD0 (unknown)
0x10DWORDThe number of "leaf node" pages in the file.
0x14DWORDOffset to the last word tree block (4096 less than the file length)
0x18WORDDepth of the tree of blocks (i.e. 1 if only leaf nodes, 2 if there is a non-leaf node page to index among the leaf nodes, 3 if there are 2 levels of index node chunks, but could theoretically be even deeper.
0x1ADWORD7 (unknown)
0x1EBYTEScale for encoding of "document index" in Word Location Code (WLC) entries
0x1FBYTERoot size for encoding of "document index" in WLC entries
0x20BYTEScale for encoding of "code count" in WLC entries
0x21BYTERoot size for encoding of "code count" in WLC entries
0x22BYTEScale for encoding of "location codes" in WLC entries
0x23BYTERoot size for encoding of "location codes" in WLC entries
0x24BYTE[10]0 (unknown)
0x2EDWORDLength of the word tree blocks (4096).
0x32DWORD0/1 (unknown)
0x36DWORDWord index of the last duplicate.
0x3ADWORDCharacter index of the last duplicate. From the first character of the first word. The whitespace after tags is not included. &amp; type things are counted as one character. Line endings are not counted in this.
0x3EDWORDLength of the longest word in the list not including NT (maximum of 99).
0x42DWORDNumber of words including duplicates.
0x46DWORDNumber of words not including duplicates.
0x4ADWORDThe total length of all the words including duplicates is this DWORD plus the next one. It is unknown how the split is performed.
0x4EDWORDThis one is usually smaller than the previous one.
0x52DWORDTotal length of all the words not including duplicates.
0x56DWORDLength of unused/null bytes at the end of the word block (if only 1 block, more than total if > 1 block - possibly some free space in WLC blocks).
0x5ADWORD0 (unknown)
0x5EDWORDOne less than the number of HTML files indexed (not entirely sure)
0x62BYTE[24]0 (unknown)
0x7ADWORDWindows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x7EDWORDLCID from the HHP file.
0x82BYTE[894]0 (unknown)

The header is followed by pairs of variable size WLC (scale and root encoded) blocks and leaf node chunks (in that order).

Each WLC entry is made up of bit fields packed as tightly as possible. Each entry, however, is right-padded with 0s to a full byte. The fields are encoded as the scale and root variable-length integer format described above, with the parameters taken from the initial header. "Delta coding" is also used in a couple of places to reduce the size of the codes -- that is, the first value is stored verbatim, and subsequent values are stored as a delta or difference from the previous value.

WLC entries
Encoding(s)Name/Comment
r/s and delta (across the various entries for a single word)Document index - indicates in which document this entry is for by way of an index into the #TOPICS file.
r/sCode count - the number of times that the word this entry specifies is used in the specified topic.
r/s and delta (within each WLC entry) A sequence of location codes - these are the word indices (zero based) where this word occurs in the specified topic.

The leaf and index node chunks are 4096 bytes in length. They begin with a header followed by entries.

Leaf node header
OffsetTypeComment/Value
0DWORDOffset to the next leaf node chunk. 0 if this is the last leaf node chunk.
4WORD0 (unknown)
6WORDLength of free space at the end of the current leaf node chunk.

This is followed by leaf node entries:

Leaf node entries
OffsetTypeComment/Value
0BYTELength of the word/partial word in this entry plus one. Maximum of 100.
1BYTEPosition in the word where characters are placed.
2BYTEsLength-1 bytes make up the word or part of the word. Maximum of 99 BYTEs. Not NT
+0BYTEContext (0 for body tag, 1 for title tag, other values are unknown)
+1ENCINTNumber of WLC entries.
+0DWORDOffset in this file of the WLC entries for this word.
+4WORD0 (unknown)
+6ENCINTNumber of bytes used by the WLC entries for this word.

This is all fairly complex, so an example will be extremely useful here. This example is taken from a copy of 'windows.chm', the system documentation apparently distributed with some version of Windows 98:

Hex dump of two leaf node entries:

000223d:                                        02 00 31  ...0...........1
0002240: 00 0a 03 04 00 00 00 00 1d 01 01 01 01 20 04 00  ............. ..
0002250: 00 00 00 03

The fields of these two entries are as follows:

Leaf node entries
Field1st entry2nd entry
New length21
Old length01
Name"1"
Context01
Num WLC ents0xA1
Offset0x4030x420
Unknown00
WLCs length0x1D3

Scanning over to offset 0x403 in the file, we see:

0000403:          f9 f4 60 86 b8 ea 6a 00 ed 78 00 2d c0
0000410: f8 d7 28 2c f0 f6 dc c8 ce 66 61 80 87 02 00 00
0000420: f9 f4 40

Broken out, these WLC entries are:

1          <10, 1027, 29>:  f9 f4 60 86 b8 ea 6a 00 ed 78 00 2d c0 f8 d7 28
                            2c f0 f6 dc c8 ce 66 61 80 87 02 00 00
1 (TITLE)  <1, 1056, 3>:    f9 f4 40

Now, the parameters for the WLC in this file are 2/2, 2/1, 2/5. Here is a quick reference table for the codes:

p       value       q (bits)

2/1:
0:      0-1         1
10:     2-3         1
110:    4-7         2
1110:   8-15        3
11110:  16-31       4
111110: 32-63       5

2/2:
0:      0-3         2
10:     4-7         2
110:    8-15        3
1110:   16-31       4
11110:  32-63       5
111110: 64-127      6

2/5:
0:      0-31        5
10:     32-63       5
110:    64-127      6
1110:   128-255     7
11110:  256-511     8
111110: 512-1023    9

Let's start with the short one, since it's very simple:

f9 f4 40 => 1111 1001 1111 0100 0100 0000

2/2 Document index: 111110 011111 => 64 + 31 => Document #95
2/1 Code count:     0 1           => 1
2/5 Location codes: 0 00100       => 4       => Word #4
    padding:        0000

So, in document #95, word #4 is a '1' which is in the title. Now, the ordering of the documents is provided by the '#URLTBL' and '#URLSTR' files. Looking up document #95 in there (0-based indexing!), we see the file is 'internet_account.htm', in which, the first non-markup text is:

<title>Dial-Up Networking: Step 1</title>

0: dial
1: up
2: networking
3: step
4: 1

Now, the next one is a little more complicated. I won't go over it in as much detail, but I'll just break it out quickly. It contains 10 entries:

(111110 011111) ( 0 1) (   0 00110  )             0000
(    10 00    ) ( 0 1) (  10 10111  )             000
(  1110 1010  ) ( 0 1) (  10 10100  )             0000000
(  1110 1101  ) ( 0 1) (1110 0000000)             000
(     0 01    ) ( 0 1) (  10 11100  )             0000
(111110 001101) ( 0 1) ( 110 010100 )             0
(     0 01    ) ( 0 1) (  10 01111  )             0000
( 11110 11011 ) ( 0 1) ( 110 011001 )             000
(   110 011   ) (10 0) ( 110 011001 ) ( 10 00011) 0000000
(    10 00    ) ( 0 1) ( 110 000001 )             0
                                                  00000000 00000000

Parsing those entries, we get:

WLC entries
Document numberWord numbers
956
95+4 = 9955
99+26 = 12552
125+29 = 154128
154+1 = 15560
155+77 = 23284
232+1 = 23347
233+59 = 29289
292+11 = 30389 and 89+35 = 124
303+4 = 30765

Picking one at random, say, Document #303 with 2 hits, we open up windows_netsetup_netwin.htm, from which I've generated a wordlist containing all of the words in order:

  0: To(TITLE)
  1: set(TITLE)
  2: up(TITLE)
  ..
  ..
 86: client
 87: follow
 88: steps
 89: 1
 90: 3
 ..
122: follow
123: steps
124: 1
125: 3
126: and
 ..

And we can see the word '1' shows up in precisely the 89th and 124th spots.

After the WLC blocks and the leaf node chunks comes the index node chunk (for a depth of 2). For higher tree depths the index node blocks are interspersed with the listing node blocks, similarly to how the PMGL/PMGI chunks are laid out in the directory of the ITSF format. The method of splitting used is likely the same space filling method used in the directory. The index node header is just a WORD indicating the length of free space at the end of the current index node chunk.

Index node entries
OffsetTypeComment/Value
0BYTEOne more than the length of the word/partial word in this entry.
1BYTEPosition in the word where characters are placed.
2BYTEsLength-1 bytes make up the word or part of the word. Not NT
+0DWORDOffset of the leaf node chunk whose last entry is this word.
+4WORD0 (unknown)

Words in the node chunks are made up of the following characters stored as is: 0x01 (buggy), 0-9, a-z, _, 0xDE, 0xFE. Bytes are converted and stored as pre the table below. Character entity references of the form &#9660; are truncated to BYTEs and translated as per the table below. Character entity references of the form &amp; are treated as whitespace, except for the the latin characters, which are converted as per the table below.

Conversions
BeforeAfter
A-Za-z
0x8A, 0x9As
0x8C, 0x9Coe
0x9F, 0xDD, 0xFD, 0xFF, &Yacute;, &yacute;, &yuml;y
0xC0-0xC5, 0xE0-0xE5, &Agrave;, &Agrave;, &Aacute;, &Acirc;, &Atilde;, &Auml;, &Aring;, &agrave;, &aacute;, &acirc;, &atilde;, &auml;, &aring;a
0xC6, 0xE6, &AElig;, &aelig;ae
0xC7, 0xE7, &Ccedil;, &ccedil;c
0xC8-0xCB, 0xE8-0xEB, &Egrave;, &Eacute;, &Ecirc;, &Euml;, &egrave;, &eacute;, &ecirc;, &euml;e
0xCC-0xCF, 0xEC-0xEF, &Igrave;, &Iacute;, &Icirc;, &Iuml;, &igrave;, &iacute;, &icirc;, &iuml;i
0xD0, &ETH;d
0xD1, 0xF1, &ntilde;, &Ntilde;n
0xD2-0xD8, 0xF0, 0xF2-0xF8, &eth;, &Ograve;, &Oacute;, &Ocirc;, &Otilde;, &Ouml;, &Oslash;, &ograve;, &oacute;, &ocirc;, &otilde;, &ouml;, &oslash;o
0xD9-0xDC, 0xF9-0xFC, &Ugrave;, &Uacute;, &Ucirc;, &Uuml;, &ugrave;, &uacute;, &ucirc;, &uuml;u
0xDF, &szlig;ss
&thorn;0xFE
&THORN;0xDE

These conversons may depend on the system codepage, character set, font and language set in the HHP file (I'm just guessing here).

There are a few bugs:

An 0x01 in a word causes the first whitespace character at the end of the word to be included in the word and if the next character is non-whitespace the word is joined to the next word. If the word begins with 0-9 then the word is terminated before the 0x01 and a new word begins at the 0x01. This bug affects the fields in the initial header.
For example:
"abcd0x1efghi-foobar" is converted to "abcd0x1efghi-foobar".
"abcd0x1efghi- foobar" is converted to "abcd0x1efghi-" and "foobar".
"0bcd0x1efghi-foobar" is converted to "0bcd" and "0x1efghi-foobar".
"0bcd0x1efghi- foobar" is converted to "0bcd", "0x1efghi-" and "foobar".

Weird bug where if the word is 16 characters in length then the word is doubled plus the first 7 chars in length.

There is a weird feature that if a word starts with 0-9 then it may contain multiple periods (0x2E = '.') or commas (0x2C = ',') embedded in the word before the non-period, non-comma word terminating character. I think this feature is so that the user can search for version numbers or numbers with a decimal point or thousands separator in them. Note that commas are removed from the word, while periods are not. For example "v1.1.234.5......,6" will become "v1" and "1.234.5......6".

Weird bug involving words ending in single quote (') being forgotten when the same word is also normal and also ending in a period (.).

There are probably many more hidden bugs and features in the word converter (I think its the the ITIR.StdWordBreaker class in ITIRCL.DLL).

/$OBJINST

From the name and the number of GUIDs present I guess it has something to do with ActiveX objects. Seems like it can be deleted without major consequence.

OffsetTypeComment/Value
0DWORD0x04000000 (unknown)
4DWORDNumber of entries

This is followed by an listing, and each listing entry is as follows

OffsetTypeComment/Value
0DWORDOffset of the entry in this file
4DWORDLength of the entry

The listing is followed by the entries one after another at offsets specified in the listing.

There are 2 known types of entries. The first seems to be made up of up to 3 different sub entries. The second is a 36 BYTE structure.

The first entry
OffsetTypeComment/Value
0GUID{4662DAAF-D393-11D0-9A56-00C04FB68BF7}
0x10DWORD0x04000000 (unknown) Possibly a big-endian version number of the class that the GUID refers to.
0x14DWORDUnknown. Methinks bitflags that somehow affect the size of entries that have the 0x04000000 DWORD, like each bit specifies the presence/absence of a specific subentry.
0x18DWORDWindows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x1CDWORDLCID from the HHP file.
0x20BYTEsUnknown
+0Entries

I haven't been able to find any files without the data for bits 0 & 1 so I can't really say exactly how big the header is and which bytes are part of the bit 0 block and which are part of the bit 1 block. Together, though, bits 0 & 1 account for a large bulk of repeatedly increasing byte blocks of 10 bytes each, plus something else at the end. I suspect that the repeats are for bit 0 and the stuff at the end is bit 1. As to the function of these two bits blocks, well there are no GUIDs and no other clues, so who knows.

bit 2. Only present when "Full text search stop list file" has been specified in the HHP.
OffsetTypeComment/Value
0char[4]""(\0
4DWORDLength in bytes of the entries not including the last zero word.
8BYTE[32]0 (unknown)
0x28Entries. The last entry has a zero length word.
bit 2 entries
OffsetTypeComment/Value
0WORDLength of the word
2char[length]ANSI/UTF-8 string from the stop list file, may be uniqified & sorted reverse alphabetically. Not NT.
bit 3
OffsetTypeComment/Value
0GUID{8FA0D5A8-DEDF-11D0-9A61-00C04FB68BF7}
0x10DWORD0x04000000 (unknown) Possibly a big-endian version number of the class that the GUID refers to.
0x14DWORD1 (unknown)
0x18DWORDWindows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x1CDWORDLCID from the HHP file.
0x20DWORD0 (unknown)
The second entry
OffsetTypeComment/Value
0GUID{4662DAB0-D393-11D0-9A56-00C04FB68B66}
0x10DWORD666 (May represent the version of the class that the GUID refers to)
0x14DWORDWindows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x18DWORDLCID from the HHP file.
0x1CDWORDUnknown. Almost always 10031. Also 66631 (accessib.chm from the MSDN).
0x20DWORD0 (unknown)

/$WWAssociativeLinks/* & /$WWKeywordLinks/*

The files in the /$WWAssociativeLinks and /$WWKeywordLinks directories have the same formats. The maximum total length (including parents) of an entry in one of these files is 488 characters (including NT). HHW complains about and refuses to output any that are greater than this length.

The /$WWKeywordLinks dir specifies the contents of the Index navigation pane & the /$WWAssociativeLinks dir specifies the Alinks.

In CHW files this is named BTREE and in CHI/CHM files it is named BTree.

This file has a 76 byte header, then 2048 byte blocks. First come all the listing blocks, then all the index blocks. This file is similar to the directory entries in the ITSF format, except that the index blocks are at the end instead of interspersed with the listing blocks. All block indices below are zero based. This file forms a tree, with the last (index mostly) block being the root of the tree. If there is more than one level of index blocks then the root block will have two children; the first in the block header and the second in the entry. WARNING: just as in the ITSF directory there can be garbage in the free space, so respect that first WORD and use it. I'm not yet sure how the listing blocks are split up, though it is probably the same as the ITSF directory (space filling).

BTree header
OffsetTypeComment/Value
0char[2];) (0x3B 0x29) (signature)*
2WORD*Flags. Bit 0x2 always 1. Bit 0x0400 1 if directory?? (this is always on)
4WORDSize of the blocks (2048)
6BYTE[16]Always X44. *says it is a string describing format of data
'L' = DWORD (indexed)
'F' = NUL-terminated string (indexed)
'i' = NUL-terminated string (indexed)
'2' = WORD
'4' = DWORD
'z' = NUL-terminated string
'!' = DWORD count value, count/8 * record
	DWORD filenumber
	DWORD TopicOffset
0x16DWORD0 (unknown)
0x1ADWORDIndex of the last listing block in the file.
0x1EDWORDIndex of the root block in the file.
0x22DWORD-1 (unknown)
0x26DWORDNumber of blocks
0x2AWORDThe depth of the tree of blocks (1 if no index blocks, 2 one level of index blocks, ...)
0x2CDWORDNumber of keywords in the file.
0x30DWORDWindows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x34DWORDLCID from the HHP file.
0x38DWORD0 if this a BTREE and is part of a CHW file, 1 if it is a BTree and is part of a CHI or CHM file
0x3CDWORDUnknown. Almost always 10031. Also 66631 (accessib.chm, ieeula.chm, iesupp.chm, iexplore.chm, msoe.chm, mstask.chm, ratings.chm, wab.chm).
0x40DWORD0 (unknown)
0x44DWORD0 (unknown)
0x48DWORD0 (unknown)
*These were guessed from the documentation provided with helpdeco by Manfred Winterhoff.
BTree listing blocks header
OffsetTypeComment/Value
0WORDLength of free space at the end of the block.
2WORDNumber of entries in the block.
4DWORDIndex of the previous block. -1 if this is the first listing block.
8DWORDIndex of the next block. -1 if this is the last listing block.
BTree listing block entries
OffsetTypeComment/Value
0WCHARsValue of the first Name entry from the HHK UTF-16/UCS-2. If this is a sub-keyword, then this will be all the parent keywords, including this one, separated by ", ". UTF-16/UCS-2 NT.
+0WORD2 if this keyword is a See Also keyword, 0 if it is not.
+2WORDDepth of this entry into the tree.
+4DWORDCharacter index of the last keyword in the ", " separated list.
+8DWORD0 (unknown)
+0xCDWORDNumber of Name, Local pairs
+0x10DWORDs or WCHARsDWORDs:Index into the /#TOPICS file.
UTF-16/UCS-2 NT string: The value of the See Also string.
+0DWORDMostly 1 (unknown)
+4DWORDZero based index of this entry in the file (not block). Increments by 13 (each entry is 13 more than the last).
BTree index blocks header
OffsetTypeComment/Value
0WORDLength of free space at the end of the block.
2WORDNumber of entries in the block.
4DWORDIndex of a child block.
BTree index block entries
OffsetTypeComment/Value
0WCHARsValue of the first Name entry from the HHK. If this is a sub-keyword, then this will be all the parent keywords, including this one, separated by ", ". UTF-16/UCS-2 NT.
+0WORD2 if this keyword is a See Also keyword, 0 if it is not.
+2WORDDepth of this entry into the tree.
+4DWORDCharacter index of the last keyword in the ", " separated list.
+8DWORD0 (unknown)
+0xCDWORDNumber of Name, Local pairs
+0x10DWORDs or WCHARsDWORDs:Index into the /#TOPICS file.
UTF-16/UCS-2 NT string: The value of the See Also string.
+0DWORDIndex of a child block. If it is a listing block then it is the one that starts with the keyword at the start of this entry

In CHW files this is named DATA and in CHI/CHM files it is named Data.

This file contains entries that are 13 bytes in length. All known entries have thus far contained the following bytes: 00000000 05000000 80000000 00. AFAICS this file is useless.

In CHW files this is named MAP and in CHI/CHM files it is named Map.

Begins with a WORD indicating the number of entries in the file (also the number of listing blocks in the BTree file). Each entry is 2 DWORDs. The first is a cumulative sum of the number of keywords in the BTree listing blocks & the second is a consecutively increasing index number. Both start at zero.

In CHW files this is named PROPERTY and in CHI/CHM files it is named Property.

If there are no links of this type in the CHM then this will be a zero DWORD. Othewise it contains the following DWORDs: 0, 0, 0, 0xC, 1, 1, 0, 0. AFAICS this file is pretty much useless.

/$HHTitleMap

The file begins with a WORD indicating the number of entries.

Each entry has the following format:

OffsetTypeComment/Value
0WORDLength of the file stem.
2BYTEsFile stem. ANSI/UTF-8 string. Not NT.
+0DWORDUnknown.
+4DWORDUnknown. Same value as previous DWORD.
+8DWORDLCID of the specified file.

/$TitleMap

The file begins with a WORD indicating the number of entries.

Each entry is 68 BYTEs in length and has the following format:

OffsetTypeComment/Value
0BYTE[25]File stem. ANSI/UTF-8 NT fixed length string.
0x19BYTE[25]Unknown. Seems to be RAM litter, but contains paths, file names, zero bytes, DWORDs and mixtures.
0x32WORDAn index number that begins at 1 and is incremented by 1 for each entry.
0x34DWORDUnknown.
0x38DWORDUnknown. Same value as previous DWORD.
0x3CDWORDLCID of the specified file.
0x40DWORDNumber of topic nodes including the contents & index files in the specified file.

/Path/file.chm/windowtype

It is a cache of user customized bits of the windowtype entry from the /#WINDOWS file of the \Path\file.chm CHM file.

OffsetTypeComment/Value
0DWORDSize of the file in bytes (44)
4Signed DWORDPosition of the left edge of the window.
8Signed DWORDPosition of the top edge of the window.
0xCSigned DWORDPosition of the right edge of the window.
0x10Signed DWORDPosition of the bottom edge of the window.
0x14DWORDWidth of the navigation pane in pixels.
0x18DWORDNon-zero if search highlight is on.
0x1CDWORDUnknown. Not font size, printing options or show state.
0x20DWORDNon-zero if there is no text of the toolbar buttons.
0x24DWORDNon-zero if the navigation pane is initially closed.
0x28DWORDWhich navigation tab is currently open.

/Path/file.chm/AdvSearchUI/Keywords

UTF-16/UCS-2 NT string. Each search item is separated by a UTF-16/UCS-2 Line Feed character. The string is followed by an unknown WORD.

/Path/file.chm/AdvSearchUI/Properties

DWORD. Only the lowest 3 bits are used. "Match similar words" is controlled by bit 0. "Search titles only" is controlled by bit 1. "Search previous results" is controlled by bit 2. Note that since previous search results are not stored anywhere as yet HH will uncheck the "Search previous results" checkbox even if its bit is on. IMHO this is a bug: HH should automatically search the whole file if there are no previous results and the checkbox is checked.

/Path/file.chm/Bookmarks/v1/Count

A DWORD indicating the number of favourites stored for the \Path\file.chm CHM file.

/Path/file.chm/Bookmarks/v1/n/Topic

An NT UTF-16/UCS-2 string showing the topic name of bookmark number n (n is zero based).

/Path/file.chm/Bookmarks/v1/n/Url

An NT UTF-16/UCS-2 string showing the URL of bookmark number n (n is zero based). It is a fully qualified path into the \Path\file.chm CHM file.

/#KEY_DELETED

A set of ANSI/UTF-8 NT strings indicating which internal files have been deleted in the new file. Names use backslash (\) instead of forward slash (/) & don't have an initial slash.

/#KEY_DATA

OffsetTypeComment/Value
0DWORD[8]Unknown.
0x20DWORDLength of the name of the old chm.
0x24BYTEsName of the old chm. ANSI/UTF-8 NT.

Please let us know if you find any other internal files, figure out formats of any internal files or find out what unknown parts of the above files do. Any and all contributions will be fully attributed and, if necessary, co-copyright given.