KI_WS_CREATE
Argument | Enumeration | Purpose |
---|---|---|
extract_handle | Handle of the extract file | |
index_handle | Handle of the WS index file | |
data_handle | Handle of the data table | |
SYM(ws_spec$) | Defines options for the WS index | |
row_length | Data table row length | |
key_length | Data table key length | |
data_ptr_size | Size of ROWID | |
block_size | Extract file block size | |
status | KDB_ERROR_ENUM | Return status |
KI_WS_CREATE
This call is no longer supported for type 7 tables where the CREATE WORD INDEX statement should be used.
KI_WS_CREATE creates a word search extract file and index file for the given table open on data_handle. Any previous extract or index file will be implicitely dropped. The extract file will be created with an extension of .wd appended to the base name of the data table. Similarly the index file with have an extension of .wi
The indexing options are specified by the ws_spec$ parameter which is laid out as follows (assuming the WS2SPEC environment variable is set to TRUE):
Byte | Length | Purpose |
---|---|---|
1 | 1 | Key path for the extract. This is the index path number to be followed in the data table and determines the ordering of duplicates |
2 | 1 | Index order. 'A' for Ascending or 'D' for Descending |
3 | 1 | Extract mode. 'C' for Compact or 'L' for Large |
4 | 1 | Minimum word length. Unsigned binary. |
5 | 1 | Maximum word length. Unsigned binary. |
6 | 1 | DBCS word length, zero if not applicable |
7 | 1 | Language code. Only used for DBCS languages as a rough check for punctuation characters. Leave as HEX(00) if not applicable. |
8 | 1 | First DBCS lead in character, HEX(00) if not required. |
9 | 1 | Final DBCS lead in character, HEX(00) if not required. |
10 | 16 | Extra characters to be included. Implicitely included are all characters which are affected by $UPPER/$LOWER translation in the current code page, as set by byte 50 of $OPTIONS RUN. Numbers are not included. |
26 | 64 | Fields defining the extract. This is ordered as 16 fields each of 4 bytes. Each field consists of a 2 byte unsigned binary start offset and a 2 byte unsigned binary length. The offset is counted from 1. A zero offset marks the end of the field definitions. |
90 | 2 | The offset of an exclusion flag, counted from 1 and expressed as a 2 byte unsigned binary number. A zero value means no flag exists. |
92 | 16 | Possible values for the single byte exclusion flag. If the byte at the specified offset in the data row buffer is in this list then that row will be ignored. |
108 | 2 | Minimum index file size. Two byte unsigned binary integer. Leave zero for default. |
110 | 1 | Free space factor. Leave zero for default. |
111 | 1 | Index packing factor. Leave zero for default. |
If the WS2SPEC environment variable is not set to TRUE then the field specifiers at byte 26 will be interpreted as 3 byte groups with two byte offsets and 1 byte lengths. The subsequent fields are not used and need not be present.
By default indexes are created to the version 3 specification. This can be overridden or augmented by setting some environment variables. This extra information will be baked into the index.
WS1 | If TRUE then use the original version 1 specification. |
WS2SPEC | If TRUE then use version 2 specification with extra DBCS support. |
EXTRA_DBCS | Up to 4 extra DBCS start and end pairs as hex digits. Version 2 and 3 only. |
EXTRA_CHARS | Up to 8 extra valid character ranges defined as start and end pairs in hex digits. Version 2 and 3 only. |
DBCS_WORD_LEN | Only valid for DBCS languages. See below |
The data_ptr_size refers to the size of the ROWID and will normally be 4 but may be 4 on applications that support it.
The block_size argument refers to the blocking in the extract file and should be a multiple of 512, typically 1024.
In single byte character set languages words are considered to be runs of characters which meet these characteristics:
Unless explicitly added with EXTRA_CHARS, numbers are not included. Space and HEX(00) are never considered as a valid character.
For Korean language systems using Windows CP949 you should set byte 8 to HEX(81) and byte 9 to HEX(FE). You do not need the EXTRA_DBCS environment variable.
For Traditional Chinese language systems (Big5 encoded, e.g. Hong Kong or Singapore) using Windows CP950 you should set byte 8 to HEX(A2) and byte 9 to HEX(FE) as the HEX(A1) page only contains punctuation characters. You do not need the EXTRA_DBCS environment variable.
For Simplified Chinese language systems (GBK encoded, e.g. PRC) using Windows CP936 you should set byte 8 to HEX(81) and byte 9 to HEX(FE). You do not need the EXTRA_DBCS environment variable.
For Japanese systems using Windows CP932 you should set byte 8 to HEX(82) and byte 9 to HEX(9F) as the HEX(81) page contains only punctuation characters. The EXTRA_DBCS environment variable should be set to "E0FC".
Byte 8 and 9 must be HEX(00) for all other code pages.
Chinese does not require word separators like space as the words can be inferred from the context heuristically. To help KCML index such languages you should specify a max DBCS word length for the maximum number of characters to be considered as a word. This can be set in byte 6 of the ws_spec$ variable or this can be overridden using the DBCS_WORD_LEN environment variable. The maximum supported word length is 9. KCML will index all the subsets of DBCS strings up to this maximum length.
If the language code in byte 7 of the ws_spec$ is set appropriately for a DBCS language then KCML will use some rough rules for excluding punctuation characters (Unicode U+3000 to U+303F) from consideration in words.