KI_WS_CREATE

Create a word search index

Synopsis:

CALL KI_WS_CREATE extract_handle, index_handle, data_handle, SYM(ws_spec$), row_length, key_length, data_ptr_size, block_size TO status

status = 'ki_ws_create( extract_handle, index_handle, data_handle, SYM(ws_spec$), row_length, key_length, data_ptr_size, block_size )

Argument	Enumeration	Purpose
extract_handle		Handle of the extract file
index_handle		Handle of the WS index file
data_handle		Handle of the data table
SYM(ws_spec$)		Defines options for the WS index
row_length		Data table row length
key_length		Data table key length
data_ptr_size		Size of ROWID
block_size		Extract file block size
status	KDB_ERROR_ENUM	Return status

>KI_WS_CREATE

KI_WS_CREATE

This call is no longer supported for type 7 tables where the CREATE WORD INDEX statement should be used.

KI_WS_CREATE creates a word search extract file and index file for the given table open on data_handle. Any previous extract or index file will be implicitely dropped. The extract file will be created with an extension of .wd appended to the base name of the data table. Similarly the index file with have an extension of .wi

The indexing options are specified by the ws_spec$ parameter which is laid out as follows (assuming the WS2SPEC environment variable is set to TRUE):

Byte	Length	Purpose
1	1	Key path for the extract. This is the index path number to be followed in the data table and determines the ordering of duplicates
2	1	Index order. 'A' for Ascending or 'D' for Descending
3	1	Extract mode. 'C' for Compact or 'L' for Large
4	1	Minimum word length. Unsigned binary.
5	1	Maximum word length. Unsigned binary.
6	1	DBCS word length, zero if not applicable
7	1	Language code. Only used for DBCS languages as a rough check for punctuation characters. Leave as HEX(00) if not applicable.
8	1	First DBCS lead in character, HEX(00) if not required.
9	1	Final DBCS lead in character, HEX(00) if not required.
10	16	Extra characters to be included. Implicitely included are all characters which are affected by $UPPER/$LOWER translation in the current code page, as set by byte 50 of $OPTIONS RUN. Numbers are not included.
26	64	Fields defining the extract. This is ordered as 16 fields each of 4 bytes. Each field consists of a 2 byte unsigned binary start offset and a 2 byte unsigned binary length. The offset is counted from 1. A zero offset marks the end of the field definitions.
90	2	The offset of an exclusion flag, counted from 1 and expressed as a 2 byte unsigned binary number. A zero value means no flag exists.
92	16	Possible values for the single byte exclusion flag. If the byte at the specified offset in the data row buffer is in this list then that row will be ignored.
108	2	Minimum index file size. Two byte unsigned binary integer. Leave zero for default.
110	1	Free space factor. Leave zero for default.
111	1	Index packing factor. Leave zero for default.

If the WS2SPEC environment variable is not set to TRUE then the field specifiers at byte 26 will be interpreted as 3 byte groups with two byte offsets and 1 byte lengths. The subsequent fields are not used and need not be present.

By default indexes are created to the version 3 specification. This can be overridden or augmented by setting some environment variables. This extra information will be baked into the index.

WS1	If TRUE then use the original version 1 specification.
WS2SPEC	If TRUE then use version 2 specification with extra DBCS support.
EXTRA_DBCS	Up to 4 extra DBCS start and end pairs as hex digits. Version 2 and 3 only.
EXTRA_CHARS	Up to 8 extra valid character ranges defined as start and end pairs in hex digits. Version 2 and 3 only.
DBCS_WORD_LEN	Only valid for DBCS languages. See below

The data_ptr_size refers to the size of the ROWID and will normally be 4 but may be 4 on applications that support it.

The block_size argument refers to the blocking in the extract file and should be a multiple of 512, typically 1024.

Valid characters for words

In single byte character set languages words are considered to be runs of characters which meet these characteristics:

characters which have distinct upper and lower case values.
characters specifically listed in the 16 bytes starting at byte 10 of ws_spec$
characters falling in the inclusive ranges specified by the EXTRA_CHARS environment variable
any DBCS character pair unless excluded by the language punctuation check

Unless explicitly added with EXTRA_CHARS, numbers are not included. Space and HEX(00) are never considered as a valid character.

Double byte character set languages

For Korean language systems using Windows CP949 you should set byte 8 to HEX(81) and byte 9 to HEX(FE). You do not need the EXTRA_DBCS environment variable.

For Traditional Chinese language systems (Big5 encoded, e.g. Hong Kong or Singapore) using Windows CP950 you should set byte 8 to HEX(A2) and byte 9 to HEX(FE) as the HEX(A1) page only contains punctuation characters. You do not need the EXTRA_DBCS environment variable.

For Simplified Chinese language systems (GBK encoded, e.g. PRC) using Windows CP936 you should set byte 8 to HEX(81) and byte 9 to HEX(FE). You do not need the EXTRA_DBCS environment variable.

For Japanese systems using Windows CP932 you should set byte 8 to HEX(82) and byte 9 to HEX(9F) as the HEX(81) page contains only punctuation characters. The EXTRA_DBCS environment variable should be set to "E0FC".

Byte 8 and 9 must be HEX(00) for all other code pages.

Chinese does not require word separators like space as the words can be inferred from the context heuristically. To help KCML index such languages you should specify a max DBCS word length for the maximum number of characters to be considered as a word. This can be set in byte 6 of the ws_spec$ variable or this can be overridden using the DBCS_WORD_LEN environment variable. The maximum supported word length is 9. KCML will index all the subsets of DBCS strings up to this maximum length.

If the language code in byte 7 of the ws_spec$ is set appropriately for a DBCS language then KCML will use some rough rules for excluding punctuation characters (Unicode U+3000 to U+303F) from consideration in words.

History