<< Click to Display Table of Contents >> Navigation: Reference Manual > Files > Encoding |
There are two basic forms of encoding of files on Windows:
•Using the Locale (MBCS)
•Using Unicode
See Fixed format data files for how this type of data file is treated.
The locale is set in the Control Panel. Changing the Locale involves restarting the PC.
Unicode is a system of character encoding handles characters in any language and does not depend on the Locale set in the Control Panel.
Unicode has a unique value for every letter in most of the languages used worldwide.
Unicode files can be moved to another country without affecting the contents and will be readable.
For text files these are the normal ways to store:
•Unicode UTF-8 Encoding
•Unicode UTF-16 LE Encoding
•ASCII (MBCS) Encoding which relies on the correct Locale being set in the Control Panel.
The Companion uses UTF-16 LE internally and most files read and produced are now UTF-8 encoded with a BOM.
Programs can detect which files are Unicode because a Unicode file can have a BOM (byte order mark) at the beginning of the file.
The BOM is a few special bytes at the front of the file that identify it as Unicode encoded:
•UTF-8 uses a 3 byte BOM
•UTF-16 uses a 2 byte BOM. For Windows this will be a UTF-16-LE (little endian)
•ASCII does not use a BOM
If a QDF file is Unicode this is shown in the main window title.
If a file does not have a BOM then the program scans the file to check if it is UTF-8 encoded.
Classic files will be ANSII. On Windows this would use MBCS and would rely on the Language for non-Unicode programs (Locale) to be set correctly in the control panel.
There is a move to assuming such files are UTF-8 which do not rely on the Language for non-Unicode programs to be set correctly in the Control panel.
Companion always puts a BOM on UTF-8 and UTF-16 files generated to avoid confusion.
A data file from another source without a BOM may be UTF-8 (UNI) or ASCII (ASC).
IMPORTANT: You can use [Raw] [File encoding check] to check if the file is UTF-8 encoded. Alternatively opening a file without a BOM in Notepad (not WordPad) will normally tell you the encoding correctly.
Each character takes up 1-4 bytes. English characters and the usual punctuations marks use a single byte.
This is a form of Unicode encoding that has been altered to reduce the size of the file where a lot of the file content is English or Western European.
XML and HTML files contain a lot of standard English characters in mark-up, so these are usually stored in UTF-8 format. This preserves all the Unicode text in a smaller file.
IMPORTANT: We recommend using UTF-8 encoding for all files because:
•Any text in any language will be shown correctly provided you have the relevant language pack installed.
•UTF-8 files can be moved safely between UNIX and Windows environments.
Every character uses (at least) two bytes. This will normally be larger than a UTF-8 file.
The Windows .NET Framework uses UTF-16 LE (Little Endian) internally.
Each character takes up 1 or 2 bytes. The 2 byte characters will be different for different languages so only one language other than English can be used.
There is a setting in the Control Panel to tell programs which language to use for non-Unicode files (the Locale).
A Windows specific MBCS encoding is used.
See Fixed format data files for details about this type of data file.
CSV files will now normally be UTF-8. We recommend using CSV data files because these will normally be smaller than fixed format files and there is no risk of slippage where a data part uses more or less space than has been allocated.