Data file

<< Click to Display Table of Contents >>

Navigation:  Reference Manual > Files >

Data file

Data files contain information that is usually given by respondents to the questions in a questionnaire.

If Companion Input is used on a data file then a Data index file will be created.

There are many different types of data file format; the way in which the data is stored in the file varies:

CSV type

oPortable CSV (comma separated value)

oOpenXML/Excel (spreadsheet)

Raw data

oFixed format data (see Fixed format data files explained):

UNI data locations are set as characters within each record

ASC data locations are set as bytes within each record

Binary data:

oCBA (QPSMR binary)

oCBE (Quantum binary)

oCSI (360 column binary, fixed record length)

We recommend using CSV for data files because these can contain any language text as data. For CSV data we recommend UTF-8 with BOM encoding.

Data files generated from another source may be used with Companion, although they will not necessarily have the correct extension.

If you are not sure of the type of data file which has been supplied to you then use Raw data view window or the Raw CSV file view window.

Supplied data files from other programs are most likely to be one of the following:

UNI (fixed format UTF-8 encoded)

This type of data is often exported from another program and used by the triple-s standard.  It may or may not have a BOM (byte order mark).

ASC (fixed format ASCII)

This type of data may be exported from another program and may be used by the triple-s standard.

CSV (comma separated values (also know as comma delimited))

This type of file is normally output from a spreadsheet program, database or statistical package.  It may need some changes before it can be used as a data file.

CBE

Binary file in Quantum format.

CSI

A rare form of column binary file.  If a card image file is supplied to you and the total byte size of the data file is exactly divisible by 160, it is likely to be a CSI file.  If the total byte size of the file is exactly divisible by 162, the file may be a CSI file that includes termination controls at the end of each line.

Verbatim files

In CSV type data files, verbatim question answers are stored in the data file with the other question data.

If the project contains verbatim questions and the data is a raw data type, then the verbatim answers are stored in a separate Verbatim CSV file.

When a raw data file is opened, you can (optionally) read the associated verbatim file to get the verbatim text.  The verbatim answers are checked against the main data file and any duplicates are removed.  A log of any inconsistencies found will be shown.

If you have included the verbatim data then a new verbatim file will be saved when the main raw data file is saved.  A copy of the original verbatim file will be kept in the Archive folder.

Data file structures

For information about the layout of data in a data file, see User Guide, Handling raw data.

See also the special requirements for CATI data files.

All native data file types except CSI have lines; each line ends with two termination characters, CR (carriage return) and a LF (line feed).  Lines vary in length as blank columns at the end of card are not usually written to the file.

Data files from other operating systems, for example UNIX, may only have a LF (line feed) at the end of each line.

In byte swapped files, each pair of bytes (characters) have been swapped in the file; this can accidentally happen when moving files between different types of computer.

UNI

This is the data file type normally used in the default character data mode.

This is an ordinary character data file often known as "fixed format".  This type of file may be produced by a line editor or word processor, or it may be output from a variety of programs, such as scanning (OMR and OCR) software, databases or statistics packages.

If codes V and X are used in ASCII data, they will be displayed as & (ampersand) and - (minus) respectively.

ASC (ASCII character)

This is the data file type normally used in older programs.

This is an ordinary character data file often known as "fixed format ASCII".  This type of file may be produced by a line editor or word processor, or it may be output from a variety of programs, such as scanning (OMR and OCR) software, databases or statistics packages.

If codes V and X are used in ASCII data, they will be displayed as & (ampersand) and - (minus) respectively.

CBA (QPSMR binary)

This is the data file type normally used in binary mode.

Column binary data is stored at the rate of one column per two bytes. The columns are split into two parts:

the first part VX0123 is stored in the first byte

the second part 456789 in the second byte

These parts are placed in the least significant six bits in each byte, giving each character a value between 0 (no codes) and 63 (all six codes). A blank (32) is then added to each character to give a value between 32 (no codes) and 95 (all six codes).

An 80 column card will produce a line of up to 160 valid printable ASCII characters, which means that CBA files can be handled by all the standard file handling programs.  A standard sort program can be used with reverse set (highest first).

CBE (Quantum binary)

This data file type is similar to ASC except that multi-coded columns are allowed and replaced by an asterisk.  If there are any multi-coded columns then the last column on the card is followed by a DEL character (ASCII 127), which is followed by a pair of characters for each asterisk.

Each pair of characters is similar to CBA format except that an "@" (64) is added to each character, giving a value between 64 (no codes) and 127 (all six codes).  Unfortunately, the value for all six codes (127) forms the DEL character which may prevent the transfer of CBT files over a computer link.

Columns that contain ASCII 32 through 126 (except 42) are not treated as multi-coded, the relevant character is placed in the column.

CSI (IBM 360 binary)

In this data file type column binary data is stored at the rate of one column per two bytes. The columns are split into two parts:

VX0123 in the first byte

456789 in the second byte

These parts are placed in the least significant six bits in each byte, giving each character a value between 0 (no codes) and 63 (all six codes).

This file type is not split into lines.  Every card (line) uses the same space (usually 80 columns, 160 bytes).  Sometimes line ends are added and these can be removed in the Raw data view window by setting the padding to 2 (or 1) in the treatment.