ratings: file: "ratings.csv" format: csv entity: rating header: true items: file: "items.csv" entity: movie header: true columns: movieId: id title: name
Data Manifest Specification
This document describes the format of data manifests.
Data manifests are written as YAML files. Only the JSON-compatible semantics of YAML are used, so in the future they may be representable in formats such as JSON or TOML.
A manifest file represents either a list or a map.
If it is a map and contains at least one of the keys
type, then the map is taken to describe a single data file (see below).
Otherwise, the file describes a collection of data files. If it is a map, then those files are labelled with their map keys; otherwise, they are labeled with their positions (starting from 0).
defines a data source consisting of ratings data, read from
ratings.csv, and movie titles read from
items.csv. The files have labels. The same manifest can be written with numeric labels as follows:
- file: "ratings.csv" format: csv entity: rating header: true - file: "items.csv" entity: movie header: true columns: movieId: id title: name
Data Source Description
Individual data sources are described with the following schema.
The data source type. Currently only
The default source type is
The remainder of the keys are defined by the particular data source.
textfile data sources
textfile sources read data from text files, with one entity per line. A text file may have one or more lines of header data.
the file to read.
The file format. Can be one of:
delimited— delimited, columnar text (default delimiter is
delimitedwith a delimiter of
tsv— tab-separated (
delimitedwith delimiter of
The delimiter string for files with the
The name of the entity type contained in this file. The entity type is also used to provide defaults for the columns. The default is
The entity builder to be used for these entities. The entity type may provide a default; otherwise,
org.lenskit.data.entities.BasicEntityBuilderis used. The keyword
basiccan be used to refer to the basic entity builder, to override a default entity builder if desired.
Whether the file has a header. If
true, the file has a single-line header; if
false, no header is assumed. If an integer, it is the number of header lines. The default is
A list describing the columns in the file (for columnar formats). A column descriptor can be either a string, giving a column name, or a map with keys
name(the column name) and
type(the column type, see [attribute data types](#data-types)).
true, then this can be a map whose keys are column header labels and whose values are column descriptors.
idcolumn is not specified, then entity IDs are synthesized from the line numbers in the file.
A list of attribute names to be indexed for fast lookup. If no indexes are specified,
userif they are present on the entities.
Metadata about the data, such as the
domainfor rating values.` Some of the common metadata:
The valid range of rating values. A map with the keys
precision; for example, 0.5-5 star data with 1/2 star precision would be described as follows:
meta: domain: minimum: 0.5 maximum: 5 precision: 0.5
derived Data Sources
Derived data sources extract entity IDs (but no other attributes) from other entities in the completed data source. This can be used to do things such as extract user IDs from ratings or purchase events.
Derived data sources are indicated as follows:
type: derived source_type: purchase source_attribute: user entity_type: user
Some entity types include default derivations; for example,
rating entities automatically produce user and item entities.
Derived entities are only used if no other component of the data source provides an entity. So if you have a file of users, and you also derive users, then the derived users will only be used when there is not a ‘real’ user to use.
Attribute Data Types
The following types are supported for attributes:
- Java class name
The corresponding class. Must be convertible with Joda-Convert.
The entity type may provide default types for various attribute names, in addition to providing a default set of columns if
columns is missing entirely. If no default is available and the type is not specified, attributes are assumed to be strings.