task crossfold(type: Crossfold, group: 'evaluate') {
input 'data/ml-100k.yml'
// test on 5 random ratings from each user
userPartitionMethod holdout(5, 'random')
// use 5-fold cross-validation
partitionCount 5
// pack data for efficiency
outputFormat 'PACK'
}
Data Processing in the Evaluator
Before you can run an evaluation, you usually need to perform some pre-processing to the evaluation. In LensKit 3, we have separated this out from the main evaluator, so that you can use the output of LensKit’s data processing in other tools, or use your own code to prepare your train and test sets for the evaluator.
Cross-validating
Cross-validation is supported in LensKit by first cross-folding a data set to produce a set of
train-test pairs, and then running the train-test evaluator on those data
sets. This is supported by the crossfold
command and the Crossfold Gradle task.
Here’s a quick example of a crossfold task in a Gradle build script:
The options supported by the cross-validation process are defined by Crossfold; common ones include:
input
-
The input data file (its
.yaml
data manifest). partitionCount
-
The number of train-test data splits to produce.
name
-
A name for the data set.
input
-
The input data; see Specifying Input Data for more details.
outputDir
-
The output directory; defaults to
${buildDir}/${name}.out
, e.g.build/crossfold.out
outputFormat
-
The output format; can be one of
CSV
,CSV_GZ
,CSV_XZ
, orPACK
. If you do not need to process the crossfolded output with other software,PACK
is the most efficient for evaluation. method
-
The cross-folding method. Can be one of the following:
PARTITION_USERS
-
Split the users into
partitionCount
disjoint partitions. For each partition, produce a train-test split by considering some ratings from the users in that partition to be test ratings, and the remainder of those users' ratings along with all ratings by other users to be the training ratings.This is the default option.
PARTITION_RATINGS
-
Split the ratings into
partitionCount
disjoint partitions. SAMPLE_USERS
-
Select
partitionCount
disjoint sets of users by random sampling. Produce train and test data as with PARTITION_USERS. This is useful for large data sets where you don’t want to test on every user.
sampleSize
-
When
method
is SAMPLE_USERS, determines how many users are used to prepare testing data for each train-test set. userPartitionMethod
-
When
method
is PARTITION_USERS or SAMPLE_USERS, determines how each test users' ratings are split into train and test ratings. Can be one of:- holdout(n, 'random')
-
Select n random ratings to be test ratings, with the remainder used for training.
- holdout(n, 'timestamp')
-
Select the n most recent ratings to be test ratings.
- holdoutFraction(f, order)
-
Select a fraction f (\(0 < f < 1\)) of the user’s ratings to be test ratings. order is one of 'random' or 'timestamp', as with holdout.
- retain(n, order)
-
Select n random ratings to be training ratings, with the remainder used for testing. order is one of 'random' (for random n ratings) or 'timestamp' (for the n oldest ratings).
Specifying Input Data
Several tasks take input data from some data set. While LensKit can take data from a wide variety of sources by implementing custom data access objects, the evaluator currently only supports static data sources using data manifests.
Configure a data source by specifying the path to its data manifest.