Connecting to Data

As described in The LensKit Data Model, LensKit abstracts data access using data access objects (DAOs). We’ve seen how to use the DAO to access entities; this page explains how to create and configure a DAO.

Currently, LensKit provides DAOs that support in-memory storage and reading from static files (e.g. CSV files). The DAO interface can be reimplemented on top of SQL, Hibernate, MongoDB, or whatever your infrastructure requires.

LensKit’s data access is all read-only. LensKit does not provide any facilities for modifying the underlying data; if you need to modify data while LensKit is running, just write the new data directly to the database and provide a LensKit DAO that reads this data. Prebuilt model components will be out of date with respect to the new data, but many components will take into account the latest data when producing recommendations.

The Static File Data Source

The StaticDataSource class compiles a DAO from entities stored in static data files, such as CSV files, and collections in memory. It can additionally derive entities from mentions in other entity lists, such as synthesizing a set of items from the item IDs in ratings.

Static data sources are described by data manifests, YAML files that describe the collection of files comprising the data source. For example, to read the ratings.csv and movies.csv files from one of the recent MovieLens data sets, you could use the following:

  file: ratings.csv
  format: csv
  header: true
  entity_type: rating
  file: movies.csv
  format: csv
  header: true
  entity_type: item
  columns: [id, name]

This describes a data source that reads ratings from ratings.csv, using the default column layout for ratings (user,item,rating,timestamp), and movie IDs and titles (as the name attribute) from movies.csv. See the Data Manifest Specification for more details on manifest files.

Once you have such a data source, you can load it:

StaticDataSource movielens = StaticDataSource.load(Paths.get("ml-latest/movielens.yml"));

StaticDataSource is a Provider of data access objects; call its get() method to get a data access object suitable for use in a recommmender.

Common Data Set Manifests

For convenience, we provdie manifests for several common data sets:

results matching ""

    No results matching ""