Configuration¶
Singlet is designed to work on separate projects at once. To keep projects tidy and independent, there are two layers of configuration.
The SINGLET_CONFIG_FILENAME
environment variable¶
If you set the environment variable SINGLET_CONFIG_FILENAME
to point at a YAML file, singlet will use it to configure your current session. To use separate sessions in parallel, just prepend your scripts with:
import os
os.environ['SINGLET_CONFIG_FILENAME'] = '<full path to config file>'
and each will work totally independently.
The configuration file¶
Singlet loads a configuration file in YAML format when you import the singlet
module. If you have not specified the location of this file with the SINGLET_CONFIG_FILENAME
environment variable, it defaults to:
<your home folder>/.singlet/config.yml
so the software will look there. An example configuration file is online. If you are not familiar with YAML syntax, it is a bit like Python dictionaries without brackets… or like JSON.
Before going into the specifics, here’s a schematic example of the configuration file:
io:
samplesheets:
ss1:
path: xxx.csv
index: samplename
featuresheets:
fs1:
path: yyy.csv
index: EnsemblGeneID
count_tables:
ct1:
path: zzz.csv
normalized: no
spikeins:
- ERCC-00002
- ERCC-00003
other:
- __alignment_not_unique
- __not_aligned
datasets:
ds1:
path: xxx.loom
format: loom
axis_samples: columns
index_samples: Cell
index_features: Gene
Now for the full specification, the root key value pairs are:
io
: for input/output specifications. At the moment this key is the only master key and is required.
There are no root lists.
io¶
The io
section has the following key value pairs:
samplesheets
: samplesheet files or Google Documents (for sample metadata).featuresheets
: featuresheet files (for feature/gene annotations or metadata).count_tables
: count table files.datasets
: integrated datasets (a single file contains all three properties above, e.g. loom files).
samplesheets¶
- The
samplesheets
section contains an arbitrary number of key value pairs and no lists. Each entry describes a samplesheet and has the following format: - the key determines the id of the samplesheet: this id is used in the contstructor of
Dataset
. - the value is a series of key value pairs, no lists.
- the key determines the id of the samplesheet: this id is used in the contstructor of
- Singlet can source samplesheets either from a local file or from an online Google Sheet. If you want to use a local file, use the following key value pairs:
path
: a filename on disk containing the samplesheet, usually in CSV/TSV format.format
: a file format of the samplesheet (optional). If missing, it is inferred from thepath
filename.
- If you prefer to source an online Google Sheet, use the following key value pairs:
url
: the URL of the spreadsheet, e.g. ‘https://docs.google.com/spreadsheets/d/15OKOC48WZYFUQvYl9E7qEsR6AjqE4_BW7qcCsjJAD6w’ for the example sheet.client_id_filename
: a local filename (initially empty) where your login information for OAUTH2 is stored. This is a JSON file so this variable typically ends with.json
client_secret_filename
: a local filename (initially empry) where your secret information for OAUTH2 is stored. This is a JSON file so this variable typically ends with.json
sheet
: the name of the sheet with the data within the spreadsheet.
- Whichever way you are using to source the data, the following key value pairs are available:
description
: a description of the sample sheet (optional).cells
: one ofrows
orcolumns
. If each row in the samplesheet is a sample, userows
, else usecolumns
. Notice that singlet samplesheets have samples as rows.index
: the name of the column/row of the samplesheet containing the sample names. This defaults toname
(optional).
count_tables¶
- The
count_tables
section contains an arbitrary number of key value pairs and no lists. Each entry describes a counts table and has the following format: - the key determines the id of the counts table: this id is used in the contstructor of
Dataset
. - the value is a series of key value pairs, no lists.
- the key determines the id of the counts table: this id is used in the contstructor of
- The following key value pairs are available:
description
: a description of the counts table (optional).path
: a filename on disk containing the counts table, usually in CSV/TSV format.format
: a file format of the counts table (optional). If missing, it is inferred from thepath
filename.cells
: one ofrows
orcolumns
. If each row in the counts table is a sample, userows
, else usecolumns
.normalized
: eitheryes
orno
. If data is not normalized, you can normalize it with singlet by using theCountsTable.normalize
method.sparse
: eitheryes
orno
(default). Ifyes
, the count table will be loaded by default asCountsTableSparse
, else asCountsTable
(dense).spikeins
: a YAML list of features that appear in the counts table and represent spike-in controls as opposed to real features. Spikeins can be excluded from the counts table usingCountsTable.exclude_features
.other
: a YAML list of features that are neither biological features nor spike-in controls. This list typically includes ambiguous alignments, multiple-aligned reads, reads outside features, etc. Other features can be excluded from the counts table usingCountsTable.exclude_features
.
The first column/row of the counts table must be the list of samples.
featuresheets¶
- The
featuresheets
section contains an arbitrary number of key value pairs and no lists. Each entry describes a featuresheet, i.e. a table with metadata for the features. A typical usage of featuresheets is to connect feature ids (e.g.EnsemblGeneID
) with human-readable names, Gene Ontology terms, species information, pathways, cellular localization, etc. Each entry has the following format: - the key is the id of the featuresheet: this id is used in the constructor of
Dataset
. - the value is a series of key value pairs, no lists.
- the key is the id of the featuresheet: this id is used in the constructor of
- The following key value pairs are available:
description
: a description of the featuresheet (optional).path
: a filename on disk containing the featuresheet, usually in CSV/TSV format.format
: a file format of the featuresheet (optional). If missing, it is inferred from thepath
filename.features
: one ofrows
orcolumns
. If each feature in the featuresheet is a feature, userows
, otherwise usecolumns
.index
: the name of the column/row of the featuresheet containing the feature names. This defaults toname
(optional).
datasets¶
- The
datasets
section contains an arbitrary number of key value pairs and no lists. Each entry describes an integrated dataset, i.e. a single file containing one or more of the three main data structures (CountsTable
,Samplesheet
, andFeaturesheet
). The most common use of integrated datasets is when all three data structures are present and they are embedded in a single file for portability purposes or lazy evaluation (the latter is not implemented yet). Each entry has the following format: - the key is the id of the dataset: this id is used in the constructor of
Dataset
. - the value is a series of key value pairs, no lists.
- the key is the id of the dataset: this id is used in the constructor of
- The following key value pairs are available:
description
: a description of the dataset (optional).path
: a filename on disk containing the integrated dataset, e.g. in LOOM format.format
: a file format of the dataset (optional). If missing, it is inferred from thepath
filename.axis_samples
: one ofrows
orcolumns
. If every sample is a column in the count matrix, usecolumns
, else userows
.index_samples
: the name of the column/row of the dataset containing the sample names.index_features
: the name of the column/row of the dataset containing the feature names.