Configuration ============= Singlet is designed to work on separate projects at once. To keep projects tidy and independent, there are two layers of configuration. The ``SINGLET_CONFIG_FILENAME`` environment variable ----------------------------------------------------- If you set the environment variable ``SINGLET_CONFIG_FILENAME`` to point at a YAML file, singlet will use it to configure your current session. To use separate sessions in parallel, just prepend your scripts with:: import os os.environ['SINGLET_CONFIG_FILENAME'] = '' and each will work totally independently. The configuration file ---------------------- Singlet loads a configuration file in `YAML `_ format when you import the ``singlet`` module. If you have not specified the location of this file with the ``SINGLET_CONFIG_FILENAME`` environment variable, it defaults to:: /.singlet/config.yml so the software will look there. An example configuration file is `online `_. If you are not familiar with YAML syntax, it is a bit like Python dictionaries without brackets... or like JSON. Before going into the specifics, here's a schematic example of the configuration file: .. code-block:: yaml io: samplesheets: ss1: path: xxx.csv index: samplename featuresheets: fs1: path: yyy.csv index: EnsemblGeneID count_tables: ct1: path: zzz.csv normalized: no spikeins: - ERCC-00002 - ERCC-00003 other: - __alignment_not_unique - __not_aligned datasets: ds1: path: xxx.loom format: loom axis_samples: columns index_samples: Cell index_features: Gene Now for the full specification, the root key value pairs are: - ``io``: for input/output specifications. At the moment this key is the only master key and is required. There are no root lists. io ~~~~~~~~ The ``io`` section has the following key value pairs: - ``samplesheets``: samplesheet files or Google Documents (for sample metadata). - ``featuresheets``: featuresheet files (for feature/gene annotations or metadata). - ``count_tables``: count table files. - ``datasets``: integrated datasets (a single file contains all three properties above, e.g. `loom `_ files). samplesheets _______________ The ``samplesheets`` section contains an arbitrary number of key value pairs and no lists. Each entry describes a samplesheet and has the following format: - the key determines the id of the samplesheet: this id is used in the contstructor of ``Dataset``. - the value is a series of key value pairs, no lists. Singlet can source samplesheets either from a local file or from an online Google Sheet. If you want to use a local file, use the following key value pairs: - ``path``: a filename on disk containing the samplesheet, usually in CSV/TSV format. - ``format``: a file format of the samplesheet (optional). If missing, it is inferred from the ``path`` filename. If you prefer to source an online Google Sheet, use the following key value pairs: - ``url``: the URL of the spreadsheet, e.g. 'https://docs.google.com/spreadsheets/d/15OKOC48WZYFUQvYl9E7qEsR6AjqE4_BW7qcCsjJAD6w' for the example sheet. - ``client_id_filename``: a local filename (initially empty) where your login information for OAUTH2 is stored. This is a JSON file so this variable typically ends with ``.json`` - ``client_secret_filename``: a local filename (initially empry) where your secret information for OAUTH2 is stored. This is a JSON file so this variable typically ends with ``.json`` - ``sheet``: the name of the sheet with the data within the spreadsheet. Whichever way you are using to source the data, the following key value pairs are available: - ``description``: a description of the sample sheet (optional). - ``cells``: one of ``rows`` or ``columns``. If each row in the samplesheet is a sample, use ``rows``, else use ``columns``. Notice that singlet samplesheets have samples as **rows**. - ``index``: the name of the column/row of the samplesheet containing the sample names. This defaults to ``name`` (optional). count_tables _____________________ The ``count_tables`` section contains an arbitrary number of key value pairs and no lists. Each entry describes a counts table and has the following format: - the key determines the id of the counts table: this id is used in the contstructor of ``Dataset``. - the value is a series of key value pairs, no lists. The following key value pairs are available: - ``description``: a description of the counts table (optional). - ``path``: a filename on disk containing the counts table, usually in CSV/TSV format. - ``format``: a file format of the counts table (optional). If missing, it is inferred from the ``path`` filename. - ``cells``: one of ``rows`` or ``columns``. If each row in the counts table is a sample, use ``rows``, else use ``columns``. - ``normalized``: either ``yes`` or ``no``. If data is not normalized, you can normalize it with singlet by using the ``CountsTable.normalize`` method. - ``sparse``: either ``yes`` or ``no`` (default). If ``yes``, the count table will be loaded by default as ``CountsTableSparse``, else as ``CountsTable`` (dense). - ``spikeins``: a YAML list of features that appear in the counts table and represent spike-in controls as opposed to real features. Spikeins can be excluded from the counts table using ``CountsTable.exclude_features``. - ``other``: a YAML list of features that are neither biological features nor spike-in controls. This list typically includes ambiguous alignments, multiple-aligned reads, reads outside features, etc. Other features can be excluded from the counts table using ``CountsTable.exclude_features``. The first column/row of the counts table must be the list of samples. featuresheets ________________ The ``featuresheets`` section contains an arbitrary number of key value pairs and no lists. Each entry describes a featuresheet, i.e. a table with metadata for the features. A typical usage of featuresheets is to connect feature ids (e.g. ``EnsemblGeneID``) with human-readable names, Gene Ontology terms, species information, pathways, cellular localization, etc. Each entry has the following format: - the key is the id of the featuresheet: this id is used in the constructor of ``Dataset``. - the value is a series of key value pairs, no lists. The following key value pairs are available: - ``description``: a description of the featuresheet (optional). - ``path``: a filename on disk containing the featuresheet, usually in CSV/TSV format. - ``format``: a file format of the featuresheet (optional). If missing, it is inferred from the ``path`` filename. - ``features``: one of ``rows`` or ``columns``. If each feature in the featuresheet is a feature, use ``rows``, otherwise use ``columns``. - ``index``: the name of the column/row of the featuresheet containing the feature names. This defaults to ``name`` (optional). datasets ________________ The ``datasets`` section contains an arbitrary number of key value pairs and no lists. Each entry describes an integrated dataset, i.e. a single file containing one or more of the three main data structures (``CountsTable``, ``Samplesheet``, and ``Featuresheet``). The most common use of integrated datasets is when all three data structures are present and they are embedded in a single file for portability purposes or lazy evaluation (the latter is not implemented yet). Each entry has the following format: - the key is the id of the dataset: this id is used in the constructor of ``Dataset``. - the value is a series of key value pairs, no lists. The following key value pairs are available: - ``description``: a description of the dataset (optional). - ``path``: a filename on disk containing the integrated dataset, e.g. in LOOM format. - ``format``: a file format of the dataset (optional). If missing, it is inferred from the ``path`` filename. - ``axis_samples``: one of ``rows`` or ``columns``. If every sample is a column in the count matrix, use ``columns``, else use ``rows``. - ``index_samples``: the name of the column/row of the dataset containing the sample names. - ``index_features``: the name of the column/row of the dataset containing the feature names.