API Reference
Main module
IO
- pyrepseq.io.isvalidaa(string)[source]
returns true if string is composed only of characters from the standard amino acid alphabet
- pyrepseq.io.isvalidcdr3(string)[source]
returns True if string is a valid CDR3 sequence
- Checks the following:
first amino acid is a cysteine (C)
last amino acid is either phenylalanine (F), tryptophan (W), or cysteine (C)
each amino acid is part of the standard amino acid alphabet
See http://www.imgt.org/IMGTScientificChart/Numbering/IMGTIGVLsuperfamily.html and also https://doi.org/10.1093/nar/gkac190
- pyrepseq.io.multimerge(dfs, on, suffixes=None, **kwargs)[source]
Merge multiple dataframes on a common column.
Provides support for custom suffixes.
- Parameters:
on ('index' or column name) –
suffixes ([list-like | None]) – list of suffixes to append to the data
**kwargs (keyword arguments passed along to pd.merge) –
- Return type:
merged dataframe
- pyrepseq.io.standardize_dataframe(df_old: DataFrame, col_mapper: Mapping, standardize: bool = True, species: str = 'HomoSapiens', tcr_enforce_functional: bool = True, tcr_precision: str = 'gene', mhc_precision: str = 'gene', strict_cdr3_standardization: bool = False, suppress_warnings: bool = False)[source]
Utility function to organise TCR data into a standardized format.
If standardization is enabled (True by default), the function will additionally attempt to standardize the TCR and MHC gene symbols to be IMGT-compliant, and CDR3/Epitope amino acid sequences to be valid. During standardization, most non-standardizable/nonsensical values will be removed, replaced with None. However, since epitopes are not necessarily always amino acid sequences, values in the Epitope column that fail standardization will be kept as their original value. The appropriate standardization procedures will be applied for columns with the following names:
TRAV / TRBV
TRAJ / TRBJ
CDR3A / CDR3B
MHCA / MHCB
Epitope
- Parameters:
df_old (pandas.DataFrame) – Source
DataFrame
from which to pull data.col_mapper (Mapping) – A mapping object, such as a dictionary, which maps the old column names to the new column names.
standardize (bool) – When set to
False
, gene name standardisation is not attempted. Defaults toTrue
.species (str) – Name of the species from which the TCR data is derived, in their binomial nomenclature, camel-cased. Defaults to
'HomoSapiens'
.tcr_enforce_functional (bool) – When set to
True
, TCR genes that are not functional (i.e. ORF or pseudogene) are removed, and replaced withNone
. Defaults toTrue
.tcr_precision (str) – Level of precision to trim the TCR gene data to (
'gene'
or'allele'
). Defaults to'gene'
.mhc_precision (str) – Level of precision to trim the MHC gene data to (
'gene'
,'protein'
or'allele'
). Defaults to'gene'
.strict_cdr3_standardization (bool) – If True, any string that does not look like a CDR3 sequence is rejected. If False, any inputs that are valid amino acid sequences but do not start with C and end with F/W are not rejected and instead are corrected by having a C appended to the beginning and an F appended at the end. Defaults to False.
suppress_warnings (bool) – If
True
, suppresses warnings that are emitted when the standardisation of certain values fails. Defaults toFalse
.
- Returns:
Standardized
DataFrame
containing the original data, cleaned.- Return type:
pandas.DataFrame
Stats
- pyrepseq.stats.jaccard_index(A, B)[source]
Calculate the Jaccard index for two sets.
This measure is defined defined as
math:J(A, B) = |A intersection B| / |A union B| A, B: iterables (will be converted to sets). If A, B are pd.Series na values will be dropped first
- pyrepseq.stats.overlap(A, B)[source]
Calculate the number of overlapping elements of two sets.
This measure is defined as \(|A intersection B|\)
A, B: iterables (will be converted to sets). na values will be dropped first
- pyrepseq.stats.overlap_coefficient(A, B)[source]
Calculate the overlap coefficient for two sets.
This measure is defined as \(O(A, B) = |A intersection B| / min(|A|, |B|)\)
A, B: iterables (will be converted to sets). na values will be dropped first
- pyrepseq.stats.pc(array, array2=None)[source]
Estimate the coincidence probability \(p_C\) from a sample. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)
Note: This measure is also known as the Simpson or Hunter-Gaston index
- Parameters:
array (array-like) – list of sampled elements
array2 (array-like) – second list of sampled elements: if provided probability of cross-coincidences is calculated as \(p_C = (sum_i n_1i n_2i) / (N_1 N_2)\)
- pyrepseq.stats.pc_conditional(df, by, on, take_mean=True, weight_uniformly=False)[source]
Conditional coincidence probability estimator
- Parameters:
df (pandas DataFrame) –
by (list) – conditioning parameters used to group input data frame
on (string/list of strings) – column or columns to compute probability of coincidence or joint probability of coincidence on. If type(on) == list then pc is computed on the concatenations of each specified column
take_mean (bool) – specify wether to take the average once pc has been computed for each specified group
- Returns:
pc of df[on] computed over each group specified in by. if take_mean=True then the average of these group by pcs is returned
- Return type:
pandas DataFrame/float
- pyrepseq.stats.pc_n(n)[source]
Estimate the coincidence probability \(p_C\) from sampled counts. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)
Note: This measure is also known as the Simpson or Hunter-Gaston index
- Parameters:
n (array-like) – list of counts
- pyrepseq.stats.powerlaw_mle_alpha(c, cmin=1.0, method='exact', **kwargs)[source]
Maximum likelihood estimate of the power-law exponent.
- Parameters:
c (counts) –
cmin (only counts >= cmin are included in fit) –
continuitycorrection (use continuitycorrection (more accurate for integer counts)) –
method (one of ['simple', 'continuitycorrection', 'exact']) –
- ‘simple’: Uses an analytical formula that is exact in the continuous case
(Eq. B17 in Clauset et al. arXiv 0706.1062v2)
’continuitycorrection’: applies a continuity correction to the analytical formula ‘exact’: Numerically maximizes the discrete loglikelihood
kwargs (dict) – passed on to scipy.optimize.minimize_scalar Default: bounds=[1.5, 4.5], method=’bounded’
- Returns:
estimated power-law exponent
- Return type:
- pyrepseq.stats.powerlaw_sample(size=1, xmin=1.0, alpha=2.0)[source]
Draw samples from a discrete power-law.
Uses an approximate transformation technique, see Eq. D6 in Clauset et al. arXiv 0706.1062v2 for details.
- Parameters:
size (number of values to draw) –
xmin (minimal value) –
alpha (power-law exponent) –
- Return type:
array of integer samples
- pyrepseq.stats.renyi2_entropy(df, features, by=None, base=2.0)[source]
Compute Renyi-Simpson entropies
- pyrepseq.stats.subsample(counts, n)[source]
Randomly subsample from a vector of counts without replacement.
- Parameters:
counts (Vector of counts (integers) to randomly subsample from.) –
n (Number of items to subsample from counts. Must be less than or equal) – to the sum of counts.
- Returns:
indices, counts
- Return type:
Subsampled vector of counts where the sum of the elements equals n
Distance
- pyrepseq.distance.calculate_neighbor_numbers(seqs, reference=None, neighborhood=<function levenshtein_neighbors>)[source]
Calculate the number of neighbors for each sequence in a list.
- pyrepseq.distance.cdist(stringsA, stringsB, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]
- Compute distance between each pair of the two collections of strings.
(scipy.spatial.distance.cdist equivalent for strings)
- Parameters:
stringsA (iterable of strings) – An mA-length iterable.
stringsB (iterable of strings) – An mB-length iterable.
metric (function, optional) – The distance metric to use. Default: Levenshtein distance.
dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers
- Returns:
Y – A \(m_A\) by \(m_B\) distance matrix is returned. For each \(i\) and \(j\), the metric
dist(u=XA[i], v=XB[j])
is computed and stored in the \(ij\) th entry.- Return type:
ndarray
- pyrepseq.distance.downsample(seqs, maxseqs)[source]
Random downsampling of a list of sequences.
Also works for tuples (seqs_alpha, seqs_beta).
- pyrepseq.distance.find_neighbor_pairs(seqs, neighborhood=<function hamming_neighbors>)[source]
Find neighboring sequences in a list of unique sequences.
- Parameters:
neighborhood (callable returning an iterable of neighbors) –
- Return type:
list of tuples (seq1, seq2)
- pyrepseq.distance.find_neighbor_pairs_index(seqs, neighborhood=<function hamming_neighbors>)[source]
Find neighboring sequences in a list of unique sequences.
- Parameters:
neighborhood (callable returning an iterable of neighbors) –
- Return type:
list of tuples (index1, index2)
- pyrepseq.distance.hamming_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY', variable_positions=None)[source]
Iterator over Hamming neighbors of a string x.
- Parameters:
alphabet (iterable of characters) –
variable_positions (iterable of positions to be varied (default: all)) –
- pyrepseq.distance.hierarchical_clustering(seqs, pdist_kws={}, linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6})[source]
Hierarchical clustering by sequence similarity.
pdist_kws: keyword arguments for distance calculation linkage_kws: keyword arguments for linkage algorithm cluster_kws: keyword arguments for clustering algorithm
- pyrepseq.distance.isdist1(x, reference, neighborhood=<function levenshtein_neighbors>)[source]
Is the string x distance 1 away from any of the strings in the reference set
- pyrepseq.distance.levenshtein_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY')[source]
Iterator over Levenshtein neighbors of a string x
- pyrepseq.distance.load_pcDelta_background(return_bins=True)[source]
Loads pre-computed background pcDelta distributions calculated for PBMC TCRs.
Data: Sample W_F1_2018 from Minervina et al. https://zenodo.org/record/4065547/
- Returns:
back (pd.DataFrame) – DataFrame with coincidence probabilities
bins (ndarray [if return_bins = True]) – Delta bins to be used as bins for other data
- pyrepseq.distance.next_nearest_neighbors(x, neighborhood, maxdistance=2)[source]
Set of next nearest neighbors of a string x.
- Parameters:
alphabet (iterable of characters) –
neighborhood (neighborhood iterator) –
maxdistance (go up to maxdistance nearest neighbor) –
- Return type:
set of neighboring sequences
- pyrepseq.distance.nndist_hamming(seq, reference, maxdist=4)[source]
Calculate the nearest-neighbor distance by Hamming distance
- pyrepseq.distance.pcDelta(seqs, seqs2=None, bins=None, normalize=True, pseudocount=0.0, maxseqs=None, **kwargs)[source]
Calculates binned near-coincidence probabilities \(p_C(\Delta)\) among input sequences.
- Parameters:
seqs ([list of strings | tuple of lists]) – sequences, or (seqs_alpha, seqs_beta)
seqs2 ([list of strings | tuple of lists] (optional)) – second list of sequences for cross-comparisons
bins (iterable) – bins for the distances Delta. (Default: range(0, 25)) bins=0: Calculate exact coincidence probability
normalize (bool) – whether to return pc (normalized) or raw counts
pseudocount (float) – for a Bayesian estimation of coincidence frequencies e.g. can use Jeffrey’s prior value of 0.5
maxseqs (int) – maximal number of sequences to keep by random downsampling
**kwargs (dict) – passed on to pdist or cdist
- Returns:
(normalized) histogram of sequence distances
- Return type:
np.ndarray
- pyrepseq.distance.pcDelta_grouped(df, by, seq_columns, **kwargs)[source]
Near-coincidence probabilities conditioned to within-group comparisons.
- Parameters:
df (pd.DataFrame) –
by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby
seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis
**kwargs (keyword arguments) – passed on to pcDelta
- Returns:
pcs – Returns a DataFrame of pC(delta) for each group
- Return type:
pd.DataFrame
- pyrepseq.distance.pcDelta_grouped_cross(df, by, seq_columns, condensed=False, **kwargs)[source]
Near-coincidence probabilities conditioned to cross-group comparisons.
- Parameters:
df (pd.DataFrame) –
by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby
seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis
condensed (bool) – Return a condensed instead of squareform matrix (default: False)
**kwargs (keyword arguments) – passed on to pcDelta
- Returns:
pcs – Returns a DataFrame of pC(delta) across pairs of groups
- Return type:
pd.DataFrame
- pyrepseq.distance.pdist(strings, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]
- Pairwise distances between collection of strings.
(scipy.spatial.distance.pdist equivalent for strings)
- Parameters:
strings (iterable of strings) – An m-length iterable.
metric (function, optional) – The distance metric to use. Default: Levenshtein distance.
dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers
- Returns:
Y – Returns a condensed distance matrix Y. For each \(i\) and \(j\) (where \(i<j<m\)), where m is the number of original observations. The metric
dist(u=X[i], v=X[j])
is computed and stored in entrym * i + j - ((i + 2) * (i + 1)) // 2
.- Return type:
ndarray
Nearest Neighbor
- pyrepseq.nn.hash_based(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets')[source]
List all neighboring CDR3B sequences efficiently for small edit distances. The idea is to list all possible sequences within a given distance and lookup the dictionary if it exists. This implementation is faster than kdtree implementation for max_edits == 1
- Parameters:
strings (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – number of CPU cores running in parallel
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
- Returns:
neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix
- Return type:
array of 3D-tuples, sparse matrix, or dense matrix
- pyrepseq.nn.kdtree(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', compression=1)[source]
List all neighboring CDR3B sequences efficiently within the given edit distance. With KDTree, the algorithms run with O(N logN) eliminating unnecessary comparisons. With RapidFuzz library, the edit distance comparison is efficiently written in C++. With multiprocessing, the algorithm can take advantage of multiple CPU cores. This implementation is faster than hash-based implementation for max_edits > 1
- Parameters:
strings (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – number of CPU cores running in parallel
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
- Returns:
neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix
- Return type:
array of 3D-tuples, sparse matrix, or dense matrix
- pyrepseq.nn.nearest_neighbor(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]
List all neighboring sequences efficiently within a given distance. The distance can be given in terms of hamming, levenshtein, or custom.
If seqs2 is not provided, every sequence is compared against every other sequence.
- Parameters:
strings (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – ignored
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against
- Returns:
neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix
- Return type:
array of 3D-tuples, sparse matrix, or dense matrix
- pyrepseq.nn.nearest_neighbor_tcrdist(df, chain='beta', max_edits=1, max_tcrdist=20, **kwargs)[source]
List all neighboring TCR sequences efficiently within a given edit and TCRdist radius.
chain: ‘alpha’ or ‘beta’ max_edits : only return neighbors up to <= this edit distance max_tcrdist : only return neighbor up to <= this TCR distance
**kwargs : passed on to nearest_neighbor function
- pyrepseq.nn.symspell(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]
List all neighboring CDR3B sequences efficiently within the given distance. This is an improved version over the hash-based.
If seqs2 is not provided, every sequences are compared against every other sequences resulting in N(seqs)**2 combinations. Otherwise, seqs are compared against seqs2 resulting in N(seqs)*N(seqs2) combinations.
- Parameters:
strings (iterable of strings) – list of CDR3B sequences
max_edits (int) – maximum edit distance defining the neighbors
max_returns (int or None) – maximum neighbor size
n_cpu (int) – ignored
custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)
max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied
output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”
seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against
- Returns:
neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix
- Return type:
array of 3D-tuples, sparse matrix, or dense matrix
Plotting submodule
- pyrepseq.plotting.align_seqs(seqs)[source]
Align multiple sequences using mafft-linsi with default parameters.
- Parameters:
seqs (iterable of strings) –
- Returns:
aligned sequences (with gaps)
- Return type:
list of strings
- pyrepseq.plotting.label_axes(fig_or_axes, labels='ABCDEFGHIJKLMNOPQRSTUVWXYZ', labelstyle='%s', xy=(-0.1, 0.95), xycoords='axes fraction', **kwargs)[source]
Walks through axes and labels each. kwargs are collected and passed to annotate
- Parameters:
fig (Figure or Axes to work on) –
labels (iterable or None) – iterable of strings to use to label the axes. If None, lower case letters are used.
loc (Where to put the label units (len=2 tuple of floats)) –
xycoords (loc relative to axes, figure, etc.) –
kwargs (to be passed to annotate) –
- pyrepseq.plotting.labels_to_colors_hls(labels, palette_kws={'l': 0.5, 's': 0.8}, min_count=None)[source]
Map a list of labels to a list of unique colors. Uses seaborn.hls_palette.
- Parameters:
df (pandas DataFrame with data) –
labels (list of labels) –
min_count (map all labels seen less than min_count to black) –
palette_kws (passed to seaborn.hls_palette) –
- pyrepseq.plotting.labels_to_colors_tableau(labels, min_count=None)[source]
Map a list of labels to a list of unique colors. Uses Tableau_10 colors
- Parameters:
df (pandas DataFrame with data) –
labels (list of labels) –
min_count (map all labels seen less than min_count to black) –
- pyrepseq.plotting.rankfrequency(data, ax=None, normalize_x=True, normalize_y=False, log_x=True, log_y=True, scalex=1.0, scaley=1.0, **kwargs)[source]
Plot rank frequency plots.
- Parameters:
- Returns:
Objectes representing the plotted data.
- Return type:
list of Line2D
- pyrepseq.plotting.seqlogos(seqs, ax=None, **kwargs)[source]
Display a sequence logo.
Aligns sequences using align_seqs if they are are not of equal length.
- Parameters:
seqs (iterable of strings) – sequences to be displayed
ax (matplotlib.axes) – if None create new figure
**kwargs (dict) – passed on to logomaker.Logo
- Return type:
axes, counts_matrix
- pyrepseq.plotting.seqlogos_vj(df, cdr3_column, v_column, j_column, axes=None, **kwargs)[source]
Display a sequence logo with V and J gene information.
- pyrepseq.plotting.similarity_clustermap(df, alpha_column='cdr3a', beta_column='cdr3b', norm=None, bounds=array([0, 1, 2, 3, 4, 5, 6]), linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6}, cbar_kws={'format': '%d', 'label': 'Sequence Distance', 'orientation': 'horizontal'}, meta_columns=None, meta_to_colors=None, **kws)[source]
Plots a sequence-similarity clustermap.
- Parameters:
df (pandas DataFrame with data) –
alpha_column (column name with alpha and beta amino acid information (set one to None for single chain plotting)) –
beta_column (column name with alpha and beta amino acid information (set one to None for single chain plotting)) –
norm (matplotlib.colors.Normalize subclass for turning distances into colors) –
bounds (bounds used for colormap matplotlib.colors.BoundaryNorm (only used if norm = None)) –
linkage_kws (keyword arguments for linkage algorithm) –
cluster_kws (keyword arguments for clustering algorithm) –
cbar_kws (keyword arguments for colorbar) –
meta_columns (list-like) – metadata to plot alongside the cluster assignment
meta_to_colors (list-like) – list of functions mapping metadata labels to colors first element of list is for clusters
kws (keyword arguments passed on to the clustermap.) –