API Reference

Main module

IO

pyrepseq.io.isvalidaa(string)[source]

returns true if string is composed only of characters from the standard amino acid alphabet

pyrepseq.io.isvalidcdr3(string)[source]

returns True if string is a valid CDR3 sequence

Checks the following:
  • first amino acid is a cysteine (C)

  • last amino acid is either phenylalanine (F), tryptophan (W), or cysteine (C)

  • each amino acid is part of the standard amino acid alphabet

See http://www.imgt.org/IMGTScientificChart/Numbering/IMGTIGVLsuperfamily.html and also https://doi.org/10.1093/nar/gkac190

pyrepseq.io.multimerge(dfs, on, suffixes=None, **kwargs)[source]

Merge multiple dataframes on a common column.

Provides support for custom suffixes.

Parameters:
  • on ('index' or column name) –

  • suffixes ([list-like | None]) – list of suffixes to append to the data

  • **kwargs (keyword arguments passed along to pd.merge) –

Return type:

merged dataframe

pyrepseq.io.standardize_dataframe(df_old: DataFrame, col_mapper: Mapping, standardize: bool = True, species: str = 'HomoSapiens', tcr_enforce_functional: bool = True, tcr_precision: str = 'gene', mhc_precision: str = 'gene', strict_cdr3_standardization: bool = False, suppress_warnings: bool = False)[source]

Utility function to organise TCR data into a standardized format.

If standardization is enabled (True by default), the function will additionally attempt to standardize the TCR and MHC gene symbols to be IMGT-compliant, and CDR3/Epitope amino acid sequences to be valid. During standardization, most non-standardizable/nonsensical values will be removed, replaced with None. However, since epitopes are not necessarily always amino acid sequences, values in the Epitope column that fail standardization will be kept as their original value. The appropriate standardization procedures will be applied for columns with the following names:

  • TRAV / TRBV

  • TRAJ / TRBJ

  • CDR3A / CDR3B

  • MHCA / MHCB

  • Epitope

Parameters:
  • df_old (pandas.DataFrame) – Source DataFrame from which to pull data.

  • col_mapper (Mapping) – A mapping object, such as a dictionary, which maps the old column names to the new column names.

  • standardize (bool) – When set to False, gene name standardisation is not attempted. Defaults to True.

  • species (str) – Name of the species from which the TCR data is derived, in their binomial nomenclature, camel-cased. Defaults to 'HomoSapiens'.

  • tcr_enforce_functional (bool) – When set to True, TCR genes that are not functional (i.e. ORF or pseudogene) are removed, and replaced with None. Defaults to True.

  • tcr_precision (str) – Level of precision to trim the TCR gene data to ('gene' or 'allele'). Defaults to 'gene'.

  • mhc_precision (str) – Level of precision to trim the MHC gene data to ('gene', 'protein' or 'allele'). Defaults to 'gene'.

  • strict_cdr3_standardization (bool) – If True, any string that does not look like a CDR3 sequence is rejected. If False, any inputs that are valid amino acid sequences but do not start with C and end with F/W are not rejected and instead are corrected by having a C appended to the beginning and an F appended at the end. Defaults to False.

  • suppress_warnings (bool) – If True, suppresses warnings that are emitted when the standardisation of certain values fails. Defaults to False.

Returns:

Standardized DataFrame containing the original data, cleaned.

Return type:

pandas.DataFrame

Stats

pyrepseq.stats.chao1(counts)[source]

Estimate richness from sampled counts.

pyrepseq.stats.chao2(counts, m)[source]

Estimate richness from incidence data

pyrepseq.stats.jaccard_index(A, B)[source]

Calculate the Jaccard index for two sets.

This measure is defined defined as

math:J(A, B) = |A intersection B| / |A union B| A, B: iterables (will be converted to sets). If A, B are pd.Series na values will be dropped first

pyrepseq.stats.overlap(A, B)[source]

Calculate the number of overlapping elements of two sets.

This measure is defined as \(|A intersection B|\)

A, B: iterables (will be converted to sets). na values will be dropped first

pyrepseq.stats.overlap_coefficient(A, B)[source]

Calculate the overlap coefficient for two sets.

This measure is defined as \(O(A, B) = |A intersection B| / min(|A|, |B|)\)

A, B: iterables (will be converted to sets). na values will be dropped first

pyrepseq.stats.pc(array, array2=None)[source]

Estimate the coincidence probability \(p_C\) from a sample. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)

Note: This measure is also known as the Simpson or Hunter-Gaston index

Parameters:
  • array (array-like) – list of sampled elements

  • array2 (array-like) – second list of sampled elements: if provided probability of cross-coincidences is calculated as \(p_C = (sum_i n_1i n_2i) / (N_1 N_2)\)

pyrepseq.stats.pc_conditional(df, by, on, take_mean=True, weight_uniformly=False)[source]

Conditional coincidence probability estimator

Parameters:
  • df (pandas DataFrame) –

  • by (list) – conditioning parameters used to group input data frame

  • on (string/list of strings) – column or columns to compute probability of coincidence or joint probability of coincidence on. If type(on) == list then pc is computed on the concatenations of each specified column

  • take_mean (bool) – specify wether to take the average once pc has been computed for each specified group

Returns:

pc of df[on] computed over each group specified in by. if take_mean=True then the average of these group by pcs is returned

Return type:

pandas DataFrame/float

pyrepseq.stats.pc_joint(df, on)[source]

Joint coincidence probability estimator

Parameters:
  • df (pandas DataFrame) –

  • on (list of strings) – columns on which to obtain a joint probability of coincidence

Returns:

pc computed on the concatenations of each specified column in on

Return type:

float

pyrepseq.stats.pc_n(n)[source]

Estimate the coincidence probability \(p_C\) from sampled counts. \(p_C\) is equal to the probability that two distinct sampled elements are the same. If \(n_i\) are the counts of the i-th unique element and \(N = \sum_i n_i\) the length of the array, then: \(p_C = \sum_i n_i (n_i-1)/(N(N-1))\)

Note: This measure is also known as the Simpson or Hunter-Gaston index

Parameters:

n (array-like) – list of counts

pyrepseq.stats.powerlaw_mle_alpha(c, cmin=1.0, method='exact', **kwargs)[source]

Maximum likelihood estimate of the power-law exponent.

Parameters:
  • c (counts) –

  • cmin (only counts >= cmin are included in fit) –

  • continuitycorrection (use continuitycorrection (more accurate for integer counts)) –

  • method (one of ['simple', 'continuitycorrection', 'exact']) –

    ‘simple’: Uses an analytical formula that is exact in the continuous case

    (Eq. B17 in Clauset et al. arXiv 0706.1062v2)

    ’continuitycorrection’: applies a continuity correction to the analytical formula ‘exact’: Numerically maximizes the discrete loglikelihood

  • kwargs (dict) – passed on to scipy.optimize.minimize_scalar Default: bounds=[1.5, 4.5], method=’bounded’

Returns:

estimated power-law exponent

Return type:

float

pyrepseq.stats.powerlaw_sample(size=1, xmin=1.0, alpha=2.0)[source]

Draw samples from a discrete power-law.

Uses an approximate transformation technique, see Eq. D6 in Clauset et al. arXiv 0706.1062v2 for details.

Parameters:
  • size (number of values to draw) –

  • xmin (minimal value) –

  • alpha (power-law exponent) –

Return type:

array of integer samples

pyrepseq.stats.renyi2_entropy(df, features, by=None, base=2.0)[source]

Compute Renyi-Simpson entropies

Parameters:
  • df (pandas DataFrame) –

  • features (list) –

  • by (string/list of strings) –

  • base (float) –

Return type:

float

pyrepseq.stats.shannon_entropy(df, features, by=None, base=2.0)[source]

Compute Shannon entropies

Parameters:
  • df (pandas DataFrame) –

  • features (list) –

  • by (string/list of strings) –

  • base (float) –

Return type:

float

pyrepseq.stats.stdpc(array)[source]

Std.dev. estimator for Simpson’s index

pyrepseq.stats.stdpc_n(n)[source]

Std.dev. estimator for Simpson’s index

pyrepseq.stats.subsample(counts, n)[source]

Randomly subsample from a vector of counts without replacement.

Parameters:
  • counts (Vector of counts (integers) to randomly subsample from.) –

  • n (Number of items to subsample from counts. Must be less than or equal) – to the sum of counts.

Returns:

indices, counts

Return type:

Subsampled vector of counts where the sum of the elements equals n

pyrepseq.stats.var_chao1(counts)[source]

Variance estimator for Chao1 richness.

pyrepseq.stats.var_chao2(counts, m)[source]

Variance estimator for Chao2 richness.

pyrepseq.stats.varpc_n(n)[source]

Variance estimator for Simpson’s index

Distance

pyrepseq.distance.calculate_neighbor_numbers(seqs, reference=None, neighborhood=<function levenshtein_neighbors>)[source]

Calculate the number of neighbors for each sequence in a list.

Parameters:
  • seqs (list of sequences) –

  • reference (list of sequences, set(seqs) if None) –

  • neighborhood (function returning iterator over neighbors) –

Return type:

integer array of number of neighboring sequences

pyrepseq.distance.cdist(stringsA, stringsB, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]
Compute distance between each pair of the two collections of strings.

(scipy.spatial.distance.cdist equivalent for strings)

Parameters:
  • stringsA (iterable of strings) – An mA-length iterable.

  • stringsB (iterable of strings) – An mB-length iterable.

  • metric (function, optional) – The distance metric to use. Default: Levenshtein distance.

  • dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers

Returns:

Y – A \(m_A\) by \(m_B\) distance matrix is returned. For each \(i\) and \(j\), the metric dist(u=XA[i], v=XB[j]) is computed and stored in the \(ij\) th entry.

Return type:

ndarray

pyrepseq.distance.downsample(seqs, maxseqs)[source]

Random downsampling of a list of sequences.

Also works for tuples (seqs_alpha, seqs_beta).

pyrepseq.distance.find_neighbor_pairs(seqs, neighborhood=<function hamming_neighbors>)[source]

Find neighboring sequences in a list of unique sequences.

Parameters:

neighborhood (callable returning an iterable of neighbors) –

Return type:

list of tuples (seq1, seq2)

pyrepseq.distance.find_neighbor_pairs_index(seqs, neighborhood=<function hamming_neighbors>)[source]

Find neighboring sequences in a list of unique sequences.

Parameters:

neighborhood (callable returning an iterable of neighbors) –

Return type:

list of tuples (index1, index2)

pyrepseq.distance.hamming_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY', variable_positions=None)[source]

Iterator over Hamming neighbors of a string x.

Parameters:
  • alphabet (iterable of characters) –

  • variable_positions (iterable of positions to be varied (default: all)) –

pyrepseq.distance.hierarchical_clustering(seqs, pdist_kws={}, linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6})[source]

Hierarchical clustering by sequence similarity.

pdist_kws: keyword arguments for distance calculation linkage_kws: keyword arguments for linkage algorithm cluster_kws: keyword arguments for clustering algorithm

pyrepseq.distance.isdist1(x, reference, neighborhood=<function levenshtein_neighbors>)[source]

Is the string x distance 1 away from any of the strings in the reference set

pyrepseq.distance.levenshtein_neighbors(x, alphabet='ACDEFGHIKLMNPQRSTVWY')[source]

Iterator over Levenshtein neighbors of a string x

pyrepseq.distance.load_pcDelta_background(return_bins=True)[source]

Loads pre-computed background pcDelta distributions calculated for PBMC TCRs.

Data: Sample W_F1_2018 from Minervina et al. https://zenodo.org/record/4065547/

Returns:

  • back (pd.DataFrame) – DataFrame with coincidence probabilities

  • bins (ndarray [if return_bins = True]) – Delta bins to be used as bins for other data

pyrepseq.distance.next_nearest_neighbors(x, neighborhood, maxdistance=2)[source]

Set of next nearest neighbors of a string x.

Parameters:
  • alphabet (iterable of characters) –

  • neighborhood (neighborhood iterator) –

  • maxdistance (go up to maxdistance nearest neighbor) –

Return type:

set of neighboring sequences

pyrepseq.distance.nndist_hamming(seq, reference, maxdist=4)[source]

Calculate the nearest-neighbor distance by Hamming distance

Parameters:
  • seqs (list of sequences) –

  • seq (sequence instance) –

  • reference (set of referencesequences) –

  • maxdist (distance beyond which to cut off the calculation (needs to be <=4)) –

Returns:

  • distance of nearest neighbor

  • Note (This function does not check if strings are of same length.)

pyrepseq.distance.pcDelta(seqs, seqs2=None, bins=None, normalize=True, pseudocount=0.0, maxseqs=None, **kwargs)[source]

Calculates binned near-coincidence probabilities \(p_C(\Delta)\) among input sequences.

Parameters:
  • seqs ([list of strings | tuple of lists]) – sequences, or (seqs_alpha, seqs_beta)

  • seqs2 ([list of strings | tuple of lists] (optional)) – second list of sequences for cross-comparisons

  • bins (iterable) – bins for the distances Delta. (Default: range(0, 25)) bins=0: Calculate exact coincidence probability

  • normalize (bool) – whether to return pc (normalized) or raw counts

  • pseudocount (float) – for a Bayesian estimation of coincidence frequencies e.g. can use Jeffrey’s prior value of 0.5

  • maxseqs (int) – maximal number of sequences to keep by random downsampling

  • **kwargs (dict) – passed on to pdist or cdist

Returns:

(normalized) histogram of sequence distances

Return type:

np.ndarray

pyrepseq.distance.pcDelta_grouped(df, by, seq_columns, **kwargs)[source]

Near-coincidence probabilities conditioned to within-group comparisons.

Parameters:
  • df (pd.DataFrame) –

  • by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby

  • seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis

  • **kwargs (keyword arguments) – passed on to pcDelta

Returns:

pcs – Returns a DataFrame of pC(delta) for each group

Return type:

pd.DataFrame

pyrepseq.distance.pcDelta_grouped_cross(df, by, seq_columns, condensed=False, **kwargs)[source]

Near-coincidence probabilities conditioned to cross-group comparisons.

Parameters:
  • df (pd.DataFrame) –

  • by (mapping, function, label, or list of labels) – see pd.DataFrame.groupby

  • seq_columns (string) – The data frame column on which we want to apply the pcDelta analysis

  • condensed (bool) – Return a condensed instead of squareform matrix (default: False)

  • **kwargs (keyword arguments) – passed on to pcDelta

Returns:

pcs – Returns a DataFrame of pC(delta) across pairs of groups

Return type:

pd.DataFrame

pyrepseq.distance.pdist(strings, metric=None, dtype=<class 'numpy.uint8'>, **kwargs)[source]
Pairwise distances between collection of strings.

(scipy.spatial.distance.pdist equivalent for strings)

Parameters:
  • strings (iterable of strings) – An m-length iterable.

  • metric (function, optional) – The distance metric to use. Default: Levenshtein distance.

  • dtype (np.dtype) – data type of the distance matrix, default: np.uint8 Note: make sure to change the dtype, if the metric does not return integers

Returns:

Y – Returns a condensed distance matrix Y. For each \(i\) and \(j\) (where \(i<j<m\)), where m is the number of original observations. The metric dist(u=X[i], v=X[j]) is computed and stored in entry m * i + j - ((i + 2) * (i + 1)) // 2.

Return type:

ndarray

Nearest Neighbor

pyrepseq.nn.hash_based(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets')[source]

List all neighboring CDR3B sequences efficiently for small edit distances. The idea is to list all possible sequences within a given distance and lookup the dictionary if it exists. This implementation is faster than kdtree implementation for max_edits == 1

Parameters:
  • strings (iterable of strings) – list of CDR3B sequences

  • max_edits (int) – maximum edit distance defining the neighbors

  • max_returns (int or None) – maximum neighbor size

  • n_cpu (int) – number of CPU cores running in parallel

  • custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)

  • max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied

  • output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.kdtree(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', compression=1)[source]

List all neighboring CDR3B sequences efficiently within the given edit distance. With KDTree, the algorithms run with O(N logN) eliminating unnecessary comparisons. With RapidFuzz library, the edit distance comparison is efficiently written in C++. With multiprocessing, the algorithm can take advantage of multiple CPU cores. This implementation is faster than hash-based implementation for max_edits > 1

Parameters:
  • strings (iterable of strings) – list of CDR3B sequences

  • max_edits (int) – maximum edit distance defining the neighbors

  • max_returns (int or None) – maximum neighbor size

  • n_cpu (int) – number of CPU cores running in parallel

  • custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)

  • max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied

  • output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.nearest_neighbor(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]

List all neighboring sequences efficiently within a given distance. The distance can be given in terms of hamming, levenshtein, or custom.

If seqs2 is not provided, every sequence is compared against every other sequence.

Parameters:
  • strings (iterable of strings) – list of CDR3B sequences

  • max_edits (int) – maximum edit distance defining the neighbors

  • max_returns (int or None) – maximum neighbor size

  • n_cpu (int) – ignored

  • custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)

  • max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied

  • output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

  • seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

pyrepseq.nn.nearest_neighbor_tcrdist(df, chain='beta', max_edits=1, max_tcrdist=20, **kwargs)[source]

List all neighboring TCR sequences efficiently within a given edit and TCRdist radius.

chain: ‘alpha’ or ‘beta’ max_edits : only return neighbors up to <= this edit distance max_tcrdist : only return neighbor up to <= this TCR distance

**kwargs : passed on to nearest_neighbor function

pyrepseq.nn.symspell(seqs, max_edits=1, max_returns=None, n_cpu=1, custom_distance=None, max_custom_distance=inf, output_type='triplets', seqs2=None)[source]

List all neighboring CDR3B sequences efficiently within the given distance. This is an improved version over the hash-based.

If seqs2 is not provided, every sequences are compared against every other sequences resulting in N(seqs)**2 combinations. Otherwise, seqs are compared against seqs2 resulting in N(seqs)*N(seqs2) combinations.

Parameters:
  • strings (iterable of strings) – list of CDR3B sequences

  • max_edits (int) – maximum edit distance defining the neighbors

  • max_returns (int or None) – maximum neighbor size

  • n_cpu (int) – ignored

  • custom_distance (Function(str1, str2) or "hamming") – custom distance function to use, must statisfy 4 properties of distance (https://en.wikipedia.org/wiki/Distance#Mathematical_formalization)

  • max_custom_distance (float) – maximum distance to include in the result, ignored if custom distance is not supplied

  • output_type (string) – format of returns, can be “triplets”, “coo_matrix”, or “ndarray”

  • seq2 (iterable of strings or None) – another list of CDR3B sequences to compare against

Returns:

neighbors – neigbors along with their edit distances according to the given output_type if “triplets” returns are [(x_index, y_index, edit_distance)] if “coo_matrix” returns are scipy’s sparse matrix where C[i,j] = distance(X_i, X_j) or 0 if not neighbor if “ndarray” returns numpy’s 2d array representing dense matrix

Return type:

array of 3D-tuples, sparse matrix, or dense matrix

Plotting submodule

pyrepseq.plotting.align_seqs(seqs)[source]

Align multiple sequences using mafft-linsi with default parameters.

Parameters:

seqs (iterable of strings) –

Returns:

aligned sequences (with gaps)

Return type:

list of strings

pyrepseq.plotting.label_axes(fig_or_axes, labels='ABCDEFGHIJKLMNOPQRSTUVWXYZ', labelstyle='%s', xy=(-0.1, 0.95), xycoords='axes fraction', **kwargs)[source]

Walks through axes and labels each. kwargs are collected and passed to annotate

Parameters:
  • fig (Figure or Axes to work on) –

  • labels (iterable or None) – iterable of strings to use to label the axes. If None, lower case letters are used.

  • loc (Where to put the label units (len=2 tuple of floats)) –

  • xycoords (loc relative to axes, figure, etc.) –

  • kwargs (to be passed to annotate) –

pyrepseq.plotting.labels_to_colors_hls(labels, palette_kws={'l': 0.5, 's': 0.8}, min_count=None)[source]

Map a list of labels to a list of unique colors. Uses seaborn.hls_palette.

Parameters:
  • df (pandas DataFrame with data) –

  • labels (list of labels) –

  • min_count (map all labels seen less than min_count to black) –

  • palette_kws (passed to seaborn.hls_palette) –

pyrepseq.plotting.labels_to_colors_tableau(labels, min_count=None)[source]

Map a list of labels to a list of unique colors. Uses Tableau_10 colors

Parameters:
  • df (pandas DataFrame with data) –

  • labels (list of labels) –

  • min_count (map all labels seen less than min_count to black) –

pyrepseq.plotting.rankfrequency(data, ax=None, normalize_x=True, normalize_y=False, log_x=True, log_y=True, scalex=1.0, scaley=1.0, **kwargs)[source]

Plot rank frequency plots.

Parameters:
  • data (array-like) – count data

  • ax (matplotlib.Axes) – axes on which to plot the data

  • normalize_x (bool, default:True) – whether to normalize counts to relative frequencies

  • normalize_y (bool, default:False) – whether to normalize ranks to cumulative probabilities

Returns:

Objectes representing the plotted data.

Return type:

list of Line2D

pyrepseq.plotting.seqlogos(seqs, ax=None, **kwargs)[source]

Display a sequence logo.

Aligns sequences using align_seqs if they are are not of equal length.

Parameters:
  • seqs (iterable of strings) – sequences to be displayed

  • ax (matplotlib.axes) – if None create new figure

  • **kwargs (dict) – passed on to logomaker.Logo

Return type:

axes, counts_matrix

pyrepseq.plotting.seqlogos_vj(df, cdr3_column, v_column, j_column, axes=None, **kwargs)[source]

Display a sequence logo with V and J gene information.

Parameters:
  • df (pd.DataFrame) – input data

  • cdr3_column (str) – column name for cdr3 sequences

  • v_column (str) – column name for v genes

  • j_column (str) – column name for j genes

  • **kwargs (dict) – passed on to seqlogos

pyrepseq.plotting.similarity_clustermap(df, alpha_column='cdr3a', beta_column='cdr3b', norm=None, bounds=array([0, 1, 2, 3, 4, 5, 6]), linkage_kws={'method': 'average', 'optimal_ordering': True}, cluster_kws={'criterion': 'distance', 't': 6}, cbar_kws={'format': '%d', 'label': 'Sequence Distance', 'orientation': 'horizontal'}, meta_columns=None, meta_to_colors=None, **kws)[source]

Plots a sequence-similarity clustermap.

Parameters:
  • df (pandas DataFrame with data) –

  • alpha_column (column name with alpha and beta amino acid information (set one to None for single chain plotting)) –

  • beta_column (column name with alpha and beta amino acid information (set one to None for single chain plotting)) –

  • norm (matplotlib.colors.Normalize subclass for turning distances into colors) –

  • bounds (bounds used for colormap matplotlib.colors.BoundaryNorm (only used if norm = None)) –

  • linkage_kws (keyword arguments for linkage algorithm) –

  • cluster_kws (keyword arguments for clustering algorithm) –

  • cbar_kws (keyword arguments for colorbar) –

  • meta_columns (list-like) – metadata to plot alongside the cluster assignment

  • meta_to_colors (list-like) – list of functions mapping metadata labels to colors first element of list is for clusters

  • kws (keyword arguments passed on to the clustermap.) –