Python Module Documentation

GOOSE functions

goose.create.seq_by_classes(length: int, aromatic: float | None = None, aliphatic: float | None = None, polar: float | None = None, positive: float | None = None, negative: float | None = None, glycine: float | None = None, proline: float | None = None, cysteine: float | None = None, histidine: float | None = None, num_attempts=100, strict_disorder=False, disorder_cutoff=0.5, cutoff=None, metapredict_version=3, max_consecutive_ordered=3, max_total_ordered=0.05, remaining_probabilities=None, max_class_fractions=None)[source]

Generate a disordered sequence with specified amino acid class fractions.

This function creates intrinsically disordered sequences where you can specify the fraction of different amino acid classes (aromatic, aliphatic, polar, charged, etc.) rather than individual amino acids. This provides a higher-level approach to sequence composition control.

Parameters:
  • length (int) – Length of the desired disordered sequence.

  • aromatic (float, optional) – Fraction of aromatic amino acids (F, W, Y) in the sequence (between 0 and 1). Default is 0.0.

  • aliphatic (float, optional) – Fraction of aliphatic amino acids (A, I, L, V) in the sequence (between 0 and 1). Default is 0.0.

  • polar (float, optional) – Fraction of polar amino acids (N, Q, S, T) in the sequence (between 0 and 1). Default is 0.0.

  • positive (float, optional) – Fraction of positively charged amino acids (K, R) in the sequence (between 0 and 1). Default is 0.0.

  • negative (float, optional) – Fraction of negatively charged amino acids (D, E) in the sequence (between 0 and 1). Default is 0.0.

  • glycine (float, optional) – Fraction of glycine (G) in the sequence (between 0 and 1). Default is 0.0.

  • proline (float, optional) – Fraction of proline (P) in the sequence (between 0 and 1). Default is 0.0.

  • cysteine (float, optional) – Fraction of cysteine (C) in the sequence (between 0 and 1). Default is 0.0.

  • histidine (float, optional) – Fraction of histidine (H) in the sequence (between 0 and 1). Default is 0.0.

  • num_attempts (int, optional) – Number of attempts to generate the sequence. Default is 100.

  • strict_disorder (bool, optional) – Whether to use strict disorder checking. If True, all residues must be above the disorder threshold. Default is False.

  • disorder_cutoff (float, optional) – Disorder threshold for sequence validation. Default from parameters module.

  • metapredict_version (int, optional) – Version of MetaPredict to use for disorder prediction. Default is 3.

  • max_consecutive_ordered (int, optional) – Maximum number of consecutive ordered residues allowed. Default from parameters module.

  • max_total_ordered (float, optional) – Maximum fraction of ordered residues allowed. Default from parameters module.

  • remaining_probabilities (dict, or string optional) – Custom amino acid probabilities for sequence generation. Keys should be single-letter amino acid codes, values should be probabilities. String options include the specified organisms in idr_probabilities.py These are: ‘mouse’, ‘fly’, ‘neurospora’, ‘yeast’, ‘arabidopsis’, ‘e_coli’, ‘worm’, ‘zebrafish’, ‘frog’, ‘dictyostelium’, ‘human’, ‘unbiased’, ‘all’

  • cutoff (float, optional) – Legacy parameter name for disorder cutoff. If provided, it will override the default disorder_cutoff value.

  • max_class_fractions (dict, optional) – Dictionary to override the maximum allowed fraction of any amino acid class. Keys should be class names (‘aromatic’, ‘aliphatic’, ‘polar’, ‘positive’, ‘negative’, ‘glycine’, ‘proline’, ‘cysteine’, ‘histidine’), values should be floats between 0 and 1. If not specified, default GOOSE thresholds are used.

Returns:

Generated amino acid sequence as a string.

Return type:

str

Raises:
  • GooseInputError – If invalid parameters are provided.

  • GooseFail – If sequence generation fails after all attempts.

Examples

>>> # Generate sequence with 20% aromatic and 10% positive residues
>>> seq = seq_by_classes(100, aromatic=0.2, positive=0.1)
>>>
>>> # Generate sequence with multiple class constraints
>>> seq = seq_by_classes(75, aromatic=0.15, polar=0.25, glycine=0.1)
goose.create.seq_by_fractions(length, **kwargs)[source]

Generate a disordered sequence with specified amino acid fractions.

This function creates intrinsically disordered sequences where you can specify the exact fraction of each amino acid type. This provides fine- grained control over sequence composition.

Parameters:
  • length (int) – Length of the desired disordered sequence.

  • A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y (float, optional) – Fraction of the sequence that should be made up of each specific amino acid (e.g., A=0.2, Y=0.05). Values should be between 0 and 1. The sum of all specified fractions should not exceed 1.

  • max_aa_fractions (dict, optional) – Dictionary to override the maximum allowed fraction of any amino acid. Keys should be single-letter amino acid codes, values should be floats between 0 and 1. If not specified, default GOOSE thresholds are used.

  • disorder_cutoff (float, optional) – Disorder threshold for sequence validation. Default is 0.6.

  • attempts (int, optional) – Number of attempts to generate the sequence. Default is 100.

  • strict_disorder (bool, optional) – Whether to use strict disorder checking. If True, all residues must be above the disorder threshold. Default is False.

  • remaining_probabilities (dict, or string optional) – Custom amino acid probabilities for sequence generation. Keys should be single-letter amino acid codes, values should be probabilities. String options include the specified organisms in idr_probabilities.py These are: ‘mouse’, ‘fly’, ‘neurospora’, ‘yeast’, ‘arabidopsis’, ‘e_coli’, ‘worm’, ‘zebrafish’, ‘frog’, ‘dictyostelium’, ‘human’, ‘unbiased’, ‘all’

  • return_all_sequences (bool, optional) – Whether to return all generated sequences. Default is False.

  • metapredict_version (int, optional) – Version of MetaPredict to use for disorder prediction. Default is 3.

  • max_consecutive_ordered (int, optional) – Maximum number of consecutive ordered residues allowed.

  • max_total_ordered (float, optional) – Maximum fraction of ordered residues allowed.

  • batch_size (int, optional) – Number of sequences to generate in each batch.

Returns:

Generated amino acid sequence as a string, or list of sequences if return_all_sequences is True.

Return type:

str or list

Raises:
  • GooseInputError – If invalid parameters are provided.

  • GooseFail – If sequence generation fails after all attempts.

Examples

>>> # Generate sequence with 30% alanine and 10% glycine
>>> seq = seq_by_fractions(100, A=0.3, G=0.1)
>>>
>>> # Generate sequence with custom max fractions
>>> seq = seq_by_fractions(50, A=0.4, max_aa_fractions={'A': 0.5})
goose.create.seq_by_re(length, objective_re, allowed_error=0.5, attempts=100, disorder_cutoff=0.5, strict_disorder=False, reduce_pos_charged=False, exclude_aas=None, metapredict_version=3, max_consecutive_ordered=3, max_total_ordered=0.05, cutoff=None)[source]

Generate a disordered sequence with a specified end-to-end distance (Re).

This function creates intrinsically disordered sequences with a target end-to-end distance (Re) in Angstroms. The end-to-end distance is the average distance between the N and C termini of the sequence.

Parameters:
  • length (int) – Length of the sequence to generate.

  • objective_re (float) – Target end-to-end distance in Angstroms.

  • allowed_error (float, optional) – Allowed error between the target and actual Re value. Default from parameters module.

  • attempts (int, optional) – Number of attempts to generate the sequence. Default is 20.

  • disorder_cutoff (float, optional) – Disorder threshold for sequence validation. Default from parameters module.

  • strict_disorder (bool, optional) – Whether to use strict disorder checking. If True, all residues must be above the disorder threshold. Default is False.

  • reduce_pos_charged (bool, optional) – Whether to reduce positively charged amino acids in the sequence. Default is False. In vivo data suggests positively charged residues may not drive sequence expansion as much as predicted by the model.

  • exclude_aas (list, optional) – List of amino acids to exclude from the sequence. Default is None.

  • metapredict_version (int, optional) – Version of MetaPredict to use for disorder prediction. Default is 3.

  • max_consecutive_ordered (int, optional) – Maximum number of consecutive ordered residues allowed. Default from parameters module.

  • max_total_ordered (float, optional) – Maximum fraction of ordered residues allowed. Default from parameters module.

  • cutoff (float, optional) – Legacy parameter name for disorder cutoff. If provided, it will override the default disorder_cutoff value.

Returns:

Generated amino acid sequence as a string.

Return type:

str

Raises:
  • GooseInputError – If the objective_re is outside the possible range for the given length, or if other invalid parameters are provided.

  • GooseFail – If sequence generation fails after all attempts.

Examples

>>> # Generate a 100-residue sequence with Re = 50 Å
>>> seq = seq_by_re(100, 50.0)
>>>
>>> # Generate with custom error tolerance
>>> seq = seq_by_re(75, 40.0, allowed_error=2.0)
goose.create.seq_by_rg(length, objective_rg, allowed_error=0.5, attempts=100, disorder_cutoff=0.5, strict_disorder=False, reduce_pos_charged=False, exclude_aas=None, metapredict_version=3, max_consecutive_ordered=3, max_total_ordered=0.05, cutoff=None)[source]

Generate a disordered sequence with a specified radius of gyration (Rg).

This function creates intrinsically disordered sequences with a target radius of gyration (Rg) in Angstroms. The radius of gyration is a measure of the compactness of the sequence’s ensemble of conformations.

Parameters:
  • length (int) – Length of the sequence to generate.

  • objective_rg (float) – Target radius of gyration in Angstroms.

  • allowed_error (float, optional) – Allowed error between the target and actual Rg value. Default from parameters module.

  • attempts (int, optional) – Number of attempts to generate the sequence. Default is 20.

  • disorder_cutoff (float, optional) – Disorder threshold for sequence validation. Default from parameters module.

  • strict_disorder (bool, optional) – Whether to use strict disorder checking. If True, all residues must be above the disorder threshold. Default is False.

  • reduce_pos_charged (bool, optional) – Whether to reduce positively charged amino acids in the sequence. Default is False. In vivo data suggests positively charged residues may not drive sequence expansion as much as predicted by the model.

  • exclude_aas (list, optional) – List of amino acids to exclude from the sequence. Default is None.

  • metapredict_version (int, optional) – Version of MetaPredict to use for disorder prediction. Default is 3.

  • max_consecutive_ordered (int, optional) – Maximum number of consecutive ordered residues allowed. Default from parameters module.

  • max_total_ordered (float, optional) – Maximum fraction of ordered residues allowed. Default from parameters module.

  • cutoff (float, optional) – Legacy parameter name for disorder cutoff. If provided, it will override the default disorder_cutoff value.

Returns:

Generated amino acid sequence as a string.

Return type:

str

Raises:
  • GooseInputError – If the objective_rg is outside the possible range for the given length, or if other invalid parameters are provided.

  • GooseFail – If sequence generation fails after all attempts.

Examples

>>> # Generate a 100-residue sequence with Rg = 25 Å
>>> seq = seq_by_rg(100, 25.0)
>>>
>>> # Generate with reduced positive charges
>>> seq = seq_by_rg(75, 20.0, reduce_pos_charged=True)
goose.create.seq_fractions(length, **kwargs)[source]

Generate a disordered sequence with specified amino acid fractions.

This function is a backwards compatibility wrapper around seq_by_fractions. Please use seq_by_fractions for new code.

Parameters:
  • length (int) – Length of the desired disordered sequence.

  • **kwargs (dict) – All keyword arguments are passed directly to seq_by_fractions. See seq_by_fractions documentation for full parameter details.

Returns:

Generated amino acid sequence(s) - see seq_by_fractions for details.

Return type:

str or list

See also

seq_by_fractions

The main function for generating sequences by fractions.

Examples

>>> # Generate sequence with 30% alanine and 10% glycine
>>> seq = seq_fractions(100, A=0.3, G=0.1)
goose.create.sequence(length, **kwargs)[source]

Generate a disordered sequence with specified physicochemical properties.

This is the main function for creating intrinsically disordered sequences with specific characteristics. You can specify multiple properties simultaneously to create sequences with desired combinations of NCPR, FCR, kappa, and hydropathy values.

Parameters:
  • length (int) – Length of the desired disordered sequence. Must be between the minimum and maximum allowed lengths as defined in the parameters module.

  • FCR (float, optional) – Fraction of charged residues (between 0 and 1). This includes both positively and negatively charged residues.

  • NCPR (float, optional) – Net charge per residue (between -1 and 1). Positive values indicate net positive charge, negative values indicate net negative charge.

  • hydropathy (float, optional) – Mean hydropathy of the sequence (between 0 and 6.1). Higher values indicate more hydrophobic sequences.

  • kappa (float, optional) – Kappa value describing charge patterning (between 0 and 1). Values closer to 1 indicate more even charge distribution.

  • attempts (int, optional) – Number of attempts to generate the sequence. Default is 20. Higher values increase success probability but take longer.

  • disorder_cutoff (float, optional) – Disorder threshold for sequence validation. Sequences must have disorder scores above this threshold. Default from parameters module.

  • exclude (list, optional) – List of amino acid residues to exclude from sequence generation. Cannot exclude charged residues if FCR is specified.

  • use_weighted_probabilities (bool, optional) – Whether to use weighted amino acid probabilities. This can increase generation success but may reduce sequence diversity. Default is False.

  • strict_disorder (bool, optional) – Whether to use strict disorder checking. If True, all residues must be above the disorder threshold. Default is False.

  • return_all_sequences (bool, optional) – Whether to return all generated sequences. If False, returns only the first successful sequence. Default is False.

  • custom_probabilities (dict, or string optional) – Custom amino acid probabilities for sequence generation. Keys should be single-letter amino acid codes, values should be probabilities. String options include the specified organisms in idr_probabilities.py These are: ‘mouse’, ‘fly’, ‘neurospora’, ‘yeast’, ‘arabidopsis’, ‘e_coli’, ‘worm’, ‘zebrafish’, ‘frog’, ‘dictyostelium’, ‘human’, ‘unbiased’, ‘all’

  • metapredict_version (int, optional) – Version of MetaPredict to use for disorder prediction. Default is 3.

  • max_consecutive_ordered (int, optional) – Maximum number of consecutive ordered residues allowed. Default from parameters module.

  • max_total_ordered (float, optional) – Maximum fraction of ordered residues allowed. Default from parameters module.

  • batch_size (int, optional) – Number of sequences to generate in each batch. Default from parameters module.

  • hydropathy_tolerance (float, optional) – Tolerance for hydropathy matching. Default from parameters module.

  • kappa_tolerance (float, optional) – Tolerance for kappa matching. Default from parameters module.

Returns:

Generated amino acid sequence as a string if return_all_sequences is False, or list of sequences if return_all_sequences is True.

Return type:

str or list

Raises:
  • GooseInputError – If invalid parameters are provided.

  • GooseFail – If sequence generation fails after all attempts.

Examples

>>> # Generate a 100-residue sequence with specific properties
>>> seq = sequence(100, FCR=0.3, NCPR=0.1, hydropathy=3.0)
>>>
>>> # Generate sequence excluding certain residues
>>> seq = sequence(50, exclude=['C', 'M'])
goose.create.variant(sequence, variant_type, **kwargs)[source]

Generate variants of an input sequence using various transformation methods.

This function provides a unified interface for creating sequence variants using different algorithms. It supports shuffling, repositioning, and property-based modifications of amino acid sequences while maintaining disorder characteristics.

Parameters:
  • sequence (str) – The amino acid sequence to generate variants from. Must be a non-empty string containing valid amino acid codes.

  • variant_type (str) – The type of variant to generate. Available options:

    Shuffling methods: - ‘shuffle_specific_regions’: Shuffle only specified regions - ‘shuffle_except_specific_regions’: Shuffle all except specified regions - ‘shuffle_specific_residues’: Shuffle only specific residue types - ‘shuffle_except_specific_residues’: Shuffle all except specific residue types - ‘weighted_shuffle_specific_residues’: Weighted shuffle of specific residues - ‘targeted_reposition_specific_residues’: Reposition specific residues

    Asymmetry and property methods: - ‘change_residue_asymmetry’: Change residue asymmetry patterns - ‘constant_properties’: Generate variant with constant properties - ‘constant_residues_and_properties’: Keep specified residues and properties constant - ‘constant_properties_and_class’: Generate variant with constant properties and class - ‘constant_properties_and_class_by_order’: Generate variant with constant properties and class by order

    Property modification methods: - ‘change_hydropathy_constant_class’: Change hydropathy while keeping class constant - ‘change_fcr_minimize_class_changes’: Change FCR while minimizing class changes - ‘change_ncpr_constant_class’: Change NCPR while keeping class constant - ‘change_kappa’: Change kappa value - ‘change_properties_minimize_differences’: Change properties while minimizing differences - ‘change_any_properties’: Change any combination of properties - ‘change_dimensions’: Change sequence dimensions (Rg/Re)

  • **kwargs (dict) – Additional parameters specific to the variant type. Common parameters include:

    General parameters: - num_attempts (int): Number of attempts to generate variant (default: 100) - strict_disorder (bool): Whether to use strict disorder checking (default: False) - disorder_cutoff (float): Disorder cutoff threshold (default: from parameters) - metapredict_version (int): MetaPredict version to use (default: 3) - hydropathy_tolerance (float): Hydropathy tolerance (default: from parameters) - kappa_tolerance (float): Kappa tolerance (default: from parameters)

    Variant-specific parameters: - shuffle_regions (list): Regions to shuffle (tuple pairs of start/end positions) - excluded_regions (list): Regions to exclude from shuffling - target_residues (list): Specific residues to target - excluded_residues (list): Specific residues to exclude - shuffle_weight (float): Weight for shuffling operations - num_changes (int): Number of changes to make - increase_or_decrease (str): Direction of change (‘increase’ or ‘decrease’) - exclude_residues (list): Residues to exclude from modifications - constant_residues (list): Residues to keep constant - target_hydropathy (float): Target hydropathy value - target_FCR (float): Target FCR value - target_NCPR (float): Target NCPR value - target_kappa (float): Target kappa value - rg_or_re (str): Whether to optimize ‘rg’ or ‘re’ - num_dim_attempts (int): Number of dimensional optimization attempts - allowed_error (float): Allowed error for dimensional constraints - reduce_pos_charged (bool): Whether to reduce positive charges - exclude_aas (list): Amino acids to exclude from generation

Returns:

Generated variant sequence as a string.

Return type:

str

Raises:
  • GooseInputError – If invalid parameters are provided, including: - Empty or invalid sequence - Invalid variant_type - Missing required parameters for the specified variant type - Invalid parameter values

  • GooseFail – If variant generation fails after all attempts.

Examples

>>> # Shuffle specific regions of a sequence
>>> original = "MSEDKQRTYHLNVAIGPKWF"
>>> variant = variant(original, 'shuffle_specific_regions',
...                  shuffle_regions=[(0, 5), (10, 15)])
>>>
>>> # Change hydropathy while keeping amino acid classes constant
>>> variant = variant(original, 'change_hydropathy_constant_class',
...                  target_hydropathy=3.5)
>>>
>>> # Generate variant with constant properties but different sequence
>>> variant = variant(original, 'constant_properties', num_attempts=50)

Notes

The function uses the VariantGenerator class from the backend to perform the actual sequence modifications. Each variant type has specific parameter requirements - consult the documentation for detailed parameter descriptions.

Region specifications use 0-based indexing where (start, end) includes positions from start to end-1, following Python slice conventions.