Using the SequenceOptimizer
GOOSE’s SequenceOptimizer is a flexible tool for designing protein sequences that match user-defined target value. It uses stochastic optimization with adaptive scaling to explore sequence space and minimize the difference between calculated and target property values. You can simultaneously optimize towards arbitrary numbers of properties with individual weights, tolerances, and constraint types.
IMPORTANT NOTE PLEASE READ: GOOSE is an IDR design tool. HOWEVER, when using SequenceOptimizer, you can design anything you want. Thus, sequences are not guaranteed to be predicted to be disordered unless you specify the FractionDisorder property.
Key Features of the New SequenceOptimizer
The SequenceOptimizer has been completely rewritten to provide:
Adaptive Property Scaling: Automatically adjusts optimization focus based on property convergence patterns and error magnitudes. This makes it easier to optimizer towards properties with highly variable scales or difficult optimization landscapes.
Diverse Initial Sequences: If you are generating a completely new sequence, you can specifiy the number of starting sequences to find better optimization.
Flexible Constraint Types: Support for exact matching, minimum thresholds, and maximum constraints for each specified property
Per-Property Tolerances: Set individual error tolerances for each property, allowing fine-grained control
Advanced Convergence Detection: Multiple convergence criteria including error tolerance, trend analysis, and stagnation detection
Performance Optimization: Comprehensive caching system for faster optimization
Arbitrary Number of Properties: Optimizer towards multiple instances of the same property. This was not previously supported.
Easier Property Value Setting: For many of the properties, you can now set the target value using a sequence of interest rather than a numeric value.
Match to arbitrary interaction matrices: You can now optimize sequences to match arbitrary interaction matrices.
Linear Profiles for Values: You can now set
can_be_linear_profile=Truefor some properties and provide a sequence or list of target values. The optimizer will then attempt to match the profile along the values.
Critical Differences between SequenceOptimizer and Create Functionality
The SequenceOptimizer represents a fundamentally different approach to sequence generation compared to the create module:
Flexibility vs. Speed:
SequenceOptimizerprioritizes extreme flexibility and handles complex multi-property optimization scenarios that would be difficultcreate. However, for simple, well-defined property targets,createfunctions are typically faster.Approximate vs. Exact Solutions:
SequenceOptimizerreturns the best possible sequence within the optimization constraints and may not achieve exact target values. In contrast,createfunctions either generate sequences that exactly meet specifications or fail completely.Extensibility: Adding new properties to
SequenceOptimizerrequires only implementing a simple property class. Adding new functionality tocreaterequires significant backend overhead.Multi-Property Optimization:
SequenceOptimizerexcels at balancing multiple competing properties simultaneously, whilecreatefunctions typically handle individual properties or simple property combinations.
Quick Start Example
Design a sequence of length 50 with a target hydrophobicity:
import goose
from sparrow import Protein
# Initialize optimizer with basic parameters
optimizer = goose.SequenceOptimizer(
target_length=50,
max_iterations=1000,
verbose=True
)
# Add hydrophobicity property with a tolerance
optimizer.add_property(
goose.Hydrophobicity,
target_value=0.5,
weight=1.0,
tolerance=0.05 # Allow 5% deviation
)
# Run optimization
optimized_sequence = optimizer.run()
# Analyze results
final_protein = Protein(optimized_sequence)
print(f"Optimized Sequence: {optimized_sequence}")
print(f"Final Hydrophobicity: {final_protein.hydrophobicity:.3f}")
print(f"Target Hydrophobicity: 0.5 ± 0.05")
Explanation:
- SequenceOptimizer(target_length=50, max_iterations=1000, verbose=True): Creates optimizer with sequence length, iteration limit, and progress reporting.
- add_property(..., tolerance=0.05): Adds hydrophobicity optimization with 5% error tolerance.
- run(): Executes optimization with adaptive scaling and convergence detection.
Advanced Quick Start with Multiple Properties:
import goose
optimizer = goose.SequenceOptimizer(target_length=100, verbose=True)
# Exact hydrophobicity target
optimizer.add_property(
goose.Hydrophobicity,
target_value=2.4,
weight=1.0,
)
# Minimum disorder requirement
optimizer.add_property(
goose.FractionDisorder,
target_value=0.8,
weight=2.0, # Higher weight = more important
constraint_type='minimum',
disorder_cutoff=0.5
)
# Maximum FCR constraint
optimizer.add_property(
goose.FCR,
target_value=0.3,
weight=1.5,
constraint_type='maximum'
)
optimized_sequence = optimizer.run()
Property Classes Overview
All property classes support three constraint types and individual tolerances:
exact: Minimize absolute difference from target (default)
minimum: Penalize only when below target value
maximum: Penalize only when above target value
To specify constraint type, use the constraint_type argument when adding a property:
# Exact target (default)
optimizer.add_property(goose.Hydrophobicity, target_value=0.5, constraint_type='exact')
# Minimum requirement
optimizer.add_property(goose.FractionDisorder, target_value=0.8, constraint_type='minimum')
# Maximum constraint
optimizer.add_property(goose.FCR, target_value=0.3, constraint_type='maximum')
Basic Properties
Property Class |
Description |
Key Arguments |
|---|---|---|
Hydrophobicity |
Average hydrophobicity (0-9.0 scale) |
target_value, weight, constraint_type |
FCR |
Fraction of Charged Residues (0-1) |
target_value, weight, constraint_type |
NCPR |
Net Charge Per Residue (-1 to 1) |
target_value, weight, constraint_type |
Kappa |
Charge patterning parameter (0-1) |
target_value, weight, constraint_type |
SCD |
Sequence Charge Decoration |
target_value, weight, constraint_type |
SHD |
Sequence Hydropathy Decoration |
target_value, weight, constraint_type |
Complexity |
Wootton-Federhen complexity |
target_value, weight, constraint_type |
ComputeIWD |
Inverse Weighted Distance |
residues (tuple), target_value, weight, constraint_type |
AminoAcidFractions |
Target amino acid composition |
target_fractions (dict), weight, constraint_type |
MatchingResidues |
Number of matching residues to target |
target_sequence, target_value, weight, constraint_type |
Ensemble Properties
Property Class |
Description |
Key Arguments |
|---|---|---|
RadiusOfGyration |
Predicted radius of gyration (A) |
target_value, weight, constraint_type |
EndToEndDistance |
Predicted end-to-end distance (A) |
target_value, weight, constraint_type |
Disorder
Property Class |
Description |
Key Arguments |
|---|---|---|
FractionDisorder |
Fraction of disordered residues (0-1) |
target_value, weight, constraint_type, disorder_cutoff |
MatchSequenceDisorder |
Match disorder profile of target sequence |
target_sequence, weight, constraint_type, exact_match, target_value |
Interaction Properties (Epsilon-based)
Property Class |
Description |
Key Arguments |
|---|---|---|
MeanSelfEpsilon
|
Self-interaction potential
|
target_value, weight,
preloaded_model, constraint_type, model
|
MeanEpsilonWithTarget
|
Mean interaction with target sequence
|
target_value, target_sequence, weight,
constraint_type, model, preloaded_model
|
ChemicalFingerprint
|
Match chemical fingerprint to target
|
target_sequence, target_value, weight,
constraint_type, model, preloaded_model,
window_size
|
Matrix-based Interaction Properties
Property Class |
Description |
Key Arguments |
|---|---|---|
MatchSelfIntermap |
Match self-interaction matrix |
sequence, weight, constraint_type, model, preloaded_model, inverse, window_size, allow_matrix_resizing |
MatchIntermap |
Match interaction matrix with target |
sequence, target_sequence, weight, constraint_type, model, preloaded_model, window_size, allow_matrix_resizing |
ModifyAttractiveValues |
Modify attractive interactions |
sequence, target_sequence, multiplier, weight, constraint_type, model, preloaded_model, window_size |
ModifyRepulsiveValues |
Modify repulsive interactions |
interacting_sequence, target_interacting_sequence, multiplier, weight, constraint_type, model, preloaded_model, window_size |
ModifyMatrixValues |
Modify both attractive and repulsive |
interacting_sequence, target_interacting_sequence, repulsive_multiplier, attractive_multiplier, weight, constraint_type, model, preloaded_model, window_size |
Folded Domain Surface Properties
Property Class |
Description |
Key Arguments |
|---|---|---|
FDMeanSurfaceEpsilon |
Mean surface epsilon for folded domains |
target_value, weight, constraint_type, model, preloaded_model, path_to_pdb, probe_radius, surface_thresh, sasa_mode, fd_start, fd_end, preloaded_fd |
FDSurfaceEpsilon |
Surface epsilon interactions |
repulsive_target, attractive_target, weight, constraint_type, model, preloaded_model, path_to_pdb, probe_radius, surface_thresh, sasa_mode, fd_start, fd_end, preloaded_fd |
FDSurfacePatchInteractions |
Surface patch interaction analysis |
target_value, weight, constraint_type, model, preloaded_model, path_to_pdb, probe_radius, surface_thresh, sasa_mode, fd_start, fd_end, preloaded_fd, patch_residues |
Optimizer Initialization and Basic Parameters
The SequenceOptimizer provides extensive control over the optimization process through initialization parameters. You can see additional parameters to change in the Advanced Optimizer Configuration section below.
Basic Parameters:
optimizer = goose.SequenceOptimizer(
target_length=100, # Required: target sequence length
max_iterations=1000, # Maximum optimization iterations
verbose=True # Enable progress reporting
)
Mutation and Diversity Parameters:
optimizer = goose.SequenceOptimizer(
target_length=100,
# Candidate generation
num_candidates=5, # Candidate sequences per iteration
num_starting_candidates=100, # Number of sequences to start with.
min_mutations=1, # Minimum mutations per candidate
max_mutations=15, # Maximum mutations per candidate
mutation_ratio=10, # Length divisor for mutation calculation
# Shuffling for diversity
enable_shuffling=True, # Enable sequence shuffling
shuffle_frequency=50, # Shuffle every N iterations
global_shuffle_probability=0.4, # Probability of global vs local shuffle
shuffle_window_size=15 # Window size for local shuffling
)
Setting Initial Sequences:
# Start from a specific sequence
initial_seq = "MGSWAEFKQRLAAIKTRLQALGSQAGKKDAE" * 3 # Must match target_length
optimizer.set_initial_sequence(initial_seq)
# The optimizer will automatically calculate normalization factors
# based on the initial sequence for adaptive scaling
Multiple Properties, Weights, and Tolerances
The optimizer excels at balancing multiple competing properties simultaneously. Each property can have individual weights, tolerances, and constraint types:
import goose
from sparrow import Protein
# Create optimizer with advanced parameters
optimizer = goose.SequenceOptimizer(
target_length=100,
max_iterations=2000,
verbose=True
)
# Critical property - must be close to target
optimizer.add_property(
goose.FractionDisorder,
target_value=0.85,
weight=3.0, # High importance
tolerance=0.02, # Very strict tolerance (2%)
constraint_type='minimum' # Must be at least 85% disordered
)
# Important but flexible property
optimizer.add_property(
goose.FCR,
target_value=0.4,
weight=2.0, # Medium-high importance
tolerance=0.05, # 5% tolerance
)
# Secondary property - more flexible
optimizer.add_property(
goose.NCPR,
target_value=-0.1,
weight=1.0, # Lower importance
tolerance=0.1 # 10% tolerance - quite flexible
)
# Compositional constraint
optimizer.add_property(
goose.AminoAcidFractions,
target_fractions={'G': 0.15, 'P': 0.10, 'S': 0.12},
weight=1.5,
tolerance=0.03 # 3% tolerance on each amino acid
)
# Run optimization
optimized_sequence = optimizer.run()
# Analyze results
final_protein = Protein(optimized_sequence)
print(f"Optimized Sequence: {optimized_sequence}")
print(f"Final FCR: {final_protein.FCR:.3f} (target: 0.4 ± 0.05)")
print(f"Final NCPR: {final_protein.NCPR:.3f} (target: -0.1 ± 0.1)")
fracs=final_protein.amino_acid_fractions
print(f"Final fractions: G = {fracs['G']:.3f}, P = {fracs['P']:.3f}, S = {fracs['S']:.3f},")
Custom Properties
Creating custom properties is straightforward by subclassing CustomProperty. The new system supports all constraint types and tolerances automatically:
import goose
from goose.backend.optimizer_properties import CustomProperty, ConstraintType
import sparrow
class AlanineCount(CustomProperty):
"""Count the number of alanine residues in the sequence."""
def __init__(self, target_value: float, weight: float = 1.0,
constraint_type: ConstraintType = ConstraintType.EXACT):
super().__init__(target_value, weight, constraint_type)
def calculate_raw_value(self, protein: 'sparrow.Protein') -> float:
"""Calculate the raw property value (before constraint application)."""
return float(protein.sequence.count('A'))
class MotifCount(CustomProperty):
"""Count occurrences of a specific motif in the sequence."""
def __init__(self, motif: str, target_value: float, weight: float = 1.0,
constraint_type: ConstraintType = ConstraintType.EXACT):
super().__init__(target_value, weight, constraint_type)
self.motif = motif
def get_init_args(self) -> dict:
"""Override to include motif parameter for serialization."""
return {
"motif": self.motif,
"target_value": self.target_value,
"weight": self.weight,
"constraint_type": self.constraint_type.value
}
def calculate_raw_value(self, protein: 'sparrow.Protein') -> float:
sequence = protein.sequence
count = 0
start = 0
while True:
pos = sequence.find(self.motif, start)
if pos == -1:
break
count += 1
start = pos + 1
return float(count)
Using Custom Properties:
# Create optimizer
optimizer = goose.SequenceOptimizer(target_length=100, verbose=True)
# Add custom properties with different constraint types
optimizer.add_property(
AlanineCount,
target_value=12.0,
weight=1.0,
constraint_type='exact',
tolerance=1.0 # Allow ±1 alanine
)
optimizer.add_property(
MotifCount,
motif="GPG",
target_value=3.0, # Want exactly 3 GPG motifs
weight=2.0,
constraint_type='exact',
tolerance=0.0 # Must be exact
)
# Standard properties
optimizer.add_property(
goose.FractionDisorder,
target_value=0.8,
weight=3.0,
constraint_type='minimum',
)
# Run optimization
optimized_sequence = optimizer.run()
# Analyze results
final_protein = sparrow.Protein(optimized_sequence)
print(f"Optimized Sequence: {optimized_sequence}")
print(f"Alanine count: {optimized_sequence.count('A')}")
print(f"GPG motifs: {optimized_sequence.count('GPG')}")
Implementing Batch Calculation for Performance (Optional):
For properties that benefit from batch processing (e.g., using external APIs or vectorized operations),
you can enable batch calculation by setting the calculate_in_batch class attribute and implementing
calculate_raw_value_batch():
import numpy as np
from goose.backend.optimizer_properties import CustomProperty
import sparrow
class VectorizedHydrophobicity(CustomProperty):
"""Example property with batch calculation support."""
calculate_in_batch = True # Enable batch processing
def __init__(self, target_value: float, weight: float = 1.0):
super().__init__(target_value, weight)
def calculate_raw_value(self, protein: 'sparrow.Protein') -> float:
"""Single sequence calculation (fallback)."""
return protein.hydrophobicity
def calculate_raw_value_batch(self, proteins: list) -> list:
"""
Batch calculation for multiple proteins (more efficient).
Parameters
----------
proteins : list of sparrow.Protein
List of protein instances to calculate
Returns
-------
list of float
List of calculated property values
"""
# Example: Use vectorized operations for efficiency. This is not actually faster
return [p.hydrophobicity for p in proteins]
Note
When to Use Batch Calculation:
When calling external APIs that support batch processing (e.g., metapredict or other predictors that support batches)
When using vectorized NumPy operations across multiple sequences
When property calculation has expensive setup costs that can be amortized
Performance Impact:
FractionDisorderuses batch calculation for ~2-5× speedup with metapredictNot all properties benefit from batch calculation
Single-sequence calculation is used as fallback when batch is unavailable
Note
Best Practices for Custom Properties:
Always implement
calculate_raw_value()instead ofcalculate()Use
get_init_args()if your property has additional parametersThe base class automatically handles constraint types and tolerances
Optionally implement batch calculation for performance with
calculate_in_batch = TrueBatch calculation is automatically used when available if calculate_in_batch is True; fallback is single-sequence mode
Advanced Optimizer Configuration
Below are additional parameters to customize the optimization process. You can set these during initialization or modify them later using dedicated methods. The default parameter values are chosen to provide robust performance across a wide range of scenarios. However, you can adjust them to better suit your specific optimization needs.
Convergence and Tolerance Controls:
optimizer = goose.SequenceOptimizer(
target_length=100,
# Error tolerance stopping
error_tolerance=1e-6, # Stop when total error below this value
enable_error_tolerance=True, # Enable error tolerance early stopping
# Convergence detection
convergence_tolerance=1e-4, # Convergence criterion for early stopping
convergence_window=20, # Number of recent iterations to check
enable_early_convergence=False, # Enable early stopping on convergence
convergence_patience=20, # Wait iterations after convergence
# Stagnation detection
stagnation_threshold=25, # Iterations before considering stagnant
stagnation_improvement_threshold=0.005 # Minimum improvement to avoid stagnation
)
Adaptive Scaling Parameters:
optimizer = goose.SequenceOptimizer(
target_length=100,
# Adaptive scaling control
enable_adaptive_scaling=True, # Enable adaptive property scaling
max_distance_factor=3.0, # Maximum scaling based on distance
distance_offset=0.2, # Offset for distance calculation
boost_factor=2.0, # Factor to boost underperforming properties
scale_momentum=0.5, # Momentum for scale smoothing (0-1)
scale_learning_rate=0.5, # Learning rate for scale updates (0-1)
min_scale=0.1, # Minimum allowed property scale
max_scale=8.0, # Maximum allowed property scale
# Thresholds for adaptive behavior
low_contribution_threshold=0.15, # Threshold for low-contributing properties
high_error_threshold=0.05, # Threshold for high-error properties
stagnation_multiplier=1.0 # Multiplier for stagnation response
)
History and Analysis Parameters:
optimizer = goose.SequenceOptimizer(
target_length=100,
# History tracking
improvement_history_size=20, # Recent improvements per property
error_history_size=50, # Recent error values to store
# Analysis parameters
min_analysis_samples=5, # Minimum samples for analysis
min_trend_samples=5, # Minimum samples for trend calculation
improvement_threshold=-0.001, # Threshold for improvement detection
stability_threshold=0.01, # Variance threshold for stability
# Progress reporting
update_interval=10 # Update progress every N iterations
)
Dynamic Configuration Methods:
You can modify convergence and error tolerance settings after initialization:
# Configure convergence detection
optimizer.configure_convergence(
tolerance=1e-5, # New convergence tolerance
window=30, # New convergence window
enable_early_stopping=True, # Enable early stopping
patience=15 # New patience value
)
# Configure error tolerance
optimizer.configure_error_tolerance(
tolerance=1e-7, # New error tolerance
enable=True # Enable/disable error tolerance stopping
)
# Get convergence information
convergence_info = optimizer.get_convergence_info()
print(f"Convergence status: {convergence_info}")
Troubleshooting and Optimization Tips
Optimization Not Converging
Symptoms: Error plateaus at high values, properties far from targets
Solutions:
Increase iterations:
max_iterations=5000or higher for complex problemsEnable adaptive scaling:
enable_adaptive_scaling=True(default)Increase diversity:
shuffle_frequency=25,num_candidates=10Check target compatibility: Ensure properties don’t fundamentally conflict
Use tolerances: Set reasonable
tolerancevalues for each propertyVerify constraint types: Make sure you’re using appropriate constraints
Slow Optimization Performance
Symptoms: Optimization takes too long, high memory usage
Solutions:
Reduce candidates:
num_candidates=3for faster iterations (default is 5)Disable expensive features:
enable_adaptive_scaling=False,enable_shuffling=FalseUse stricter early stopping:
error_tolerance=1e-4,enable_early_convergence=TrueOptimize caching: Check cache hit rate with
get_cache_statistics()Pre-load models: Use
preloaded_modelfor epsilon properties
Property Conflicts and Balancing
Symptoms: Some properties optimize while others get worse
Solutions:
Adjust weights: Higher weight = higher priority
Use appropriate constraint types: MINIMUM/MAXIMUM instead of EXACT when possible
Set generous tolerances: Allow some flexibility in less critical properties
Check physical compatibility: Some combinations may be impossible
Monitor individual properties: Enable
verbose=Trueto track individual progress
# Balanced multi-property optimization
optimizer.add_property(goose.FractionDisorder, target_value=0.8, weight=3.0,
constraint_type=ConstraintType.MINIMUM, tolerance=0.05)
optimizer.add_property(goose.FCR, target_value=0.3, weight=1.0,
constraint_type=ConstraintType.EXACT, tolerance=0.1)
optimizer.add_property(goose.Hydrophobicity, target_value=0.4, weight=0.5,
constraint_type=ConstraintType.EXACT, tolerance=0.2)
Memory Issues with Large Sequences
Symptoms: Out of memory errors, excessive RAM usage
Solutions:
Reduce history sizes:
improvement_history_size=5,error_history_size=10Clear cache periodically: Call
optimizer._clear_evaluation_cache()if neededDisable caching: Set caching parameters conservatively
Use fewer candidates:
num_candidates=3for large sequences
# Memory-efficient settings for large sequences
optimizer = goose.SequenceOptimizer(
target_length=1000,
improvement_history_size=5,
error_history_size=10,
num_candidates=3,
debugging=False
)
Stagnation Issues
Symptoms: Error doesn’t improve for many iterations
Solutions:
Enable shuffling:
enable_shuffling=Truewith frequent shufflingAdjust stagnation detection: Lower
stagnation_threshold=15Increase mutation diversity: Higher
max_mutations=20Check for impossible targets: Some property combinations may be unachievable
Examples and Demo Notebooks
GOOSE includes comprehensive demo notebooks showcasing advanced SequenceOptimizer usage in the /demos directory. These include:
Basic optimization: see sequence_optimization.ipynb for basic usage.
Custom properties: see custom_optimizer_peroperties.ipynb for creating and implementing custom user-defined properties
Design by interaction: see generate_sequences_by_interaction.ipynb for designing sequences to interact with a target sequence using epsilon-based properties.
Design by linear profiles: see linear_profiles.ipynb for designing sequences to match linear profiles of properties like NCPR.
Design by interaction matrices: see epsilon_matrix_variants.ipynb for designing sequences to match or modify interaction matrices.
Demo Location:
Check the demos directory for Jupyter notebooks with detailed examples and explanations.
API Reference
Core Classes:
- goose.SequenceOptimizer: Main optimization engine
- goose.backend.optimizer_properties.ProteinProperty: Base class for properties
- goose.backend.optimizer_properties.CustomProperty: Base class for custom properties users can define
Key Methods:
- SequenceOptimizer.add_property(): Add properties to optimize
- SequenceOptimizer.set_initial_sequence(): Set starting sequence
- SequenceOptimizer.run(): Execute optimization
See Also
For complete API documentation, see goose/optimize.py and goose/backend/optimizer_properties.py.
For implementation examples and advanced usage patterns, explore the demo notebooks in demos/.