csd_operations module
Module: csd_operations.py
High-level interface for interacting with the Cambridge Structural Database (CSD).
This module provides functionality to: - Extract and filter refcode families - Cluster structures by packing similarity - Select representative structures using the vdWFV metric - Save intermediate results to CSV
Dependencies
pandas networkx ccdc csd_structure_validator
- class csd_operations.SimilaritySettings(distance_tolerance=0.2, angle_tolerance=20.0, ignore_bond_types=True, ignore_hydrogen_counts=True, ignore_hydrogen_positions=True, packing_shell_size=15, ignore_spacegroup=True, normalise_unit_cell=True)[source]
Bases:
objectConfiguration settings for packing similarity comparisons of crystal structures.
- Parameters:
distance_tolerance (float, default=0.2) – Maximum allowed deviation in atomic distances (Å) when comparing packings.
angle_tolerance (float, default=20.0) – Maximum allowed angular deviation (degrees) between molecular orientations.
ignore_bond_types (bool, default=True) – If True, matching bond orders are not required for similarity.
ignore_hydrogen_counts (bool, default=True) – If True, differences in hydrogen counts are ignored.
ignore_hydrogen_positions (bool, default=True) – If True, explicit hydrogen coordinate differences are ignored.
packing_shell_size (int, default=15) – Number of molecules considered in each packing-shell comparison.
ignore_spacegroup (bool, default=True) – If True, space-group designations are not required to match.
normalise_unit_cell (bool, default=True) – If True, unit cell parameters are normalized before comparison.
- __init__(distance_tolerance=0.2, angle_tolerance=20.0, ignore_bond_types=True, ignore_hydrogen_counts=True, ignore_hydrogen_positions=True, packing_shell_size=15, ignore_spacegroup=True, normalise_unit_cell=True)
- class csd_operations.CSDOperations(data_directory, data_prefix)[source]
Bases:
objectHigh-level interface for querying, validating, clustering, and selecting crystal structures from the Cambridge Structural Database (CSD).
- data_directory
Base directory for reading and writing CSD-related files.
- Type:
Path
- reader
CCDC EntryReader instance connected to the “CSD” database.
- Type:
io.EntryReader
- similarity_engine
Engine for computing pairwise packing similarity.
- Type:
PackingSimilarity
- __init__(data_directory, data_prefix)[source]
Initialize CSDOperations with target directory and filename prefix.
- get_refcode_families_df()[source]
Query the CSD and group entries by base refcode.
- Returns:
DataFrame with columns: - family_id : str, first six characters of the refcode - refcode : str, full CSD refcode
- Return type:
pd.DataFrame
- save_refcode_families_csv(df=None, filename=None)[source]
Write the refcode-families DataFrame to a CSV file.
- filter_families_by_size(df, min_size=2)[source]
Exclude families with fewer than a specified number of members.
- cluster_families(filters)[source]
Perform packing similarity clustering on each refcode family.
Workflow
Load initial refcode families CSV.
Group refcodes by ‘family_id’.
For each group, validate entries and build a similarity graph.
Identify connected components as clusters.
Save clustered results to CSV.
- param filters:
Criteria for structure validation.
- type filters:
Dict[str, Any]
- returns:
DataFrame with columns [‘family_id’, ‘refcode’, ‘cluster_id’].
- rtype:
pd.DataFrame
- raises FileNotFoundError:
If the initial CSV is missing.
- raises RuntimeError:
If clustering fails for any family.
- get_unique_structures(filters, method='vdWFV')[source]
Select one representative per cluster using the vdWFV metric.
Workflow
Load clustered families CSV.
Group by [‘family_id’, ‘cluster_id’].
Compute vdWFV for each refcode; select the minimum.
Save unique representatives to CSV.
- param filters:
Placeholder for revalidation filters.
- type filters:
Dict[str, Any]
- param method:
Only ‘vdWFV’ is supported.
- type method:
str, default=”vdWFV”
- returns:
DataFrame with columns [‘family_id’, ‘refcode’].
- rtype:
pd.DataFrame
- raises FileNotFoundError:
If clustered CSV is missing.
- raises NotImplementedError:
If method is not ‘vdWFV’.
Cambridge Structural Database Operations
The csd_operations module provides high-level interfaces for interacting with the Cambridge Structural Database (CSD), including family extraction, similarity clustering, and representative structure selection.
SimilaritySettings Class
- class csd_operations.SimilaritySettings(distance_tolerance=0.2, angle_tolerance=20.0, ignore_bond_types=True, ignore_hydrogen_counts=True, ignore_hydrogen_positions=True, packing_shell_size=15, ignore_spacegroup=True, normalise_unit_cell=True)[source]
Bases:
objectConfiguration settings for packing similarity comparisons of crystal structures.
- Parameters:
distance_tolerance (float, default=0.2) – Maximum allowed deviation in atomic distances (Å) when comparing packings.
angle_tolerance (float, default=20.0) – Maximum allowed angular deviation (degrees) between molecular orientations.
ignore_bond_types (bool, default=True) – If True, matching bond orders are not required for similarity.
ignore_hydrogen_counts (bool, default=True) – If True, differences in hydrogen counts are ignored.
ignore_hydrogen_positions (bool, default=True) – If True, explicit hydrogen coordinate differences are ignored.
packing_shell_size (int, default=15) – Number of molecules considered in each packing-shell comparison.
ignore_spacegroup (bool, default=True) – If True, space-group designations are not required to match.
normalise_unit_cell (bool, default=True) – If True, unit cell parameters are normalized before comparison.
Configuration for Packing Similarity Comparisons
Dataclass controlling parameters for 3D crystal packing similarity calculations using CCDC algorithms.
Key Parameters:
distance_tolerance (
float) - Maximum deviation in atomic distances (Å)angle_tolerance (
float) - Maximum angular deviation (degrees)packing_shell_size (
int) - Number of molecules in comparison shellignore_hydrogen_positions (
bool) - Whether to ignore H-atom coordinatesnormalise_unit_cell (
bool) - Whether to normalize unit cell parameters
Default Configuration:
settings = SimilaritySettings( distance_tolerance=0.2, # 0.2 Å distance tolerance angle_tolerance=20.0, # 20° angular tolerance ignore_bond_types=True, # Ignore bond order differences ignore_hydrogen_counts=True, # Ignore H-count differences ignore_hydrogen_positions=True,# Ignore H-position differences packing_shell_size=15, # 15-molecule comparison shell ignore_spacegroup=True, # Ignore space group differences normalise_unit_cell=True # Normalize unit cell parameters )
Tuning Guidelines:
Strict Similarity - Reduce distance/angle tolerances
Loose Similarity - Increase tolerances for broader clustering
Performance - Reduce packing_shell_size for faster comparisons
Accuracy - Increase packing_shell_size for more reliable comparisons
- __init__(distance_tolerance=0.2, angle_tolerance=20.0, ignore_bond_types=True, ignore_hydrogen_counts=True, ignore_hydrogen_positions=True, packing_shell_size=15, ignore_spacegroup=True, normalise_unit_cell=True)
CSDOperations Class
- class csd_operations.CSDOperations(data_directory, data_prefix)[source]
Bases:
objectHigh-level interface for querying, validating, clustering, and selecting crystal structures from the Cambridge Structural Database (CSD).
- data_directory
Base directory for reading and writing CSD-related files.
- Type:
Path
- reader
CCDC EntryReader instance connected to the “CSD” database.
- Type:
io.EntryReader
- similarity_engine
Engine for computing pairwise packing similarity.
- Type:
PackingSimilarity
High-Level CSD Interface for Structure Operations
Primary interface for querying, validating, clustering, and selecting crystal structures from the Cambridge Structural Database.
Core Responsibilities:
Family Extraction - Query and organize structures into chemical families
Quality Validation - Filter structures based on experimental criteria
Similarity Clustering - Group structures by 3D packing similarity
Representative Selection - Choose optimal structures using statistical metrics
Data Management - Save intermediate results and manage file I/O
- Attributes:
data_directory (
Path) - Base directory for file operationsdata_prefix (
str) - Filename prefix for all generated filesreader (
io.EntryReader) - CCDC database connectionsimilarity_engine (
PackingSimilarity) - Packing comparison engine
- __init__(data_directory, data_prefix)[source]
Initialize CSDOperations with target directory and filename prefix.
- Parameters:
Initialize CSD Operations Handler
- Parameters:
data_directory (
Union[str, Path]) - Base directory for file I/Odata_prefix (
str) - Prefix for generated filenames
Initialization Process:
# Set up file paths and directories self.data_directory = Path(data_directory) self.data_prefix = data_prefix # Initialize CSD connection self.reader = io.EntryReader("CSD") # Set up packing similarity engine self.similarity_engine = PackingSimilarity()
Directory Structure Created:
data_directory/ ├── {prefix}_refcode_families.csv ├── {prefix}_refcode_families_clustered.csv ├── {prefix}_refcode_families_unique.csv └── structures/ ├── REFCODE01.cif ├── REFCODE02.cif └── ...
- __init__(data_directory, data_prefix)[source]
Initialize CSDOperations with target directory and filename prefix.
- get_refcode_families_df()[source]
Query the CSD and group entries by base refcode.
- Returns:
DataFrame with columns: - family_id : str, first six characters of the refcode - refcode : str, full CSD refcode
- Return type:
pd.DataFrame
- save_refcode_families_csv(df=None, filename=None)[source]
Write the refcode-families DataFrame to a CSV file.
- filter_families_by_size(df, min_size=2)[source]
Exclude families with fewer than a specified number of members.
- cluster_families(filters)[source]
Perform packing similarity clustering on each refcode family.
Workflow
Load initial refcode families CSV.
Group refcodes by ‘family_id’.
For each group, validate entries and build a similarity graph.
Identify connected components as clusters.
Save clustered results to CSV.
- param filters:
Criteria for structure validation.
- type filters:
Dict[str, Any]
- returns:
DataFrame with columns [‘family_id’, ‘refcode’, ‘cluster_id’].
- rtype:
pd.DataFrame
- raises FileNotFoundError:
If the initial CSV is missing.
- raises RuntimeError:
If clustering fails for any family.
- get_unique_structures(filters, method='vdWFV')[source]
Select one representative per cluster using the vdWFV metric.
Workflow
Load clustered families CSV.
Group by [‘family_id’, ‘cluster_id’].
Compute vdWFV for each refcode; select the minimum.
Save unique representatives to CSV.
- param filters:
Placeholder for revalidation filters.
- type filters:
Dict[str, Any]
- param method:
Only ‘vdWFV’ is supported.
- type method:
str, default=”vdWFV”
- returns:
DataFrame with columns [‘family_id’, ‘refcode’].
- rtype:
pd.DataFrame
- raises FileNotFoundError:
If clustered CSV is missing.
- raises NotImplementedError:
If method is not ‘vdWFV’.
Family Extraction Methods
- CSDOperations.get_refcode_families_df()[source]
Query the CSD and group entries by base refcode.
- Returns:
DataFrame with columns: - family_id : str, first six characters of the refcode - refcode : str, full CSD refcode
- Return type:
pd.DataFrame
Extract Structure Families from CSD
Queries the CSD to organize structures into families based on chemical similarity and refcode relationships.
Family Organization:
Structures are grouped by: * Chemical connectivity - Same molecular graph * Refcode patterns - Related experimental studies * Publication relationships - Same research group/journal
- Returns:
pandas.DataFramewith columns:family_id - Unique identifier for each chemical family
refcode - CSD refcode for individual structures
Example Output:
family_id refcode ACSALA ACSALA ACSALA ACSALA01 ACSALA ACSALA02 ... BENZEN BENZEN BENZEN BENZEN01 BENZEN BENZEN02 ...
- CSDOperations.save_refcode_families_csv(df=None, filename=None)[source]
Write the refcode-families DataFrame to a CSV file.
- Parameters:
df (pd.DataFrame, optional) – DataFrame to save. If None, uses get_refcode_families_df().
filename (Union[str, Path], optional) – Full file path for output. If None, defaults to data_directory / f”{data_prefix}_refcode_families.csv”.
- Raises:
OSError – If writing to disk fails.
Save Family Assignments to CSV
Writes refcode family assignments to disk for persistence and downstream processing.
- Parameters:
df (
pandas.DataFrame, optional) - DataFrame to save; if None, generates new onefilename (
Union[str, Path], optional) - Output path; if None, uses default naming
Default File Path:
{data_directory}/{data_prefix}_refcode_families.csvCSV Format:
family_id,refcode ACSALA,ACSALA ACSALA,ACSALA01 ... BENZEN,BENZEN BENZEN,BENZEN01 ...
- CSDOperations.filter_families_by_size(df, min_size=2)[source]
Exclude families with fewer than a specified number of members.
- Parameters:
df (pd.DataFrame) – DataFrame with columns [‘family_id’, ‘refcode’].
min_size (int, default=2) – Minimum number of members for a family to be retained.
- Returns:
Filtered DataFrame.
- Return type:
pd.DataFrame
- Raises:
KeyError – If ‘family_id’ column is missing.
Filter Families by Member Count
Removes families with insufficient members for meaningful clustering analysis.
- Parameters:
df (
pandas.DataFrame) - Family assignments DataFramemin_size (
int) - Minimum family size (default: 2)
Filtering Logic:
# Count members per family family_counts = df['family_id'].value_counts() # Keep only families with sufficient members valid_families = family_counts[family_counts >= min_size].index filtered_df = df[df['family_id'].isin(valid_families)]
- Use Cases:
Statistical significance - Ensure meaningful clustering
Computational efficiency - Focus on families with multiple structures
Quality control - Remove singleton families
- Returns:
pandas.DataFramewith filtered family assignments
Clustering Methods
- CSDOperations.cluster_families(filters)[source]
Perform packing similarity clustering on each refcode family.
Workflow
Load initial refcode families CSV.
Group refcodes by ‘family_id’.
For each group, validate entries and build a similarity graph.
Identify connected components as clusters.
Save clustered results to CSV.
- param filters:
Criteria for structure validation.
- type filters:
Dict[str, Any]
- returns:
DataFrame with columns [‘family_id’, ‘refcode’, ‘cluster_id’].
- rtype:
pd.DataFrame
- raises FileNotFoundError:
If the initial CSV is missing.
- raises RuntimeError:
If clustering fails for any family.
Perform Packing Similarity Clustering
Groups structures within each family based on 3D crystal packing similarity using CCDC algorithms.
Clustering Workflow:
Load Families - Read refcode family assignments
Parallel Processing - Distribute families across CPU cores
Structure Validation - Apply quality filters to each structure
Similarity Computation - Calculate pairwise packing similarities
Graph Construction - Build similarity graphs with threshold cutoffs
Cluster Identification - Find connected components as clusters
Result Aggregation - Combine results from all families
- Parameters:
filters (
Dict[str, Any]) - Structure validation criteria
Similarity Algorithm:
# For each pair of structures in a family similarity = PackingSimilarity.compare( crystal1=entry1.crystal, crystal2=entry2.crystal, distance_tolerance=0.2, angle_tolerance=20.0, packing_shell_size=15 ) # Similarity values range from 0 (dissimilar) to 1 (identical) if similarity > threshold: graph.add_edge(refcode1, refcode2)
Cluster Output:
family_id refcode cluster_id ACSALA ACSALA 1 ACSALA ACSALA01 1 ACSALA ACSALA02 1 ... ACSALA ACSALA13 2 ACSALA ACSALA15 2 ACSALA ACSALA17 2 ... ACSALA ACSALA23 3 ACSALA ACSALA24 3 ... BENZEN BENZEN 1 BENZEN BENZEN01 1 BENZEN BENZEN02 1 ... BENZEN BENZEN03 2 BENZEN BENZEN04 2 BENZEN BENZEN16 2 ...
- Performance Characteristics:
CPU Parallelization - Uses multiple cores for family processing
Memory Efficiency - Processes families independently
Scalability - Linear scaling with number of families
- Returns:
pandas.DataFramewith clustered family assignments- Raises:
FileNotFoundError- If refcode families CSV is missingRuntimeError- If clustering fails for any family
- CSDOperations._check_structure(identifier, filters, entry=None)[source]
Validate a CSD entry against filter criteria.
- Parameters:
- Returns:
True if the structure is valid, False otherwise.
- Return type:
- Raises:
Exception – If validation fails unexpectedly.
Validate Structure Against Filter Criteria
Applies comprehensive quality filters to determine structure suitability for analysis.
- Parameters:
identifier (
str) - CSD refcode to validatefilters (
Dict) - Validation criteria dictionaryentry (
io.Entry, optional) - Pre-loaded CSD entry
Validation Categories:
Quality Filters: * Completeness requirements - Data collection completeness
Chemical Filters: * Element restrictions - Allowed atomic species * Molecular weight limits - Size constraints * Z’ value constraints - Asymmetric unit requirements * Crystal type requirements - Homomolecular vs. solvated
Structural Filters: * Disorder exclusion - Remove disordered structures * Polymer exclusion - Exclude polymeric materials
Example Filter Configuration:
filters = { "structure_list": ["csd-unique"], # Use unique structures in CSD "crystal_type": ["homomolecular"], # Homomolecular structures only "target_species": ["C", "H"], # Hydrocarbons only "target_space_groups": ["P1", "P-1"], # Triclinic structures only "target_z_prime_values": [1], # Z' = 1 only "molecule_weight_limit": 300.0, # Small molecules "molecule_formal_charges": [0], # Neutral molecules "unique_structures_clustering_method": "vdWFV", # Use van der Waals free volume to select unique structures from a cluster }
- Returns:
boolindicating if structure passes all validation criteria
Representative Selection Methods
- CSDOperations.get_unique_structures(filters, method='vdWFV')[source]
Select one representative per cluster using the vdWFV metric.
Workflow
Load clustered families CSV.
Group by [‘family_id’, ‘cluster_id’].
Compute vdWFV for each refcode; select the minimum.
Save unique representatives to CSV.
- param filters:
Placeholder for revalidation filters.
- type filters:
Dict[str, Any]
- param method:
Only ‘vdWFV’ is supported.
- type method:
str, default=”vdWFV”
- returns:
DataFrame with columns [‘family_id’, ‘refcode’].
- rtype:
pd.DataFrame
- raises FileNotFoundError:
If clustered CSV is missing.
- raises NotImplementedError:
If method is not ‘vdWFV’.
Select Representative Structures from Clusters
Chooses one optimal representative structure from each cluster using the vdWFV (van der Waals Fit Volume) metric.
Selection Algorithm:
The vdWFV method selects the structure with the most typical packing density within each cluster
- Parameters:
method (
str) - Selection method (“vdWFV” only supported)
- Output Files:
CSV -
{prefix}_refcode_families_unique.csvStructure Directory - Individual CIF files for representatives
- Returns:
pandas.DataFramewith selected representative structures- Raises:
FileNotFoundError- If clustered families CSV is missingNotImplementedError- If method other than “vdWFV” requested
- CSDOperations._save_unique_structures(df)[source]
Save unique structure representatives to CSV.
- Parameters:
df (pd.DataFrame) – DataFrame with columns [‘family_id’, ‘refcode’].
- Raises:
OSError – If file writing fails.
Save Representative Structures to CSV
Persists the selected unique structures for downstream processing.
- Parameters:
df (
pandas.DataFrame) - DataFrame with representative assignments
Output Format:
family_id,refcode ACSALA,ACSALA13 ACSALA,ACSALA24 ACSALA,ACSALA35 BENZEN,BENZEN22 BENZEN,BENZEN24 ...
File Location:
{data_directory}/{data_prefix}_refcode_families_unique.csv
Utility Functions
_process_single_family
- csd_operations._process_single_family(args)[source]
Validate and cluster a single refcode family by packing similarity.
- Parameters:
args (Tuple[str, List[str], Dict[str, Any]]) –
family_id : str
structures : List[str]
filters : Dict of validation criteria
- Returns:
family_id and list of clusters (each a list of refcodes).
- Return type:
- Raises:
Exception – If any error occurs during processing.
Process Individual Family for Clustering
Worker function for parallel clustering of structure families.
- Parameters:
args (
Tuple[str, List[str], Dict]) - (family_id, refcodes, filters)
- Processing Steps:
Validation - Check each structure against filters
Similarity Matrix - Compute all pairwise similarities
Graph Construction - Build similarity network
Clustering - Identify connected components
Result Packaging - Return cluster assignments
- Returns:
Tuple[str, List[List[str]]]- (family_id, list of clusters)
_representative_for_cluster
- csd_operations._representative_for_cluster(args)[source]
Select the refcode with minimal vdWFV in a cluster.
- Parameters:
args (Tuple[str, List[str]]) –
family_id : str
cluster : List[str]
- Returns:
family_id and representative refcode.
- Return type:
- Raises:
Exception – If any lookup fails.
Select Representative from Single Cluster
Worker function for parallel representative selection.
- Selection Process:
Load Structures - Access CSD entries for cluster members
Compute Metrics - Calculate vdWFV for each structure
Statistical Analysis - Find cluster median vdWFV
Representative Selection - Choose structure closest to median
- Returns:
Tuple[str, str]- (family_id, representative_refcode)
See Also
crystal_analyzer module : Pipeline orchestration structure_data_extractor module : Raw data extraction ../validation/csd_structure_validator : Structure validation geometry_utils module : Geometric calculations