csd_operations module

Module: csd_operations.py

High-level interface for interacting with the Cambridge Structural Database (CSD).

This module provides functionality to: - Extract and filter refcode families - Cluster structures by packing similarity - Select representative structures using the vdWFV metric - Save intermediate results to CSV

Dependencies

pandas networkx ccdc csd_structure_validator

class csd_operations.SimilaritySettings(distance_tolerance=0.2, angle_tolerance=20.0, ignore_bond_types=True, ignore_hydrogen_counts=True, ignore_hydrogen_positions=True, packing_shell_size=15, ignore_spacegroup=True, normalise_unit_cell=True)[source]

Bases: object

Configuration settings for packing similarity comparisons of crystal structures.

Parameters:
  • distance_tolerance (float, default=0.2) – Maximum allowed deviation in atomic distances (Å) when comparing packings.

  • angle_tolerance (float, default=20.0) – Maximum allowed angular deviation (degrees) between molecular orientations.

  • ignore_bond_types (bool, default=True) – If True, matching bond orders are not required for similarity.

  • ignore_hydrogen_counts (bool, default=True) – If True, differences in hydrogen counts are ignored.

  • ignore_hydrogen_positions (bool, default=True) – If True, explicit hydrogen coordinate differences are ignored.

  • packing_shell_size (int, default=15) – Number of molecules considered in each packing-shell comparison.

  • ignore_spacegroup (bool, default=True) – If True, space-group designations are not required to match.

  • normalise_unit_cell (bool, default=True) – If True, unit cell parameters are normalized before comparison.

distance_tolerance: float = 0.2
angle_tolerance: float = 20.0
ignore_bond_types: bool = True
ignore_hydrogen_counts: bool = True
ignore_hydrogen_positions: bool = True
packing_shell_size: int = 15
ignore_spacegroup: bool = True
normalise_unit_cell: bool = True
__init__(distance_tolerance=0.2, angle_tolerance=20.0, ignore_bond_types=True, ignore_hydrogen_counts=True, ignore_hydrogen_positions=True, packing_shell_size=15, ignore_spacegroup=True, normalise_unit_cell=True)
class csd_operations.CSDOperations(data_directory, data_prefix)[source]

Bases: object

High-level interface for querying, validating, clustering, and selecting crystal structures from the Cambridge Structural Database (CSD).

data_directory

Base directory for reading and writing CSD-related files.

Type:

Path

data_prefix

Prefix used when naming output files.

Type:

str

reader

CCDC EntryReader instance connected to the “CSD” database.

Type:

io.EntryReader

similarity_engine

Engine for computing pairwise packing similarity.

Type:

PackingSimilarity

__init__(data_directory, data_prefix)[source]

Initialize CSDOperations with target directory and filename prefix.

Parameters:
  • data_directory (Union[str, Path]) – Directory under which all CSV outputs will be saved.

  • data_prefix (str) – Prefix for generated CSV filenames (e.g., “<prefix>_refcode_families.csv”).

get_refcode_families_df()[source]

Query the CSD and group entries by base refcode.

Returns:

DataFrame with columns: - family_id : str, first six characters of the refcode - refcode : str, full CSD refcode

Return type:

pd.DataFrame

save_refcode_families_csv(df=None, filename=None)[source]

Write the refcode-families DataFrame to a CSV file.

Parameters:
  • df (pd.DataFrame, optional) – DataFrame to save. If None, uses get_refcode_families_df().

  • filename (Union[str, Path], optional) – Full file path for output. If None, defaults to data_directory / f”{data_prefix}_refcode_families.csv”.

Raises:

OSError – If writing to disk fails.

filter_families_by_size(df, min_size=2)[source]

Exclude families with fewer than a specified number of members.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns [‘family_id’, ‘refcode’].

  • min_size (int, default=2) – Minimum number of members for a family to be retained.

Returns:

Filtered DataFrame.

Return type:

pd.DataFrame

Raises:

KeyError – If ‘family_id’ column is missing.

cluster_families(filters)[source]

Perform packing similarity clustering on each refcode family.

Workflow

  1. Load initial refcode families CSV.

  2. Group refcodes by ‘family_id’.

  3. For each group, validate entries and build a similarity graph.

  4. Identify connected components as clusters.

  5. Save clustered results to CSV.

param filters:

Criteria for structure validation.

type filters:

Dict[str, Any]

returns:

DataFrame with columns [‘family_id’, ‘refcode’, ‘cluster_id’].

rtype:

pd.DataFrame

raises FileNotFoundError:

If the initial CSV is missing.

raises RuntimeError:

If clustering fails for any family.

get_unique_structures(filters, method='vdWFV')[source]

Select one representative per cluster using the vdWFV metric.

Workflow

  1. Load clustered families CSV.

  2. Group by [‘family_id’, ‘cluster_id’].

  3. Compute vdWFV for each refcode; select the minimum.

  4. Save unique representatives to CSV.

param filters:

Placeholder for revalidation filters.

type filters:

Dict[str, Any]

param method:

Only ‘vdWFV’ is supported.

type method:

str, default=”vdWFV”

returns:

DataFrame with columns [‘family_id’, ‘refcode’].

rtype:

pd.DataFrame

raises FileNotFoundError:

If clustered CSV is missing.

raises NotImplementedError:

If method is not ‘vdWFV’.

Cambridge Structural Database Operations

The csd_operations module provides high-level interfaces for interacting with the Cambridge Structural Database (CSD), including family extraction, similarity clustering, and representative structure selection.

SimilaritySettings Class

class csd_operations.SimilaritySettings(distance_tolerance=0.2, angle_tolerance=20.0, ignore_bond_types=True, ignore_hydrogen_counts=True, ignore_hydrogen_positions=True, packing_shell_size=15, ignore_spacegroup=True, normalise_unit_cell=True)[source]

Bases: object

Configuration settings for packing similarity comparisons of crystal structures.

Parameters:
  • distance_tolerance (float, default=0.2) – Maximum allowed deviation in atomic distances (Å) when comparing packings.

  • angle_tolerance (float, default=20.0) – Maximum allowed angular deviation (degrees) between molecular orientations.

  • ignore_bond_types (bool, default=True) – If True, matching bond orders are not required for similarity.

  • ignore_hydrogen_counts (bool, default=True) – If True, differences in hydrogen counts are ignored.

  • ignore_hydrogen_positions (bool, default=True) – If True, explicit hydrogen coordinate differences are ignored.

  • packing_shell_size (int, default=15) – Number of molecules considered in each packing-shell comparison.

  • ignore_spacegroup (bool, default=True) – If True, space-group designations are not required to match.

  • normalise_unit_cell (bool, default=True) – If True, unit cell parameters are normalized before comparison.

Configuration for Packing Similarity Comparisons

Dataclass controlling parameters for 3D crystal packing similarity calculations using CCDC algorithms.

Key Parameters:

  • distance_tolerance (float) - Maximum deviation in atomic distances (Å)

  • angle_tolerance (float) - Maximum angular deviation (degrees)

  • packing_shell_size (int) - Number of molecules in comparison shell

  • ignore_hydrogen_positions (bool) - Whether to ignore H-atom coordinates

  • normalise_unit_cell (bool) - Whether to normalize unit cell parameters

Default Configuration:

settings = SimilaritySettings(
    distance_tolerance=0.2,        # 0.2 Å distance tolerance
    angle_tolerance=20.0,          # 20° angular tolerance
    ignore_bond_types=True,        # Ignore bond order differences
    ignore_hydrogen_counts=True,   # Ignore H-count differences
    ignore_hydrogen_positions=True,# Ignore H-position differences
    packing_shell_size=15,         # 15-molecule comparison shell
    ignore_spacegroup=True,        # Ignore space group differences
    normalise_unit_cell=True       # Normalize unit cell parameters
)

Tuning Guidelines:

  • Strict Similarity - Reduce distance/angle tolerances

  • Loose Similarity - Increase tolerances for broader clustering

  • Performance - Reduce packing_shell_size for faster comparisons

  • Accuracy - Increase packing_shell_size for more reliable comparisons

distance_tolerance: float = 0.2
angle_tolerance: float = 20.0
ignore_bond_types: bool = True
ignore_hydrogen_counts: bool = True
ignore_hydrogen_positions: bool = True
packing_shell_size: int = 15
ignore_spacegroup: bool = True
normalise_unit_cell: bool = True
__init__(distance_tolerance=0.2, angle_tolerance=20.0, ignore_bond_types=True, ignore_hydrogen_counts=True, ignore_hydrogen_positions=True, packing_shell_size=15, ignore_spacegroup=True, normalise_unit_cell=True)

CSDOperations Class

class csd_operations.CSDOperations(data_directory, data_prefix)[source]

Bases: object

High-level interface for querying, validating, clustering, and selecting crystal structures from the Cambridge Structural Database (CSD).

data_directory

Base directory for reading and writing CSD-related files.

Type:

Path

data_prefix

Prefix used when naming output files.

Type:

str

reader

CCDC EntryReader instance connected to the “CSD” database.

Type:

io.EntryReader

similarity_engine

Engine for computing pairwise packing similarity.

Type:

PackingSimilarity

High-Level CSD Interface for Structure Operations

Primary interface for querying, validating, clustering, and selecting crystal structures from the Cambridge Structural Database.

Core Responsibilities:

  • Family Extraction - Query and organize structures into chemical families

  • Quality Validation - Filter structures based on experimental criteria

  • Similarity Clustering - Group structures by 3D packing similarity

  • Representative Selection - Choose optimal structures using statistical metrics

  • Data Management - Save intermediate results and manage file I/O

Attributes:
  • data_directory (Path) - Base directory for file operations

  • data_prefix (str) - Filename prefix for all generated files

  • reader (io.EntryReader) - CCDC database connection

  • similarity_engine (PackingSimilarity) - Packing comparison engine

__init__(data_directory, data_prefix)[source]

Initialize CSDOperations with target directory and filename prefix.

Parameters:
  • data_directory (Union[str, Path]) – Directory under which all CSV outputs will be saved.

  • data_prefix (str) – Prefix for generated CSV filenames (e.g., “<prefix>_refcode_families.csv”).

Initialize CSD Operations Handler

Parameters:
  • data_directory (Union[str, Path]) - Base directory for file I/O

  • data_prefix (str) - Prefix for generated filenames

Initialization Process:

# Set up file paths and directories
self.data_directory = Path(data_directory)
self.data_prefix = data_prefix

# Initialize CSD connection
self.reader = io.EntryReader("CSD")

# Set up packing similarity engine
self.similarity_engine = PackingSimilarity()

Directory Structure Created:

data_directory/
├── {prefix}_refcode_families.csv
├── {prefix}_refcode_families_clustered.csv
├── {prefix}_refcode_families_unique.csv
└── structures/
    ├── REFCODE01.cif
    ├── REFCODE02.cif
    └── ...
__init__(data_directory, data_prefix)[source]

Initialize CSDOperations with target directory and filename prefix.

Parameters:
  • data_directory (Union[str, Path]) – Directory under which all CSV outputs will be saved.

  • data_prefix (str) – Prefix for generated CSV filenames (e.g., “<prefix>_refcode_families.csv”).

get_refcode_families_df()[source]

Query the CSD and group entries by base refcode.

Returns:

DataFrame with columns: - family_id : str, first six characters of the refcode - refcode : str, full CSD refcode

Return type:

pd.DataFrame

save_refcode_families_csv(df=None, filename=None)[source]

Write the refcode-families DataFrame to a CSV file.

Parameters:
  • df (pd.DataFrame, optional) – DataFrame to save. If None, uses get_refcode_families_df().

  • filename (Union[str, Path], optional) – Full file path for output. If None, defaults to data_directory / f”{data_prefix}_refcode_families.csv”.

Raises:

OSError – If writing to disk fails.

filter_families_by_size(df, min_size=2)[source]

Exclude families with fewer than a specified number of members.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns [‘family_id’, ‘refcode’].

  • min_size (int, default=2) – Minimum number of members for a family to be retained.

Returns:

Filtered DataFrame.

Return type:

pd.DataFrame

Raises:

KeyError – If ‘family_id’ column is missing.

cluster_families(filters)[source]

Perform packing similarity clustering on each refcode family.

Workflow

  1. Load initial refcode families CSV.

  2. Group refcodes by ‘family_id’.

  3. For each group, validate entries and build a similarity graph.

  4. Identify connected components as clusters.

  5. Save clustered results to CSV.

param filters:

Criteria for structure validation.

type filters:

Dict[str, Any]

returns:

DataFrame with columns [‘family_id’, ‘refcode’, ‘cluster_id’].

rtype:

pd.DataFrame

raises FileNotFoundError:

If the initial CSV is missing.

raises RuntimeError:

If clustering fails for any family.

get_unique_structures(filters, method='vdWFV')[source]

Select one representative per cluster using the vdWFV metric.

Workflow

  1. Load clustered families CSV.

  2. Group by [‘family_id’, ‘cluster_id’].

  3. Compute vdWFV for each refcode; select the minimum.

  4. Save unique representatives to CSV.

param filters:

Placeholder for revalidation filters.

type filters:

Dict[str, Any]

param method:

Only ‘vdWFV’ is supported.

type method:

str, default=”vdWFV”

returns:

DataFrame with columns [‘family_id’, ‘refcode’].

rtype:

pd.DataFrame

raises FileNotFoundError:

If clustered CSV is missing.

raises NotImplementedError:

If method is not ‘vdWFV’.

Family Extraction Methods

CSDOperations.get_refcode_families_df()[source]

Query the CSD and group entries by base refcode.

Returns:

DataFrame with columns: - family_id : str, first six characters of the refcode - refcode : str, full CSD refcode

Return type:

pd.DataFrame

Extract Structure Families from CSD

Queries the CSD to organize structures into families based on chemical similarity and refcode relationships.

Family Organization:

Structures are grouped by: * Chemical connectivity - Same molecular graph * Refcode patterns - Related experimental studies * Publication relationships - Same research group/journal

Returns:
pandas.DataFrame with columns:
  • family_id - Unique identifier for each chemical family

  • refcode - CSD refcode for individual structures

Example Output:

family_id    refcode
ACSALA       ACSALA
ACSALA       ACSALA01
ACSALA       ACSALA02
...
BENZEN       BENZEN
BENZEN       BENZEN01
BENZEN       BENZEN02
...
CSDOperations.save_refcode_families_csv(df=None, filename=None)[source]

Write the refcode-families DataFrame to a CSV file.

Parameters:
  • df (pd.DataFrame, optional) – DataFrame to save. If None, uses get_refcode_families_df().

  • filename (Union[str, Path], optional) – Full file path for output. If None, defaults to data_directory / f”{data_prefix}_refcode_families.csv”.

Raises:

OSError – If writing to disk fails.

Save Family Assignments to CSV

Writes refcode family assignments to disk for persistence and downstream processing.

Parameters:
  • df (pandas.DataFrame, optional) - DataFrame to save; if None, generates new one

  • filename (Union[str, Path], optional) - Output path; if None, uses default naming

Default File Path:

{data_directory}/{data_prefix}_refcode_families.csv

CSV Format:

family_id,refcode
ACSALA,ACSALA
ACSALA,ACSALA01
...
BENZEN,BENZEN
BENZEN,BENZEN01
...
CSDOperations.filter_families_by_size(df, min_size=2)[source]

Exclude families with fewer than a specified number of members.

Parameters:
  • df (pd.DataFrame) – DataFrame with columns [‘family_id’, ‘refcode’].

  • min_size (int, default=2) – Minimum number of members for a family to be retained.

Returns:

Filtered DataFrame.

Return type:

pd.DataFrame

Raises:

KeyError – If ‘family_id’ column is missing.

Filter Families by Member Count

Removes families with insufficient members for meaningful clustering analysis.

Parameters:
  • df (pandas.DataFrame) - Family assignments DataFrame

  • min_size (int) - Minimum family size (default: 2)

Filtering Logic:

# Count members per family
family_counts = df['family_id'].value_counts()

# Keep only families with sufficient members
valid_families = family_counts[family_counts >= min_size].index
filtered_df = df[df['family_id'].isin(valid_families)]
Use Cases:
  • Statistical significance - Ensure meaningful clustering

  • Computational efficiency - Focus on families with multiple structures

  • Quality control - Remove singleton families

Returns:

pandas.DataFrame with filtered family assignments

Clustering Methods

CSDOperations.cluster_families(filters)[source]

Perform packing similarity clustering on each refcode family.

Workflow

  1. Load initial refcode families CSV.

  2. Group refcodes by ‘family_id’.

  3. For each group, validate entries and build a similarity graph.

  4. Identify connected components as clusters.

  5. Save clustered results to CSV.

param filters:

Criteria for structure validation.

type filters:

Dict[str, Any]

returns:

DataFrame with columns [‘family_id’, ‘refcode’, ‘cluster_id’].

rtype:

pd.DataFrame

raises FileNotFoundError:

If the initial CSV is missing.

raises RuntimeError:

If clustering fails for any family.

Perform Packing Similarity Clustering

Groups structures within each family based on 3D crystal packing similarity using CCDC algorithms.

Clustering Workflow:

  1. Load Families - Read refcode family assignments

  2. Parallel Processing - Distribute families across CPU cores

  3. Structure Validation - Apply quality filters to each structure

  4. Similarity Computation - Calculate pairwise packing similarities

  5. Graph Construction - Build similarity graphs with threshold cutoffs

  6. Cluster Identification - Find connected components as clusters

  7. Result Aggregation - Combine results from all families

Parameters:
  • filters (Dict[str, Any]) - Structure validation criteria

Similarity Algorithm:

# For each pair of structures in a family
similarity = PackingSimilarity.compare(
    crystal1=entry1.crystal,
    crystal2=entry2.crystal,
    distance_tolerance=0.2,
    angle_tolerance=20.0,
    packing_shell_size=15
)

# Similarity values range from 0 (dissimilar) to 1 (identical)
if similarity > threshold:
    graph.add_edge(refcode1, refcode2)

Cluster Output:

family_id    refcode     cluster_id
ACSALA       ACSALA      1
ACSALA       ACSALA01    1
ACSALA       ACSALA02    1
...
ACSALA       ACSALA13    2
ACSALA       ACSALA15    2
ACSALA       ACSALA17    2
...
ACSALA       ACSALA23    3
ACSALA       ACSALA24    3
...
BENZEN       BENZEN      1
BENZEN       BENZEN01    1
BENZEN       BENZEN02    1
...
BENZEN       BENZEN03    2
BENZEN       BENZEN04    2
BENZEN       BENZEN16    2
...
Performance Characteristics:
  • CPU Parallelization - Uses multiple cores for family processing

  • Memory Efficiency - Processes families independently

  • Scalability - Linear scaling with number of families

Returns:

pandas.DataFrame with clustered family assignments

Raises:
CSDOperations._check_structure(identifier, filters, entry=None)[source]

Validate a CSD entry against filter criteria.

Parameters:
  • identifier (str) – CSD refcode.

  • filters (Dict[str, Any]) – Validation criteria.

  • entry (io.Entry, optional) – Preloaded CSD entry. If None, loaded internally.

Returns:

True if the structure is valid, False otherwise.

Return type:

bool

Raises:

Exception – If validation fails unexpectedly.

Validate Structure Against Filter Criteria

Applies comprehensive quality filters to determine structure suitability for analysis.

Parameters:
  • identifier (str) - CSD refcode to validate

  • filters (Dict) - Validation criteria dictionary

  • entry (io.Entry, optional) - Pre-loaded CSD entry

Validation Categories:

Quality Filters: * Completeness requirements - Data collection completeness

Chemical Filters: * Element restrictions - Allowed atomic species * Molecular weight limits - Size constraints * Z’ value constraints - Asymmetric unit requirements * Crystal type requirements - Homomolecular vs. solvated

Structural Filters: * Disorder exclusion - Remove disordered structures * Polymer exclusion - Exclude polymeric materials

Example Filter Configuration:

filters = {
    "structure_list": ["csd-unique"],     # Use unique structures in CSD
    "crystal_type": ["homomolecular"],    # Homomolecular structures only
    "target_species": ["C", "H"],         # Hydrocarbons only
    "target_space_groups": ["P1", "P-1"], # Triclinic structures only
    "target_z_prime_values": [1],         # Z' = 1 only
    "molecule_weight_limit": 300.0,       # Small molecules
    "molecule_formal_charges": [0],       # Neutral molecules
    "unique_structures_clustering_method": "vdWFV", # Use van der Waals free volume to select unique structures from a cluster
}
Returns:

bool indicating if structure passes all validation criteria

Representative Selection Methods

CSDOperations.get_unique_structures(filters, method='vdWFV')[source]

Select one representative per cluster using the vdWFV metric.

Workflow

  1. Load clustered families CSV.

  2. Group by [‘family_id’, ‘cluster_id’].

  3. Compute vdWFV for each refcode; select the minimum.

  4. Save unique representatives to CSV.

param filters:

Placeholder for revalidation filters.

type filters:

Dict[str, Any]

param method:

Only ‘vdWFV’ is supported.

type method:

str, default=”vdWFV”

returns:

DataFrame with columns [‘family_id’, ‘refcode’].

rtype:

pd.DataFrame

raises FileNotFoundError:

If clustered CSV is missing.

raises NotImplementedError:

If method is not ‘vdWFV’.

Select Representative Structures from Clusters

Chooses one optimal representative structure from each cluster using the vdWFV (van der Waals Fit Volume) metric.

Selection Algorithm:

The vdWFV method selects the structure with the most typical packing density within each cluster

Parameters:
  • method (str) - Selection method (“vdWFV” only supported)

Output Files:
  • CSV - {prefix}_refcode_families_unique.csv

  • Structure Directory - Individual CIF files for representatives

Returns:

pandas.DataFrame with selected representative structures

Raises:
CSDOperations._save_unique_structures(df)[source]

Save unique structure representatives to CSV.

Parameters:

df (pd.DataFrame) – DataFrame with columns [‘family_id’, ‘refcode’].

Raises:

OSError – If file writing fails.

Save Representative Structures to CSV

Persists the selected unique structures for downstream processing.

Parameters:

Output Format:

family_id,refcode
ACSALA,ACSALA13
ACSALA,ACSALA24
ACSALA,ACSALA35
BENZEN,BENZEN22
BENZEN,BENZEN24
...

File Location:

{data_directory}/{data_prefix}_refcode_families_unique.csv

Utility Functions

_process_single_family

csd_operations._process_single_family(args)[source]

Validate and cluster a single refcode family by packing similarity.

Parameters:

args (Tuple[str, List[str], Dict[str, Any]]) –

  • family_id : str

  • structures : List[str]

  • filters : Dict of validation criteria

Returns:

family_id and list of clusters (each a list of refcodes).

Return type:

Tuple[str, List[List[str]]]

Raises:

Exception – If any error occurs during processing.

Process Individual Family for Clustering

Worker function for parallel clustering of structure families.

Parameters:
  • args (Tuple[str, List[str], Dict]) - (family_id, refcodes, filters)

Processing Steps:
  1. Validation - Check each structure against filters

  2. Similarity Matrix - Compute all pairwise similarities

  3. Graph Construction - Build similarity network

  4. Clustering - Identify connected components

  5. Result Packaging - Return cluster assignments

Returns:

Tuple[str, List[List[str]]] - (family_id, list of clusters)

_representative_for_cluster

csd_operations._representative_for_cluster(args)[source]

Select the refcode with minimal vdWFV in a cluster.

Parameters:

args (Tuple[str, List[str]]) –

  • family_id : str

  • cluster : List[str]

Returns:

family_id and representative refcode.

Return type:

Tuple[str, str]

Raises:

Exception – If any lookup fails.

Select Representative from Single Cluster

Worker function for parallel representative selection.

Selection Process:
  1. Load Structures - Access CSD entries for cluster members

  2. Compute Metrics - Calculate vdWFV for each structure

  3. Statistical Analysis - Find cluster median vdWFV

  4. Representative Selection - Choose structure closest to median

Returns:

Tuple[str, str] - (family_id, representative_refcode)

See Also

crystal_analyzer module : Pipeline orchestration structure_data_extractor module : Raw data extraction ../validation/csd_structure_validator : Structure validation geometry_utils module : Geometric calculations