csa_config.ExtractionConfig

class csa_config.ExtractionConfig(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)[source]

Configuration settings for the data-extraction pipeline.

Parameters:
  • data_directory (Path) – Directory under which all raw and intermediate extraction outputs will be stored. Subdirectories (e.g. “structures/”, “csv/”) are created automatically.

  • data_prefix (str) – Prefix used when naming output files, for example "{data_prefix}_refcode_families.csv".

  • actions (Dict[str, bool]) – Flags to enable or skip individual extraction substeps: - get_refcode_families - cluster_refcode_families - get_unique_structures - get_structure_data - post_extraction_process

  • filters (Dict[str, Any]) – Criteria for filtering CSD entries, for example: - elements (List[str]): only structures containing these elements - min_resolution (float): only structures with resolution ≤ this value - space_groups (List[str]): only structures in these space groups

  • extraction_batch_size (int) – Number of structures or refcode families to process per batch during extraction

  • post_extraction_batch_size (int) – Number of structures to process per batch during post-extraction

from_json(cls, json_path)[source]

Load and validate fields from the “extraction” section of a JSON file.

__init__(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)

Methods

__init__(data_directory, data_prefix, ...)

from_json(json_path)

Load an ExtractionConfig from a JSON file.

Attributes

data_directory

data_prefix

actions

filters

extraction_batch_size

post_extraction_batch_size

data_directory: Path
data_prefix: str
actions: Dict[str, bool]
filters: Dict[str, Any]
extraction_batch_size: int
post_extraction_batch_size: int
classmethod from_json(json_path)[source]

Load an ExtractionConfig from a JSON file.

Parameters:

json_path (Union[str, Path]) – Path to the JSON configuration file.

Returns:

Instance populated from the “extraction” section.

Return type:

ExtractionConfig

Raises:
__init__(data_directory, data_prefix, actions, filters, extraction_batch_size, post_extraction_batch_size)