focal.data_processing module¶

Data processing module for the Fiber Cleave Processing application.

This module provides classes for loading, preprocessing, and organizing data for training CNN and MLP models for fiber cleave analysis.

class focal.data_processing.BadCleaveTensionClassifier(csv_path: str, img_folder: str, tension_threshold: int, backbone: str | None = 'efficientnet', encoder_path: str | None = None, classification_type: str | None = 'binary')[source]¶

Bases: DataCollector

extract_data()[source]¶

Extract data from DataFrame into separate arrays for model training.

Returns:: Tuple of (images, features, labels) arrays

class focal.data_processing.DataCollector(csv_path: str, img_folder: str, angle_threshold: float, diameter_threshold: float, classification_type: str | None = 'binary', backbone: str | None = 'mobilenet', set_mask: str | None = 'n', encoder_path: str | None = None)[source]¶

Bases: object

Class for collecting and preprocessing data from CSV files and image folders.

This class handles loading cleave metadata from CSV files, processing images, and creating TensorFlow datasets for training machine learning models.

create_custom_dataset(image_shape: Tuple[int, int, int], test_size: float = 0.2, buffer_size: int = 32, batch_size: int = 16) → Tuple[DatasetV2, DatasetV2][source]¶

Create datasets using only grayscale images and labels with a custom image shape.

Parameters:

image_shape – Desired image shape (height, width, channels)
test_size – Fraction of data to use for testing
buffer_size – Buffer size for shuffling
batch_size – Batch size for training

Returns:

Tuple of (train_ds, test_ds)

create_datasets(images: ndarray, features: ndarray, labels: ndarray, test_size: float, buffer_size: int, batch_size: int, train_p: float, test_p: float, feature_scaler_path: str | None = None) → Tuple[DatasetV2, DatasetV2, dict[int, float] | None][source]¶

Create train and test datasets with feature scaling.

Parameters:

images – Array of image paths
features – Array of numerical features
labels – Array of target labels
test_size – Fraction of data to use for testing
buffer_size – Buffer size for dataset shuffling
batch_size – Batch size for training
feature_scaler_path – Optional path to save feature scaler
train_p – Masking probability for training.
test_p – Masking probability for testing.
feature_scaler_path – Path to save feature scaler.

Returns:

Tuple of (train_ds, test_ds)

create_kfold_datasets(images: ndarray, features: ndarray, labels: ndarray, buffer_size: int, batch_size: int, train_p: float, test_p: float, n_splits: int = 5) → List[Tuple[DatasetV2, DatasetV2]][source]¶

Create datasets based on stratified k-fold cross validation.

Parameters:

images – Array of image paths
features – Array of numerical features
labels – Array of target labels
buffer_size – Buffer size for dataset shuffling
batch_size – Batch size for training
n_splits – Number of k-fold splits
train_p – Masking probability for training
test_p – Masking probabilty for testing
n_splits – Number of folds to use

Returns:

List of (train_ds, test_ds) tuples for each fold

property df: DataFrame | None¶: Lazy loading for memory efficiency.

extract_data() → Tuple[ndarray, ndarray, ndarray][source]¶

Extract data from DataFrame into separate arrays for model training.

Returns:: Tuple of (images, features, labels) arrays

get_backbone_preprocessor(backbone: str)[source]¶

Return the preprocessing function for the specified backbone model.

Parameters:: backbone (str) – Name of the backbone to use. Must be one of: - “mobilenet” - “resnet” - “efficientnet”
Returns:: The preprocess_input function tied to the chosen backbone.
Return type:: Callable
Raises:: ValueError – If backbone is not one of the supported options.

image_only_dataset(original_dataset: DatasetV2) → DatasetV2[source]¶

Convert dataset to image-only format (remove feature inputs).

Parameters:: original_dataset – Original dataset with (image, features) inputs
Returns:: Dataset with only image inputs
Return type:: tf.data.Dataset

load_process_images(filename: str) → Tensor[source]¶

Load and preprocess image from file path.

Parameters:: filename – Image filename or path
Returns:: Preprocessed image tensor
Return type:: tf.Tensor

save_scaler_encoder(obj: object, filepath: str) → None[source]¶

Save a scaler or encoder to disk for future use.

Parameters:

filepath – Path to save scaler or encoder
obj – Scaler or Encoder object

class focal.data_processing.MLPDataCollector(csv_path: str, img_folder: str, angle_threshold: float, diameter_threshold: float, backbone: str | None = None)[source]¶

Bases: DataCollector

Data collector specifically for MLP regression models.

This class handles data preparation for tension prediction models, including proper scaling of both features and labels.

create_datasets(images: ndarray, features: ndarray, labels: ndarray, test_size: float, buffer_size: int, batch_size: int, feature_scaler_path: str | None = None, tension_scaler_path: str | None = None) → Tuple[DatasetV2, DatasetV2][source]¶

Create train and test datasets for MLP regression with proper scaling.

Parameters:

images – Array of image paths
features – Array of numerical features
labels – Array of tension values
test_size – Fraction of data to use for testing
buffer_size – Buffer size for dataset shuffling
batch_size – Batch size for training
feature_scaler_path – Optional path to save feature scaler
tension_scaler_path – Optional path to save tension scaler

Returns:

Tuple of (train_ds, test_ds)

create_kfold_datasets(images: ndarray, features: ndarray, labels: ndarray, buffer_size: int, batch_size: int, n_splits: int = 5) → Tuple[List[Tuple[DatasetV2, DatasetV2]], MinMaxScaler][source]¶

Create k-fold datasets for MLP regression with proper scaling.

Parameters:

images – Array of image paths
features – Array of numerical features
labels – Array of tension values
buffer_size – Buffer size for dataset shuffling
batch_size – Batch size for training
n_splits – Number of k-fold splits

Returns:

Tuple of (datasets, label_scaler)

extract_data() → Tuple[ndarray, ndarray, ndarray][source]¶

Extract data for MLP regression (tension prediction).

Returns:: Tuple of (images, features, labels) arrays