focal.data_processing module

Data processing module for the Fiber Cleave Processing application.

This module provides classes for loading, preprocessing, and organizing data for training CNN and MLP models for fiber cleave analysis.

class focal.data_processing.BadCleaveTensionClassifier(csv_path: str, img_folder: str, tension_threshold: int, backbone: str | None = 'efficientnet', encoder_path: str | None = None, classification_type: str | None = 'binary')[source]

Bases: DataCollector

extract_data()[source]

Extract data from DataFrame into separate arrays for model training.

Returns:

Tuple of (images, features, labels) arrays

class focal.data_processing.DataCollector(csv_path: str, img_folder: str, angle_threshold: float, diameter_threshold: float, classification_type: str | None = 'binary', backbone: str | None = 'mobilenet', set_mask: str | None = 'n', encoder_path: str | None = None)[source]

Bases: object

Class for collecting and preprocessing data from CSV files and image folders.

This class handles loading cleave metadata from CSV files, processing images, and creating TensorFlow datasets for training machine learning models.

create_custom_dataset(image_shape: Tuple[int, int, int], test_size: float = 0.2, buffer_size: int = 32, batch_size: int = 16) Tuple[DatasetV2, DatasetV2][source]

Create datasets using only grayscale images and labels with a custom image shape.

Parameters:
  • image_shape – Desired image shape (height, width, channels)

  • test_size – Fraction of data to use for testing

  • buffer_size – Buffer size for shuffling

  • batch_size – Batch size for training

Returns:

Tuple of (train_ds, test_ds)

create_datasets(images: ndarray, features: ndarray, labels: ndarray, test_size: float, buffer_size: int, batch_size: int, train_p: float, test_p: float, feature_scaler_path: str | None = None) Tuple[DatasetV2, DatasetV2, dict[int, float] | None][source]

Create train and test datasets with feature scaling.

Parameters:
  • images – Array of image paths

  • features – Array of numerical features

  • labels – Array of target labels

  • test_size – Fraction of data to use for testing

  • buffer_size – Buffer size for dataset shuffling

  • batch_size – Batch size for training

  • feature_scaler_path – Optional path to save feature scaler

  • train_p – Masking probability for training.

  • test_p – Masking probability for testing.

  • feature_scaler_path – Path to save feature scaler.

Returns:

Tuple of (train_ds, test_ds)

create_kfold_datasets(images: ndarray, features: ndarray, labels: ndarray, buffer_size: int, batch_size: int, train_p: float, test_p: float, n_splits: int = 5) List[Tuple[DatasetV2, DatasetV2]][source]

Create datasets based on stratified k-fold cross validation.

Parameters:
  • images – Array of image paths

  • features – Array of numerical features

  • labels – Array of target labels

  • buffer_size – Buffer size for dataset shuffling

  • batch_size – Batch size for training

  • n_splits – Number of k-fold splits

  • train_p – Masking probability for training

  • test_p – Masking probabilty for testing

  • n_splits – Number of folds to use

Returns:

List of (train_ds, test_ds) tuples for each fold

property df: DataFrame | None

Lazy loading for memory efficiency.

extract_data() Tuple[ndarray, ndarray, ndarray][source]

Extract data from DataFrame into separate arrays for model training.

Returns:

Tuple of (images, features, labels) arrays

get_backbone_preprocessor(backbone: str)[source]

Return the preprocessing function for the specified backbone model.

Parameters:

backbone (str) – Name of the backbone to use. Must be one of: - “mobilenet” - “resnet” - “efficientnet”

Returns:

The preprocess_input function tied to the chosen backbone.

Return type:

Callable

Raises:

ValueError – If backbone is not one of the supported options.

image_only_dataset(original_dataset: DatasetV2) DatasetV2[source]

Convert dataset to image-only format (remove feature inputs).

Parameters:

original_dataset – Original dataset with (image, features) inputs

Returns:

Dataset with only image inputs

Return type:

tf.data.Dataset

load_process_images(filename: str) Tensor[source]

Load and preprocess image from file path.

Parameters:

filename – Image filename or path

Returns:

Preprocessed image tensor

Return type:

tf.Tensor

save_scaler_encoder(obj: object, filepath: str) None[source]

Save a scaler or encoder to disk for future use.

Parameters:
  • filepath – Path to save scaler or encoder

  • obj – Scaler or Encoder object

class focal.data_processing.MLPDataCollector(csv_path: str, img_folder: str, angle_threshold: float, diameter_threshold: float, backbone: str | None = None)[source]

Bases: DataCollector

Data collector specifically for MLP regression models.

This class handles data preparation for tension prediction models, including proper scaling of both features and labels.

create_datasets(images: ndarray, features: ndarray, labels: ndarray, test_size: float, buffer_size: int, batch_size: int, feature_scaler_path: str | None = None, tension_scaler_path: str | None = None) Tuple[DatasetV2, DatasetV2][source]

Create train and test datasets for MLP regression with proper scaling.

Parameters:
  • images – Array of image paths

  • features – Array of numerical features

  • labels – Array of tension values

  • test_size – Fraction of data to use for testing

  • buffer_size – Buffer size for dataset shuffling

  • batch_size – Batch size for training

  • feature_scaler_path – Optional path to save feature scaler

  • tension_scaler_path – Optional path to save tension scaler

Returns:

Tuple of (train_ds, test_ds)

create_kfold_datasets(images: ndarray, features: ndarray, labels: ndarray, buffer_size: int, batch_size: int, n_splits: int = 5) Tuple[List[Tuple[DatasetV2, DatasetV2]], MinMaxScaler][source]

Create k-fold datasets for MLP regression with proper scaling.

Parameters:
  • images – Array of image paths

  • features – Array of numerical features

  • labels – Array of tension values

  • buffer_size – Buffer size for dataset shuffling

  • batch_size – Batch size for training

  • n_splits – Number of k-fold splits

Returns:

Tuple of (datasets, label_scaler)

extract_data() Tuple[ndarray, ndarray, ndarray][source]

Extract data for MLP regression (tension prediction).

Returns:

Tuple of (images, features, labels) arrays