lnb package

classifiers

lnb.classifiers.drop_zero_cols(X_train: DataFrame, X_test: DataFrame | None = None) tuple

Drops columns from the input DataFrames where all column values are zero.

Parameters:
  • X_train (pd.DataFrame) – The training data.

  • X_test (pd.DataFrame, optional) – The test data, optional.

Returns:

If X_test is not provided, returns the modified X_train with zero-sum columns dropped. If X_test is provided, returns both X_train and X_test with zero-sum columns dropped.

Return type:

pd.DataFrame or tuple of pd.DataFrame

lnb.classifiers.fit_classifier(X_train: DataFrame, y_train: DataFrame, model: str, cv=False)

Trains a classifier based on the specified model type using the provided training data.

Parameters:
  • X_train (pd.DataFrame) – The feature data for the training set.

  • y_train (pd.DataFrame) – The target labels for the training set.

  • cv (bool, optional) – If True, performs cross-validation to find the best hyperparameters. If False, trains the model using fixed hyperparameters. Defaults to False

Returns:

A trained classifier object based on the specified model.

Return type:

object

lnb.classifiers.fit_classifiers(X_train: DataFrame, y_train: DataFrame, models: list, cv=False)

Trains classifiers based on the specified model types using the provided training data.

Parameters:
  • X_train (pd.DataFrame) – The feature data for the training set.

  • y_train (pd.DataFrame) – The target labels for the training set.

  • models (list) – A list of model names (as strings) to be trained and validated. Supported models are: ‘logistic_regression’, ‘random_forest’, ‘mlp’.

  • cv (bool, optional) – If True, performs cross-validation to find the best hyperparameters. If False, trains the model using fixed hyperparameters. Defaults to False

Returns:

list of fitted classifiers

Return type:

list

lnb.classifiers.fit_validate_classifiers(X_train: DataFrame, y_train: DataFrame, X_test: DataFrame, y_test: DataFrame, models: list, cv: bool = False) tuple

Trains and validates multiple classifiers on the provided training and test sets.

Parameters:
  • X_train (pd.DataFrame) – The feature data for the training set.

  • y_train (pd.DataFrame) – The target labels for the training set.

  • X_test (pd.DataFrame) – The feature data for the test set.

  • y_test (pd.DataFrame) – pd.DataFrameThe target labels for the test set.

  • models (A list of model names (as strings) to be trained and validated. Supported models are: 'logistic_regression', 'random_forest', 'mlp'.) – _description_

  • cv (bool, optional) – If True, performs cross-validation during model training to tune hyperparameters. If False, trains the model using fixed hyperparameters (default is False). Defaults to False

Returns:

A tuple containing: - A list of trained models. - A list of the results (training and test accuracy, AUC, etc.) for each model.

Return type:

tuple

lnb.classifiers.scale_features(X_train: DataFrame, X_test: DataFrame | None = None) tuple

Scales the features in X_train by standardizing them. If X_test is provided, it scales the test set using the same mean and standard deviation as X_train.

Parameters:
  • X_train (pd.DataFrame) – The training data

  • X_test (pd.DataFrame, optional) – The test data (default is None), defaults to None

Returns:

If X_test is not provided, returns the standardized X_train. If X_test is provided, returns both the standardized X_train and X_test.

Return type:

tuple

lnb.classifiers.train_LogisticRegression(X_train: DataFrame, y_train: DataFrame, cv: bool = False) LogisticRegression

Trains a logistic regression model using either a fixed regularization parameter or cross-validation.

Parameters:
  • X_train (pd.DataFrame) – The feature data for the training set.

  • y_train (pd.DataFrame) – The target labels for the training set.

  • cv (bool, optional) – If True, performs cross-validation to find the best regularization parameter. If False, trains the model using a fixed regularization parameter (default is False). Defaults to False

Returns:

A trained logistic regression model.

Return type:

LogisticRegression

lnb.classifiers.train_MLP(X_train: DataFrame, y_train: DataFrame, cv: bool = False) MLPClassifier

Trains a Multi-layer Perceptron (MLP) classifier using either fixed hyperparameters or cross-validation for hyperparameter tuning.

Parameters:
  • X_train (pd.DataFrame) – The feature data for the training set.

  • y_train (pd.DataFrame) – The target labels for the training set.

  • cv (bool, optional) – If True, performs cross-validation to find the best hyperparameters. If False, trains the model using fixed hyperparameters. Defaults to False

Returns:

A trained MLP classifier.

Return type:

MLPClassifier

lnb.classifiers.train_RandomForest(X_train: DataFrame, y_train: DataFrame, cv: bool = False) RandomForestClassifier

Trains a random forest classifier using either fixed hyperparameters or cross-validation for hyperparameter tuning.

Parameters:
  • X_train (pd.DataFrame) – The feature data for the training set.

  • y_train (pd.DataFrame) – The target labels for the training set.

  • cv (bool, optional) – If True, performs cross-validation to find the best hyperparameters. If False, trains the model using fixed hyperparameters. Defaults to False

Returns:

A trained random forest classifier.

Return type:

RandomForestClassifier

lnb.classifiers.validate_clf(clf: ClassifierMixin, X_train: DataFrame, y_train: DataFrame, X_test: DataFrame, y_test: DataFrame) tuple

Evaluates the classifier’s performance on both the training and test datasets. Prints the accuracy and AUC (Area Under the Curve) for both training and test sets.

Parameters:
  • clf (ClassifierMixin) – The classifier to be evaluated (any estimator with predict and predict_proba methods).

  • X_train (pd.DataFrame) – The feature data for the training set.

  • y_train (pd.DataFrame) – The target labels for the training set.

  • X_test (pd.DataFrame) – The feature data for the test set.

  • y_test (pd.DataFrame) – pd.DataFrameThe target labels for the test set.

Returns:

A tuple containing: - Training accuracy - Training AUC - Test accuracy - Test AUC (if computable)

Return type:

tuple

data_prep

lnb.data_prep.discretize_dataset(df: DataFrame, columns: list) DataFrame

Convert the dataset to one where all categories in categorical columns are integers instead of class name strings

Parameters:
  • df (pd.DataFrame) – dataset to discretize

  • columns (list) – columns to discretize

Returns:

discretized dataset

Return type:

pd.DataFrame

lnb.data_prep.get_target_record(df: DataFrame, index: int) DataFrame

Given an index, return the 1-record dataframe corresponding to the index

Parameters:
  • df (pd.DataFrame) – dataset

  • index (int) – index of the record to return

Returns:

dataframe containing the record at the specified index of df

Return type:

pd.DataFrame

lnb.data_prep.load_data(path_to_data: str, path_to_metadata: str, cols_to_select: list = ['all'])
lnb.data_prep.normalize_cont_cols(df: DataFrame, meta_data: list, df_aux: DataFrame, types: tuple = ('Float',)) DataFrame

Normalize continuous columns

Parameters:
  • df (pd.DataFrame) – dataframe containing data to normalize

  • meta_data (list) – meta data

  • df_aux (pd.DataFrame) – auxiliary data based on which normalization is done

  • types (tuple, optional) – types of column to normalize, defaults to (“Float”,)

Returns:

normalized dataframe

Return type:

pd.DataFrame

lnb.data_prep.read_data(data_path: str, categorical_cols: list, continuous_cols: list) DataFrame

Read given file_path (csv) and return a pd dataframe. If all categorical, make sure data all column values are strings

Parameters:
  • data_path (str) – path to data

  • categorical_cols (list) – names of categorical columns

  • continuous_cols (list) – names of continuous columns

Returns:

dataframe containing loaded data

Return type:

pd.DataFrame

lnb.data_prep.read_metadata(metadata_path: str) tuple

Read metadata from a json file (is necessary for the reprosyn generators)

Parameters:

metadata_path (str) – path to metadata

Returns:

tuple containing metadata, categorical column names, and conitnuous column names

Return type:

tuple

lnb.data_prep.select_columns(df: DataFrame, categorical_cols: list, continuous_cols: list, cols_to_select: list, meta_data_og: list) tuple

Select specified columns of dataset and drop the rest.

Parameters:
  • df (pd.DataFrame) – dataset

  • categorical_cols (list) – names of categorical columns

  • continuous_cols (list) – names of continuous columns

  • cols_to_select (list) – columns to keep in dataframe. If “all” then all columns are kept.

  • meta_data_og (list) – metadata

Returns:

data with selected columns, categorical column names, continuous column names, metadata concerning selected columns.

Return type:

tuple

lnb.data_prep.split_data(df: DataFrame, path_to_ids: str)

distance

lnb.distance.compute_achilles(df: DataFrame, categorical_cols: list, continuous_cols: list, meta_data: list, n_to_save: int)

Function to compute Achilles scores for each record in dataset concurrently.

Parameters:
  • df (pd.DataFrame) – dataset based on which to compute Achilles scores

  • categorical_cols (list) – names of categorical columns

  • continuous_cols (list) – names of continuous columns

  • meta_data (list) – metadata

  • n_to_save (int) – number of nearest neighbors to consider when computing Achilles score

Returns:

dictionary where key is record id and value is Achilles score

Return type:

dict

async lnb.distance.compute_achilles_one_record(df: DataFrame, record_id: int, ohe_cat_indices: list, n_cat_cols: int, cont_indices: list, n_cont_cols: int, all_distances: dict, n_to_save: int)

Async function to compute achilles score (mean distance to n_to_save closest neighbors) for a single record.

Parameters:
  • df (pd.DataFrame) – dataframe containing dataset based on which to compute Achilles score

  • record_id (int) – id of record to compute Achilles score for

  • ohe_cat_indices (list) – indices of one-hot encoded columns

  • n_cat_cols (int) – number of categorical columns

  • cont_indices (list) – indices of continuous columns

  • n_cont_cols (int) – number of continuous columns

  • all_distances (dict) – dictionary where key is record id and value is Achilles score

  • n_to_save (int) – number of nearest neighbors to consider when computing Achilles score

Returns:

None

Return type:

None

async lnb.distance.compute_achilles_parallel(df: DataFrame, categorical_cols: list, continuous_cols: list, meta_data: list, n_to_save: int)

Async function to compute Achilles scores for each record in dataset concurrently.

Parameters:
  • df (pd.DataFrame) – dataset based on which to compute Achilles scores

  • categorical_cols (list) – names of categorical columns

  • continuous_cols (list) – names of continuous columns

  • meta_data (list) – metadata

  • n_to_save (int) – number of nearest neighbors to consider when computing Achilles score

Returns:

dictionary where key is record id and value is Achilles score

Return type:

dict

lnb.distance.compute_achilles_seq(df: DataFrame, categorical_cols: list, continuous_cols: list, meta_data: list, n_to_save: int)

Function to compute Achilles scores for each record in dataset sequentially.

Parameters:
  • df (pd.DataFrame) – dataset based on which to compute Achilles scores

  • categorical_cols (list) – names of categorical columns

  • continuous_cols (list) – names of continuous columns

  • meta_data (list) – metadata

  • n_to_save (int) – number of nearest neighbors to consider when computing Achilles score

Returns:

dictionary where key is record id and value is Achilles score

Return type:

dict

lnb.distance.top_n_vulnerable_dists(distances: dict, n: int) list

Return the risk scores of the top n vulnerable records

Parameters:
  • distances (dict) – dictionary where key is record id and value is the corresponding record’s risk score (in this case the mean distance to it’s 5 closest neighbors)

  • n (int) – number of most vulnerable records to find

Returns:

list of n most vulnerable record’s risk scores

Return type:

list

lnb.distance.top_n_vulnerable_records(distances: dict, n: int) list

Return the ids of the top n vulnerable records

Parameters:
  • distances (dict) – dictionary where key is record id and value is the corresponding record’s risk score (in this case the mean distance to it’s 5 closest neighbors)

  • n (int) – number of most vulnerable records to find

Returns:

list of n most vulnerable record’s ids

Return type:

list

feature_extractors

lnb.feature_extractors.apply_feature_extractor_one_dataset_parallel(dataset: list, target_record: DataFrame, ohe: OneHotEncoder, ohe_columns: list, ohe_column_names: list, continuous_cols: list, feature_extractors: list, do_ohe: list, queries_list: list, query_extractor, train: bool, membership_label: bool, i: int) tuple

Apply feature extraction in parallel for a given dataset.

Parameters:
  • dataset (pd.DataFrame) – The dataset for which features are to be extracted.

  • target_record (pd.DataFrame) – The target record for which features are to be extracted.

  • ohe (OneHotEncoder) – A fitted one-hot encoder instance.

  • ohe_columns (list) – A list of column names representing one-hot encoded categorical features.

  • ohe_column_names (list) – The names of the columns of the one-hot encoding result.

  • continuous_cols (list) – A list of column names representing continuous features.

  • feature_extractors (list) – A list of feature extractor functions or tuples specifying the feature extractors to be used.

  • do_ohe (list) – A list of boolean values indicating whether one-hot encoding is required for each feature extractor.

  • queries_list (list) – A list of queries for extracting features.

  • query_extractor (function) – The function used for extracting features when the feature extractor is a tuple.

  • train (bool) – A boolean indicating if the dataset is for training.

  • membership_label (bool) – A boolean indicating if membership labeling is required.

  • i (int) – An index to specify which feature extractor to use.

Returns:

A tuple containing: - X (pd.DataFrame): A DataFrame containing the extracted features. - membership_label (bool): The membership label associated with the dataset. - train (bool): The training flag.

Return type:

tuple

lnb.feature_extractors.apply_feature_extractor_sequential(datasets: list, target_record: DataFrame, labels: list, ohe: OneHotEncoder, ohe_columns: list, ohe_column_names: list, continuous_cols: list, feature_extractors: list, do_ohe: list) tuple

Given a list of feature extractor functions and synthetic datasets, extract all features and create a new DataFrame with all features per dataset as individual records.

Parameters:
  • datasets (list) – A list of shadow synthetic datasets.

  • target_record (pd.DataFrame) – DataFrame of one record with the target record, potentially to be used by the feature extractor.

  • labels (list) – A list of labels corresponding to the datasets.

  • ohe (OneHotEncoder) – A fitted one-hot encoder instance.

  • ohe_columns (list) – The columns on which the one-hot encoding should be applied.

  • ohe_column_names (list) – The names of the columns of the one-hot encoding result.

  • continuous_cols (list) – The columns that are continuous.

  • feature_extractors (list) – A list of feature extractor functions. All functions have as input a dataset and output a list of features and a list of column names. If more than one feature extractor is specified, all features are extracted and appended.

  • do_ohe (list) – A list of boolean values indicating whether each feature extractor function requires the dataset to be one-hot encoded or not.

Returns:

DataFrame containing all features per dataset and the corresponding labels.

Return type:

pd.DataFrame

lnb.feature_extractors.apply_feature_extractor_to_datasets(datasets_train: list, datasets_eval: list, target_record: DataFrame, ohe: OneHotEncoder, ohe_columns: list, ohe_column_names: list, continuous_cols: list, feature_extractors: list, do_ohe: list)

Apply feature extraction to both training and evaluation datasets.

Parameters:
  • datasets_train (list) – A list of training datasets, each containing synthetic data and corresponding membership labels.

  • datasets_eval (list) – A list of evaluation datasets, each containing synthetic data and corresponding membership labels.

  • target_record (pd.DataFrame) – The target record for which features are to be extracted.

  • ohe (OneHotEncoder) – A fitted one-hot encoder instance.

  • ohe_columns (list) – A list of column names representing one-hot encoded categorical features.

  • ohe_column_names (list) – The names of the columns of the one-hot encoding result.

  • continuous_cols (list) – A list of column names representing continuous features.

  • feature_extractors (list) – A list of feature extractor functions or tuples specifying the feature extractors to be used.

  • do_ohe (list) – A list of boolean values indicating whether one-hot encoding is required for each feature extractor.

Returns:

A list containing extracted features and labels for both training and evaluation datasets.

Return type:

list

lnb.feature_extractors.apply_ohe(df: DataFrame, ohe: OneHotEncoder, categorical_cols: list, ohe_column_names: list, continous_cols: list) DataFrame

One-hot-encode dataset

Parameters:
  • df (pd.DataFrame) – dataset to encode

  • ohe (OneHotEncoder) – OneHotEncoder instance

  • categorical_cols (list) – names of categorical columns

  • ohe_column_names (list) – names for one-hot-encoded columns

  • continous_cols (list) – names of continuous columns

Returns:

One-hot-encoded dataset

Return type:

pd.DataFrame

lnb.feature_extractors.create_queries(queries_list: list, feature_extractors: list, dataset: DataFrame, ohe_columns: list, continuous_cols: list)

Generate queries based on the provided feature extractors and dataset.

Parameters:
  • queries_list (list) – A list to store the generated queries for each feature extractor.

  • feature_extractors (list) – A list of feature extractors, where each element can be a function or a tuple with parameters.

  • dataset (pd.DataFrame) – The dataset containing the features for query generation.

  • ohe_columns (list) – A list of column names representing one-hot encoded categorical features.

  • continuous_cols (list) – A list of column names representing continuous features.

Returns:

A tuple containing: - queries_list (list): The updated list of queries generated for each feature extractor. - query_extractor: The last used query extractor from the feature extractors list.

Return type:

tuple

lnb.feature_extractors.extract_correlation_features(synthetic_df: ~pandas.core.frame.DataFrame, categorical_cols: list, ohe_column_names: list, continuous_cols: list, target_record=<class 'pandas.core.frame.DataFrame'>) tuple

Extract correlation features

Parameters:
  • synthetic_df (pd.DataFrame) – Synthetic dataset

  • categorical_cols (list) – names of categorical columns

  • ohe_column_names (list) – names of one-hot encoded columns

  • continuous_cols (list) – names of continuous columns

  • target_record (pd.DataFrame, optional) – dataframe containing target record, defaults to pd.DataFrame

Returns:

extracted features and names

Return type:

tuple

lnb.feature_extractors.extract_naive_features(synthetic_df: ~pandas.core.frame.DataFrame, categorical_cols: list, ohe_column_names: list, continuous_cols: list, target_record=<class 'pandas.core.frame.DataFrame'>) tuple

Compute the Naive method as described in “Synthetic data – anonymisation groundhog day” (Usenix 2022)

Parameters:
  • synthetic_df (pd.DataFrame) – Synthetic dataset

  • categorical_cols (list) – names of categorical columns

  • ohe_column_names (list) – names of one-hot encoded columns

  • continuous_cols (list) – names of continuous columns

  • target_record (pd.DataFrame, optional) – dataframe containing target record, defaults to pd.DataFrame

Returns:

extracted features and names

Return type:

tuple

lnb.feature_extractors.extract_one_feature(feature_extractor, queries, dataset, ohe_columns, target_record, query_extractor, do_ohe, data_ohe, ohe_column_names, continuous_cols, target_ohe)

Extract features using a given feature extractor.

Parameters:
  • feature_extractor (function or tuple) – The feature extractor function or a tuple containing the function and additional parameters.

  • queries (list) – A list of queries for extracting features.

  • dataset (pd.DataFrame) – The dataset containing the features for extraction.

  • ohe_columns (list) – A list of column names representing one-hot encoded categorical features.

  • target_record (pd.DataFrame) – The target record for which features are to be extracted.

  • query_extractor (function) – The function used for extracting features when the feature extractor is a tuple.

  • do_ohe (bool) – A boolean indicating whether one-hot encoding is required.

  • data_ohe (pd.DataFrame) – The one-hot encoded version of the dataset.

  • ohe_column_names (list) – The names of the columns of the one-hot encoding result.

  • continuous_cols (list) – A list of column names representing continuous features.

  • target_ohe (pd.DataFrame) – The one-hot encoded version of the target record.

Returns:

A tuple containing: - features (list): The extracted features. - col_names (list): The names of the extracted features.

Return type:

tuple

lnb.feature_extractors.feature_extractor_distances(synthetic_df: DataFrame, target_record_ohe: DataFrame)

Extract features based on cosine similarity between synthetic data and target record

Parameters:
  • synthetic_df (pd.DataFrame) – synthetic data

  • target_record_ohe (pd.DataFrame) – one-hot encoded target record

Returns:

extracted features and names

Return type:

tuple

lnb.feature_extractors.feature_extractor_queries_CQBS(synthetic_df: DataFrame, target_record: DataFrame, queries: list)

Extract query-based features

Parameters:
  • synthetic_df (pd.DataFrame) – synthetic data

  • target_record (pd.DataFrame) – target record

  • queries (list) – queries

Returns:

extracted features and names

Return type:

tuple

lnb.feature_extractors.feature_extractor_topX_full(synthetic_df: DataFrame, target_record_ohe: DataFrame, top_X: int = 50)

Extract features based on top X cosine similarity between synthetic data and target record

Parameters:
  • synthetic_df (pd.DataFrame) – synthetic data

  • target_record_ohe (pd.DataFrame) – one-hot encoded target record

  • top_X (int, optional) – number of most similar records to consider, defaults to 50

Returns:

extracted features and names

Return type:

tuple

lnb.feature_extractors.fit_ohe(df: DataFrame, categorical_cols: list, metadata: dict) tuple
lnb.feature_extractors.get_feature_extractors(feature_extractor_names: list) tuple

Given a list of strings or tuples specifying the feature extractors to be used, create a list of the corresponding functions and parameters.

Parameters:

feature_extractor_names (list) – A list of feature extractors, where each element can be: - A string specifying a feature extractor (e.g., ‘naive’, ‘correlation’, ‘closest_X_full’, ‘all_distances’) - A tuple with a feature extractor name and additional parameters (name, orders, number, conditions)

Returns:

A tuple containing: - feature_extractors (list): A list of functions (or tuples with functions and parameters) corresponding to the requested feature extractors. - do_ohe (list): A list of boolean values indicating whether one-hot encoding (OHE) should be performed for each feature extractor.

Return type:

tuple

lnb.feature_extractors.get_queries(orders, categorical_indices: list, continous_indices: list, num_cols: int, number: int, cat_condition_options: tuple = (-1, 1), cont_condition_options: tuple = (3, -3), random_state: int = 42) list
Condition options:

0 -> no condition on this attribute; 1 -> ==

-1

-> != 2 -> > 3 -> >=

-2

-> <

-3

-> <=

Parameters:
  • orders (list) – list containing numbers of features to consider when extracting queries from feature combinations

  • categorical_indices (list) – indices of categorical columns

  • continous_indices (list) – indices of continuous columns

  • num_cols (int) – number of columns in dataset

  • number (int) – number of combinations to return

  • cat_condition_options (tuple, optional) – defaults to (-1, 1)

  • cont_condition_options (tuple, optional) – defaults to (3, -3)

  • random_state (int, optional) – random seed, defaults to 42

Returns:

extracted features and names

Return type:

list

generators

class lnb.generators.Generator

Bases: ABC

Base class for generators

property label
class lnb.generators.baynet

Bases: Generator

This generator is based on BAYNET.

fit_generate(dataset, metadata, size, seed)
property label
class lnb.generators.ctgan

Bases: Generator

This generator is based on CTGAN.

fit_generate(dataset, metadata, size, seed, epochs=50)
property label
lnb.generators.get_generator(name_generator: str, epsilon: float)

Get generator instance

Parameters:
  • name_generator (str) – generator name. Supports “identity”, “BAYNET”, “privbayes”, “CTGAN”, “SYNTHPOP”, “INDHIST”

  • epsilon (float) – epsilon for training differentially private generators

Returns:

generator instance

Return type:

Generator

class lnb.generators.identity

Bases: Generator

This generator is the identity generator: just return the input dataset.

fit_generate(dataset, metadata, size, seed)
property label
class lnb.generators.indhist

Bases: Generator

This generator is based on INDHIST.

fit_generate(dataset, metadata, size, seed)
property label
class lnb.generators.privbayes(epsilon: float)

Bases: Generator

This generator is based on privbayes.

fit_generate(dataset, metadata, size, seed)
property label
class lnb.generators.synthpop

Bases: Generator

This generator is based on SYNTHPOP.

fit_generate(dataset, metadata, size, seed)
property label

mia

lnb.mia.mia(path_to_data: str, path_to_metadata: str, path_to_data_split: str, target_records: list, generator_name: str, n_synth: int | None = None, n_datasets: int = 1000, epsilon: float = 0.0, models: list = ['random_forest', 'logistic_regression'], output_path: str = './output/files/')

Membership Inference Attack (MIA) function to evaluate data privacy risks.

Parameters:
  • path_to_data (str) – Path to the data file.

  • path_to_metadata (str) – Path to the metadata file.

  • path_to_data_split (str) – Path to the data split information.

  • target_records (list) – List of target records for MIA.

  • generator_name (str) – Name of the data generator being used.

  • n_synth (int) – Number of synthetic records to use. Defaults to the size of df_target if not provided.

  • n_datasets (int) – Number of datasets to generate. Defaults to 1000.

  • epsilon (float) – Differential privacy parameter. Defaults to 0.0.

  • output_path (str) – Path to store output files. Defaults to ‘./output/files/’.

Returns:

A dictionary containing the MIA results for each target record.

Return type:

dict

lnb.mia.train_evaluate_mia(df_aux: DataFrame, df_target: DataFrame, meta_data: list, target_record_id: int, df_eval: DataFrame, generator_name: str, continuous_cols: list, categorical_cols: list, n_synth: int = 1000, n_datasets: int = 1000, seeds_train: list | None = None, seeds_eval: list | None = None, epsilon: float = 0.0, models: list | None = None, cv: bool = False)

Train and evaluate a membership inference attack (MIA) using shadow datasets and target record.

Parameters:
  • df_aux (pd.DataFrame) – Auxiliary dataset used for generating shadow datasets.

  • df_target (pd.DataFrame) – Dataset containing the target record for MIA.

  • meta_data (list) – Metadata information used for feature extraction and generating synthetic datasets.

  • target_record_id (int) – The ID of the target record for MIA.

  • df_eval (pd.DataFrame) – Evaluation dataset used for testing the trained models.

  • generator_name (str) – Name of the data generator used for generating synthetic datasets.

  • continuous_cols (list) – A list of column names representing continuous features.

  • categorical_cols (list) – A list of column names representing categorical features.

  • n_synth (int, optional) – Number of synthetic records to generate for each shadow dataset (default is 1000).

  • n_datasets (int, optional) – Number of shadow datasets to generate (default is 1000).

  • seeds_train (list, optional) – List of seeds used for training dataset generation (default is None).

  • seeds_eval (list, optional) – List of seeds used for evaluation dataset generation (default is None).

  • epsilon (float, optional) – Differential privacy parameter for synthetic dataset generation (default is 0.0).

  • models (list, optional) – A list of model names to use for training the meta-classifier (default is [‘random_forest’]).

  • cv (bool, optional) – Whether to use cross-validation during model training (default is False).

  • output_path (str, optional) – Path to save output files (default is ‘./output/files/’).

Returns:

A tuple containing: - target_record_id (int): The ID of the target record used for MIA. - model_metrics (dict): A dictionary containing AUC and accuracy metrics for each trained model.

Return type:

tuple

plots

lnb.plots.calculate_statistics(distances: dict[int, float]) None

Calculate summary statistics of Achilles distances

Parameters:

distances (dict) – dictionary where key is record id, value is Achilles score

lnb.plots.mia_results_to_df(mia_results: list[float]) DataFrame

Convert MIA results list to dataframe

Parameters:

mia_results (list[float]) – list containing MIA scores

Returns:

dataframe containing MIA results

Return type:

pd.DataFrame

lnb.plots.plot_achilles(distances: dict[int, float], n: int) None

Plot histogram of Achilles scores

Parameters:
  • distances (dict) – dictionary where key is record id, value is Achilles score

  • n (int) – number of most vulnerable records to identify

lnb.plots.plot_mia_scores(mia_results: list[float], output_path: str | None = None) None

Plot MIA scores

Parameters:
  • mia_results (list[float]) – list containing MIA scores

  • output_path (str, optional) – path to save plot, defaults to None

shadow

lnb.shadow_data.generate_dataset_parallel(df_aux: DataFrame, df_target: DataFrame, meta_data: list, target_record: DataFrame, df_eval: DataFrame, in_dataset: bool, generator_name: str, n_synth: int, n_original: int, seeds_train: list, seeds_eval: list, idx: int, shadow_datasets: list, shadow_membership_labels: list, evaluation_datasets: list, evaluation_membership_labels: list, epsilon: float, train: bool)

This function trains one generator and generates a single dataset. The generator is trained on df_target, either with the target record swapped out for a different record sampled from df_eval (in_dataset=False), or on the full df_target containing the target record (in_dataset=True). The generated synthetic dataset and its membership label are placed in the corresponding lists.

Parameters:
  • df_target (pandas.DataFrame) – Target dataset.

  • meta_data (list) – List containing metadata concerning the data (feature types, ranges, etc.), necessary for training synthetic data generators.

  • target_record (pandas.DataFrame) – DataFrame containing only the target record.

  • df_eval (pandas.DataFrame) – Evaluation pool to draw the reference record from.

  • in_dataset (bool) – If True, the target record is in the dataset used to train the synthetic data generator. If False, it is replaced by a reference record randomly sampled from df_eval.

  • generator_name (str) – Name of the generator used, e.g., ‘SYNTHPOP’, ‘BAYNET’, etc. See reprosyn library for more details.

  • n_synth (int) – Size of each synthetic dataset.

  • seeds (list) – List of seed values; length must be equal to n_datasets.

  • idx (int) – Counter for number of generated synthetic datasets.

  • synthetic_datasets (list) – List containing all generated synthetic datasets. This function generates a single synthetic dataset and places it in the list, selecting the position according to idx.

  • membership_labels (list) – List containing membership labels for each synthetic dataset. If the target record is included in the training set, the membership label is 1, otherwise it is 0. Label at index idx in membership_labels corresponds to the synthetic dataset at index idx in synthetic_datasets.

  • epsilon (float) – Epsilon when training with differential privacy.

Returns:

synthetic dataset, bool indicating whether the synthetic data generator was trained on the target record, bool indicating whether the synthetic dataset is used for MIA training or evaluation

Return type:

tuple

lnb.shadow_data.generate_datasets(df_aux: DataFrame, df_target: DataFrame, meta_data: list, target_record_id: int, df_eval: DataFrame, generator_name: str, n_synth: int = 1000, n_original: int = 1000, n_datasets: int = 1000, seeds_train: list | None = None, seeds_eval: list | None = None, epsilon: float = 0.0)

Launch the pipeline to generate evaluation synthetic datasets.

Parameters:
  • df_target (pandas.DataFrame) – Target dataset.

  • meta_data (list) – List containing metadata concerning the data (feature types, ranges, etc.), necessary for training synthetic data generators.

  • target_record_id (int) – Index of the target record.

  • df_eval (pandas.DataFrame) – Evaluation pool to draw the reference record from.

  • generator_name (str) – Name of the generator used, e.g., ‘SYNTHPOP’, ‘BAYNET’, etc. See reprosyn library for more details.

  • n_synth (int) – Size of each synthetic dataset.

  • n_datasets (int) – Number of synthetic datasets to generate.

  • seeds (list) – List of seed values; length must be equal to n_datasets.

  • epsilon (float) – Epsilon when training with differential privacy.

Returns:

A list containing tuples of (synthetic dataset, membership label)

Return type:

list

lnb.shadow_data.generate_datasets_parallel(df_aux: DataFrame, df_target: DataFrame, meta_data: list, target_record: DataFrame, df_eval: DataFrame, generator_name: str, n_synth: int, n_original: int, n_datasets: int, seeds_train: list, seeds_eval: list, epsilon: float, train: bool)

This function allows evaluation datasets to be generated concurrently. It is not meant to be called directly, but rather from generate_evaluation_datasets. Each launched task corresponds to a single synthetic data generator being trained and used to generate a single synthetic dataset. Exactly half of the synthetic data generators are trained on data including the target record.

Parameters:
  • df_target (pandas.DataFrame) – Target dataset.

  • meta_data (list) – List containing metadata concerning the data (feature types, ranges, etc.), necessary for training synthetic data generators.

  • target_record (pandas.DataFrame) – DataFrame containing only the target record.

  • df_eval (pandas.DataFrame) – Evaluation pool to draw the reference record from.

  • generator_name (str) – Name of the generator used, e.g., ‘SYNTHPOP’, ‘BAYNET’, etc. See reprosyn library for more details.

  • n_synth (int) – Size of each synthetic dataset.

  • n_datasets (int) – Number of synthetic datasets to generate.

  • seeds (list) – List of seed values; length must be equal to n_datasets.

  • epsilon (float) – Epsilon when training with differential privacy.

Returns:

A tuple containing: - A list of all generated synthetic datasets. - A list of membership labels for each generated synthetic dataset.

Return type:

tuple

utils

lnb.utils.blockPrint()
lnb.utils.enablePrint()
lnb.utils.ignore_depreciation()
async lnb.utils.save_metrics_to_file(file_path: str, data)

Save metrics to file

Parameters:
  • file_path (str) – file path

  • data (_type_) – data to save

lnb.utils.str2bool(s: str)

convert string to bool. Used in parser.

Parameters:

s (str) – “True” or “False”. Raises argparse.ArgumentTypeError if value is neither.

Returns:

boolean

Return type:

bool

lnb.utils.str2list(s)

Convert a string to a list. Used in parser.

Parameters:

s (str) – string

Returns:

list

Return type:

list