lnb package
classifiers
- lnb.classifiers.drop_zero_cols(X_train: DataFrame, X_test: DataFrame | None = None) tuple
Drops columns from the input DataFrames where all column values are zero.
- Parameters:
X_train (pd.DataFrame) – The training data.
X_test (pd.DataFrame, optional) – The test data, optional.
- Returns:
If X_test is not provided, returns the modified X_train with zero-sum columns dropped. If X_test is provided, returns both X_train and X_test with zero-sum columns dropped.
- Return type:
pd.DataFrame or tuple of pd.DataFrame
- lnb.classifiers.fit_classifier(X_train: DataFrame, y_train: DataFrame, model: str, cv=False)
Trains a classifier based on the specified model type using the provided training data.
- Parameters:
X_train (pd.DataFrame) – The feature data for the training set.
y_train (pd.DataFrame) – The target labels for the training set.
cv (bool, optional) – If True, performs cross-validation to find the best hyperparameters. If False, trains the model using fixed hyperparameters. Defaults to False
- Returns:
A trained classifier object based on the specified model.
- Return type:
object
- lnb.classifiers.fit_classifiers(X_train: DataFrame, y_train: DataFrame, models: list, cv=False)
Trains classifiers based on the specified model types using the provided training data.
- Parameters:
X_train (pd.DataFrame) – The feature data for the training set.
y_train (pd.DataFrame) – The target labels for the training set.
models (list) – A list of model names (as strings) to be trained and validated. Supported models are: ‘logistic_regression’, ‘random_forest’, ‘mlp’.
cv (bool, optional) – If True, performs cross-validation to find the best hyperparameters. If False, trains the model using fixed hyperparameters. Defaults to False
- Returns:
list of fitted classifiers
- Return type:
list
- lnb.classifiers.fit_validate_classifiers(X_train: DataFrame, y_train: DataFrame, X_test: DataFrame, y_test: DataFrame, models: list, cv: bool = False) tuple
Trains and validates multiple classifiers on the provided training and test sets.
- Parameters:
X_train (pd.DataFrame) – The feature data for the training set.
y_train (pd.DataFrame) – The target labels for the training set.
X_test (pd.DataFrame) – The feature data for the test set.
y_test (pd.DataFrame) – pd.DataFrameThe target labels for the test set.
models (A list of model names (as strings) to be trained and validated. Supported models are: 'logistic_regression', 'random_forest', 'mlp'.) – _description_
cv (bool, optional) – If True, performs cross-validation during model training to tune hyperparameters. If False, trains the model using fixed hyperparameters (default is False). Defaults to False
- Returns:
A tuple containing: - A list of trained models. - A list of the results (training and test accuracy, AUC, etc.) for each model.
- Return type:
tuple
- lnb.classifiers.scale_features(X_train: DataFrame, X_test: DataFrame | None = None) tuple
Scales the features in X_train by standardizing them. If X_test is provided, it scales the test set using the same mean and standard deviation as X_train.
- Parameters:
X_train (pd.DataFrame) – The training data
X_test (pd.DataFrame, optional) – The test data (default is None), defaults to None
- Returns:
If X_test is not provided, returns the standardized X_train. If X_test is provided, returns both the standardized X_train and X_test.
- Return type:
tuple
- lnb.classifiers.train_LogisticRegression(X_train: DataFrame, y_train: DataFrame, cv: bool = False) LogisticRegression
Trains a logistic regression model using either a fixed regularization parameter or cross-validation.
- Parameters:
X_train (pd.DataFrame) – The feature data for the training set.
y_train (pd.DataFrame) – The target labels for the training set.
cv (bool, optional) – If True, performs cross-validation to find the best regularization parameter. If False, trains the model using a fixed regularization parameter (default is False). Defaults to False
- Returns:
A trained logistic regression model.
- Return type:
LogisticRegression
- lnb.classifiers.train_MLP(X_train: DataFrame, y_train: DataFrame, cv: bool = False) MLPClassifier
Trains a Multi-layer Perceptron (MLP) classifier using either fixed hyperparameters or cross-validation for hyperparameter tuning.
- Parameters:
X_train (pd.DataFrame) – The feature data for the training set.
y_train (pd.DataFrame) – The target labels for the training set.
cv (bool, optional) – If True, performs cross-validation to find the best hyperparameters. If False, trains the model using fixed hyperparameters. Defaults to False
- Returns:
A trained MLP classifier.
- Return type:
MLPClassifier
- lnb.classifiers.train_RandomForest(X_train: DataFrame, y_train: DataFrame, cv: bool = False) RandomForestClassifier
Trains a random forest classifier using either fixed hyperparameters or cross-validation for hyperparameter tuning.
- Parameters:
X_train (pd.DataFrame) – The feature data for the training set.
y_train (pd.DataFrame) – The target labels for the training set.
cv (bool, optional) – If True, performs cross-validation to find the best hyperparameters. If False, trains the model using fixed hyperparameters. Defaults to False
- Returns:
A trained random forest classifier.
- Return type:
RandomForestClassifier
- lnb.classifiers.validate_clf(clf: ClassifierMixin, X_train: DataFrame, y_train: DataFrame, X_test: DataFrame, y_test: DataFrame) tuple
Evaluates the classifier’s performance on both the training and test datasets. Prints the accuracy and AUC (Area Under the Curve) for both training and test sets.
- Parameters:
clf (ClassifierMixin) – The classifier to be evaluated (any estimator with predict and predict_proba methods).
X_train (pd.DataFrame) – The feature data for the training set.
y_train (pd.DataFrame) – The target labels for the training set.
X_test (pd.DataFrame) – The feature data for the test set.
y_test (pd.DataFrame) – pd.DataFrameThe target labels for the test set.
- Returns:
A tuple containing: - Training accuracy - Training AUC - Test accuracy - Test AUC (if computable)
- Return type:
tuple
data_prep
- lnb.data_prep.discretize_dataset(df: DataFrame, columns: list) DataFrame
Convert the dataset to one where all categories in categorical columns are integers instead of class name strings
- Parameters:
df (pd.DataFrame) – dataset to discretize
columns (list) – columns to discretize
- Returns:
discretized dataset
- Return type:
pd.DataFrame
- lnb.data_prep.get_target_record(df: DataFrame, index: int) DataFrame
Given an index, return the 1-record dataframe corresponding to the index
- Parameters:
df (pd.DataFrame) – dataset
index (int) – index of the record to return
- Returns:
dataframe containing the record at the specified index of df
- Return type:
pd.DataFrame
- lnb.data_prep.load_data(path_to_data: str, path_to_metadata: str, cols_to_select: list = ['all'])
- lnb.data_prep.normalize_cont_cols(df: DataFrame, meta_data: list, df_aux: DataFrame, types: tuple = ('Float',)) DataFrame
Normalize continuous columns
- Parameters:
df (pd.DataFrame) – dataframe containing data to normalize
meta_data (list) – meta data
df_aux (pd.DataFrame) – auxiliary data based on which normalization is done
types (tuple, optional) – types of column to normalize, defaults to (“Float”,)
- Returns:
normalized dataframe
- Return type:
pd.DataFrame
- lnb.data_prep.read_data(data_path: str, categorical_cols: list, continuous_cols: list) DataFrame
Read given file_path (csv) and return a pd dataframe. If all categorical, make sure data all column values are strings
- Parameters:
data_path (str) – path to data
categorical_cols (list) – names of categorical columns
continuous_cols (list) – names of continuous columns
- Returns:
dataframe containing loaded data
- Return type:
pd.DataFrame
- lnb.data_prep.read_metadata(metadata_path: str) tuple
Read metadata from a json file (is necessary for the reprosyn generators)
- Parameters:
metadata_path (str) – path to metadata
- Returns:
tuple containing metadata, categorical column names, and conitnuous column names
- Return type:
tuple
- lnb.data_prep.select_columns(df: DataFrame, categorical_cols: list, continuous_cols: list, cols_to_select: list, meta_data_og: list) tuple
Select specified columns of dataset and drop the rest.
- Parameters:
df (pd.DataFrame) – dataset
categorical_cols (list) – names of categorical columns
continuous_cols (list) – names of continuous columns
cols_to_select (list) – columns to keep in dataframe. If “all” then all columns are kept.
meta_data_og (list) – metadata
- Returns:
data with selected columns, categorical column names, continuous column names, metadata concerning selected columns.
- Return type:
tuple
- lnb.data_prep.split_data(df: DataFrame, path_to_ids: str)
distance
- lnb.distance.compute_achilles(df: DataFrame, categorical_cols: list, continuous_cols: list, meta_data: list, n_to_save: int)
Function to compute Achilles scores for each record in dataset concurrently.
- Parameters:
df (pd.DataFrame) – dataset based on which to compute Achilles scores
categorical_cols (list) – names of categorical columns
continuous_cols (list) – names of continuous columns
meta_data (list) – metadata
n_to_save (int) – number of nearest neighbors to consider when computing Achilles score
- Returns:
dictionary where key is record id and value is Achilles score
- Return type:
dict
- async lnb.distance.compute_achilles_one_record(df: DataFrame, record_id: int, ohe_cat_indices: list, n_cat_cols: int, cont_indices: list, n_cont_cols: int, all_distances: dict, n_to_save: int)
Async function to compute achilles score (mean distance to n_to_save closest neighbors) for a single record.
- Parameters:
df (pd.DataFrame) – dataframe containing dataset based on which to compute Achilles score
record_id (int) – id of record to compute Achilles score for
ohe_cat_indices (list) – indices of one-hot encoded columns
n_cat_cols (int) – number of categorical columns
cont_indices (list) – indices of continuous columns
n_cont_cols (int) – number of continuous columns
all_distances (dict) – dictionary where key is record id and value is Achilles score
n_to_save (int) – number of nearest neighbors to consider when computing Achilles score
- Returns:
None
- Return type:
None
- async lnb.distance.compute_achilles_parallel(df: DataFrame, categorical_cols: list, continuous_cols: list, meta_data: list, n_to_save: int)
Async function to compute Achilles scores for each record in dataset concurrently.
- Parameters:
df (pd.DataFrame) – dataset based on which to compute Achilles scores
categorical_cols (list) – names of categorical columns
continuous_cols (list) – names of continuous columns
meta_data (list) – metadata
n_to_save (int) – number of nearest neighbors to consider when computing Achilles score
- Returns:
dictionary where key is record id and value is Achilles score
- Return type:
dict
- lnb.distance.compute_achilles_seq(df: DataFrame, categorical_cols: list, continuous_cols: list, meta_data: list, n_to_save: int)
Function to compute Achilles scores for each record in dataset sequentially.
- Parameters:
df (pd.DataFrame) – dataset based on which to compute Achilles scores
categorical_cols (list) – names of categorical columns
continuous_cols (list) – names of continuous columns
meta_data (list) – metadata
n_to_save (int) – number of nearest neighbors to consider when computing Achilles score
- Returns:
dictionary where key is record id and value is Achilles score
- Return type:
dict
- lnb.distance.top_n_vulnerable_dists(distances: dict, n: int) list
Return the risk scores of the top n vulnerable records
- Parameters:
distances (dict) – dictionary where key is record id and value is the corresponding record’s risk score (in this case the mean distance to it’s 5 closest neighbors)
n (int) – number of most vulnerable records to find
- Returns:
list of n most vulnerable record’s risk scores
- Return type:
list
- lnb.distance.top_n_vulnerable_records(distances: dict, n: int) list
Return the ids of the top n vulnerable records
- Parameters:
distances (dict) – dictionary where key is record id and value is the corresponding record’s risk score (in this case the mean distance to it’s 5 closest neighbors)
n (int) – number of most vulnerable records to find
- Returns:
list of n most vulnerable record’s ids
- Return type:
list
feature_extractors
- lnb.feature_extractors.apply_feature_extractor_one_dataset_parallel(dataset: list, target_record: DataFrame, ohe: OneHotEncoder, ohe_columns: list, ohe_column_names: list, continuous_cols: list, feature_extractors: list, do_ohe: list, queries_list: list, query_extractor, train: bool, membership_label: bool, i: int) tuple
Apply feature extraction in parallel for a given dataset.
- Parameters:
dataset (pd.DataFrame) – The dataset for which features are to be extracted.
target_record (pd.DataFrame) – The target record for which features are to be extracted.
ohe (OneHotEncoder) – A fitted one-hot encoder instance.
ohe_columns (list) – A list of column names representing one-hot encoded categorical features.
ohe_column_names (list) – The names of the columns of the one-hot encoding result.
continuous_cols (list) – A list of column names representing continuous features.
feature_extractors (list) – A list of feature extractor functions or tuples specifying the feature extractors to be used.
do_ohe (list) – A list of boolean values indicating whether one-hot encoding is required for each feature extractor.
queries_list (list) – A list of queries for extracting features.
query_extractor (function) – The function used for extracting features when the feature extractor is a tuple.
train (bool) – A boolean indicating if the dataset is for training.
membership_label (bool) – A boolean indicating if membership labeling is required.
i (int) – An index to specify which feature extractor to use.
- Returns:
A tuple containing: - X (
pd.DataFrame): A DataFrame containing the extracted features. - membership_label (bool): The membership label associated with the dataset. - train (bool): The training flag.- Return type:
tuple
- lnb.feature_extractors.apply_feature_extractor_sequential(datasets: list, target_record: DataFrame, labels: list, ohe: OneHotEncoder, ohe_columns: list, ohe_column_names: list, continuous_cols: list, feature_extractors: list, do_ohe: list) tuple
Given a list of feature extractor functions and synthetic datasets, extract all features and create a new DataFrame with all features per dataset as individual records.
- Parameters:
datasets (list) – A list of shadow synthetic datasets.
target_record (pd.DataFrame) – DataFrame of one record with the target record, potentially to be used by the feature extractor.
labels (list) – A list of labels corresponding to the datasets.
ohe (OneHotEncoder) – A fitted one-hot encoder instance.
ohe_columns (list) – The columns on which the one-hot encoding should be applied.
ohe_column_names (list) – The names of the columns of the one-hot encoding result.
continuous_cols (list) – The columns that are continuous.
feature_extractors (list) – A list of feature extractor functions. All functions have as input a dataset and output a list of features and a list of column names. If more than one feature extractor is specified, all features are extracted and appended.
do_ohe (list) – A list of boolean values indicating whether each feature extractor function requires the dataset to be one-hot encoded or not.
- Returns:
DataFrame containing all features per dataset and the corresponding labels.
- Return type:
pd.DataFrame
- lnb.feature_extractors.apply_feature_extractor_to_datasets(datasets_train: list, datasets_eval: list, target_record: DataFrame, ohe: OneHotEncoder, ohe_columns: list, ohe_column_names: list, continuous_cols: list, feature_extractors: list, do_ohe: list)
Apply feature extraction to both training and evaluation datasets.
- Parameters:
datasets_train (list) – A list of training datasets, each containing synthetic data and corresponding membership labels.
datasets_eval (list) – A list of evaluation datasets, each containing synthetic data and corresponding membership labels.
target_record (pd.DataFrame) – The target record for which features are to be extracted.
ohe (OneHotEncoder) – A fitted one-hot encoder instance.
ohe_columns (list) – A list of column names representing one-hot encoded categorical features.
ohe_column_names (list) – The names of the columns of the one-hot encoding result.
continuous_cols (list) – A list of column names representing continuous features.
feature_extractors (list) – A list of feature extractor functions or tuples specifying the feature extractors to be used.
do_ohe (list) – A list of boolean values indicating whether one-hot encoding is required for each feature extractor.
- Returns:
A list containing extracted features and labels for both training and evaluation datasets.
- Return type:
list
- lnb.feature_extractors.apply_ohe(df: DataFrame, ohe: OneHotEncoder, categorical_cols: list, ohe_column_names: list, continous_cols: list) DataFrame
One-hot-encode dataset
- Parameters:
df (pd.DataFrame) – dataset to encode
ohe (OneHotEncoder) – OneHotEncoder instance
categorical_cols (list) – names of categorical columns
ohe_column_names (list) – names for one-hot-encoded columns
continous_cols (list) – names of continuous columns
- Returns:
One-hot-encoded dataset
- Return type:
pd.DataFrame
- lnb.feature_extractors.create_queries(queries_list: list, feature_extractors: list, dataset: DataFrame, ohe_columns: list, continuous_cols: list)
Generate queries based on the provided feature extractors and dataset.
- Parameters:
queries_list (list) – A list to store the generated queries for each feature extractor.
feature_extractors (list) – A list of feature extractors, where each element can be a function or a tuple with parameters.
dataset (pd.DataFrame) – The dataset containing the features for query generation.
ohe_columns (list) – A list of column names representing one-hot encoded categorical features.
continuous_cols (list) – A list of column names representing continuous features.
- Returns:
A tuple containing: - queries_list (list): The updated list of queries generated for each feature extractor. - query_extractor: The last used query extractor from the feature extractors list.
- Return type:
tuple
- lnb.feature_extractors.extract_correlation_features(synthetic_df: ~pandas.core.frame.DataFrame, categorical_cols: list, ohe_column_names: list, continuous_cols: list, target_record=<class 'pandas.core.frame.DataFrame'>) tuple
Extract correlation features
- Parameters:
synthetic_df (pd.DataFrame) – Synthetic dataset
categorical_cols (list) – names of categorical columns
ohe_column_names (list) – names of one-hot encoded columns
continuous_cols (list) – names of continuous columns
target_record (pd.DataFrame, optional) – dataframe containing target record, defaults to pd.DataFrame
- Returns:
extracted features and names
- Return type:
tuple
- lnb.feature_extractors.extract_naive_features(synthetic_df: ~pandas.core.frame.DataFrame, categorical_cols: list, ohe_column_names: list, continuous_cols: list, target_record=<class 'pandas.core.frame.DataFrame'>) tuple
Compute the Naive method as described in “Synthetic data – anonymisation groundhog day” (Usenix 2022)
- Parameters:
synthetic_df (pd.DataFrame) – Synthetic dataset
categorical_cols (list) – names of categorical columns
ohe_column_names (list) – names of one-hot encoded columns
continuous_cols (list) – names of continuous columns
target_record (pd.DataFrame, optional) – dataframe containing target record, defaults to pd.DataFrame
- Returns:
extracted features and names
- Return type:
tuple
- lnb.feature_extractors.extract_one_feature(feature_extractor, queries, dataset, ohe_columns, target_record, query_extractor, do_ohe, data_ohe, ohe_column_names, continuous_cols, target_ohe)
Extract features using a given feature extractor.
- Parameters:
feature_extractor (function or tuple) – The feature extractor function or a tuple containing the function and additional parameters.
queries (list) – A list of queries for extracting features.
dataset (pd.DataFrame) – The dataset containing the features for extraction.
ohe_columns (list) – A list of column names representing one-hot encoded categorical features.
target_record (pd.DataFrame) – The target record for which features are to be extracted.
query_extractor (function) – The function used for extracting features when the feature extractor is a tuple.
do_ohe (bool) – A boolean indicating whether one-hot encoding is required.
data_ohe (pd.DataFrame) – The one-hot encoded version of the dataset.
ohe_column_names (list) – The names of the columns of the one-hot encoding result.
continuous_cols (list) – A list of column names representing continuous features.
target_ohe (pd.DataFrame) – The one-hot encoded version of the target record.
- Returns:
A tuple containing: - features (list): The extracted features. - col_names (list): The names of the extracted features.
- Return type:
tuple
- lnb.feature_extractors.feature_extractor_distances(synthetic_df: DataFrame, target_record_ohe: DataFrame)
Extract features based on cosine similarity between synthetic data and target record
- Parameters:
synthetic_df (pd.DataFrame) – synthetic data
target_record_ohe (pd.DataFrame) – one-hot encoded target record
- Returns:
extracted features and names
- Return type:
tuple
- lnb.feature_extractors.feature_extractor_queries_CQBS(synthetic_df: DataFrame, target_record: DataFrame, queries: list)
Extract query-based features
- Parameters:
synthetic_df (pd.DataFrame) – synthetic data
target_record (pd.DataFrame) – target record
queries (list) – queries
- Returns:
extracted features and names
- Return type:
tuple
- lnb.feature_extractors.feature_extractor_topX_full(synthetic_df: DataFrame, target_record_ohe: DataFrame, top_X: int = 50)
Extract features based on top X cosine similarity between synthetic data and target record
- Parameters:
synthetic_df (pd.DataFrame) – synthetic data
target_record_ohe (pd.DataFrame) – one-hot encoded target record
top_X (int, optional) – number of most similar records to consider, defaults to 50
- Returns:
extracted features and names
- Return type:
tuple
- lnb.feature_extractors.fit_ohe(df: DataFrame, categorical_cols: list, metadata: dict) tuple
- lnb.feature_extractors.get_feature_extractors(feature_extractor_names: list) tuple
Given a list of strings or tuples specifying the feature extractors to be used, create a list of the corresponding functions and parameters.
- Parameters:
feature_extractor_names (list) – A list of feature extractors, where each element can be: - A string specifying a feature extractor (e.g., ‘naive’, ‘correlation’, ‘closest_X_full’, ‘all_distances’) - A tuple with a feature extractor name and additional parameters (name, orders, number, conditions)
- Returns:
A tuple containing: - feature_extractors (list): A list of functions (or tuples with functions and parameters) corresponding to the requested feature extractors. - do_ohe (list): A list of boolean values indicating whether one-hot encoding (OHE) should be performed for each feature extractor.
- Return type:
tuple
- lnb.feature_extractors.get_queries(orders, categorical_indices: list, continous_indices: list, num_cols: int, number: int, cat_condition_options: tuple = (-1, 1), cont_condition_options: tuple = (3, -3), random_state: int = 42) list
- Condition options:
0 -> no condition on this attribute; 1 -> ==
- -1
-> != 2 -> > 3 -> >=
- -2
-> <
- -3
-> <=
- Parameters:
orders (list) – list containing numbers of features to consider when extracting queries from feature combinations
categorical_indices (list) – indices of categorical columns
continous_indices (list) – indices of continuous columns
num_cols (int) – number of columns in dataset
number (int) – number of combinations to return
cat_condition_options (tuple, optional) – defaults to (-1, 1)
cont_condition_options (tuple, optional) – defaults to (3, -3)
random_state (int, optional) – random seed, defaults to 42
- Returns:
extracted features and names
- Return type:
list
generators
- class lnb.generators.baynet
Bases:
GeneratorThis generator is based on BAYNET.
- fit_generate(dataset, metadata, size, seed)
- property label
- class lnb.generators.ctgan
Bases:
GeneratorThis generator is based on CTGAN.
- fit_generate(dataset, metadata, size, seed, epochs=50)
- property label
- lnb.generators.get_generator(name_generator: str, epsilon: float)
Get generator instance
- Parameters:
name_generator (str) – generator name. Supports “identity”, “BAYNET”, “privbayes”, “CTGAN”, “SYNTHPOP”, “INDHIST”
epsilon (float) – epsilon for training differentially private generators
- Returns:
generator instance
- Return type:
- class lnb.generators.identity
Bases:
GeneratorThis generator is the identity generator: just return the input dataset.
- fit_generate(dataset, metadata, size, seed)
- property label
- class lnb.generators.indhist
Bases:
GeneratorThis generator is based on INDHIST.
- fit_generate(dataset, metadata, size, seed)
- property label
mia
- lnb.mia.mia(path_to_data: str, path_to_metadata: str, path_to_data_split: str, target_records: list, generator_name: str, n_synth: int | None = None, n_datasets: int = 1000, epsilon: float = 0.0, models: list = ['random_forest', 'logistic_regression'], output_path: str = './output/files/')
Membership Inference Attack (MIA) function to evaluate data privacy risks.
- Parameters:
path_to_data (str) – Path to the data file.
path_to_metadata (str) – Path to the metadata file.
path_to_data_split (str) – Path to the data split information.
target_records (list) – List of target records for MIA.
generator_name (str) – Name of the data generator being used.
n_synth (int) – Number of synthetic records to use. Defaults to the size of df_target if not provided.
n_datasets (int) – Number of datasets to generate. Defaults to 1000.
epsilon (float) – Differential privacy parameter. Defaults to 0.0.
output_path (str) – Path to store output files. Defaults to ‘./output/files/’.
- Returns:
A dictionary containing the MIA results for each target record.
- Return type:
dict
- lnb.mia.train_evaluate_mia(df_aux: DataFrame, df_target: DataFrame, meta_data: list, target_record_id: int, df_eval: DataFrame, generator_name: str, continuous_cols: list, categorical_cols: list, n_synth: int = 1000, n_datasets: int = 1000, seeds_train: list | None = None, seeds_eval: list | None = None, epsilon: float = 0.0, models: list | None = None, cv: bool = False)
Train and evaluate a membership inference attack (MIA) using shadow datasets and target record.
- Parameters:
df_aux (pd.DataFrame) – Auxiliary dataset used for generating shadow datasets.
df_target (pd.DataFrame) – Dataset containing the target record for MIA.
meta_data (list) – Metadata information used for feature extraction and generating synthetic datasets.
target_record_id (int) – The ID of the target record for MIA.
df_eval (pd.DataFrame) – Evaluation dataset used for testing the trained models.
generator_name (str) – Name of the data generator used for generating synthetic datasets.
continuous_cols (list) – A list of column names representing continuous features.
categorical_cols (list) – A list of column names representing categorical features.
n_synth (int, optional) – Number of synthetic records to generate for each shadow dataset (default is 1000).
n_datasets (int, optional) – Number of shadow datasets to generate (default is 1000).
seeds_train (list, optional) – List of seeds used for training dataset generation (default is None).
seeds_eval (list, optional) – List of seeds used for evaluation dataset generation (default is None).
epsilon (float, optional) – Differential privacy parameter for synthetic dataset generation (default is 0.0).
models (list, optional) – A list of model names to use for training the meta-classifier (default is [‘random_forest’]).
cv (bool, optional) – Whether to use cross-validation during model training (default is False).
output_path (str, optional) – Path to save output files (default is ‘./output/files/’).
- Returns:
A tuple containing: - target_record_id (int): The ID of the target record used for MIA. - model_metrics (dict): A dictionary containing AUC and accuracy metrics for each trained model.
- Return type:
tuple
plots
- lnb.plots.calculate_statistics(distances: dict[int, float]) None
Calculate summary statistics of Achilles distances
- Parameters:
distances (dict) – dictionary where key is record id, value is Achilles score
- lnb.plots.mia_results_to_df(mia_results: list[float]) DataFrame
Convert MIA results list to dataframe
- Parameters:
mia_results (list[float]) – list containing MIA scores
- Returns:
dataframe containing MIA results
- Return type:
pd.DataFrame
- lnb.plots.plot_achilles(distances: dict[int, float], n: int) None
Plot histogram of Achilles scores
- Parameters:
distances (dict) – dictionary where key is record id, value is Achilles score
n (int) – number of most vulnerable records to identify
- lnb.plots.plot_mia_scores(mia_results: list[float], output_path: str | None = None) None
Plot MIA scores
- Parameters:
mia_results (list[float]) – list containing MIA scores
output_path (str, optional) – path to save plot, defaults to None
shadow
- lnb.shadow_data.generate_dataset_parallel(df_aux: DataFrame, df_target: DataFrame, meta_data: list, target_record: DataFrame, df_eval: DataFrame, in_dataset: bool, generator_name: str, n_synth: int, n_original: int, seeds_train: list, seeds_eval: list, idx: int, shadow_datasets: list, shadow_membership_labels: list, evaluation_datasets: list, evaluation_membership_labels: list, epsilon: float, train: bool)
This function trains one generator and generates a single dataset. The generator is trained on df_target, either with the target record swapped out for a different record sampled from df_eval (in_dataset=False), or on the full df_target containing the target record (in_dataset=True). The generated synthetic dataset and its membership label are placed in the corresponding lists.
- Parameters:
df_target (pandas.DataFrame) – Target dataset.
meta_data (list) – List containing metadata concerning the data (feature types, ranges, etc.), necessary for training synthetic data generators.
target_record (pandas.DataFrame) – DataFrame containing only the target record.
df_eval (pandas.DataFrame) – Evaluation pool to draw the reference record from.
in_dataset (bool) – If True, the target record is in the dataset used to train the synthetic data generator. If False, it is replaced by a reference record randomly sampled from df_eval.
generator_name (str) – Name of the generator used, e.g., ‘SYNTHPOP’, ‘BAYNET’, etc. See reprosyn library for more details.
n_synth (int) – Size of each synthetic dataset.
seeds (list) – List of seed values; length must be equal to n_datasets.
idx (int) – Counter for number of generated synthetic datasets.
synthetic_datasets (list) – List containing all generated synthetic datasets. This function generates a single synthetic dataset and places it in the list, selecting the position according to idx.
membership_labels (list) – List containing membership labels for each synthetic dataset. If the target record is included in the training set, the membership label is 1, otherwise it is 0. Label at index idx in membership_labels corresponds to the synthetic dataset at index idx in synthetic_datasets.
epsilon (float) – Epsilon when training with differential privacy.
- Returns:
synthetic dataset, bool indicating whether the synthetic data generator was trained on the target record, bool indicating whether the synthetic dataset is used for MIA training or evaluation
- Return type:
tuple
- lnb.shadow_data.generate_datasets(df_aux: DataFrame, df_target: DataFrame, meta_data: list, target_record_id: int, df_eval: DataFrame, generator_name: str, n_synth: int = 1000, n_original: int = 1000, n_datasets: int = 1000, seeds_train: list | None = None, seeds_eval: list | None = None, epsilon: float = 0.0)
Launch the pipeline to generate evaluation synthetic datasets.
- Parameters:
df_target (pandas.DataFrame) – Target dataset.
meta_data (list) – List containing metadata concerning the data (feature types, ranges, etc.), necessary for training synthetic data generators.
target_record_id (int) – Index of the target record.
df_eval (pandas.DataFrame) – Evaluation pool to draw the reference record from.
generator_name (str) – Name of the generator used, e.g., ‘SYNTHPOP’, ‘BAYNET’, etc. See reprosyn library for more details.
n_synth (int) – Size of each synthetic dataset.
n_datasets (int) – Number of synthetic datasets to generate.
seeds (list) – List of seed values; length must be equal to n_datasets.
epsilon (float) – Epsilon when training with differential privacy.
- Returns:
A list containing tuples of (synthetic dataset, membership label)
- Return type:
list
- lnb.shadow_data.generate_datasets_parallel(df_aux: DataFrame, df_target: DataFrame, meta_data: list, target_record: DataFrame, df_eval: DataFrame, generator_name: str, n_synth: int, n_original: int, n_datasets: int, seeds_train: list, seeds_eval: list, epsilon: float, train: bool)
This function allows evaluation datasets to be generated concurrently. It is not meant to be called directly, but rather from generate_evaluation_datasets. Each launched task corresponds to a single synthetic data generator being trained and used to generate a single synthetic dataset. Exactly half of the synthetic data generators are trained on data including the target record.
- Parameters:
df_target (pandas.DataFrame) – Target dataset.
meta_data (list) – List containing metadata concerning the data (feature types, ranges, etc.), necessary for training synthetic data generators.
target_record (pandas.DataFrame) – DataFrame containing only the target record.
df_eval (pandas.DataFrame) – Evaluation pool to draw the reference record from.
generator_name (str) – Name of the generator used, e.g., ‘SYNTHPOP’, ‘BAYNET’, etc. See reprosyn library for more details.
n_synth (int) – Size of each synthetic dataset.
n_datasets (int) – Number of synthetic datasets to generate.
seeds (list) – List of seed values; length must be equal to n_datasets.
epsilon (float) – Epsilon when training with differential privacy.
- Returns:
A tuple containing: - A list of all generated synthetic datasets. - A list of membership labels for each generated synthetic dataset.
- Return type:
tuple
utils
- lnb.utils.blockPrint()
- lnb.utils.enablePrint()
- lnb.utils.ignore_depreciation()
- async lnb.utils.save_metrics_to_file(file_path: str, data)
Save metrics to file
- Parameters:
file_path (str) – file path
data (_type_) – data to save
- lnb.utils.str2bool(s: str)
convert string to bool. Used in parser.
- Parameters:
s (str) – “True” or “False”. Raises argparse.ArgumentTypeError if value is neither.
- Returns:
boolean
- Return type:
bool
- lnb.utils.str2list(s)
Convert a string to a list. Used in parser.
- Parameters:
s (str) – string
- Returns:
list
- Return type:
list