Utilities
Causal Inference Helpers
- bayesgm.utils.helpers.get_ADRF(x_values=None, x_min=None, x_max=None, nb_intervals=None, dataset='Imbens')[source]
Compute the values of the Average Dose-Response Function (ADRF).
- Parameters:
x_values (list or np.ndarray, optional) – A list or array of values at which to evaluate the ADRF. If provided, overrides x_min, x_max, and nb_intervals.
x_min (float, optional) – The minimum value of the range (used when x_values is not provided).
x_max (float, optional) – The maximum value of the range (used when x_values is not provided).
nb_intervals (int, optional) – The number of intervals in the range (used when x_values is not provided).
dataset (str, optional) – The dataset name (default: ‘Imbens’). Must be one of {‘Imbens’, ‘Sun’, ‘Lee’}.
- Returns:
true_values – The computed ADRF values.
- Return type:
np.ndarray
Notes
Either x_values or (x_min, x_max, nb_intervals) must be provided.
- Supported datasets:
‘Imbens’: ADRF = x + 2 / (1 + x)^3
‘Sun’: ADRF = x - 1/2 + exp(-0.5) + 1
‘Lee’: ADRF = 1.2 * x + x^3
- bayesgm.utils.helpers.estimate_latent_dims(x, y, v, v_ratio=0.7, z0_dim=3, max_total_dim=64, min_z3_dim=3)[source]
Estimate the latent-dimension split for CausalBGM.
Uses Sliced Inverse Regression (SIR) and PCA to automatically choose dimensions
[z0, z1, z2, z3]for the four latent sub-vectors.- Parameters:
x (np.ndarray) – Treatment variable with shape
(n, 1).y (np.ndarray) – Outcome variable with shape
(n, 1).v (np.ndarray) – Covariates with shape
(n, v_dim).v_ratio (float, default=0.7) – Cumulative PCA variance ratio used to determine total latent dimension.
z0_dim (int, default=3) – Fixed dimension for the confounding sub-vector \(Z_0\).
max_total_dim (int, default=64) – Upper bound on the total latent dimension.
min_z3_dim (int, default=3) – Minimum dimension for the residual sub-vector \(Z_3\).
- Returns:
A list
[z0_dim, z1_dim, z2_dim, z3_dim].- Return type:
Image Helpers
- bayesgm.utils.helpers.mnist_mask_indices(shape=(28, 28), mode='hole', center=(14, 14), num_holes=1, hole_size=3, orientation='horizontal', stripe_width=4, stripe_pos=14, seed=None)[source]
Create pixel masks on a 2D grid and return flattened index arrays.
- Parameters:
shape ((H, W)) – Image height and width.
mode (str) –
- One of:
’hole’ : mask a hole with size hole_size`×`hole_size.
’edge_stripe’ : mask a stripe along the edges; choose side and stripe_width.
’upper_half’ : mask rows [0 : H//2)
’lower_half’ : mask rows [H//2 : H)
’left_half’ : mask cols [0 : W//2)
’right_half’ : mask cols [W//2 : W)
center (tuple (row, col)) – Center of the hole when mode=’hole’.
hole_size (int) – Side length of each square hole (odd is best).
orientation (str) – Which edges to mask for mode=’edge_stripe’. ‘horizontal’ masks horizontal strip; ‘vertical’ masks vertical strip.
stripe_width (int) – Stripe thickness in pixels (for edge stripes).
stripe_pos (int) – Position of the stripe when mode=’edge_stripe’.
seed (int or None) – RNG seed for reproducibility.
- Returns:
ind_x1 (np.ndarray (1D, dtype=int)) – Flattened indices of unmasked pixels.
ind_x2 (np.ndarray (1D, dtype=int)) – Flattened indices of masked pixels.
Data I/O
- bayesgm.utils.data_io.save_data(fname, data, delimiter='\t')[source]
Save the data to the specified path.
Parameters:
- fnamestr
The file name or path where the data will be saved.
- datanp.ndarray
The data to save.
- delimiterstr, optional
The delimiter for saving .txt or .csv files (default: ‘ ‘).
Raises:
- ValueError
If the file extension is not recognized.
- bayesgm.utils.data_io.parse_file(path, sep='\t', header=0, normalize=True)[source]
Parse an input data file and return a single data matrix.
This is a general-purpose loader for the BGM model, where the input is a single data matrix (as opposed to the causal triplet format with treatment, outcome, and covariates).
- Parameters:
path (str) – Path to the input file. Supported formats: .npz, .csv, .txt.
sep (str, optional) – Separator for .csv or .txt files. Default is tab-delimited.
header (int or None, optional) – Row number to use as column names in .csv files. Default is 0.
normalize (bool, optional) – If True, the data will be normalized using
StandardScaler.
- Returns:
data – The data matrix with shape
(n_samples, n_features), dtype float32.- Return type:
np.ndarray
Examples
>>> data = parse_file("data.csv", sep=',', normalize=True) >>> data = parse_file("data.npz", normalize=False)
- bayesgm.utils.data_io.parse_file_triplet(path, sep='\t', header=0, normalize=True)[source]
Parse an input file and extract the (treatment, outcome, covariates) triplet for CausalBGM model training or evaluation.
- Parameters:
path (str) – Path to the input file. The file can be in .npz, .csv, or .txt format.
sep (str, optional) – Separator used in .csv or .txt files. Defaults to tab-delimited format.
header (int or None, optional) – Row number to use as column names in .csv files. Default is 0 (the first row). Use None if the file does not have a header.
normalize (bool, optional) – If True, the features in v will be normalized using StandardScaler.
- Returns:
data_x (np.ndarray) – The treatment variable(s) extracted from the file, reshaped to (-1, 1).
data_y (np.ndarray) – The outcome variable(s) extracted from the file, reshaped to (-1, 1).
data_v (np.ndarray) – Covariates extracted from the file. Normalized if normalize=True.
Notes
- Supported file formats:
.npz: Numpy compressed files with keys x, y, and v.
.csv: Comma-separated value files with treatment, outcome, and covariates as columns.
.txt: Tab- or other character-delimited text files with similar structure to .csv.
The input file must exist at the specified path.
The first column is assumed to be the treatment variable (x).
The second column is assumed to be the outcome variable (y).
Remaining columns are assumed to be covariates (v).
Examples
# Example for .csv input data_x, data_y, data_v = parse_file_triplet(“data.csv”, sep=’,’, header=0, normalize=True)
# Example for .npz input data_x, data_y, data_v = parse_file_triplet(“data.npz”, normalize=False)