The code in this archive was written by Andrew Reid (2024). It is available under the Gnu General Public License (GPL-4.0), available at: https://www.gnu.org/licenses/gpl-3.0.html

The software is for demonstrative purposes only. It has not been tested to any degree and should not be used for anything other than learning or teaching.

USAGE:

1. INPUT:

Look at smoking_cancer.json or smoking_cancer_latent.json to see how DAGs are specified:

"SmokingNetwork":
    {
        "X": {
            "Name": "Gene X",
            "Parents": [],
            "ProbTrue": 0.2
        },
        "A": {
            "Name": "Addictive Predisposition",
            "Parents": ["Y"],
            "ProbTrue": [0.6, 0.25]
        },
        "S": {
            "Name": "Smoking",
            "Parents": ["A","R"],
            "ProbTrue": [0.6, 0.35, 0.25, 0.06]
        }
    }

The network name is the key for a Bayesian network specification using binary variables. This specificaiton is a dictionary array composed of keys for each node/variable in the network. Each node is specified by three items: Name, Parents, and ProbTrue. 

"Name" is the long name of the variable. 

"Parents" is an array of the parents of this node, specified by their keys. A parent A of node B specifies the directed edge A → B.

"ProbTrue" specifies the (conditional) probability of this variable being true. If a single value is provided, this in the marginal probability. If multiple values are provided, this specifies the conditional probability, given that the state of the parent is TRUE (1) or FALSE (0), in that order. If multiple parents exist, joint probabilities are appended to the array in the order they appear in the "Parents" array. For the "Smoking" entry above, "ProbTrue" specifies [P(S|A=1,R=1), P(S|A=1,R=0), P(S|A=0,R=1), P(S|A=0,R=0)]. NOTE: this has only been tested for two parents.

2. OUTPUT

2.1. Binary samples

Binary samples are specified as CSV-format files with headers corresponding to the variable names for a given Bayesian network. Each column takes an integer value of 0 or 1, representing FALSE or TRUE, respectively. The header line is prefaced by a pound sign (#).

Open "blog_data/smoking_sample.csv" in a text editor to see how these data are stored.

2.2. Graphs

Directed graphs are output in JSON format. Open "blog_21_smoking_directed.json" to see an example. You may need to prettify the text.

{
    "nodes": [
        {
            "id": "X",
            "label": "X",
            "fx": 125.0,
            "fy": 0.0
        },
    ],
    "links": [
        {
            "source": "X",
            "target": "L"
        },
    ]
}

The file is a dictionary with two keys: "nodes" and  "links":

"nodes" is an array of dictionary items specifying nodes. These have four items:
    "id" is a simple text identifier
    "label" is a label for, e.g., visualization
    "fx" and "fy" is the x- and y-coordinates for rendering; the default layout is a circle
    
"links" is an array of dictionary items specifying directed edges. These have two items:
    "source" is the identifier for the source node
    "target" is the identifier for the target node

Each link specifies the directed edge "source" → "target".

3. SCRIPTS

2.1 generate_*.py 

This script loads a Bayesian network specified as above and generates a set of samples by pseudorandomly assigning TRUE (1) or FALSE (0) to each variable, according to its (marginal) probability. Alter the input/outpu of this script by modifying the configuration parameters.

Configuration:
network_file: Specifies the JSON-format network input file, as specified above
output_file: Specifies where to write the output
N_samp: The number of samples to generate

Output:
A CSV sample is generated at "output_file"

2.2. process_*.py

This script runs the PC algorithm on a sample. 

Configuration
input_file: The sample to process, specified as a CSV file (output of generate_*.py)
output_prefix: A prefix for output files
pthres: The signficance threshold for inferring independence

Output:
The script outputs JSON-format graphs for each stage of the process, including pruning and directionality assignment. 

Pruning stage output has the suffix "_pruned_*", ending in a number for the iteration, "Full" for the full undirected graph, and "Final" for the pruned undirected graph. 

The final directed graph estimate has the suffix "_directed". If edges are ambiguous, bidirectional edges will be represented.

2.3. test_stability.py

This tests the stability of the algorithm for a given input network, by iterating over samples for multiple sample sizes, and computing accuracy.

Configuration:
network_file: Input network in JSON format
output_file: Output CSV file
N_samp: The number of samples to generate per sample size
pthres: The threshold for determining independence
sizes: A set of sample sizes to iterate over
verbose: Whether to write verbose information to the console
debug: Whether to write deug information to the console
max_iter: Maximum number of iterations to produce new samples when a sample is degenerate

Output:
A CSV file with the following columns:
N: Sample size
mean_agree: Mean proportion of edges in estimate that match ground truth
std_agree: Standard deviation of the same
skipped: Number of samples skipped due to degeneracy