Subgraph Matching Datasets

Due to the sizes of these files, we include here the sources of the datasets we use, rather than the full dataset.

Small Dataset

To download the dataset:

wget http://liris.cnrs.fr/csolnon/newSIPbenchmarks.tgz
tar -zxvf newSIPbenchmarks.tgz
rm newSIPbenchmarks.tgz
cd newSIPbenchmarks.tgz/scalefree/A.01
cp -r newSIPbenchmarks.tgz/scalefree/A.01 SMALL_A.01

Format: Unsure of the format, but needs to be converted into something that the fv3 algorithm can eat using our grf_converter script. Assuming you are in the subgraph_matching directory:

python ../../Code/Data_Analysis/grf_converter.py -i target -o target.grf

Pattern

  • Number of Subgraphs: 180
  • Number of Nodes: 180
  • Number of Edges: 180
  • SOD: 4 Kb
  • Source: Same as A.01 Target. To convert the pattern file into the graph file, assuming you are in the subgraph_matching directory:
python ../../Code/Data_Analysis/grf_converter.py -i pattern -o pattern.grf

Medium Dataset

Target

  • Name: Enron Email Dataset
  • Nodes: 36,692
  • Edges: 183,131
  • SOD: 3,955 Kb
  • Source: Stanford Network Analytics Program (SNAP) <https://snap.stanford.edu/data/email-Enron.html> via the MIT Graph Challenge Datasets

To download the dataset:

wget https://snap.stanford.edu/data/email-Enron.txt.gz
gunzip email-Enron.txt.gz
mv email-Enron.txt MEDIUM_email-Enron.txt

Format: Comes in an adjacency list format with some leading comments, can be moved into the required GRF format for vf3 with our converter. Assuming you are in the subgraph_matching directory:

python ../../Code/Data_Analysis/grf_converter.py -i email-Enron.txt -o email-Enron.grf -AL

Pattern

None Provided.

Large Dataset

Target

  • Name: Measurement and Analysis on the Wide Internet (MAWI) Graph 1
  • Nodes: 18,571,154
  • Edges: 38,040,320
  • SOD: 684,690 Kb
  • Source: MIT Graph Challenge Datasets generated from MWAI Archives

To download the dataset:

wget https://graphchallenge.s3.amazonaws.com/synthetic/gc5/201512012345.v18571154_e38040320.tsv
mv 201512012345.v18571154_e38040320.tsv LARGE_201512012345.v18571154_e38040320.tsv

Format: Tab Separated Adjacency List, needs to be converted into a grf format.

Pattern

Huge Dataset

  • Name: Measurement and Analysis on the Wide Internet (MAWI) Graph 5
  • Nodes: 226,196,185
  • Edges: 480,047,894
  • SOD: 10,048,635 Kb
  • Source: MIT Graph Challenge Datasets generated from MWAI Archives

To download the dataset:

wget https://graphchallenge.s3.amazonaws.com/synthetic/gc5/201512020330.v226196185_e480047894.tsv
mv 201512020330.v226196185_e480047894.tsv HUGE_201512020330.v226196185_e480047894.tsv

Format: Tab Separated Adjacency List, needs to be converted into a grf format.