Community Detection Datasets¶

Due to the sizes of these files, we include here the sources of the datasets we use, rather than the full dataset.

Small Dataset¶

The Small Dataset is derived from the Chesepeake dataset:

Name: Chesepeake
Nodes: 39
Edges: 170
SOD: 22 Kb
Source: Gunrock Benchmark

To download the dataset:

wget https://github.com/gunrock/gunrock/blob/main/datasets/chesapeake/chesapeake.mtx
mv chesapeake.mtx SMALL_chesapeake.mtx

Format: MTX file (the Matrix Market Format <https://people.sc.fsu.edu/~jburkardt/data/mm/mm.html>). It will work with GUNROCK GPU implementations out-of-the-box but needs reformatting to work with CPU implementations.

Medium Dataset¶

Name: email-EU-Core
Nodes: 1,005
Edges: 25,571
SOD: 189 Kb
Source: The Stanford Network Analysis Project

To download the dataset:

wget https://snap.stanford.edu/data/email-Eu-core.txt.gz
gunzip email-Eu-core.txt.gz
mv email-Eu-core.txt MEDIUM_email-Eu-core.txt

Format: .txt adjacency list. It will work out-of-the-box with sequential Louvain but will require conversion to MTX format for GPU.

Large Dataset¶

Name: wiki-talk
Nodes: 2,394,385
Edges: 4,659,565
SOD: 64,911 Kb
Source: Stanford Network Analysis Project

To download the dataset:

wget https://snap.stanford.edu/data/wiki-Talk.txt.gz
gunzip wiki-Talk.txt.gz
mv wiki-Talk.txt LARGE_wiki-Talk.txt

Format: .txt adjacency list with # prepended comments. It will need to be transformed into MTX format to work with GPU implementations.

Huge Dataset¶

The huge dataset is derived from the Twitter benchmark dataset used by the UC Berkeley GAP benchmark.

Name: Twitter RV
Nodes: 61,578,415
Edges: 2,405,026,930
SOD: 25,558,868 Kb
Source: ANLAB KAIST Twitter Dataset

To download the dataset:

wget https://github.com/ANLAB-KAIST/traces/releases/download/twitter_rv.net/twitter_rv.net.00.gz
wget https://github.com/ANLAB-KAIST/traces/releases/download/twitter_rv.net/twitter_rv.net.01.gz
wget https://github.com/ANLAB-KAIST/traces/releases/download/twitter_rv.net/twitter_rv.net.02.gz
wget https://github.com/ANLAB-KAIST/traces/releases/download/twitter_rv.net/twitter_rv.net.03.gz
gunzip -c twitter_rv.net.00 twitter_rv.net.01 twitter_rv.net.02 twitter_rv.net.03 > HUGE_twitter_rv.net

Format: .txt Adjacency List