Community Detection Datasets

Due to the sizes of these files, we include here the sources of the datasets we use, rather than the full dataset.

Small Dataset

The Small Dataset is derived from the Chesepeake dataset:

To download the dataset:

wget https://github.com/gunrock/gunrock/blob/main/datasets/chesapeake/chesapeake.mtx
mv chesapeake.mtx SMALL_chesapeake.mtx

Format: MTX file (the Matrix Market Format <https://people.sc.fsu.edu/~jburkardt/data/mm/mm.html>). It will work with GUNROCK GPU implementations out-of-the-box but needs reformatting to work with CPU implementations.

Medium Dataset

To download the dataset:

wget https://snap.stanford.edu/data/email-Eu-core.txt.gz
gunzip email-Eu-core.txt.gz
mv email-Eu-core.txt MEDIUM_email-Eu-core.txt

Format: .txt adjacency list. It will work out-of-the-box with sequential Louvain but will require conversion to MTX format for GPU.

Large Dataset

To download the dataset:

wget https://snap.stanford.edu/data/wiki-Talk.txt.gz
gunzip wiki-Talk.txt.gz
mv wiki-Talk.txt LARGE_wiki-Talk.txt

Format: .txt adjacency list with # prepended comments. It will need to be transformed into MTX format to work with GPU implementations.

Huge Dataset

The huge dataset is derived from the Twitter benchmark dataset used by the UC Berkeley GAP benchmark.

To download the dataset:

wget https://github.com/ANLAB-KAIST/traces/releases/download/twitter_rv.net/twitter_rv.net.00.gz
wget https://github.com/ANLAB-KAIST/traces/releases/download/twitter_rv.net/twitter_rv.net.01.gz
wget https://github.com/ANLAB-KAIST/traces/releases/download/twitter_rv.net/twitter_rv.net.02.gz
wget https://github.com/ANLAB-KAIST/traces/releases/download/twitter_rv.net/twitter_rv.net.03.gz
gunzip -c twitter_rv.net.00 twitter_rv.net.01 twitter_rv.net.02 twitter_rv.net.03 > HUGE_twitter_rv.net

Format: .txt Adjacency List