-
Notifications
You must be signed in to change notification settings - Fork 4
/
konect-handbook.tex
5240 lines (4789 loc) · 243 KB
/
konect-handbook.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%
% This file is released under the Creative Commons CC-BY-SA
% 4.0 License; see:
%
% https://creativecommons.org/licenses/by-sa/4.0/
%
% AUTHORS
% Jérôme KUNEGIS <[email protected]>
% Daniel DÜNKER <[email protected]>
%
\documentclass{article}
\usepackage{hyperref}
\pdfoutput=1
\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{grffile}
\usepackage{subfigure}
\usepackage{marvosym} % for \Clocklogo
\usepackage{color}
\usepackage{textcomp} % for \textlangle, \textrangle
\usepackage{array}
\usepackage{wasysym} % for \newmoon
\usepackage{tikz}
\usepackage{natbib}
\usepackage[nottoc]{tocbibind}
%%\usepackage{wrapfig}
\usepackage{longtable}
\usepackage{multicol}
\usepackage{multirow}
\usepackage{geometry}
\newcommand{\wPlot}{0.45\textwidth}
\newcommand{\wFull}{0.95\textwidth}
\newcommand{\wOnePointFive}{0.8\textwidth}
\newcommand{\wTwo}{0.45\textwidth}
\newcommand{\wThree}{0.31\textwidth}
\newcommand{\wFour}{0.24\textwidth}
\def\mathnote#1{%
\tag*{\rlap{\hspace\marginparsep\smash{\parbox[t]{\marginparwidth}{#1}}}}
}
\input{colors}
\definecolor{InternalLinkColor}{rgb}{0.0, 0.35, 0.0}
\definecolor{ExternalLinkColor}{rgb}{0.0, 0.0, 0.69}
\hypersetup{
colorlinks=true,
citecolor=InternalLinkColor,
urlcolor=ExternalLinkColor,
linkcolor=InternalLinkColor
}
\begin{document}
\title{
\includegraphics[width=2.8cm]{konect-logo} \\
\vspace{0.9cm}
{\Huge Handbook of Network Analysis\footnote{This handbook is
continuously updated; an up-to-date version can always be found at
\href{https://github.com/kunegis/konect-handbook/raw/master/konect-handbook.pdf}{https://github.com/kunegis/konect-handbook/raw/master/konect-handbook.pdf}.}} \\
The KONECT Project \\
\texttt{\href{http://KONECT.cc/}{KONECT.cc}}
}
\author{
Jérôme Kunegis
}
\newgeometry{top=4cm}
\maketitle
\section*{Abstract}
This is the handbook for the KONECT project, a scientific project to
archive network datasets, compute systematic network theoretic
statistics about them, visualize their properties, and provide
corresponding data and Free Software tools to programmers, researchers and teachers in
fields related to network analysis and graph mining, by
Jérôme Kunegis. The name \emph{KONECT} stands for
\emph{\underline{Ko}blenz \underline{Ne}twork \underline{C}ollec\underline{t}ion}, as parts of the KONECT project
were initiated for the PhD thesis of Jérôme Kunegis at
the University of Koblenz--Landau in Germany \citeyearpar{kunegis:phd}.
This handbook
documents all methods, definitions and conventions used in the project,
and serves as a general handbook of network mining, with an emphasis on
spectral graph theoretical methods, i.e., such methods that are based on
the use of specific characteristic matrices of graphs.
KONECT datasets and code is used in academia as the basis for research
on real-world datasets, in education as the basis for teaching, and
in particular it serves as the basis for research projects with the goal
to study large numbers of network datasets. KONECT borrows
datasets from many sources in and outside academia, and lends itself
datasets to other network dataset collection projects.
\thispagestyle{empty}
\restoregeometry
\newpage
\section{Introduction}
Everything is a network~-- whenever we look at the interactions between
things, a network is formed implicitly. In the areas of data mining,
machine learning, information retrieval, etc., networks are modeled
as \emph{graphs}. Many, if not most problem types can be applied to
graphs: clustering, classification, prediction, pattern recognition, and
others. Networks arise in almost all areas of research, commerce and
daily life in the form of social networks, road networks, communication
networks, trust networks, hyperlink networks, chemical interaction
networks, neural networks, collaboration networks and lexical networks.
The content of text documents is routinely modeled as document--word
networks, taste as person--item networks and trust as person--person
networks. In recent years, whole database systems have appeared
specializing in storing networks. In fact, a majority of research
projects in the areas of web mining, web science and related areas uses
datasets that can be understood as networks. Unfortunately, results
from the literature can often not be compared easily because
they use different datasets. What is more, different network datasets
have slightly different properties, such as allowing multiple or only
single edges between two nodes. In order to provide a unified view on
such network datasets, and to allow the application of network analysis
methods across disciplines, the KONECT project defines a comprehensive
network taxonomy and provides a consistent access to network datasets.
To validate this approach on real-world data from the Web, KONECT
also provides a large number (1,000+) of network datasets of different
types and different application areas.
KONECT, the \underline{Ko}blenz \underline{Ne}twork \underline
Collec\underline tion, contains over 1,000 network datasets as of 2017.
In addition to these datasets, KONECT consists of Matlab code to
generate statistics and plots about them, which are shown on the
KONECT
website\footnote{\href{http://konect.cc/}{http://konect.cc/}}.
KONECT contains networks of all sizes, from small classical datasets
from the social sciences such as Kenneth Read's Highland Tribes network
with 16 vertices and 58 edges
(\href{http://konect.cc/networks/ucidata-gama/}{\textsf{HT}}),
to the Twitter social network with 52 million nodes and 1.9 billion
edges
(\href{http://konect.cc/networks/twitter_mpi/}{\textsf{TF}}).
Figure~\ref{fig:scatter.size.avgdegree} shows a scatter plot of all
networks by the number of nodes and the average degree in the network.
Each network in KONECT is represented by a unique two- or
three-character code which we write in a \textsf{sans-serif font}, and
is indicated in parentheses as used previously in this paragraph. The
full list of codes is given
online.\footnote{\href{http://konect.cc/networks/}{http://konect.cc/networks/}}
\begin{figure}
\centering
\includegraphics[width=0.8\textwidth]{plot/scatter.c.size.avgdegree.everything}
\caption[*]{
All networks in KONECT
arranged by the size (the number of nodes) and the
average number of neighbors of all nodes. Each network is
represented by a two- or three-character code. The color of each
code corresponds to the network
category as given in Table~\ref{tab:categories}.
}
\label{fig:scatter.size.avgdegree}
\end{figure}
\subsection{Software and Software Packages}
The KONECT project consists of several components, whose interactions is
summarized in Figure~\ref{fig:organization}. Various parts of the
KONECT project are available at Github, including this Handbook.\footnote{\href{https://github.com/kunegis/konect-analysis}{github.com/kunegis/konect-analysis}}\footnote{\href{https://github.com/kunegis/konect-toolbox}{github.com/kunegis/konect-toolbox}}\footnote{\href{https://github.com/kunegis/konect-handbook}{github.com/kunegis/konect-handbook}}\footnote{\href{https://github.com/kunegis/konect-extr}{github.com/kunegis/konect-extr}}\footnote{\href{https://github.com/kunegis/konect-www}{github.com/kunegis/konect-www}}
\begin{figure}
\centering
\includegraphics[width=\wOnePointFive]{organization.white}
\caption{
\label{fig:organization}
Overview of KONECT's components.
}
\end{figure}
\subsection{History of KONECT}
The roots of KONECT lie in the research of Jérôme Kunegis at the
Technical University of Berlin, within the DAI Laboratory. The first
networks were bipartite rating graphs, collected to support Jérôme
Kunegis' work on collaborative filtering
\citep{kunegis:negative-resistance,kunegis:adapting-ratings}. The
earliest networks include MovieLens, Jester, BookCrossing, and the Netflix Prize
dataset.
What later became known as the KONECT project's collection of networks
properly started out in December 2008,
as evaluation for Jérôme Kunegis' ICML 2009
paper \emph{Learning Spectral Graph Transformations for Link Prediction}
\citep{kunegis:spectral-transformation}, codenamed \emph{Spectral
Transformation}. It then consisted of a collection of network
datasets and spectral link prediction methods, i.e., code based on the
decomposition of various characteristic graph matrices.
The first unipartite networks appeared at this time, one of the earliest
being the trust network of Advogato.
The first dataset crawled specifically for KONECT was the Slashdot Zoo,
for which crawling began in 2008, with the corresponding paper published
a year later \citep{kunegis:slashdot-zoo}.
Later, more datasets
were added and the codebase was called the \emph{Graph Store}. This was
at a time when the word \emph{store} was in fashion, in part due to the
emergence of \emph{app stores}.
In that phase, the project was used for the experiments of several papers in the area of
collaborative filtering and recommender systems.
When Jérôme moved from
TU Berlin to the University of Koblenz--Landau in Koblenz (Germany) the
project was renamed \emph{Web Store}, in line with Koblenz' Institute
for Web Science and Technologies (WeST). The name \emph{KONECT --
\underline{Ko}blenz \underline{Ne}twork
\underline{C}ollec\underline{t}ion} was adopted sometime in 2011.
Jérôme wrote his PhD thesis at the University of Koblenz--Landau in the
same year.
In that phase, most of the mathematical notation of KONECT was
settled. In particular, Jérôme Kunegis' PhD thesis contains a
mathematical glossary that contains many of the symbol in the present
document. That PhD thesis was also the first time that a list of all
networks was generated, before the website went online.
The first KONECT website was created in 2011 at \texttt{konect.uni-koblenz.de}.
In this phase of the project, extraction of datasets and maintenance of the
KONECT website was performed in collaboration with Daniel Dünker, Holger
Heinz, and Martina Sekulla.
Code for dataset extraction and the Matlab Toolbox were first published
on the KONECT website as downloadable tarballs. All published code was
licensed under the GNU General Public License (GPL) version~3 from the beginning.
A short overview paper of the KONECT system was
published in 2013 at the International World Wide Web Conference (WWW),
as part of the Web Observatory Workshop \citep{kunegis:konect}\footnote{
The Google Scholar citation page for that publication gives a list of
papers using KONECT, see
\href{https://scholar.google.com/scholar?cites=7174338004474749050}{https://scholar.google.com/scholar?cites=7174338004474749050}},
and remains the primary publication to cite when using KONECT.
In 2015 and 2016, various parts of the KONECT project were placed on
GitHub, again under the GNU
General Public License version~3, including this handbook.
From 2017 on, the KONECT project continued to be developed at the University of
Namur (Belgium), with web hosting provided by the Institute for Web
Science and Technologies (WeST) at the University of
Koblenz--Landau in Koblenz (Germany). In that phase, the number of
networks, statistics, and plots was yet again increased, and a more
powerful computation server was acquired. The domain name
\href{http://konect.cc/}{\texttt{konect.cc}} was adopted in October
2017, with hosting now switched to the University of Namur.
A tutorial on the use of large network dataset collections, largely based
on KONECT, was given in the same year at the Conference on Information and Knowledge Management (CIKM~2017) under the name \emph{Network Analysis
in the Age of Large Network Dataset Collections -- Challenges,
Solutions and Applications}.\footnote{\href{http://xn.unamur.be/network-collection-tutorial-cikm2017/}{http://xn.unamur.be/network-collection-tutorial-cikm2017/}}
The Stu build
tool\footnote{\href{https://github.com/kunegis/stu}{https://github.com/kunegis/stu}}
was developed in parallel with KONECT. KONECT used \texttt{make(1)} in
the first years of its operation. The Stu build system was initiated in
2014, and KONECT was gradually switched to it, serving as its main use
case. Stu was essentially completed in August 2017 with publication of version
2.5, but continues to see occasional improvements.
As Jérôme Kunegis left academia in 2023 to join the industrial world in 2023,
the KONECT website continues to be available.
\subsection{Status of this Handbook}
This handbook is constantly updated. The most up to date version can
always be found on
GitHub.\footnote{\href{https://github.com/kunegis/konect-handbook/raw/master/konect-handbook.pdf}{https://github.com/kunegis/konect-handbook/raw/master/konect-handbook.pdf}}
Approximately every year, a version of the handbook is published on
arXiv, as a new version of the existing arXiv
entry.\footnote{\href{http://arxiv.org/abs/1402.5500}{http://arxiv.org/abs/1402.5500}}
The handbook is explicitly \emph{not} peer reviewed, as we update it as
needed. Since the handbook serves as the central place in which we add
new definitions, the reader will notice that the prose may vary greatly
from section to section. Also, some notation used may be very establish
and will not change, but other choices in notation are not set in stone,
and we reserve the possibility to change our notation to suit our
needs. However, any change in notation will be reflected in KONECT
source code and on the KONECT world wide web site. While it is
perfectly acceptable to cite this handbook, we urge readers not to refer
to any equation or section number as these \emph{will} change
constantly.
What will \emph{not} change, by design, are all internal
names. For instance, the clustering coefficient will always be denoted
as \texttt{clusco}. Thus, when citing statistics, decompositions or
other items defined in this handbook, please use internal names; these
are always given in a non-proportional font.
Throughout the handbook, we use margin notes to give the internal
\marginpar{\textlangle{}\texttt{name}\textrangle}
names of various parameters, as shown here on the side of this text.
In KONECT, almost everything has an \emph{internal name}, i.e., a
systematic name that is used throughout code, and which is stable.
Internal names are never changed. Therefore, some internal names may be
slightly inaccurate or inconsistent. For purposes of backward
compatibility, this is preferred over changing names. For instance,
KONECT's Enron email network is called \texttt{enron}, the edge weights
of networks with negative edges is called \texttt{signed}, the set
of all unipartite networks is called \texttt{SQUARE}, and the category
of online social networks is called \texttt{Social}. Internal names are
case-sensitive.
Certain parts of the handbook are based on previous work, and certain
parts have been published as papers by the authors. These works are
always cited in the relevant sections, and should be cited in preference
when using the corresponding methods.
\subsection{Structure of this Handbook}
This handbook serves both as a compendium of mathematical definitions
used in the KONECT project, as well as documentation of the software
used in KONECT.
\begin{itemize}
\item Section \ref{sec:taxonomy} -- \textbf{Networks} gives the basic
definition of a network as used in KONECT -- as networks exists in
many different variants such as directed, bipartite, signed, etc., the
exact definitions are often overlooked, but can be crucial for
understanding individual relations. The taxonomy of networks given in
this section serves for the rest of the project. This section also
attempts to illustrate the breadth of the concept of \emph{network},
bridging many different disciplines.
\item Section \ref{sec:definitions} -- \textbf{Graph Theory} gives the
graph-theoretical definitions underlying all topics of network analysis. Again,
the devil is in the details and the graph theory literature is often
split between multiple incompatible definitions for commonly used
terms. This sections presents the terminology and definitions as used
in KONECT, which represents a compromise that has been proven useful
in practice.
\item Section \ref{sec:statistics} -- \textbf{Statistics} is devoted to
network statistics, i.e., numerical measures that characterize a
network as a whole. These are central to network analysis, and most
common network statistics are covered. The section is structured by
the type of analysis underlying the different statistics, roughly
order from simple to complex. Each subsection serves as an
introduction to the underlying topic, although not all subsections
(yet) contain enough exposition to server as a general introduction to
these topics.
\item Section \ref{sec:matrix} -- \textbf{Matrices and Decompositions} reviews
characteristic matrices used to analyse graphs, with a focus on their
decompositions. These are crucial in various types of analyses,
include pairwise node measures such as distances and similarities, as
well as node-based measures such as centralities. Due to the focus of
the KONECT project on such decompositions, this section quite detailed
and complete, although the emphasis is mainly on the eigenvalue and
singular value decompositions, and matrix to which they can be
applied.
\item Section \ref{sec:plots} -- \textbf{Plots} reviews common ways to
visualize properties of a network. Some of those, such as the degree
distribution, are ubiquitous in the literature. Since KONECT plots
are extensively used by the authors in published research, the plots
cover quite distinct areas of network analysis.
\item Section \ref{sec:other} -- \textbf{Other Definitions} covers other
definitions used in KONECT, including node features,
i.e., numerical measures that characterize individual nodes in a
network, and error measures. These are often used as measures of centrality or
importance, etc.
\item Section \ref{sec:toolbox} -- \textbf{The KONECT Toolbox} describes the GNU
Octave and Matlab toolbox that is part of the KONECT project.
\item Section \ref{sec:format} -- \textbf{File Formats} finally documents the
different file formats used by KONECT.
\end{itemize}
\section{Networks}
\label{sec:taxonomy}
Datasets in KONECT represent networks, i.e., a set of nodes connected by
links. Networks can be classified by their format
(directed/undirected/bipartite), by their edge weight types and
multiplicities, by the
presence of metadata such as timestamps and node labels, and by
the types of objects represented by nodes and links.
The full list of networks is given
online.\footnote{\href{http://konect.cc/networks/}{http://konect.cc/networks/}}
\subsection{Format}
The format of a network is always one of the following. The network
formats are summarized in Table~\ref{tab:format}.
\begin{itemize}
\item
In \textbf{undirected networks} (U),
\marginpar{\texttt{sym}}
edges are undirected. That is,
there is no difference between the edge from $u$ to $v$ and the edge
from $v$ to $u$; both are the edge $\{u,v\}$.
An example of an undirected network is the social network of
Facebook
(\href{http://konect.cc/networks/facebook-wosn-wall/}{\textsf{Ow}}),
in which there is no difference between the statements ``A
is a friend of B'' and ``B is a friend of A.''
\item In a \textbf{directed network} (D),
\marginpar{\texttt{asym}}
the links are directed. That is, there is a
difference between the edge $(u,v)$ and the edge $(u,v)$.
Directed networks are sometimes also called \emph{digraphs} (for \emph{directed
graphs}), and their edges \emph{arcs}.
An example of a directed social network is the follower network of
Twitter
(\href{http://konect.cc/networks/twitter_mpi/}{\textsf{TF}}),
in which the fact that user A follows user B does not imply
that user B follows user A.
\item \textbf{Bipartite networks} (B)
\marginpar{\texttt{bip}}
include two types of nodes, and all edges
connect one node type with the other. An example of a bipartite
network is a rating graph, consisting of the node types \emph{user}
and \emph{movie}, and each rating connects a user and a movie
(\href{http://konect.cc/networks/movielens-10m_rating/}{\textsf{M3}}).
Bipartite networks are always undirected in KONECT. Datasets that can
be represented as hypergraphs can also be represented without loss as
bipartite networks, explaining why KONECT does not have a
\emph{hypergraph} format, and also why bipartite networks are common.
\end{itemize}
\begin{table}
\caption{
The network formats allowed in KONECT.
Each network dataset is exactly of one type.
\label{tab:format}
}
\centering
\makebox[\textwidth]{
\begin{tabular}{ c l llll }
\toprule
\textbf{\#} & \textbf{Icon} & \textbf{Type} & \textbf{Edge partition} & \textbf{Edge types} & \textbf{Internal name} \\
\midrule
1 & U & Undirected & Unipartite & Undirected & \texttt{sym} \\
2 & D & Directed & Unipartite & Directed & \texttt{asym} \\
3 & B & Bipartite & Bipartite & Undirected & \texttt{bip} \\
\bottomrule
\end{tabular}
}
\end{table}
\subsection{Weights}
The edge weight and multiplicity types of networks are represented by
one of the following types.
The types of edge weights and multiplicities are summarized in
Table~\ref{tab:weights}.
\begin{itemize}
\item An \textbf{unweighted network} ($-$)
\marginpar{\texttt{unweighted}}
has edges that are
unweighted, and only a
single edge is allowed between any two nodes.
\item In a \textbf{network with multiple edges} ($=$),
\marginpar{\texttt{positive}}
two nodes can be
connected by any number of edges, and all edges are unweighted. This
type of network is also called a multigraph.
\item In a \textbf{positive network} ($+$),
\marginpar{\texttt{posweighted}}
edges are annotated
with positive weights, and only a single edge is allowed between
any node pair. The weight of zero is identified with the lack of an edge
and thus, we require that each edge has a weight strictly larger than
zero, except when the special tag \texttt{\#zeroweight} is defined.
\item In a \textbf{signed network} ($\pm$),
\marginpar{\texttt{signed}}
both positive and negative edges are
allowed. Positive and negative edges are represented by positive and
negative edge weights. Many networks of this type have only the
weights $\pm 1$, but in the general case we allow any nonzero weight;
this distinction is not made.
\item \textbf{Networks with multiple signed edges} ($\stackrel{+}{=}$)
\marginpar{\texttt{multisigned}}
allow multiple edges between two nodes, which may have the same values
as edges in a signed network.
\item \textbf{Rating networks} ($*$)
\marginpar{\texttt{weighted}}
have arbitrary real edge weights. They
differ from positive and signed networks in that the edge weights are
interpreted as an interval scale, and thus the value zero has no
special meaning. Adding a constant to all edge weights does not
change the semantics of a rating network.
Ratings can be discrete, such as the one-to-five star ratings, or
continuous, such as a rating given in percent.
This type of network allows only a single edge between two nodes.
\item \textbf{Networks with multiple ratings} ($_*{}^*$)
\marginpar{\texttt{multiweighted}}
have edges annotated
with rating values, and allow multiple edges between two nodes.
\item \textbf{Dynamic networks} ($\rightleftharpoons$) are networks in
\marginpar{\texttt{dynamic}}
which edges can appear and disappear. They are always
temporal. Individual edges are not weighted.
\end{itemize}
\begin{table}
\caption{
The edge weight and multiplicity types allowed in KONECT. Each
network dataset is exactly of one type.
Note that due to historical reasons, networks with multiple
unweighted edges have the internal name \texttt{positive}, while
positively weighted networks have the internal
\texttt{posweighted}.
For signed networks and positive edge weights, weights of zero are
only allowed when the tag \texttt{\#zeroweight} is set.
\label{tab:weights}
}
\centering
\makebox[\textwidth]{
\begin{tabular}{cllllll}
\toprule
\textbf{\#} & \textbf{Icon} &\textbf{Type} & \textbf{Multiple} & \textbf{Edge weight} &
\textbf{Edge weight} &
\textbf{Internal name} \\
& & & \textbf{edges} & \textbf{range} & \textbf{scale} \\
\midrule
1 & $-$ & Unweighted & No & $\{1\}$ & -- & \texttt{unweighted} \\
2 & $=$ & Multiple unweighted & Yes & $\{1\}$ & -- & \texttt{positive} \\
3 & $+$ & Positive weights & No & $(0, \infty)$ & Ratio scale & \texttt{posweighted}\\
4 & $\pm$ & Signed & No & $(-\infty,+\infty)$ & Ratio scale & \texttt{signed} \\
5 & $\stackrel{+}{=}$ & Multiple signed & Yes & $(-\infty,+\infty)$ & Ratio scale & \texttt{multisigned} \\
6 & $*$ & Rating & No & $(-\infty,+\infty)$ &Interval scale & \texttt{weighted} \\
7 & $_*{}^*$ & Multiple ratings & Yes & $(-\infty,+\infty)$ & Interval scale & \texttt{multiweighted}\\
8 & $\rightleftharpoons$ & Dynamic & Yes & $\{1\}$ & -- & \texttt{dynamic} \\
9 & & Multiple positive weights & Yes & $(0, \infty)$ & Ratio scale & \texttt{multiposweighted} \\
\bottomrule
\end{tabular}
}
\end{table}
\subsection{Temporal Networks}
Furthermore, networks can have one more property:
\begin{itemize}
\item \textbf{Temporal networks} (\Clocklogo) include a timestamp for each
edge, and thus the network can be reconstructed for any moment in the
past. By default, the timestamp is in Unix time, and gives one
timestamp to each edge, denoting when that edge was added, or when the
event represented by the edge took place.
It is unspecified whether timestamps include or not leap seconds. In
practice, no dataset has precise enough information for that to matter.
When the tag \texttt{\#unspecifiedtime} is set, the timestamps are not in Unix
time, but rather in an unspecified monotonous time scale.
The fact that a network is temporal is not saved in any special
location, but is determined by the fact that timestamps are present in
the TSV file, as will be described later.
\end{itemize}
\subsection{Categories}
Finally, the network categories classify networks by the type of data they
represent.
An overview of the categories is given in Table~\ref{tab:categories}.
\begin{table*}
\caption{
The network categories in KONECT.
Each category is assigned a color, which is used in plots, for
instance in Figure~\ref{fig:scatter.size.avgdegree}. The property
icons are defined in Table \ref{tab:weights}. U: Undirected
network, D: Directed network, B: Bipartite network.
\label{tab:categories}
}
\centering
\makebox[\textwidth]{
\input{tex/category-tabular}
}
\end{table*}
%
% The following texts describing each category are extracted
% automatically by konect-www/sh/handbook-category to generate the
% website, and therefore they follow strict formatting rules:
% - Each line begins by \item[LONGNAMEs], where LONGNAME is the long
% name of the category as given in
% konect-toolbox/m/konect_data_category.m
% - Each block ends in an empty line, or a line starting
% with \item or \end
% - Not all Latex markup is supported; see
% konect-www/sh/handbook-category
%
\begin{description}
\item[Affiliation networks] are bipartite networks denoting the
\marginpar{\texttt{Affiliation}}
membership of actors in groups. Groups can be defined as narrowly as
individual online communities in which users have been active
(\href{http://konect.cc/networks/flickr-groupmemberships/}{\textsf{FG}})
or as broadly as countries
(\href{http://konect.cc/networks/dbpedia-country/}{\textsf{CN}}). The
actors are mainly persons, but can also be other actors such as musical
groups. Note that in all affiliation networks we consider, each actor
can be in more than one group, as otherwise the network cannot be
connected.
\item[Animal networks] are networks of contacts between animals.
\marginpar{\texttt{Animal}}
They are the animal equivalent to human social networks. Note that
datasets of websites such as Dogster
(\href{http://konect.cc/networks/petster-friendships-dog/}{\textsf{Sd}})
are \emph{not} included here but in the \texttt{Social} (online social
network) category, since the networks are generated by humans.
\item[Authorship networks] are unweighted bipartite networks consisting
\marginpar{\texttt{Authorship}}
of links between authors and their works. In some authorship networks
such as that of scientific literature
(\href{http://konect.cc/networks/dblp-author/}{\textsf{Pa}}),
works have typically only few authors, whereas works in other
authorship networks may have many authors, as in Wikipedia articles
(\href{http://konect.cc/networks/edit-enwiki/}{\textsf{en}}).
\item[Citation networks] consist of documents that reference each
\marginpar{\texttt{Citation}}
other. The primary example are scientific publications, but the
category also allow patents and other types of documents that
reference each other. The category does not include hyperlink
networks, i.e., web pages that reference each other. Those are in the
category \texttt{Hyperlink}.
\item[Co-authorship networks] are unipartite networks connecting authors
\marginpar{\texttt{Coauthorship}}
who have written works together, for instance academic literature, but
also other types of works such as music or movies. These are also
often called collaboration networks.
\item[Co-citation networks] are unipartite networks of documents,
connected by an edge if they are both cited by a same other document.
As a general rule, co-citation networks are the self-join of citation
networks.
\item[Communication networks] contain edges that represent
\marginpar{\texttt{Communication}}
individual messages between persons. Communication networks are directed
and allow multiple edges.
Examples of communication networks are those of
emails (\href{http://konect.cc/networks/enron/}{\textsf{EN}})
and those of
Facebook messages
(\href{http://konect.cc/networks/facebook-wosn-wall/}{\textsf{Ow}}). Note
that in some instances, edge directions are not
known and KONECT can only provide an undirected network.
\item[Computer networks] are networks of connected computers.
\marginpar{\texttt{Computer}}
Nodes in them are computers, and edges are connections.
When speaking about \emph{networks} in a computer science context, one
often means only computer networks. An example is the internet
topology network (\href{http://konect.cc/networks/topology/}{\textsf{TO}}).
\item[Feature networks] are bipartite, and denote any kind of feature
\marginpar{\texttt{Feature}}
assigned to entities. Feature networks are unweighted and have
edges that are not annotated with edge creation times. Examples are
songs and their genres
(\href{http://konect.cc/networks/dbpedia-genre/}{\textsf{GE}}).
\item[Human contact networks] are unipartite networks of actual contact
\marginpar{\texttt{HumanContact}}
between persons, i.e., talking with each other, spending time
together, or at least being physically close. Usually, these datasets
are collected by giving out RFID tags to people with chips that record
which other people are in the vicinity. Determining when an actual
contact has happened (as opposed to for instance to persons standing
back to back) is a nontrivial research problem.
An example is the Reality Mining dataset
(\href{http://konect.cc/networks/mit/}{\textsf{RM}}).
\item[Human social networks] are real-world social networks between
humans. \marginpar{\texttt{HumanSocial}} The ties are offline,
and not from an online social network. Also, the ties represent a
state, as opposed to human contact networks, in which each edge
represents an event.
\item[Hyperlink networks] are the
networks \marginpar{\texttt{Hyperlink}} of pages on the World Wide Web
or on another system of linked knowledge,
connected by hyperlinks or similar types of links. These are in general directed. Since any
type of information can be represented on the World Wide Web,
hyperlink networks are often simultaneously in another category. For
instance, user pages linked to each other represent a hyperlink
network and a social network at the same time. In such cases, only
the more specific (non-hyperlink) category is used in KONECT.
\item[Infrastructure networks] are networks of physical infrastructure.
\marginpar{\texttt{Infrastructure}}
Examples are road networks
(\href{http://konect.cc/networks/roadNet-CA/}{\textsf{RO}}), airline
connection networks
(\href{http://konect.cc/networks/opsahl-openflights/}{\textsf{OF}}),
and power grids
(\href{http://konect.cc/networks/opsahl-powergrid/}{\textsf{UG}}).
\item[Interaction networks] are bipartite networks consisting of people
\marginpar{\texttt{Interaction}}
and items or between people and other people, where each edge represents an interaction.
In interaction networks, we always allow multiple edges between the
same person--item pair, and an interaction is always to be understood as an event.
Interaction networks can be online or offline.
Examples are
people writing in forums
(\href{http://konect.cc/networks/opsahl-ucforum/}{\textsf{UF}}),
commenting on movies
(\href{http://konect.cc/networks/filmtipset_comment/}{\textsf{Fc}}),
listening to songs
(\href{http://konect.cc/networks/lastfm_song/}{\textsf{Ls}})
and sports results.
\item[Lexical networks] consist of words from natural
\marginpar{\texttt{Lexical}} languages and the relationships between
them. Relationships can be semantic (i.e, related to the meaning of
words) such as the synonym relationship
(\href{http://konect.cc/networks/wordnet-words/}{\textsf{WO}}),
associative such as when two words are associated with each other by
people in experiments
(\href{http://konect.cc/networks/eat/}{\textsf{EA}}), or denote
co-occurrence, i.e., the fact that two words co-occur in text
(\href{http://konect.cc/networks/lasagne-spanishbook/}{\textsf{SB}}).
\item[Metabolic networks] model metabolic pathways,
\marginpar{\texttt{Metabolic}} in which nodes a chemical substances
and edges are often directed and represents chemical interactions. A
subset are protein--protein interaction networks (PPI), in which nodes
are proteins, and which are usually undirected.
\item[Miscellaneous networks] are any networks that do not fit into one
\marginpar{\texttt{Misc}}
of the other categories.
\item[Neural networks] are networks reprensentating the structure of the
brain. Nodes are neurons are higher-level groupings of the brain,
while edges are connections between them. The field concerned with
the network analysis of such structures is called \emph{network neuroscience}.
\item[Online contact networks] consist of people and interactions between
\marginpar{\texttt{OnlineContact}} them. Contact networks are
unipartite and allow multiple edges, i.e., there can always be
multiple interactions between the same two persons. They can be both
directed or undirected.
\item[Physical networks] represent physically existing network
\marginpar{\texttt{Physical}} structures in the broadest sense. This
category covers such diverse data as physical computer networks
(\href{http://konect.cc/networks/topology/}{\textsf{TO}}), transport
networks
(\href{http://konect.cc/networks/opsahl-openflights/}{\textsf{OF}}) and
biological food networks
(\href{http://konect.cc/networks/foodweb-baydry/}{\textsf{FD}}).
\item[Rating networks] consist of assessments given to items by users,
\marginpar{\texttt{Rating}} weighted by a rating value. Rating
networks are bipartite. Networks in which users can rate other users
are not included here, but in the Social category instead. If only a
single type of rating is possible, for instance the ``favorite''
relationship, then rating networks are unweighted. Examples of items
that are rated are movies
(\href{http://konect.cc/networks/movielens-10m_rating/}{\textsf{M3}}),
songs (\href{http://konect.cc/networks/yahoo-song/}{\textsf{YS}}),
jokes (\href{http://konect.cc/networks/jester1/}{\textsf{J1}}), and even
sexual escorts
(\href{http://konect.cc/networks/escorts/}{\textsf{SX}}).
\item[Online social networks] represent ties between
\marginpar{\texttt{Social}} persons in online social networking
platforms. Certain social networks allow negative edges, which denote
enmity, distrust or dislike. Examples are Facebook friendships
(\href{http://konect.cc/networks/facebook-wosn-links/}{\textsf{Ol}}),
the
Twitter follower relationship
(\href{http://konect.cc/networks/twitter_mpi/}{\textsf{TF}}), and
friends and foes on Slashdot
(\href{http://konect.cc/networks/slashdot-zoo/}{\textsf{SZ}}). Note
that some social networks can be argued to be rating networks, for
instance the user--user rating network of a dating site
(\href{http://konect.cc/networks/libimseti/}{\textsf{LI}}). These
networks are all included in the \textsf{Social} category.
Online social networks may be undirected (such as on Facebook),
or directed (such as on Twitter).
For historical reasons, the internal name of this category is
\textsf{Social}, even though it includes only online social networks.
\item[Software networks] are networks of interacting software
\marginpar{\texttt{Software}} component. Node can be software
packages connected by their dependencies, source files connected by
includes, and classes connected by imports.
\item[Text networks] consist of text documents containing words. They
\marginpar{\texttt{Text}} are bipartite and their nodes are documents
and words. Each edge represents the occurrence of a word in a
document. Document types are for instance newspaper articles
(\href{http://konect.cc/networks/gottron-trec/}{\textsf{TR}}) and
Wikipedia articles
(\href{http://konect.cc/networks/gottron-excellent/}{\textsf{EX}}).
\item[Trophic networks] consist of biological
\marginpar{\texttt{Trophic}} species connected by edges denoting which
pairs of species are subject to exchange of substances such as carbon
or nitrogen. In simple cases, these networks can be described as
``who eats whom'', but the category also includes the exchanges of
more specific chemical species.
The term \emph{food chain} may describe such relationships,
but note that in the general case, a trophic network is not a chain,
i.e., it is not linear. Trophic networks are directed. Nodes may be
individual species, may may also be broader or narrower classes of
organisms.
\end{description}
Note that the category system of KONECT is in flux. As networks are
added to the collection, large categories are split into smaller ones.
\subsection{What Is and Is Not a Network}
While the KONECT motto asserts that \emph{everything is a network}, this
does not imply that everything is a complex network. Thus, certain networks
have a structure that is trivial in a way that render many network
analysis methods moot. In KONECT,
we do not include such networtks.
This includes networks without a giant connected component,
in which most nodes are not reachable from each other, and trees, in
which there is only a single path between any two nodes. Note that
bipartite relationships extracted from n-to-1 relationships are
therefore excluded, as they lead to a disjoint network. For instance, a
bipartite person--city network containing \emph{was-born-in} edges would
not be included, as each city would form its own component disconnected
from the rest of the network. On the other hand, a band--country
network where edges denote the country of origin of individual band
members is included, as members of a single band can have different
countries of origin. In fact the Countries network
(\href{http://konect.cc/networks/dbpedia-country/}{\textsf{CN}})
is of this form. Another example is a bipartite song--genre network,
which would only be included in KONECT when songs can have multiple
genres. As an example of the lack of complex structure when only a
single genre is allowed, the degree distribution in such a song--genre
network is skewed because all song nodes have degree one, the diameter
cannot be computed since the network is disconnected, and each connected
component trivially has a diameter of two or less.
Certain other types of structures are equivalent to networks. For
instance, hypergraphs, in which each hyperedge may include any number of
nodes, can be presented equivalently as bipartite graphs, up to certain
small differences described in Section~\ref{sec:definitions}. The same holds e.g.\ for partially-ordered sets and
directed graphs. In KONECT, we choose the ``network'' aspect to model those.
\subsection{Groups}
\label{sec:groups}
Types of networks are separated into groups, as given in
Table~\ref{tab:groups}.
Each groups represents a combination of metadata nased on network
format, weights and other attributes. The importance of groups is that
for most network analysis methods, we can specify to which groups of
networks they apply.
\begin{table}
\centering
\caption{
\label{tab:groups}
The list of groups in KONECT. Groups represent sets of networks
that have similar metadata, based on their format, weights and other
attributes.
}
\makebox[\textwidth]{
\begin{tabular}{ l l }
\toprule
\textbf{Internal name} & \textbf{Definition} \\
\midrule
\texttt{ALL} & All networks \\
\texttt{SYM} & Undirected, unipartite \\
\texttt{ASYM} & Directed, unipartite \\
\texttt{BIP} & Bipartite \\
\texttt{SQUARE} & Unipartite \\
\texttt{NEGATIVE} & Networks where edges can be interpreted to have negative edges \\
\texttt{NONUNWEIGHTED} & Networks where edges can be interpreted to have a weight or multiplicity \\
\texttt{ASYMNEGATIVE} & Directed networks where edges that can be negative \\
\texttt{SQUARENEGATIVE} & Unipartite networks where edges can be negative \\
\texttt{TIME} & Temporal networks \\
\texttt{TIME\_NEGATIVE} & Temporal networks where edges can be negative \\
\texttt{MULTI} & Networks with multiple edges \\
\bottomrule
\end{tabular}
}
\end{table}
\subsection{Tags}
\label{sec:tags}
The following tags can be given to networks. They are declared in the
\texttt{tags} field in the \texttt{meta.*} file for each network.
\begin{itemize}
\item \texttt{\#acyclic}: The network is acyclic. Can only be
set for directed networks. If this is not set, a directed
network must contain at least two pairs of reciprocal edges of
the form $(u,v)$ and $(v,u)$. If the network does not contain
reciprocal edges, but has cycles, the tag
\texttt{\#nonreciprocal} is used.
\item \texttt{\#aggregatetime}: The smallest value in the timestamp column
stands for any earlier time; all timestamp having this smallest value should not be
considered when performing time-based methods and plots. In most
cases, this value is zero, but it can just as well be -1 or the most
negative integer of the integer type used when creating the dataset.
\item \texttt{\#clique}: All possible edges are present, i.e., the
graph is complete. This is \emph{not} the opposite of
\texttt{\#incomplete}. Which edges are taken into account depends on
the type of graph, i.e., whether the graph is bipartite, directed,
allows loops. This does not take into account edge weights and
multiplicities, and in fact this tag only really makes sense when
those are present.
\item \texttt{\#incomplete}: The network is incomplete, i.e.,
the present dataset represents a subset of the actual data. This is
mostly due to the fact that the data was crawled, or aggregated in
another way that cannot guarantee that all nodes and links were seen.
In such graphs, it is not specified \emph{which} nodes and edges are
missing.
This implies that strictly speaking, certain statistics or plots like
the degree distribution are not meaningful, since they may
depend on which parts are missing. In practice, many datasets
fall into this category, and they do not necessarily all have this
tag.
\item \texttt{\#join}: The network is the join (in the
database-theoretical sense) of more
fundamental networks. For instance, a co-authorship network
is a join of the authorship network with itself. In general, all
networks that can be described as being a \emph{co-X network}, where
\emph{X} is some other network, are of this type.
Networks that have this tag may have skewed properties, such as skewed
degree distributions. In KONECT, we recommend to \emph{not} generate
the join of a given network, but to publish the underlying network(s)
itself. However, many networks are only known as their join, and thus
are included in KONECT. (Note that in many cases, the self join has
been used to make a unipartite network out of a bipartite one, in
order to apply network analysis methods that otherwise do not apply to
bipartite networks.)
\item \texttt{\#kcore}: The network contains only nodes with a
certain minimal degree $k$. In other words, the nodes with
degree less than a certain number $k$ were removed from the
dataset. This changes a network drastically, and is called
the ``$k$-core'' of a network. This is sometimes done to get
a less sparse network in applications that do not perform well
on sparse networks. This tag implies the
\texttt{\#incomplete} tag.
\item \texttt{\#lcc}: The dataset actually contains only the
largest connected component of the actual network. Implies
\texttt{\#incomplete}. This tag is not used when the network
is connected for other reasons.
\item \texttt{\#loop}: The network may contain loops, i.e.,
egdes connecting a vertex to itself. This tag is only
allowed for unipartite networks. When this tag is not
present, loops are not allowed, and the presence of loops
will be considered an error by analysis code.
\item \texttt{\#lowmultiplicity}: Set in networks with multiple
edges in which the actual maximal edge multiplicity is very
low. Used to be able to use the maximal multiplicity as a
sanity check. Indicates a dataset error, as edge
multiplicities usually have a power law-like distribution, and
thus very high edge multiplicities are usually present.
\item \texttt{\#missingmultiplicity}: This tag is used when the
underlying network had inherent multiple edges, but these are
not present in the dataset. For instance, any email network
that does not contain multiple edges is tagged with this.
\item \texttt{\#missingorientation}: This tag is used for
undirected networks which are based on an underlying
directed network. For instance, in a citation network, we
may only know that the documents A and B are linked, but not
which one cites the other. In such a case, the network in
KONECT is undirected, although the underlying network is
actually directed.
\item \texttt{\#nonreciprocal}: For directed networks only.
The network does not contain reciprocal edges. This is only
used when the network is non-reciprocal, but does contain
directed cycles. If the network is acyclic,
\texttt{\#acyclic} is used.
\item \texttt{\#unspecifiedtime}: The timestamps do not represent Unix
time, but something else. The only assumption that can be made is
that timestamps are monotonously increasing, i.e., that larger values
denote later times. These could be simply year numbers, number of
days since a starting point, or something else. In networks having
this tag, the number of unique timestamps may be very low, making
temporal methods unsuited. For instance, a network that is known from
only two snapshots may have the timestamp values 1 and 2.
\item \texttt{\#path}: The edges represent paths have have been
navigated in some form. Thus, a purely network-analysis approach will
fail to take into account the paths and only consider the
adjacencies.
\item \texttt{\#regenerate}: The network can be regenerated
periodically and may be updated when a more recent dataset
becomes available.
\item \texttt{\#skew}: The graph is directed,
and can be interpreted such that an inverted edge is the same than a
negative edge. This applies for instance to sports results network,
where a directed edge means A won against B, but could be equivalently
expressed as a negative edge from B to A. It also applies to networks
denoting dominance behavior, in particular between animals.
\item \texttt{\#tournament}: The graph is directed and for each
pair of nodes $\{u,v\}$, either the directed edge $u \rightarrow v$ or
the directed edge $v \rightarrow u$ exists, but not both. It
is an error for a non-directed graph to have this tag. If
\texttt{\#tournament} is defined, then
\texttt{\#nonreciprocal} must also be defined. Also, the graph must
not contain loops, and thus
\texttt{\#loop} must not be defined.
\item \texttt{\#trianglefree}: The network is triangle-free, i.e., the
network does not contain any triangles. Can only be used on
unipartite networks. By default, it is an error for a unipartite
network to not contain triangles. This tag allows a network to be
triangle-free.
\item \texttt{\#zeroweight}: Must be set if it is allowed for edge
weights to be zero. Only used for networks with positive edge
weights and signed/multisigned networks.
\end{itemize}
\section{Graph Theory}
\label{sec:definitions}
The areas of graph theory and network analysis are young, and