5 One-mode visualization
5.1 Introduction
5.1.1 Preparing to work with network
For the remainder of this tutorial are going to stick to working with the network
package, even though the functions we will use for visualization are compatible with both objects. I propose using network objects because they are compatible with more advanced statistical analysis provided through the statnet
suite of packages. igraph
objects should also work with the ggraph
function, but because the syntax for working with the objects is different, the code used to do that would need to change.
To continue working with network
objects and ggraph
, we want to load in two more packages. The sna
package provides us with several functions for calculating network statistics, and dplyr
and magrittr
are packages from that tidyverse that will help us streamline our plotting.
library(sna)
library(dplyr)
library(magrittr)
Because we will be using exclusively network
objects, we want to detach igraph
before we continue further. This is because there are several commonly-used network functions in both igraph
and sna
that mask one another. For instance:
::degree()
sna::degree() igraph
The igraph
degree function will work only on igraph
objects and sna
degree
function will work only on network
objects. To avoid confusion and masking in R, we are going to detach the igraph
package and work only with network
objects and compatible packages like sna
.
detach("package:igraph", unload = TRUE)
5.2 Guiding questions:
To get started with visualization, let’s remind ourselves of our visualization goals and guiding questions. With our one-mode network, we want to know:
- What is the structure of the collaborative research network in the Delta?
- How have the Delta’s research collaborations changed over time?
5.3 Getting started with ggraph
To introduce the basics of the ggraph
package, we’re going to focus on our first question: What is the structure of the collaborative research network in the Delta?
As described in the previous section, ggraph
uses the same approach as the ggplot2
grammar of graphics. This means is has three core components: 1.) (network) data, 2.) aesthetic mappings, and 3.) geometries, in this case edges and nodes. As with any ggplot we can add a theme that suits the look of the figure, and ‘void’ themes are often suited for networks.
ggraph(net1) +
geom_node_point() +
geom_edge_link() +
theme_void()
## Using "stress" as default layout
Note that there are some defaults at play here:
- The layout argument is set to = ‘auto’, which is the ‘stress’ layout
- The default aesthetic mapping for geom_node_point: x, y, which don’t need specified
- The default aesthetic mapping for geom_edges: x, y, xend, yend, which don’t need specified
5.3.1 Layouts
First, let’s talk about layouts. Networks are typically laid out based on the algorithm you (or the package default) chooses. Layouts are a choice based on how you would like to present the network. Because networks are depicting actors and relationships in an two-dimensional space (x, y), there is no ‘right’ way to plot actors in space (unless, of course, you have actual spatial data associated with your nodes, in which case you could use a geospatial network mapping (e.g. flight route maps)).
For scenarios without geospatial coordinates, network theorists have developed layout algorithms which define rules for calculating the x and y coordinates of the nodes. Many layout algorithms aim to represent networks so that strongly connected nodes are plotted in close proximity to one another, representing a ‘core’ and a ‘periphery’.
There is an overview of ggraph
‘s different layouts presented in this blog post, and descriptions of some of them in the ggraph vignette and documentation, under the layout_tbl_graph_...
functions. The layout is selected with the ’layout’ argument within the ggraph
function. Below we present our network with a few different layouts.
Note that there is a whole world of plotting networks using different conceptualizations of nodes and edges. These conceptualizations of nodes and edges may be dependent on certain layout algorithms. For example, if working with a ‘tree’ based layout, which creates more of a hierarchical structure like a dendrogram, edges as lines may not be suited. Instead, you can specify different edges (e.g. diagonals).
For the sake of this workshop we are sticking to more ‘traditional’ network visualization, so we will only be depicting nodes as points and edges as lines.
You can also set a manual layout. Because layouts are just x and y coordinates of points defined by a certain algorithm, you can extract those coordinates using the create_layout()
function from ggraph
, and fix and/or manipulate those coordinates, if need be. We will deal with this more shortly.
<- create_layout(net1, layout = 'fr')
fixed_coord head(fixed_coord[c(1:2,5)])
## x y name
## 1 -10.3582823 0.6269689 Agricultural Coalitions: Landowners membership fees
## 2 2.3912230 1.0975265 Anchor QEA
## 3 -8.3744602 -4.1790223 Audubon Canyon Ranch
## 4 0.2279849 -2.6019089 Bachand and Associates
## 5 -7.1904394 8.1055097 BTS
## 6 2.6774555 -8.3683646 CalFish
For our one-mode networks, we are going to use the ‘fr’ layout, which is the Fruchterman and Reingold (force-directed algorithm). This is a choice that helps place higher-degree nodes in the center, and low-degree and isolate nodes on the periphery.
ggraph(net1, layout = 'fr') +
geom_node_point() +
geom_edge_link() +
theme_void()
5.3.2 Aesthetics
Now that we’ve chosen a layout, let’s start adding some aesthetic features. This process will look very much like aesthetic mapping in ggplot2
: we can assign aesthetics like color, size, shape, etc. to both our edge and node geometries.
Size by degree: One common approach for network visualization is to size nodes by their degree centrality. While more central nodes are already placed at the center of our layout’s algorithm, it can be helpful to also increase their size to communicate this point. To size by degree, we will want to create a degree variable as a node attribute using the degree
function, and assign that attribute to our network data.
%v% 'degree' <- sna::degree(net1) net1
Hint: if you get Error in degree(net1) : Not a graph object
, double check that you have detached the igraph
package!
Color-blind friendly colors: We’d also like our nodes and edges to be colored differently than the default black, so we can set these colors to our geometries. Because the focus of these networks are on nodes, not edges, we can set our edges to a less pronounced color like grey, and select an accessible, color-blind friendly color palette to select a node color from. I personally like to use viridis:
::viridis(12) viridis
## [1] "#440154FF" "#482173FF" "#433E85FF" "#38598CFF" "#2D708EFF" "#25858EFF"
## [7] "#1E9B8AFF" "#2BB07FFF" "#51C56AFF" "#85D54AFF" "#C2DF23FF" "#FDE725FF"
The latest version of R studio lets us see these colors when we write them out in a script, so let’s do that, and assign these colors to an object named clrs
. We will be referencing this vector as we start using the palette.
<- c("#440154FF", "#482173FF", "#433E85FF", "#38598CFF",
clrs "#2D708EFF", "#25858EFF", "#1E9B8AFF", "#2BB07FFF",
"#51C56AFF", "#85D54AFF", "#C2DF23FF", "#FDE725FF")
We can now integrate these three features: node size, edge color, and node color, into our plot.
ggraph(net1, layout = 'fr') +
geom_node_point(aes(size = degree),
color = clrs[4]) +
geom_edge_link(color = "gray80") +
theme_void()
Notice how layer order matters (as as with ggplot2
), and so by having edges layered on top of nodes, we are really hiding the nodes. Let’s try to switch this, but also include some alpha arguments to help increase transparency.
ggraph(net1, layout = 'fr') +
geom_edge_link(color = "gray80") +
geom_node_point(aes(size = degree),
color = clrs[4], alpha = .7) +
theme_void()
We can also add labels and make thematic alterations to features like the legend, just like with ggplot2.
ggraph(net1, layout = 'fr') +
geom_edge_link(color = "gray80") +
geom_node_point(aes(size = degree), color = clrs[4],
alpha = .7) +
theme_void() +
labs(title = "Delta Science Collaborative Research Network") +
theme(legend.position = "none")
Notice that with the ‘fr’ layout (and any other layout algorithm), the coordinates change a bit every time. This is because each time we create a visualization the algorithm is re-run, and there is variation in the exact calculation. You can set your seed (every time before you plot) to keep it consistent.
5.3.3 Node labels
So far we are getting a clear shape of the network. But related to our first question about understanding network structure, we may want to understand who is central to collaboration. To better identify our nodes, let’s try to add some node text with geom_node_text()
. Already, we have a variable that is the name of our vertices:
head(net1 %v% 'name')
## [1] "Agricultural Coalitions: Landowners membership fees"
## [2] "Anchor QEA"
## [3] "Audubon Canyon Ranch"
## [4] "Bachand and Associates"
## [5] "BTS"
## [6] "CalFish"
Let’s add this as a text geometry.
ggraph(net1, layout = 'fr') +
geom_edge_link(color = "gray80") +
geom_node_point(aes(size = degree), color = clrs[4],
alpha = .7) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Delta Science Collaborative Research Network") +
geom_node_text(aes(label = name),
size = 3,
color="black")
Okay, a bit overwhelming. Instead, let’s be selective based on degree. Let’s say we want to take the top 5-degree nodes and label them.
# Extract the network's degree values based on the order of degrees
<- (net1 %v% 'degree')[order(net1 %v% 'degree')]
degs # Then identify the top 5 unique degree values
<- unique(rev(degs))[1:5]
topdegs # Then create a network variable named labels and add the name only if a
# node has the number of degrees in the 'top degrees'
%v% 'labels' <- ifelse((net1 %v% 'degree') %in% topdegs,
net1 %v% 'name', '') net1
Now we have a sparse label attribute.
%v% 'labels' net1
## [1] "" "" "" "" "" "" "" "CDFW" ""
## [10] "" "" "" "" "" "DWR" "" "" ""
## [19] "" "" "" "" "" "" "" "" ""
## [28] "" "" "" "" "" "" "" "" ""
## [37] "" "" "" "" "" "" "" "" ""
## [46] "" "" "" "" "" "" "" "" ""
## [55] "" "" "" "" "" "" "" "" ""
## [64] "" "" "" "" "" "" "" "" ""
## [73] "" "" "" "" "" "" "" "" ""
## [82] "" "SFEI" "" "" "" "" "" "" ""
## [91] "" "" "" "" "" "" "" "" ""
## [100] "" "" "USBR" "" "" "" "" "" ""
## [109] "" "" "USFWS" "" "USGS" "" "" "" ""
## [118] "" "" "" "" "" "" "" "" ""
## [127] "" "" "" "" "" "" "" "" ""
## [136] "" "" "" ""
We can use this labels attribute to make our figure more easily readable.
ggraph(net1, layout = 'fr') +
geom_edge_link(color = "gray80") +
geom_node_point(aes(size = degree), color = clrs[4],
alpha = .7) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Delta Science Collaborative Research Network") +
geom_node_text(aes(label = labels),
size = 3)
Almost. Let’s include a repel = T argument to make sure the text doesn’t overlap.
ggraph(net1, layout = 'fr') +
geom_edge_link(color = "gray80") +
geom_node_point(aes(size = degree), color = clrs[4],
alpha = .7) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Delta Science Collaborative Research Network") +
geom_node_text(aes(label = labels),
size = 3,
repel = T)
So, what is the structure of the collaborative research network in the Delta? This network shows us the overall picture, and we can use summary statistics to fill in the gaps. Among the 139 organizations involved in scientific research in the Delta, there is a mean degree of 14, suggesting that organizations are involved with, on average, 14 other organizations (across one or more projects). The network is quite connected, as the main component includes 107 (77%) of the organizations, with 27 isolates, meaning that 27 organizations have not collaborated at all. Within that main component, the average path length is 2.3, meaning that on average an organization is less then 3 connections away from any other organization. At the center of the network are three federal agencies, the US Geological Survey (USGS), US Fish and Wildlife Service (USFWS), and US Bureau of Reclamation (USBR), two state agencies, California Department of Fish and Wildlife (CDFW) and California Department of Water Resources (DWR) one research institute, the San Francisco Estuary Institute (SFEI).
Reveal: How to calculate network-level statistics with the sna
package
network.size(net1)
## [1] 139
mean(net1 %v% 'degree')
## [1] 13.66906
sum(component.largest(net1))
## [1] 107
length(isolates(net1))
## [1] 27
<- component.largest(net1, result = 'graph')
main_comp mean(geodist(main_comp)[['gdist']], na.rm = T)
## [1] 2.345008
5.3.4 Network plotting function
Now that we’ve got that down as a base, I want to create this network visualization approach as a function so that we can move through other material a little more smoothly. Feel free to just copy this function – all we are doing is taking the code we previously wrote, and replacing the network name that we’ve been using, net1, with the generic argument for the network name, netname.
<- function(netname){
netplot_function <- ggraph(netname, layout = 'fr') +
p geom_edge_link(color = "gray80") +
geom_node_point(aes(size = degree), color = clrs[4],
alpha = .7) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Delta Science Collaborative Research Network") +
geom_node_text(aes(label = labels),
size = 3,
repel = T)
return(p)
}
5.3.5 Removing isolates
Before we move too far along in our formatting of this figure, we may want to remove isolates. There are certainly occasions where we want to see isolates in our network, but other times we are interested in the main component. We can identify our isolates and then induce our subgraph with only the non-isolate nodes using the get.inducedSubgraph()
function.
isolates(net1)
## [1] 5 6 10 14 16 18 26 33 37 42 46 50 56 58 69 76 77 79 97
## [20] 106 108 115 117 127 131 134 137
<- (1:network.size(net1))[-isolates(net1)]
noiso <- get.inducedSubgraph(net1, noiso) net1_noiso
Now we can see our network without isolates, and quickly use our new netplot_function
:
netplot_function(net1_noiso)
5.4 Longitudinal networks
Lets now turn to think more deeply about edges with the second guiding question: How have the Delta’s research collaborations changed over time? So far we’ve been looking at all of the research collaborations in the DST database, which range from 1950 to more or less present day. But remember that our data have edge attributes based on when the collaborative project occurred, binned into 4 time periods: Before 1980, 1980-1994, 1995-2009, and 2010-2024 (including ongoing projects).
net1
## Network attributes:
## vertices = 139
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 475
## missing edges= 0
## non-missing edges= 475
##
## Vertex attribute names:
## degree labels mode name url vertex.names
##
## Edge attribute names:
## before_1980 Y1980_1994 Y1995_2009 Y2010_2024
To visualize how our networks change over time we are going to be ‘inducing subgraphs’, which is a network phrase for taking slices of our network. Networks can be induced based on certain nodes, which we will experiment with when we look at two mode networks. But for this question we will be making subgraphs based on edge attributes.
5.4.1 Inducing subgraphs by edge attribute
To induce our network based on edge attributes, we’ll want to identify which edges have those attributes. We can identify the edge ids for which each time category is equal to TRUE. We have these are four binary variables, rather than one attribute with four time categories, because organizations can collaborate on projects in more than one time period.
# First, let's give these ids, which will become important later
%v% 'id' <- net1 %v% 'vertex.names'
net1
# Get the edges for each time period
<- which(net1 %e% 'before_1980' == T)
t1 <- which(net1 %e% 'Y1980_1994' == T)
t2 <- which(net1 %e% 'Y1995_2009' == T)
t3 <- which(net1 %e% 'Y2010_2024' == T) t4
Now that we have our edge ids before each time period (t1 through t4), we can use the get.inducedSubgraph
function and identify the edge ids that we’d like to keep in each network.
# Induce subgraphs based on edges
<- get.inducedSubgraph(net1, eid = t1)
net1_t1 <- get.inducedSubgraph(net1, eid = t2)
net1_t2 <- get.inducedSubgraph(net1, eid = t3)
net1_t3 <- get.inducedSubgraph(net1, eid = t4) net1_t4
Now we have a slice of the network for each time period. Note that because we did not specify the vertex ids, these networks will include only the nodes that has connections for that time period, and it will not include isolates. For example, though our whole collaborative network has 139 nodes, the network from time period 2 has only 30.
net1_t2
## Network attributes:
## vertices = 30
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 98
## missing edges= 0
## non-missing edges= 98
##
## Vertex attribute names:
## degree id labels mode name url vertex.names
##
## Edge attribute names:
## before_1980 Y1980_1994 Y1995_2009 Y2010_2024
With these induced networks, an important thing to notice is that the attributes assigned in the complete network remain. This is is less important for exogenous, fixed attributes like name, but is important for endogenous/structural attributes like degree. For example, we can check the degree of SFEI in two different networks against the whole network and see that the degree attribute has carried over, but that calcuation is no longer correct in the induced networks.
%v% 'degree')[(net1_t1 %v% 'name' == "SFEI")] (net1_t1
## [1] 60
%v% 'degree')[(net1_t2 %v% 'name' == "SFEI")] (net1_t2
## [1] 60
%v% 'degree')[(net1 %v% 'name' == "SFEI")] (net1
## [1] 60
We need to update/reassign any structural values that we calculated for the whole network so that they are accurate for each sub-network.
%v% 'degree' <- degree(net1_t1)
net1_t1 %v% 'degree' <- degree(net1_t2)
net1_t2 %v% 'degree' <- degree(net1_t3)
net1_t3 %v% 'degree' <- degree(net1_t4) net1_t4
Just as the degree attribute needed changed, we also need to change the the label attribute, which was assigned based on degree. We can write a function to do that to avoid repetition across each time period.
<- function(netname, n){
label_top_degree <- (netname %v% 'degree')[order(netname %v% 'degree')]
degs <- unique(rev(degs))[1:n]
topdegs <- ifelse((netname %v% 'degree') %in% topdegs,
labels %v% 'name', '')
netname return(labels)
}
%v% 'labels' <- label_top_degree(net1_t1, 5)
net1_t1 %v% 'labels' <- label_top_degree(net1_t2, 5)
net1_t2 %v% 'labels' <- label_top_degree(net1_t3, 5)
net1_t3 %v% 'labels' <- label_top_degree(net1_t4, 5) net1_t4
Now that we’ve updated our attributes, let’s plot one of our subgraphs. We can use the netplot_function
that we wrote in the Section 5.5.
netplot_function(net1_t1) + labs(title = "Collaborative network: Pre 1980")
netplot_function(net1_t2) + labs(title = "Collaborative network: 1980-1994")
netplot_function(net1_t3) + labs(title = "Collaborative network: 1995-2009")
netplot_function(net1_t4) + labs(title = "Collaborative network: 2010-2024")
This is a start, but the visualization challenge here is that it is hard to really detect change because the layout changes every time. Remember, our layout algorithm wants to cluster densely connected nodes, and because those clusters shift across time periods (combined with the random element in the algorithmic calculation itself), the algorithm will move nodes to a new coordinate in each time period. To improve this visualization, then, we want nodes to be in the same position for each subgraph. So next we learn how to fix the coordinates of the nodes across multiple graphs.
5.4.2 Fixing coordinates
With ggraph
, we can fix coordinates by creating a layout table from our initial network. Let’s all set the same seed so that we can have the same coordinates across computers. Note that we created an ‘id’ variable earlier based on the vertex name to serve as unique identifiers in these layout tables. We’ll do that again for out no isolates network
set.seed(26)
%v% 'id' <- net1_noiso %v% 'vertex.names'
net1_noiso <- create_layout(net1_noiso, layout = 'fr')
fixed_coord head(fixed_coord[,c(1:8)])
## x y degree id labels member_label membership mode
## 1 -4.525570 -9.489269 2 49590 1 1
## 2 -3.393409 -1.165814 10 49592 1 1
## 3 6.230619 -4.040640 2 49593 1 1
## 4 -8.142122 1.745267 8 49594 1 1
## 5 -2.001708 -3.333115 6 49601 1 1
## 6 -2.796420 1.448559 88 49602 CDFW CDFW 2 1
With these coordinates fixed from our full plot, we can then apply those same coordinates for each subgraph. To do that, we’ll first create manual layouts for each subgraph. Next we will subset the relevant coordinates from the full coordinate list using the node ‘id’.
# 1. Create a layout table for the subgraph
<- create_layout(net1_t1, layout = 'fr')
coord_t1 # 2. Subset the relevant coordinates from the full layout table
<- fixed_coord[fixed_coord$id %in% coord_t1$id, c('x','y', 'id')]
fixed_coord_t1 # 3. Assign the
$x <- fixed_coord_t1$x
coord_t1$y <- fixed_coord_t1$y coord_t1
Instead of copying and pasting this over again, we’ll write a function to quickly assign the coordinates that we set as fixed to a given subgraph. To do that, we take the code we wrote above but generalize the network and fixed coordinate arguments to ‘netname’ and ‘fixed’. Then we can input any network name and any fixed coordinates, and set them all.
<- function(netname, fixed){
assign_fixed_coords <- create_layout(netname, layout = 'fr')
coord_t <- fixed[fixed$id %in% coord_t$id, c('x','y')]
fixed_coord_t $x <- fixed_coord_t$x
coord_t$y <- fixed_coord_t$y
coord_treturn(coord_t)
}
<- assign_fixed_coords(net1_t1, fixed_coord)
coord_t1 <- assign_fixed_coords(net1_t2, fixed_coord)
coord_t2 <- assign_fixed_coords(net1_t3, fixed_coord)
coord_t3 <- assign_fixed_coords(net1_t4, fixed_coord) coord_t4
Now we can feed these coordinates directly into the netplot_function
and just add new labels. Notice that we can feed these layout data frames directly in to the function, as we would a network object. This is a great functionality of ggraph
.
netplot_function(coord_t1) + labs(title = "Collaborative network: Pre 1980")
netplot_function(coord_t2) + labs(title = "Collaborative network: 1980-1994")
netplot_function(coord_t3) + labs(title = "Collaborative network: 1995-2009")
netplot_function(coord_t4) + labs(title = "Collaborative network: 2010-2024")
Huh, we’re really close, but something is not quite right yet. Even though the point have the same coordinates, each version of the network does not take up the same amount of space. For example, the subgraph for the 4th time period includes University of Kansas, which is fixed at x = -8.3, so it is one of the left-most points on the network. However, University of Kansas is not included in the subgraph for the 1st time period, and the left-most node in this network is only positioned at x = -4.7. So our issue is that the x and y axis limits adjust based on the data we input. We can just that as one more layer, specifying the widest range of x and y values in the complete network:
netplot_function(coord_t1) + labs(title = "Collaborative network: Pre 1980") +
xlim(c(-9,7)) + ylim(c(-10,6))
netplot_function(coord_t2) + labs(title = "Collaborative network: 1980-1994") +
xlim(c(-9,7)) + ylim(c(-10,6))
netplot_function(coord_t3) + labs(title = "Collaborative network: 1995-2009") +
xlim(c(-9,7)) + ylim(c(-10,6))
netplot_function(coord_t4) + labs(title = "Collaborative network: 2010-2024") +
xlim(c(-9,7)) + ylim(c(-10,6))