Collecting basic statistics
In a series of posts, I will provide an overview of several machine learning approaches to learning from graph data. Starting with basic statistics that are used to describe graphs, I will go deeper into the subject by discussing node embeddings, graph kernels, graph signal processing, and eventually graph neural networks. The posts are intended to reflect on my personal experience in academia and industry, including some of my research papers. My main motivation is to present first some basic approaches to machine learning on graphs that should be used before digging into advanced algorithms like graph neural networks.
In the first post, I present some common techniques for graph analysis that should help us better understand our data.
I assume familiarity with basic graph concepts. I will consider undirected graphs without self-loops (nodes connected to themselves by an edge) but this is only for the ease of presentation. A graph is denoted by G=(V, E) where V is the set of nodes or vertices, and E is the set of edges. The number of nodes is denoted by n = |V| and the number of edges by m = |E|.
What graph statistics to look for?
Before starting to work on a problem, we usually invest some time in exploratory data analysis. This provides us with insights into the data that might be crucial for our later work. Ideally, we would like to plot a graph like we do for a subway map but most real-life graphs are large and so easy to be visually represented. Instead, here are some important statistics to look for:
- Degree distribution. The degree distribution of two real-life networks is shown in Figure 1. The first graph comes from email correspondence in an EU institution, and the second graph is the co-authorship graph from the DBLP database. We observe that the most common node degree in the email graph is 1, meaning that most people only communicate with a single person. In the co-authorship graph most authors have two or three co-authors.
- Graph density. The quantity is defined as
The explanation is simple: the maximum possible number of edges is n*(n-1)/2 as every node can be connected to exactly n-1 other nodes and the edges are undirected. It measures how close is the graph to a fully connected graph, or a clique. Most real-life networks are very sparse, thus the measure is in most cases not very informative.
- Connected components. We…
Continue reading: https://towardsdatascience.com/machine-learning-on-graphs-part-1-9ec3b0bd6abc?source=rss—-7f60cf5620c9—4