Jeffrey Phillips Freeman discussing Bioalgorithms, Machine Learning, Computer Science, Electrical Engineering, HAM Radio and everything science.

Hyperassociative Map Explanation


Almost 8 years ago, on Aug 15, 2009, I invented a new game-changing algorithm called the Hyperassociative Map algorithm. It was released as part of the dANN v2.x library. The HAM algorithm, as it is often called, has since been used by countless developers and in hundreds of projects. HAM is a Graph Drawing algorithm that is similar to force-directed algorithms but in a class all its own. Unlike other force-directed algorithms HAM does not have a sense of momentum or acceleration which makes it debatable if it can even still be called force-directed.

Below is a video demonstration of HAM in action. In this 3D visualization the vertices of the graph are displayed as grey spheres but the edges are not rendered. The graph’s topology is relatively simple containing 128 nodes in groups of 16 layered such that each group is fully connected to each adjacent group. This results in 256 edges between each adjacent group. Since the groups on either end only have one other group they are adjacent to that means there is a total of 1,792 edges. Despite this the graph aligns quickly and smoothly on a single 1 Ghz processor as demonstrated in the video. It starts with randomized locations for each vertex and then aligns. After each alignment the graph is reset with new random starting positions to show that the same alignment is achieved every time.

What makes HAM so special is that it retains many of the advantages that have made force-directed algorithms so popular while simultaneously addressing their short comings. Wikipedia describes the following advantages to using force-directed algorithms, all of which hold true for the HAM algorithm.

  • Good-quality results - The output obtained usually have very good results based on the following criteria: uniform edge length, uniform vertex distribution and showing symmetry. This last criterion is among the most important ones and is hard to achieve with any other type of algorithm.
  • Flexibility - Force-directed algorithms can be easily adapted and extended to fulfill additional aesthetic criteria. This makes them the most versatile class of graph drawing algorithms. Examples of existing extensions include the ones for directed graphs, 3D graph drawing, cluster graph drawing, constrained graph drawing, and dynamic graph drawing.
  • Intuitive - Since they are based on physical analogies of common objects, like springs, the behavior of the algorithms is relatively easy to predict and understand. This is not the case with other types of graph-drawing algorithms.
  • Simplicity - Typical force-directed algorithms are simple and can be implemented in a few lines of code. Other classes of graph-drawing algorithms, like the ones for orthogonal layouts, are usually much more involved.
  • Interactivity - Another advantage of this class of algorithm is the interactive aspect. By drawing the intermediate stages of the graph, the user can follow how the graph evolves, seeing it unfold from a tangled mess into a good-looking configuration. In some interactive graph drawing tools, the user can pull one or more nodes out of their equilibrium state and watch them migrate back into position. This makes them a preferred choice for dynamic and online graph-drawing systems.
  • Strong theoretical foundations - While simple ad-hoc force-directed algorithms often appear in the literature and in practice (because they are relatively easy to understand), more reasoned approaches are starting to gain traction. Statisticians have been solving similar problems in multidimensional scaling (MDS) since the 1930s, and physicists also have a long history of working with related n-body problems - so extremely mature approaches exist. As an example, the stress majorization approach to metric MDS can be applied to graph drawing as described above. This has been proven to converge monotonically. Monotonic convergence, the property that the algorithm will at each iteration decrease the stress or cost of the layout, is important because it guarantees that the layout will eventually reach a local minimum and stop. Damping schedules cause the algorithm to stop, but cannot guarantee that a true local minimum is reached.

However the two disadvantages described of force-directed algorithms, namely high running time and poor local minima, have been corrected in the HAM algorithm. As described earlier HAM is not a true force-directed algorithm because it lacks any sense of momentum. This was intentional as it ensures there is no need for a dampening schedule to eliminate oscillations that arise from the momentum of nodes. This has the added advantage that the algorithm does not prematurely come to rest at a local minima. It also means fewer processing cycles wasted on modeling oscillations and vibrations throughout the network.

These properties alone already make HAM a worthwhile algorithm for general study and real-world applications, however it is important to note that HAM was originally designed with a very specific use case in mind. Originally HAM was designed to facilitate the distribution of massive real-time graph processing networks. The sort of scenario where each vertex in a graph had to process some input data and produce some output data and where each vertex is part of a large interdependent graph working on the data in real time. When distributing the tasks across a cluster of computers it is critical that vertices that are highly interconnected reside on the same computer in the cluster and physically close to the computers housing the vertices that will ultimately receive the data, process it and then carry it throughout the rest of the network. For this purpose HAM was created to model graphs such that each node in a compute cluster took ownership of tasks associated with vertices that were spatially close to each other according to the HAM’s drawing of the compute graph.

In order for HAM to be successful at it’s job it needed to exhibit a few very specific properties. For starters the interactivity property mentioned earlier was a must. HAM needed to be able to work with a graph that is constantly changing its topology with new vertices able to be added, removed, or reconfigured in real time. This is ultimately what led the algorithm to be modeled in a way that made it similar to force-directed algorithms.

The other requirement is that the motion of the vertices had to be smooth without any oscillations as they align. This was critical because if oscillations occurred on a vertex as it was near the border that distinguishes one compute node from another then those oscillations across that border would cause the task in the compute cluster to be transferred between the nodes in the cluster each time. Since this is an expensive operation it is important that as HAM aligned the vertices didn’t jitter causing them to cross these borders excessively.

Finally HAM needed to be able to be parallelized and segmented. That means that it needed to scale well for multi-threading but in such a way that each thread didn’t need to be aware of the entire graph in order to process it; instead each thread had to be capable of computing the alignment of HAM on an isolated section of the graph. This is obviously critical because of the distributed nature of the compute graph, particularly if we want something capable of unbounded scaling. I basically wanted an algorithm that could be successful on even massively large graphs.

With almost 8 years of testing it has become evident that HAM is top in its class compared to many graph drawing algorithms. Despite this it is still scarcely understood by those studying graph drawing algorithms. For this reason I wanted to write this article to share some of its internal workings so others can adapt and play with the algorithm for their own projects.

The Algorithm

In this section I want to get into the internal workings of the Hyperassociative Map algorithm, HAM. Below is the pseudocode breakdown explaining the algorithm. Notice I use some math notation here for simplicity. Most notably I use vector notation where all variables representing vectors have a small arrow above the variable and the norm, or magnitude, of the vector is represented by double vertical bars on either side, for example \(||\vec{p}||\). If you have trouble with vector notation or just want to see a concrete example the full working java code can be found at the end of this article for reference.

\caption{Hyperassociative Map}
% equilibrium distance
\REQUIRE $\tilde{\chi} > 0$
%repulsion strength
\REQUIRE $\delta > 1$
% learning rate
\REQUIRE $\eta = 0.05$
% alignment threshold (determines when graph is aligned)
\REQUIRE $\beta =$ 0.005
\PROCEDURE{HAM}{Vertex Set \textbf{as} $g$}
  \STATE \CALL{Randomize}{$g$}
  \WHILE{\CALL{AlignAll}{$g$} $> \beta \cdot \tilde{\chi}$}
    \STATE optionally recenter the graph
\PROCEDURE{AlignAll}{Vertex Set \textbf{as} $g$}
  \STATE $\zeta = 0$
  \FORALL{$v$ \textbf{in} $g$}
    \STATE $\vec{{\scriptsize \triangle} p}  =$ \CALL{Align}{$v$}
    \IF{$||\vec{{\scriptsize \triangle} p}|| > \zeta$}
      \STATE $\zeta = ||\vec{{\scriptsize \triangle} p}||$
    \STATE \CALL{Place}{$v$, \CALL{Position}{$v$} $+ \vec{{\scriptsize \triangle} p}$}
  \RETURN $\zeta$
\PROCEDURE{Align}{Vertex \textbf{as} $v$}
  \STATE $\vec{p} =$ \CALL{Position}{$v$}
  \STATE $\vec{{\scriptsize \triangle} p} = 0$
  \FORALL{$m$ \textbf{in} \CALL{Neighbors}{$v$}}
    \STATE $\vec{q} =$ \CALL{Position}{$m$} - $\vec{p}$
    \STATE $\vec{{\scriptsize \triangle} p} = \vec{{\scriptsize \triangle} p} + \vec{q} \cdot \frac{(||\vec{q}|| - \tilde{\chi}) \cdot \eta}{||\vec{q}||}$
  \FORALL{$m$ \textbf{in} \CALL{NotNeighbors}{$v$}}
    \STATE $\vec{q} =$ \CALL{Position}{$m$} - $\vec{p}$
    \STATE $\vec{c} = \vec{q} \cdot \frac{-\eta}{{||\vec{q}||}^{\delta + 1}}$
    \IF{$||\vec{c}|| > \tilde{\chi}$}
      \STATE $\vec{c} = \vec{c} \cdot \frac{\tilde{\chi}}{||\vec{c}||}$
    \STATE $\vec{{\scriptsize \triangle} p} = \vec{{\scriptsize \triangle} p} + \vec{c}$
  \RETURN $\vec{{\scriptsize \triangle} p}$
\PROCEDURE{Randomize}{Vertex Array \textbf{as} $g$}
  \STATE randomise position of all vertex in $g$
\PROCEDURE{Place}{Vertex \textbf{as} $v$, Vector \textbf{as} $\vec{p}$}
  \STATE sets the position of $v$ to $\vec{p}$
\PROCEDURE{Neighbors}{Vertex \textbf{as} $v$}
  \RETURN set of all vertex adjacent to $v$
\PROCEDURE{NotNeighbors}{Vertex \textbf{as} $v$}
  \STATE $s =$ set of all vertex not adjacent to $v$
  \STATE $w =$ set of all vertex whose position is close to that of $v$
  \RETURN $s \cap w$
\PROCEDURE{Position}{Vertex \textbf{as} $v$}
  \RETURN vector representing position of $v$

Obviously the pseudocode packs a lot of information into only a few lines so I’ll try to explain some of the more important parts so you have an idea at what you’re looking at.


First, lets consider the constants defined at the beginning. the variable \(\tilde{\chi}\) is called the Equilibrium Distance. It defines the ideal distance between two vertices connected by an edge. If a pair of vertices connected by a single edge are the only vertices present then they will align such that they are approximately as far apart as the value of \(\tilde{\chi}\). For simplicity here we have represented \(\tilde{\chi}\) as a single constant but in practice it is also possible to assign a different value to this constant for each edge, resulting in a graph with different aesthetic qualities. This value must of course always be a positive number greater than \(0\). The default value, and the one used in the demonstration video, is \(1\).

The second constant is called the Repulsion Strength and is represented by the \(\delta\) variable. This constant determines how strong the repulsion between two unconnected vertices are, that is two vertices not connected by an edge. Lower values for \(\delta\) result in a stronger repulsion and larger numbers represent a weaker repulsion. The default value is \(2\) and this is the value used in the demonstration video.

Next is the Learning Rate constant, \(\eta\). This is simply a scaling factor applied when the vertices are aligned to ensure the graph is aligned to each node with equivalent effect rather than over-fitting to the last node processed.

The last constant is the Alignment threshold, \(\beta\), this represents the minimum movement threshold. Once the vertices move less than this value during an alignment cycle it is presumed the graph is sufficiently aligned and the loop ends.

Align Procedure

The algorithm itself is represented such that it is broken up into three major procedures. The procedure named HAM is the entry point for the algorithm, the procedure named Align calculates the incremental alignment for a single vertex, and the procedure named AlignAll calculates alignment once for every vertex in the graph.

Lets first explain what is going on in the Align procedure. Here we have a single value being passed in, the vertex to be aligned, \(v\). On line 19 the current position of the vertex is obtained and represented as a euclidean vector, \(\vec{p}\). Next on line 20 a zero vector is initialized and represented as \(\vec{{\scriptsize \triangle} p}\). The vector \(\vec{{\scriptsize \triangle} p}\) is calculated throughout the procedure and represents the desired change to the current position of the vector \(\vec{p}\). In other words once \(\vec{{\scriptsize \triangle} p}\) is calculated it can be added to the current position of the vertex and will result in the new position for the vertex. Just as if calculating forces the \(\vec{{\scriptsize \triangle} p}\) will be the sum of all the composite influences acting on the vertex; so it represents the overall influence exerted on the vertex at any time.

When calculating \(\vec{{\scriptsize \triangle} p}\) the procedure must iterate through all the other vertices that have an effect on the vertex being aligned. There are two type of vertices each with different effects: neighbors, and non-neighbors. Neighbors are all the vertices connected directly to the current vertex by an edge, non-neighbors are all the other vertices not connected by an edge.

First the influence from the neighbor vertices is calculated on lines 21 - 24. The influence two neighbor vertices have on each other is different depending on how far apart they are. If they are closer than the Equilibrium Distance, \(\tilde{\chi}\), then the effect is repulsive. If they are farther apart than \(\tilde{\chi}\) then the effect is attractive. The calculation for this is represented by line 23. It basically calculates the vector that represents the difference between the position of the vertex being aligned and its neighbor and reduces the magnitude of the vector back to \((||\vec{q}|| - \tilde{\chi}) \cdot \eta\). To look at it another way if the equation was just \(||\vec{q}|| - \tilde{\chi}\) then the new position of the vector would be exactly at the Equilibrium Distance, but instead it is scaled to a fraction of this by \(\eta\) which adjusts how quickly the vertex will approach its equilibrium point.

Next the influence between non-neighbor vertices is calculated on lines 25 - 32. Non-neighbor vertices, that is vertices not connected by an edge, always exhibit a purely repulsive influence. Line 27 calculates this in a similar technique as before. That is the difference between the position of the two vertices is represented by \(\vec{q}\) and then its magnitude is scaled. Of course it’s also negative to indicate that the force is repulsive. The equation just seems confusing in its simplified and compacted form. Initially it was derived by calculating the new magnitude of \(\vec{q}\) as the following.

$$ \frac{-1}{{\mid\mid\vec{q}\mid\mid}^{\delta}} \cdot \eta $$

This makes a lot more sense as we know in nature repulsive forces are the inverse square of their distance. So this accurately represents a repulsive influence that diminishes with distance. Once we actually apply that magnitude to the vector and simplify we arrive at our final equation.

$$ \vec{q} \cdot \frac{\frac{-1}{{\mid\mid\vec{q}\mid\mid}^{\delta}} \cdot \eta}{\mid\mid\vec{q}\mid\mid} \Rightarrow \vec{q} \cdot \frac{-\eta}{{||\vec{q}||}^{\delta + 1}} $$

The only caveat to this is seen in lines 28 to 30 where it checks the distance moved as a result of the repulsive influence. If it is greater than the Equilibrium Distance, \(\tilde{\chi}\), then its magnitude is scaled back to be \(\tilde{\chi}\). This is done because at very close distances the exponential nature of the repulsive influence becomes overwhelming and we want to ensure the majority of this influence works at a distance to allow the graph to spread apart but still allow the other influences to be the dominate influences on the graph.

At this point the computed change in position for the vertex is simply returned at line 33 for further processing by the AlignAll procedure.

AlignAll Procedure

The AlignAll Procedure is extremely straight forward. It is passed in the set of all vertices in the graph as \(g\) and iterates over the set while aligning them one at a time. Each vertex will get aligned once per call to the procedure, this means the procedure will usually need to be called multiple times.

On line 8 the Maximum Vertex Movement variable, represented as \(\zeta\), is initialized to \(0\). This variable represents the greatest distance any vertex moved during the alignment; after being calculated it’s value is returned on line 15. The Maximum Vertex Movement is important for determining when the HAM algorithm has finished processing.

Other than that this procedure doesn’t do anything special, the vertex alignment vector is calculated on line 10 and the new position for the vertex is set on line 14.

HAM Procedure

The HAM procedure is another rather straight forward procedure to explain. It starts by assigning some initial random coordinates to each vertex in the graph. After that it continually loops calling AlignAll until the graph is sufficiently aligned.

On line 3 the AlignAll procedure is called in a loop until the Max Vertex Movement returned is less than \(\beta \cdot \tilde{\chi}\). This is just the Alignment Threshold normalized by the Equilibrium Distance constant. The Alignment Threshold is sufficiently small such that if the movements in the graph are less than this value then they are considered negligible and the alignment can end.

As an optional step after each alignment iteration it may be desired to translate the entire graph so it is centered around the zero vector. There is a small amount of drift as the alignment of the graph is calculated and by doing this it ensures the graph remains in the center of the view when rendered. The drift is usually negligible however so this step is entirely optional. In the full java example below the logic for centering the graph is included.

Appendix: Full Java Code

public class HyperassociativeMap<G extends Graph<N, ?>, N> implements 
        GraphDrawer<G, N> {
    private static final double REPULSIVE_WEAKNESS = 2.0;
    private static final double DEFAULT_LEARNING_RATE = 0.05;
    private static final double EQUILIBRIUM_DISTANCE = 1.0;
    private static final double EQUILIBRIUM_ALIGNMENT_FACTOR = 0.005;

    private final G graph;
    private final int dimensions;
    private Map<N, Vector> coordinates = Collections.synchronizedMap(new 
            HashMap<N, Vector>());
    private static final Random RANDOM = new Random();
    private final boolean useWeights;
    private double equilibriumDistance;
    private double learningRate = DEFAULT_LEARNING_RATE;
    private double maxMovement = 0.0;

    public HyperassociativeMap(final G graph, final int dimensions, final 
    double equilibriumDistance, final boolean useWeights) {
        if (graph == null)
            throw new IllegalArgumentException("Graph can not be null");
        if (dimensions <= 0)
            throw new IllegalArgumentException("dimensions must be 1 or more");

        this.graph = graph;
        this.dimensions = dimensions;
        this.equilibriumDistance = equilibriumDistance;
        this.useWeights = useWeights;

        // refresh all nodes
        for (final N node : this.graph.getNodes()) {
            this.coordinates.put(node, randomCoordinates(this.dimensions));

    public G getGraph() {
        return graph;

    public double getEquilibriumDistance() {
        return equilibriumDistance;

    public void setEquilibriumDistance(final double equilibriumDistance) {
        this.equilibriumDistance = equilibriumDistance;

    public void resetLearning() {
        maxMovement = 0.0;

    public void reset() {
        // randomize all nodes
        for (final N node : coordinates.keySet()) {
            coordinates.put(node, randomCoordinates(dimensions));

    public boolean isAlignable() {
        return true;

    public boolean isAligned() {
        return isAlignable()
                && (maxMovement < (EQUILIBRIUM_ALIGNMENT_FACTOR * 
                && (maxMovement > 0.0);

    public void align() {
        // refresh all nodes
        if (!coordinates.keySet().equals(graph.getNodes())) {
            final Map<N, Vector> newCoordinates = new HashMap<N, Vector>();
            for (final N node : graph.getNodes()) {
                if (coordinates.containsKey(node)) {
                    newCoordinates.put(node, coordinates.get(node));
                } else {
                    newCoordinates.put(node, randomCoordinates(dimensions));
            coordinates = Collections.synchronizedMap(newCoordinates);

        maxMovement = 0.0;
        Vector center;

        center = processLocally();

        // divide each coordinate of the sum of all the points by the number of
        // nodes in order to calculate the average point, or center of all the
        // points
        for (int dimensionIndex = 1; dimensionIndex <= dimensions; 
             dimensionIndex++) {
            center = center.setCoordinate(center.getCoordinate
                    (dimensionIndex) / graph.getNodes().size(), dimensionIndex);


    public int getDimensions() {
        return dimensions;

    public Map<N, Vector> getCoordinates() {
        return Collections.unmodifiableMap(coordinates);

    private void recenterNodes(final Vector center) {
        for (final N node : graph.getNodes()) {
            coordinates.put(node, coordinates.get(node).calculateRelativeTo

    public boolean isUsingWeights() {
        return useWeights;

    Map<N, Double> getNeighbors(final N nodeToQuery) {
        final Map<N, Double> neighbors = new HashMap<N, Double>();
        for (final TraversableCloud<N> neighborEdge : graph.getAdjacentEdges
                (nodeToQuery)) {
            final Double currentWeight = (((neighborEdge instanceof Weighted)
                    && useWeights) ? ((Weighted) neighborEdge).getWeight() : 
            for (final N neighbor : neighborEdge.getNodes()) {
                if (!neighbor.equals(nodeToQuery)) {
                    neighbors.put(neighbor, currentWeight);
        return neighbors;

    private Vector align(final N nodeToAlign) {
        // calculate equilibrium with neighbors
        final Vector location = coordinates.get(nodeToAlign);
        final Map<N, Double> neighbors = getNeighbors(nodeToAlign);

        Vector compositeVector = new Vector(location.getDimensions());
        // align with neighbours
        for (final Entry<N, Double> neighborEntry : neighbors.entrySet()) {
            final N neighbor = neighborEntry.getKey();
            final double associationEquilibriumDistance = neighborEntry

            Vector neighborVector = coordinates.get(neighbor)
            double newDistance = Math.abs(neighborVector.getDistance()) - 
            newDistance *= learningRate;
            neighborVector = neighborVector.setDistance(newDistance);
            compositeVector = compositeVector.add(neighborVector);
        // calculate repulsion with all non-neighbors
        for (final N node : graph.getNodes()) {
            if ((!neighbors.containsKey(node)) && (node != nodeToAlign)
                    && (!graph.getAdjacentNodes(node).contains(nodeToAlign))) {
                Vector nodeVector = coordinates.get(node).calculateRelativeTo
                double newDistance = -1.0 / Math.pow
                        (nodeVector.getDistance(), REPULSIVE_WEAKNESS);
                if (Math.abs(newDistance) > Math.abs(equilibriumDistance)) {
                    newDistance = Math.copySign(equilibriumDistance, 
                newDistance *= learningRate;
                nodeVector = nodeVector.setDistance(newDistance);
                compositeVector = compositeVector.add(nodeVector);
        Vector newLocation = location.add(compositeVector);
        final Vector oldLocation = coordinates.get(nodeToAlign);
        double moveDistance = Math.abs(newLocation.calculateRelativeTo

        if (moveDistance > maxMovement) {
            maxMovement = moveDistance;

        coordinates.put(nodeToAlign, newLocation);
        return newLocation;
    public static Vector randomCoordinates(final int dimensions) {
        final double[] randomCoordinates = new double[dimensions];
        for (int randomCoordinatesIndex = 0; randomCoordinatesIndex < 
                dimensions; randomCoordinatesIndex++) {
            randomCoordinates[randomCoordinatesIndex] = (RANDOM.nextDouble() 
                    * 2.0) - 1.0;

        return new Vector(randomCoordinates);

    private Vector processLocally() {
        Vector pointSum = new Vector(dimensions);
        for (final N node : graph.getNodes()) {
            final Vector newPoint = align(node);
            for (int dimensionIndex = 1; dimensionIndex <= dimensions; 
                 dimensionIndex++) {
                pointSum = pointSum.setCoordinate(pointSum.getCoordinate
                        (dimensionIndex) + newPoint.getCoordinate
                        (dimensionIndex), dimensionIndex);
        return pointSum;

Latex Inspired Rendering of Algorithms in HTML

I recently updated the code for my blog (yes, this one) to render, in classic latex style, algorithms as pseudocode. This style has been so extensively used from the 80’s and continues to be used as a standard format for rendering algorithms as pseudocode. Below is an example of what I’m talking about, its the Quicksort algorithm rendered using this style.

% This quicksort algorithm is extracted from Chapter 7, Introduction to Algorithms (3rd edition)
\PROCEDURE{Quicksort}{$A, p, r$}
  \IF{$p < r$}
    \STATE $q = $ \CALL{Partition}{$A, p, r$}
    \STATE \CALL{Quicksort}{$A, p, q - 1$}
    \STATE \CALL{Quicksort}{$A, q + 1, r$}
\PROCEDURE{Partition}{$A, p, r$}
  \STATE $x = A[r]$
  \STATE $i = p - 1$
  \FOR{$j = p$ \TO $r - 1$}
    \IF{$A[j] < x$}
      \STATE $i = i + 1$
      \STATE exchange
      $A[i]$ with     $A[j]$
    \STATE exchange $A[i]$ with $A[r]$

If you’d like to use it yourself then it isn’t too hard. You can get most of the magic from a nice little java script library that you can find on github called pseudocode.js. The problem is that you will only get half way there using the javascript alone. The parts that are there work great, but it won’t search your html for you and render anything, you have to write that yourself. For that if you put the following block of code at the end of your page’s source, and make sure the project I linked earlier is already installed, then it will solve that problem for you.

  var blocks = document.getElementsByClassName("pseudocode");
  for(var blockId = 0; blockId < blocks.length; blockId++) {
    var block = blocks[blockId];

    var code = block.textContent;
    var options = {
      lineNumber: true

    var outputEl = document.createElement('div');
    outputEl.className += " pseudocode-out";
    block.parentNode.insertBefore(outputEl, block.nextSibling);

    pseudocode.render(code, outputEl, options);

  while( blocks[0]) {

Now all you have to do is add any element with a class of “pseudocode” and write the actual psueudocode inside the element, following the format specified in the pseudocode project. I prefer to define this as a “pre” element so if javascript is turned off it will still produce a human readable block. So other then that I didn’t have to do anything too special to get it working.

Conditional Probabilities and Bayes Theorem

I’ve been getting a lot of questions from friends lately about what Bayes Theorem means. The confusion is understandable because it appears in a few models that seem to be completely unrelated to each other. For example we have Naive Bayes Classifiers and Bayesian Networks which operate on completely different principles. Moreover this is compounded with a lack of understanding regarding unconditional and conditional probabilities.

In this article I offer a tutorial to help bring the lay person up to speed with some basic understanding on these concepts and how Bayes Theorem can be applied.

Probability Space

We have to start by explaining a few terms and how they are used.

A Random Trial is a trial where we perform some random experiment. For example it might be flipping a coin or rolling dice.

The Sample Space of a Random Trial, typically denoted by \(\Omega\), represents all possible outcomes for the Random Trial being performed. So for flipping a coin the outcome can be either heads or tails, so the Sample Space would be a set containing only these two values.

$$ \Omega = \{Heads, Tails\} $$

For rolling dice the Sample Space would be the set of all the various faces for the dice being rolled. When rolling only one standard six sided die the Sample space would be as follows.

$$ \Omega = \{1, 2, 3, 4, 5, 6\} $$

In both of these examples the Random Trial being performed will select an outcome from their respective Sample Space at random. In these trials each outcome has an equal chance of being selected, though that does not necessarily need to be the case.

For our purposes here I want to formulate a Random Trial thought experiment that simulates a medical trial consisting of 10 patients. We will be using this example throughout much of this tutorial. Therefore our Sample Space will be a set consisting of 10 elements, each element represents a single unique patient in the trial. Patients are represented with the variable x with a subscript from 1 to 10 that uniquely identifies each patient.

$$ \Omega = \{x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8, x_9, x_{10}\} $$

An Event in the context of probabilities is a set of outcomes that can be satisfied. Typically these sets are represented using lowercase greek letters such as \(\alpha\) or \(\beta\). For example if we were rolling a single die and wanted it to land on an odd number then the Event representing an odd outcome would be represented as the following set.

$$ \alpha = \{1, 3, 5\} $$

Similarly if we simply wanted to roll the number 6 then the Event would be a set containing only that one number.

$$ \alpha = \{6\} $$

The Event Space, often denoted by \(\mathcal{F}\), is the set of all Events to be observed. It is a set of sets that represents every possible combination of subsets of the Sample Space or some part thereof. Not all Events in the event space need to be possible however. For example if we are talking about flipping a coin then the Event Space would have, at most, 4 members representing outcomes for Heads, Tails, either Heads or Tails, and neither Heads nor Tails. We could represent this with the following set notation.

$$ \mathcal{F} = \{\{\}, \{Heads\}, \{Tails\}, \{Heads, Tails\}\} $$

The empty set is usually represented with the \(\emptyset\) symbol. So the previous set can be rewritten using this shorthand as follows.

$$ \mathcal{F} = \{\emptyset, \{Heads\}, \{Tails\}, \{Heads, Tails\}\} $$

Notice that one of the members of the Event Space is equivalent to the Sample space. the fact that it contains both the empty set and the Sample Space as members is more a matter of mathematical completeness and plays a role in making some mathematical proofs easier to carry out. For our purposes here they will largely be ignored.

At this point I want to go over a little bit of mathematical notation that may help when reading other texts on the subject. The first is the concept of a Power Set. The Power Set is simply every possible combination of subsets for a particular set. In the example above regarding the coin toss Event Space we can say that the Event Space specified is the Power Set of the Sample Space. The notation for the Power Set is the number 2 with an exponent that is a set. For example short hand for the above Event Space definition could have been the following.

$$ \mathcal{F} = 2^{\Omega} $$

Every Event Space must be either equal to, or a subset of, the Power Set of the Sample Space. We can represent that with the following set notation.

$$ \mathcal{F} \subseteq 2^{\Omega} $$

Going back to our example of patients in a clinical trial we might want to know what the chance is of selecting a patient at random that has a fever. In that case the Event would be the set of all patients that have a fever and the outcome would be a single patient selected at random. Each Event is an element in the Event Space. So we will denote it as \(\mathcal{F}\) with a subscript so it is easier to read than it would be using arbitrary greek lowercase letters, as is the usual convention. If 3 of the 10 patients in our Sample Space have a fever we can represent the fever Event as follows.

$$ \mathcal{F}_{fever} = \{x_1, x_6, x_8\} $$

This means that if we select a patient at random and that patient is a member of the \(\mathcal{F}_{fever}\) set then that patient has a fever and the outcome has satisfied the event. Similarly we can define the event representing patients that have the flu with the following notation.

$$ \mathcal{F}_{flu} = \{x_2, x_4, x_6, x_8\} $$

As stated earlier Events are simply members of the Event Space. This can be indicated using the following set notation which simply states that the flu Event is a member of the Event Space and the fever Event is also a member of the Event Space.

$$ \mathcal{F}_{fever} \in \mathcal{F} $$
$$ \mathcal{F}_{flu} \in \mathcal{F} $$

Similarly if we wish to indicate that the fever and flu Events are subsets of the Event Space we can do so using the following notation.

$$ \mathcal{F}_{fever} \subset \Omega $$
$$ \mathcal{F}_{flu} \subset \Omega $$

The only term left to define is the Probability Space. This is just the combination of the Event Space, the Sample Space, as well as the probability of each of the Events taking place. It represents all the information we need to determine the chance of any possible outcome occurring. It is denoted as a 3-tuple containing these three things. The probability, P, represents a function that maps Events in \(\mathcal{F}\) to probabilities.

$$ (\Omega, \mathcal{F}, P) $$

Unconditional Probability

This is where things get interesting. Since we have all the important terms defined we can start talking about actual probabilities. We start with the simplest type of probability, the Unconditional Probability. These are the sort of probability most people are familiar with. It is the chance that an outcome will occur independent of any other Events. For example if I flip a coin I can say the probability of it landing on Heads is 50%; this would be an Unconditional Probability.

If all the outcomes in our Sample Space have the same chance of being selected by a Random Trial then calculating the Unconditional Probability is rather easy. The Event would represent all desired outcomes from our Sample Space. So if we wanted to flip a coin and get heads then our Event is a set with a single member and the Sample Space consists of only two members, the possible outcomes. We can write this as the following.

$$ \Omega = \{Heads, Tails\} $$
$$ \mathcal{F}_{Heads} = \{Heads\} $$

If the above event is satisfied by the flip of a coin it means the outcome of the coin toss was heads. To calculate the probability for this event we simply count the number of members in the Event Set and divide it by the number of members in the Sample Space. In this case the result is 50% but we can represent that as follows.

$$ P(\mathcal{F}_{Heads}) = \frac{1}{2} $$

The number of members in a set is called Cardinality. We can represent that using notation that is the same as the absolute value sign used around a set. Therefore we can represent the previous equation using the following notation.

$$ P(\mathcal{F}_{Heads}) = \frac{\mid \mathcal{F}_{Heads} \mid}{\mid \Omega \mid} = \frac{1}{2} $$

We can generalize this for any Event represented as \(\mathcal{F}_{i}\) with the following definition for calculating an Unconditional Probability.

$$ P(\mathcal{F}_{i}) = \frac{\mid \mathcal{F}_{i} \mid}{\mid \Omega \mid} $$

Now let’s apply this to our clinical trial example from earlier. Say we wanted to calculate the chance of selecting someone from the 10 patients in the trial, at random, such that the person selected has a fever. We can calculate that with the following.

$$ P(\mathcal{F}_{fever}) = \frac{\mid \mathcal{F}_{fever} \mid}{\mid \Omega \mid} = \frac{3}{10} $$

We can also do the same for calculating the chance of randomly selecting a patient that has the flu.

$$ P(\mathcal{F}_{flu}) = \frac{\mid \mathcal{F}_{flu} \mid}{\mid \Omega \mid} = \frac{4}{10} = \frac{2}{5} $$

Conditional Probability

A Conditional Probability takes this idea one step further. A Conditional Probability specifies the probability of an event being satisfied if it is known that another event was also satisfied. For example using our clinical trial thought experiment one might ask what is the probability of someone having the flu if we know that person has a fever. This would be represented with the following notation.

$$ P(\mathcal{F}_{flu} \mid \mathcal{F}_{fever}) $$

Assuming that having a fever has some effect on the likelihood of having the flu then this probability would be different than the chance for just any randomly selected member having the flu, after all people with fevers are more likely to have the flu than people without a fever.

Since we already know which patients have the flu and which have a fever it is easy to determine an answer to this question. To calculate the probability we can look at how many patients in our Sample Space have a fever and what percentage of those patients with fever also have the flu. By looking at the data we can see that there are 3 patients with fevers and of those patients only 2 of them have the flu. So the answer is \(\frac{2}{3}\).

$$ \mathcal{F}_{fever} = \{x_1, x_6, x_8\} $$
$$ \mathcal{F}_{flu} = \{x_2, x_4, x_6, x_8\} $$
$$ P(\mathcal{F}_{flu} \mid \mathcal{F}_{fever}) = \frac{2}{3} $$

We can generalize this statement by saying that we take the intersection of the sets that represent the Event for patients with the flu and patients with a fever. The intersection is just the set of all the elements that those two sets have in common.

The symbol for intersection is \(\cap\), therefore we can show the intersection of these two sets as follows.

$$ \mathcal{F}_{flu} \cap \mathcal{F}_{fever} = \{x_6, x_8\} $$

Another way to look at calculating the Conditional Probability would be to take the Cardinality of the intersection of these two Events and divide it by the cardinality of the conditional Event that has been satisfied. So now we have the following.

$$ P(\mathcal{F}_{flu} \mid \mathcal{F}_{fever}) = \frac{\mid \mathcal{F}_{flu} \cap \mathcal{F}_{fever} \mid}{\mid \mathcal{F}_{fever} \mid} = \frac{2}{3} $$

We can also ask a similar, but markedly different, question. If we know a patient has the flu what is the chance that same patient will have a fever. For this we can use the same logic as above and come up with the following.

$$ P(\mathcal{F}_{fever} \mid \mathcal{F}_{flu}) = \frac{\mid \mathcal{F}_{flu} \cap \mathcal{F}_{fever} \mid}{\mid \mathcal{F}_{flu} \mid} = \frac{2}{4} = \frac{1}{2} $$

As you can see the only thing that changed is the denominator which is now the Cardinality of the flu Event rather than the fever Event. We can generalize the equation for calculating a Conditional Probability as follows.

$$ P(\mathcal{F}_{i} \mid \mathcal{F}_{j}) = \frac{\mid \mathcal{F}_{i} \cap \mathcal{F}_{j} \mid}{\mid \mathcal{F}_{j} \mid} $$

Bayes Theorem

Bayes Theorem itself is remarkably simple on the surface yet immensely useful in practice. In its simplest form it lets us calculate a Conditional Probability when we have limited information to work with. If we only knew, for example, the probabilities for \(P(F_i \mid F_j)\), \(P(F_i)\), and \(P(F_j)\), then using Bayes Theorem we could calculate the probability for \(P(F_j \mid F_i)\). The precise equation for Bayes Theorem is as follows.

$$ P(\mathcal{F}_{i} \mid \mathcal{F}_{j}) = \frac{ P(\mathcal{F}_{j} \mid \mathcal{F}_{i}) \cdot P(\mathcal{F}_{i}) }{ P(\mathcal{F}_{j}) } $$

Let’s say we didn’t know all the details of the clinical trial from earlier; we have no idea what the Sample Space is or what members belong to each Event set. All we know is the probability that someone will have a fever at any given time, the probability they will have the flu, and the probability that someone with the flu has a fever. From this limited information, and using Bayes Theorem it would be possible to infer the probability of having the flu if you have a fever. First let’s copy the probabilities we know to match what we previously calculated manually.

$$ P(\mathcal{F}_{fever}) = \frac{3}{10} $$
$$ P(\mathcal{F}_{flu}) = \frac{2}{5} $$
$$ P(\mathcal{F}_{fever} \mid \mathcal{F}_{flu}) = \frac{1}{2} $$

Using only this information, along with Bayes Theorem, we can calculate the probability of someone having the flu if they have a fever as follows.

$$ P(\mathcal{F}_{flu} \mid \mathcal{F}_{fever}) = \frac{ P(\mathcal{F}_{fever} \mid \mathcal{F}_{flu}) \cdot P(\mathcal{F}_{flu}) }{ P(\mathcal{F}_{fever}) } $$
$$ P(\mathcal{F}_{flu} \mid \mathcal{F}_{fever}) = \frac{ \frac{1}{2} \cdot \frac{2}{5} }{ \frac{3}{10} } $$
$$ P(\mathcal{F}_{flu} \mid \mathcal{F}_{fever}) = \frac{2}{3} $$

This solution of course agrees with our earlier results when we were able to calculate the answer by manually counting the data. However, this time we did not have to use the data directly.

Let’s do one more example to drive the point home. Say we have a test for Tuberculosis, TB, that is 95% accurate. That is to say that if you have TB then 95% of the time the test will give you a positive result. Similarly if you do not have TB then only 95% of the time will you get a negative result. We can represent this as follows.

$$ P(\mathcal{F}_{positive} \mid \mathcal{F}_{infected}) = \frac{19}{20} $$

Furthermore let’s say we know that only one in a thousand members of the population are infected with TB at any one time. We can demonstrate this as follows.

$$ P(\mathcal{F}_{infected}) = \frac{1}{1000} $$

Finally let’s say when tested on the general population that 509 out of every 10,000 people received a positive result. We can represent that with the following.

$$ P(\mathcal{F}_{positive}) = \frac{509}{10000} $$

With this information it is possible to calculate the probability someone will have TB if they receive a positive test result. Using Bayes Theorem we can solve for the probability as follows.

$$ P(\mathcal{F}_{infected} \mid \mathcal{F}_{positive}) = \frac{ P(\mathcal{F}_{positive} \mid \mathcal{F}_{infected}) \cdot P(\mathcal{F}_{infected}) }{ P(\mathcal{F}_{positive}) } $$
$$ P(\mathcal{F}_{infected} \mid \mathcal{F}_{positive}) = \frac{ \frac{19}{20} \cdot \frac{1}{1000} }{ \frac{509}{10000} } $$
$$ P(\mathcal{F}_{infected} \mid \mathcal{F}_{positive}) = \frac{ 19 }{ 1018 } = 0.018664 = 1.8664\% $$

This gives us a very surprising result. It says that of the people who take the TB test and show up positive less than 2% of them actually have TB. This demonstrates the importance of using very accurate clinical tests when testing for diseases that have a low occurrence in the population. Even a small error in the test can give false positives at an alarmingly high rate.

Restricted Logarithmic Growth with Injection

The Logistic Function, sometimes with modifications, has been used successfully to model a large range of natural systems. Some examples include bacterial growth, tumor growth, animal populations, neural network transfer functions, chemical reaction rates, language adoption, and diffusion of innovation, to name a few. The real world applications are simply staggering.

Because of the utility of this powerful little equation it has lead me to investigate more thoroughly how it works, and how it can be applied to novel innovations.

There have been many contributions that have led to numerous variations of the Logistic function. One that struck my attention in particular is the Verhulst Equation, sometimes referred to descriptively as the “Restricted Logarithmic Growth Function”, or simply “Logistic Growth Function”. It is used in ecology to model the expected growth of a population while taking into consideration that resources, such as food, are finite, resulting in a maximum sustainable population, called the carrying capacity. The model demonstrates an exponential growth of the population when it is significantly below this carrying capacity, but as the population approaches the carrying capacity, and resource competition increases, the growth rate slows asymptotically. The Verhulst Equation has been tested against numerous real world populations, including Seals and Elk, and has been shown to be a relatively accurate model.

Of course the Verhulst Equation isn’t limited to modeling animal populations, it has also been successfully used to model diffusion of innovation. In this sense it can be used to represent how ideas spread throughout a population. It is this particular application that was most interesting to me.

Unrestricted Exponential Growth

As with any complex idea we have to start with the basics. Whether we are talking about an idea or an organism, if it is capable of spreading through multiplying itself, then obviously the more it spreads, the faster it will spread. You start out with one, which turns into two, then four, then eight, in just a few iterations you will have billions; if there is nothing to impede the growth, then it will follow an exponential curve.

We can model this sort of growth quite simply; the rate at which new members of the population will be observed can be represented as the growth rate, G, multiplied by the population, p.

You will notice in the above equation that there is a derivative on the left hand side of the equation. The equation can therefore be read as “The rate of change in the population for each unit of time is equal to the growth rate multiplied by the current population”. If the population is seen to double each year, for example, then time would have a unit of years, and G would be 2.

Of course the equation becomes much more useful if we can get rid of the derivative and simply express the total population at any point in time. To do that we integrate the equation, while solving for p and we arrive at the following equation.

Here the variable e is Euler’s constant, and P0 is our constant of integration which represents the population when time, t, is equal to 0.

To help visualize whats going on lets try some arbitrary values for our constants. Lets suppose the growth rate is 0.1, and our P0 is 1. This would represent a initial population with only one member capable of replicating once for every 10 units of time that have passed. If we plug these values in and simplify we arrive at the following equation.

It is immediately obvious this is an exponential function, nothing too special about that. We can see from the graph below that the population would continue to grow exponentially and unrestricted.

Restricted Logarithmic Growth

Of course in the real world it is rare that anything will truly grow in an unrestricted manner. Space is finite, resources are finite, eventually everything must stop growing. This brings us to the Restricted Logarithmic Growth model, also called the Verhurst Equation.

Verhurst recognized that, at least amongst animals, resources are limited. When unlimited resources are available to an organism it will grow and reproduce unrestricted. However as the population grows resources become scarce. The more scarce the resources, the greater the restriction on growth and reproduction, and this effect will scale according to the proportion of the resources which are free. Therefore as the population reaches the carrying capacity of the environment then the growth rate should approach 0 in a linear fashion.

Therefore if we take the growth rate equation from before, we can add an additional term to represent the availability of resources.

In the above equation we can see that we multiply by a term that scales proportionally to the population. The term will evaluate to 1 when the population is 0, and will approach 0 as the population approaches the carrying capacity, represented as k(t). This is where the population’s growth restriction is represented.

Before we can actually solve the equation above, we have to actually know what the k(t) function evaluates to; after all we can’t perform integration on a function if we don’t know what that function is. So lets just assume the carrying capacity is a constant we will call K.

At this point we can solve for p and integrate the equation as before. This will give us an equation which models the total population as a function of time.

The above equation is usually what people refer to when they talk about the Verhulst equation, but it might help to see what it looks like with some actual values plugged in for the constants. Lets pick similar values with a growth rate, G, of 0.1, an initial population, P0 of 1, and a carrying capacity, K, of 100.

Once we plug these values in and simplify we arrive at the above equation. It is a specific form of the Logistic Function. If we graph that we will get the following graph.

In the above graph the dotted horizontal orange line represents the carrying capacity, K. This is of course a constant with a value of 100. We can see that early on the population has what appears to be an exponential growth pattern, but as it begins to approach the orange dashed line it asymptotically tappers off. Of course as time approaches infinity (the x axis) then the population approaches the carrying capacity.

Time-varying Carrying Capacity

If we want to play with this equation a bit more we can make things more complex (and fun) by picking a carrying capacity that actually changes with time. If we go back to the original growth rate equation we can make k(t) a function equal to (t/10)+100, rather than a constant. If we do this, and keep the growth rate and P0 constants the same as before, then we will get the following equation.

Again we can solve for p, integrate, and simplify and we will arrive at the following equation.

In the above equation Ei is the Exponential Integral Function. If you don’t know what that is don’t worry, we can just graph the equation below to get a better understanding of how it behaves.

You can see the equation behaves similarly in this graph as in the last graph. The only exception is that the carrying capacity starts at 100 and slowly increases over time. As this happens the population also increases to match the carrying capacity without ever exceeding the carrying capacity. Pretty much how we would expect it to work.

Restricted Logarithmic Growth with Injection

This is where the lessons of history end, and my own contribution begins. I was left considering the effects of a modern world and how they could be reflected in the Verhurst model.

As I mentioned in the introduction the Verhurst equation has since been used in countless applications outside of ecology models. One that was particularly interesting is that it has been used successfully to model the Diffusion of Innovation. Essentially it can accurately describe the adoption of an idea within a population. For example it can model the proliferation of linguistic changes such as the change in the spelling of a word, or even the adoption of an entirely new word.

All of this, of course, has implications in marketing where the marketing penetration of a product can spread through word of mouth alone and can therefore be modeled using the Verhurst equation. But how could we use this to also account for the effects of advertising, which can often act as an accelerant to the spreading of an idea that is still primarily driven by word of mouth. Thats when inspiration struck.

If we accept that ideas spreading through word of mouth will often fit a model similar to population growth, then advertising is a bit like artificially injecting new members into a population as it grows. The advertising acts as a jump start to the population growth, accelerating it, and magnifying the effects that would be seen from word of mouth alone.

If I could incorporate this effect into the model not only could it be used for modeling product adoption, but conservationism efforts as well. It is not uncommon for conservationists to breed in captivity and then release in the hopes of supplementing a specie’s population. In this sense it would be modeled with the same artificial injection.

In fact almost all the examples given earlier where the Verhurst model has been used in the past could potentially benefit from this addition. Consider chemical reactions; an artificial injection factor could represent the effects of slowly adding new reactants to a reaction as it progresses.

So I had to figure out where this new function might fit in. When we go back and take a look at our original differential growth rate equation we have two terms, the left hand term represented the unrestricted growth rate, and the right hand term represented the constraint on growth as the resource contention grows. So all it really takes is adding an injection factor to the left hand term. The new equation looks like the following.

You will notice in the above equation all we did was add the new injection function, shown as i(t). This is just the number of artificially injected members of the population per unit of time. By representing it as a function it can of course change over time such that we can inject a varying number of members over time.

It is important to keep in mind that this injection is still scaled by the right hand term. The more resource contention there is the less effective the injection is likely to be. If we consider the earlier example of breeding animals in captivity and releasing them into the wild then this represents the fact that if resources are scarce many of the animals released are likely to die of starvation before they have a chance to have children. In the case of advertisement, or Diffusion of Innovation in general, then it is reasonable to expect that the more people that know about an idea the harder it would be for your advertisements to reach new people who have yet to hear word of it. If Pepsi plays a commercial on TV, for example, you might expect the vast majority of people who see it to already know about the Pepsi brand.

As before if we want to play around with the equation a bit to see how the model plays out we are going to have to create some definitions for our functions.

For the sake of simplicity we turned our injection function, i(t), into the constant I, and the carrying capacity, k(t), into the constant K. This just gives us something to play with and allows us to solve for p and integrate in the following equation.

Now lets plug in some arbitrary values for our constants and see how it graphs out. Lets pick a value of 1/2 for I, 1/10 for G, 100 for K, and 1 for P0. If we plug these values in and simplify we arrive at the following equation.

Finally, lets graph the carrying capacity and growth model as before. In the graph below carrying capacity is once again represented by the dashed yellow line, and the solid blue line is the population. We can immediately notice that while the graph still has a logistic curve to it that the population grows significantly quicker in the beginning than with the other models, but despite the injection it still does not exceed the carrying capacity at any point.

While this model has yet to be tested against real world data I am hopeful that it will provide an accurate representation of an artificial population injection in growth models. I am looking forward to see how I and others might apply this and similar models to data in the future.

The Great Nebula in Orion

As some of you may already know I am an avid Astrophotographer. Several years back I set out to photograph the Great Nebula in orion, particularly focusing on the 4 brightest stars at its center, called the trapezium. These stars produce the majority of the light responsible for lighting up the entire nebula. They are also so bright and closely spaced together that they often appear to the onlooker to be a single super bright star. My primary focus in this shoot was to be able to provide enough resolution to be able to see the trapezium as 4 distinct stars.

Deep field image of The Great Nebula in Orion

The image you see above is the final product of several days worth of work trying to capture and process the Orion Nebula (NGC 1976 / The Great Nebula in Orion / M42) taken on November 12th 2010. It also includes a faint wisp of De Mairan’s Nebula (NGC 1982 / M43). The shot was taken with an EQ-6 Mount without an autoguider attached to an 800mm focal length, 8" Aperture, f/4 Newtonian style Telescope. Finally my Canon Rebel T1i acted as the imager. I used a 3x Barlow without an eyepiece to increase the focal length (zoom). The majority of the shot is the result of 6 stacked 30 second 3200 ISO shots. A stack of 15 additional shots were used on the Trapezium star cluster in the center. This shot was also post-processed to improve contrast and light levels. The primary goal was to show the detail of the inner core of the Orion Nebula.

Of course in the interest of science I couldn’t just stop there. I felt it was important, and hopefully useful, to chart the various stars in the image so people knew exactly what they were looking at, as well as cross reference it with any catalogs.

Simple annotated image of The Great Nebula in Orion

In the above picture all the stars big enough to have common names have been labeled with their common names. Unfortunately none of the other stars in the picture have names so the majority were left blank.

Extended annotated image of The Great Nebula in Orion

Of course I wasn’t going to stop there. While many stars didn’t have names, at the very least, they showed up in a few star catalogs somewhere. The final image above labels every single star, no matter how faint, with the catalog identification value. Since some of the stars show up in multiple catalogs in those cases multiple identifiers are listed.