Distance and Similarity in Manifold Learning

Imagine you crumple a flat piece of paper into a ball. The paper is fundamentally 2-dimensional, it's a flat surface, but now it's twisted and folded into 3D space. A manifold is like that crumpled paper. Manifold learning asks: what is the true, simple structure hiding inside the high-dimensional mess?

The catch: straight-line distance through 3D space is not the same as distance along the crumpled paper's surface. Two points might be close if you measure straight through the air, but far apart if you have to follow the paper's folds and creases. Manifold learning cares about the distance along the paper's surface, not the straight-line distance through space.

▶ What is a manifold?

A manifold is a smooth, curved space that looks locally flat. Think of Earth: from far away, it's a sphere (curved), but if you zoom in to your neighborhood, it looks flat. Data often lives on hidden manifolds, simple, low-dimensional shapes hidden in high-dimensional spaces. A manifold has fewer "degrees of freedom" than the full space it's embedded in.

Try it: Crumple the Paper (3D)

Crumpledness: 0%

💡 Drag to rotate, scroll to zoom

Straight-line distance (through space):

100 units

Distance along surface (manifold):

100 units

The Core Problem: Which Distance Matters?

Euclidean distance measures straight-line distance through the full space. If you're in a room with a curved wall, Euclidean distance is the direct path through the wall. Geodesic distance follows the surface, the actual path you'd walk along the manifold.

Example: Two people on opposite sides of a mountain are close in Euclidean distance (straight through the rock) but far in geodesic distance (walking around the mountain). Manifold learning cares about geodesics, because that's how the data actually relates.

▶ What is geodesic distance?

Geodesic distance is the shortest path between two points along a curved surface. On a sphere, it's an arc of the great circle. On Earth, it's the shortest distance you could actually travel between two cities (not through the ground). Geodesic is the "true" distance on the manifold itself.

Similarity vs. Closeness

Here's the key insight: two points can be close in the high-dimensional space but far apart on the actual curved surface, and vice versa. Let's use our examples:

Rolled-up scroll: Two points on different layers are close if you measure straight through (high-dimensional space), but far apart if you follow the paper (manifold).

Handwritten digits: Two digit images might have similar pixel values (close in 784D space), but they might be very different digits (far on the manifold). Or two images that look identical might have different pixel values due to noise.

This is why manifold learning matters: straight-line distance in the high-dimensional space is misleading. You need to measure distance along the actual structure.

Similarity should be based on manifold structure, not raw 3D distance. That's what manifold learning algorithms try to discover: which points are truly similar given the underlying geometry?

The Manifold Assumption: Real data doesn't randomly use all available dimensions. It clusters on lower-dimensional manifolds within the high-dimensional space. For example: a handwritten digit image has 784 dimensions (28×28 pixels). But not every possible combination of pixel values is a valid digit. Valid digits only form a 10-50 dimensional manifold within that 784D space. Same with text embeddings: they exist in 512+ dimensional space, but meaningful language only occupies a much lower-dimensional manifold within it. This assumption is the foundation of manifold learning.

Local Neighborhoods: The Building Block

The key insight: if you zoom in really close to the curved paper, locally it looks almost flat. This means neighbors should be near you along the surface, not flying through empty space. The algorithm exploits this simple idea in three steps:

Step 1: Find neighbors. For each data point, find the k closest points using straight-line distance (Euclidean). Call these your neighbors. If your data is dense enough, the closest points in space should also be close on the actual surface.

Step 2: Build a network. Create a graph: each point is a node, and you draw edges between neighbors. This creates a connected web of points.

Step 3: Use the network to measure distance. Instead of measuring straight-line distance between two points, trace the shortest path along the network edges. This path-distance approximates how far apart points are along the actual surface. The more data points you have, the better this approximation becomes.

▶ What is k-nearest neighbors (k-NN)?

k-NN is the simplest notion of neighbors: the k closest points to your data point in Euclidean space. For k=5, you find the 5 points closest to you using standard distance. These neighbors are assumed to be near you on the manifold too. On dense manifolds, this usually works because the manifold is locally continuous.

Three Popular Approaches

1. Isomap (Isometric Mapping)
Build a neighborhood graph with k-NN, then use geodesic distances (shortest paths on the graph) instead of Euclidean distances. Geodesic distances preserve the manifold's geometry. Then apply classical multidimensional scaling to find a low-dimensional representation where these geodesic distances are preserved.

▶ What is multidimensional scaling (MDS)?

MDS takes a distance matrix (pairwise distances between all points) and finds a low-dimensional configuration where these distances are preserved as much as possible. It's like reconstructing a map from distance data: if you know Paris is 500km from London, and London is 1000km from Berlin, you can roughly sketch their positions.

2. Locally Linear Embedding (LLE)
The idea: each point can be written as a linear combination of its neighbors. This works because locally (zoomed in), the manifold is flat. Find weights that reconstruct each point from its neighbors. Then find a low-dimensional representation where the same reconstruction weights work.

This preserves local geometry without computing distances explicitly. It's asking: "what's the simplest low-dimensional space where these reconstruction relationships still hold?"

3. t-SNE (t-Distributed Stochastic Neighbor Embedding)
Convert distances to probabilities. In the high-dimensional space, probability of being neighbors is proportional to similarity. In the low-dimensional space, optimize to match these neighbor probabilities. t-SNE is very good at revealing cluster structure and is widely used for visualization.

▶ What is stochastic neighbor embedding?

Instead of saying "point B is neighbor of A" (hard decision), say "B is a neighbor of A with probability p" (soft decision). Closer points get higher probability. This is more flexible than hard neighborhoods and handles the manifold more gracefully.

Why This Matters

Understanding manifold structure changes how you think about similarity. It explains why:

Euclidean distance fails for high-dimensional data: In 100 dimensions, most distances look the same. The manifold structure is what matters.
Visualization is hard: To plot data meaningfully, you need to respect manifold distances, not just raw similarity.
Clustering can be misleading: Two clusters separated in the original space might be one connected manifold. Or vice versa.
Interpolation matters: If you move smoothly in manifold space, you get meaningful transitions. Moving in raw high-dimensional space gives garbage.

Practical insight: When your algorithm treats all distances equally (like k-means), it's blind to manifold structure. When you use manifold-aware methods, you respect the true geometry of your data. This often leads to better clusters, better embeddings, and more interpretable results.

The Catch: The Manifold Assumption Isn't Always True

Not all data lies on a manifold. If your data is uniformly scattered in high dimensions with no structure, manifold learning adds no value, and might confuse things. The manifold assumption works best when:

Data has intrinsic structure (images, text, time series)
There are fewer truly independent factors than dimensions suggest
The manifold is dense (enough data to estimate the structure)

The takeaway: Distance and similarity in manifold learning are fundamentally about discovering the hidden geometry in your data. Once you see the manifold, you see what your data truly looks like.