Distance and Similarity in Manifold Learning
Imagine you crumple a flat piece of paper into a ball. The paper is fundamentally 2-dimensional, it's a flat surface, but now it's twisted and folded into 3D space. A manifold is like that crumpled paper. Manifold learning asks: what is the true, simple structure hiding inside the high-dimensional mess?
The catch: straight-line distance through 3D space is not the same as distance along the crumpled paper's surface. Two points might be close if you measure straight through the air, but far apart if you have to follow the paper's folds and creases. Manifold learning cares about the distance along the paper's surface, not the straight-line distance through space.
Try it: Crumple the Paper (3D)
💡 Drag to rotate, scroll to zoom
The Core Problem: Which Distance Matters?
Euclidean distance measures straight-line distance through the full space. If you're in a room with a curved wall, Euclidean distance is the direct path through the wall. Geodesic distance follows the surface, the actual path you'd walk along the manifold.
Example: Two people on opposite sides of a mountain are close in Euclidean distance (straight through the rock) but far in geodesic distance (walking around the mountain). Manifold learning cares about geodesics, because that's how the data actually relates.
Similarity vs. Closeness
Here's the key insight: two points can be close in the high-dimensional space but far apart on the actual curved surface, and vice versa. Let's use our examples:
Rolled-up scroll: Two points on different layers are close if you measure straight through (high-dimensional space), but far apart if you follow the paper (manifold).
Handwritten digits: Two digit images might have similar pixel values (close in 784D space), but they might be very different digits (far on the manifold). Or two images that look identical might have different pixel values due to noise.
This is why manifold learning matters: straight-line distance in the high-dimensional space is misleading. You need to measure distance along the actual structure.
Similarity should be based on manifold structure, not raw 3D distance. That's what manifold learning algorithms try to discover: which points are truly similar given the underlying geometry?
The Manifold Assumption: Real data doesn't randomly use all available dimensions. It clusters on lower-dimensional manifolds within the high-dimensional space. For example: a handwritten digit image has 784 dimensions (28×28 pixels). But not every possible combination of pixel values is a valid digit. Valid digits only form a 10-50 dimensional manifold within that 784D space. Same with text embeddings: they exist in 512+ dimensional space, but meaningful language only occupies a much lower-dimensional manifold within it. This assumption is the foundation of manifold learning.
Local Neighborhoods: The Building Block
The key insight: if you zoom in really close to the curved paper, locally it looks almost flat. This means neighbors should be near you along the surface, not flying through empty space. The algorithm exploits this simple idea in three steps:
Step 1: Find neighbors. For each data point, find the k closest points using straight-line distance (Euclidean). Call these your neighbors. If your data is dense enough, the closest points in space should also be close on the actual surface.
Step 2: Build a network. Create a graph: each point is a node, and you draw edges between neighbors. This creates a connected web of points.
Step 3: Use the network to measure distance. Instead of measuring straight-line distance between two points, trace the shortest path along the network edges. This path-distance approximates how far apart points are along the actual surface. The more data points you have, the better this approximation becomes.
Three Popular Approaches
1. Isomap (Isometric Mapping)
Build a neighborhood graph with k-NN, then use geodesic distances (shortest paths on the graph) instead of Euclidean distances. Geodesic distances preserve the manifold's geometry. Then apply classical multidimensional scaling to find a low-dimensional representation where these geodesic distances are preserved.
2. Locally Linear Embedding (LLE)
The idea: each point can be written as a linear combination of its neighbors. This works because locally (zoomed in), the manifold is flat. Find weights that reconstruct each point from its neighbors. Then find a low-dimensional representation where the same reconstruction weights work.
This preserves local geometry without computing distances explicitly. It's asking: "what's the simplest low-dimensional space where these reconstruction relationships still hold?"
3. t-SNE (t-Distributed Stochastic Neighbor Embedding)
Convert distances to probabilities. In the high-dimensional space, probability of being neighbors is proportional to similarity. In the low-dimensional space, optimize to match these neighbor probabilities. t-SNE is very good at revealing cluster structure and is widely used for visualization.
Why This Matters
Understanding manifold structure changes how you think about similarity. It explains why:
- Euclidean distance fails for high-dimensional data: In 100 dimensions, most distances look the same. The manifold structure is what matters.
- Visualization is hard: To plot data meaningfully, you need to respect manifold distances, not just raw similarity.
- Clustering can be misleading: Two clusters separated in the original space might be one connected manifold. Or vice versa.
- Interpolation matters: If you move smoothly in manifold space, you get meaningful transitions. Moving in raw high-dimensional space gives garbage.
Practical insight: When your algorithm treats all distances equally (like k-means), it's blind to manifold structure. When you use manifold-aware methods, you respect the true geometry of your data. This often leads to better clusters, better embeddings, and more interpretable results.
The Catch: The Manifold Assumption Isn't Always True
Not all data lies on a manifold. If your data is uniformly scattered in high dimensions with no structure, manifold learning adds no value, and might confuse things. The manifold assumption works best when:
- Data has intrinsic structure (images, text, time series)
- There are fewer truly independent factors than dimensions suggest
- The manifold is dense (enough data to estimate the structure)
The takeaway: Distance and similarity in manifold learning are fundamentally about discovering the hidden geometry in your data. Once you see the manifold, you see what your data truly looks like.