• 0 Posts
  • 2 Comments
Joined 4 months ago
cake
Cake day: September 12th, 2025

help-circle
  • Probably the only interesting part of that study to me is how they are measuring “erratic” which is using a measure they’ve called “novelty”. Its in appendix A1:

    A.1 Embedding and Novelty Measurement

    To quantify content novelty, we first convert the text of each post into a high-dimensional vector representation (embedding). This process begins by cleaning the raw post content (e.g., stripping HTML tags) and feeding the text into a pre-trained SentenceTransformer model, specifically all-MiniLM-L6-v2. This model maps each post to a 384-dimensional vector. From the full corpus of N posts, we obtain a matrix of “raw" embeddings.

    These raw embeddings are known to suffer from anisotropy (a non-uniform distribution in the vector space), which can make distance metrics unreliable [li2020sentence]. To correct this, we apply a standard decorrelation step. We fit a Principal Component Analysis model with whitening to the entire matrix 𝐄raw. This transformation de-correlates the features and scales them to have unit variance, yielding a matrix of ‘whitened’ embeddings, 𝐄white [su2021whitening]. These whitened vectors are used for all novelty calculations.

    There is a decent primer on the transformer here:

    https://medium.com/@rahultiwari065/unlocking-the-power-of-sentence-embeddings-with-all-minilm-l6-v2-7d6589a5f0aa

    I’m not sure of a great primer on PCA, it kind of finds the dominant directions of a set of vectors.

    With that novelty measurement the eracticness seems to be averaging over a window (seven day) and then measuring euclidean distance.

    I did have a pint just before reading and writing this so there’s probably some mistakes here