Probably the only interesting part of that study to me is how they are measuring “erratic” which is using a measure they’ve called “novelty”. Its in appendix A1:
A.1 Embedding and Novelty Measurement
To quantify content novelty, we first convert the text of each post into a high-dimensional vector representation (embedding). This process begins by cleaning the raw post content (e.g., stripping HTML tags) and feeding the text into a pre-trained SentenceTransformer model, specifically all-MiniLM-L6-v2. This model maps each post to a 384-dimensional vector. From the full corpus of N posts, we obtain a matrix of “raw" embeddings.
These raw embeddings are known to suffer from anisotropy (a non-uniform distribution in the vector space), which can make distance metrics unreliable [li2020sentence]. To correct this, we apply a standard decorrelation step. We fit a Principal Component Analysis model with whitening to the entire matrix 𝐄raw. This transformation de-correlates the features and scales them to have unit variance, yielding a matrix of ‘whitened’ embeddings, 𝐄white [su2021whitening]. These whitened vectors are used for all novelty calculations.
Probably the only interesting part of that study to me is how they are measuring “erratic” which is using a measure they’ve called “novelty”. Its in appendix A1:
There is a decent primer on the transformer here:
https://medium.com/@rahultiwari065/unlocking-the-power-of-sentence-embeddings-with-all-minilm-l6-v2-7d6589a5f0aa
I’m not sure of a great primer on PCA, it kind of finds the dominant directions of a set of vectors.
With that novelty measurement the eracticness seems to be averaging over a window (seven day) and then measuring euclidean distance.
I did have a pint just before reading and writing this so there’s probably some mistakes here