porcoesphino

porcoesphino@mander.xyz · edit-2 1 month ago

Probably the only interesting part of that study to me is how they are measuring “erratic” which is using a measure they’ve called “novelty”. Its in appendix A1:

A.1 Embedding and Novelty Measurement

To quantify content novelty, we first convert the text of each post into a high-dimensional vector representation (embedding). This process begins by cleaning the raw post content (e.g., stripping HTML tags) and feeding the text into a pre-trained SentenceTransformer model, specifically all-MiniLM-L6-v2. This model maps each post to a 384-dimensional vector. From the full corpus of N posts, we obtain a matrix of “raw" embeddings.

These raw embeddings are known to suffer from anisotropy (a non-uniform distribution in the vector space), which can make distance metrics unreliable [li2020sentence]. To correct this, we apply a standard decorrelation step. We fit a Principal Component Analysis model with whitening to the entire matrix 𝐄_raw. This transformation de-correlates the features and scales them to have unit variance, yielding a matrix of ‘whitened’ embeddings, 𝐄_white [su2021whitening]. These whitened vectors are used for all novelty calculations.

There is a decent primer on the transformer here:

https://medium.com/@rahultiwari065/unlocking-the-power-of-sentence-embeddings-with-all-minilm-l6-v2-7d6589a5f0aa

I’m not sure of a great primer on PCA, it kind of finds the dominant directions of a set of vectors.

With that novelty measurement the eracticness seems to be averaging over a window (seven day) and then measuring euclidean distance.

I did have a pint just before reading and writing this so there’s probably some mistakes here

porcoesphino@mander.xyz · 3 months ago

I’d give iNaturalist a go:

https://www.inaturalist.org/