RSS/Atom feed to the site content #11

pointlessone · 2024-07-30T14:26:36Z

Is there a feed to the content that is published on the site itself?

There's a feed to substack that hasn't been updated in a while and the substack itself seem to be gone from the web but I can't find any feeds for the site content.

gwern · 2024-07-30T22:55:37Z

RSS is not supported because I have been unhappy with past attempts to create an RSS feed; the usual blog-centric approach doesn't work well with lots of incremental updates and I dropped the darcs/git→RSS bridge as it was nothing but noise for readers.*

You can probably turn the changelog page into an RSS feed fairly easily, and there is a Gwern flair on the subreddit which comes with an authenticated RSS feed. (Compilation might also be of interest.) That is the closest thing right now to what you want. (And the new 'author' metadata+backlinks is halfway to an RSS feed - that is how the 'Gwern' section of https://gwern.net/doc/newest/index is implemented - so you could probably scrape that too and RSSfy it.)

After reading namespace's comments on blogging and my own thinking about multi-level design, I have been mulling over an approach to RSS feeds that I think can work well.

The problem with standard RSS feeds is that they are designed for one-off pages, like a newspaper or a blog: a URL gets created as a single standalone finished page, announced, and that's that. (The URL contents will inevitably change, but these changes are not considered important.) But this is a poor fit for Gwern.net because I 'finish' major pages only once a month or so, and it is clear that people are interested in a more granular view of my writing than that. (I am always embarrassed when I see someone tweeting out or including in a newsletter a tweet or comment of mine - because it indicates a failure of curation on my part if they have to link those instead of a page on my site.) The other extreme is that every single file modification is reported in the RSS feed, like the RSS feed for English Wikipedia's Recent Changes. This is an equally poor fit, because the nature of Gwern.net is that there is a lot of small change constantly going on site-wide, particularly related to formatting or reorganization, of less than zero interest to readers, and which is why I killed the original RSS feed: I couldn't even read it myself!

More broadly, in general, there are just no good ways to announce the full spectrum of changes from comments or tweets (recall the original name: 'microblogging') to shorter blog posts to longform essays/books. No one has done this, and most generally do not recognize this as any kind of problem, but there is a big 'gap' between each speed of service, going from chat to microblog to blog/newsletter to essay to book. So writers online tend to pigeonhole themselves: someone will tweet a lot, or they will instead write a lot of blog posts, or they will periodically write a long effortpost. When they engage in multiple time-scales, usually, one 'wins' and the others are a 'waste' in the sense that they get abandoned: either the author stops using them, or the content there gets 'stranded'. Most writers simply accept this, and chop their writing down to fit - Scott Alexander writes solely in blogposts & blogpost comments (having largely given up on Twitter, Tumblr, and LW/Reddit), even though if you wanted to know his big take on, say, 'predictive processing' as the theory of everything in AI/psychiatry, the best he can do is shrug and point you to like 30 blog posts scattered across at least 3 sites going back a decade (LW, SSC, & ACX); and Matt Levine has to repeat himself, again and again, every time a specific recurring drama comes up. If you don't want that to happen (maybe you don't love to hear yourself write the same thing again and again as much as Mencius Moldbug does) and don't want to become solely a poaster or to have your secondary writings become so much water under the bridge, your only option is to do a lot of tedious work copying back and forth: I try to monthly review my tweets & comments and pull stuff onto Gwern.net, but it is a lot of work and I often think that I am leaving behind a lot of stuff, even when I do manage to do the review. It is clearly not very sustainable. And if I wanted to summarize it at multiple levels (like a level in between a list of tweets and a full essay, or an annual level), when would I ever actually do something like write or think or live?

I think this is a major reason for the death of blogging and the increasing rarity of non-blog homepage sites like Gwern.net: the friction of multiple writing places means you are constantly being sucked into just one and tempted to abandon the others; and the rewards of social media tend to win out. You may be able to maintain dual-posting for a while, but at some point a shock happens, and when you return, you have a backlog and never catch up and settle for one. (This is expected from a queuing theory perspective if the friction is heavily overloading you: at some point things will break down catastrophically, and since you aren't writing all that many words per day, which would be easy to copy-paste in total, it must be everything else, the friction/overhead.) And once you are writing solely on Twitter or Facebook or whatever, it's hard to ever escape with your stuff to your own website, despite the perpetual amnesia / eternal now of the microblogs. (The social media sites don't even need to be hostile for this to happen. It's just the constant trivial inconvenience and toil.)

So, how do I solve this broad problem of packaging up my writing in logical units spanning the continuum from single tweets to 'best essay of the year'?

After several years of the annotation popup system and watching LLMs become effectively superhuman at summarization & resolving major limitations like short context windows / cost, I think I can propose a design:

you provide all the levels, without the toil, by starting with a link-centric approach where every comment or tweet or essay or reference is a URL with metadata, and using LLM recursive summarization to fill in the gaps. These different levels can then be exposed to readers as separate RSS feeds.

So the workflow would look like this: every comment is copied by the backend, with its data & metadata like author/date and a LLM auto-title/summary; sets of related comments like a tweet thread get grouped and summarized together as a whole; updates to blog posts or essays likewise get grouped and summarized; finally, whole new posts/essays (with a handwritten summary, or again LLM-written).

Once this has been set up, as the author, I go around tweeting or Redditing, possibly replying to my own comments repeatedly as the muse demands, and my tweets all get logged and saved automatically; a reader interested in the blow-by-blow can subscribe to the most atomic RSS feed, and read each one; a reader with less time to spare reads the grouped comment summary RSS feed; or they can read the whole essay that I eventually sit down and write with proper references (starting from the grouped-comment summary as a quick-and-dirty outline to help me get started). But there is almost no friction which stops my comments from percolating up through my site, from individual atomic comments to short summaries to finished writings, and readers can pick what level they want to read at, rather than reading a one-size-fits-all-but-suits-none 'most recent' RSS feed. (cf. Kicks Condor's "Fraidycat" with fixed allocation to feeds.)

For Gwern.net specifically, I would start with the annotation system as the 'atomic' level and try to build up from there.

Now that I have finally overhauled the backend to support additional metadata on annotations like the critical 'date modified' field (so necessary for /doc/newest/index).

With meaningful 'last-modified' vs 'date-created' metadata on all pages+annotations, I can now populate a sane RSS feed with both newly-created & recently-modified items, and these items can be both my essays & any new links I bookmark or annotations created.

So the idea is that reading the RSS is like a link feed with essay updates once in a while. It might go something like this:

"Golden Gate Bridge WP article / SF city WP article / annotation (suicide study) / annotation (poem) / weekly batch list of miscellaneous URLs & image uploads / 'Movie Reviews: +review of The Bridge 2006' / annotation (Arxiv) / annotation (Arxiv) / annotation (Arxiv) / annotation (Arxiv) / 'Research Ideas: free play for RL exploration' / annotation (link) / annotation (link) / annotation (link) / weekly batch list of miscellaneous links / annotation (link) / annotation (link) / annotation (link) / ..."

'Full' annotations & essays get a separate entry, while the shorter 'partial' annotations (which have only a little metadata like a title or a tag) get rolled up into a single large weekly item which can be skipped or skimmed.

Because the annotations are just static HTML snippets already, they can be easily put into the RSS feed itself, to allow a preview (even if that obviously wouldn't support all of the on-site features; they will link to the essay or the first tag-directory entry, so the RSS entry for https://www.theinformation.com/articles/openai-removes-ai-safety-leader-m-dry-a-onetime-ally-of-ceo-altman wouldn't link to TI but to its current tag-directory entry).

The main problem with this is that if an essay like a review is included each time its last-modified changes, the entry is not too useful: the annotation, which contains the page abstract, will likely still be the same and not mention whatever is changed. So you might see an essay pop up a dozen times in this RSS feed without knowing what changed. I could write a manual diff, but that is exactly the sort of "toil" I am trying to avoid on Gwern.net because it is unsustainable in the long run & such overhead unconsciously discourages a writer. (You feel like you are being punished---you wrote something, and your reward for a job well done is... being required to write even more? Not fun.) It is also difficult to do any kind of labeling of importance of a patch upfront: a good essay update might be composed of dozens of patches, each trivial on their own; indeed, I might not realize where something is going until after the writing is all done, as I explore a topic or people respond or I dig up new sources or have a sudden realization - "completeness" is something that can be known only in retrospect, reviewing changes.

But with the date-range and the git history and progress in LLM context windows, this can now be automated.
When an essay's last-modified indicates that it should get an RSS feed item and the date-created indicates that this is an 'old' essay which has been recently modified rather than a new essay which has never been in the RSS feed before, the RSS-generating code can skip the hand-written abstract and instead generate a summary of the changes instead.
To do this, call git on the essay's Markdown source file for the last month of patches, extract the patch summaries & even the patches themselves if necessary, and feed them into a LLM like GPT-4o-mini to get back a consolidated description.
(With context windows like 128k, I can easily feed in some examples to few-shot the task and still have room for big sets of diffs.)

The LLM will know to not bother mentioning the massive churn on Gwern.net like spellcheck or linkrot fixing, and will summarize it on a more semantic level for readers.

And you can generalize this idea further, and start to meet Namespace's challenge for blogging software that can go from tweets to posts: right now, there is no good way to go from writing in tweet-sized chunks to writing longform essays. So many people who could have written blog posts or even books wind up trapped in long tweet-threads (or given the hostility of Twitter these days to any kind of serious intellectual work, like hiding tweets from non-logged-in users & penalizing tweets with external links, not writing anything at all), which never leave Twitter and are impossible to find or browse sanely.
But with LLMs, you can fix that: the human writer writes in tweets or comments or sections as the muse moves them, and the LLM can consolidate them into progressively larger chunks, culminating in whole essays, and the human writer polish them up and finalize them. (After all, when it comes to writing, many people find writing very easy, as demonstrated by their ability to write tens of thousands of words on Twitter or Discord or Reddit or IRC---it's the editing it all together that destroys them with the tedium and fear of criticism and they just never get started.)
Each of these can be a separate RSS feed: one feed for atomic writing like tweets, one feed for the next level up (tweet-threads, sections?), one feed for the next level up (essays?), and so on.

Then a new reader can easily catch up on the backlog: simply read the essays, and then drop down to the level of granularity they have the time & interest for. (A big fan will of course read the most granular tweet-level daily feed, but others will prefer a higher level like weekly summaries, or even monthly essay-sized outputs.)

And then in the ultimate evolution of this, the writer just writes atomic bits without any editing, and the LLM takes care of adding it to an ever-enlarging corpus and expanding it as appropriate, and then summarizing it for the writer & readers to review/read, and updating it based on feedback. (See also "Nenex".)

(One might wonder how to present this outside the RSS feed context, but that's straightforward, especially when you have transclusions + collapses. Like the other things, you just generate statically, at a compile-time, each 'level', and then you can present them to the reader as a series of collapses+transcludes. You can present them as simply a flat list or as a recursive set of collapses from small to large, or however you wish. Just rearrange a few links/div-wrappers as you please.)

* For perspective, here are the last 40 git patches to Gwern.net, to give you an idea of how miserably useless a reading experience the patches per se would be for an ordinary reader who just wants a list of 'interesting' updates or writings:

"+lns; begin catching up after trip" · "record all minor pending edits (1722141005)" · "lint; rescrape for date-range subscripts" · "lint" · "+commafy number pass" · "lint; +first comma-fying pass" · "lint" · "lint" · "lorem: inline: date subscripts: continue finetuning examples" · "+lns" · "second EN DASH pass" · "big EN DASH searh-and-replace while working on the newly-enabled date-range subscripts" · "+lns, catch up on some comments" · "Crumb: +apposite comic that's a bit of a meme for Crumb, and some details from an interview" · "lint" · "+lns; lint" · "embryo selection: split out 'history of iterated embryo selection'/IES history to separate page because long enough & of independent interest" · "split out Twitter UX essay for easier linking in my Twitter DMs" · "anime reviews: factor out development hell" · "initialize a miscitation/miscite tag as a subset of publication-bias (statistics/bias/publication/miscitation)" · "RTX: +thumbnail in Midjourneyv6 based on ChatGPT-4o suggestion to use a 'nano' skull and crossbones" · "note: split out 'Highly Potent Drugs As Psychological Warfare Weapons' essay to /rtx due to length and to make more linkable" · "initialize disappearing polymorphs tag (science/chemistry/disappearing-polymorph) apropos of sudden burst of Twitter interest" · "initialize Stigler's diet problem (Dantzig) statistics/decision/stigler-diet tag" · "+lns" · "fully re-paragraphized all outstanding abstracts" · "lint" · "record all minor pending edits (1720931406); rescrape abstracts to update thumbnails for split-out pages" · "GTX: added few-shot examples to paragraphizer.py & loosened length constraint, so rerun paragraphizer on the holdouts" · "+lns" · "split out The Tale of Princess Kaguya anime movie review to /review/princess-kaguya; re-enable Midjourney to test personalization & generate a good Kaguya danse macabre thumbnail" · "lint" · "+lns" · "Timecrimes: finally thought up a thumbnail" · "lint" · "initialize truesight tag doc/statistics/stylometry/truesight/ for LLM-powered stylometrics/deanonymization" · "initialize mode collapse preference learning tag for AI slop/ChatGPTese infections in ChatGPT, DALL-E 3, Midjourney, etc (reinforcement-learning/preference-learning/mode-collapse)" · "lorem block: collapses: mismatch cases: rm unnecessary case that is ~impossible to write" · "lint" · "+lns; further new sync lint work"

pointlessone · 2024-07-31T08:20:40Z

OK, I appreciate the thoroughness and your thought on this are very interesting but I just want to know when new stuff is up. So in the interest of getting it rolling how about starting small?

Say, first add a feed for essays. Just the list of essays.

Next you can add major updates. It is my understanding that the date of last major update is added manually. When you do it you can try formulating in a sentence or two what compelled you to update that date. Manually. If it indeed turns out that bad than you can try going the LLM route or even just dropping it. I mean, if you don't think it's worth exposing the change summary on the website then it's probably not worth doing in the feed. Just add the update date and let RSS readers figure it out.

I honestly don't want to add a lot of work for you. I just want to know about new writing. In a sense, I want to eliminate the toil of checking your site from time to time to find out of there's anything new.

dbohdan · 2024-07-31T09:30:05Z

I think a feed that was just new pages (including refactors, like split-out reviews?) would be useful.

Something I'd want in a unified update feed for gwern.net is a form or multiple forms of tagging for different types of content. A basic one would be a way to distinguish between annotations/links and essays. Title prefixes like "Movie Reviews:" are a good idea. For additional tags that aren't easy or practical to fit in the title, like "essay" for all types of essays regardless of the prefix, you could have RSS/Atom categories on items.

gwern · 2024-07-31T17:05:59Z

So in the interest of getting it rolling how about starting small?

I don't want to, though. It isn't fun, I won't learn anything from it, and it doesn't excite me to half-ass an RSS feed which I know is an unsatisfactory solution to the problem. (And it would be a liability as well: an RSS feed is a promise to the reader, and has long-term consequences. I am still fixing spurious 404s from the old Gitit-style RSS feed I deleted a decade ago - once the URLs get out there and are linked, you can't recall them.)

And the RSS libraries are a pain to work with because it's all XML-based, which is one reason I never was able to do much with the old RSS feed. "You see a maze of twisty types, each alike..."

While I have plenty of other things to work on which are also useful to readers or myself, and which may help clarify what I want from an RSS feed. (As it happens, they did, as you can see above.) Sometimes, design just takes a while and you have to use the system 'in anger' for a few years until you can see what the logical next step is. When I killed the first RSS feed, I had no idea what would be a good replacement, which was not simply jamming a Wikipedia Recent Changes or a blog/journalism peg into the square hole of Gwern.net. Now I do.

I think a feed that was just new pages (including refactors, like split-out reviews?)

I don't consider refactoring like splitting out pages to be of interest to readers. It's just bookkeeping, shuffling around mostly-unmodified text. In theory, if a split-out review was worth reading because it's a substantive review, it is linked in the newsletter for that month, and it would hypothetically be included as an annotation with its hand-written abstract or summarized by the LLM pass after it was finished.

For additional tags that aren't easy or practical to fit in the title, like "essay" for all types of essays regardless of the prefix, you could have RSS/Atom categories on items.

Oh, obviously I can support multiple RSS feeds - once the paradigm has been sorted out and I've decided what I even want from RSS feeds in the first place. (The whole popup/annotation system is partially motivated by the goal of making updates more meaningful & granular and organizing references.)

Beyond the master/site-wide/firehose RSS feed which included all annotations/essays and the weekly miscellaneous batch entry (something like /site.rss vs /site-essays.rss vs /site-links.rss), you would want a RSS feed for each tag-directory which filtered to just items with that tag (eg. /doc/ai/index.rss), and you would want per-page RSS feeds following the existing convention of a file-extension suffix (so /foo is the essay, which you know because it has no period and the Gwern.net convention is that essays never have periods & files always have periods, and then /foo.rss would be the RSS feed for exclusively that essay), and you could easily split the firehose feed into essays-only/annotations-only.

The RSS/Atom 'category' field seems redundant with the tags that are already in the generated snippet, and I don't know of any RSS reader which makes those easy to access*, so I am doubtful they would be worth encoding & supporting. Seems like you'd have to be using some custom RSS filtering tool and then you are able to do arbitrary operations on the text anyway, and much of what you would want to do would be doable by the per-tag RSS feeds. (eg. if you want just reviews, you would use /review/index.rss or one of the tags like fiction/science-fiction).

* like a lot of 2000s-era Semantic Web-influenced tech, RSS/Atom comes with a lot of fancy semantic features which in practice no one bothers with for their own stuff and which are too dangerous or rare to rely on when provided by others

dbohdan · 2024-07-31T19:06:20Z

And the RSS libraries are a pain to work with because it's all XML-based, which is one reason I never was able to do much with the old RSS feed.

It solved most of the pain for me when I found the library xmltodict to use in my static site generator. (xmltodict is actually bidirectional.) I highly recommend this library if you can do the RSS part in Python and recommend Python to use it.

I don't consider refactoring like splitting out pages to be of interest to readers. It's just bookkeeping, shuffling around mostly-unmodified text.

Fair enough. I was thinking about tracking refactors as a second chance to notice longer reviews. One should probably address the problem of readers missing something in a more principled way, like an occasional "Did you miss?" item based on visit statistics. This is not an early concern.

Oh, obviously I can support multiple RSS feeds - once the paradigm has been sorted out and I've decided what I even want from RSS feeds in the first place.

[...] much of what you would want to do would be doable by the per-tag RSS feeds.

Great. With per-tag feeds, categories would be redundant or close enough.

pointlessone · 2024-07-31T19:40:01Z

I don't want to, though.

I appreciate the honesty and respect your choice.

once the URLs get out there and are linked, you can't recall them.

True but also, I believe, you're treating it more seriously that it actually requires. Yes, cool URLs don't change and all that but also they actually do change all the time. If you've been on the interned for a bit then you know what to do when you're faced with 404 and you're convinced the URL was correct at some point. Archives, caches, search, etc. All tools that users can employ to help themselves. You don't have to help them. Exactly like you don't want to provide the simplest possible feed. It is a choice. You can choose to not care about everlasting stable URLs. Similar how you've chosen to not provide a feed.

Anyway, I just wanted to let you know that there's a desire for a feed. I think, I succeeded in that particular aspect. The choice what to do with that knowledge is yours. And the actual implementation if you decided to do it is yours, too. Obviously.

Thank you.

gwern · 2024-07-31T22:42:03Z

One should probably address the problem of readers missing something in a more principled way, like an occasional "Did you miss?" item based on visit statistics.

I think that that need is largely handled by the combination of a:visited styling so readers know at a glance if they have/have not visited a URL before, and the similar-links/backlinks+linkbibliographies, which list either similar things or direct bidirectional links. There's no way to track users better than their browser history tracks them (short of going fully dynamic and having user accounts and logins, I suppose), and I have no idea how I could feasibly surface recommendations better than the embedding nearest-neighbors & forward/backwards citations do now.

dbohdan · 2024-08-01T08:39:14Z

I had in mind tracking it in aggregate based on (admittedly difficult-to-determine) expected vs. actual page analytics, not individually per user. "Did you miss?" referred to the readership as a whole. (Gwern.net getting user accounts would be funny. One step closer to Xanadu.)

filipeabperes · 2024-10-12T03:06:10Z

@pointlessone You can use a feed generator to create one automatically from the changelog. It doesn't look great, but maybe satisfies your basic functionality. I created one here as an example that I think should work, but you can probably do better by spending more than the 30s I did on it and/or using a paid service.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RSS/Atom feed to the site content #11

RSS/Atom feed to the site content #11

pointlessone commented Jul 30, 2024

gwern commented Jul 30, 2024 •

edited

Loading

pointlessone commented Jul 31, 2024

dbohdan commented Jul 31, 2024

gwern commented Jul 31, 2024 •

edited

Loading

dbohdan commented Jul 31, 2024

pointlessone commented Jul 31, 2024

gwern commented Jul 31, 2024 •

edited

Loading

dbohdan commented Aug 1, 2024

filipeabperes commented Oct 12, 2024 •

edited

Loading

RSS/Atom feed to the site content #11

RSS/Atom feed to the site content #11

Comments

pointlessone commented Jul 30, 2024

gwern commented Jul 30, 2024 • edited Loading

pointlessone commented Jul 31, 2024

dbohdan commented Jul 31, 2024

gwern commented Jul 31, 2024 • edited Loading

dbohdan commented Jul 31, 2024

pointlessone commented Jul 31, 2024

gwern commented Jul 31, 2024 • edited Loading

dbohdan commented Aug 1, 2024

filipeabperes commented Oct 12, 2024 • edited Loading

gwern commented Jul 30, 2024 •

edited

Loading

gwern commented Jul 31, 2024 •

edited

Loading

gwern commented Jul 31, 2024 •

edited

Loading

filipeabperes commented Oct 12, 2024 •

edited

Loading