-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSS/Atom feed to the site content #11
Comments
RSS is not supported because I have been unhappy with past attempts to create an RSS feed; the usual blog-centric approach doesn't work well with lots of incremental updates and I dropped the darcs/git→RSS bridge as it was nothing but noise for readers.* You can probably turn the changelog page into an RSS feed fairly easily, and there is a After reading namespace's comments on blogging and my own thinking about multi-level design, I have been mulling over an approach to RSS feeds that I think can work well. The problem with standard RSS feeds is that they are designed for one-off pages, like a newspaper or a blog: a URL gets created as a single standalone finished page, announced, and that's that. (The URL contents will inevitably change, but these changes are not considered important.) But this is a poor fit for Gwern.net because I 'finish' major pages only once a month or so, and it is clear that people are interested in a more granular view of my writing than that. (I am always embarrassed when I see someone tweeting out or including in a newsletter a tweet or comment of mine - because it indicates a failure of curation on my part if they have to link those instead of a page on my site.) The other extreme is that every single file modification is reported in the RSS feed, like the RSS feed for English Wikipedia's Recent Changes. This is an equally poor fit, because the nature of Gwern.net is that there is a lot of small change constantly going on site-wide, particularly related to formatting or reorganization, of less than zero interest to readers, and which is why I killed the original RSS feed: I couldn't even read it myself! More broadly, in general, there are just no good ways to announce the full spectrum of changes from comments or tweets (recall the original name: 'microblogging') to shorter blog posts to longform essays/books. No one has done this, and most generally do not recognize this as any kind of problem, but there is a big 'gap' between each speed of service, going from chat to microblog to blog/newsletter to essay to book. So writers online tend to pigeonhole themselves: someone will tweet a lot, or they will instead write a lot of blog posts, or they will periodically write a long effortpost. When they engage in multiple time-scales, usually, one 'wins' and the others are a 'waste' in the sense that they get abandoned: either the author stops using them, or the content there gets 'stranded'. Most writers simply accept this, and chop their writing down to fit - Scott Alexander writes solely in blogposts & blogpost comments (having largely given up on Twitter, Tumblr, and LW/Reddit), even though if you wanted to know his big take on, say, 'predictive processing' as the theory of everything in AI/psychiatry, the best he can do is shrug and point you to like 30 blog posts scattered across at least 3 sites going back a decade (LW, SSC, & ACX); and Matt Levine has to repeat himself, again and again, every time a specific recurring drama comes up. If you don't want that to happen (maybe you don't love to hear yourself write the same thing again and again as much as Mencius Moldbug does) and don't want to become solely a poaster or to have your secondary writings become so much water under the bridge, your only option is to do a lot of tedious work copying back and forth: I try to monthly review my tweets & comments and pull stuff onto Gwern.net, but it is a lot of work and I often think that I am leaving behind a lot of stuff, even when I do manage to do the review. It is clearly not very sustainable. And if I wanted to summarize it at multiple levels (like a level in between a list of tweets and a full essay, or an annual level), when would I ever actually do something like write or think or live? I think this is a major reason for the death of blogging and the increasing rarity of non-blog homepage sites like Gwern.net: the friction of multiple writing places means you are constantly being sucked into just one and tempted to abandon the others; and the rewards of social media tend to win out. You may be able to maintain dual-posting for a while, but at some point a shock happens, and when you return, you have a backlog and never catch up and settle for one. (This is expected from a queuing theory perspective if the friction is heavily overloading you: at some point things will break down catastrophically, and since you aren't writing all that many words per day, which would be easy to copy-paste in total, it must be everything else, the friction/overhead.) And once you are writing solely on Twitter or Facebook or whatever, it's hard to ever escape with your stuff to your own website, despite the perpetual amnesia / eternal now of the microblogs. (The social media sites don't even need to be hostile for this to happen. It's just the constant trivial inconvenience and toil.) So, how do I solve this broad problem of packaging up my writing in logical units spanning the continuum from single tweets to 'best essay of the year'? After several years of the annotation popup system and watching LLMs become effectively superhuman at summarization & resolving major limitations like short context windows / cost, I think I can propose a design: you provide all the levels, without the toil, by starting with a link-centric approach where every comment or tweet or essay or reference is a URL with metadata, and using LLM recursive summarization to fill in the gaps. These different levels can then be exposed to readers as separate RSS feeds. So the workflow would look like this: every comment is copied by the backend, with its data & metadata like author/date and a LLM auto-title/summary; sets of related comments like a tweet thread get grouped and summarized together as a whole; updates to blog posts or essays likewise get grouped and summarized; finally, whole new posts/essays (with a handwritten summary, or again LLM-written). Once this has been set up, as the author, I go around tweeting or Redditing, possibly replying to my own comments repeatedly as the muse demands, and my tweets all get logged and saved automatically; a reader interested in the blow-by-blow can subscribe to the most atomic RSS feed, and read each one; a reader with less time to spare reads the grouped comment summary RSS feed; or they can read the whole essay that I eventually sit down and write with proper references (starting from the grouped-comment summary as a quick-and-dirty outline to help me get started). But there is almost no friction which stops my comments from percolating up through my site, from individual atomic comments to short summaries to finished writings, and readers can pick what level they want to read at, rather than reading a one-size-fits-all-but-suits-none 'most recent' RSS feed. (cf. Kicks Condor's "Fraidycat" with fixed allocation to feeds.) For Gwern.net specifically, I would start with the annotation system as the 'atomic' level and try to build up from there. Now that I have finally overhauled the backend to support additional metadata on annotations like the critical 'date modified' field (so necessary for With meaningful 'last-modified' vs 'date-created' metadata on all pages+annotations, I can now populate a sane RSS feed with both newly-created & recently-modified items, and these items can be both my essays & any new links I bookmark or annotations created. So the idea is that reading the RSS is like a link feed with essay updates once in a while. It might go something like this: "Golden Gate Bridge WP article / SF city WP article / annotation (suicide study) / annotation (poem) / weekly batch list of miscellaneous URLs & image uploads / 'Movie Reviews: +review of The Bridge 2006' / annotation (Arxiv) / annotation (Arxiv) / annotation (Arxiv) / annotation (Arxiv) / 'Research Ideas: free play for RL exploration' / annotation (link) / annotation (link) / annotation (link) / weekly batch list of miscellaneous links / annotation (link) / annotation (link) / annotation (link) / ..." 'Full' annotations & essays get a separate entry, while the shorter 'partial' annotations (which have only a little metadata like a title or a tag) get rolled up into a single large weekly item which can be skipped or skimmed. Because the annotations are just static HTML snippets already, they can be easily put into the RSS feed itself, to allow a preview (even if that obviously wouldn't support all of the on-site features; they will link to the essay or the first tag-directory entry, so the RSS entry for The main problem with this is that if an essay like a review is included each time its last-modified changes, the entry is not too useful: the annotation, which contains the page abstract, will likely still be the same and not mention whatever is changed. So you might see an essay pop up a dozen times in this RSS feed without knowing what changed. I could write a manual diff, but that is exactly the sort of "toil" I am trying to avoid on Gwern.net because it is unsustainable in the long run & such overhead unconsciously discourages a writer. (You feel like you are being punished---you wrote something, and your reward for a job well done is... being required to write even more? Not fun.) It is also difficult to do any kind of labeling of importance of a patch upfront: a good essay update might be composed of dozens of patches, each trivial on their own; indeed, I might not realize where something is going until after the writing is all done, as I explore a topic or people respond or I dig up new sources or have a sudden realization - "completeness" is something that can be known only in retrospect, reviewing changes. But with the date-range and the git history and progress in LLM context windows, this can now be automated. The LLM will know to not bother mentioning the massive churn on Gwern.net like spellcheck or linkrot fixing, and will summarize it on a more semantic level for readers. And you can generalize this idea further, and start to meet Namespace's challenge for blogging software that can go from tweets to posts: right now, there is no good way to go from writing in tweet-sized chunks to writing longform essays. So many people who could have written blog posts or even books wind up trapped in long tweet-threads (or given the hostility of Twitter these days to any kind of serious intellectual work, like hiding tweets from non-logged-in users & penalizing tweets with external links, not writing anything at all), which never leave Twitter and are impossible to find or browse sanely. Then a new reader can easily catch up on the backlog: simply read the essays, and then drop down to the level of granularity they have the time & interest for. (A big fan will of course read the most granular tweet-level daily feed, but others will prefer a higher level like weekly summaries, or even monthly essay-sized outputs.) And then in the ultimate evolution of this, the writer just writes atomic bits without any editing, and the LLM takes care of adding it to an ever-enlarging corpus and expanding it as appropriate, and then summarizing it for the writer & readers to review/read, and updating it based on feedback. (See also "Nenex".) (One might wonder how to present this outside the RSS feed context, but that's straightforward, especially when you have transclusions + collapses. Like the other things, you just generate statically, at a compile-time, each 'level', and then you can present them to the reader as a series of collapses+transcludes. You can present them as simply a flat list or as a recursive set of collapses from small to large, or however you wish. Just rearrange a few links/div-wrappers as you please.) * For perspective, here are the last 40 git patches to Gwern.net, to give you an idea of how miserably useless a reading experience the patches per se would be for an ordinary reader who just wants a list of 'interesting' updates or writings: "+lns; begin catching up after trip" · "record all minor pending edits (1722141005)" · "lint; rescrape for date-range subscripts" · "lint" · "+commafy number pass" · "lint; +first comma-fying pass" · "lint" · "lint" · "lorem: inline: date subscripts: continue finetuning examples" · "+lns" · "second EN DASH pass" · "big EN DASH searh-and-replace while working on the newly-enabled date-range subscripts" · "+lns, catch up on some comments" · "Crumb: +apposite comic that's a bit of a meme for Crumb, and some details from an interview" · "lint" · "+lns; lint" · "embryo selection: split out 'history of iterated embryo selection'/IES history to separate page because long enough & of independent interest" · "split out Twitter UX essay for easier linking in my Twitter DMs" · "anime reviews: factor out development hell" · "initialize a miscitation/miscite tag as a subset of publication-bias (statistics/bias/publication/miscitation)" · "RTX: +thumbnail in Midjourneyv6 based on ChatGPT-4o suggestion to use a 'nano' skull and crossbones" · "note: split out 'Highly Potent Drugs As Psychological Warfare Weapons' essay to /rtx due to length and to make more linkable" · "initialize disappearing polymorphs tag (science/chemistry/disappearing-polymorph) apropos of sudden burst of Twitter interest" · "initialize Stigler's diet problem (Dantzig) statistics/decision/stigler-diet tag" · "+lns" · "fully re-paragraphized all outstanding abstracts" · "lint" · "record all minor pending edits (1720931406); rescrape abstracts to update thumbnails for split-out pages" · "GTX: added few-shot examples to paragraphizer.py & loosened length constraint, so rerun paragraphizer on the holdouts" · "+lns" · "split out The Tale of Princess Kaguya anime movie review to /review/princess-kaguya; re-enable Midjourney to test personalization & generate a good Kaguya danse macabre thumbnail" · "lint" · "+lns" · "Timecrimes: finally thought up a thumbnail" · "lint" · "initialize truesight tag doc/statistics/stylometry/truesight/ for LLM-powered stylometrics/deanonymization" · "initialize mode collapse preference learning tag for AI slop/ChatGPTese infections in ChatGPT, DALL-E 3, Midjourney, etc (reinforcement-learning/preference-learning/mode-collapse)" · "lorem block: collapses: mismatch cases: rm unnecessary case that is ~impossible to write" · "lint" · "+lns; further new sync lint work" |
OK, I appreciate the thoroughness and your thought on this are very interesting but I just want to know when new stuff is up. So in the interest of getting it rolling how about starting small? Say, first add a feed for essays. Just the list of essays. Next you can add major updates. It is my understanding that the date of last major update is added manually. When you do it you can try formulating in a sentence or two what compelled you to update that date. Manually. If it indeed turns out that bad than you can try going the LLM route or even just dropping it. I mean, if you don't think it's worth exposing the change summary on the website then it's probably not worth doing in the feed. Just add the update date and let RSS readers figure it out. I honestly don't want to add a lot of work for you. I just want to know about new writing. In a sense, I want to eliminate the toil of checking your site from time to time to find out of there's anything new. |
I think a feed that was just new pages (including refactors, like split-out reviews?) would be useful. Something I'd want in a unified update feed for gwern.net is a form or multiple forms of tagging for different types of content. A basic one would be a way to distinguish between annotations/links and essays. Title prefixes like "Movie Reviews:" are a good idea. For additional tags that aren't easy or practical to fit in the title, like "essay" for all types of essays regardless of the prefix, you could have RSS/Atom categories on items. |
I don't want to, though. It isn't fun, I won't learn anything from it, and it doesn't excite me to half-ass an RSS feed which I know is an unsatisfactory solution to the problem. (And it would be a liability as well: an RSS feed is a promise to the reader, and has long-term consequences. I am still fixing spurious 404s from the old Gitit-style RSS feed I deleted a decade ago - once the URLs get out there and are linked, you can't recall them.) And the RSS libraries are a pain to work with because it's all XML-based, which is one reason I never was able to do much with the old RSS feed. "You see a maze of twisty types, each alike..." While I have plenty of other things to work on which are also useful to readers or myself, and which may help clarify what I want from an RSS feed. (As it happens, they did, as you can see above.) Sometimes, design just takes a while and you have to use the system 'in anger' for a few years until you can see what the logical next step is. When I killed the first RSS feed, I had no idea what would be a good replacement, which was not simply jamming a Wikipedia Recent Changes or a blog/journalism peg into the square hole of Gwern.net. Now I do.
I don't consider refactoring like splitting out pages to be of interest to readers. It's just bookkeeping, shuffling around mostly-unmodified text. In theory, if a split-out review was worth reading because it's a substantive review, it is linked in the newsletter for that month, and it would hypothetically be included as an annotation with its hand-written abstract or summarized by the LLM pass after it was finished.
Oh, obviously I can support multiple RSS feeds - once the paradigm has been sorted out and I've decided what I even want from RSS feeds in the first place. (The whole popup/annotation system is partially motivated by the goal of making updates more meaningful & granular and organizing references.) Beyond the master/site-wide/firehose RSS feed which included all annotations/essays and the weekly miscellaneous batch entry (something like The RSS/Atom 'category' field seems redundant with the tags that are already in the generated snippet, and I don't know of any RSS reader which makes those easy to access*, so I am doubtful they would be worth encoding & supporting. Seems like you'd have to be using some custom RSS filtering tool and then you are able to do arbitrary operations on the text anyway, and much of what you would want to do would be doable by the per-tag RSS feeds. (eg. if you want just reviews, you would use * like a lot of 2000s-era Semantic Web-influenced tech, RSS/Atom comes with a lot of fancy semantic features which in practice no one bothers with for their own stuff and which are too dangerous or rare to rely on when provided by others |
It solved most of the pain for me when I found the library xmltodict to use in my static site generator. (xmltodict is actually bidirectional.) I highly recommend this library if you can do the RSS part in Python and recommend Python to use it.
Fair enough. I was thinking about tracking refactors as a second chance to notice longer reviews. One should probably address the problem of readers missing something in a more principled way, like an occasional "Did you miss?" item based on visit statistics. This is not an early concern.
Great. With per-tag feeds, categories would be redundant or close enough. |
I appreciate the honesty and respect your choice.
True but also, I believe, you're treating it more seriously that it actually requires. Yes, cool URLs don't change and all that but also they actually do change all the time. If you've been on the interned for a bit then you know what to do when you're faced with 404 and you're convinced the URL was correct at some point. Archives, caches, search, etc. All tools that users can employ to help themselves. You don't have to help them. Exactly like you don't want to provide the simplest possible feed. It is a choice. You can choose to not care about everlasting stable URLs. Similar how you've chosen to not provide a feed. Anyway, I just wanted to let you know that there's a desire for a feed. I think, I succeeded in that particular aspect. The choice what to do with that knowledge is yours. And the actual implementation if you decided to do it is yours, too. Obviously. Thank you. |
I think that that need is largely handled by the combination of |
I had in mind tracking it in aggregate based on (admittedly difficult-to-determine) expected vs. actual page analytics, not individually per user. "Did you miss?" referred to the readership as a whole. (Gwern.net getting user accounts would be funny. One step closer to Xanadu.) |
@pointlessone You can use a feed generator to create one automatically from the changelog. It doesn't look great, but maybe satisfies your basic functionality. I created one here as an example that I think should work, but you can probably do better by spending more than the 30s I did on it and/or using a paid service. |
Is there a feed to the content that is published on the site itself?
There's a feed to substack that hasn't been updated in a while and the substack itself seem to be gone from the web but I can't find any feeds for the site content.
The text was updated successfully, but these errors were encountered: