The tech giants make enough money that they could keep on growing forever, from my understanding.
But the fediverse? Sure the main instances that get enough funding are going to be okay, but what about the single-user instances 10 years from now on when there’s a lot more content to download? Won’t they go bankrupt just by trying to annex the big instances?
And I have the impression that the lemmy giants are going to change over time: does that mean that 50 years from now on, the posts I’m posting here today might get lost in time because the instances that annex it will have shut down by then?
I probably misunderstand how the fediverse works, but my worry is that the small instances won’t be able to hold an ever-growing amount of data forever.
I spoke in absolutes for the sake of readability, but I’m as in-the-dark as can be.
No, after a sufficient amount of time has passed, we would run out of useable matter and energy in the universe. This theorized end-state of heat death puts a finite cap on the size of the Fediverse.
Constrained to Earth, it’d probably be fine. Though I do see it splintering eventually, with sub-communities existing independently from the main organism.
But would it work with spherical servers in vacuum?
Time to invent the Dyson Server!
We already got one: the dyson.com server. /jk
Time to invent the Dyson Server!
Is this amount of time clearly stated or defined. Indefinite does not mean infinite.
Mostly serious answer: the current implementation is not going to scale effectively with growth. The software implementation is still rough around the edges, and the ActivityPub protocol probably needs more knobs to handle bulk data synchronization. Within the service, moderaton is a serious challenge with many unanswered questions.
Likewise, the back end software implementation is monolithic, meaning it’s one software stack that does everything from sign in to subscriptions to synchronization and scheduling. Housekeeping and garbage collection probably isn’t that tight, either. This is mostly speculation as I’ve watched things over the last couple of weeks’ growth.
I believe the data store is based on Postgres RDBMS, which while being robust and scalable is fussy and needs tuning when turning over large amounts of highly unique data.
None of this is an indictment on the devs! Rather the opposite, because the software IS chugging along while experiencing tremendous growth.
I expect over time the back end will devolve into micro services that communicate over a highly scalable, or stream-based messaging bus. Larger instances could probably also benefit from static caching and CDN techniques to keep pages loading quickly even while the back end thrashes.
The structure.if the ecosystem needs to strike a balance between fewer large instances and many-many small instances. In the first scenario, the scaling limit is in the monolithic stack, which introduces I/O bottlenecks and serialization delays (even if massively threaded). In the latter scenario, message state and synchronous distribution become challenging because a full mesh of federations could scale faster than network state tables have room to support. Some middle tier might be needed, and I have no idea what that might even look like.
So to answer your question, can it scale indefinitely? Probably not because we hit scaling limits pretty quickly on a number of dimensions. Nevertheless, smart people.are starting to hang out here, and I expect will take an interest in how it all works. Improvement is inevitable, and I think the early roadblocks will be overcome easily enough
There’s nothing wrong with a monolith. Microservices are not inherently more scalable. Their advantage is around scaling teams. If anything, a monolith can be more performant as in-process calls are much faster thent network calls.
There can be better efficiencies by disaggregating the full stack into microservices and making IPC calls among scalable workers versus strictly service-per-server models which, yes, incur scaling issues from network iowait. Modern network operating systems do this, which allows heavier loaded processes more access to resources while lesser loaded processes are deferred.
I’m not sure what you mean by a “network operating system”, but monoliths are inherently just as scaleable as services.
Imagine you have a service architecture, and you are running 2 of service A, 4 of service B, and 8 of service C.
Alternatively, you could be running a monolith on 14 nodes. Most of the work those 14 nodes will be doing work that would have been covered by service C, it’s just spread out in a different way.
I’m talking about Cisco IOS-XR, Juniper JunOS, Arista.EOS and others.
Those operating systems are disaggregated, meaning different features can be restarted, replicated, scaled out horizontally, or upgraded without having to disturb the other components in runtime.
Maybe we’re getting at the same point from other ends. I’m not a traditional software engineer,but ai have had academic and professional training on these topics.
The world produces 15Mt of beans every year. The average shit post with beans has 700g of beans in it. This means Lemy can scale to around 22 billions shitposts/year. We have some margin.
#shitpost
This math checks out. I ran it through the bean calculator using OpenBeanAI. 32.33% of the simulations show these numbers.
Repeating of course.
Of course
I probably misunderstand how the fediverse works, but my worry is that the small instances won’t be able to hold an ever-growing amount of data forever.
Let’s pretend you run a small Lemmy instance (~100 users).
If you federate with a large instance, you (i.e. your instance) will only receive new posts from communities that your users subscribe to, or users that your users follow [1]. These are deduplicated, in the sense that if all 100 of your users subscribe to the same community, you only need to download and store one copy of that community’s posts in your database.
[1] AFAICT. The current implementation of Lemmy seems to handle federation using the activitypub_federation crate. I skimmed the docs of that crate, but they aren’t 100% clear about this.
the posts I’m posting here today might get lost in time because the instances that annex it will have shut down by then?
You have the same problem with any data you put online anywhere: The people currently keeping your stuff online might delete it anytime they decide it’s not worth the trouble to keep it online.
If it’s important to you that certain information stays online, keep a copy on a disk in your house; check back periodically to be sure it’s still online, and if it’s not, you can always use the copy in your house to put it online again somewhere else. If it’s very important to you, keep multiple copies on multiple disks hosted by multiple companies on different continents.
50 years from now on
Predicting what will happen in tech in 50 years is a pretty daunting challenge.
50 years ago, in 1973, all the computers on the ARPAnet (the predecessor of the Internet) could be easily listed on a single piece of paper. The home computer was still years from birth. The Zilog Z80, Intel 8080, Motorola 6500 and the MOS Technology 6502, which would play key roles in early home computers and gaming consoles, were just beginning to enter the market.
All the answers I got were very useful and informative, but this one is definitely the one that catered the most to my worries.
The Fediverse needs to encourage different instances. It’s the only way it can work. It has the technical framework to do it and for it to be transparent to the enduser but I feel like it’s not there yet.
For example I think users should be strongly encourages to chose regional instances instead of lemmy.world (I know know, ironic coming from me). It should be default and require the user to go out of their way to select a different instance. It should also be concisely explained that your instance doesn’t matter and that you can see any other federated instance. Yes, this is not always true but it doesn’t matter to someone just joining. Let them get here first and then they’ll naturally learn about the intricacies. Don’t scare them away at the gates.
I guessy answer is, who cares? Don’t treat a social media account as some immortal time capsule of your life. Keep a photo album, write some diary entries, but don’t rely on any form of social media to be the historical record of your existance. If it’s inportant keep it somewhere you can ensure the preservation.
I’m pretty sure the world will continue long after we’ve forgotten beans and not pooping for X days.
I needed to be reminded of this, thanks.
Still, Reddit is probably the biggest and most accessible source of information in the world, written out of passion by people, experts, professors, neckbeards… trolls… uni students, researchers,
and I wish Lemmy could also become the archive that Reddit is, but if information has a high likelihood to get lost with time, why bother? It should then really only be treated as a very temporary social media which is… okay, I guess.
Everything is temporary. Nothing is permanent. Embrace it and live in the now.
It’s weird to think about, but data has a shelf life. Software needs to grow and be pruned regularly, or it dies.
Social media is both - the data dump is useless without an ecosystem of tools around it, and if the data itself stops interacting with the zeitgeist of the parent society, it basically becomes an old journal. It’s interesting to a very specific group of people, and literally no one else wants to see it (aside from a few gems picked out and cleaned up for public consumption)
At any point we could go back to Reddits explosion after the digg migration. We could pull up posts that mirror exactly what’s happening now. It’d be interesting for sure, and there’s days of then-now posts that people could be making…but instead we just have people telling us about their memories of that process.
Why? Because that data is old and stale. You’d have to hunt it down with tools not intended for it, filter out the best of it, fix broken links, and probably put it through a slur filter
Reddit changes all the time too. Posts are added, edited, deleted. If they don’t find a way to monetize, soon, they also likely won’t be able to pay for their storage indefinitely.
Reddit is just a website (and really just a forum with a special interface) that has been around for a decade+
The knowledge and accumulation came from users and time. Same as anywhere else
I think people need to be reminded of two big things when it comes to Lemmy:
-
It is impermanent. Not intentionally, I’m sure most instances will try to keep all the posts for as long as possible. But we’re just hosting this stuff on independent servers (also known as “somebody else’s computer”) and we can’t rely on them to stay online forever.
-
Lemmy is NOT PRIVATE. You cannot delete your posts, and this is by design. You can edit them, but there’s an edit history, and even if there wasn’t, it would be impossible to ensure that the old versions of your posts aren’t stored on some random, rarely used instance. There is no big man in charge like Mark Zuckerberg that you can sue to delete your data. If you want to use Lemmy privately, DON’T POST YOUR PERSONAL INFO. Don’t post things that can be used to identify you. This is a public forum. Treat it like one. If you don’t like that, go somewhere else.
Sorry, #2 is kinda off topic, but I see a lot of confusion about what Lemmy is and isn’t.
Thanks, I’m also definitely confused about what Lemmy is and isn’t. This clears up a lot.
-
Each instance only needs to hold the data from communities its users are subscribed to. And images live on their host instances anyway. No instance needs to hold the entirety of Lemmy. :)
Smaller instances don’t grab everything from every other server, it only grabs data from other servers when their users are subscribed to specific communities, also I suspect it doesn’t grab all historical data automatically (though I don’t know how much it does grab by default)
Right now there’s no migration tool for when instances shut down, but it should be technically possible someone just needs to implement it.
It doesn’t get everything. At least it’s not what kbin is doing, and I expect lemmy is the same. How it works for kbin is, once you subscribe you will start getting info about ANY change. A like, comment or anything like that.
So, someone likes a comment, anywhere on any instance. We’ll get sent that like. But we don’t have the comment. So, we fetch the comment. But, wait this one was a comment on a previous comment. So we’ll fetch that too. All the way to the post. Also any users involved (the liker, and commenters back to original poster) will also be fetched. The hierarchy to the comment that was liked is then built, and then the like itself is applied.
That’s why you will start to see a lot of old posts, but not all of them. It’s just going to slowly build up over time as people interact and of course you’ll get anything new.
Even with this I’m getting a LOT of content right now delivered.
EDIT: Not sure why I posted without finishing typing before.
I see, that clears up a lot, thank you! I just hope that Lemmy is, as you suspect, doing the same as kbin.
It doesn’t even work right, lemmy.ca and beehaw.org are not synchronized even if they federate with each others, certainly the same with multiple instances.
Yes, I’ve noticed that on Kbin.social, there are fewer visible comments than on Lemmy.world. It looks like quite a few comments aren’t being federated properly.
I’ve been wondering about that too. Specifically how long can old posts persist, and how long are instances compelled to host older content especially from other servers.
Probably depends on who specifically is running the instance and the community on that instance
I imagine the devs aren’t worried about this yet.
Long-term, I imagine that archiving and culling old data would maybe make sense.
Maybe there will be the equivalent of archive.org for lemmy only one day.
Mastodon already has something to purge posts older than X days And that’s with reletively small snippits of text generally. It’ll be essential for any kind of ling term running on small nodes to automate removal of posts to avoid blowing up the host’s disk space. There’s also the possibility of ‘purge after X period of time with no activity’ which would make sense for something like this where things can turn into long discussions more often than it does with a toot/tweet.
you can trigger https://web.archive.org/ to “Save Page Now”
It says below the input URL box “Capture a web page as it appears now for use as a trusted citation in the future.”
archive.org also has an extension that automatically scrapes webpages that haven’t been downloaded in 90/60/30/7days/24hrs
Why would an instance annex another instance? I can’t imagine the pain in trying to merge two databases like that, then handling the changing of the address.
If anything, I expect a single or a collection of larger instances could choose to defederate from the larger federation.
My bad. By annexing, I meant downloading posts from other instances.
It is basically like when a person goes to a webpage. If anything, it is more likely that a group of smaller instances to take down a big instance hosting tons of communities.