Skip to content

Private container to cache resources shared with me #330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
srosset81 opened this issue Apr 7, 2025 · 4 comments
Open

Private container to cache resources shared with me #330

srosset81 opened this issue Apr 7, 2025 · 4 comments

Comments

@srosset81
Copy link
Contributor

srosset81 commented Apr 7, 2025

Context and use case

If tomorrow we used SAI for the Welcome to my place app, the performance would most likely be very bad.

In my own account, about 50 people shared 300 events with me. And I need to list them by date, filtering out events that are in the past.

I have also created about 30 events. It would be easy to sort these with a SPARQL query or with our upcoming container SHACL filtering. But I cannot do this with the 300 events shared with me because I only have their URIs, and I can't do any filtering before I GET them.

So if I switched to SAI, I would need to load 50 Delegated Data Grants (DDG), and then 300 Data Instances (events) linked from these DDG. After I've loaded all these 350 items (+ my own 30 events), I would need to order and filter them, to display (in the end) 10 events per page.

Since browsers do not fire more than 15-20 requests at the same time, it would probably take several minutes just to display these 10 events -- something that it is not acceptable to any modern user.

I'll point out this is a problem that is general to Solid. The lack of real applications make it less visible.

Proposed solution

In ActivityPods, we cache (in the user's storage) the resources that are shared with them. Right now, they are cached in the same LDP container as the resource that they created ourselves, but it could be a different container. Thanks to this, it's easy to filter and order resources (with a SPARQL query and, tomorrow, with SHACL shapes)

The cache is kept up-to-date thanks to ActivityPub: whenever a resource is updated, the resource owner sends a UPDATE activity to all users with whom the resource has been shared. This mechanism has been implemented since 2022 and it works very well.

I don't think there would a philosophical problem of using the user's storage for cache, as long as these data are kept private, and can only be seen by the user (and the applications that have been granted interop:All right on the given resource).

Alternative

The alternative could be to cache data in the local browser. But on the first load or when using a new device, the performances would still be terrible. Will users be ready to wait several minutes for an application ? Most likely they will leave before.

Another problem with local cache is that the browser cannot know if a resource has been updated so to have fresh data, it will still need to query ALL data. This can be done in the background, but that will have bad impact on the server and I don't think it's such an elegant solution.

@jg10-mastodon-social
Copy link

The question of controlling data exfiltration regularly comes up - once a user has granted access to a third party, what stops the third party from holding on to the data.
I think SAI would need to provide explicit consent for a caching use case. In the general case, the user should be aware that their data has gone beyond the requesting client 1.

That said, it does sound like at the very least some kind of indexing would be useful, and this sounds like a sensible solution to performance.

The issue of performance with many LDP resources is well known, though probably not well documented for newcomers. In my opinion, it's fundamental to working at web scale, instead of relying on a platform for aggregation.
There's a variety of engineering solutions, including client side (local storage, indexing, streaming, design of document structure) and with server involvement (globbing, sparql, query interfaces, and http2 - with only http2 currently supported by the solid spec, to my knowledge).
See e.g. https://forum.solidproject.org/t/state-of-the-art-for-querying-large-containers/3320

Footnotes

  1. Local storage is probably fine without additional consent in my opinion - it's still in the client and localStorage doesn't significantly increase security risk if used appropriately

@srosset81
Copy link
Contributor Author

srosset81 commented Apr 8, 2025

Thanks for your contribution @jg10-mastodon-social !

I think SAI would need to provide explicit consent for a caching use case. In the general case, the user should be aware that their data has gone beyond the requesting client

What is fundamentally different between caching a data in my browser (so my "local computer") and caching it in my Pod (my "remote computer"), especially if it's in a private container no one else can access ? In terms of security and privacy, I don't see a big difference.

That said, it does sound like at the very least some kind of indexing would be useful, and this sounds like a sensible solution to performance.

Indeed app-specific indexes could be a middleground solution. It could be used for both ordering and filtering. If I know my app will need to order or filter some resources by predicate, I can build an index from all resources the user has created or that have been shared with them.

I know this is something @lecoqlibre and @SlyRock are working on at the moment (see this old blog post). We could think how it could be integrated seemlessly in SAI, and also automated in some ways, so that apps don't need to invent a whole mechanism when they have this very common need.

@jg10-mastodon-social
Copy link

What is fundamentally different between caching a data in my browser (so my "local computer") and caching it in my Pod (my "remote computer")

Local storage is by design hidden from the user, who just sees data loading faster than they would otherwise. A pod is designed to be in the control of the user - it's more analogous to saving files in the user's downloads folder. There may be ways to obfuscate (e.g. browser cache files are saved where a user doesn't usually go in a format they are not familiar with), but fundamentally if there's no explicit consent, then some alternative measure should be taken so that a user is not able to reshare content that was shared with them.

Indexes are restricted in what they disclose, and are typically user-unfriendly too, which is why I consider them less of an issue.

My understanding is that indexes are usually tackled through the notifications system. An indexing app would need to have appropriate access and the consumer app would need to request access to the index.
i.e. my first reaction is that this would be just like any other SAI data grant?

@elf-pavlik
Copy link
Member

I'll point out this is a problem that is general to Solid. The lack of real applications make it less visible.

Currently this is a well know limitation of Solid in general, this use case seems very similar:

To be honest, I don't think there is a quick solution to this problem. It probably needs to be captured in https://github.com/w3c/lws-ucs and considered together with authorization, replication, ODRL policies etc.

There is also a video recording from a related Solid Practitioners meeting: https://spectra.video/w/rE4CGHB5Sr74gR6TA1Mccs

@srosset81 if you like we could bring it up again during next Solid CG call and get more input on possible short and long term paths forward.


Since browsers do not fire more than 15-20 requests at the same time, it would probably take several minutes just to display these 10 events -- something that it is not acceptable to any modern user.

I'm assuming at least HTTP/2 which allows a lot of concurrent requests over the same connection. I think it should support at least 100 streams per connection though some servers can set lower limits. To be honest I couldn't find what are the browser limits number of connections across different domains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants