-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix(memory/rag): add support of loading file from file:// uri #4286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
mattf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is dangerous. please make it configurable and disabled by default.
alternatively, advise users to upload files using /v1/files and attach them with POST /v1/vector_stores or POST /v1/vector_stores/{id}/files.
franciscojavierarceo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are actively deprecating the RAGDocument so we shouldn't be steering users this way. As @mattf we recommend the Files API.
|
Thanks for the feedback @mattf @franciscojavierarceo - addressed the security concerns: Changes:
Security retained:
|
mattf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about using aiofiles, which gets you async stat and io?
I'm not opposing in any way, but I don't see us using this anywhere in the project, which most likely means a new dependency, if we are ok with introducing a new dependency here, ok, but it might introduce new issues |
i'll help you get this in shape, but you'll need someone else to approve.
mattf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aiofiles
I'm not opposing in any way, but I don't see us using this anywhere in the project, which most likely means a new dependency, if we are ok with introducing a new dependency here, ok, but it might introduce new issues
fyi, the localfs files impl isn't async either. it's arguably ok to wait to make this async.
So I'm thinking of finishing this, the current way, then introducing aiofiles as a different RFE, see what the general feeling towards this, and then perform the changes in a different PR (or PRs) |
| filename = os.path.basename(real_path) | ||
|
|
||
| try: | ||
| file_stat = os.stat(real_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in your use case, is it ok for all users to have access to any file accessible by the running stack server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while this is a viable security concern, its not exactly my use case (I'm fixing an issue for another user). its worth putting into question, the support for file:// in general, and again I believe its out of the scope for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkennetzoracle will you shed some light on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are steering users towards the /v1/files API and deprecating RAGDocument this may be moot. This issue was requested because RAGDocument couldn't work from file:// but if there is a new preferred way, we should probably use that way.
If this is to be merged, I don't think users should have access to every file on the server. You could enforce scoped access similar to how you enforce users to "opt-in" to this behavior at all. Just make them define allowed paths.
What does this PR do?
Adds support for
file://URIs in RAGDocument content. Previously, the URI regex pattern matchedfile://but the code then passed it to httpx which only supports http/https, causing anUnsupportedProtocolerror.This PR:
read_file_uri()helper function with security controls (path sanitization, file size limits)content_from_doc()andraw_data_from_doc()to handlefile://URIs before http/httpsLLAMA_STACK_MAX_FILE_URI_SIZE_MBenv var (default: 100MB)Closes #3463
Test Plan
Run the unit test suite:
Manual verification: