WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@r-bit-rry
Copy link
Contributor

@r-bit-rry r-bit-rry commented Dec 3, 2025

What does this PR do?

Adds support for file:// URIs in RAGDocument content. Previously, the URI regex pattern matched file:// but the code then passed it to httpx which only supports http/https, causing an UnsupportedProtocol error.

This PR:

  • Adds a read_file_uri() helper function with security controls (path sanitization, file size limits)
  • Updates content_from_doc() and raw_data_from_doc() to handle file:// URIs before http/https
  • Adds configurable max file size via LLAMA_STACK_MAX_FILE_URI_SIZE_MB env var (default: 100MB)

Closes #3463

Test Plan

Run the unit test suite:

python -m pytest tests/unit/providers/utils/memory/test_vector_store.py -v

Manual verification:

import asyncio
from llama_stack.providers.utils.memory.vector_store import content_from_doc
from llama_stack_api import RAGDocument

async def test():
    doc = RAGDocument(document_id="test", content="file:///path/to/your/file.txt")
    result = await content_from_doc(doc)
    print(result)

asyncio.run(test())

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 3, 2025
@r-bit-rry r-bit-rry marked this pull request as ready for review December 4, 2025 17:25
Copy link
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is dangerous. please make it configurable and disabled by default.

alternatively, advise users to upload files using /v1/files and attach them with POST /v1/vector_stores or POST /v1/vector_stores/{id}/files.

Copy link
Collaborator

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are actively deprecating the RAGDocument so we shouldn't be steering users this way. As @mattf we recommend the Files API.

@r-bit-rry
Copy link
Contributor Author

Thanks for the feedback @mattf @franciscojavierarceo - addressed the security concerns:

Changes:

  • file:// URIs now disabled by default via LLAMA_STACK_ALLOW_FILE_URI env var
  • Error guides users to Files API: "file:// URIs disabled. Use Files API (/v1/files) instead, or set LLAMA_STACK_ALLOW_FILE_URI=true."
  • Refactored content_from_doc() to eliminate duplicate code paths

Security retained:

  • Path sanitization (os.path.realpath())
  • Size limits (LLAMA_STACK_MAX_FILE_URI_SIZE_MB, default 100MB)

mattf
mattf previously requested changes Dec 7, 2025
Copy link
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using aiofiles, which gets you async stat and io?

@r-bit-rry
Copy link
Contributor Author

aiofiles

I'm not opposing in any way, but I don't see us using this anywhere in the project, which most likely means a new dependency, if we are ok with introducing a new dependency here, ok, but it might introduce new issues

@mattf mattf dismissed their stale review December 8, 2025 11:31

i'll help you get this in shape, but you'll need someone else to approve.

Copy link
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aiofiles

I'm not opposing in any way, but I don't see us using this anywhere in the project, which most likely means a new dependency, if we are ok with introducing a new dependency here, ok, but it might introduce new issues

fyi, the localfs files impl isn't async either. it's arguably ok to wait to make this async.

@r-bit-rry
Copy link
Contributor Author

aiofiles

I'm not opposing in any way, but I don't see us using this anywhere in the project, which most likely means a new dependency, if we are ok with introducing a new dependency here, ok, but it might introduce new issues

fyi, the localfs files impl isn't async either. it's arguably ok to wait to make this async.

So I'm thinking of finishing this, the current way, then introducing aiofiles as a different RFE, see what the general feeling towards this, and then perform the changes in a different PR (or PRs)

filename = os.path.basename(real_path)

try:
file_stat = os.stat(real_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in your use case, is it ok for all users to have access to any file accessible by the running stack server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this is a viable security concern, its not exactly my use case (I'm fixing an issue for another user). its worth putting into question, the support for file:// in general, and again I believe its out of the scope for this PR

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkennetzoracle will you shed some light on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are steering users towards the /v1/files API and deprecating RAGDocument this may be moot. This issue was requested because RAGDocument couldn't work from file:// but if there is a new preferred way, we should probably use that way.

If this is to be merged, I don't think users should have access to every file on the server. You could enforce scoped access similar to how you enforce users to "opt-in" to this behavior at all. Just make them define allowed paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFE: Accept file:// for URI for RAGDocument

4 participants