Building an End-to-End RAG Pipeline to Query Local Files, Audio, and Video
June 20, 2024
Jun 20, 2024
None
Engineering
A key challenge when working with AI models is to ensure that the information generated by the model is relevant, timely and context-sensitive. Retrieval-Augmented Generation (RAG) is a technique to achieve this, where generative AI models are linked with additional relevant data sources to improve the accuracy of answers generated by the model.
Since models have a limit on the context they can receive, it is necessary to split large data inputs into smaller chunks and send only the relevant chunks to the model. Relevancy is determined by generating vector representations of the data chunks and then using mathematical functions to determine which chunks are most relevant to the question being asked of the model.
A number of tools are available to do this, and a popular and easy-to-use example is the LlamaIndex framework. This framework includes a RAG CLI, llamaindex-cli
, which does all the heavy lifting: indexing and vectorizing data, integrating with a language model, and handling queries and responses.
The RAG CLI doesn't live in isolation, though: it still needs a pipeline that delivers data to it. Building this data pipeline usually involves interacting with multiple tools and processes. The typical way to do this is messy and unreliable, using shell scripting “glue” or by “shelling out” from Python or another programming language. This is where Dagger can help!
Dagger for Data Pipelines
Using Dagger ensures that your data pipeline always works consistently across operating systems and environments, by encapsulating all the workflows that make up a data pipeline - retrieving or extracting data, transforming data, storing data - into reusable, standardized Dagger Functions.
In a recent community call, Marcos Nils demonstrated how to use Dagger with LlamaIndex to build a RAG data pipeline for YouTube videos. His pipeline consists of two Dagger Functions:
A lower-level Dagger Function named
rag
which accepts a collection of documents and internally callsllamaindex-cli
to index and respond to queries on those documents.A higher-level Dagger Function named
ytChat
that accepts a YouTube video URL as data source. This function internally callsyoutube-dl
to download the YouTube video,whisper
to transcribe the audio from the video, andrag
to index the transcript and respond to queries.
Benefits
Why do it this way? Dagger Functions are programmable, reusable and run everything in containers, which means that:
You can interact with multiple tools and processes using containers and clean, maintainable Python code (or any of the other supported programming languages) instead of relying on messy YAML or unreliable shell scripts.
You can expose a clean API to your different pipeline functions. This allows you to reuse the same functionality in other projects, or share your code with other users in your organization or community.
You benefit from all the tooling, best practices and third-party libraries available in your language’s ecosystem.
Dagger’s container-first approach ensures that your data pipeline runs the same way every time, unaffected by environment differences or operating system variability.
You get all the benefits of Dagger's caching, which accelerates pipeline runs significantly.
Try It Out!
Dagger Functions are reusable and can be called at will from anywhere that you can run the Dagger CLI, which means that you can use the same Dagger Functions to query your own local documents or YouTube videos. Here's how:
NOTE: The RAG CLI uses OpenAI, which means that your data will be sent to the OpenAI API. You will also need an OpenAI API key to use these Dagger Functions. Learn more about the OpenAI API.
Query local documents
1. Install the Dagger CLI.
2. Set the OPENAI_API_KEY environment variable to the value of your OpenAI API key.
export OPENAI_API_KEY=YOUR-API-KEY-HERE
3. Add all the documents (PDFs, images, text files, ...) you want to index to a local directory.
4. Call the Dagger Function as below, replacing the placeholders with valid inputs.
dagger -m
github.com/marcosnils/daggerverse/gptools@c3f85409f535112047ef609885e152648826b4ce call rag \
--openai-api-key=env:OPENAI_API_KEY \
--source=YOUR-DOCUMENT-DIR-PATH-HERE \
--question
Query YouTube videos
1. Install the Dagger CLI.
2. Set the OPENAI_API_KEY environment variable to the value of your OpenAI API key.
export OPENAI_API_KEY=YOUR-API-KEY-HERE
3. Call the Dagger Function as below, replacing the placeholders with valid inputs.
dagger -m
github.com/marcosnils/daggerverse/gptools@c3f85409f535112047ef609885e152648826b4ce call yt-chat \
--openai-api-key=env:OPENAI_API_KEY \
--url=YOUR-YOUTUBE-VIDEO-URL-HERE \
--question
Watch Marcos’ complete demo below, and let us know in Discord if you have any questions!
A key challenge when working with AI models is to ensure that the information generated by the model is relevant, timely and context-sensitive. Retrieval-Augmented Generation (RAG) is a technique to achieve this, where generative AI models are linked with additional relevant data sources to improve the accuracy of answers generated by the model.
Since models have a limit on the context they can receive, it is necessary to split large data inputs into smaller chunks and send only the relevant chunks to the model. Relevancy is determined by generating vector representations of the data chunks and then using mathematical functions to determine which chunks are most relevant to the question being asked of the model.
A number of tools are available to do this, and a popular and easy-to-use example is the LlamaIndex framework. This framework includes a RAG CLI, llamaindex-cli
, which does all the heavy lifting: indexing and vectorizing data, integrating with a language model, and handling queries and responses.
The RAG CLI doesn't live in isolation, though: it still needs a pipeline that delivers data to it. Building this data pipeline usually involves interacting with multiple tools and processes. The typical way to do this is messy and unreliable, using shell scripting “glue” or by “shelling out” from Python or another programming language. This is where Dagger can help!
Dagger for Data Pipelines
Using Dagger ensures that your data pipeline always works consistently across operating systems and environments, by encapsulating all the workflows that make up a data pipeline - retrieving or extracting data, transforming data, storing data - into reusable, standardized Dagger Functions.
In a recent community call, Marcos Nils demonstrated how to use Dagger with LlamaIndex to build a RAG data pipeline for YouTube videos. His pipeline consists of two Dagger Functions:
A lower-level Dagger Function named
rag
which accepts a collection of documents and internally callsllamaindex-cli
to index and respond to queries on those documents.A higher-level Dagger Function named
ytChat
that accepts a YouTube video URL as data source. This function internally callsyoutube-dl
to download the YouTube video,whisper
to transcribe the audio from the video, andrag
to index the transcript and respond to queries.
Benefits
Why do it this way? Dagger Functions are programmable, reusable and run everything in containers, which means that:
You can interact with multiple tools and processes using containers and clean, maintainable Python code (or any of the other supported programming languages) instead of relying on messy YAML or unreliable shell scripts.
You can expose a clean API to your different pipeline functions. This allows you to reuse the same functionality in other projects, or share your code with other users in your organization or community.
You benefit from all the tooling, best practices and third-party libraries available in your language’s ecosystem.
Dagger’s container-first approach ensures that your data pipeline runs the same way every time, unaffected by environment differences or operating system variability.
You get all the benefits of Dagger's caching, which accelerates pipeline runs significantly.
Try It Out!
Dagger Functions are reusable and can be called at will from anywhere that you can run the Dagger CLI, which means that you can use the same Dagger Functions to query your own local documents or YouTube videos. Here's how:
NOTE: The RAG CLI uses OpenAI, which means that your data will be sent to the OpenAI API. You will also need an OpenAI API key to use these Dagger Functions. Learn more about the OpenAI API.
Query local documents
1. Install the Dagger CLI.
2. Set the OPENAI_API_KEY environment variable to the value of your OpenAI API key.
export OPENAI_API_KEY=YOUR-API-KEY-HERE
3. Add all the documents (PDFs, images, text files, ...) you want to index to a local directory.
4. Call the Dagger Function as below, replacing the placeholders with valid inputs.
dagger -m
github.com/marcosnils/daggerverse/gptools@c3f85409f535112047ef609885e152648826b4ce call rag \
--openai-api-key=env:OPENAI_API_KEY \
--source=YOUR-DOCUMENT-DIR-PATH-HERE \
--question
Query YouTube videos
1. Install the Dagger CLI.
2. Set the OPENAI_API_KEY environment variable to the value of your OpenAI API key.
export OPENAI_API_KEY=YOUR-API-KEY-HERE
3. Call the Dagger Function as below, replacing the placeholders with valid inputs.
dagger -m
github.com/marcosnils/daggerverse/gptools@c3f85409f535112047ef609885e152648826b4ce call yt-chat \
--openai-api-key=env:OPENAI_API_KEY \
--url=YOUR-YOUTUBE-VIDEO-URL-HERE \
--question
Watch Marcos’ complete demo below, and let us know in Discord if you have any questions!
A key challenge when working with AI models is to ensure that the information generated by the model is relevant, timely and context-sensitive. Retrieval-Augmented Generation (RAG) is a technique to achieve this, where generative AI models are linked with additional relevant data sources to improve the accuracy of answers generated by the model.
Since models have a limit on the context they can receive, it is necessary to split large data inputs into smaller chunks and send only the relevant chunks to the model. Relevancy is determined by generating vector representations of the data chunks and then using mathematical functions to determine which chunks are most relevant to the question being asked of the model.
A number of tools are available to do this, and a popular and easy-to-use example is the LlamaIndex framework. This framework includes a RAG CLI, llamaindex-cli
, which does all the heavy lifting: indexing and vectorizing data, integrating with a language model, and handling queries and responses.
The RAG CLI doesn't live in isolation, though: it still needs a pipeline that delivers data to it. Building this data pipeline usually involves interacting with multiple tools and processes. The typical way to do this is messy and unreliable, using shell scripting “glue” or by “shelling out” from Python or another programming language. This is where Dagger can help!
Dagger for Data Pipelines
Using Dagger ensures that your data pipeline always works consistently across operating systems and environments, by encapsulating all the workflows that make up a data pipeline - retrieving or extracting data, transforming data, storing data - into reusable, standardized Dagger Functions.
In a recent community call, Marcos Nils demonstrated how to use Dagger with LlamaIndex to build a RAG data pipeline for YouTube videos. His pipeline consists of two Dagger Functions:
A lower-level Dagger Function named
rag
which accepts a collection of documents and internally callsllamaindex-cli
to index and respond to queries on those documents.A higher-level Dagger Function named
ytChat
that accepts a YouTube video URL as data source. This function internally callsyoutube-dl
to download the YouTube video,whisper
to transcribe the audio from the video, andrag
to index the transcript and respond to queries.
Benefits
Why do it this way? Dagger Functions are programmable, reusable and run everything in containers, which means that:
You can interact with multiple tools and processes using containers and clean, maintainable Python code (or any of the other supported programming languages) instead of relying on messy YAML or unreliable shell scripts.
You can expose a clean API to your different pipeline functions. This allows you to reuse the same functionality in other projects, or share your code with other users in your organization or community.
You benefit from all the tooling, best practices and third-party libraries available in your language’s ecosystem.
Dagger’s container-first approach ensures that your data pipeline runs the same way every time, unaffected by environment differences or operating system variability.
You get all the benefits of Dagger's caching, which accelerates pipeline runs significantly.
Try It Out!
Dagger Functions are reusable and can be called at will from anywhere that you can run the Dagger CLI, which means that you can use the same Dagger Functions to query your own local documents or YouTube videos. Here's how:
NOTE: The RAG CLI uses OpenAI, which means that your data will be sent to the OpenAI API. You will also need an OpenAI API key to use these Dagger Functions. Learn more about the OpenAI API.
Query local documents
1. Install the Dagger CLI.
2. Set the OPENAI_API_KEY environment variable to the value of your OpenAI API key.
export OPENAI_API_KEY=YOUR-API-KEY-HERE
3. Add all the documents (PDFs, images, text files, ...) you want to index to a local directory.
4. Call the Dagger Function as below, replacing the placeholders with valid inputs.
dagger -m
github.com/marcosnils/daggerverse/gptools@c3f85409f535112047ef609885e152648826b4ce call rag \
--openai-api-key=env:OPENAI_API_KEY \
--source=YOUR-DOCUMENT-DIR-PATH-HERE \
--question
Query YouTube videos
1. Install the Dagger CLI.
2. Set the OPENAI_API_KEY environment variable to the value of your OpenAI API key.
export OPENAI_API_KEY=YOUR-API-KEY-HERE
3. Call the Dagger Function as below, replacing the placeholders with valid inputs.
dagger -m
github.com/marcosnils/daggerverse/gptools@c3f85409f535112047ef609885e152648826b4ce call yt-chat \
--openai-api-key=env:OPENAI_API_KEY \
--url=YOUR-YOUTUBE-VIDEO-URL-HERE \
--question
Watch Marcos’ complete demo below, and let us know in Discord if you have any questions!