Count LLM Tokens flow
The "Count LLM Tokens" processor in Lunar.dev provides an estimation of the number of tokens required for a given request body, based on a specified Language Learning Model (LLM). This estimation is performed before the request is sent to the upstream LLM provider. The estimated token count is then added as a custom header, x-lunar-estimated-tokens, to the request, which is removed by Lunar.dev before the request is actually forwarded to the LLM provider. This mechanism allows users to leverage this token estimation for internal logic within Lunar.dev, most notably in conjunction with the Custom Quota feature, enabling precise control over LLM usage based on token consumption.
Scenarios
This feature allows users to implement fine-grained control over their LLM usage, leading to better cost management and adherence to provider limits. Here are a few scenarios illustrating its value:
Use Case 1: Preventing Overspending on LLM Costs
-
User Story: As a developer managing an application heavily reliant on a pay-per-token LLM, I want to prevent unexpected cost overruns. I want to define a maximum token limit per request, so that Lunar.dev automatically blocks requests exceeding this limit before they are sent to the LLM provider.
-
How the Flow Helps: The "Count LLM Tokens" processor estimates the tokens. A Custom Quota rule can then be configured to check the
x-lunar-estimated-tokens
header. If the estimated token count exceeds the defined limit, the request is rejected by Lunar.dev, preventing unnecessary LLM costs.
Use Case 2: Fair Usage and Prioritization
-
User Story: As a product owner, I want to ensure fair access to our LLM-powered features for all users and prevent a single user from consuming an excessive number of tokens. I want to implement a quota system that limits the total number of tokens a user can consume within a specific time window.
-
How the Flow Helps: By estimating tokens with this processor, and then using a Custom Quota based on the
x-lunar-estimated-tokens
header, it's possible to track and limit the total estimated tokens consumed by a user within a defined period. This allows for a more granular quota management than simply limiting the number of requests.
Use Case 3: Optimizing for Rate Limits based on Token Count
-
User Story: As an engineer integrating with an LLM provider that has different rate limits based on the number of tokens per request, I want to ensure our application stays within these limits to avoid throttling.
-
How the Flow Helps: The "Count LLM Tokens" processor allows for setting up rate limiting rules based on the estimated token count. For example, you could configure a more restrictive rate limit for requests with a higher estimated token count and a more lenient one for smaller requests, effectively optimizing usage according to the provider's rate limiting policies.
Use Case 4: Implementing Tiered Access Based on Token Consumption
-
User Story: As a business, we offer different service tiers with varying levels of LLM access. Higher tiers should allow for processing larger and more complex requests, which translate to higher token counts.
-
How the Flow Helps: By using the estimated token count from the header, different routing or processing logic can be applied based on the request size. For example, requests with estimated tokens below a certain threshold can be routed to a less expensive LLM model or a different processing pipeline.
In summary, the "Count LLM Tokens" flow provides a powerful mechanism to understand and control the token consumption of your LLM interactions within Lunar.dev, primarily enabling more sophisticated Custom Quota management and cost optimization.
Flow Configuration Template
name: <FlowName> # The name of the flow, e.g., ClientSideLimitingFlow
filter:
url: <URLPattern> # Define the URL pattern for the filter, e.g., api.example.com/*
method: ["<HTTPMethod>"] # Optional: List of HTTP methods, e.g., GET, POST
headers:
- key: <HeaderKey> # Optional: Header key, e.g., 'X-API-Key'
value: <HeaderValue> # Optional: header value, e.g., '12345', '67890'
processors:
CountLLMTokens:
processor: CountLLMTokens # token counting processor
parameters:
- key: store_count_header
value: <header-name> # Optional: override the default custom header we wish to update with the request token count, Default: x-lunar-estimated-tokens
- key: encoding
value: <Encoding> # Optional: override the default type of encoding using for tokenization, e.g., cl100k_base, o200k_base, p50k_base, r50k_base
- key: model
value: <model-name> # Optional: override model name instead of Lunar.dev automatic detection (can be specific or use wilcard, to include range of models) - gpt-4o-*, gpt-4-*, gpt-4, gpt-4o etc.
flow:
request:
- from:
stream:
name: globalStream # The stream to start the request flow
at: start # Start point
to:
processor:
name: CountLLMTokens # Process the request through the Limiter processor
- from:
processor:
name: CountLLMTokens # After the Limiter processor
to:
stream:
name: globalStream # Send the request to the global stream
at: end # End point of the request flow
response:
- from:
stream:
name: globalStream # The stream to start the request flow
at: start # Start point
to:
stream:
name: globalStream # Send response back to the global stream
at: end # End point of the response flow
Flow Example
name: CountRequestTokens
filter:
url: api.openai.com/v1/chat/completions
body:
- key: model
regex_match: "gpt-4-*"
processors:
CountLLMTokens:
processor: CountLLMTokens
parameters:
- key: store_count_header
value: "x-lunar-estimated-tokens"
- key: model
value: "gpt-4-*"
flow:
request:
- from:
stream:
name: globalStream
at: start
to:
processor:
name: CountLLMTokens
- from:
processor:
name: CountLLMTokens
to:
stream:
name: globalStream
at: end
response:
- from:
stream:
name: globalStream
at: start
to:
stream:
name: globalStream
at: end
quotas:
- id: MyQuota
filter:
url: api.openai.com
strategy:
fixed_window_custom_quota:
static:
max: 100
interval: 1
interval_unit: minute