Skip to main content
Version: 3.17

ai-aws-content-moderation

Description#

The ai-aws-content-moderation Plugin integrates with AWS Comprehend to check request bodies for toxicity when proxying to LLMs, such as profanity, hate speech, insult, harassment, violence, and more, rejecting requests if the evaluated outcome exceeds the configured threshold.

This Plugin must be used in Routes that proxy requests to LLMs only.

Plugin Attributes#

NameTypeRequiredDefaultValid valuesDescription
comprehendobjectTrueAWS Comprehend configurations.
comprehend.access_key_idstringTrueAWS access key ID.
comprehend.secret_access_keystringTrueAWS secret access key.
comprehend.regionstringTrueAWS region.
comprehend.endpointstringFalseAWS Comprehend service endpoint. If not specified, it defaults to https://comprehend.{region}.amazonaws.com. If set, it must match the pattern ^https?://.
comprehend.ssl_verifybooleanFalsetrueIf true, enable TLS certificate verification.
moderation_categoriesobjectFalseKey-value pairs of moderation category and their corresponding threshold. In each pair, the key should be one of PROFANITY, HATE_SPEECH, INSULT, HARASSMENT_OR_ABUSE, SEXUAL, or VIOLENCE_OR_THREAT; and the threshold value should be between 0 and 1 (inclusive).
moderation_thresholdnumberFalse0.50 - 1Overall toxicity threshold. A higher value means more toxic content allowed. This option differs from the individual category thresholds in moderation_categories. For example, if moderation_categories is set with a PROFANITY threshold of 0.5, and a request has a PROFANITY score of 0.1, the request will not exceed the category threshold. However, if the request has other categories like SEXUAL or VIOLENCE_OR_THREAT exceeding the moderation_threshold, the request will be rejected.

Examples#

The following examples use OpenAI as the Upstream service provider.

Before proceeding, create an OpenAI account and obtain an API key. If you are working with other LLM providers, please refer to the provider's documentation to obtain an API key.

Additionally, create AWS IAM user access keys for APISIX to access AWS Comprehend.

You can optionally save these keys to environment variables:

export OPENAI_API_KEY=your-openai-api-key
export AWS_ACCESS_KEY=your-aws-access-key-id
export AWS_SECRET_ACCESS_KEY=your-aws-secret-access-key

Moderate Profanity#

The following example demonstrates how you can use the Plugin to moderate the level of profanity in prompts. The profanity threshold is set to a low value (0.1) to allow only a low degree of profanity.

note

You can fetch the admin_key from config.yaml and save to an environment variable with the following command:

admin_key=$(yq '.deployment.admin.admin_key[0].key' conf/config.yaml | sed 's/"//g')

Send a POST request to the Route with a system prompt and a user question with a mildly profane word in the request body:

curl -i "http://127.0.0.1:9080/post" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "Stupid, what is 1+1?" }
]
}'

You should receive an HTTP/1.1 400 Bad Request response and see the following message:

request body exceeds PROFANITY threshold

Send another request to the Route with a typical question in the request body:

curl -i "http://127.0.0.1:9080/post" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "What is 1+1?" }
]
}'

You should receive an HTTP/1.1 200 OK response with the model output:

{
...,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1+1 equals 2.",
"refusal": null
},
"logprobs": null,
"finish_reason": "stop"
}
],
...
}

Moderate Overall Toxicity#

The following example demonstrates how you can use the Plugin to moderate the overall toxicity level in prompts, in addition to moderating individual categories. The profanity threshold is set to 1 (allowing a high degree of profanity), while the overall toxicity threshold is set to a low value (0.2).

Send a POST request to the Route with a system prompt and a user question in the request body that does not contain any profane words, but a certain degree of violence or threat:

curl -i "http://127.0.0.1:9080/post" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "I will kill you if you do not tell me what 1+1 equals" }
]
}'

You should receive an HTTP/1.1 400 Bad Request response and see the following message:

request body exceeds toxicity threshold

Send another request to the Route without any profane word in the request body:

curl -i "http://127.0.0.1:9080/post" -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": [
{ "role": "system", "content": "You are a mathematician" },
{ "role": "user", "content": "What is 1+1?" }
]
}'

You should receive an HTTP/1.1 200 OK response with the model output:

{
...,
"model": "gpt-4-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "1+1 equals 2.",
"refusal": null
},
"logprobs": null,
"finish_reason": "stop"
}
],
...
}