Generative artificial intelligence (AI) is rapidly being adopted across enterprises, moving beyond simple foundation model interactions to sophisticated agentic workflows. As organizations progress from proofs-of-concept to production deployments, they need robust tools for developing, evaluating, and monitoring AI applications at scale.
This post demonstrates how to utilize Foundation Models (FMs) from Amazon Bedrock and the newly launched Amazon Bedrock AgentCore in conjunction with W&B Weave to build, evaluate, and monitor enterprise AI solutions. The discussion covers the complete development lifecycle, from tracking individual FM calls to monitoring complex agent workflows in production.
Overview of W&B Weave
Weights & Biases (W&B) offers an AI developer system that provides comprehensive tools for training models, fine-tuning, and leveraging foundation models for enterprises of all sizes across various industries.
W&B Weave provides a unified suite of developer tools to support every stage of agentic AI workflows, including:
- Tracing & monitoring: This feature tracks large language model (LLM) calls and application logic, aiding in debugging and analyzing production systems.
- Systematic iteration: It allows for refining and iterating on prompts, datasets, and models.
- Experimentation: Users can experiment with various models and prompts within the LLM Playground.
- Evaluation: Custom or pre-built scorers, along with comparison tools, systematically assess and enhance application performance. It also facilitates collecting user and expert feedback for real-world testing and evaluation.
- Guardrails: These help protect applications with safeguards for content moderation, prompt safety, and more. Users can implement custom or third-party guardrails (including Amazon Bedrock Guardrails) or W&B Weave’s native guardrails.
W&B Weave can be fully managed by Weights & Biases in a multi-tenant or single-tenant environment, or it can be deployed directly within a customer’s Amazon Virtual Private Cloud (VPC). Furthermore, W&B Weave’s integration into the W&B Development Platform offers organizations a seamlessly integrated experience between model training/fine-tuning and agentic AI workflows.
The Weights & Biases AI Development Platform is available through AWS Marketplace. Individuals and academic teams can access W&B without additional cost.
Tracking Amazon Bedrock FMs with W&B Weave SDK
W&B Weave integrates smoothly with Amazon Bedrock via Python and TypeScript SDKs. After installing the library and patching the Bedrock client, W&B Weave automatically tracks LLM calls:
!pip install weave
import weave
import boto3
import json
from weave.integrations.bedrock.bedrock_sdk import patch_client
weave.init("my_bedrock_app")
# Create and patch the Bedrock client
client = boto3.client("bedrock-runtime")
patch_client(client)
# Use the client as usual
response = client.invoke_model(
modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 100,
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}),
contentType='application/json',
accept='application/json'
)
response_dict = json.loads(response.get('body').read())
print(response_dict["content"][0]["text"])

This integration automatically versions experiments and tracks configurations, offering complete visibility into Amazon Bedrock applications without requiring modifications to core logic.
Experimenting with Amazon Bedrock FMs in W&B Weave Playground
The W&B Weave Playground accelerates prompt engineering with an intuitive interface for testing and comparing Bedrock models. Key features include:
- Direct prompt editing and message retrying
- Side-by-side model comparison
- Access from trace views for rapid iteration
Users can add AWS credentials in the Playground settings, select preferred Amazon Bedrock FMs, and begin experimenting. The interface supports rapid iteration on prompts while maintaining full traceability of experiments.

Evaluating Amazon Bedrock FMs with W&B Weave Evaluations
W&B Weave Evaluations offers dedicated tools for effectively evaluating generative AI models. By combining W&B Weave Evaluations with Amazon Bedrock, users can efficiently assess these models, analyze outputs, and visualize performance across key metrics. Users have the option to use built-in scorers from W&B Weave, third-party or custom scorers, and human/expert feedback. This combination facilitates a deeper understanding of model tradeoffs, such as differences in cost, accuracy, speed, and output quality.
W&B Weave provides a robust method for tracking evaluations using Model & Evaluation classes. To configure an evaluation job, customers can:
- Define a dataset or a list of dictionaries containing examples for evaluation.
- Create a list of scoring functions. Each function should accept a model_output and, optionally, other inputs from the examples, returning a dictionary with the scores.
- Define an Amazon Bedrock model using the Model class.
- Evaluate this model by calling Evaluation.
An example of setting up an evaluation job is provided below:
import weave
from weave import Evaluation
import asyncio
# Collect your examples
examples = [
{"question": "What is the capital of France?", "expected": "Paris"},
{"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
{"question": "What is the square root of 64?", "expected": "8"},
]
# Define any custom scoring function
@weave.op()
def match_score1(expected: str, output: dict) -> dict:
# Here is where you'd define the logic to score the model output
return {'match': expected == model_output['generated_text']}
@weave.op()
def function_to_evaluate(question: str):
# here's where you would add your LLM call and return the output
return {'generated_text': 'Paris'}
# Score your examples using scoring functions
evaluation = Evaluation(
dataset=examples, scorers=[match_score1]
)
# Start tracking the evaluation
weave.init('intro-example')
# Run the evaluation
asyncio.run(evaluation.evaluate(function_to_evaluate))

The evaluation dashboard visualizes performance metrics, enabling informed decisions regarding model selection and configuration. For detailed guidance, a previous post on evaluating LLM summarization with Amazon Bedrock and Weave is available.
Enhancing Amazon Bedrock AgentCore Observability with W&B Weave
Amazon Bedrock AgentCore is a comprehensive suite of services designed for deploying and operating highly capable agents securely at an enterprise scale. It offers secure runtime environments, workflow execution tools, and operational controls that are compatible with popular frameworks like Strands Agents, CrewAI, LangGraph, and LlamaIndex, as well as many LLM models from Amazon Bedrock or external sources.
AgentCore includes built-in observability through Amazon CloudWatch dashboards, which track key metrics such as token usage, latency, session duration, and error rates. It also traces workflow steps, detailing which tools were invoked and how the model responded, providing essential visibility for debugging and quality assurance in production.
When AgentCore and W&B Weave are used together, teams can leverage AgentCore’s built-in operational monitoring and security foundations while also incorporating W&B Weave if it aligns with their existing development workflows. Organizations already utilizing the W&B environment may choose to integrate W&B Weave’s visualization tools alongside AgentCore’s native capabilities. This approach offers teams the flexibility to use the observability solution that best fits their established processes and preferences when developing complex agents that chain multiple tools and reasoning steps.

There are two primary methods for adding W&B Weave observability to AgentCore agents: using the native W&B Weave SDK or integrating via OpenTelemetry.
Native W&B Weave SDK
The simplest method involves using W&B Weave’s @weave.op decorator to automatically track function calls. Users initialize W&B Weave with their project name and wrap the functions they wish to monitor:
import weave
import os
os.environ["WANDB_API_KEY"] = "your_api_key"
weave.init("your_project_name")
@weave.op()
def word_count_op(text: str) -> int:
return len(text.split())
@weave.op()
def run_agent(agent: Agent, user_message: str) -> Dict[str, Any]:
result = agent(user_message)
return {"message": result.message, "model": agent.model.config["model_id"]}
Since AgentCore operates as a Docker container, W&B Weave should be added to dependencies (e.g., uv add weave) to include it in the container image.
OpenTelemetry Integration
For teams already using OpenTelemetry or seeking vendor-neutral instrumentation, W&B Weave directly supports OTLP (OpenTelemetry Protocol):
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
auth_b64 = base64.b64encode(f"api:{WANDB_API_KEY}".encode()).decode()
exporter = OTLPSpanExporter(
endpoint="https://trace.wandb.ai/otel/v1/traces",
headers={"Authorization": f"Basic {auth_b64}", "project_id": WEAVE_PROJECT}
)
# Create spans to track execution
with tracer.start_as_current_span("invoke_agent") as span:
span.set_attribute("input.value", json.dumps({"prompt": user_message}))
result = agent(user_message)
span.set_attribute("output.value", json.dumps({"message": result.message}))
This approach maintains compatibility with AgentCore’s existing OpenTelemetry infrastructure while routing traces to W&B Weave for visualization. When using both AgentCore and W&B Weave, teams have multiple observability options. AgentCore’s CloudWatch integration monitors system health, resource utilization, and error rates, while also providing tracing for agent reasoning and tool selection. W&B Weave offers visualization capabilities that present execution data in formats familiar to teams already using the W&B environment. Both solutions provide visibility into how agents process information and make decisions, allowing organizations to select the observability approach that best aligns with their existing workflows and preferences. This dual-layer approach allows users to:
- Monitor production service level agreements (SLAs) through CloudWatch alerts.
- Debug complex agent behaviors using W&B Weave’s trace explorer.
- Optimize token usage and latency with detailed execution breakdowns.
- Compare agent performance across different prompts and configurations.
The integration requires minimal code changes, preserves existing AgentCore deployments, and scales with agent complexity. Whether building simple tool-calling agents or orchestrating multi-step workflows, this observability stack provides the necessary insights for rapid iteration and confident deployment.
For implementation details and complete code examples, refer to a previous post.
Conclusion
This post demonstrated how to build and optimize enterprise-grade agentic AI solutions by combining Amazon Bedrock’s FMs and AgentCore with W&B Weave’s comprehensive observability toolkit. It explored how W&B Weave can enhance every stage of the LLM development lifecycle—from initial experimentation in the Playground to systematic evaluation of model performance, and finally to production monitoring of complex agent workflows.
The integration between Amazon Bedrock and W&B Weave offers several key capabilities:
- Automatic tracking of Amazon Bedrock FM calls with minimal code changes using the W&B Weave SDK.
- Rapid experimentation through the W&B Weave Playground’s intuitive interface for testing prompts and comparing models.
- Systematic evaluation with custom scoring functions to assess different Amazon Bedrock models.
- Comprehensive observability for AgentCore deployments, with CloudWatch metrics providing robust operational monitoring supplemented by detailed execution traces.
A simple integration to track Amazon Bedrock calls can be a starting point, with more advanced features progressively adopted as AI applications grow in complexity. The combination of Amazon Bedrock and W&B Weave’s comprehensive development tools provides the foundation needed to build, evaluate, and maintain production-ready AI solutions at scale.

