Skip to content

Commit

Permalink
Add aevaluate pointer (#439)
Browse files Browse the repository at this point in the history
Also spelling fixes
  • Loading branch information
hinthornw authored Sep 19, 2024
2 parents 93dcf01 + 4bbbcb9 commit 6a819b0
Show file tree
Hide file tree
Showing 17 changed files with 38 additions and 26 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ The outputs are just the output of that step, which is usually the LLM response.

The evaluator for this is usually some binary score for whether the correct tool call was selected, as well as some heuristic for whether the input to the tool was correct. The reference tool can be simply specified as a string.

There are several benefits to this type of evaluation. It allows you to evaluate individual actions, which lets you hone in where your application may be failing. They are also relatively fast to run (because they only involve a single LLM call) and evaluation often uses simple heuristc evaluation of the selected tool relative to the reference tool. One downside is that they don't capture the full agent - only one particular step. Another downside is that dataset creation can be challenging, particular if you want to include past history in the agent input. It is pretty easy to generate a dataset for steps early on in an agent's trajectory (e.g., this may only include the input prompt), but it can be difficult to generate a dataset for steps later on in the trajectory (e.g., including numerous prior agent actions and responses).
There are several benefits to this type of evaluation. It allows you to evaluate individual actions, which lets you hone in where your application may be failing. They are also relatively fast to run (because they only involve a single LLM call) and evaluation often uses simple heuristic evaluation of the selected tool relative to the reference tool. One downside is that they don't capture the full agent - only one particular step. Another downside is that dataset creation can be challenging, particular if you want to include past history in the agent input. It is pretty easy to generate a dataset for steps early on in an agent's trajectory (e.g., this may only include the input prompt), but it can be difficult to generate a dataset for steps later on in the trajectory (e.g., including numerous prior agent actions and responses).

:::tip

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This page assumes that you have already read our guide on [data retention](./dat
## How usage limits work

LangSmith lets you configure usage limits on tracing. Note that these are _usage_ limits, not _spend_ limits, which
mean they let you limit the quantity of occurrances of some event rather than the total amount you will spend.
mean they let you limit the quantity of occurrences of some event rather than the total amount you will spend.

LangSmith lets you set two different monthly limits, mirroring our Billable Metrics discussed in the aforementioned data retention guide:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ You can also tag versions of your dataset using the SDK. Here's an example of ho

```python
from langsmith import Client
fromt datetime import datetime
from datetime import datetime

client = Client()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ The following metrics are available off-the-shelf:
| `embedding_distance` | Cosine distance between two embeddings | expect.embedding_distance(prediction=prediction, expectation=expectation) |
| `edit_distance` | Edit distance between two strings | expect.edit_distance(prediction=prediction, expectation=expectation) |

You can also log any arbitrary feeback within a unit test manually using the `client`.
You can also log any arbitrary feedback within a unit test manually using the `client`.

```python
from langsmith import unit, Client
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ stub = Stub("auth-example", image=Image.debian_slim().pip_install("langsmith"))
@web_endpoint(method="POST")
# We set up a `secret` query parameter
def f(data: dict, secret: str = Query(...)):
# You can import dependencies you don't have locally inside Modal funxtions
# You can import dependencies you don't have locally inside Modal functions
from langsmith import Client

# First, we validate the secret key we pass
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ openai_response = oai_client.chat.completions.create(**openai_payload)`),

## List, delete, and like prompts

You can also list, delete, and like/unline prompts using the `list prompts`, `delete prompt`, `like prompt` and `unlike prompt` methods.
You can also list, delete, and like/unlike prompts using the `list prompts`, `delete prompt`, `like prompt` and `unlike prompt` methods.
See the [LangSmith SDK client](https://github.com/langchain-ai/langsmith-sdk) for extensive documentation on these methods.

<CodeTabs
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ for await (const run of client.listRuns({

## Use filter query language

For more complex queries, you can use the query language described in the [filter query language refernece](../../reference/data_formats/trace_query_syntax#filter-query-language).
For more complex queries, you can use the query language described in the [filter query language reference](../../reference/data_formats/trace_query_syntax#filter-query-language).

### List all runs called "extractor" whose root of the trace was assigned feedback "user_score" score of 1

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ call_openai(messages)\n
call_openai(
messages,
# highlight-next-line
langsmith_extra={"project_name": "My Overriden Project"},
langsmith_extra={"project_name": "My Overridden Project"},
)\n
# The wrapped OpenAI client accepts all the same langsmith_extra parameters
# as @traceable decorated functions, and logs traces to LangSmith automatically.
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-2.0/pricing.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ You are not currently able to set a spend limit in the product.

### How can my track my usage so far this month?

Under the Settings section for your Organization you will see subsection for <b>Usage</b>. There, you will able to see a graph of the daily nunber of billable LangSmith traces from the last 30, 60, or 90 days. Note that this data is delayed by 1-2 hours and so may trail your actual number of runs slightly for the current day.
Under the Settings section for your Organization you will see subsection for <b>Usage</b>. There, you will able to see a graph of the daily number of billable LangSmith traces from the last 30, 60, or 90 days. Note that this data is delayed by 1-2 hours and so may trail your actual number of runs slightly for the current day.

### I have a question about my bill...

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ However, you can configure LangSmith to use an external Postgres database (**str
- A provisioned Postgres database that your LangSmith instance will have network access to. We recommend using a managed Postgres service like:
- [Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html)
- [Google Cloud SQL](https://cloud.google.com/curated-resources/cloud-sql#section-1)
- [Azure Database for PostgreSQ](https://azure.microsoft.com/en-us/products/postgresql#features)
- [Azure Database for PostgreSQL](https://azure.microsoft.com/en-us/products/postgresql#features)
- Note: We only officially support Postgres versions >= 14.
- A user with admin access to the Postgres database. This user will be used to create the necessary tables, indexes, and schemas.
- This user will also need to have the ability to create extensions in the database. We use/will try to install the btree_gin, btree_gist, pgcrypto, citext, and pg_trgm extensions.
Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-2.0/self_hosting/release_notes.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ This release adds a number of new features, improves the performance of the Thre

- [Resource tags to organize your Workspace in LangSmith](https://changelog.langchain.com/announcements/resource-tags-to-organize-your-workspace-in-langsmith)
- [Generate synthetic examples to enhance a LangSmith dataset](https://changelog.langchain.com/announcements/generate-synthetic-examples-to-enhance-a-langsmith-dataset)
- [Enhanced trace comparison view and savable custom trace filters](https://changelog.langchain.com/announcements/trace-comparison-view-saving-custom-trace-filters)
- [Enhanced trace comparison view and saveable custom trace filters](https://changelog.langchain.com/announcements/trace-comparison-view-saving-custom-trace-filters)
- [Defining, validating and updating dataset schemas](https://changelog.langchain.com/announcements/define-validate-and-update-dataset-schemas-in-langsmith)
- [Multiple annotators can review a run in LangSmith](https://changelog.langchain.com/announcements/multiple-annotators-can-review-a-run-in-langsmith)
- [Support for filtering runs within the trace view](https://changelog.langchain.com/announcements/filtering-runs-within-the-trace-view)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ table_of_contents: true

# Deleting Traces

The LangSmith UI does not currently support the deletion of an invidual trace. This, however, can be accomplished by directly removing the trace from all materialized views in ClickHouse (except the runs_history views) and the runs and feedback tables themselves.
The LangSmith UI does not currently support the deletion of an individual trace. This, however, can be accomplished by directly removing the trace from all materialized views in ClickHouse (except the runs_history views) and the runs and feedback tables themselves.

This command can either be run using a trace ID as an argument or using a file that is a list of trace IDs.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@ use this feature, please read more about its functionality [here](../../concepts

### Set dev/staging limits and view total spent limit across workspaces

Following the same logic for our dev and staging environments, whe set limits at 10% of the production
Following the same logic for our dev and staging environments, we set limits at 10% of the production
limit on usage for each workspace.

While this works with our usage pattern, setting good dev and staging limits may vary depending on
Expand Down
8 changes: 4 additions & 4 deletions versioned_docs/version-2.0/tutorials/Developers/agents.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ class State(TypedDict):

### SQL Assistant

Use [prompt based roughtly on what is shown here](https://python.langchain.com/v0.2/docs/tutorials/sql_qa/#agents).
Use [prompt based roughly on what is shown here](https://python.langchain.com/v0.2/docs/tutorials/sql_qa/#agents).

```python
from langchain_core.runnables import Runnable, RunnableConfig
Expand All @@ -178,8 +178,8 @@ class Assistant:
# Invoke the tool-calling LLM
result = self.runnable.invoke(state)
# If it is a tool call -> response is valid
# If it has meaninful text -> response is valid
# Otherwise, we re-prompt it b/c response is not meaninful
# If it has meaningful text -> response is valid
# Otherwise, we re-prompt it b/c response is not meaningful
if not result.tool_calls and (
not result.content
or isinstance(result.content, list)
Expand Down Expand Up @@ -478,7 +478,7 @@ def check_specific_tool_call(root_run: Run, example: Example) -> dict:
"""
Check if the first tool call in the response matches the expected tool call.
"""
# Exepected tool call
# Expected tool call
expected_tool_call = 'sql_db_list_tables'

# Run
Expand Down
16 changes: 14 additions & 2 deletions versioned_docs/version-2.0/tutorials/Developers/evaluation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ At a high level, in this tutorial we will go over how to:
- _Track results over time_
- _Set up automated testing to run in CI/CD_

For more information on the evaluation workflows LangSmith supports, check out the [how-to guides](../../how_to_guides).
For more information on the evaluation workflows LangSmith supports, check out the [how-to guides](../../how_to_guides), or see the reference docs for [evaluate](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) and its asynchronous [aevaluate](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html) counterpart.

Lots to cover, let's dive in!

Expand Down Expand Up @@ -192,14 +192,26 @@ Now we're ready to run evaluation.
Let's do it!

```python
from langsmith.evaluation import evaluate
from langsmith import evaluate

experiment_results = evaluate(
langsmith_app, # Your AI system
data=dataset_name, # The data to predict and grade over
evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results
experiment_prefix="openai-3.5", # A prefix for your experiment names to easily identify them
)

# Note: If your system is async, you can use the asynchronous `aevaluate` function
# import asyncio
# from langsmith import aevaluate
#
# experiment_results = asyncio.run(aevaluate(
# my_async_langsmith_app, # Your AI system
# data=dataset_name, # The data to predict and grade over
# evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results
# experiment_prefix="openai-3.5", # A prefix for your experiment names to easily identify them
# ))

```

This will output a URL. If we click on it, we should see results of our evaluation!
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -237,10 +237,10 @@ import numpy as np

def find_similar(examples, topic, k=5):
inputs = [e.inputs['topic'] for e in examples] + [topic]
embedds = client.embeddings.create(input=inputs, model="text-embedding-3-small")
embedds = [e.embedding for e in embedds.data]
embedds = np.array(embedds)
args = np.argsort(-embedds.dot(embedds[-1])[:-1])[:5]
vectors = client.embeddings.create(input=inputs, model="text-embedding-3-small")
vectors = [e.embedding for e in vectors.data]
vectors = np.array(vectors)
args = np.argsort(-vectors.dot(vectors[-1])[:-1])[:5]
examples = [examples[i] for i in args]
return examples
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ df['version'] = df['version'].apply(lambda x: f"version:{x}")

### Save to CSV

To upload the data to LangSmith, we first need to save it to a CSV, which we can do using the `to_csv` function provided by pandas. Make sure to save this file somewhere that is easily accesible to you.
To upload the data to LangSmith, we first need to save it to a CSV, which we can do using the `to_csv` function provided by pandas. Make sure to save this file somewhere that is easily accessible to you.

```python
df.to_csv("./../SWE-bench.csv",index=False)
Expand All @@ -55,7 +55,7 @@ Next, select `Key-Value` as the dataset type. Lastly head to the `Create Schema`

Once you have populated the `Input fields` (and left the `Output fields` empty!) you can click the blue `Create` button in the top right corner, and your dataset will be created!

### Upload CSV to LangSmith Programatically
### Upload CSV to LangSmith Programmatically

Alternatively you can upload your csv to LangSmith using the sdk as shown in the code block below:

Expand Down Expand Up @@ -178,7 +178,7 @@ def convert_runs_to_langsmith_feedback(
else:
feedback_for_instance.append({"key":"resolved-patch","score":0})
else:
# The instance did not run succesfully
# The instance did not run successfully
feedback_for_instance += [{"key":"completed-patch","score":0},{"key":"resolved-patch","score":0}]
feedback_for_all_instances[prediction['run_id']] = feedback_for_instance

Expand Down

0 comments on commit 6a819b0

Please sign in to comment.