Add aevaluate pointer (#439)

Also spelling fixes
langchain-ai · Sep 19, 2024 · 6a819b0 · 6a819b0
2 parents 93dcf01 + 4bbbcb9
commit 6a819b0
Show file tree

Hide file tree

Showing 17 changed files with 38 additions and 26 deletions.
diff --git a/versioned_docs/version-2.0/concepts/evaluation/evaluation.mdx b/versioned_docs/version-2.0/concepts/evaluation/evaluation.mdx
@@ -313,7 +313,7 @@ The outputs are just the output of that step, which is usually the LLM response.
 
 The evaluator for this is usually some binary score for whether the correct tool call was selected, as well as some heuristic for whether the input to the tool was correct. The reference tool can be simply specified as a string.
 
-There are several benefits to this type of evaluation. It allows you to evaluate individual actions, which lets you hone in where your application may be failing. They are also relatively fast to run (because they only involve a single LLM call) and evaluation often uses simple heuristc evaluation of the selected tool relative to the reference tool. One downside is that they don't capture the full agent - only one particular step. Another downside is that dataset creation can be challenging, particular if you want to include past history in the agent input. It is pretty easy to generate a dataset for steps early on in an agent's trajectory (e.g., this may only include the input prompt), but it can be difficult to generate a dataset for steps later on in the trajectory (e.g., including numerous prior agent actions and responses).
+There are several benefits to this type of evaluation. It allows you to evaluate individual actions, which lets you hone in where your application may be failing. They are also relatively fast to run (because they only involve a single LLM call) and evaluation often uses simple heuristic evaluation of the selected tool relative to the reference tool. One downside is that they don't capture the full agent - only one particular step. Another downside is that dataset creation can be challenging, particular if you want to include past history in the agent input. It is pretty easy to generate a dataset for steps early on in an agent's trajectory (e.g., this may only include the input prompt), but it can be difficult to generate a dataset for steps later on in the trajectory (e.g., including numerous prior agent actions and responses).
 
 :::tip
 

diff --git a/versioned_docs/version-2.0/concepts/usage_and_billing/usage_limits.mdx b/versioned_docs/version-2.0/concepts/usage_and_billing/usage_limits.mdx
@@ -7,7 +7,7 @@ This page assumes that you have already read our guide on [data retention](./dat
 ## How usage limits work
 
 LangSmith lets you configure usage limits on tracing. Note that these are _usage_ limits, not _spend_ limits, which
-mean they let you limit the quantity of occurrances of some event rather than the total amount you will spend.
+mean they let you limit the quantity of occurrences of some event rather than the total amount you will spend.
 
 LangSmith lets you set two different monthly limits, mirroring our Billable Metrics discussed in the aforementioned data retention guide:
 

diff --git a/versioned_docs/version-2.0/how_to_guides/datasets/version_datasets.mdx b/versioned_docs/version-2.0/how_to_guides/datasets/version_datasets.mdx
@@ -34,7 +34,7 @@ You can also tag versions of your dataset using the SDK. Here's an example of ho
 
 ```python
 from langsmith import Client
-fromt datetime import datetime
+from  datetime import datetime
 
 client = Client()
 

diff --git a/versioned_docs/version-2.0/how_to_guides/evaluation/unit_testing.mdx b/versioned_docs/version-2.0/how_to_guides/evaluation/unit_testing.mdx
@@ -199,7 +199,7 @@ The following metrics are available off-the-shelf:
 | `embedding_distance` | Cosine distance between two embeddings                      | expect.embedding_distance(prediction=prediction, expectation=expectation)                                             |
 | `edit_distance`      | Edit distance between two strings                           | expect.edit_distance(prediction=prediction, expectation=expectation)                                                  |
 
-You can also log any arbitrary feeback within a unit test manually using the `client`.
+You can also log any arbitrary feedback within a unit test manually using the `client`.
 
 ```python
 from langsmith import unit, Client

diff --git a/versioned_docs/version-2.0/how_to_guides/monitoring/webhooks.mdx b/versioned_docs/version-2.0/how_to_guides/monitoring/webhooks.mdx
@@ -182,7 +182,7 @@ stub = Stub("auth-example", image=Image.debian_slim().pip_install("langsmith"))
 @web_endpoint(method="POST")
 # We set up a `secret` query parameter
 def f(data: dict, secret: str = Query(...)):
-    # You can import dependencies you don't have locally inside Modal funxtions
+    # You can import dependencies you don't have locally inside Modal functions
     from langsmith import Client
 
     # First, we validate the secret key we pass

diff --git a/...ioned_docs/version-2.0/how_to_guides/prompts/manage_prompts_programatically.mdx b/...ioned_docs/version-2.0/how_to_guides/prompts/manage_prompts_programatically.mdx
@@ -223,7 +223,7 @@ openai_response = oai_client.chat.completions.create(**openai_payload)`),
 
 ## List, delete, and like prompts
 
-You can also list, delete, and like/unline prompts using the `list prompts`, `delete prompt`, `like prompt` and `unlike prompt` methods.
+You can also list, delete, and like/unlike prompts using the `list prompts`, `delete prompt`, `like prompt` and `unlike prompt` methods.
 See the [LangSmith SDK client](https://github.com/langchain-ai/langsmith-sdk) for extensive documentation on these methods.
 
 <CodeTabs

diff --git a/versioned_docs/version-2.0/how_to_guides/tracing/export_traces.mdx b/versioned_docs/version-2.0/how_to_guides/tracing/export_traces.mdx
@@ -150,7 +150,7 @@ for await (const run of client.listRuns({
 
 ## Use filter query language
 
-For more complex queries, you can use the query language described in the [filter query language refernece](../../reference/data_formats/trace_query_syntax#filter-query-language).
+For more complex queries, you can use the query language described in the [filter query language reference](../../reference/data_formats/trace_query_syntax#filter-query-language).
 
 ### List all runs called "extractor" whose root of the trace was assigned feedback "user_score" score of 1
 

diff --git a/versioned_docs/version-2.0/how_to_guides/tracing/log_traces_to_project.mdx b/versioned_docs/version-2.0/how_to_guides/tracing/log_traces_to_project.mdx
@@ -64,7 +64,7 @@ call_openai(messages)\n
 call_openai(
     messages,
     # highlight-next-line
-    langsmith_extra={"project_name": "My Overriden Project"},
+    langsmith_extra={"project_name": "My Overridden Project"},
 )\n
 # The wrapped OpenAI client accepts all the same langsmith_extra parameters
 # as @traceable decorated functions, and logs traces to LangSmith automatically.

diff --git a/versioned_docs/version-2.0/pricing.mdx b/versioned_docs/version-2.0/pricing.mdx
@@ -240,7 +240,7 @@ You are not currently able to set a spend limit in the product.
 
 ### How can my track my usage so far this month?
 
-Under the Settings section for your Organization you will see subsection for <b>Usage</b>. There, you will able to see a graph of the daily nunber of billable LangSmith traces from the last 30, 60, or 90 days. Note that this data is delayed by 1-2 hours and so may trail your actual number of runs slightly for the current day.
+Under the Settings section for your Organization you will see subsection for <b>Usage</b>. There, you will able to see a graph of the daily number of billable LangSmith traces from the last 30, 60, or 90 days. Note that this data is delayed by 1-2 hours and so may trail your actual number of runs slightly for the current day.
 
 ### I have a question about my bill...
 

diff --git a/versioned_docs/version-2.0/self_hosting/configuration/external_postgres.mdx b/versioned_docs/version-2.0/self_hosting/configuration/external_postgres.mdx
@@ -14,7 +14,7 @@ However, you can configure LangSmith to use an external Postgres database (**str
 - A provisioned Postgres database that your LangSmith instance will have network access to. We recommend using a managed Postgres service like:
   - [Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_GettingStarted.CreatingConnecting.PostgreSQL.html)
   - [Google Cloud SQL](https://cloud.google.com/curated-resources/cloud-sql#section-1)
-  - [Azure Database for PostgreSQ](https://azure.microsoft.com/en-us/products/postgresql#features)
+  - [Azure Database for PostgreSQL](https://azure.microsoft.com/en-us/products/postgresql#features)
 - Note: We only officially support Postgres versions >= 14.
 - A user with admin access to the Postgres database. This user will be used to create the necessary tables, indexes, and schemas.
 - This user will also need to have the ability to create extensions in the database. We use/will try to install the btree_gin, btree_gist, pgcrypto, citext, and pg_trgm extensions.

diff --git a/versioned_docs/version-2.0/self_hosting/release_notes.mdx b/versioned_docs/version-2.0/self_hosting/release_notes.mdx
@@ -17,7 +17,7 @@ This release adds a number of new features, improves the performance of the Thre
 
 - [Resource tags to organize your Workspace in LangSmith](https://changelog.langchain.com/announcements/resource-tags-to-organize-your-workspace-in-langsmith)
 - [Generate synthetic examples to enhance a LangSmith dataset](https://changelog.langchain.com/announcements/generate-synthetic-examples-to-enhance-a-langsmith-dataset)
-- [Enhanced trace comparison view and savable custom trace filters](https://changelog.langchain.com/announcements/trace-comparison-view-saving-custom-trace-filters)
+- [Enhanced trace comparison view and saveable custom trace filters](https://changelog.langchain.com/announcements/trace-comparison-view-saving-custom-trace-filters)
 - [Defining, validating and updating dataset schemas](https://changelog.langchain.com/announcements/define-validate-and-update-dataset-schemas-in-langsmith)
 - [Multiple annotators can review a run in LangSmith](https://changelog.langchain.com/announcements/multiple-annotators-can-review-a-run-in-langsmith)
 - [Support for filtering runs within the trace view](https://changelog.langchain.com/announcements/filtering-runs-within-the-trace-view)

diff --git a/versioned_docs/version-2.0/self_hosting/scripts/delete_traces.mdx b/versioned_docs/version-2.0/self_hosting/scripts/delete_traces.mdx
@@ -6,7 +6,7 @@ table_of_contents: true
 
 # Deleting Traces
 
-The LangSmith UI does not currently support the deletion of an invidual trace. This, however, can be accomplished by directly removing the trace from all materialized views in ClickHouse (except the runs_history views) and the runs and feedback tables themselves.
+The LangSmith UI does not currently support the deletion of an individual trace. This, however, can be accomplished by directly removing the trace from all materialized views in ClickHouse (except the runs_history views) and the runs and feedback tables themselves.
 
 This command can either be run using a trace ID as an argument or using a file that is a list of trace IDs.
 

diff --git a/versioned_docs/version-2.0/tutorials/Administrators/manage_spend.mdx b/versioned_docs/version-2.0/tutorials/Administrators/manage_spend.mdx
@@ -228,7 +228,7 @@ use this feature, please read more about its functionality [here](../../concepts
 
 ### Set dev/staging limits and view total spent limit across workspaces
 
-Following the same logic for our dev and staging environments, whe set limits at 10% of the production
+Following the same logic for our dev and staging environments, we set limits at 10% of the production
 limit on usage for each workspace.
 
 While this works with our usage pattern, setting good dev and staging limits may vary depending on

diff --git a/versioned_docs/version-2.0/tutorials/Developers/agents.mdx b/versioned_docs/version-2.0/tutorials/Developers/agents.mdx
@@ -160,7 +160,7 @@ class State(TypedDict):
 
 ### SQL Assistant
 
-Use [prompt based roughtly on what is shown here](https://python.langchain.com/v0.2/docs/tutorials/sql_qa/#agents).
+Use [prompt based roughly on what is shown here](https://python.langchain.com/v0.2/docs/tutorials/sql_qa/#agents).
 
 ```python
 from langchain_core.runnables import Runnable, RunnableConfig
@@ -178,8 +178,8 @@ class Assistant:
             # Invoke the tool-calling LLM
             result = self.runnable.invoke(state)
             # If it is a tool call -> response is valid
-            # If it has meaninful text -> response is valid
-            # Otherwise, we re-prompt it b/c response is not meaninful
+            # If it has meaningful text -> response is valid
+            # Otherwise, we re-prompt it b/c response is not meaningful
             if not result.tool_calls and (
                 not result.content
                 or isinstance(result.content, list)
@@ -478,7 +478,7 @@ def check_specific_tool_call(root_run: Run, example: Example) -> dict:
     """
     Check if the first tool call in the response matches the expected tool call.
     """
-    # Exepected tool call
+    # Expected tool call
     expected_tool_call = 'sql_db_list_tables'
 
     # Run

diff --git a/versioned_docs/version-2.0/tutorials/Developers/evaluation.mdx b/versioned_docs/version-2.0/tutorials/Developers/evaluation.mdx
@@ -19,7 +19,7 @@ At a high level, in this tutorial we will go over how to:
 - _Track results over time_
 - _Set up automated testing to run in CI/CD_
 
-For more information on the evaluation workflows LangSmith supports, check out the [how-to guides](../../how_to_guides).
+For more information on the evaluation workflows LangSmith supports, check out the [how-to guides](../../how_to_guides), or see the reference docs for [evaluate](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._runner.evaluate.html) and its asynchronous [aevaluate](https://langsmith-sdk.readthedocs.io/en/latest/evaluation/langsmith.evaluation._arunner.aevaluate.html) counterpart.
 
 Lots to cover, let's dive in!
 
@@ -192,14 +192,26 @@ Now we're ready to run evaluation.
 Let's do it!
 
 ```python
-from langsmith.evaluation import evaluate
+from langsmith import evaluate
 
 experiment_results = evaluate(
     langsmith_app, # Your AI system
     data=dataset_name, # The data to predict and grade over
     evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results
     experiment_prefix="openai-3.5", # A prefix for your experiment names to easily identify them
 )
+
+# Note: If your system is async, you can use the asynchronous `aevaluate` function
+# import asyncio
+# from langsmith import aevaluate
+#
+# experiment_results = asyncio.run(aevaluate(
+#     my_async_langsmith_app, # Your AI system
+#     data=dataset_name, # The data to predict and grade over
+#     evaluators=[evaluate_length, qa_evaluator], # The evaluators to score the results
+#     experiment_prefix="openai-3.5", # A prefix for your experiment names to easily identify them
+# ))
+
 ```
 
 This will output a URL. If we click on it, we should see results of our evaluation!

diff --git a/versioned_docs/version-2.0/tutorials/Developers/optimize_classifier.mdx b/versioned_docs/version-2.0/tutorials/Developers/optimize_classifier.mdx
@@ -237,10 +237,10 @@ import numpy as np
 
 def find_similar(examples, topic, k=5):
     inputs = [e.inputs['topic'] for e in examples] + [topic]
-    embedds = client.embeddings.create(input=inputs, model="text-embedding-3-small")
-    embedds = [e.embedding for e in embedds.data]
-    embedds = np.array(embedds)
-    args = np.argsort(-embedds.dot(embedds[-1])[:-1])[:5]
+    vectors = client.embeddings.create(input=inputs, model="text-embedding-3-small")
+    vectors = [e.embedding for e in vectors.data]
+    vectors = np.array(vectors)
+    args = np.argsort(-vectors.dot(vectors[-1])[:-1])[:5]
     examples = [examples[i] for i in args]
     return examples
 ```

diff --git a/versioned_docs/version-2.0/tutorials/Developers/swe-benchmark.mdx b/versioned_docs/version-2.0/tutorials/Developers/swe-benchmark.mdx
@@ -39,7 +39,7 @@ df['version'] = df['version'].apply(lambda x: f"version:{x}")
 
 ### Save to CSV
 
-To upload the data to LangSmith, we first need to save it to a CSV, which we can do using the `to_csv` function provided by pandas. Make sure to save this file somewhere that is easily accesible to you.
+To upload the data to LangSmith, we first need to save it to a CSV, which we can do using the `to_csv` function provided by pandas. Make sure to save this file somewhere that is easily accessible to you.
 
 ```python
 df.to_csv("./../SWE-bench.csv",index=False)
@@ -55,7 +55,7 @@ Next, select `Key-Value` as the dataset type. Lastly head to the `Create Schema`
 
 Once you have populated the `Input fields` (and left the `Output fields` empty!) you can click the blue `Create` button in the top right corner, and your dataset will be created!
 
-### Upload CSV to LangSmith Programatically
+### Upload CSV to LangSmith Programmatically
 
 Alternatively you can upload your csv to LangSmith using the sdk as shown in the code block below:
 
@@ -178,7 +178,7 @@ def convert_runs_to_langsmith_feedback(
             else:
                 feedback_for_instance.append({"key":"resolved-patch","score":0})
         else:
-            # The instance did not run succesfully
+            # The instance did not run successfully
             feedback_for_instance += [{"key":"completed-patch","score":0},{"key":"resolved-patch","score":0}]
         feedback_for_all_instances[prediction['run_id']] = feedback_for_instance
-Original file line number
+Diff line change
@@ Expand Up @@
     The evaluator for this is usually some binary score for whether the correct tool call was selected, as well as some heuristic for whether the input to the tool was correct. The reference tool can be simply specified as a string.
-    There are several benefits to this type of evaluation. It allows you to evaluate individual actions, which lets you hone in where your application may be failing. They are also relatively fast to run (because they only involve a single LLM call) and evaluation often uses simple heuristc evaluation of the selected tool relative to the reference tool. One downside is that they don't capture the full agent - only one particular step. Another downside is that dataset creation can be challenging, particular if you want to include past history in the agent input. It is pretty easy to generate a dataset for steps early on in an agent's trajectory (e.g., this may only include the input prompt), but it can be difficult to generate a dataset for steps later on in the trajectory (e.g., including numerous prior agent actions and responses).
+    There are several benefits to this type of evaluation. It allows you to evaluate individual actions, which lets you hone in where your application may be failing. They are also relatively fast to run (because they only involve a single LLM call) and evaluation often uses simple heuristic evaluation of the selected tool relative to the reference tool. One downside is that they don't capture the full agent - only one particular step. Another downside is that dataset creation can be challenging, particular if you want to include past history in the agent input. It is pretty easy to generate a dataset for steps early on in an agent's trajectory (e.g., this may only include the input prompt), but it can be difficult to generate a dataset for steps later on in the trajectory (e.g., including numerous prior agent actions and responses).
     :::tip
@@ Expand Down @@