Scalability issues #82

BellmannRichard · 2023-05-07T15:50:14Z

BellmannRichard
May 7, 2023

You clearly have to pass the whole data frame to the OAI API. Even for small data frames (hundreds of rows, dozens of columns) this could easily fill up a 4096 context, or make users spend a lot of money. You should compute the number of tokens before you make the API call, and it’s that over some threshold, warn the user.

Also, this will clearly not scale to the size of the datasets used in the industry. Try a random dataset with 10000 rows and 100 columns for example. If it doesn’t work (as I expect) consider testing some fix, such as maybe split the di in chunks, summarize them and use the summaries to answer the research question. Summaries will most likely mess up the floating point numbers, though. All in all, I don’t see how this can work even for medium-sized dataframes

yzaparto · 2023-05-07T16:07:52Z

yzaparto
May 7, 2023

Hi @BellmannRichard
Thanks for raising the concern.
We are not passing the whole df to OpenAi. Its a small subset of that (df.head).
I would say give it a try on large datasets and if it breaks feel free to create an issue.

0 replies

gventuri · 2023-05-07T17:37:40Z

gventuri
May 7, 2023
Maintainer

Hey @BellmannRichard, as @yzaparto we only send the first 5 records of the table. The only scalability issue comes with table with many columns, I'll turn this in a discussion, let's see if we manage to find a solution!

1 reply

Al-aminI May 24, 2023

yeah you are actually right, the number of row doesn't raise an issue, the issue is when there are a lot of columns, i have actually been playing around with it with some data, and it is actually great but when the columns are much, it use to say there is no enough information or so. and one other issue that is bothering me is when i tried to deploy a sample web app with it on render to test somethings, the build is failing by this error,:

ERROR: Could not find a version that satisfies the requirement pandasai (from -r requirements.txt (line 5)) (from versions: none)

or

ERROR: No matching distribution found for pandasai.

and thanks you for building this amazing package, i have actually worked with llms and tabular data by prompt engineering and other methods, i will like to contribute to this project too.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability issues #82

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Scalability issues #82

BellmannRichard May 7, 2023

Replies: 2 comments · 1 reply

yzaparto May 7, 2023

gventuri May 7, 2023 Maintainer

Al-aminI May 24, 2023

BellmannRichard
May 7, 2023

Replies: 2 comments 1 reply

yzaparto
May 7, 2023

gventuri
May 7, 2023
Maintainer