Using read_csv within Databricks to open a local file #2177

ChuckConnell · 2021-06-29T15:57:45Z

ChuckConnell
Jun 29, 2021

I have imported some code from pandas to Databricks/Koalas. My read_csv statement does not work because the file is local to my computer, not within the Databricks file system (DBFS). I feel like I am missing something obvious. I want my pandas code to work on Databricks/Koalas with minor changes.

I know I can use the Databricks GUI point-and-click to create a DBFS table, then make a DataFrame from the table, but that is not programmatic and is a poor solution if I have hundreds of local files.

`import databricks.koalas as ks
myDF = ks.read_csv("C:/Users/chuck/Desktop/County.csv")

java.io.IOException: No FileSystem for scheme: C`

Answered by zero323

Dec 6, 2021

Additionally to #2177 (comment), this is just not going to work ‒ internally, read_* methods use standard Spark data sources

koalas/databricks/koalas/namespace.py

Line 282 in 4d14f37

reader = default_session().read

so the same restriction apply. Any path you want to read has to be accessible to every Spark worker in your cluster and that's really not the case when you use your local file system.

In general, you should migrate your data to distributed file storage first ‒ in case of DBFS, the official CLI should do the trick.

dbfs cp -r ...

View full answer

HyukjinKwon · 2021-06-30T00:49:48Z

HyukjinKwon
Jun 30, 2021
Maintainer

It should be either canonical URL (e.g., file://) or a proper local path on windows (e.g., C:\\Users\\..)

1 reply

ChuckConnell Jun 30, 2021
Author

Thank you for your reply. But neither of these syntaxes work. Here is the exact code from my program. All of the various methods shown fail with various errors.

`
#path = "file://C:/Users/chuck/Desktop/COVID Data/Mass DPH/2021-01-03/County.csv"
#path = "file://C:\Users\chuck\Desktop\COVID Data\Mass DPH\2021-01-03\County.csv"
#path = "file://Users\chuck\Desktop\COVID Data\Mass DPH\2021-01-03\County.csv"
#path = "file://Users/chuck/Desktop/COVID Data/Mass DPH/2021-01-03/County.csv"

#path = "C:\Users\chuck\Desktop\COVID Data\Mass DPH\2021-01-03\County.csv"
#path = "C:/Users/chuck/Desktop/COVID Data/Mass DPH/2021-01-03/County.csv"
#path = "\Users\chuck\Desktop\COVID Data\Mass DPH\2021-01-03\County.csv"
#path = "C://Users//chuck//Desktop//COVID Data//Mass DPH//2021-01-03//County.csv"

CountyDF = ks.read_csv(path)
`

HyukjinKwon · 2021-07-01T01:56:31Z

HyukjinKwon
Jul 1, 2021
Maintainer

Can you see if it works with plain PySpark?

2 replies

ChuckConnell Jul 1, 2021
Author

Excellent idea. All tests done with Databricks Community Edition 8.3...

First, I tested directly in PySpark without pandas or Koalas. As I expected, the function read_csv() is not defined.

`
path = "C://Users//chuck//Desktop//COVID Data//Mass DPH//2021-01-03//County.csv"
CountyDF = read_csv(path)

NameError: name 'read_csv' is not defined
`

Next, I tested using pandas in PySpark. Error shown below.

`
import pandas as pd
path = "C:\Users\chuck\Desktop\COVID Data\Mass DPH\2021-01-03\County.csv"
CountyDF = pd.read_csv(path)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\chuck\Desktop\COVID Data\Mass DPH\2021-01-03\County.csv'
`

ChuckConnell Jul 2, 2021
Author

I will add an opinion here that this is a really big deal. The first line of almost every pandas program is read_csv(some_local_file). If this does not work in Koalas, it is a large barrier to people moving their pandas code to Koalas. If there are hundreds of local files to read, this is basically a show stopper.

I hope that I am just doing something wrong and only need to fix my local file path syntax.

zero323 · 2021-12-06T10:56:09Z

zero323
Dec 6, 2021

Additionally to #2177 (comment), this is just not going to work ‒ internally, read_* methods use standard Spark data sources

koalas/databricks/koalas/namespace.py

Line 282 in 4d14f37

reader = default_session().read

so the same restriction apply. Any path you want to read has to be accessible to every Spark worker in your cluster and that's really not the case when you use your local file system.

In general, you should migrate your data to distributed file storage first ‒ in case of DBFS, the official CLI should do the trick.

dbfs cp -r ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using read_csv within Databricks to open a local file #2177

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Using read_csv within Databricks to open a local file #2177

ChuckConnell Jun 29, 2021

Replies: 3 comments · 3 replies

HyukjinKwon Jun 30, 2021 Maintainer

ChuckConnell Jun 30, 2021 Author

HyukjinKwon Jul 1, 2021 Maintainer

ChuckConnell Jul 1, 2021 Author

ChuckConnell Jul 2, 2021 Author

zero323 Dec 6, 2021

ChuckConnell
Jun 29, 2021

Replies: 3 comments 3 replies

HyukjinKwon
Jun 30, 2021
Maintainer

ChuckConnell Jun 30, 2021
Author

HyukjinKwon
Jul 1, 2021
Maintainer

ChuckConnell Jul 1, 2021
Author

ChuckConnell Jul 2, 2021
Author

zero323
Dec 6, 2021