Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow csv unescaping for station names #81

Merged
merged 1 commit into from
May 6, 2024

Conversation

jbruechert
Copy link
Contributor

This will only start doing something once motis-project/utl#18 is merged

It is the minimum amount of changes to fix the issue I encountered, but should probably be done for more fields.

@jbruechert
Copy link
Contributor Author

to get rid of the std::string allocations which are unnecessary in ~99% of the cases were nothing was escaped, I suppose a global string pool could be used that string_views could point to.

@felixguendling
Copy link
Member

Wouldn't it be sufficient to have csv_range not create a new T every time read_row is called?
https://github.com/motis-project/utl/blob/4c1503afe58e209977d9a1e3db6a6b271a50c521/include/utl/parser/csv_range.h#L94

If it would have a T member variable that it can reuse every time read_row is called, allocations would be very limited.

Having global state is something I would try to avoid as it can introduce unexpected problems in with multi-threading, etc.

@felixguendling
Copy link
Member

which are unnecessary in ~99% of the cases were nothing was escaped

In this case here (stop.name) there should not be any additional allocation now as there has been an allocation anyway (stop.name needs to be allocated), just that it's now allocated earlier during CSV parsing. But there's no extra allocation with the std::move.

@jbruechert
Copy link
Contributor Author

In this case here (stop.name) there should not be any additional allocation now as there has been an allocation anyway (stop.name needs to be allocated), just that it's now allocated earlier during CSV parsing. But there's no extra allocation with the std::move.

Wasn't it just a string_view pointing to the csv source data?

@felixguendling
Copy link
Member

True. You're right.

I think I'll try to change the procedure a bit so data is directly stored to the timetable struct.

@felixguendling
Copy link
Member

I think another option would be to use cista::raw::basic_string as it act as both: a non-owning string_view or a owning string. So in 99.9% of the time it would act as string_view but in case an allocation is needed, it becomes a string and allocates. Unfortunately, it doesn't have a std::string compatible interface (like erase).

@jbruechert
Copy link
Contributor Author

cista::raw::string is pretty neat, I added an overload for it in motis-project/utl#19

src/loader/gtfs/stop.cc Outdated Show resolved Hide resolved
@felixguendling
Copy link
Member

Next step would be to update utl and use cista::raw::generic_string, right? Or is anything else blocking? Just want to make sure I didn't miss any other PR where you're blocked waiting for my review.

@jbruechert
Copy link
Contributor Author

I just can't get pkg do create a state that I can work in right now.
My changes in nigiri are constantly hard reset.

@felixguendling
Copy link
Member

If you work directly on the nigiri repo (not in deps, but nigiri = root), pkg should not interfere.

pkg in general only checks out a version if current_ref != target_ref where target_ref is the latest commit referenced over the whole tree. So if you set this to your current commit, there should not be any resets.

@felixguendling
Copy link
Member

Otherwise, you can just replace include(cmake/pkg.cmake) with add_subdirectory(deps) to disable the check.

@jbruechert
Copy link
Contributor Author

jbruechert commented May 5, 2024

After switching to generic_string, the code now crashes.
I also added a move operator in csv_col:

  T&& operator*() { return std::move(t); }
0x00005555560826a1 in cista::generic_string<char const*>::reset (this=0x656c6b) at /home/jonah/kde/src/motis/deps/cista/include/cista/containers/string.h:67
67          if (!h_.is_short_ && h_.ptr_ != nullptr && h_.self_allocated_) {
(gdb) bt
#0  0x00005555560826a1 in cista::generic_string<char const*>::reset (this=0x656c6b) at /home/jonah/kde/src/motis/deps/cista/include/cista/containers/string.h:67
#1  0x000055555570033c in cista::generic_string<char const*>::~generic_string (this=0x656c6b, __in_chrg=<optimized out>) at /home/jonah/kde/src/motis/deps/cista/include/cista/containers/string.h:31
#2  utl::csv_col<cista::generic_string<char const*>, utl::csv_col_name<&utl::const_str<(char)115, (char)116, (char)111, (char)112, (char)95, (char)110, (char)97, (char)109, (char)101, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0, (char)0>::s> >::~csv_col (this=0x656c6b, __in_chrg=<optimized out>)                                                 
    at /home/jonah/kde/src/motis/deps/utl/include/utl/parser/csv_range.h:23
#3  csv_stop::~csv_stop (this=0x7fffffffac90, __in_chrg=<optimized out>) at /home/jonah/kde/src/motis/deps/nigiri/src/loader/gtfs/stop.cc:177
#4  utl::csv_range<nigiri::loader::gtfs::read_stops(nigiri::source_idx_t, nigiri::timetable&, tz_map&, std::string_view, std::string_view, unsigned int)::csv_stop, utl::line_range<utl::buf_reader<utl::progress_tracker::update_fn()::<lambda(size_t)> > >, ','>::read_row (s=..., this=<error reading variable: Cannot access memory at address 0xfffffffffffffd13>)                                                                   
    at /home/jonah/kde/src/motis/deps/utl/include/utl/parser/csv_range.h:103
#5  utl::csv_range<nigiri::loader::gtfs::read_stops(nigiri::source_idx_t, nigiri::timetable&, tz_map&, std::string_view, std::string_view, unsigned int)::csv_stop, utl::line_range<utl::buf_reader<utl::progress_tracker::update_fn()::<lambda(size_t)> > >, ','>::begin (this=<error reading variable: Cannot access memory at address 0xfffffffffffffd13>) at /home/jonah/kde/src/motis/deps/utl/include/utl/parser/csv_range.h:109    
#6  utl::operator|<utl::csv_range<nigiri::loader::gtfs::read_stops(nigiri::source_idx_t, nigiri::timetable&, tz_map&, std::string_view, std::string_view, unsigned int)::csv_stop, utl::line_range<utl::buf_reader<utl::progress_tracker::update_fn()::<lambda(size_t)> > >, ','> >(utl::csv_range<nigiri::loader::gtfs::read_stops(nigiri::source_idx_t, nigiri::timetable&, tz_map&, std::string_view, std::string_view, unsigned int)::csv_stop, utl::line_range<utl::buf_reader<utl::progress_tracker::update_fn()::<lambda(size_t)> > >, ','> &&, utl::for_each_t<nigiri::loader::gtfs::read_stops(nigiri::source_idx_t, nigiri::timetable&, tz_map&, std::string_view, std::string_view, unsigned int)::<lambda(nigiri::loader::gtfs::read_stops(nigiri::source_idx_t, nigiri::timetable&, tz_map&, std::string_view, std::string_view, unsigned int)::csv_stop&)> > &&) (    
    r=..., f=...) at /home/jonah/kde/src/motis/deps/utl/include/utl/pipes/for_each.h:15
#7  0x00005555568cad08 in nigiri::loader::gtfs::read_stops (src=..., src@entry=..., tt=..., timezones=..., stops_file_content=..., transfers_file_content="", link_stop_distance=<optimized out>)
    at /home/jonah/kde/src/motis/deps/nigiri/src/loader/gtfs/stop.cc:193

Do you have an idea why that could happen?

Without the move operator, switching to generic_string just worsens import performance a lot, but doesn't crash

@felixguendling
Copy link
Member

My mistake. I missed to return *this in the assignment operators of generic_string. I updated utl to have a mutable csv_col::operator* that returns a mutable reference which should enable std::move to work. In case it works, could you please benchmark again so we make sure there's no performance degradation for importing data?

@jbruechert
Copy link
Contributor Author

I updated utl to have a mutable csv_col::operator* that returns a mutable reference which should enable std::move to work.

Thanks!

@jbruechert
Copy link
Contributor Author

It should finally work properly now :)

include/nigiri/location.h Outdated Show resolved Hide resolved
@felixguendling
Copy link
Member

I think it was good with the version we had before. The latest change with making location::name_ a generic_string didn't work and is also not necessary as location is always only a (volatile/temporary) view on data.

@felixguendling
Copy link
Member

felixguendling commented May 5, 2024

With this change address sanitizer doesn't complain anymore: felixguendling@07f6ec7

@felixguendling
Copy link
Member

What's the performance impact of the new solution?

@jbruechert
Copy link
Contributor Author

I ran it on the transitous data set, and it wasn't noticeably slower (I didn't have a previous number unfortunately).
Additionally, I had a look at a perf flame graph, and string allocation wasn't a noticeable percentage of the overall time.

@jbruechert
Copy link
Contributor Author

I think it was good with the version we had before. The latest change with making location::name_ a generic_string didn't work and is also not necessary as location is always only a (volatile/temporary) view on data.

Okay, wasn't sure whether location really never lives longer than stop, but that simplifies things 👍

@jbruechert
Copy link
Contributor Author

image

@felixguendling felixguendling self-requested a review May 6, 2024 07:49
@felixguendling felixguendling merged commit d998a24 into motis-project:master May 6, 2024
12 checks passed
@felixguendling
Copy link
Member

Looks good! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants