-
Notifications
You must be signed in to change notification settings - Fork 9
/
README.jmd
99 lines (77 loc) · 3.77 KB
/
README.jmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# ShortStrings
[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/JuliaString/ShortStrings.jl/issues)
[![CI](https://github.com/JuliaString/ShortStrings.jl/workflows/CI/badge.svg)](https://github.com/JuliaString/ShortStrings.jl/actions?query=workflow%3ACI)
[![codecov](https://codecov.io/gh/JuliaString/ShortStrings.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/JuliaString/ShortStrings.jl)
This is an efficient string format for storing strings using integer types. For example, `UInt32` can hold 3 bytes of string with 1 byte to record the size of the string and a `UInt128` can hold a 15-byte string with 1 byte to record the size of the string.
Using BitIntegers.jl, integer of larger size than `UInt128` can be defined. This package supports string with up to 255 bytes in size.
## Quick Start
```julia
using ShortStrings
using SortingAlgorithms
using Random: randstring
N = Int(1e6)
svec = [randstring(rand(1:15)) for i=1:N]
# convert to ShortString
ssvec = ShortString15.(svec)
# sort short vectors
@time sort(svec);
@time sort(ssvec, by = x->x.size_content, alg=RadixSort);
# conversion to shorter strings is also possible with
ShortString7(randstring(7))
ShortString3(randstring(3))
# convenience macros are provided for writing actual strings (e.g., for comparison)
s15 = ss15"A short string" # ShortString15 === ShortString{Int128}
s7 = ss7"shorter" # ShortString7 === ShortString{Int64}
s3 = ss3"srt" # ShortString3 === ShortString{Int32}
# The ShortString constructor can automatically select the shortest size that a string will fit in
ShortString("This is a long string")
# The maximum length can also be added:
ShortString("Foo", 15)
# The `ss` macro will also select the shortest size that will fit
s31 = ss"This also is a long string"
```
## Benchmarks
```julia
using SortingLab, ShortStrings, SortingAlgorithms, BenchmarkTools;
N = Int(1e6);
svec = [randstring(rand(1:15)) for i=1:N];
# convert to ShortString
ssvec = ShortString15.(svec);
basesort = @benchmark sort($svec)
radixsort_timings = @benchmark SortingLab.radixsort($svec)
short_radixsort = @benchmark ShortStrings.fsort($ssvec)
# another way to do sorting
sort(ssvec, by = x->x.size_content, alg=RadixSort)
using RCall
@rput svec;
r_timings = R"""
replicate($(length(short_radixsort.times)), system.time(sort(svec, method="radix"))[3])
""";
using Plots
bar(["Base.sort","SortingLab.radixsort","ShortStrings radix sort", "R radix sort"],
mean.([basesort.times./1e9, radixsort_timings.times./1e9, short_radixsort.times./1e9, r_timings]),
title="String sort performance - len: 1m, variable size 15",
label = "seconds")
```
```julia
using SortingLab, ShortStrings, SortingAlgorithms, BenchmarkTools;
N = Int(1e6);
svec = rand([randstring(rand(1:15)) for i=1:N÷100],N)
# convert to ShortString
ssvec = ShortString15.(svec);
basesort = @benchmark sort($svec) samples = 5 seconds = 120
radixsort_timings = @benchmark SortingLab.radixsort($svec) samples = 5 seconds = 120
short_radixsort = @benchmark ShortStrings.fsort($ssvec) samples = 5 seconds = 120
using RCall
@rput svec;
r_timings = R"""
replicate(max(5, $(length(short_radixsort.times))), system.time(sort(svec, method="radix"))[3])
""";
using Plots
bar(["Base.sort","SortingLab.radixsort","ShortStrings radix sort", "R radix sort"],
mean.([basesort.times./1e9, radixsort_timings.times./1e9, short_radixsort.times./1e9, r_timings]),
title="String sort performance - len: $(N÷1_000_000)m, fixed size: 15",
label = "seconds")
```
## Notes
This is based on the discussion [here](https://discourse.julialang.org/t/progress-towards-faster-sortperm-for-strings/8505/4?u=xiaodai). If Julia.Base adopts the hybrid representation of strings then it makes this package redundant.