Skip to content

R Dataset

genmeblog edited this page Jun 22, 2020 · 1 revision

R data objects are usually converted to the Clojure datastructure or tech.ml.dataset object. Here are the notes about typical use cases. Default R datasets are used as examples.

Data Frame

Any data.frame, also tribble and data.table are treated the same. If row.names are available they are converted to the additional column :$row.names.

BOD

No row.names available.

:Time :demand
1.0 8.3
2.0 10.3
3.0 19.0
4.0 16.0
5.0 15.6
7.0 19.8

CO2

With row.names

:$row.names :Plant :Type :Treatment :conc :uptake
1 1 :Quebec :nonchilled 95.0 16.0
2 1 :Quebec :nonchilled 175.0 30.4
3 1 :Quebec :nonchilled 250.0 34.8
4 1 :Quebec :nonchilled 350.0 37.2
5 1 :Quebec :nonchilled 500.0 35.3
6 1 :Quebec :nonchilled 675.0 39.2
7 1 :Quebec :nonchilled 1000.0 39.7
8 2 :Quebec :nonchilled 95.0 13.6
9 2 :Quebec :nonchilled 175.0 27.3
10 2 :Quebec :nonchilled 250.0 37.1

Table

Table is converted to a long form where each dimension has it's own column. If column names are not available, column id is prefixed with :$col. Values are stored in the last, :$value column.

USBAdmissions

Dimensions with names.

:Admit :Gender :Dept :$value
Admitted Male A 512.0
Rejected Male A 313.0
Admitted Female A 89.0
Rejected Female A 19.0
Admitted Male B 353.0
Rejected Male B 207.0
Admitted Female B 17.0
Rejected Female B 8.0
Admitted Male C 120.0
Rejected Male C 205.0

crimtab

Dimensions without names

:$col-0 :$col-1 :$value
9.4 142.24 0
9.5 142.24 0
9.6 142.24 0
9.7 142.24 0
9.8 142.24 0
9.9 142.24 0
10 142.24 1
10.1 142.24 0
10.2 142.24 0
10.3 142.24 0

Matrices, arrays, multidimensional arrays

The idea here is similar to R, 2d structures (matrices) are tagged using other dimensions. So for first two dimensions - matrix is created, or dimensions are added as columns. If names are missing artificial column names are added. Row names are added as :$row.names.

VADeaths

Matrix with row and column names

:$row.names Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0

freeny-x

Matrix with column names

lag quarterly revenue price index income level market potential
8.79636 4.70997 5.82110 12.9699
8.79236 4.70217 5.82558 12.9733
8.79137 4.68944 5.83112 12.9774
8.81486 4.68558 5.84046 12.9806
8.81301 4.64019 5.85036 12.9831
8.90751 4.62553 5.86464 12.9854
8.93673 4.61991 5.87769 12.9900
8.96161 4.61654 5.89763 12.9943
8.96044 4.61407 5.92574 12.9992
9.00868 4.60766 5.94232 13.0033

iris3

3d array, with names in second and third dimensions

:$col-0 Sepal L. Sepal W. Petal L. Petal W.
Setosa 5.1 3.5 1.4 0.2
Setosa 4.9 3.0 1.4 0.2
Setosa 4.7 3.2 1.3 0.2
Setosa 4.6 3.1 1.5 0.2
Setosa 5.0 3.6 1.4 0.2
Setosa 5.4 3.9 1.7 0.4
Setosa 4.6 3.4 1.4 0.3
Setosa 5.0 3.4 1.5 0.2
Setosa 4.4 2.9 1.4 0.2
Setosa 4.9 3.1 1.5 0.1

5D array

Created with (r/r '(array ~(range 60) :dim [2 5 1 3 2]))

:$col-0 :$col-1 :$col-2 1 2 3 4 5
1 1 1 0.0 2.0 4.0 6.0 8.0
1 1 1 1.0 3.0 5.0 7.0 9.0
1 2 1 10.0 12.0 14.0 16.0 18.0
1 2 1 11.0 13.0 15.0 17.0 19.0
1 3 1 20.0 22.0 24.0 26.0 28.0
1 3 1 21.0 23.0 25.0 27.0 29.0
1 1 2 30.0 32.0 34.0 36.0 38.0
1 1 2 31.0 33.0 35.0 37.0 39.0
1 2 2 40.0 42.0 44.0 46.0 48.0
1 2 2 41.0 43.0 45.0 47.0 49.0
1 3 2 50.0 52.0 54.0 56.0 58.0
1 3 2 51.0 53.0 55.0 57.0 59.0

1D Timeseries

Timeseries are stored in two columns:

  • :$time - to store time identifier as float
  • :$series - to store timeseries

BJsales

:$time :$series
1.0 200.1
2.0 199.5
3.0 199.4
4.0 198.9
5.0 199.0
6.0 200.2
7.0 198.6
8.0 200.0
9.0 200.3
10.0 201.2

Multidimensional timeseries

Is a mix of multidmentions array with added :$time column.

EuStockMarkets

:$time DAX SMI CAC FTSE
1991.49615385 1628.75 1678.1 1772.8 2443.6
1991.50000000 1613.63 1688.5 1750.5 2460.2
1991.50384615 1606.51 1678.6 1718.0 2448.2
1991.50769231 1621.04 1684.1 1708.1 2470.4
1991.51153846 1618.16 1686.6 1723.1 2484.7
1991.51538462 1610.61 1671.6 1714.3 2466.8
1991.51923077 1630.75 1682.9 1734.5 2487.9
1991.52307692 1640.17 1703.6 1757.4 2508.4
1991.52692308 1635.47 1697.5 1754.0 2510.5
1991.53076923 1645.89 1716.3 1754.3 2497.4

Datatypes with time

(r/r "
   day <- c(\"20081101\", \"20081101\", \"20081101\", \"20081101\", \"18081101\", \"20081102\", \"20081102\", \"20081102\", \"20081102\", \"20081103\")
   time <- c(\"01:20:00\", \"06:00:00\", \"12:20:00\", \"17:30:00\", \"21:45:00\", \"01:15:00\", \"06:30:00\", \"12:50:00\", \"20:00:00\", \"01:05:00\")
   dts1 <- paste(day, time)
   dts2 <- as.POSIXct(dts1, format = \"%Y%m%d %H:%M:%S\")
   dts3 <- as.POSIXlt(dts1, format = \"%Y%m%d %H:%M:%S\")
   dts <- data.frame(posixct=dts2, posixlt=dts3)") 
:posixct :posixlt
2008-11-01T01:20+01:00[Europe/Warsaw] 2008-11-01T01:20+01:00[Europe/Warsaw]
2008-11-01T06:00+01:00[Europe/Warsaw] 2008-11-01T06:00+01:00[Europe/Warsaw]
2008-11-01T12:20+01:00[Europe/Warsaw] 2008-11-01T12:20+01:00[Europe/Warsaw]
2008-11-01T17:30+01:00[Europe/Warsaw] 2008-11-01T17:30+01:00[Europe/Warsaw]
1808-11-01T21:45+01:24[Europe/Warsaw] 1808-11-01T21:45+01:24[Europe/Warsaw]
2008-11-02T01:15+01:00[Europe/Warsaw] 2008-11-02T01:15+01:00[Europe/Warsaw]
2008-11-02T06:30+01:00[Europe/Warsaw] 2008-11-02T06:30+01:00[Europe/Warsaw]
2008-11-02T12:50+01:00[Europe/Warsaw] 2008-11-02T12:50+01:00[Europe/Warsaw]
2008-11-02T20:00+01:00[Europe/Warsaw] 2008-11-02T20:00+01:00[Europe/Warsaw]
2008-11-03T01:05+01:00[Europe/Warsaw] 2008-11-03T01:05+01:00[Europe/Warsaw]

Other

Harman23-cor

Named list

{:cov
 [1.0 0.846 0.805 0.859 0.473 0.398 0.301 0.382 0.846 1.0 0.881 0.826
  0.376 0.326 0.277 0.415 0.805 0.881 1.0 0.801 0.38 0.319 0.237 0.345 0.859
  0.826 0.801 1.0 0.436 0.329 0.327 0.365 0.473 0.376 0.38 0.436 1.0 0.762
  0.73 0.629 0.398 0.326 0.319 0.329 0.762 1.0 0.583 0.577 0.301 0.277 0.237
  0.327 0.73 0.583 1.0 0.539 0.382 0.415 0.345 0.365 0.629 0.577 0.539 1.0],
 :center [0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0],
 :n.obs [305.0]}

Partially named list

{:a [11.0], :b [22.0], [[3]] [33.0], [[4]] [44.0], :e [55.0], :f [66.0], [[7]] [77.0], [[8]] [88.0], :i [99.0]}