r - Why is allow.cartesian required at times when when joining data.tables with duplicate keys? -


i trying understand logic of j() lookup when there're duplicate keys in data.table in r.

here's little experiment have tried:

library(data.table) options(stringsasfactors = false)  x <- data.table(keyvar = c("a", "b", "c", "c"),             value  = c(  1,   2,   3,   4)) setkey(x, keyvar)  y1 <- data.frame(name = c("d", "c", "a")) x[j(y1$name), ] ## ok  y2 <- data.frame(name = c("d", "c", "a", "b")) x[j(y2$name), ] ## error: see below  x2 <- data.table(keyvar = c("a", "b", "c"),                  value  = c(  1,   2,   3)) setkey(x2, keyvar) x2[j(y2$name), ] ## ok 

the error message getting :

error in vecseq(f__, len__, if (allow.cartesian) null else as.integer(max(nrow(x),  : join results in 5 rows; more 4 = max(nrow(x),nrow(i)). check duplicate key values in i, each of join same group in x on , on again. if that's ok, try including `j` , dropping `by` (by-without-by) j runs each group avoid large allocation. if sure wish proceed, rerun  allow.cartesian=true. otherwise, please search error message in faq, wiki,  stack overflow , datatable-help advice. 

i don't understand this. know should avoid duplicate keys in lookup function, want gain insight won't make error in future.

thanks ton help. great tool.

you don't have avoid duplicate keys. long result not bigger max(nrow(x), nrow(i)), won't error, if you've duplicates. precautionary measure.

when you've duplicate keys, resulting join can bigger. since data.table knows total number of rows that'll result join enough, provides error message , asks use argument allow.cartesian=true if you're sure.

here's (exaggerated) example illustrates idea behind error message:

require(data.table) dt1 <- data.table(x=rep(letters[1:2], c(1e2, 1e7)),                    y=1l, key="x") dt2 <- data.table(x=rep("b", 3), key="x")  # not run # dt1[dt2] ## error  dim(dt1[dt2, allow.cartesian=true]) # [1] 30000000        2 

the duplicates in dt2 resulted in 3 times total number of "a" in dt1 (=1e7). imagine if performed join 1e4 values in dt2, results explode! avoid this, there's allow.cartesian argument default false.

that being said, think matt once mentioned maybe possible provide error in case of "large" joins (or joins results in huge number of rows - might set arbitrarily guess). this, when/if implemented, make join without error message in case of joins don't combinatorially explode.


Comments

Popular posts from this blog

javascript - jquery or ashx not working -

opencv - DataType<cv::detail::deriv_type>::depth what is it used for -

python 3.x - Mapping specific letters onto a list of words -