How to read specific columns of a CSV when given the header as a vector

Question

I have a large CSV file without a header row, and the header is available to me as a vector. I want to use a subset of the columns of the file without loading the entire file. The subset of columns required are provided as a separate list.

Edit: in this case, the column names provided in the header list are important. This MRE only has 4 column names, but the solution should work for a large dataset with pre-specified column names. The catch is that the column names are only provided externally, not as a header in the CSV file.

1,2,3,4
5,6,7,8
9,10,11,12

header <- c("A", "B", "C", "D")
subset <- c("D", "B")

So far I have been reading the data in the following manner, which gets me the result I want, but loads the entire file first.

# Setup

library(readr)

write.table(
  structure(list(V1 = c(1L, 5L, 9L), V2 = c(2L, 6L, 10L), V3 = c(3L, 7L, 11L), V4 = c(4L, 8L, 12L)), class = "data.frame", row.names = c(NA, -3L)),
  file="sample-data.csv",
  row.names=FALSE,
  col.names=FALSE,
  sep=","
)

header <- c("A", "B", "C", "D")
subset <- c("D", "B")

# Current approach

df1 <- read_csv(
  "sample-data.csv",
  col_names = header
)[subset]

df1

# A tibble: 3 × 2
      D     B
  <dbl> <dbl>
1     4     2
2     8     6
3    12    10

How can I get the same result without loading the entire file first?

Related questions

Only read selected columns includes the header in the first row.
Ways to read only select columns from a file into R? (A happy medium between read.table and scan?) [duplicate] does not specify column names outside the file and the answers do not apply to this situation.
how to skip reading certain columns in readr [duplicate] is different because it seems to be about skipping an unknown first column and reading a known second and third column across multiple files. Data types are not necessarily known in advance in this question.
Is there a way to omit the first column when reading a csv [duplicate]: column is skipped based on position, not position in an externally provided list of column names.

Darren Tsai · Accepted Answer · 2024-03-15 15:29:35Z

2

You can use readr::read_csv with col_names and col_select arguments.

header <- c("A", "B", "C", "D")
subset <- c("D", "B")

readr::read_csv("sample_data.csv",
                col_names = header,
                col_select = any_of(subset))

# # A tibble: 3 × 2
#       D     B
#   <dbl> <dbl>
# 1     4     2
# 2     8     6
# 3    12    10

edited Mar 15 at 15:29

answered Mar 15 at 15:28

Darren Tsai

34.7k5 gold badges24 silver badges54 bronze badges

Thank you for your answer. Is there a reason to use any_of instead of all_of?
– Joshua Shew
Commented Mar 15 at 15:29
1

@JoshuaShew all_of is for strict selection. If some column name in subset do not exist in the imported data, you will get an error. any_of() is a conservative option to avoid errors. But in your case any_of or all_of are both okey!
– Darren Tsai
Commented Mar 15 at 15:43
1

Noted! This answer works best for me because the parameters use the header and subset variables instead of copy-pasting the values. The explanation for when to use any_of() vs all_of() was also helpful.
– Joshua Shew
Commented Mar 15 at 15:57
@stefan_aus_hannover When I set col_names = header, the column names A,B,C,D have been assigned to the data, so col_select consequently works. I think your suggestion is not correct here.
– Darren Tsai
Commented Mar 15 at 16:01

Add a comment |

stefan_aus_hannover · Accepted Answer · 2024-03-15 17:10:52Z

before OP edit about header

You don't have to read the entire file in at once as there is an argument with the read_csv() function. You would just need to modify your code to

df1 <- read_csv(
  "sample-data.csv",
  col_select=c("D","B")
)

After edit

df1 <- read_csv(
  "c:/data/56791/originals/test.csv",
  col_names = c("A","B","C","D"),
  col_select = c(4,2)
)

if you give the arguments defined vectors like in the OPs question, you would need to follow Darren's answer using any_of or you will receive the warning message


Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.

Important note: col_names= has to be given a header the column length of the csv file or you will get the error

! Names repair functions cant return `NA` values.

Théodore Targerian · Accepted Answer · 2024-03-15 15:09:34Z

0

If you use the read_csv from the readrpackage you have the argument col_select where you can select the columns to read.

answered Mar 15 at 15:09

Théodore Targerian

414 bronze badges

Add a comment |

cristian-vargas · Accepted Answer · 2024-03-15 15:11:20Z

0

The readr::read_csv() function has an argument called col_select that allows you to specify which columns to read using the same language as dplyr::select(). So in practice, this looks like:

df1 <- readr::read_csv(
  file = "sample-data.csv",
  col_names = header,
  col_select = c(D, B)
)

Which then gives the desired output:

# A tibble: 3 × 2
      D     B
  <dbl> <dbl>
1     4     2
2     8     6
3    12    10

You can also call attr(df1, "spec") which confirms that columns A and C were skipped when reading the file.

answered Mar 15 at 15:11

cristian-vargas

5401 gold badge1 silver badge13 bronze badges

Thank you for your answer. I was hoping to pass the variable subset in as a the col_select parameter but I got the following warning: FAQ - Note: Using an external vector in selections is ambiguous. It seems that the approach here it to use col_select = all_of(subset) instead.
– Joshua Shew
Commented Mar 15 at 15:28

Add a comment |

Collectives™ on Stack Overflow

How to read specific columns of a CSV when given the header as a vector

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
r
csv
readr
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged rcsvreadr or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
csv
readr
or ask your own question.