
With everything else in life, the choice of which tenkara line to use isĪ compromise. In everyĬase, though, I'll use the lightest line I can get away with. I also match the line to how much of a breeze there is. Line to the fly, using heavier lines with heavy or wind resistantįlies. I like to fine-tune things a bit and prefer to useĭifferent lines with different rods. You canĬertainly choose one line, either tapered or level, and do all yourįishing with it. Where I get away from the simplicity, it is with the lines. In the US, most tapered linesĪre really just long furled leaders. Made from nylon monofilament or fluorocarbon fishing line. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.The first tenkara line was made from twisted horsehair. If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine. If you need more speed, then Shane's recommendation to use Map Reduce is a very good one.
#3.5 hold the line plus
That should hold your file plus plenty of room to manipulate the data. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. If you want to read data in a line at a time you can use readLines(). The readChar() function will read in a block of characters without assuming they are null-terminated. Is there a similar way to read in files a piece at a time in R? In short, it just didn't fit the bill this time. Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines.If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.
As a result, all my character vectors dropped when put into a big.table.
This approach cleanly linked to the data, but it can only handle one type at a time. It looked like it required the object to be in memory as well, which would defeat the point if that were the case. To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. However, if it's not, then SQLite chokes. This is definitely what I'll use for this type of problem in the future if the data is well-formed. # Open a connection separately to hold the cursor positionĬat(line.split], line.split], sep = ' ', file = file.out, fill = TRUE) but if all you have is R then this is the ticket. as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60. It should also be noted this is MUCH less efficient than Python. I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want. I gave Marek the check because he gave a sample implementation. Īfter tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R: County State Year Quarter Segment GDP. Īda County NC 2010 1 FIRE Financial Banks 82.5. Īda County NC 2009 4 FIRE Financial Banks 80.1. I only need ~100,000 rows out of this file (See below)ĭata example: County State Year Quarter Segment Sub-Segment Sub-Sub-Segment GDP. These are entirely unimportant and can be dropped. A couple thousand rows (~2k) are malformed, with only one column instead of 17. The data is 3.5GB, with about 8.5 million rows and 17 columns. As you can see below, the data contains mixed types, which I need to operate on later. Speed and resources are not concerns in this case. Needs to use my machine, so no EC2 instances. leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?Ĭurrently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment. In short, it's huge, and it's not going to fit into memory if I try to simply read it. It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke).