Read and Combine Multiple Files Node Fs.readfile
This blog post has an interesting inspiration point. Terminal week, someone in ane of my Slack channels, posted a coding challenge he'd received for a developer position with an insurance technology company.
It piqued my interest as the claiming involved reading through very big files of information from the Federal Elections Commission and displaying back specific data from those files. Since I've not worked much with raw data, and I'k always up for a new challenge, I decided to tackle this with Node.js and come across if I could complete the challenge myself, for the fun of it.
Hither's the four questions asked, and a link to the data set that the programme was to parse through.
- Write a program that volition print out the total number of lines in the file.
- Find that the 8th column contains a person's name. Write a program that loads in this data and creates an array with all name strings. Print out the 432nd and 43243rd names.
- Discover that the 5th column contains a form of date. Count how many donations occurred in each month and print out the results.
- Notice that the 8th column contains a person'south name. Create an array with each first name. Identify the nigh mutual first proper name in the data and how many times it occurs.
Link to the information: https://www.fec.gov/files/bulk-downloads/2018/indiv18.nix
When yous unzip the folder, you should see i main .txt
file that's two.55GB and a folder containing smaller pieces of that primary file (which is what I used while testing my solutions earlier moving to the principal file).
Not too terrible, right? Seems doable. And then let's talk about how I approached this.
The Two Original Node.js Solutions I Came Up With
Processing big files is zero new to JavaScript, in fact, in the core functionality of Node.js, there are a number of standard solutions for reading and writing to and from files.
The nigh straightforward is fs.readFile()
wherein, the whole file is read into memory and and then acted upon once Node has read it, and the second option is fs.createReadStream()
, which streams the information in (and out) similar to other languages like Python and Java.
The Solution I Chose to Run With & Why
Since my solution needed to involve such things as counting the full number of lines and parsing through each line to go donation names and dates, I chose to use the second method: fs.createReadStream()
. Then, I could employ the rl.on('line',...)
function to get the necessary data from each line of lawmaking as I streamed through the document.
It seemed easier to me, than having to divide apart the whole file once it was read in and run through the lines that way.
Node.js CreateReadStream() & ReadFile() Code Implementation
Below is the lawmaking I came up with using Node.js's fs.createReadStream()
function. I'll break it down beneath.
The very first things I had to do to prepare this up, were import the required functions from Node.js: fs
(file arrangement), readline
, and stream
. These imports immune me to and then create an instream
and outstream
then the readLine.createInterface()
, which would allow me read through the stream line by line and print out data from it.
I also added some variables (and comments) to concur diverse bits of data: a lineCount
, names
array, donation
assortment and object, and firstNames
array and dupeNames
object. Y'all'll run across where these come into play a little later.
Inside of the rl.on('line',...)
function, I was able to practise all of my line-by-line information parsing. In here, I incremented the lineCount
variable for each line it streamed through. I used the JavaScript split()
method to parse out each name and added it to my names
array. I farther reduced each proper noun down to just first names, while accounting for middle initials, multiple names, etc. forth with the first proper noun with the assist of the JavaScript trim()
, includes()
and dissever()
methods. And I sliced the twelvemonth and date out of date column, reformatted those to a more than readable YYYY-MM
format, and added them to the dateDonationCount
array.
In the rl.on('shut',...)
function, I did all the transformations on the data I'd gathered into arrays and console.log
ged out all my data for the user to see.
The lineCount
and names
at the 432nd and 43,243rd index, required no further manipulation. Finding the nearly common name and the number of donations for each month was a lilliputian trickier.
For the most common first name, I start had to create an object of key value pairs for each proper name (the key) and the number of times it appeared (the value), and so I transformed that into an assortment of arrays using the ES6 part Object.entries()
. From there, it was a simple task to sort the names by their value and print the largest value.
Donations also required me to brand a like object of key value pairs, create a logDateElements()
role where I could nicely using ES6's string interpolation to display the keys and values for each donation month. And then create a new Map()
transforming the dateDonations
object into an array of arrays, and looping through each assortment calling the logDateElements()
function on it. Whew! Not quite as elementary as I first thought.
But it worked. At least with the smaller 400MB file I was using for testing…
After I'd done that with fs.createReadStream()
, I went back and too implemented my solutions with fs.readFile()
, to see the differences. Here'due south the code for that, but I won't become through all the details here — it's pretty similar to the commencement snippet, just more synchronous looking (unless you use the fs.readFileSync()
part, though, JavaScript will run this code simply every bit asynchronously as all its other code, not to worry.
If you'd like to see my total repo with all my lawmaking, y'all can see information technology here.
Initial Results from Node.js
With my working solution, I added the file path into readFileStream.js
file for the 2.55GB monster file, and watched my Node server crash with a JavaScript heap out of memory
error.
Equally information technology turns out, although Node.js is streaming the file input and output, in betwixt it is still attempting to agree the entire file contents in memory, which it tin't exercise with a file that size. Node can concur upward to 1.5GB in memory at one time, but no more.
So neither of my current solutions was up for the total challenge.
I needed a new solution. A solution for even larger datasets running through Node.
The New Data Streaming Solution
I constitute my solution in the form of EventStream
, a popular NPM module with over 2 1000000 weekly downloads and a promise "to make creating and working with streams piece of cake".
With a picayune assistance from EventStream'due south documentation, I was able to figure out how to, one time again, read the code line by line and practice what needed to exist washed, hopefully, in a more CPU friendly fashion to Node.
EventStream Code Implementation
Here's my code new code using the NPM module EventStream.
The biggest change was the pipe commands at the beginning of the file — all of that syntax is the fashion EventStream's documentation recommends y'all intermission up the stream into chunks delimited past the \northward
character at the end of each line of the .txt
file.
The only other affair I had to change was the names
reply. I had to fudge that a little bit since if I tried to add all 13MM names into an array, I once more, hitting the out of memory issue. I got around it, by only collecting the 432nd and 43,243rd names and adding them to their ain assortment. Not quite what was being asked, just hey, I had to get a little creative.
Results from Node.js & EventStream: Round 2
Ok, with the new solution implemented, I over again, fired upward Node.js with my 2.55GB file and my fingers crossed this would work. Bank check out the results.
Success!
Decision
In the stop, Node.js'southward pure file and big data handling functions fell a little short of what I needed, but with only one extra NPM packet, EventStream, I was able to parse through a massive dataset without crashing the Node server.
Stay tuned for part 2 of this series where I compare my 3 different means of reading information in Node.js with performance testing to see which one is truly superior to the others. The results are pretty eye opening — especially equally the data gets larger…
Thanks for reading, I promise this gives y'all an idea of how to handle large amounts of data with Node.js. Claps and shares are very much appreciated!
If you enjoyed reading this, you may also enjoy some of my other blogs:
- Postman vs. Indisposition: Comparing the API Testing Tools
- How to Apply Netflix'southward Eureka and Leap Cloud for Service Registry
- Jib: Getting Expert Docker Results Without Any Noesis of Docker
Source: https://itnext.io/using-node-js-to-read-really-really-large-files-pt-1-d2057fe76b33
0 Response to "Read and Combine Multiple Files Node Fs.readfile"
Post a Comment