garann means > totes profesh

a mostly defunct web development blog

i know more than i ever wanted to about programmatically composed node streams and now so do you

Wed, 3 May 2017

When I accepted a developer relations role, the one I'm in currently, one of the last things I expected was that it would be the reason I finally did some serious development with Node streams. For the past month, though, I've been designing and writing a small framework that runs everything with streams, and as a result I know more about them than I ever thought I would.

Pretty much since I started working with Node people have been saying to me how streams are wonderful and in the future everything will be streams and we'll have world peace and universal basic income and all the backpressure we could ever desire. I just didn't see the point. Most of JavaScript's life we've had ways to deal with potentially infinite input (like a webpage) and haven't needed a formal concept of streaming. We just used event listeners. Eventually, however, I was forced to admit that sometimes data (like life) comes at you fast, and it's nice to have things like buffering and backpressure already included in a fancy formal stream so you don't have to write all that shit yourself. Although I wouldn't call myself a streams convert, I can see that there are times when they're an interesting tool for the job.

I'm not here to convince you, though. One thing I noticed very quickly in trying to develop with streams was that almost all of the documentation assumes your streams are hard-coded. If I wanted to, let's say, dynamically compose a pipeline of functions that I secretly turned into streams under the hood to hide complexity from people developing against my little framework, there was very little to help me.

I did want to do that, and in case someone else ever wants to do something similar, here's what I learned:

there's some inconsistency in how streams are used

Depending on the type of stream, there's approximately one function the stream gets from its parent class that's very important to its implementation. For readable streams it's push. For writable streams it's an argument helpfully named cb, next, or callback in most documentation that you need to call at various times in various ways to accomplish various outcomes. For transforms it's actually both. If you want to throw an error in a writable you stick the error in the callback. If you're in a readable, however, you need to emit an error event.

If the only dynamic thing you're doing with streams is the act of chaining them together, or the streams you're chaining together are already proper streams where the authors have taken those slight differences into account, that's just fine. If you're trying to accept pure functionality and simply use streams as a mechanism to run it, however, it's fucking annoying. For any piece of software other people will develop against, it's just good manners to offer as consistent a contract between the framework or tool and the implementation as possible. Since my use case allowed me to define a signature for the functions it accepted whose consistency was more to my liking, I took the opportunity to standardize those main functions as push, done, and error and send them all down as properties of a single object so they could all be called the same way.

null has some surprising effects

You could probably be forgiven if you consider null to be a less-threatening cousin of undefined. Often null is just empty, while undefined will blow up all your shit. Not so in streamsland. If you have a data source that may sometimes return null data, you have to make sure not to pass that along, assuming you want your pipeline to continue piping. Pushing null is a semi-secret signal that a readable stream has no more data left to give. Conversely, when you call the callback in a writable, including any argument other than null causes an error to be emitted. As above, if you were writing your own interface to Node streams you could un-overload these functions to let implementors write more declarative code, but that's probably not cool anymore and I'm rude for suggesting it.

pausing doesn't necessarily work like you think

In almost any movie where someone comes up with a way to reanimate the dead, they come back slightly off. This is also what happens with your data when you pause and resume a stream. If something in the pipeline is still pushing data during that time, it gets packed up in something called a WriteReq object that I can't find any documentation for. I can tell you my experience, though, which is that upon resuming the stream you'll get an array of those in your writev function and they'll all have a chunk property which will contain the data you missed (that maybe you missed on purpose and so choose to ignore, it's your life). One caveat is that if you were using streams with objectMode left set to false, this might be different since you'd still be working with buffers. I haven't checked, but feel free to @ me if you have!

ending one stream doesn't end them all

I swear I read that a writable is supposed to end if a readable piped into it ends, but all I can say definitively is that's not what I've seen. Possibly because I fucked something up? Possibly because I've misunderstood something and that's not supposed to happen at all. At any rate, a "way" to shut down your pipeline and guarantee it gives control back to other things that might want it is to listen for the end event on your streams. You can be real gross and save an array of all the streams in your pipeline, then go through and manually close all of them in the end event handler. It's not very pretty, but it's the only thing I was able to find that ensured the behavior I expected.

there is no restart button

I had a cool idea that my framework would allow a pipeline to be stopped and restarted, but if you take a close look at all the docs you'll notice an interesting pattern: there are lots of end and finish type references and zero start or letsParty methods. In hindsight, I can see how this is philosophically consistent with not reanimating the dead. If you want to restart a stream that's ended, you have to actually go and reassemble the whole pipeline, and/or restart the loop where you manually read each stream until you get to the end, effectively rewriting the streams API because you think that kind of shit is fun. Whatever your pleasure, "restarting" a pipeline means, in practical terms, programmatically composing it all over again.

you can buffer your buffers

If you're going to put your readable streams into flowing mode by using pipe, you haven't necessarily given up the ability for a writable destination to pull data as it wants it. Ordinarily a readable will send the rest of the pipeline data as quickly as it can handle it, but you can allow the destination to set its own terms by adding transforms between it and the readable segment of the pipeline. For example, you could use a timer to wait a certain amount of time before the transform signals it will accept more data, while allowing the destination not to worry about any of that and just signal it's done when it's redrawn some HTML or whatever it might be up to. Although streams themselves don't give you a ton of options for control in an obvious way, you can use the events they emit and the control functions they provide to make in-flight changes to how they operate together. For example, there's a sneaky way of making your streams pull streams at the bottom of the official docs for writable.write.

there is no pipeline

If you want to define a pipeline of streams but not actually start the flow of data, I'm afraid you are shit out of luck, unless you want to implement that yourself. The pipeline that connects your streams is conceptual, not something you can do stuff to or–to the frustration of anyone who wants to programmatically compose one–define to use at a later time. But if you start disassembling the pipeline so you can exert more control over it, you become responsible for more of those concerns like whether a writable is done doing its stuff or whether a readable has run out of data. Decompose far enough and my suspicion, based on having started down that path and quickly realizing I wanted no part of it, is that you end up with very, very expensive pub/sub.

You can sort of do things to a pipeline using wrap, which will take a pipeline of streams and wrap it in a single readable. Apparently that is not the intent of the function, but it does kind of work. However, you need to do additional management if you go that route. Errors in wrapped streams will propogate up into the stream wrapping them, for example, so you'll need to handle those or you'll get yelled at.

This was actually the most difficult part of learning about streams for me, particularly because I was trying to compose them programmatically. To have your commands disappear into an invisible web of callbacks and internal events is a little disconcerting, and makes it difficult to determine where and how things are happening. It brought up some questions for me about the trade-offs between the suitability of a pattern for solving a problem, and how easy the pattern is to work with during development. For this reason, if I were going to write this framework again I'd seriously consider leaving streams to more behind-the-scenes tasks.



These notes are based on some pretty new code, so I'll update them if I learn that something was way wrong, or add to them if I find any more surprises. And this is on github, so if I said something really, really wrong and it's upset you badly feel free to make a pull request.