Texas: The Land of Frequent Water Outages

I’m now on my third day without water in Texas. I mean, I have zero water. Nothing at all comes out of my tap.

Even worse, there is nothing to buy because stores quickly sold out due to the detestable Texan habit of panic buying water whenever a major storm comes through.

This sort of thing has happened once before, but this is the first time I’ve ever lost all water. In October 2018, we lost water filtration for a week after the county’s water treatment plant flooded. In this case, we still had water in the tap, though – it just wasn’t treated, so we had to boil the water first.

I moved to Texas for work and better economic opportunities. Many millions have chosen to relocate here for the same reason, leading to huge growth in the state, especially in metro areas.

But it seems that this growth has outpaced modernization of infrastructure. Any area will experience growing pains as population increases, but it isn’t common for basic services like electric power and water to become unavailable.

Let’s Have Some Fun Securing Our Home Network With pfSense

As I alluded to in a previous blog post, I’ve been beefing up my home network security as it’s become apparent we are now living in the midst of a cyber war.

I use a VPN and have for many years now, but I have had all kinds of problems with running the VPN connection on all of my devices. The iOS support for VPN connections is especially terrible. So I decided to instead get a firewall device to sit between my network and the internet. I configured a Protectli device with pfSense, a FreeBSD-based open source OS for firewalls.

Diagram of my network configuration.

pfSense is configured with a VPN connection that “tunnels” my network traffic to another server somewhere on the other side of the internet. This is important because it prevents my ISP from reading my network traffic. To the ISP, any traffic coming from the pfSense firewall looks like static — it is indistinguishable from noise. In effect, this allows all the devices attached to my network to appear with the same phony IP address (that of the VPN server).

Hiding your IP address is a good idea if you care about privacy. Ad tech companies, including the big hitters like Google, Facebook, and Twitter, use your IP address (or a hash of it) to identify you online. Using your IP address, a company like Facebook can connect your activity across browser sessions, devices, and so on. Additionally, your ISP knows your IP address and can, if compelled by a court order or DMCA request, hand over your name and identification in association with your IP address.

In short, here’s what you’ll need for this setup:

  • A physical firewall device that can run pfSense (I recommend Protectli) [about $300]
  • A USB thumb drive, which we’ll use to install pfSense [$10 or less]
  • A speedy WiFi router that does not run your ISP’s modem connection (if you have a separate device for the router from the modem, you can probably use that device, but my instructions here assume your ISP’s WiFi device cannot be trusted) [about $100]

This isn’t a home security panacea. Next, I’ll need to configure my own DNS to avoid DNS leak. I may try to get DoH setup, but I’m not sure if pfSense supports encrypted DNS yet. Once I have that setup I’ll also be able to block all advertising inside my network, as a nice bonus!

Finally, my eventual goal is to have this network function as a sort of “home lab” setup. I want to be able to VPN into my firewall from anywhere on the internet, effectively allowing me to establish a secure connection for file sharing between friends and family. My pfSense adventure is just beginning! 🙂

Why Tech People Can Work From Home So Easily

Media outlets have reported that tech workers are among the least impacted by “lockdowns” and workplace closures over the last year. This is ostensibly because we can work remotely.

But can’t anyone with a white-collar job work remotely? Unless you need to be physically doing something in proximity to your workplace, most modern office jobs can be done over a video or chat application.

I will submit to you that there the real reason tech people are having such an easy time transitioning to remote work is that we have already built all the tools to do our jobs remotely. We have been so far ahead of the curve, that our industry is already built for remote work.

Consider the problems of remote work, broadly:

  • Workers may be working from different timezones. Timezones may have significant overlap (e.g., New York and Chicago) or they may have little or no overlap (e.g. Mumbai and Texas). Thus, being able to work independently is necessary to maintain productivity.
  • Workers must be able to work independently, but eventually their work needs to be integrated together. How shall this be accomplished? If workers A and B are both going to do 50% of a project, how can we divide the work so that both are contributing their fair share, without significantly impacting productivity?

These are the very problems that source control systems are designed to solve. The popular Git system works as follows:

  • Work is sub-divided into “commits.” When worker A has completed a task to their satisfaction, they then “commit” that work to their local copy of the project. Note that the work is still not integrated with worker B’s work.
  • Commits should be small. Large commits are harder to integrate, and are thus discouraged. Thus, worker A and B plan their contributions accordingly so that they produce many small commits.
  • There is a “master” version of the work, similar to a multi-track master for an audio recording, which contains the sum of all past commits. When a worker begins changing the project, he updates his local copy of the “master” branch, and then creates a new branch off of this local copy of master where he will do his work and create commits as he completes it.
  • master is not a stable, static thing – it is changing constantly as others complete work. When worker A is done, he will integrate or “merge” his changes back into the master branch. Here he must fetch whatever other commits have been “merged” by worker B, C, D, E, F, etc. This may trigger an event requiring him to consider how his changes should be integrated with everyone else’s work before it will be pushed out for distribution to the rest of the company.

All of this is accomplished through a clever use of a graph data structure and hashes. If you’re interested in learning more, I suggest getting a book about how Git’s internals work.

Taken together, Git’s features allow worker A to work on a small piece of the total project without coordinating with worker B, and yet also ensures that worker A’s hard work will be integrated into the project. Hopefully you can see why this system is ideal for remote work.

Software people, I think, underestimate how powerful source control really is. I greatly doubt that lawyers, accountants, or civil servants have a similarly sophisticated method of working independently.

My Quixotic Quest for Digital Security

It’s no secret the internet has gotten a lot scarier in recent years. When crime in your area goes up noticeably, it’s rational to invest in a new security system. So why not do the same with our digital lives? This impulse began my attempts at hardening my digital security.

The first step I took back in 2018 was to transition away from Gmail toward a private, end-to-end encrypted email solution at Protonmail. I also decided to buy their bundled VPN service and began using it constantly on all of my devices. The ProtonVPN app is decent enough 99% of the time, but it has a lot of problems, mainly when it has to reconnect. This is an especially salient problem for mobile devices, which are constantly connecting and unconnecting to different WiFi networks or falling back to 4G as you move them around. But even for home WiFi connections, the ProtonVPN app is just not that great, even though some major stability improvements have been made in the last few years like the ‘kill switch’ feature that terminates your connection if the VPN disconnects – which is unfortunately common.

This led me to seek out a custom firewall for my home network that could route all my home network traffic through the VPN. I landed on pfSense, a FreeBSD based OS for firewalls. It purports to offer “enterprise”-grade security. Now the question of what to run it on. A blog I follow recommended a little firewall device from Protectli would be more than sufficient for a home network.

I picked out my Protectli device for about $300 and began setting it up. I managed to get pfSense installed and seeing some packets come through the firewall. I then installed an OpenVPN client with my ProtonVPN credentials and verified everything was getting routed through the VPN tunnel. Looking good so far! I then went out and bought a new WiFi router to put behind the firewall (replacing my ISPs router, lord knows what what thing’s running) and voilá – I now had my own secured home network with no need to run the ProtonVPN app on my device. Everything behind that network would be VPN’d to another IP, and the WiFi router itself was protected by pfSense running on the firewall.

All was well and good until I had to restart the firewall. I was moving some equipment around my living room and had to unplug it. When I started it up again, I couldn’t get it to route traffic through the firewall, even after making a fresh pfSense install!

Right now I’m stuck trying to figure out this issue. I don’t know if it’s a hardware failure or some issue with how I’ve got it setup. The Protectli guys have been nice in communicating via email to see if we can figure out the problem. For now I’m going to have to keep the dream alive in my heart.

Visa To Acquire Plaid

If you use any third-party finance tools (such as Mint, Personal Capital, You Need A Budget, etc.), the tool very likely uses Plaid. Plaid is an API service for securely accessing bank transactions, balances, and so on. It does so by using token authentication to your bank(s). This data then passes through Plaid’s systems decrypted, so if you use these tools, Plaid knows every financial transaction across all of your connected accounts, including balances. The app developers (such as Intuit, who also owns TurboTax and Quicken) also have a copy of this data.

Visa now wants to buy Plaid. This will grant Visa visibility into every transaction you make, across every connected account.

Why is this data valuable to them? It can be used to assess credit-worthiness for Visa’s own products. Furthermore, Visa already has loads of data on consumers since so many of us have Visa cards. Every transaction that goes through Visa’s systems is recorded and part of your permanent record. Adding Plaid will only enhance their data set. That data set can then be repackaged and sold to marketing firms, advertisers, and law enforcement.

The only way to opt-out of this system is to a) abstain from using tools like Mint which use Plaid to access your financial data, and b) use cash where possible.

Covid-19 is now being used to promote a cashless society. I’ve already had several businesses tell me they “strongly prefer” I pay electronically. Today businesses are still legally required to accept cash, but that could change.

Unless you can pay with cash, there is no way to opt of out this system wherein your financial transactions are being recorded, packaged, and sold. We should expect to see more mergers of this sort in the future.

I use Mint myself, but I am looking to transition to an open source solution in the future. Unfortunately, all of the ones I have seen require Plaid accounts. It may not currently be possible to recreate the software features of Mint without running afoul of the surveillance system.

Sources:
– https://www.eff.org/deeplinks/2020/11/visa-wants-buy-plaid-and-it-transaction-data-millions-people
– https://www.nytimes.com/2020/07/06/business/cashless-transactions.html
– https://www.denverpost.com/2020/12/31/cash-dollar-bill-credit-card-covid-pandemic/



Holistic Software Engineering

All organizational problems in software development can be traced back to the erroneous assumption that developing software systems is like developing other large, complex things, like buildings or cars.

In a company producing cars, for example, the process is divided into two roles, broadly speaking. There are people who design the cars, creating blueprints and specifications as to how the engine components and various features will fit together functionally — let’s call them designers — and then there are the people who actually assemble the cars — let’s call them assemblers.

The assemblers need specialized skills to operate the various tools used to put together the components at the assembly plant. But they don’t have to know how cars work. They certainly don’t need a mental model of how a modern vehicle is designed. They are mostly concerned with the actual assembly process, how to make it more efficient, how to avoid costly accidents and delays in the assembly process, and so on.

The designers, on the other hand, have a completely different skill set. They have a deep knowledge of what kind of engine layouts work, what effect changing certain components have on the overall performance of the car, and so on. They talk to executive leadership about what kind of designs are selling, learn about performance improvements that can be made, and respond to problems in the design by producing new designs.

We also have a middle tier of managers that serve as connective tissue between these groups. They help to keep communication working smoothly, resolve issues with employees, and identify needs for new hires as the production line grows.

(OK, this is probably not exactly how automotive design works, but bear with me!)

Divergent Levels of Understanding

It’s tempting to think that building something like a mobile app works the same way. Perhaps you’re shaking your head now as you read this article, thinking to yourself, “Of course that’s how it works!”

In software, it’s quite common to have the assembler and designer roles separate, especially in corporate environments. From a hiring standpoint, this makes sense.

Let’s say we are developing a new mobile app front end to our back end system. We can hire an off-shore team of contractors to put together an iOS app, and throw in a senior back-end engineer from the main office to help direct them on Zoom calls once or twice a week. This might help us keep our personnel costs down, right?

Already we’ve split the task of building software into two roles. Now we have a senior engineer who’s really doing the design phase of the project and a group of assemblers who are actually doing the programming.

Let’s say we also have a product manager to oversee this project. The PM knows how the mobile app needs to work.

Notice how divergent the different levels of understanding of the project have become. We have different humans handling three distinct levels of understanding:

  1. What is the project supposed to do?
  2. What is the best approach to meet these requirements?
  3. How are we going to implement this best approach?

Roles Within Software

The people who understand the product and what it needs to be able to do in the long term have titles like product manager or software engineering manager. They usually don’t write code or implement the design. Sometimes they have development experience, but not always. These roles have a level 1 understanding of the project.

Even software architects frequently spend little time implementing designs. Usually, they’re concerned with making sure the team has the right approach, and they have to justify that approach. Sometimes a similar role is taken by lead engineers who are responsible for guiding the team toward an implementation. These are your level 2s.

And, of course, there are often several engineers working on implementing the design. Usually when we say engineer we mean someone writing application code to be deployed either to a server somewhere or delivered to a browser in a bundle of compressed JavaScript. These folks are deep in level three.

But there are also those tasked with ensuring the environment where the application actually runs is healthy. For maddening reasons, we call them devops engineers. In a past life, these brave souls might have been called sys admins. If they’re working in a cloud provider, like Amazon Web Services, we’ve sometimes taken to calling them cloud engineers because their job is to provision cloud resources for the team. These ops roles are also level three.

Split Brain

In our zeal to make software development more efficient, I think we’ve erroneously taken this metaphor of the assembly line and applied it to software.

Instead of making things more efficient, we split the complete understanding of the project, so when we make changes in the system, we now have to begin at level one, and translate the requirements all the way down to level three.

Dare I ask, is there a better way? Let’s let our imaginations go wild for a moment. I give you the holistic software engineer.

A Holistic Software Engineer

holistic software engineer is capable of:

  • Designing a system from beginning to end, taking into account various tradeoffs.
  • Administering a working system, be it in the cloud or on-premises.
  • Communicating these concepts to non-technical staff outside the team, e.g. executives or sales.
  • Understanding the need that the project fills for the user, whether it be an internal user like an executive or an external customer.

In other words, the holistic software engineer has a complete understanding of the project — from its purpose right down to its implementation details.

They can speak to the various decisions that went into designing the project. They can quickly iterate on new designs when the implementation in the “assembly” of the software runs into performance problems. They can also understand how decisions impact users of the system, and what aspects of the system’s performance would be most important to its users.

The most useful contributors on software projects are the ones that have this holistic understanding. They can bridge the gap between the different levels that the organizational strategy has created.

They could come from either direction. They might be a technical person who has gone out of their way to understand the product side of things.

Or they could be a product person who has done a lot of face-to-face work with engineers to understand how the system was designed and what its limitations are.

Whichever direction the holistic software engineer comes from, they’re useful because they’ve started to re-compose the complete understandingthat was lost when we split roles along with levels of understanding.

A Challenge

I challenge you to think about your own role in these terms and try to take on a more holistic role.

Are you working at level one, with a solid understanding of the product but no idea how to design or implement it? Try learning some technical skills so you can understand what the engineers mean when they say that service X needs to call service Y.

Or are you working at level two, where you sketch out designs of systems but cannot be bothered to iterate on those designs when your offshore team runs into problems? You’d be a better architect if you dove into the code and had a solid grasp of where the product is heading.

Perhaps you’re deep in level three, heads down in code, but not really aware of how the project might change and grow in the future. I would challenge you to think of yourself not as a “coder” but as someone with a role in structuring the implementation to allow the design to evolve to meet future product needs. Talk to product and ask questions to fill in the gaps in your understanding.

Conclusion

We can all improve in our craft if we have a little more understanding of the concerns of the people we interface with at work. I present this concept of holistic engineering as a way of getting at the kind of empathetic approach that I find works best in software teams.

I think that the Agile movement was an attempt to re-constitute this holistic engineer. Processes like point estimation, storyboards, and so on are just formalized ways of communicating technical challenges to management and communicating product requirements to engineers.

Whatever system we use, the end objective is the same: engineers and product experts working together toward a complete understanding of the project, so that progress can be made. Try to understand the whole picture and you will be more useful as a contributor, no matter what your role is.

Article: Inventing the Data Machine

I’m late updating this, but I wanted to link to my latest article about data science in the political realm.

The country is in a weird state right now, and it seems like we are hurtling toward a state of perpetual partisan deadlock, so please bear in mind this was written back in January. I tried to focus on facts – what a notion! – rather than project my own politics onto the subject. I cover the development of data science in political campaigns and the individuals who contributed to its development.

Why You Should Not Have A Public Profile Photo

As of this writing, there are zero available public photos of me. You cannot google my name and come up with an accurate image of my likeness. The only ones that appear are not of me. You might instead find a Hong Kong investor, for example, or any of several others around the world who happen to share my name. None of them look anything like me, nor are they related to me in any way.

I am quite gloatingly proud of this fact. Being unknowable in the information age is difficult, and it took some work to achieve my status as the invisible man.

Why in the era of social media am I so protective of my visage? In 2020 it is common to meticulously document your life on platforms like Instagram and Tik-Tok. I used to have accounts on some of these platforms, but no longer. I am blissfully invisible to the internet, and I love it. It is all part of my privacy philosophy.

My philosophy is this: I want to have as much control over my public information as possible. I am aware that every bit of information about me is a potential tool for someone to do me harm, and the single most dangerous bit of information is my location and what I look like.

As a basic first requirements for privacy, you never want to make the following information public*:

  • Your home address.
  • Your likeness.

With these two pieces of information, anyone in the world can find you, stalk you, and do whatever they’d like to you. So called “people search” sites scrape public records such as voter registrations to compile databases of addresses, so it is often trivial to find the home address of anyone you want, even celebrities and politicians. I have gone to great lengths to remove my information from these sites.

It may seem paranoid to worry about this – after all, I’m a law abiding citizen, aren’t I? – but sadly this is a very real threat in today’s fracturing society. Online bullying and harassment are increasingly common. We must take a defensive stance.

That said, there are some downsides to being unfindable online. Namely, you may end up looking more suspicious for your lack of public information. It’s become common for someone to “stalk” a date or a job applicant before meeting them in person, and if nothing turns up on that search, it may well be assumed that you’re not a real person, or even that you’re hiding something. In fact, this problem is the very reason I originally created this website – to prove my existence in the absence of any other digital presence.

I feel much safer knowing that someone I do not know cannot locate me. And with my own personal site, I can control what information is available about me, showcase my professional accomplishments, some of my interests, and also control the context in which it appears.

*: The only people I would exempt from this rule are those whose profession requires them to be in the public eye. Public speakers or C-suite business leaders, for example, can’t very well hide their likeness from search engines, since they are likely to appear blurbed in articles or prominently displayed on a corporate website. Until I decide to run for office or take a CTO job, I’ll happily remain faceless.

Why Everyone Should Learn Functional Programming Today

In the world of programming languages, trends come and go. One trend that deserves consideration is the interest in functional programming that began earlier this decade. Functional programming is a style that emphasizes immutable data, functional primitives, and avoidance of state.

I know what you’re thinking. You wrote some Lisp in college and dreaded it. Or, you had to manage some awful Scala code at your last job and you’d rather not deal with that again.

I know, I know. But hear me out.

Functional programming is more than a trend. Understanding its concepts and appeal goes a long way toward understanding the problems facing software engineers in 2019 and on into the next decade.

In fact, it helps to understand the current state of the world, as data mining and Machine Learning algorithms become an issue of public concern.

Even if you don’t work in a functional language, the solutions offered by the functional way of thinking can help you solve difficult problems and understand the world of computing.

Imperative Style

Most programming languages in wide use today are Von Neumann languages. These are languages that mirror a Von Neumann computer architecture, in which memory, storage, control flow instructions, and I/O are parts of the language. A programmer creates a variable in memory, sets its value, deletes it, and controls what the next command will be.

Everyone who has written a program is familiar with these concepts. Indeed, all the most popular languages in use are Von Neumann family languages: Java, C++, Python, Ruby, Go.

Enter Functional Style

In August 1978, computer scientist John Backus published an article in the Communications of the ACM. Backus accused conventional Von Neumann style languages of being “fat and flabby.” He bemoaned the complexity of new generations of languages that required enormous manuals to understand. Each successive generation added more features that looked like enhancements but considerably degraded the language by adding complexity.

Furthermore, programs written in these languages couldn’t be composed into new programs because their components weren’t created in generic forms.

A sorry state of affairs, indeed.

Backus asked why we can’t create programs that are structured more like mathematical formulas. In such a language, data could be manipulated as in algebra. He proposes this “functional style of programming” would be more correct, simpler, and composable. Backus also stressed the importance of “clarity and conceptual usefulness” of programs.

It has been four decades since this paper was written, but I think we can all relate to this!

Languages like Java, Python, and JavaScript add new features intended to clarify syntax, but the overall trend of these languages is toward increasing complexity. Object-Oriented Programming (OOP) at least gives us modularity, but inheritance hierarchies lead to well-known design problems.

Models of Computing

The blame for all this complexity, according to Backus, goes back to the Von Neumann computer architecture itself. It served us well in the 1940s and ‘50s, but by 1978, it had begun to show its age. He defines several conceptual models to demonstrate the limitations of Von Neumann’s ubiquitous model.

Turing machines and automata

These are conceptual models of computers used by computer scientists. They meet all the requirements for computing, but they’re too unwieldy for human beings tasked with designing software programs.

The dreaded Von Neumann model

Backus calls the Von Neumann model, exemplified by most of the conventional languages we use today, “complex, bulky, not useful.”

Backus concedes that Von Neumann languages can be “moderately clear,” but he calls out their lack of conceptual usefulness.

Indeed, how many of us have stared cross-eyed at a 1,000-line block of Python or Java, trying to suss out what all these loops and conditional statements are trying to do? And with multiple contributors, it can be a nightmare to understand highly procedural code.

Backus also notes that the Von Neumann model is designed for a single CPU machine. Instructions are executed one at a time.

The functional model

Here, Backus identifies the lambda calculus, the Lisp language, and his own concept of “functional style programming” as the third category.

Programs written in this model have no state. Instead of setting variables directly, we bind values to symbols. Instead of looping, we transform collections. The result is programs that are concise and clear, as well as conceptually useful.

Another way to say it might be to say that functional style is obvious.

Indeed, a program written in a functional style language is often quite short, but its concise definition makes it easier to understand than its non-functional equivalent.

Why Should I Care?

OK, so maybe we could make better programs if we all dropped Python and Java and started writing Haskell. Uh-huh. OK. Sure.

But who’s going to do that? And why? How are we going to train developers fresh out of college in these languages that they don’t know? More importantly, why? Certainly, there has been a lot of quality software written in existing languages, and as C++ creator Bjarne Stroustrup once said:

“There are only two kinds of languages: the ones people complain about and the ones nobody uses.”

The reason we should care about all this beyond an academic exercise is that the present movement toward “Big Data”-driven products has led to problems in computing that the functional model is uniquely good at solving.

Concurrency

As Backus noted in 1978, the Von Neumann model is really oriented around simple computers that execute one instruction at a time. The flow of a Von Neumann style program puts the control of every instruction into the hands of the programmer.

Unfortunately, it didn’t take long before our computers became more complex. We now have computers with many CPUs, executing many instructions at the same time. Popular languages like Python and Java weren’t built from the ground up to take advantage of this. These languages have bolted on threading APIs to allow programmers to take advantage of multiple processors. Others rely on process forking, essentially pushing the problem down to the operating system.

Multi-threaded programs are hard to write correctly, and even very experienced programmers can make serious errors. Writing multi-threaded programs is so complex that there are entire books dedicated to doing it correctly.

What would our programming languages look like if computers with many CPUs were commonplace in the 1940s? Would we choose to program each thread individually, or would we come up with different concepts for achieving the same goal?

Distributed Systems

Ten years ago, most software was written to run on an operating system on a customer’s PC. Any operations that the software needed to do were processed using the customer’s CPU, memory, and local disk.

But the early success of Gmail and other web-based tools proved that a sophisticated software system could be run over the internet.

Today’s commercial software doesn’t just run on a customer’s PC. It runs in the cloud, across perhaps hundreds of machines. Software-as-a-Service (Saas) products are now commonplace, used by individuals and enterprises.

With the data taken off of the customer’s PC and sent over the wire to our data center in the cloud, we now have a situation where we can look at the data for all customers in aggregate form. And that data can identify trends in the data — for example, detecting fraud in bank transactions.

But these systems are hard to write. Instead of running in a single-threaded computer with local memory and disk access like the Von Neumann model presupposes, our programs now have to run across potentially hundreds of machines with many CPUs. Even worse, we’re now processing way, waymore data than we could ever hope to store on a single machine. And we need to be working on this data. It can’t just be shoveled into a data warehouse and queried later.

A Naïve Solution

One approach is to keep using the threading or process-forking models we have been given to write our code, and then build a fleet of machines to scale it. Those machines will then process data and push that data somewhere (a database?) to keep it from filling up the local disk.

As you might guess, this solution is very operationally complex. We have to manually shard the data somehow — i.e., split our data set evenly across our n processing machines — and write the glue code for all these machines to talk to one another and perform some sort of leader election to figure out whose job it is to coordinate all of this.

In practical programmer terms, it also means we’re going to have to write, maintain, and version the following in code:

  • Complex multi-threaded code written in Java, for example.
  • A bunch of bash scripts to deploy and update this code on our n machines in the cloud.
  • Code to scale up and down our solution to more machines as our data volume grows or shrinks.
  • Some kind of scheduler and coordination system to get all these operations to work together and join their result somewhere.

Now imagine debugging and maintaining this system. Doesn’t sound fun, does it? Certainly, the resulting solution in code will not be obvious.

An Elegant Solution

In 2013, Berkeley’s AMPLab donated the Spark project to the open-source world. Over the years, Spark has become one of the favored ‘big data’ cluster programming platforms, supplanting a variety of systems built by engineers at Twitter and Google.

Spark is written in the Scala language, a functional programming language that runs in the Java Virtual Machine (JVM). I won’t get into the gory details of how Spark works or write any code here. You can find plenty of examples online for that.

Instead, I’ll present the Spark conceptual framework and show how the functional model of computing is crucial to its elegant solution.

What is “the program?”

Ask yourself this question. In our hypothetical distributed system described above, what is “the program?”

Is it the concurrent Java code that we wrote? Or is it the bash scripts that deploy that code out to the various machines in our “cluster?” Is it the scheduling algorithm?

I’d argue that all of these components put together contain pieces of “the program.” The program is the instructions for transforming the data. Details like thread management and managing resources are incidental to this goal.

Think of it this way. Say we have a dozen machines in the cloud with 4 CPUs and 16 GB of memory. Throw all those machines together into a big “cluster” of computing resources. Now we have one big “computer” with 4 * 12 = 48 CPUs, 16 * 12 = 192 GB memory, and a certain amount of disk storage.

Now, imagine we write our data transformations in the functional style described by Backus. Each transformation is written like a mathematical function. There’s an input and an output. No state. All data is immutable, stored in stages on disk on each machine, and deleted when it’s no longer needed.

We could now have a scheduler that knows about the structure of our cluster. In other words, it knows it has 12 machines with 4 CPUs and 16 GB memory. The scheduler dispatches a portion of the data along with the data transformation function we’ve defined.

In fact, if we write our data transformation “program” in a purely functional style, the scheduler can dispatch many of these transformations at the same time, as many as can be fit in the cluster with its limited resources. That allows us to process our data in an efficient manner.

Programming the Cluster in Functional Style

I’m not going to promote Spark as the end-all, be-all of cluster computing. Perhaps we’ll come up with something better in the future, and Spark isn’t good for every distributed system. It’s optimized for data processing and streaming, and not serving up live requests, for example.

But I want to emphasize the shift in perspective that allows this type of system to be built, namely functional programming style. And indeed, when we enter the realm of ‘big data,’ we tend to find that most solutions rely on the functional model of computing.

Spark offers a Scala, Java, and Python API. Whatever language you choose, you’re going to be writing your Spark program in a functional style.

We also tend to find that the separation of transformation code from resource management is a theme. Apache Spark’s solution separates out the resource management aspects of our distributed system, leaving us to work with the data. Data transformation rules are clear and require no complex multithreaded code.

It seems that distributed systems are finally freeing us from the limitations of the Von Neumann model.

Conclusion

Functional programming languages may be falling out of favor as a popular replacement for languages like Java or Python. As a drop-in replacement for simple use cases, like a small web application, Scala or Haskell may be overkill.

But the functional model of computing has not gone away by a long shot. If anything, it’s more ascendant than ever. It’s hiding behind the scenes, powering the Machine Learning algorithms, business intelligence, and analytics engines that provide insights to modern organizations.

Software engineers and managers would do well to learn these concepts and understand why so many projects that run at the heart of the biggest tech companies rely on functional style projects like Apache Spark.

Functional style allows us to separate the “how” of computing resource management from the “what” of a program. It frees us from burdensome and complex multithreading APIs bolted on to languages that are based on a model of a simple computer conceived of in the 1940s.

The functional model is uniquely well-adapted to the data-rich world that we’re entering. It’s an indispensable tool for any software engineer working today.