15 May 2018

Is blockchain "web scale"?

For something to be truly awesome, it must be "web scale" right? The rather excellent video just below shows how hype can blind us to the real values of a technology. It's quite a famous trope in some circles and I think is a useful parallel to the excitement about blockchain.

In it an enthusiastic user of a new technology repeats marketing hype despite having no understanding of the technical concerns involved. Watch the video and you'll hear every piece of marketing spin that surrounds NoSQL databases.

I usually get very excited about new technologies but I'm quite underwhelmed by blockchain. In the commercial context I see it as a great solution that really just needs to find a problem to solve.

Lots of companies are looking for the problem that the blockchain solution will solve. Facebook has a blockchain group that is dedicated to studying blockchain and how it can be used in Facebook.

They don't seem to have found a particularly novel use case for it and appear to be set on using it to launch a Facebook crypto-currency. This would let Facebook track your financial transactions (https://cheddar.com/videos/facebook-plans-to-create-its-own-cryptocurrency) which to me sounds like juicy information to sell to advertisers.

I'm convinced that when we finally find that problem blockchain will be a godsend, but right now outside of cryptocurrency I'm struggling to see how the cost and risk of blockchain is worthwhile in the commercial context.

A publicly shared database

Blockchain offers a distributed ledger system, which sounds like something we want right? Everybody can look at the list of transactions at any time, and verify that a particular transaction is genuine.

Using boring traditional data services you'd be forced to have a database and expose an API that lets authenticated users access it in ways that you directly control. How dull is that!

Like communism, decentralizing control of data is a great idea because it promotes equality and encourages distributed effort focused on the greater good. Capitalists and other realists will point out that it's not machine learning algorithms that will be the oil of the future, it's data. Facebook isn't worth billions because it has clever algorithms, but rather because it controls data about people. Data has value, are you sure you want to give it away?


Broadly speaking, instead of your company having a private database that it controls access to, you're able to have a shared database that you can only control the contents of if you spend more on hardware than the rest of the world.

Wait, what? Your blockchain consultant didn't explain to you that anybody with a botnet is going to be able to rewrite your public ledger?

Well it's true. The way that blockchain works means that the record is only immutable if there isn't an actor who controls more than 51% of the processing power in the network.

If you have a private blockchain you're going to need to be certain that you can throw enough resources at it to prevent malicious actors from rewriting history. Or you could only allow trusted actors to use your blockchain, which sounds very much like a traditional access method.

Who will mine your blocks?

The most popular use for blockchain (by far) is to provide a ledger for currency, like Bitcoin and the other coins out there. When users mine the blockchain they are rewarded with coins which ultimately they hope to be able to convert into fiat currency or material goods/services at some point. There is a direct incentive to spend money on electricity to crunch the numbers to verify Bitcoin transactions.

If you build your own private blockchain who is going to mine the blocks? What is the value in them doing so and are you going to end up mining the blocks yourself just to get the transactions processed? How is this cheaper than a traditional database?

Given the problems with immutability it should be pretty clear that a private blockchain is a pretty risky way to avoid boring traditional data-sharing approaches like an API. And of course that's assuming that there will be people wanting to mine your blocks.

Digital smart contracts will replace lawyers

Blockchain lets you dispense with traditional signed contracts with suppliers and rather enter into digitally signed contracts. They're touted to replace lawyers (https://blockgeeks.com/guides/smart-contracts/).

Smart contracts cut out the middle-man when it comes to transferring goods and services. They're an alternative to conveyancing costs, credit-card processing fees, escrow costs, and so on. Essentially you place the asset into the trust of the computer program which will then decide where to transfer it depending on whatever conditions you program into it.

Instead of placing your house into escrow with a lawyer who is bound by professional conduct rules and has an insured lawyer's trust fund you can use a smart contract written by anybody. That's the democratizing power of blockchain!

Who can forget about the Ethereum fork which happened because of faulty code in a smart contract? I'm not nearly arrogant enough to assume that I can code tighter code than the guys and girls who created the DAO - are you willing to bet your house that you are?

I am horribly unfair

Maybe it's not fair to use a high value asset as an example for considering a smart chain. What about other use cases touted by blockchain evangelists?

Blockchain supposedly streamlines business decisions by eliminating back-and-forth decision making. A smart contract can simply have the business rules coded into it so instead of seeking approval from humans you can just rely on the contract to grant your request.

For example, if you need to make a business trip it can happen automatically just so long as the coded requirements are met. As long as the contract is coded to be aware of every factor that goes into the decision it can automatically approve your travel request.

That contract won't work for requesting a new monitor for your workstation though. You'll need a different contract for that. Or maybe you can extend the old contract and just add more rules to it?

Given the rate cards for blockchain developers very simple administrative decisions can end up taking a lot more coding effort (and money) than you really need to be dedicating to simple decisions. And what happens if a management decision needs judgment that hasn't been coded?

Banks are already using smart contracts

Indeed they are (https://www.barclayscorporate.com/insight-and-research/technology-and-digital-innovation/what-does-blockchain-do.html).

Barclays sees a future where identity theft is impossible because digital identity is immutable and publicly accessible for them to read and share with law enforcement. My financial records would be available to Barclays (and whoever else can read the blockchain) and they'd be able to make a decision about opening an account quicker than the time it currently takes (about an hour the last time I did it at the branch).

I wouldn't need to take my photo identity and proof of address documents to the bank, I would just need to show that I own the private key associated with the digital identity profile. This will prevent identity theft, according to Barclays, presumably because consumer computers are secure and digital keys can't be stolen.

Trade documents are given as a good example of how digital signing and identity can be accomplished with blockchain.

In blockchain how do you establish identity? The digital identity is established by ownership of a private key, but how do you link that to a physical entity? Surely you need some way to link the digital identity with the physical entity before you ship the goods that your smart contract says have been paid for.

How do I know that wallet address 0xaa8e0d3ded810704c4d8bc879a650aad50f36bc5 is actually Acme Inc trading out of London and not Pirates Ltd trading out of Antigua? Who is responsible for authenticating the digital identities in blockchain?

You can trust the blockchain (as long as the hash power is evenly distributed) but can you trust the digital identities on it?

Digital signing and identity can also be managed through public key cryptography where a recognized and trusted central authority signs keys after verifying the owners identity. This isn't a new arena and blockchain doesn't solve the problems that public key cryptography has.

There are already established digital signing solutions that don't rely on snake-oil. I signed my rent agreement digitally in the UK with my landlords who live in continental Europe. I hardly think that this space is a raison d'ĂȘtre for block chain.

Public, not private blockchains

It seems that my beef is with the impractical nature of using private blockchain where existing solutions are more secure and cheaper. So, what about public blockchains, like Ethereum?

In a traditional blockchain each node on the network needs a full copy of the entire chain in order to be able to verify transactions. Without this public scrutiny the blockchain is no longer secure.

The problem with this is that Ethereum can only process very limited amounts of transactions per seconds. Currently it runs at about 45 transactions per second, which isn't an awful lot when you share it out amongst your company, and all the people speculatively trading Ethereum.

Ethereum is considering a sharding approach where they will decentralize the chain slightly in order to improve transaction speed. A few nodes on the network will have more authority than others. These nodes will need to have explicit trust in each other, and obviously the network will need to follow suit.

As a company do you want to commit to this level of trust? Who are the actors controlling these nodes? Who will they be in three years time? What countries laws are they bound by?

Data you put into the blockchain will be shared with everybody, forever. Governments don't like competition when it comes to spying on people and are passing increasingly strict privacy laws - how will you plan for compliance when you don't control your data?

Blockchain is web-scale!


28 December 2017

Component cohesion

Image: Pixabay
Breaking your application down into components can be a useful approach to a "divide and conquer" methodology.  Assigning specific behaviour to a component and then defining interfaces for other components to access it allows you to develop a service driven architecture. 

I'm in the process of decomposing a monolithic application into services that will eventually become standalone micro-services.  Part of the task ahead lies in determining the service boundaries, which are analogous to software components for my micro-service application. 

I want components to be modular to allow them to be developed and deployed as independently as possible.  I'm using the approach suggested by Eric Evans in his book on domain driven design where he describes the concept of "bounded contexts".  I like to think of a bounded context as being for domain models as a namespace is for classes.  These contexts are spaces where a domain model defined in the Ubiquitous Domain Language will have a precise and consistent meaning.  Keeping components modular helps to define and maintain these boundaries.

I want my components to be cohesive because I want my architecture to be so simple that people wonder why we need an architect at all.  It should be intuitively obvious why a group of classes belong together in a component and what part of my domain logic they're implementing.  Cohesion is a good thing and we're all familiar with writing cohesive classes, but what principals are important to consider when looking at grouping up classes into cohesive components

Robert C Martin discusses three important concepts that govern component cohesion on his website (here)

  • Release-Reuse equivalency principle (REP) - the granule of release is the granule of reuse
  • Common Closure principle (CCP) - classes that change together are packaged together
  • Common Reuse principle (CRP) - Classes that are used together are packaged together

The Release-Reuse equivalence principle (REP) is very simple.  It states that classes that are packaged together into a component need to be able to be released together.  In practice this boils down to properly versioning your releases and having all of the classes in your component versioned and released together.

The Common Closure principle (CCP) states that you should gather together classes that change for the same reasons and the same times.  Conversely you should separate out classes that change for different reasons and at different times. 

Remember that the S of SOLID stands for "single responsibility principle" (SRP) where a class should have only one reason to change?  The CCP is for components what the SRP is for classes.

We can say that generally stuff that changes together should be kept together.

The Common Reuse principle (CRP) states that you should not force users of a component to depend on things they don't need. 

The CRP more strongly suggests that we do not include classes into a component that are not tightly bound to the function of the component.  Every time we touch one of those classes we will be forced to retest all of the client applications that are using the component.  Our deployment process will be heavier than it needs to be, and crucially we'll be deploying more than we have to.

The CRP is a more general form of the interface segregation principle but suggests that a component should be built from classes that are commonly used together. 

Generally speaking, we should avoid depending on things that we don't need.

We've seen three principles that govern how we group up classes into components.  The REP and CCP are inclusive about grouping up classes and suggest what classes do belong together.  The CRP is more strong about excluding classes from a component.  There is therefore a balance to be walked between these principals. 

Tim Ottinger suggested a diagram that helps to see the cost of abandoning a principle.  The label on an edge is the cost of weakening adherence to the principle on the opposite vertex.  So, for example the cost of abandoning CCP is that we have too many components changing at one time.

Diagram suggested by Tim Ottinger illustrating tension between component cohesion principles
Your application will fall somewhere within this triangle as you balance your focus between the principles. 

This balance is dynamic and changes over time.  Robert C Martin notes that “A good architect finds a position in that tension triangle that meets the _current_ concerns of the development team, but is also aware that those concerns will change over time.

These principles will govern how I examine my monolith and identify classes that I can group together to form components.


27 December 2017

Writing SOLID Laravel code

Image: Pixabay
SOLID is a mnemonic acronym for five object-oriented design principals that are intended to make software designs more understandable (see Wikipedia). They were promoted by a chap called Robert C Martin who has been programming since before I was born and is an authority on writing clean code.

 Laravel is a PHP framework that implements the model-view-controller (MVC) pattern. A lot of people think that their responsibility for OOP design ends with adopting a framework, but actually Laravel is relatively un-opinionated on your OOP design and you still need to think about writing code that is testable and maintainable.

 The reason that SOLID principals matter becomes apparent when you work on a single project for a long time. If you're writing throwaway applications for clients that you never expect to work on again (presumably because the client won't hire you again) then the quality of your code doesn't matter. But if you're the guy stuck with managing and implementing change in an application that is actively growing and improving then you're going to want code that is easy to change, easy to test, and easy to deploy.

The most common problem I've seen in Laravel is "fat controllers" that use the ORM to get some data and then push it through to a view.  Let's take a look at an example I've made.  Imagine that we're writing a payroll program.  We might write something like the following controller methods:

This is an unfortunately common Laravel pattern that is taught in countless tutorials. We call the model from the controller, format the data, and then pass it on to the view. This is the easiest way to teach the MVC pattern to beginners but unfortunately it violates the SOLID principals. Let's see why, and how we can improve this code.

The "S" in SOLID stands for single responsibility principal which requires that each module or class should have a single responsibility for the functionality of the application.  A more subtle understanding is put forward by Robert C Martin who says that "A class should have only one reason to change".

The thinking behind limiting the reasons for changing a class comes from the observation that software is often developed by teams and that often a team is implementing a feature for a particular actor.  In our example the CEO of the company will have different requirements from the CFO, and when either of them requests a change then we want to limit the impact of that change.  The actor is the reason for software to change - they request a feature and a team goes ahead and implements it.

In our controller above if the CEO requested a change then that change would definitely affect the CFO.  The teams working on the code would need to merge in code from each other.  If our code was properly designed then the controller class would be responsible to just one actor.

In this example I've moved the responsibility for calculating the employee pay to its own object.  This object will only change if the CFO requests a change to the way that wages are calculated and so adheres to the single responsibility principal.  We would similarly have an object that is responsible for counting the hours.  I've chosen this way of solving the problem because the Facade pattern is very loaded in Laravel and I think it would just muddy the waters to use it here.

Let's move on to "O" which is the open-closed principal which requires that "A software artifact should be open for extension but closed for modification".  It was developed by Betrand Meyer and holds that you should be able to extend on a modules functionality without having to change that module.

The aim of the OCP is to protect important code that implements high level policies from changes made to code that implements low-level details.  We want the part of our code that is central to our application to be insulated from changes in other parts of the application.

There is some level of separation in our Laravel application.  We can make a change to the View without there being any impact on the Controller, but within the controller above we have no such insulation.  If we make a change to the way we read the database then we will be affecting exactly the same function that is responsible for calculating wages!

The open-closed principal seeks to prevent you from changing core functionality as a side-effect of adding new functionality to your application.  It works by separating out the application into a hierarchy of dependencies.  You can extend functionality from the lower levels of the hierarchy without changing the code in the higher levels.

The "L" in SOLID is named for Barbara Liskov who put forward what is now known as the Liskov substitution principal.  The principal holds that "if S is a subtype of T, then objects of type T in a program may be replaced with objects of type S without altering any of the desirable properties of that program".

In the example above I've amended the object to make it inherit from an interface.  Both the PermanentEmployeePayCalculator class and the TemporaryEmployeePayCalculator implement this interface and can be substituted for each other.  This makes a lot more sense if you consider an LSP violation, such as this one:

This violates the Liskov Substitution Principal because the methods have got different signatures. You cannot substitute the subtypes of PayCalculator each other because they're incompatible.  The object that depended on them would need to implement some logic to be able to know how many parameters to pass to the method.  Adhering to the Liskov substitution principal removes this need and removes special cases from your code.

Adhering to the Liskov substitution principal
The "I" in SOLID was proposed by Robert C Martin and stands for interface segregation.  The idea is that code should not be made to depend on methods that it does not use.  By reducing the dependencies between classes you help to decouple your code making it easier to make changes in one section without impacting on others.

Let's imagine that we separated out our controller into classes like the below diagram.  We have an Employee data object that is responsible for interacting with the persistence layer and returning results.  It has a method that the PayCalculator object uses to determine whether the employee needs to earn their overtime rate and a method that both objects use to fetch the list of hours that an employee has worked (which may or may not violate the single responsibility principal).

Violation of the ISP

The problem here is that the HoursReporter is forced to depend on the isHourOvertime() function.  This introduces an additional coupling between the classes that we need to avoid if we want to adhere to the interface segregation principal.
Adhering to the ISP

We can easily solve this problem by declaring an interface for the classes to depend on.  The interface for the HoursReporter class excludes the function that we do not want to depend on.

The last letter in SOLID is "D" which stands for the dependency inversion principal which holds that the most flexible modules are those in which source code dependencies refer only to abstractions rather than concretions.

To understand dependency inversion consider two things: flow of control and source code dependency.  We want to be able to have our source code dependencies to be independent of how control flows through our application.

Some classes and files in our application are more prone to change than others.  They are "volatile" classes.  We want to minimise the effect of the changes in these classes to the more stable classes.  Ideally we want our business logic to be very stable and highly insulated from changes elsewhere in our system.

In the diagram below I'm illustrating a source code dependency hierarchy.  High level classes call functions in lower level classes, but in order to do so they need to depend on that class.  This means that your source code dependencies are unavoidably tied to how your flow of control works.

Source code dependency hierarchy
The problem that arises from this dependency is that it becomes difficult to swap functionality.  A change to the source code of a low level object means that we need to rebuild all of the files that depend on it;  Admittedly rebuilding files in PHP is less of an issue than for statically typed languages that are built in advance, but we are still directly impacting on files other than the one we are touching.

Let's say, for example, we had a class that outputs the Employee wages to screen.  In the diagram above we would see the Employee object as the High Level object and perhaps a "ScreenOutput" object as a low-level object.  Our Employee object calls the ScreenOutput class directly, and so we have to mention the source code in Employee, like this:

Now our CFO asks us to be able to print out the wages using the black and white printer in her office.  Uh-oh, now we need to rewrite our source-code dependency because the "use" statement refers specifically to a concretion.

What happens if we want to make a change to the way that wages are displayed on the screen?  We can easily tweak the ScreenOutput object, but can we deploy it separately?  What impact is it going to have on all the places that depend on it?

How could we fix this problem and allow ourselves to swap functionality in and out without affecting our source code dependencies?  How do we actually decouple these objects?

The answer is to always depend on abstractions rather than concretions.  This insulates you from changes in the underlying files and lets you change and deploy parts of your application separately.

Using an interface to implement dependency inversion
In the diagram above the Employee object is calling the ScreenOutput class method through an interface.  The class has a source code dependency on the interface file (which shouldn't change often) and any code change in the ScreenOutput class will not affect the Employee object.

The rules to follow for the dependency inversion principal are:

  1. Do not reference volatile concrete classes 
  2. Do not derive from volatile concrete classes
  3. Do not override concrete functions
  4. Never mention the name of anything concrete and volatile

One way that you can accomplish this is through using a Factory to instantiate volatile concrete classes.  This removes the requirement to have a source code dependency on the class that you're instantiating in the object where you need it.

Laravel approaches dependency inversion by using a "service container".  Your code no longer depends on a concrete implementation of a class, but rather requests an instance of an object from the IoC container.

In our controller code above the IoC container returns an instance of the HoursWorked model through the Facade pattern.  The controller is not directly dependent on the source code file of the HoursWorked model.  So in this particular case we're just lucky to be adhering to a SOLID principal!

23 June 2017

How to get Virtualbox VMs to talk to each other

I'm busy writing an Ansible script and want to test it locally before trying to deploy it anywhere.  The easiest way to try and make my local environment as close to my deployment environment was to set up a network of Virtualbox VMs.

The problem was that I've always configured my VM's to use NAT networking.  I ssh onto them by setting port forwarding and have never really needed them to have their own address.

The solution to this problem is pretty simple.  Just stop the machines and add a new network adapter of type "Host Only".  This adapter will handle communication between the guest and host machines.

The trick is that you need to configure the guest OS network interface too.

To do this SSH onto your VM and run "ip add" to list your network adapters.  If you're like me and started with NAT before adding "Host Only" as your second adapter the output should look something like this:

You need to identify the adapter that is your "Host Only" network.  You can do this by running "ip add" on your host machine and looking for the vboxnet0 network address (assuming you're using the defaults given to you by Virtualbox).

Now you need to edit /etc/network/interfaces and tell Linux (I'm using Ubuntu 16.04) to set up that interface.  Add lines like this snippet to your file:

Now your virtual machines will have an ip address (you can grab it with ifconfig) that you can set up in your Ansible inventory.

13 April 2017

Is PHP a good fit for an API server?

Image: Pixabay
Calling PHP a double-claw hammer is a bit of an in-joke in the PHP community.  A lot of people bemoan PHP as a language – it's fashionable to do so and it seems to be a way to look clever.  The joke came about from a blog post where somebody pointed out all of the problems with PHP (here's a rebuttal - https://blog.codinghorror.com/the-php-singularity/ )

Anyway, PHP is a warty language that sucks in academic circles but it doesn't matter because it's really good at web stuff, there are lots of people who know it (so it's cheap to hire), there are lots of libraries and frameworks (so it's cheap and fast to develop in).  The commercial world is willing to overlook the academic warts.

I'm busy helping to improve the performance of an API server.  As part of my effort I'm profiling calls to the endpoints.  I'm using Blackfire to generate call graphs and also logging the sql queries that the ORM is producing so that I can check the indexes and joins.

Here's a callgraph for a call to the endpoint where we are looking to run a paginated SQL query.  We're not applying any business logic or having any side-effects - all we're trying to do is query the database and return a JSON string to the frontend.

Blackfire call graph
That's a pretty substantial call graph for what sounds like a simple task right?  All I want to do is route the request to a controller, query the database, and send the results back.

Blackfire tells me that 172 different MySQL queries are being run.  The PHP code responsible is using the ORM to build up the joins and so on.  I suspect that the problem is that there is pagination being applied and the ORM is not able to optimize the queries it needs to do in order to paginate efficiently.

Okay, so what questions do I have?

Why are we not querying the database more directly?  I appreciate that developer productivity is a good reason to use ORM but is it a good reason in this case?  172 queries is an awful lot, especially when a lot of them are related to querying the schema so our ORM can run.

Why on earth does PHP have to spend so much time in disk I/O reading all of those source files when really what we need is request routing, a database query, and a response handler?  

Blackfire reports that 304kb of data was transmitted across the wire for this.  That seems like a lot of data for the five or six records that I'm returning to the frontend.

The call graph is frustrating – I'm lumbered with a whole lot of black box code and I have no control over the SQL that is being run.  How can I improve the performance of this transaction?

So is PHP the best tool for this job?

I have previously had intractable problems with PHP when it comes to memory management.  It's pretty complicated and it differs depending on the way that PHP is run but I do not have 100% confidence in PHP's garbage collection.  

Circular object references (which I encountered while using an ORM where a model referenced itself as a parent to form a hierarchy) cannot be completely collected by PHP.  PHP actually relies on the container the machine runs in to collect this memory.  

PHP is not built for being a long-running program.  It was never designed for this and it should never be used for this.  It was built to handle a request for a page and then terminate. 

The application is bootstrapped for every request.  How much overhead does this add?  Well there's a question that Blackfire is raising for me.  Take a look at the timeline for the transaction from before:
Blackfire call timeline
The timeline shows when a PHP function was called in relation to the time taken to generate the response. 

My controller function starts at around 750ms into the transaction.  The actual time is irrelevant as a benchmark, but the fact that the first time *my* code runs is half-way into the transaction is what is relevant.   

Until halfway into my application I've been waiting for PHP to bootstrap my application.  You could argue that this is because of the PHP framework I'm using, but actually it is the limitation of PHP not being able to maintain state that requires us to continuously bootstrap the application.

Bootstrapping our application might involve disk I/O (depending on OpCache).  It definitely involves network I/O because we have to connect to MySQL and wait for it to authenticate us.  I know that there are ways to improve this, like by not using a framework and by tuning OpCache to improve compile time.

I'm concerned about what will happen when the application has 50,000 concurrent users.  How much of a strain will it place on my database server to be constantly connecting (and authenticating)?

I think PHP is brilliant at web pages and not so good at being a long-running application that is capable of reusing resources.  I'm a huge PHP fan but as an architect I do not want it to be my only tool.  

I'm busy learning Elixir and the Phoenix framework (again with the frameworks!) response in microseconds (not milliseconds).  I don't think we should using PHP like the hammer we use for everything.