System Theories

These are thoughts on politics, economics, systems theory, science and anything else that catches the eye of Kurt Cagle. Posted every Sunday.

Kurzweil Cities and Kunstlervilles, Revisited

Steampunk Laptop

The future is a fascinating place, yet lately it seems like we’ve reached a rather disturbing fork in the road. On one side is the techno-fetishist fantasies of Ray Kurzweil, computer pioneer and auteur of the Singularity concept, in which by 2040, the world will reach a point where everything goes hyperbolic and the distinction between humans and machines gets lost forever. The utopian world that he inhabits is one in which the compound interest derived from Moore’s Law reaches a stage where technology is all pervasive, where we are all completely interconnected, and we hit an event horizon beyond which are febrile flesh and blood brains of today seem incapable of imaging or envisioning. We ARE the network.

Take the other road, however, and the world begins to get disturbingly scary in a different way. In this future we’ve hit the peak resource wall - oil is declining everywhere, the society that’s so utterly dependent upon that oil collapses into a either anarchy or post-modern pioneer society, where being a blacksmith is a good profession, we’re hitching those carbon composite cars to the horses to get anywhere, and the technological boom will fade away as power distribution systems disintegrate. It’s not a bad life, if you like living like the Amish.

In an earlier article (one sadly no longer on the web) I called these two destinations Kurzweil Cities and Kunstlervilles. They seem to represent the two extremes of our love/hate relationship with the technological society, the one transcendent but based upon some fairly absurd assumptions, the other bleak and dark, the expulsion from the rather dubious garden of the petroleum driven Eden. Curiously there are in many ways more people who seem to long for the seeming dystopia of the Kunstlervilles, the neo-Ludites who spend their evenings stocking away supplies awaiting the fall of civilization while blogging this fact to their technologically connected friends and cohorts.

Yet for all of that I cannot help but wonder if perhaps the fork in the road is itself largely an illusion. My gut feel - my intuition, it you will - is that societies are remarkably more resilient than we assume, even if characteristics of those societies change dramatically over time. Personally, I believe that the current cultural edifice of wealthy robber barons (read - the Financial, Military, Petroleum based superindustry and their chief investors) will collapse within the next decade.

Why? Because overall they are all too reliant not only upon hydrocarbons but even a certain grade of those hydrocarbons, namely those in and around the higher end of the “-ane” group - high octane groups. Electricity generation can be accomplished readily with other means, but the conversion of the American fleet of cars, trucks and aircraft can’t be - and the disruption this is already bringing is forcing major societal changes.

Automobile ownership of those between the age of 16-35 is the lowest it’s ever been, and the number of people in this age group who do not even own a driver’s license as a percentage of population is the highest ever. Some of that is due to the explosion in mobile devices - in addition to reducing the need to “get together” to interact, such devices also make it far easier to share cars and to coordinate schedules with other transit options. Some of it is due to the fact that this group has fewer job opportunities, but when those opportunities do arise, many of them can be done without the need to spend an hour on the road for eight hours of time wasting meetings and another hour on the way back home.

But there’s another reason as well. These kids aren’t stupid, and they aren’t as vested into the system of lies and self-deceptions that their Boomer managers and grandparents are. It makes sense to them to keep their expenses low, and a car is a major expense. Put another way, a car has gone from being a luxury, to being a necessity, to increasingly being a luxury again, and one with less and less desirability … especially if you opt out of the corporate model of life.

The same thing applies to the financial system, and with it the military. The US Military has many missions, and it is unfair to categorize all of them as negative. However, a great deal of military policy is ultimately geared towards the protection of oil supply chains globally. The financial sector has gained the primacy it has largely because militaries are expensive - the budget for the Dept. of Defense dwarfs every other department in the Federal Government. Much of this budget has been suborned by military contractors such as Halliburton, Lockheed Martin, Raytheon and others for advanced weapon systems that are, for the most part, designed for “conventional” warfare, only faster, with more muscle, and a greater “boom” factor. 

However, the writing is on the wall even there. President Obama has “ended” the US military involvement in Irag, though there are still tens of thousands of contractors, and while it is a slower, and more complicated process, will, if he serves through November 2012, likely see military troops come home from Afghanistan by the summer of 2014. A trillion dollars worth of cuts loom in 2013, and the military will take the brunt of that (and likely much of that brunt will be born by the contractors whose programs get cut). Romney is self-destructing - I put the chance that a new nominee will emerge from the GOP convention as approaching 50% at this point (with about ten weeks to “sell” that nominee to the public), so the reality is that with neither the billions spent on such systems nor the billions in fees generated by servicing the loans for this, neither the military nor financial sectors will survive without radically diminishing their influence.

One of the central tenants of the post-abundance movement is that we’ll go through a radical crash as the systems seize up, and that crash could end civilization as we know it. Yes, of course. But what about civilization as we don’t know it.

Here’s a question to ask yourself: Today, if you had to give up your car or your access to the Internet, which would you choose?

I suspect that most people born before 1960 would answer that they would far rather keep their cars. Those born after most assuredly would answer that they would far rather give up their car. The Internet means communication, accessibility, entertainment. Today, with the Internet, I can order food (from restaurants or grocery stores), furnishings and equipment delivered, can find and do work, can order finished products or raw goods, can stay informed, can promote what I do, can communicate with friends, family and customers, can education myself and my children, can engage in politics and even societal functions.

The car, on the other hand, is primarily a means for getting you to work, for meetings or for transporting goods. It serves a secondary purpose, that of a status symbol, though again that status symbol comes at no small expense (which is, I suppose, part of the rationale for it being a status symbol). In most cases, it is inextricably tied to the type of society that is car-bound - malls, fast food restaurants, millions of miles of highways, centralized offices, gas stations, supermarkets, big box stores, and so forth. What this means in practice is that this generation begins to gain affluence, all of these businesses that have built their business model on the car will also effectively collapse. Less travel, less need for having fast food restaurants at every highway intersection and in every shopping complex, stagnation and eventually decline of their parent companies. And if you believe the future is bright for such companies, take a look at their sales figures in the last five years.

Yet the infrastructure won’t completely collapse. Delivery becomes a bigger concern. Snow Crash’s Neil Stephenson clearly understood that - home delivery clearly becomes an integral part of the business model. Here, ironically, the second major disruptive aspect of the mobile web revolution takes place. Colby Cosh of McLean Magazine wrote what should be a must-read for anyone looking for what the future will bring: Artisan Chocolate and Social Revolution. In it he highlights a phenomenon that has been given names like the neoVictorian Aesthetic and the Second Arts and Crafts Era. The notion is that mass production will not suddenly cease - it is actually a remarkably efficient use of energy to create necessary goods and services, even though if carried through to its logical conclusion it also tends to eliminate almost everyone in the workforce.

What Cosh argues is that this is giving rise not just to “mass customization” but rather “artisanal” products. Artisanal products and services are those that are customized to a specific audience or customer base. His argument is that the chocolates that the team in question make likely end up using the same base products as large manufacturers, but that because they can hand-create their chocolates to individual needs - a party or celebration, a gift, specialized items for artisanal restaurants - it is in fact this customization that provides the added value.

Another good example, one that I love, is the Steampunk laptop. Steampunk of course, epitomizes the neoVictorian sensibilities of William Morris, but the fascinating thing is that most of these laptops were originally created as works of art, but they proved to be so popular that the artist ended up going full time into creating such specialized laptops. Internally, the laptop is similar to countless others, but it is the customization, and the time and talent involved in doing so, that makes this aspect so popular.

The Kickstart project is yet another example of this phenomenon. Kickstart makes it possible to raise small private capital from small investors in order to get a book, magazine, or production funded. The investors may share in some of the profits or receive product in compensation, but the key point here is that the investors in question aren’t billion dollar funds looking to get a high yield return on their money (indeed, these kind of opportunities are drying up), but are instead investments made to see small, manageable products and services be created that don’t have a high enough profit margin to be attractive to the moneyed interests but that nonetheless fulfill a need or want. Significantly, this is actually one of the most benign forms of capitalism out there, because it serves to create a community of interest.

On the services side, you’re beginning to see the emergence of itinerant service professionals - people who will come to your house to cut your hair, give you a manicure, instruct your children (or yourself), and so forth. Why? Both because service people bill by the customer, in general, and the customers are drying up. Transportation costs shifts to the provider rather than the customer - I think that’s going to increasingly be the rule moving forward as the Millennials age. You may in fact see the rise of doctors doing housecalls, a practice that became unfeasible primarily because of the distances involved in the suburban era. 

And in many respects, that brings me back full circle to the question of which road will be taken.  Artisanal customizations is dependent upon mass production to get started, but over time, I think it will increasingly dominate as the underlying forces that currently favor mass-production will fail one by one. Long term I think that the strength of regional economies in the US and Canada will ultimately outweigh the national economy. Artisanal development is an interim step there, and one that may very well be required in order to push aside the rubric of a hundred years of mega-corporate oligarchical controls over everything from employment to zoning to the intrinsic shape of cities.

In time, as supply chains begin to collapse, the scope of the artisans’ efforts will increase even as their effective reach diminishes (because of transportation costs and subsequent reduction of raw materials). Cities become more concentrated and more autonomous even as their suburban neighborhoods begin to crumble, leaving an annulus that will  ultimately revert back to wilderness with towns emerging towards the periphery where distances to the city become unfeasibly far but where the town has a reason for being (a port, a crossroads, an agricultural center, etc.). The artisans may at that end up becoming the new mercantilists, but it will take a while for that cycle to repeat to the extent that it has become today (largely dependent upon energy profiles).

The Internet will be a part of this - people will fight for their ability to stay connected even as the automobile era ends. Most than likely, what you’ll end up seeing is the rise of regional internet authorities in the same way that you have regional power authorities, beholden to their customer base directly but regulated by the regional governments. The regional authorities would keep the primary lines connected as much as possible, even though I suspect that at least at some time during the next several decades, the bandwidth available over the systems will decline overall for a number of years before turning around. 

The Kunstlervillagers are right in thinking that society will become very unstable for a while. Where I break from them is the idea that we’re heading back to a wild-west type society as it was 150 years ago. I think that ours society will be different, not so much regressed as reforged, and that for a few generations yet, these unborn children will challenge us by going in directions that may seem completely foreign to us, but that will work for them. The only real unanswered question is whether they will look upon out time as a golden age or as an object lesson?

Obama’s Drone “Problem”

This may prove a very unpopular statement, but I have been thinking about Obama’s drone problem. There is a concern among liberals that the increase in Obama’s use of drone aircraft is ominous, that he has become a murderer for the selective killing attempts through the use of targeted remote control aircraft.

There is something about drones that is quite scary, but it took me a while to figure out what it was. Then I worked it out - drones are scary because Americans have become desensitized to war. The military currently makes up less than 2% of the overall population, meaning that at any given time the vast majority of people have never been in a battlefield situation, have never seen the effects of a bomb or a missile upon buildings and people, have never lived with the fear that those in battlegrounds know instinctively, of being killed by a deliberate or a stray bullet, of stepping on a land mine, or of having a bomb go off in their vicinity.

A bomber aircraft drops hundreds of bombs in a given run. From the air, it’s fireworks. From the ground, it’s death and devastation. A missile is even worse, because such missiles can come at any time. For every “terrorist” killed by these attacks, a dozen or a hundred or a thousand may be “collateral damage”. Americans though never see this, never see the numbers or the horror involved. Yet bombers and missiles are always “honorable” because the pilot and crew put themselves at potential risk.

A drone, on the other hand, isn’t “honorable” - the person controlling the drone may be half a world away. A drone is a very precise killing weapon, however - while it is still not perfect, a drone can delivery a far less lethal attack to a far smaller and more precise target. It is not, in other words, anonymous. It is a bullet with a person’s name on it.

Obama has taken to using drones for precisely this reason. Bush (and Cheney and Rumsfeld) viewed war as an economic opportunity, a way to enrich big military-industrial players (which all three had major stakes in) by taking advantage of very imprecise, large scale military actions in order to prosecute a war against a force measured in the hundreds. It then waged a second war while the first was ongoing primarily as a pretext for a pre-emptive invasion to capture a country’s oil delivering capability. The total number of people killed in Iraq is likely in the millions, the overwhelming number of which were Iraqis (and the overwhelming number of those were either civilians or young men drafted at gunpoint into a hopeless war that they really had no skills to wage).

Obama, on the other hand, has ended the Iraq war, and has shifted the Afghanistan war (which is increasingly a war with Pakistan) away from large scale bombing actions and towards much more tightly focused tactical strikes using drones. Why? Because Obama recognizes the lie in Bush’s strategy. Afghanistan is a guerrilla war. It involves members of the Taliban, local warlords and the re-emergent Al Qaeda carrying out attacks both on its own populace and on US military forces. Obama likely would prefer to get Americans out of the line of fire entirely, but there are logistical and strategic reasons why this is both hard and time consuming.

This means that Obama’s other alternative is to treat terrorism the way that Clinton treated it, which is as a crime rather than a war. As a crime, the people who commit the crime are ideally captured, or if that’s not possible, killed, with as little collateral damage as possible. Due process is involved - a decision must be made to act against the criminal, the case for and against the action must be weighed, and ultimately the action must be carried out - with the name of a person who signed off on that decision publicly known. This is what due process is - it is a mechanism to insure that such retribution is made without review, and that someone, in this case Obama, must ultimately make the decision to execute that retribution, at the potential impact of his political career and possibly his own freedom.

War absolves the commander-in-chief of this responsibility. Hundreds of thousands or even millions of innocent people may die, but the commander in chief at the end of the day doesn’t really have these lives on his own soul. There are no names, and increasingly not even numbers, to tell how many innocents died. By several estimates, a minimum of 150,000 Iraqis (possibly up into the millions). There are no names for those that died. Obama has inherited a war that he neither wanted nor feels is worthwhile, and the number of people that he has authorized to be killed is probably less than 1000, of which maybe as much as 40% were deliberately targeted.

Ideally, I think most Americans would love for it to be zero. However, with targeted drones, Americans are faced with an uncomfortable truth. Each of those bullets has a name on it now, and it is America’s burden, not just Obama’s, to acknowledge that we have killed. War is no longer anonymous, and this is good, if not necessarily pleasant. 

Artists vs. Engineers: Millennials (and Virtuals) in the Workforce

Further thoughts on the coming generations.

A recent post about Millennials in the workforce was as notable for what it didn’t say about that generation (from the perspective of a Boomer)  as for what it did. Indeed, to me it highlighted the fact that the Boomers really, really do not understand the Millennials, just as they didn’t understand the GenXers after them. This is perhaps not surprising - the Internet and the resulting explosive connectivity changed the very language of business, and to a great degree that difference can be summed up in the aphorism “If the Boomers are the ME generation, the Millennials are the US generation.”

First, the motivations that drive Millennials are VERY different from the ones that drove Boomers, and in a number of ways are different from thoset that drive GenXers. First and foremost, there’s a lot of pent-up anger out there in the Millennial generation towards corporations in general. When the Boomers came of age, most corporations consciously or unconsciously emulated the command and control structures of the military, because the young men that fathered those boomers had come right out of World War II, and had brought back with them not only GI Plan schooling but a very clear idea about how large organizations should be structured, which served well as the United States became the primary supplier of goods to the world in the aftermath of the destruction of WWII. The Boomers entered universities that had a similar command and control structure, and while there may have been protests and the like, once Boomers entered into the corporate world they took to it like a duck to water. 

For the GI generation, the idea of being employed for life at the same corporation you started with in the mail room to the time you finally retired was pretty much a given. For the Boomer generation, ascent was in 8-10 year arcs with different corporations, each arc providing you with counter experiences to your previous job, until ultimately you ended up as senior management with the job you finally retired from. For GenXers, you were more likely to be a freelancer or consultant, in between speculative startups. Job segments were shorter and riskier, you could make a lot of money, but job security was always an iffy thing, and not surprisingly, ours is much more of an engineering generation than the Boomers ever were. As we enter into the end game of our own careers, we’re looking at uncertain futures (retirement? huh?) and typically are ending up as senior academics, heads of consultancies, researchers, senior engineers and systems architects. 

The Millennials grew up with the Internet, and are easily the most connected generation ever. Their notion of corporations are informed by Google and Facebook, not General Electric - relatively small and autonomous working groups under a larger umbrella group, team oriented but with small, spatially-disparate teammates that communicate largely via electronic means, as often as not outside cubicle walls. They have a far greater job mobility, and the distinction between employment and unemployment are far fuzzier. The corporation that they work for is less important than the team they work, and the distinction between employee and contractor - so significant to the power games of the Boomers - is pretty much meaningless to the Millennials. Money is a motivation - especially in these times - but for the most part it’s not at the top of list of motivations, but simply a reflection of the fact that to stay connected they need to have the tools to do so, and need to have a place to sleep at night. Ironically, one benefit that this generation has is that they are likely to be great savers, because not only do they have the object lesson of the Greater Recession, but beyond improving their communication gear they really do not have a big need for materialistic possessions per se. 

To Millennials, the Boomers are hopelessly materialistic, not because of any reflection of their moral failings (indeed, Millennials consider most Boomers to be sanctimonious) but because materialism does not translate well to mobility. Sales of mobile homes are actually increasing among this generation, even with the high cost of gasoline, because a lot of Millennials are accustomed to going to where the work is (or at least where a steady internet connection can be found) and see physical homes as liabilities. Millennials are also getting married much later in life, are having kids much later, and are having fewer of them (if they decide to have any at all). Marriage itself is increasingly seen as optional, and both of these have a huge impact upon business, as it means again that one of the biggest factors that tend to stabilize a person’s career is the presence of a family. 

This means that Millennials view employers as clients to be serviced rather than overarching structures that provide long term meaningful careers. They are loyal to a fault, but they are loyal to their circle, not to any large institution. They distrust marketing and corporate spin, find the political games and infighting in large organizations, and yet are more inclined to act by creating temporary alliances between different groups to provide a united front to meet a crisis than they are by trying to subsume those same groups.

This is one of the reasons why the dominant innovation of the Millennials thus far is the Flash attack, in which action tends to percolate quietly in the background, often well below the current mainstream, then suddenly overwhelm the media with a seemingly large and monolithic front. Boomers are used to campaigns, much like military planners preparing for the next battle. Millennials are more like guerrilla fighters - snipers coming out of nowhere then disappearing into the background. This agility is both a strength and a weakness - it can get things done very quickly, but sustained action becomes problematic, and this tends to manifest in business focus as well. Millennials are the ADHD generation - they tend to be easily distracted from long term goals by immediate needs or crises, and consequently can get bored when projects extend beyond a certain window.

Of course, the flip-side to that is that Millennials also tend to be very innovative, especially with regard to social innovation. They are reinventing “media”, moving it far outside the box that their parents envisioned, seeing news, entertainment and education as all effectively just part of a single broad digital experience. The iconic image of the world for Boomers was the Mercator project map of the world, for GenXers is “The Big Blue Marble” and for Millennials is Google Earth. Carrying this point further, Google Earth is the world as entertainment - it is fundamentally interactive, is contributed to by a large community providing different layers of information, is generally not formally curated, combines temporality, geophysical location, images, media files and hyperlinks, and more importantly turns everything we know about geography “applications” on its ear. Millennials are not working out of the box, they are creating their own tessaracts and throwing the idea of box out altogether.

Most Millennials are also remarkably optimistic about the future. Unlike the GenXers, who for the most part have had to deal with the disintegrating remnants of the concept of “job” that came from the Boomers, Millennials are redefining the very concept, and are doing so in ways that are increasingly moving away from the institutional view. A job is something that you do for a couple of months to perhaps a couple of years, wrap it up and move on to the next job (or perhaps jobs). More and more of that work is virtual - Boomers generally have shied away from telecommuting, as their focus has always been the office. GenXers began to embrace it, but  the generation itself tends to be more introverted than their (highly) extroverted parents so working in a solitary fashion with the occasional interactions with work was more natural to them. Millennials are like their grandparents - highly extroverted and social as a rule - but overall so tuned into the connectivity of the web that their socialization (both personal and business) tends to take place on that medium in preference even to physical interaction. Ironically, this means that commercial real estate is going to stay depressed for a long time, even as “business” picks up, because as the Millennials increasingly become the dominant workers, the need for dedicated business spaces for people will diminish dramatically.

And what of the generation after, the Virtuals? They will likely share some of the characteristics of Millennials (certainly the connectivity aspect), but will also tend towards introversion (this cycling of extroversion and introversion seems to be a generational characteristic), and all that implies. At the moment, the leading edge of the Virtuals is 12-13 years of age, so it is difficult to generalize, but there are several intriguing signs. Test scores for Virtuals in mathematics and science have been going up in comparison to those of Millennials (which went down in those areas when measure at the same age), and interest in those fields is rising.

The Virtuals generally have a higher number of Aspergers and high functioning Autism than the Millennials did per capita, which usually manifests as social retardation but higher focus or intensity in specific areas. They are not as media driven, and ironically they are more inclined to play strategic games and build applications than communicate with their peers over computer or smart tablet environments. They are more avid readers, however, and tend more towards non-fiction or speculative fiction than their Millennial brethren did. If SMS and social media were the iconic symbols of the Millennials, for the Virtuals it’s tablets, and likely virtual glasses as they start rolling out towards the end of 2012 and into 2013. Their world will be immersive - the web will simply be an overlay on everyday life, and everything in that world will have information and context. “Traditional” academia will also be crumbling pretty dramatically by this point, and it is likely that the Virtuals will far more likely be self-educated and auto-didactically skilled - education will be unable to keep up with the disruptive changes and challenge to its authority that the coming era of Big Data augers, and while Virtuals will be considerably more knowledgeable (and potentially skilled) in specific areas than any previous generation, they will largely be building the edifices which would nominally be educating them. 

As a generation, they will be entering the workforce at a time when there will be massive upheavals in the corporate and political world as Millennials become the prime shapers of social policy and direction. Indeed in many respects they will be the primary agents by which these radical reforms are actually implemented (just as GenXers built the web that was largely envisioned by Boomers). The office of the future (circa 2030) will be notable primarily for being non-existent. Businesses will still exist, but retail will be a far reduced shadow of itself (and malls will likely end up being repurposed as work hotels where spaces get rented out as needed, if they don’t get torn down outright). Big box stores will become fulfillment centers for online retailers from grocers to clothiers to automobiles. Work will be done by ad hoc groups working distributed, with perhaps half of those working from home. Manufacturing will shift to mass cottage industries (pay very close attention to 3d printers), and zoning will have to take into account the rise of new residential/light industry sectors.

Note that I suspect this will be the case perhaps even more if we are in a diminishing resource environment. Short of a complete catabolic societal collapse, which is possible but unlikely, what will more far more likely happen is that society will adapt to a mode where driving an hour each way to work every day will become prohibitive, where work will likely be either immediately local or will be far enough away that travel on a regular basis to it is not feasible (it’s also worth noting that the Millennials are the first generation since the 1930s in which driving does not play a prominent role, and this will carry through in their approach towards work … if they have to drive any significant distance to get to it, they won’t be interested in taking the job).

On the other hand, this is also a generation where marriage occurs late and child rearing occurs later if at all, and this means that the Millennials will be far more likely to hop - migrating from one city to another to take on a job for a certain period of time, then moving to the next city. Ironically, this mode doesn’t necessarily involve a car - the Millennials will tend to travel very light (a tablet, a couple of changes of clothes and toiletries), will travel by train or bus, and will rent a car as needed rather than own one outright. They are also growing up distrusting big business and big government simultaneously, and this means that they will tend to be very conservative both in their spending and saving.

Finally, Millennials are already defined by the cohesiveness of their extended networks as compared to older generations. This is a generation of specialty convention goers, and many of their closest relationships will be shaped by common interest rather than by geography. From an outsider’s perspective they may appear somewhat childish, but these conventions serve much the same purpose as bars did to an older generation - a place to meet others and establish new relationships (romantic and otherwise). This generation is also less likely to do hard drugs or become alcoholic than previous generations did (as demographic trends seem to be proving out).

The combination of living light (which places a far lower demand on finances than maintaining a house, car, furnishings, and so forth) and demand for mobility means that work will tend towards transient relationships as well - it simply will not play as big a role in the lives of Millennials compared with their social life. (They also will tend to stay “in the nest” far longer than preceding generations.) That doesn’t mean that they will be beggars - that same mobility will translate into a penchant for saving rather than spending, and when they do finally “settle down” towards the end of their 30s, they will likely be far better off than the preceding GenXers at that age. 

It’s hard to say what the longer term characteristics of the Virtuals will be - the oldest is now twelve, but there are a few indications. Expect Virtuals to be home-bodies - they will establish nests, workshops, and other bases of operation fairly early, will likely not be anywhere near as transient as Millennials, and may be somewhat more materialistically inclined. They will see Millennials as flighty and somewhat inconsequential, too hung up on media and rather spoiled. As children, they will have grown up during fairly harsh times, and as such Virtuals will likely also be thrifty, but in different ways than the Millennials - they will be inclined towards saving as a defense against potential downtimes vs. saving as a consequence of a light living style. The Millennials will envision the social foundation for the century, the Virtuals will be ones to lay down the infrastructure to support that - the artists vs. the engineers.

From Dividends to Bonuses

Suppose for the moment that companies that have been paying extraordinary dividends of late to stockholders were to do the unthinkable: Every other month, all of the money that would have gone into the dividend pool was instead given as bonuses to employees, at all levels. The investors, of course, would probably howl - after all, they invested money into the company in order to assure a steady income. A few might be so disgruntled they might actually pull their money out, though probably not all that many - pulling your money out of a company is a fairly cumbersome process, especially if you’re still receiving a regular payment.

However, think about what happens on the other side. Employees have more money in their paycheck once every couple of months. Those who might have been on the verge of looking for something else (which as often as not are your more expensive, harder to keep employees) start to cut down on the number of resumes they are sending out. People may begin to want to come into work just a bit earlier, spend less time on Facebook and more time on making the company itself profitable - they, after all, now have a stake in the company’s success. They become less worried about being fired (which often contributes to a high stress level that reduces productivity) and more on doing their job. Hiring becomes easier, because your first best source of hiring is referrals from your own employees.

When this happens on a larger scale, those people have more money to spend. This increases the profitability of other companies, and means that other companies’ employees have more money to spend on your goods or services. Business increases, which means that the stock dividends that were being paid out earlier also increase, along with the values of the investors’ other stocks in their respective portfolios. People are able to pay more in taxes, so the quality of infrastructure around them improves. More people get jobs, so the drain on the unemployment system drops.

This is also a solution that doesn’t require large scale government action. All it takes is for a few companies to go down this path, towards asking investors that they take a back seat for a little while in order to protect the long term viability of the businesses in question. The more companies that do this, the more that business will begin to improve, and the faster that companies reach a point where they are paying out as much as they used to in dividends. The alternative is a slow long term anoxic death, where companies are increasingly forced to use their internal funds to keep their quarterly earnings consistent, until one day that money runs out.

Technology and Generations

A couple of days ago I came across a story talking about how NASA was interested in helping to interest the next generation of students in science and technology careers (the so-called STEM, or Science, Technology, Engineering and Mathematics fields). It’s been one of the greater mysteries in technical circles about what caused such a massive fall-off in the number of people pursuing technical degrees in the early 2000s as compared to the 1990s, and most of the obvious explanations (the tech recession in 2000 for instance) have always seemed rather facile to me. I actually think the reason is deeper, and if I’m right then this may in fact be the perfect time for policy makers to be investing in STEM related educational programs.

Recently, I had a chance to thoroughly read the classic work The Fourth Turning, by Strauss and Howe. For those not familiar with their theories, the core idea is that there is a societal cycle called a saeculum (from which we derive words like secular) that roughly spans 80-90 years. Each saeculum consists of four generations, each of which tend to have similar values, motivations and philosophies, and each of which interact with the other generations in a clear and distinct pattern. As each generation moves through the various stages of life (youth, adulthood, middle age, senescance) each of which tend to be 18-20 years of length as well, they also tend to have very different concerns, expectations and desires. As prior and succeeding generations are also moving along those same stages but offset, this means that there are distinct configurations that describe the psycho-social characteristics of these generations. 

I believe that the current drought in STEM interest (except in certain very specific areas) may actually be generationally driven, and both points to the likely characteristics of the incoming generations and gives a road map that educators and policy makers should pay attention to closely.

A good reference point to show this is to look closely at the Baby Boomers and how they ended up shaping both business and society. The Boomer generation (born from 1943 to 1961, using what I think is a realistic ethical rather than demograph division), for the most part, were not engineers or scientists, though there were several notable engineers and scientists in that generation. It was the GI generation that built the space program, created the first computers, built much of the highway and electrical infrastructure of the country. The Boomers were marketers, managers and salesmen. They were the corporate warriors, and as they moved into the workforce, the engineering ethos of the previous generation was replaced with the marketing ethos of this one.

The GenXers (born from 1962 to 1981), on the other hand, were engineers of sorts, but their playground was not space, but computer technology and biotech. They did the bulk of the programming, designing, engineering and analysis work of the Internet and of the Biotech revolution. What’s interesting is that as the Boomers retire and the GenXers begin to replace them on the other side of the generational gap, the focus of management, of education, and of policy is going to shift increasingly towards problem solving - not “How do we make the most money doing this?” but “How do we solve the problems we’re facing in the most efficient and elegant manner we can?”

I’d argue that this represents a radical shift in thinking in society. It’s hard for a 60 year old C-level manager who’s uppermost thought during the day is “How can we improve the share price of our company?” to understand the motivations of a 42 year old senior engineer who’s looking at finding the optimal solution to building a software system. More significantly, when that 42 year old becomes the 60 year old CEO of the company nearly two decades later, her motivation is not enhancing share price, but building the software products that meet the greatest needs of their customers, with shareholder value far lower on the priority chain. The company structures will be different, the valuation systems will be different, EVERYTHING will be different. They will be focused on SOLVING PROBLEMS.

The Millennials (1982 to 2000), on the other hand, are media people. They grew up in the silver age of Social Media. The Internet had reached a point of complexity that it could start supporting a number of different kinds of media, and the communication aspects of the Internet are far more important to them than the technical aspects. For many of them, there was never a time where the Internet didn’t exist. The oldest of the Millennials are now out of college, they are intensely anti-marketing (this is the generation under which media deconstruction hit its high point) and they are highly genre savvy. This is the generation that will a hundred years from now be seen as the artistic giants of the twenty first century.

However, it is the next generation, what I call the Virtuals (born 2000-2018) that will be the bringers of the next wave of technical innovation (outside the media space). This is a generation that will have high capacity gene sequencers, big data cloud infrastructures and semantically aware computer systems, mobile sensor networks, near-earth commercial space travel, LEDs and memsistors and high voltage solar “fabric” and all the things that are emerging largely from the work of the GenXers (who are now going into research rather than management) before most of them are out of high school.

The oldest Virtual at this point, is twelve years old and is in sixth grade. The youngest will not be born for another six years. The Virtuals are not like the Millennials. I have two children - one born in 1993, the other born in 2000. The elder of the two is a classical Millennial - she’s into cosplay, animation, computer graphics and computer games, and social media. She’s entering college in media arts, and I fully anticipate that she’ll find herself very much caught up in a world where creatives are very much in demand and where the rules of society are rewritten daily. She’s a social deconstructionist.

My youngest was born in 2000, and she is what I believe many Virtuals will be like. She’s more literal than her sister, was programming game levels by the time she was seven (and taught herself how to read off the Internet), and is rather scarily good at finding the information that she needs to educate herself. She’s a technical synthesist. She has trouble with school though, because school doesn’t work the way she thinks - she can find information, but she’s having trouble learning strategies for synthesizing that information. Of course, the schools themselves haven’t really caught up with this fact - they’re just starting to come to grips with the fact that the Millennials exist in a world that is global, is more engrossing than school, and is mediated by networks - and many of those Millennials have already graduated.

As not so much of a diversion here, I think education is a critical part of any society, but I rather despair at the educational system in the US. The content of it is designed by values-conscious Boomers determined to put a stamp of morality and jingoistic patriotism (while minimizing the importance of science in many parts of the country), implemented by technical GenXers who chafe under this system and despairing about the Millennials who all seem like ADHD candidates permanently wired to their smart phones and who for the most part are more interested in video games and cosplay than in IMPORTANT THINGS (even as they themselves wonder whether what they’re teaching is worth anything). And of course, STEM (science, technical, engineering and mathematics) courses of study have seen a massive drop in participation. We’re becoming a nation of gamers and idiots.

Except I’m not so sure that’s really the case. The Millennials are the counter-stroke generation to the Boomers - interested in art and literature, philosophy and media, architecture and music. They are communicators first and foremost, but they really have in the aggregate comparatively little interest in the technical except as it relates to these areas.

The Virtuals, on the other hand, will be technical synthesists. The GenXers have built the scaffolding and infrastructure that the Millennials use for communication and social bonding, but they have also built the scaffolding and very early infrastructure for the Virtuals to build on in combining bio-engineering with information management, for building and designing specialized energy aggregators and generators, and for integrating all of these together into a cohesive technical superstructure of applications (one that reengineers the human body all the way up to the height of the human noosphere). They will in fact be the ones that rebuild the technical underpinnings of society, quite possibly as the world that the Boomers built finally collapses under its own weight.

The GenXers started entering into college (the start of adulthood) in 1982 and its noteworthy that the number of students graduating in STEM technologies started picking up dramatically by 1986. It hit its peak in 1995, four years after the GenXer population peaked (and four years after they entered college). By 1999, even though the tech field was still hot, STEM graduates were declining again. Where were the (now) Millennials going? New media, gaming, communications, web design, graphics, as well as a noticeable pick up in theatre arts, writing, photographer and similar fields. Certainly the technologies were now coming online to make this field attractive, but its worth noting that the place they weren’t going into - not just STEM (except for technology related to the communications revolution) or medicine but also law, finance, business or even the more humdrum aspects of marketing and sales, in places where, ironically, the tools and technologies were just as well developed.

The Millennials are now coming out of college - they hit their peak in 2009 and there’s some evidence to indicate that the number of graduates in the media arts arena is leveling off, consistent with a graduation peak of about 2013. It’s also worth noting that most generations have somewhat different characteristics pre- and post- peaks. Pre-peak generations have shadows of the previous generation that colors their attitudes and beliefs. Post-peak get “premonitions” of the next generation, sharing more and more of their values. At the cusp points between generations, you often end up with people who are generalists, not necessarily strong in any one generation but often being renaissance characters that don’t easily fit into any generation.

If, as I suspect, the Virtuals end up being technological synthesists (as opposed to the GenXer’s role as technological analysts), then 2013 will also mark the trough of STEM graduates, and the trend should turn around. However, their focus is going to shift - alt-energy vs. geologist engineers and chemists, distributed AI construction (possibly with robots and telepresence) vs. business applications, life-form engineers vs. geneticists and oncologists. As a generation they will be very utilitarian and focused compared to the previous generation (whom they will consider as being rather frivolous and perhaps overly indulged). The mid-point in the trend will occur around 2022 with the generation peaking in 2031 in terms of STEM graduates.

Of course, this also brings up an interesting conundrum. The Millennials are for the most part community oriented, though that community is defined virtually rather than physically. This means that their optimal learning style (all other things being equal) is one where learning takes place via interactions with their peers, and social awareness is considered of greater value than technical competence. There, the principle role of the teacher is very much that of the mediator and director, shaping the conversations towards the completion of communal projects. 

Virtuals, on the other hand, are already showing that they respond best to autodydactic approaches to learning, where they learn by doing, research what they need when they need it, and generally find traditional teaching methodologies to be confusing at best and counterproductive at worst. As it turns out, this is in fact the best way to learn science, where the role of the teacher is primarily that of advisor rather than authority. The students also tend to gravitate to an apprenticeship model, where you have a master with a limited number of apprentices and sojourners (the pairing of a GenXer with one or more Virtual is a particularly effective combination), especially as the GenXers will be entering Senescence at this stage in their own lives, when their principle role is to be teachers and advisors rather than decision makers.

There’s been a pendulum swing towards anti-intellectualism that seems to be reaching its peak in the US, but we may in fact be near the end of the pendulum swing. The Boomers entered into the period of senescence starting around 2000 (these things tend to be fuzzy +/-3 three years), and the Boomers have generally been the generation of the salesman. In conjunction with senescence this has meant that the Boomers have been focused on physical and financial security, mortality, maximization of financial assets. They also have tended to push conformance to the status quo, which, given the demographic size of the group, has generally meant THEIR status quo, and in old age this has tended to result in dogmatic uniformity, ideological rigidity and a move towards centralization.

By 2009, the peak of the Boomer generation entered senescence (and out of a decision making capacity). By 2018, the Boomers will be completely within senescence, with the GenXers fully invested in the decision-making “Middle Aged” bracket. Since societal direction tends to be determined largely by this bracket, this again hints at society beginning to shift towards more pragmatism, more focus on problem solving rather than profit maximization, and more of a need for (and respect of) scientists and technicians. Just as with the rise of STEM graduates, society itself is beginning to move back towards a mode where the problem solvers, rather than the empire builders, are coming to the fore. Personally, it couldn’t happen soon enough.


One final note. The one area where I break with Strauss and Howe is in their designations of saecular titles. In the fourth turning (the one we’re in now, extending from 2000-2018 +/- a few years) the Millennials are “Heroes” while the virtuals are Artists. I believe that a perhaps more accurate way of thinking is to see Millennials in this phase as Social Deconstructionists (with the Boomers being Social Constructionists, promoting the status quo and GenXers being Technical Constructionists, building technical infrastructure). This means that Virtuals would be Technical Deconstructionists - they will be the mix and match generation, crossing technical disciplines, questioning the technical status quo.

Deconstructionism in literary terms is the process of identifying literary tropes (cultural assumptions), and deconstructing them in an attempt to understand how they work, why they work and how they can then be reconstructed to more closely model the world. Technical construction effectively builds on existing infrastructure to create new works, while technical deconstruction is the process of re-examining those core assumptions, discipline boundaries and underlying physical constraints and create whole new directions with them. The OWS movement is fueled largely by early cycle Millennials (just as the Tea Party is primarily made up of early cycle Boomers). GenXers largely were tool builders, Virtuals will be tool users.

Okay, THIS is the final note and a pet peeve. GenXers have generally gotten a bad rap compared to the Boomers - introverted to the Boomers extroversion, indifferent to material success compared to the Boomers’ avid capitalistic streak, perceptive and slow to make judgements or decisions compared to the Boomer’s decisive leadership and charisma, pragmatists to the Boomers’ idealism. Yet it was the GenXers who were mostly responsible for the creation of the web, probably the single most important invention of the last century. The Internet was initially a construct of the GI generation, while the web was conceived by a late cycle boomer (Tim Berners Lee, born in that incredible technical banner year of 1955, the same year that both Bill Gates and Steve Jobs were born), and implemented for the most part by GenXers (Marc Andreesen, Linus Torvalds, Dan Connolly, Roy Fielding, many others). 

XSLTForms Gets AVTs!!!!!

XSLTForms Gets AVTs!!!!! Okay, so five exclamation marks is typically one of the signs of a truly deranged mind, but in this case I think the enthusiasm may very well be justified.

XSLTForms has gained a small, dedicated following over the years, but because of the general feeling among the web community that XForms has been a failure (probably because it uses that damned XML language) it’s long been an underappreciated gem. Yet due to the hard work of XSLTForm’s Alain Couthures, the client side form language today picked up several powerful new additions in what is billed as the XSLTForms 1.0 Release Candidate 1.

Introducing AVTs

Alain is on the W3C XForms working group (as am I) and has been closely following (and shaping) the direction of XForms 2.0. There are a number of features that are very exciting in this next version of the venerable specification, but one of the most important is a capability called Attribute Value Templates, or AVTs.

For anyone who has used XQuery or XSLT, AVTs are old hat. In those languages, an AVT is a block of text within an attribute, surrounded by braces “{}” that, when processed, evaluate to an expression in the host evaluating language. For instance, in XQuery you may have an XML document of the form:

<records>
     <record>
           <id>1</id>
           <name>Jane Doe</name>
           <occupation>adventurer</occupation>
           <url>/person/JaneDoe.html</url>
           <level>12</level>
     </record>
     <record>
           <id>2</id>
           <name>John Dee</name>
           <occupation>researcher</occupation>
           <url>/person/JohnDee.html</url>
           <level>14</level>
     </record>
     <record>
           <id>3</id>
           <name>Jean Dair</name>
           <occupation>administrator</occupation>
           <url>/person/JeanDair.html</url>
           <level>13</level>
     </record>
</records>

You can create a table from this data in XQuery using AVTs:

<table>
     <tr>
           <th>Id</th>
           <th>Name</th>
           <th>Occupation</th>
           <th>Level</th>
     </tr>
{for $record in $records/record return
     <tr>
            <td>{fn:string($record/id)}</td>
            <td><a href="{$record/url}">{fn:string($record/name)}</a></td>
            <td>{fn:string($record/occupation)}</td>
            <td>{xs:integer($record/level)}</td>
     </tr>
}
</table>

Anything within brackets are evaluated as XQuery. Within the @href attribute (which is used to place a link on the displayed name from the record) this also automatically evaluates the content (in this case the URL) as text, which is then used to assign the link on the item itself.

XForms 1.x doesn’t have this capability.  This means that in order to do the analogous operation with XForms, you have the rather unwieldy:

<tr>
<xf:repeat nodeset = "record">
      <td><xf:output ref="id"/></td>
      <td><xf:trigger ref="." appearance="minimal">
             <xf:label><xf:output ref="name"/></xf:label>
              <xf:action ev:event="DOMActivate">
                     <xf:load>
                            <xf:value ref="href">
                     </xf:load>
              </xf:action>
            </xf:trigger>
       </td>
      <td><xf:output ref="occupation"/></td>
      <td><xf:output ref="level"/></td>
</xf:repeat>
</tr> 

Consequently, it’s perhaps not surprising that XForms has a reputation for verbosity. Yet suppose that XForms has AVTs. Then the same statement above can be rendered as:

<tr>
<xf:repeat nodeset = "record">
      <td><xf:output ref="id"/></td>
      <td><a href="{url}"><xf:output ref="name"/></a></td>
      <td><xf:output ref="occupation"/></td>
      <td><xf:output ref="level"/></td>
</xf:repeat>
</tr> 

This results in a considerable savings in both coding and legibility. However, there’s another, more subtle benefit here. If the XForms application is able to add new records to the record collection given above, then when the record is added, the repeat statement will recalculate and render the new record with the proper URL as part of the sequence.

AVTs and CSS

AVTs can work well with CSS. For instance, the following XForms with AVT can be used to set both the foreground and background color of the page from a form:

<html xmlns="http://www.w3.org/1999/xhtml" 
    xmlns:xf="http://www.w3.org/2002/xforms">
    <head>
        <title>Background Colors</title>
        <xf:model>
            <xf:instance id="colors">
                <colors xmlns="">
                    <color name="Blue" code="#0000FF" complement="#FFFFFF"/>
                    <color name="Green" code="#00FF00" complement="#000000"/>
                    <color name="Olive" code="#808000" complement="#FFFFFF"/>
                    <color name="Orange" code="#F87A17" complement="#000000"/>
                    <color name="Pink" code="#FFC0CB" complement="#000000"/>
                    <color name="Purple" code="#800080" complement="#FFFFFF"/>
                    <color name="Red" code="#FF0000" complement="#FFFFFF"/>
                    <color name="Yellow" code="#FFFF00" complement="#0000000"/>
                </colors>
            </xf:instance>
            <xf:instance id="state">
                <state xmlns="">
                    <color>Blue</color>
                </state>
            </xf:instance>
        </xf:model>
    </head>
    <body>
        <div id="background"
            style="position:absolute;left:0;top:0;width:100%;height:100%;
background-color:
{instance('colors')/color[@code=instance('state')/color]/@code};
color:{instance('colors')/color[@code=instance('state')/color]/@complement}"> <h1>Background Color</h1> <xf:select1 ref="instance('state')/color"> <xf:itemset nodeset="instance('colors')//color"> <xf:label ref="@name"/> <xf:value ref="@code"/> </xf:itemset> </xf:select1> <input type="text"
value="{instance('colors')/color[@code=instance('state')/color]/@code}"></input>
</div> </body> </html>

This actually shows a number of AVT uses. When the user selects an item from the color list drop down, this sets a second “state” instance’s color value. This in turn is used to re-evaluate the AVTs within the @style attribute of the “background” div and sets the background color to the corresponding color code. It also does a look-up with this color value to retrieve the corresponding “complement” color, which is used to set the foreground color of text in order to make the text visible against dark colors.

An AVT is also used within an HTML input element to display the color code. This points out one restriction on AVTs - they are intended as read-only properties. You can overwrite the value of the above input statement, but it won’t reflect back into the model - the next time the model refreshes, the AVT will re-evaluate it’s template expression and replace whatever you typed in.

If an XForms statement had been used instead, it would look more like:

<xf:input ref="instance('colors')/color[@code=instance('state')/color]/@code"></xf:input>

which will in fact change the underlying model, but its worth noting that this isn’t an AVT - the ref statement evaluates not to a string but a node, and there are no brackets.

Note: I found that when an HTML and XForms input referenced the same node, errors were generated, so it’s likely that this hasn’t been fully debugged yet.

A similar approach can be used with images. An AVT in an image @src attribute can be used to set the image to that particular URL, while an AVT can also be used with the CSS background image property:

<div style="background-image:url({color/@url});">Content</div>

… in order to set an image for a background. 

If you, like me, serve up your XForms from XQuery, in which the {} characters are already overloaded to handle XQuery AVTs, you can still use XForms AVTs by simply escaping the brackets - using two left or right brackets instead of one. XQuery will ignore the brackets, but will pass the resulting code to the client as code within a single bracket.

XSLTForms and JavaScript

In addition to the AVTs, there are a few additional features in the XSLTForms RC1 release that are worth exploring. One of them is the ability to use JavaScript functions directly from XPath. In this particular case, the functions are marginally limited in that only strings can be passed in or out, but with JavaScript this is generally not that much of a limitation.

A very simple example of this can be illustrated with a page in which the user enters an angle, and the Javascript code then generates the sine, cosine and tangent values for that angle.


<html xmlns="http://www.w3.org/1999/xhtml" 
    xmlns:xf="http://www.w3.org/2002/xforms">
    <head>
        <title>Trig Functions</title>
        <xf:model>
            <xf:instance id="state">
                <state xmlns="">
                    <angle>0</angle>
                </state>
            </xf:instance>
        </xf:model>
        <script type="text/javascript"><![CDATA[
var trig = function(op,angle){
    return Math[op](3.14159 * angle / 180).toFixed(4);
    };
]]></script>
    </head>
    <body>
        <div>
            <table>
                <tr>
                    <th>Angle</th>
                    <th>sin()</th>
                    <th>cos()</th>
                    <th>tan()</th>
                </tr>
                <tr>
                    <td><xf:input ref="angle" incremental="true"/></td>
                    <td><xf:output value="trig('sin',angle)"/></td>
                    <td><xf:output value="trig('cos',angle)"/></td>
                    <td><xf:output value="trig('tan',angle)"/></td>
                </tr>
            </table>
        </div>
    </body>
</html>

XPath 1.0 by itself doesn’t support standard trigonometric functions, but JavaScript obviously does. In the code above, the trig() function takes one of the Math operators (normally expressed as Math.sin(angle), etc., but here expressed as Math[“sin”](angle)) and evaluates the function using the string value of the /state/angle node (which is consequently implicitly converted to a float), then fixes the length to four digits after the decimal point before returning it to the XPath context. 

Such functions could obviously be far more sophisticated. For instance, if a particular web service only returns JSON rather than XML, this function can retrieve the JSON, store this in a variable closure, and either update the web page or translate this back into the model. Additionally, since it is possible to get access to the XForms model itself, these functions can provide an additional bridge between the XForms model and alternative representations.

The js-eval() function is also available with this update. It takes a JavaScript expression and evaluates it, then returns the results (think of it as an anonymous function).  There’s obviously much more that can be done with these functions (and I have another article specifically about XSLTForms and JavaScript in the works to cover a few of these) but the key takeaway is that XSLTForms is a very Javascript friendly client library.

Wrap Up

In my experiences working with AVTs there were several times where the code generated errors, so I’m not sure I’d necessarily recommend that you go with the XSLTForms AVT implementation unless you’re willing to spend a certain amount of time debugging. The AVT library is also still very much HTML specific - a test with embedded SVG caused several problems (which is a shame, because I think AVTs and SVG were made for one another).

However, even given these caveats (which I fully expect to see resolved with the next major iteration) now is the time to start playing with AVTs in XSLTForms. The ability to control CSS from imported XML models by itself is a huge win, as is the ability to take advantage of parameterization of HTML elements via their attributes coming from an XForms model (and impacting that model in return).

The most recent code update for XSLTForms is through the Source Forge repository at http://sourceforge.net/projects/xsltforms/develop. More information about XSLTForms itself is available from the XSLT mailing lists at http://sourceforge.net/mailarchive/forum.php?forum_name=xsltforms-support

MarkLogic, eBookmarks and Distributed File Systems


In the last couple of weeks I’ve been going through Tom White’s excellent book on Hadoop by O’Reilly Press as an e-book. The eBook experience has actually been a somewhat new one for me - while I’ve been involved in producing eBooks for a while, I’ve not generally been a big consumer of them until lately. One of the things that I find most satisfying about them (especially through Kindle Cloud) is that I can read a chapter on my smartphone, then open up the same book on my laptop and pick it up at exactly the same point - complete with my annotations and commentary. It only increases my own intent to publish The Jane Doe Project, my Steampunkesque science fiction novel, to eBook when I finally finish the revisions.

Something else that comes out of this, however, is the growing awareness of the whole issue of intelligent distributed file systems. eBooks (and their platforms) are actually a very good starting point for considering what is possible with such a system (and some of the complexities that are involved). The idea here is that I have a resource - call it an eBook for now - that can be accessed via a URL. That resource not only has the contents of the book in question (in whatever format) but it also includes a metadata layer that identifies the resource via a specific “name” or “id” (most likely a URI of some sort).

Additionally, somewhere, there is a database of some sort that contains bookmarks that link both to a user in the system (which also has a corresponding data record) and to the book id itself. When the URL is called, it invokes a RESTful application that combines the book data with the bookmark data (after confirming the identity of the user) and passes this bundle to the associated client, either directly or more likely through a couple of related calls. The client in turn will take the book data, combine it with the bookmark and annotation data and provide the options to the reader to “sync” their books so that the latest bookmarks are kept. Once synced, the stored bookmark data can then be used to both reposition the book reader’s position in the book as well as to display the corresponding annotations.

Similarly, when you create a new annotation (or change pages), the new bookmark, possibly with annotation, gets sent to a collection of bookmarks with both the book id and the user id as  keys. The sent annotations (likely an XML file though possibly a JSON file) then get read back into the database, either replacing or modifying the collection of bookmark resources. Note that when you ask for new annotations from the web, using the userId and bookId as keys, what gets returned is a wrapped “collection” of bookmark resources, typically ordered by date (though this isn’t always a given) that satisfy the query for both keys, where an individual bookmark “document” might look something like this:

<bookmark>
      <bookmarkId>1001</bookmarkId>
      <bookId>2853</bookId>
      <userId>jane_doe</userId>
      <bookmarkType>annotation</bookmarkType>
      <position>19523</position>
      <annotation>This is an interesting passage, though I rather wonder
if it could have been written better.</annotation>
      <lastUpdated>2012-02-05T11:59:17.02851-05:00</lastUpdated>
</bookmark>

In this case, it’s assumed that the various ids are indexed, and lastUpdated is set as a dateTime range index. 

A MarkLogic XQuery script to retrieve these is simplicity itself, given the relevant keys. Lets assume for the moment that you have a script called /scripts/bookmarks.xq, defined as follows:

let $book-id := xdmp:get-request-field("book")
let $user-id := xdmp:get-request-field("user") 
let $query := cts:and-query((cts:element-value-query(fn:QName("","bookId"),$book-id),
cts:element-value-query(fn:QName("","userId"),$user-id))
return (xdmp:set-response-contentType("text/xml"),<bookmarks>
{for $bookmark in cts:search(/bookmark,$query) order by $bookmark/lastUpdate return $bookmark}
</bookmarks>)

This would then be invoked as:

http://yourserver.com:8010/scripts/bookmarks.xq?book=2853&user=jane_doe

Where :8010 should be replaced with the MarkLogic port being used.

Briefly, the cts: namespace handles low level queries (I’ll do the higher level search:search api in a subsequence article). In this case, the book and user keys are retrieved from a query string, then query itself is constructed from ANDing two element-value queries, one retrieving a query for all bookmarks with the $book-id value for &lt;bookId, the second retrieving a query for all bookmarks with the $user-id value for &lt;userId.

The query itself is an object, internally kept as a binary object, though this can be mapped into XML by wrapping the object itself in an XML element (in the same way that a map, a binary object, can be coerced into an XML representation. The query object doesn’t in and of itself perform the query - you need to pass it as an argument to cts:search(), but what it does do is define the constraints working on the search.

The cts:search() function in turn takes two values, the aforementioned query and an XPath expression that identifies in the database the root element that needs to be matched. Note that this isn’t a “directory” query - the query actually talks to ALL of the items in the database. (We’ll come back to this point in a moment).

It’s possible to create a URI rewriter that can turn what amounts to a remote procedure call (RPC type invocation) into a more restful statement. The file /scripts/rewriter.xq illustrates this:

let $request-url := xdmp:get-request-url()
let $terms := fn:tokenize($request-url,"/")
return if (fn:starts-with($request-url,"/bookmarks/")) then
let $target-script := "/script/bookmarks.xq?"
let $book-id := $terms[3]
let $user-id := $terms[4]
let $final-url := fn:concat($target-script,"book=",$book-id,"&amp;user=",$user-id)
return $final-url
else $request-url 

The rewriter.xq script is assigned in the app server set up page (usually on port 8001 for a MarkLogic instance), and it’s purpose is to generate a URL that an internal script then uses to determine the URI. If this field is blank, then the URI passed is sent on to the normal handlers, but if it contains a script location, then MarkLogic will call that script to determine the final URI. In the case of this particular rewriter, anything starting with /bookmarks/ will activate it:

http://yourserver.com:8010/bookmarks/2853/jane_doe

The script then generates the previous URL, passing the second “folder” in as the id for a book and the third “file” in as the id for a user.

You could of course get more sophisticated. For instance, if only the book id is passed, then what may be returned are all of the bookmarks for everyone for that book (for security purposes, this might also require a second query that will filter only those for which &lt;bookmarkType> is “public-annotation”, meaning that the only bookmarks that showed up were those that the person felt should be publicly available. This would change the bookmarks.xq script somewhat:

let $book-id := xdmp:get-request-field("book")
let $user-id := xdmp:get-request-field("user") 
let $query := if ($user-id != "") then
cts:and-query((cts:element-value-query(fn:QName("","bookId"),$book-id),
cts:element-value-query(fn:QName("","userId"),$user-id))
else
cts:and-query((cts:element-value-query(fn:QName("","bookId"),$book-id),
cts:element-value-query(fn:QName("","bookmarkType"),"public-annotation"))
return (xdmp:set-response-contentType("text/xml"),<bookmarks>
{for $bookmark in cts:search(/bookmark,$query) order by $bookmark/lastUpdate return $bookmark}
</bookmarks>)

This simple change makes it possible to support both private and public annotations in one script.

The direct implication of this is that a RESTful services architecture can become independent of any internal storage mechanism - in this particular case, in fact, if you were dealing with a MarkLogic cluster of several different forests (the foundations for databases), this code would work even if the bookmarks were distributed over each of these clusters. In other words, externally, this system has the potential to be distributed over several (perhaps even several dozen) databases that are set up within the same cluster. Unlike HBase or HFS, on the other hand, this also works remarkably well for many thousands or millions (or potentially billions) of bookmark objects.

Yet the idea of having searches looking at potentially ALL of the objects in the database, even with indexing, still may make the idea of having internal folders of content preferable. It may be preferable, for instance, to have a system where each bookmark is in fact in a bookmarks forlder within a user folder along with other information, perhaps something along the lines of:

/bookstore/users/jane_doe/bookmarks/1001.xml
This is certainly doable - simply save the files with this particular folder using the xdmp:document-insert() function when saving the bookmark, and MarkLogic will create the relevant folders internally. When the search is made, then, you can constrain the search by passing the directory in question using the directory query:
     let $query := cts:and-query((
cts:directory-query((fn:concat("/bookstore/users/",$user-id,"/bookmarks/")),"1"),
cts:element-value-query(fn:QName("","bookId"),$book-id)))
The order here can be critical - “and” queries are iterative, which means that by doing the evaluation on the cts:directory-query first, you are effectively guaranteeing a much smaller search (only those items that satisfy that query) in the next iteration, here the element-value-query. This generally holds true for collection-queries as well; by passing these in first into and-queries, cts:search will only look for those items in the second set that are also in the first, rather than searching across the whole space. (I believe in this case that code optimizers might automatically shuffle these items around internally, but it never hurts to be consistent in one’s coding.)
However, it’s worth asking what effect this has on clustering, and distributed files. The answer here, surprisingly, is none. In MarkLogic, a directory is essentially just an “exclusive” type of collection, one in which there is only one such “directory collection” per file. This means that in practice there’s an internal property that holds this particular URI. What it means as well is that there is in fact no “folder” set per se within marklogic that tie it to a particular local file system. Internally, as well as externally, you can have multiple resources that are all in the same “folder” that may be on different machines and hard drives - the file path provided is essentially virtual, with each URI then mapped to a given document node in the database.
The upshot is that the internal file system is just as distributed as any external rest-based systems. Since the contents of such nodes can be text or binary files as well as XML, this then implies that MarkLogic happens to be a particularly intelligent distributed file system in its own rights, regardless of the content. Hadoop is distributed as well, but both the HFS and HBase systems are fairly high latency systems in comparison in terms of retrieval
Indeed, this points to a logical place where Hadoop and MarkLogic can actually act in tandem. Hadoop works remarkably well for large files, because it can parse those files into chunks and process those chunks in parallel. Unfortunately, Hadoop is especially inefficient when it comes to large numbers of small files.
MarkLogic, on the other hand, much prefers working with small files, because it is optimized to work either at a document root or document fragment level, but doesn’t do as well when handling large content (fragment searches require reindexing and make certain operations, such as xincludes, much more difficult as a consequence). By converting large files into small XML files that can be then stored internally in MarkLogic, Hadoop can act as a preprocessor to make the data more efficient for search and transformative types of operations.
This article touches on a lot of core concepts and provides some low-level overviews of basic calls in the MarkLogic XQuery environment. There are also a number of very useful extensions to MarkLogic that are designed to simplify the development process for these types of applications. I’ve previously mentioned Corona, which makes it possible to develop a quick and dirty REST API even for non-XML content. Norman Walsh created an ePub2 Reader as well, making it possible to read ePub 2.0 formatted (non-DRM) books using the MarkLogic system. The Rewrite framework provides a way to handle URL rewriting declaratively (which definitely comes in handy when building complex applications). Finally, the Hadoop Connector is designed to provide a bridge between MarkLogic and the Apache Hadoop framework.
So … go read an eBook (once I’m a little farther along in the edits, I may start publishing a teaser chapter or two of my own).

Komen Foundation, Bank of America, Barack Obama and Activist Analysts

The Komen Foundation - they of the pink ribbons who stirred so much viral opposition for dropping a sizeable grant to Planned Parenthood in conjunction after an Republican Senator opened up an “investigation” of the Planned Parenthood organization - announced today that they will be restoring the grant, and indicated that they are amending their bylaws to only respond to investigations that are criminal rather than political in nature.

As someone who has a family member that has used Planned Parenthood services for breast examinations in the past, I applaud the decision, and indeed find it amusing that Planned Parenthood managed in the three days of controversy to receive so many additional donations that the Komen Foundation shortfall was made up and wildly exceeded. The applause is admittedly tepid - it’s not hard to connect the dots to imagine a phone call from a congressman to a stalwart Republican executive for the Komen Foundation discussing bylaws and loopholes within the organization’s charter, which means that this move is a belated attempt to save face given the potential for imploding donations.

However, my goal here is not to discuss the immediate political ramifications but to explore the bigger ones concerning the marked increase in power of social media activism, and its relationship to the reputation economy.

Within most corporations, even non-profit ones, there are three distinct arenas that are “customer facing” - sales, marketing and public relations. In an increasingly reputation based economy, the purpose of marketing is to create and enhance a positive reputation for both the products a company makes and the brand name of the product itself, and internally it can be seen as trading money (energy in fiscal marketing terms) for reputation via the vehicle of advertising. Sales capitalizes on that marketing in order to translate brand reputation into money.

In this respect, reputation can be seen as a measure of trust; it is the degree to which one person trusts another person’s word, whether that word be an opinion, the use of a particular product or service (from creating a widget to singing a song) or fitness for representing them as a proxy. When trust is high, transactions are likely to move ahead (the meme is transmitted, the sale is made, the vote is cast). When trust is low, transactions bog down amid uncertainty, and the person making the commitment of valued resources may demand a premium to compensate them for the potential risk of a failed transaction, or may just choose to go with someone who has a higher reputation.

The third division - public relations - is quite frequently underdeveloped in organizations, because its principle job (as seen by companies) is repairing trust. In the twentieth century, most companies relied upon the simple fact that a large corporation that extended across multiple communities could rely upon the inability of the average person to make complaints known. If a person had a problem with the company, they were out of luck. If they brought the complaint to court, the company may very well choose to reach a settlement of a certain amount of money in exchange for that person’s silence. In a few cases, where the company in question was playing for higher stakes, that person may very well disappear.

The class action lawsuit changed that. In such a suit, a lawyer representing a plaintiff may very well decide that the likelihood of change was higher if others with a similar grievance also brought their charges forward, and so acted as a communications mediator - finding people with common ground and convincing them to commit to the suit. Companies hated class action lawsuits because they could neither ignore them nor buy silence. The settlement, when it came, was a tacit admission of guilt that would tarnish their reputation, meaning not only would it cost them in the settlement but it would cost them again in reduced sales and additional funds expended in marketing. Worse, if a settlement couldn’t be reached, it might very well force a company to recall products or end revenue producing services, or, worst case scenario, may prove so crippling as to cause the company to go into bankruptcy.

The public relations organ of organizations grew out of that fear. It’s first purpose was “spin” - it provided a way of justifying the actions of the organization to a community, in order to keep a critical mass of reaction from forming. It often manifested as “giving back to the community” via charitable donations, embracing of community standards, and providing testimonials to say “Look at all the good that we do.” In this respect it was another form of marketing, but it also provided a layer of protection - people are less likely to bring action against a company that donate services and money that they themselves benefit from.

In some cases, the PR (or community relations) department actually does make a difference in an organization - shaping the philosophy of its executives and its workers, and acting in effect, as the conscience of the organization. In others, however, its goal was simply white-washing - providing another form of marketing (market economy to reputation economy) that just masked more questionable activities going on underneath the surface.

The problem with class action lawsuits that emerged over the years was that with enough political clout, corporations were able to significantly reduce both their sting and their efficacy. With enough money, class action lawsuits could be dragged through the courts indefinitely, which served not only to place a de facto gag order on the plaintiffs, but also of risking a mistrial  as the physical resources of both plaintiffs and jury were exhausted, meaning that a settlement, which often included the deliberate non-admission of wrong-going on the part of the defendant companies, might be the best that could be achieved.

However, just as larger companies were congratulating themselves on defusing the threat of class action lawsuits, networked social media was just taking off, and this has in turn changed the game dramatically. Not that long ago, boycotts were far down on the list of CEO’s concerns as they went to bed at night. Most boycotts were poorly organized, were typically local to a given area, and as often as not involved “left-wing lunatics” who could be readily marginalized. Worker boycotts (unions) were of course a different issue, which is a big part of the reason why big business has been so adamant about fighting unionism at any turn. However, consumer boycotts were rare and ineffective.

With the Internet, that has changed. as has been shown several times this year. The recent attempts to push SOPA and PIPA in Congress created a boycott that literally appeared overnight, not only exposing the various actors involved but also threatening retaliation (up to and including the RIAA being knocked off the web for a certain period of time).

However, the Komen case shows that its just not Internet junkies protecting their home turf. The people involved at the Komen Foundation bet that a trumped up investigation of what is increasingly a public resource (Planned Parenthood) would provide enough doubts to provide legitimate cover for a “backdoor” defunding of a particular right-wing target. Social media spilled over into the main stream media of television, threatening a boycott in donations that could very well have ended up with Komen financially drained within weeks, and moreover destroying its reputation currency in the process. As it turned out, Planned Parenthood now actually has a greater reputation currency than Komen, and it is very likely that the people who instigated this action on the Komen side will find themselves voted off the board the next time they meet - their own reputation currency similarly non-existent.

A similar action took place a few months ago with the debit card fee hike that Bank of America decided to levy. People were taking their money out of the bank in response, though it can be argued that financially this was perhaps the desired goal - it moved non-performing accounts off the books. However, what B of A failed to realize was that the reputation costs were going to end up being much steeper than anticipated, because it has eroded trust in the institution. If you do not trust a bank, you won’t set up an account there, you won’t give that institution your money to invest, and you will spread via word of mouth your own reservations. You are, in effect, practicing a form of boycott. It is very likely that the damage that B of A did to themselves will cost a great deal in marketing to make up, and may last for years.

Social media activism and boycotting represents an existential threat to corporations, because they have little in the way of defenses for it. Senior executives and corporate owners general have worked upon the assumption that their organizations were relatively opaque, in part because such opacity was required against the potential for corporate espionage. Yet what they can’t control is the ability for people to make inferences based upon external connections into the company, something that they absolutely depend upon themselves.

Such connections have always existed, but in the analog world, the time it took for people to connect the dots was often measured in months or even years, and the time to communicate these revelations was similarly long - typically long after any damage was done. Today, a determined activist analyst can put together these same relationships in a matter of days, and it can “go viral” within hours. What that means is that for corporations, “spin” is no longer enough.   Organizations that think they can go on with business as usual are discovering that it is becoming more and more difficult to stay out of the scrutiny of the public eye, and that the costs of damage control begin eating not only into profits but even operating margins. It also means that its more difficult for political collusion to take place without that fact becoming public knowledge.

Democracy has frequently been described by the privileged as mob rule, and there is some truth in this. Such activism can go in both directions, with a corporation’s or person’s reputation brought low stunningly fast even if they don’t deserve it. However, the funny thing is that while this potential certainly exists, there are some brakes to it. Memes are introduced into the noosphere all the time, but most fail to reach a critical threshold. The ones that do usually require endorsement from those with solid reputations, must meet a bar of validity, and usually must be simple. Moreover, such memes often tend to bring reactions of people counter to these memes, which means that over time the counter-memes may bring into question the authority of those making the initial allegations.

This differs radically from the world offered by passive media. Television and radio advertising relies on the fact that in this environment the repetition of a meme with no counter to that meme will strengthen the “truthiness” of that meme. In the noosphere of the Internet, on the other hand, the initial reaction time may be swift, but the ability to provide reflection (countermemes) is also there to better establish the value of that meme. It’s perhaps not surprising given that that the world of the Internet Social Media is generally far more well informed about the world - and more reactive to potential threats - than television and radio media viewers are.

It’s also one of the reasons I think that Barack Obama will get re-elected. President Obama this week held a “town-hall” meeting over the Internet, with the questors actually able to interact with him about his answers. Yes, the questions were obviously pre-selected, but there were some very uncomfortable questions nonetheless, about the use of drones, his position on SOPA, and questions about the use of H1-B visas by companies even at a time when unemployment remains high in the engineering sectors. What I found fascinating was that Obama answered these, in some cases defending his own administration’s record, but in other cases seeming to be honestly surprised about the issues and appearing a little troubled that he didn’t know about them.

Obama’s on the cusp between being a boomer and a genXer (he was born in 1961). He grew up with the Internet, much of his thinking was shaped by the Internet and an awareness of social media, and with that thinking comes a considerably more nuanced understanding about communication than either his predecessor or any of the contenders on the Right, most of whom either grew up in an era where the dominant communication paradigm with the television set and the radio, or who are ideologically predisposed towards distrusting decentralized communication media in general. Their viewpoint is very much corporate - repeat the same marketing meme over and over again over one way channels through the directed application of money, then rely upon a friendly media to bury the gaffes that inevitably come out on the campaign trail.

However, that media is now facing real competition from the social media side (which they do not control) and are being forced as a consequence to become more critical of the candidates themselves - which is exactly the relationship that the media should have of any representative.  On the web, the messenger is often under as much scrutiny as the message. While the Obama administration had run afoul of the both the traditional and internet media a few times over his term, he understands the Internet media far better, and is considerably more aware of both the power and peril of being a politician in a world of activist analysts.

Big Data vs. Smart Data

I’ve heard the Big Data meme thrown around quite a bit among the marketing folks, but have never been terribly comfortable with the traction that it’s gaining from a technical standpoint. It sounds like a way to sell lots and lots of servers (real and virtual) for doing all that processing of the “vast amounts of data” that we’re swimming in.

However, before investors start writing out those checks to the new up and coming Big Data vendors, it’s probably worth asking a few really Big questions. Perhaps by throwing out a few use cases, it might help to define what Big Data really isn’t, making it at least marginally easier to figure out what the marketing term really does mean.

Use Case the First: Consumer Sentiment Analysis. We love us our polls, real good - we want to know exactly what everyone is thinking, is buying, is using in the bathroom, who everyone is voting for and who they’re watching every minute of the day! Yup - lots and lots of data there.

The problem there is not in gathering the data, or even all that much of integrating it. There are structured forms for sentiment analysis that can be readily ingested, transformed and otherwise shaped with comparatively little work, and this has been ongoing for quite some time. Google, Microsoft, Amazon, and so forth probably have a clearer recollection of what you did in the last twenty four hours than you do, as does your cell phone provider. Don’t believe me? Consider that applications such as Google Navigate, which I use all the time, doesn’t need to worry about querying street cameras or city navigation for determining its route times - it just takes the incoming data stream from all the cell phone users impinging on the various traffic towers to determine how quickly traffic is moving, and calculates things out from there. 

Indeed, the bigger problem is that there is a tug of war between corporations and privacy advocates. Getting the data is comparatively easy. Getting the data in such a way as to avoid violating a person’s privacy is well nigh impossible, and the reality is that most corporations don’t really try very hard, because their goal is determining and acting on that determination who will spend their hard-earned money for the company’s goods or services. Indeed, for many companies, the bigger problem is not in collecting the data, but in figuring out how much of that data they can throw away before saving it in the first place.

Admittedly this does point to one of the benefits of Big Data as well. Storage space, while cheap, is not free. One of the critical roles of big data is to act as a folder to eliminate that data which is not in fact relevant from being persisted for any longer than it has to be. This may seem somewhat counter-intuitive - wouldn’t it be better to keep the data around so that it can be analyzed in other ways?

Perhaps not. Data decays in value over time, and the more atomic the data, the more quickly it decays. My needs are context based, and this context is seldom visible from the data itself. My tastes in music or food or entertainment change overtime, so that data about my buying habits today may not necessarily be germane ten years from now, though they will likely be pertinent a few weeks or even months down the road. Typically this means that the value of my buying habits (as documents) five years ago have far lower value than they do today. Stock traders and economics often talk about time series - 50 or 200 day moving averages that smooth out the daily “noise” to make sense of broader trends. Having sampled points from the data  and a general idea of the DMA at that stage can go a long way towards understanding the shape of the data while still retaining relevancy.

Big Data Is Streaming Data

Put another way, most “big data” systems should more properly be seen as the processing of “streams” of data. A stream can be thought of as a buffer for processing combined with a sampling strategy. An input operation populates the buffer, the buffer is processed to generate “smart data” (more structured content, reports, etc.), the buffer data is persisted and the process begins all over again at some later point. Periodically, the archived buffer snapshots past a certain date are then purged from the system (in essence a different stream).

The processed data has more extensive machine semantics - it is marked up or encoded so that a machine can better process (i.e., understand) the content relative to more abstract entities. Because the information structure contained therein is more standardized, it requires less reinterpretation by downstream processes. It can be more readily “mashed up” with other content. 

You can see this in things like Twitter. Twitter is a fast stream, billions of tweets go through the system annually. Storing all those tweets would not be impossible, but in most cases it would be impractical, especially if what you’re looking for is specific textual information, hashtags or links. Instead what makes more sense is creating reverse indexes that correlate hashtags with message ids, user ids, other tags and links. Any such system might also employ Twitter’s own search mechanism to retrieve near real time lists - in effect, Twitter is already doing a certain amount of its own smart data processing in order to populate it’s search capabilities. Any application in turn can use the results of this API to sample the stream over time and persisting only those items in the stream that have relevance. Moreover, because you do have links back to the source, it’s also possible to spider back to correlated links.

From Streams to Semantics

Of course, once that happens, then your streams can be used to construct relational graphs, and it is here that things get interesting. The Tweet structure can in fact be broken down into relationships: tweetA is produced by personA, tweetA has hash tags (establishes a relationship to) #tagA, #tagB and #tagC, has person references @personE and @personF, and has urls http://linkG and http://linkH. These (and other) relationships can be expressed as assertion triples:

tweet:tweetX tweet:createdBy user:personA.
tweet:tweetX tweet:hasHash hash:tagB.
tweet:tweetX tweet:hasReference user:personB.

… and so forth. This can then be used with SPARQL type queries such as:

select ?user WHERE {
?tweet1 tweet:createBy ?user.
?tweet1 tweet:hasHash ?hash.
?tweet2 tweet:hasHash ?hash.
?tweet2 tweet:createdBy user:Joe.

which can be restated as “select all users who’ve written tweets referencing a hash tag that was also referenced by the user named Joe”, or put another way “find all the people that are interested in the same topics that Joe is interested in”. If Joe is perceived of as an influencer, then such a query can in fact determine the overall shape and constituency of his influence group. A similar query could determine such things as the set of topics that a group of influencers in a given field cover, and who is likely to be a new influencer within that group. 

This is where big data becomes smart data. Such a query is inferential in nature, and is remarkably difficult to write using traditional query tools, but is fairly easy to write with a semantic language such as SPARQL - if the data is in the right form. Thus, the role of effective streaming applications is to shape data stream samples into such a form that higher level processing on them becomes possible - restructuring, document enrichment, semantic deconstruction.

This will become especially important as the web shifts towards sensor data. We see intimations of the sensor web now in the importance of cell phones, which from a data collection standpoint are really little more than bundles of sensors - identity (a URL), location, internal state, orientation, timestamp, as well as potentially higher level information such as purchases, phone make, and even tweets and similar sms messages, depending upon what gets exposed.

This web of sensor data is already in use for everything from real time traffic analysis (whenever you power up a navigation ap, you’re seeing Big Data in action, as the sensors that organizations use to determine traffic congestion are really just reading the vectors of millions of cell phones as they move through space and correlating that with street maps) to determining crowd congestion in order to better reallocate routing of signals, or to predicting the locations of crime in real time. As intent becomes more readily discernible from these sensors, this opens up additional possibilities for analysis.

The Ethics of Smart Data

However, this should be predicated with a significant warning. Inference modeling of Big Data Streams is still just a model, albeit a potent one. Knowing the right questions to ask is increasingly critical, as the tools may readily point out marketing or law enforcement patterns that don’t really exist if the wrong assumptions are made of the queries themselves, and indeed, even as one builds up an increasingly abstract representation of information from data, assumptions may be introduced into the semantics of that representation that may very well introduce biases.

Moreover, such analysis cannot help but identify individuals - there is a high degree of correlation between a cell phone and its owner, for instance (it’s not 100% - I may give my eleven year old daughter my cell phone for a while so she can play Angry Birds while I’m otherwise engaged, for instance - but it’s pretty close), and one of the central challenges that we as a society face is the degree to which this information can be used. In my experience, people usually tend to agree to this loss of privacy for the sake of convenience, but if it begins to seriously impinge upon the individual civil liberties of the populace, this may be changed by social action or legislation.

This becomes even more of a challenge because of the third aspect of big data - the ability to correlate disparate data systems. Web services of all stripes provide mechanisms for data interchange, and especially as data systems move towards a more RESTful interface (which is especially useful for constraining queries against data records) this ability means that it is possible at the data store layer to creating such cross correlation of databases). By themselves, databases may be relatively “clean”, however, couple that with databases querying other databases and a deep inference engine, such data analysis can often reinfer the presence and profiles of individuals synthetically.  

We are reaching a stage where inference engines can tie in building permits, corporate ownership records, campaign contributions, voting records, travel schedules and the like to build a strong circumstantial case for bribery and corruption, for instance, just as they can be used to target street criminals - indeed, what I found so fascinating about the Fox TV show Numb3rs was not the extent to which the inferencing mathematics was pure bs (though there was some of that) but the extent to which it was plausible … and disturbing.

Summary

In the long run, most of what marketing tends to term Big Data consists of the processing of either current or archival streams. ETL of raw data is not in fact qualitatively different than working against live sampled streams, the primary difference is primarily that the archived data needs to be “resampled”. Moreover, the train from purely unstructured to structured data (XML or JSON) and from there to semantic data (as RDF triples) is a processing continuum, with the goal being keeping your models as devoid of potentially corrupting inferences as possible, as these inferences are in turn derived by the corresponding queries and reports generated from XQuery or SPARL (or both).

This also points to a perhaps uncomfortable truth for the Hadoop community. Hadoop is a very powerful tool for ETL processing, but the Hadoop File System is not an intelligent 4th generation database. It’s more the equivalent of a strip-miner trying to pull rare earths from terabytes or petabytes of electronic dross - it can be used to extract raw ore from the earth, but the relationship mining and interconnectivity need a multiply indexed data system in order to convert the rare earths into a form useful for industry. While no doubt there will be efforts to do just that within Hadoop, by seeing Big Data as a stream processing system it makes more sense to make use of the streaming analogy throughout the processing pipline, with tools such as MarkLogic handling the XML search and intermediate … and quite likely the semantic side.

Why MarkLogic May Be Your Next JSON Store

For several years MarkLogic Server has been quietly building a reputation as a powerhouse in the XML space - by combining blazingly fast XML search capabilities, an impressive core XQuery library, the ability to use the database as a surprisingly sophisticated web application server, impressive tools for working with ingested documents (including both the traditional Microsoft office suite along with a considerably richer suite of images, alternative document formats and even media with the recent 5.0 release) and its support for clustering, it’s perhaps not surprising that it is becoming the default XML server for publishers, libraries, educational institutions and government agencies.

However, it can also be argued that we are entering into the “post-XML” world. For web application developers, XML has always been something of a hassle, especially as there’s been a strong resistance to the incorporation of XQuery or even an easy to use XPath processor within browsers themselves. Largely by fiat, the native language of the web is JavaScript, and increasingly this means that the default language for web communication is JSON.

This presents an interesting challenge for an XML database company. It’s worth understanding that while it is in fact perfectly possible (and in fact quite useful) for serving up web pages and accepting form content through MarkLogic, more often I find (after several years of both MarkLogic and eXist development) that people prefer to use MarkLogic as a services hub, providing both SOAP and REST services to dedicated clients both on the desktop and on mobile devices. So long as these services are XML oriented, this works great. But what about JSON, which continues to gain in this sector?

The Power of Corona

As it turns out, MarkLogic may actually be in a position to become a major player in this space as well. I ran into Jason Hunter, MarkLogic’s Deputy CTO and creator of the highly regarded MarkMail service, at a recent Government Enterprise summit and we spent a fair amount of time talking about what MarkLogic is doing both with the Federal Government as well as with smaller customers, and the discussion of RESTful services and JSON came up. Jason grinned wickedly, kicked back his feet onto a nearby chair, and told me about a new project that they had underway.

 Corona is a RESTful services architecture that’s intended to make MarkLogic more accessible for those developers that aren’t familiar with MarkLogic, XQuery, or possibly even XML. The core idea is simple. You start with a document. That document could be XML, but it could also be JSON based. For instance, suppose that you were creating an application intended to provide information for ratings and review service for various beers (ya knew that was coming). Your web app on the client sets up the document in question, say something like:

var beer = {
    name:"Copper Hook",
    id: "RedHook-CopperHook",
    brewer:
         brewerRefId:"RedHookBreweries",
         name: "Red Hook",
         city:"Seattle",
         country:"United States"
         },
    brew: {
         style: "Copper Ale",
         flavor: "Smooth",
         ABV: 0.058,
         },
    brewedFrom: 2001
    };

This could also be represented in XML, of course, but the point here is that it doesn’t have to be. We can also create other types of documents such as a review:

var review = {
    username:"Jane Doe",
    id: "19ACED922859102",
    beerReviewed:"RedHook-CopperHook",
    comments:"Fantastic copper beer, definitely smooth", 
    rating:4,
    reviewDate:"2012-01-05T12:86:24.00249-05:00"
    };

The corona package, once installed, lets you set up a base application URI for hosting the endpoints, such as http://www.myserver.com/reviewer/ which points to the relevant Corona enabled webapplication. The corona package then includes a few different “endpoints” that provide RESTful services support. The /store endpoint is critical for interacting with the database, /search is used for queries, and so forth.  Query string parameters can then be used to pass arguments to the Corona endpoint to provide support.

Suppose for instance, that you wanted to post both the beer and the review to the server. These would logically be broken down into two distinct collections - beer and review. Somewhat counter to REST convention, you’d also use the PUT in order to insert the contents into the database (normally this should be POST). From Javascript, this would probably look something like:

var beerPath = " http://www.myserver.com/reviewer/store?uri=/"+beer.id + ".json;collection=beers";
var xhr = new XmlHttpRequest();
xhr.open("PUT",beerPath,true);
xhr.send(beer);
xhr.close();

var reviewPath = " http://www.myserver.com/reviewer/store?uri=/"+review.id + ".json;collection=reviews";
var xhr = new XmlHttpRequest();
xhr.open("PUT",reviewPath,true);
xhr.send(beer);
xhr.close();

So what happens on the server? Quite a bit, as it turns out. The endpoint takes the inbound JSON objects and internally converts them into a round-trip enabled XML format, stores them associated with the given URIs (here “/RedHook-CopperHook.json” and “/19ACED922859102.json” respectively) and adds the objects to their corresponding collections.

When you want the review back, you use the GET protocol on the /store endpoint.

var xhr = new XmlHttpRequest();
xhr.open("GET", "http://www.myserver.com/reviewer/store?uri=/19ACED922859102.json" ,false);
xhr.send();
var review = parseJSON(xhr.responseText);
print(review.comments); 

Updating and deleting resources (as well as more sophisticated options such as transactional management) are also supported.

Search and Rescue

So far, this falls into the category of “hey, that’s cool.” but in reality, especially if you’re using local storage on the client, this is something of a yawner. Where MarkLogic really shines is in its search capabilities, and this in turn points to the second endpoint.

For instance, suppose that you wished to get all records which contain the word “smooth beer”. You can use the following query string on the /search endpoint:

/search?stringQuery=smooth+beer

which will return all records that contain both smooth and beer.

[{
    username:"Jane Doe",
    id: "19ACED922859102",
    beerReviewed:"RedHook-CopperHook",
    comments:"Fantastic copper beer, definitely smooth", 
    rating:4,
    reviewDate:"2012-01-05T12:86:24.00249-05:00"
    },
 {
    username:"JohnDonne",
    id: "8352FADE229A91205",
    beerReviewed:"Olympia-Pilsner",
    comments:"I liked this, overall, though not my favorite. Still, the beer had a very smooth taste.", 
    rating:4,
    reviewDate:"2012-01-05T12:86:24.00249-05:00"
    } 
];

Similarly,

/search?stringQuery="smooth beer"

which returns all records that have the specific expression “smooth beer”.

You can do field based searches:

/search?stringQuery=name:Copper+Hook

and even ranged searches:

/search?stringQuery=brewedAfter:1971

as well as combinations of these:

/search?stringQuery=brewedAfter:1971+smooth+beer

(+ replaces a space in URL encoded notation).

Additionally, you can constrain searches to specific collections:

/search?stringQuery=brewedAfter:1971+smooth+beer&collection=beer

among other options (including being able to geospatial queries, change sorting and sort order, modify and control paging and so forth).

Beyond these simple queries, it’s possible to create considerably more complex queries by sending a specialized JSON document to the /search end point containing the relevant parameters. This is described in much greater detail in the documentation itself. Indeed, most, if not all, of the MarkLogic query functionality is expressible through the Corona interfaces.

NoSQL Powerhouse?

This opens up some interesting speculation. MarkLogic has been somewhat marginalized in the web world because of its initial XML bias, though as MarkLogic founder Chris Lindblad has said more than once, this bias was largely accidental - the underlying indexing structures were set up in a way that could have just as readily been used to represent JSON. A couple of generations of processors and the Corona extensions look to have made this a reality, albeit via a “virtual data layer” in XML.

The Corona interfaces can be used with other “unstructured” documents as well - for the Marklogic 4.x line this includes support for binary files such as Microsoft Word and Excel, for the ML 5.x line, this gets expanded to a wide number of media resources from EXIF data on image, audio and video files to open office formats, among others. As such, the company’s poised to enter the enterprise media server market in a big way, as well as being able to act not only as a web server for XML web content but as a generalized web services hub for JSON and similar formats.

The project is still under development, though all of the core functionality is there now. Corona has the potential to significantly expand the market presence of MarkLogic, and should definitely be on the Must Review list for anyone looking at building out RESTful service systems.

MarkLogic Corona Project GitHub