Today's designers of data analytics systems are using thirty-year-old mental models around scarcity of compute and are thus crippling their designs, not fully realizing how radically different 21st-century analytics has become.
In higher education administration, 21st-century data analytics has arrived. Signs of improvements are emerging sporadically, like flowers blooming in the desert.
Yet at most institutions, troubles abound. Political silos, held together by the maxim that hiding and hoarding information secures one's position, prevent the consolidation of systems. Combatants use information technology as a proxy war, thus avoiding more difficult head-to-head conflict. Many staff in our IT organizations and end-user communities have not kept pace with the sudden advances in analytical approaches and tools and are often lacking basic skills in statistics and data management. New technologies go through careful and sometimes lengthy reviews before adoption while current approaches, delinquent as they may be, escape similar scrutiny. College and university leaders are loath to change their organizational structures to utilize data better, and given today's technology, these structures now look downright anachronistic.
The refrain of objections repeats:
- We have so many ways to define data! (In reality, there aren't that many.)
- Our data is of poor quality! (It can be fixed.)
- Legal regulations say we can't! (These are commonly overstated; laws should be read carefully.)
- Our politics will prevent this! (That may be, but leaders can influence the culture.)
- What about privacy and security concerns? (Tools for managing these concerns are now more advanced.)
- We need outside experts to do this for us! (This is not always true.)
- We can't integrate the data because our tools are too old! (Read on.)
One could argue that since the dawn of civilization, the adoption of new ways of doing things is slowed down in two ways: (1) the length of time required for the practice to be communicated from mind to mind, and (2) the length of time required for a single mind to accept that a new practice is worthwhile. With the communication technology available today, the first barrier is removed: practices can propagate prodigiously and swiftly. It's the second impediment to adoption—mental inertia—that remains a concern. We all look at the world through a certain framing or lens (mental model), based on our background. New data and new logic must fit into these frames before we can adjust them and adopt a new practice.
Mental models can be hard to move because humans can be oh-so stubborn. These mental models are more than just intellectual tools; they partially define who we are. We react to changing them as if our very bodies were being attacked. We very quickly understand when someone is subtly trying to bend our mental models to accommodate their way of thinking, and we sometimes respond emotionally and viscerally. The bodily attachment to mental models is part of our genetic heritage and is deeply human.
In the area of analytics, our mental models need an overhaul.
Thirty Years Later
One area of mental models stubbornness involves managing basic data. In most IT organizations today, staff are moving data between systems using "flat files"—text files containing rows of data, spreadsheets, and replicas of databases (database dumps)—just like our forebears did in 1989. The basics of the batch file import and export have not changed much, if at all. Worse still, in many IT organizations, data movement is a cottage industry that exists in piecemeal form sprinkled across, if not beyond, the central IT organization and is usually loosely regulated and often not systematically improved. For me, this is infuriating. Even though many organizations have adopted all sorts of new technologies and approaches, vendors and IT organizations alike persist in flat file tangos.
In the area of structured analytics, the Kimball-style data warehouse, with fact tables and dimensions and with star or snowflake permutations, still reigns supreme. The concept of extract, transform, and load (ETL) is alive and well and permeates people's thoughts and language. Data at rest needs to be extracted from the ground, lifted out, and transformed, like a manufacturing process, in often convoluted operations that fold, bend, and otherwise spittle upon the data and then finally load the data, like a railcar of goods, into the retail shop floor, which in this case is the Kimball model, arranged like empty shelves awaiting information goods to appear. Report writers and data visualizers then, again, rummage through the Kimball model in the data warehouse, combining data using a series of joins and, very frequently, making copies of the data in order to facilitate additional data transformation and analytics methods or to be combined with data not in the data warehouse. The current manufacturing process needs improvement.
The lack of processing speed within the computing environments has been a strong force shaping the evolution of our data management practices. Slow disk drives dictate a certain response from operating and database system software developers, involving things like indexes, transaction logs, and caches. Data warehouse designers respond by forcing the surface design of the solution to respect these limitations. The result is that data warehouses, once built, still require significant translation or add-ons to meet specific analysis needs. The historic, up until now, scarcity of compute performance that has not kept pace with the growth in data has profoundly shaped not just the nature of IT systems but the common thought processes of everyone involved. Designers of solutions today are using thirty-year-old mental models around scarcity of compute and are thus crippling their designs, not fully realizing how radically different 21st-century analytics has become.
The New Technologies
The past evolutionary pressure—that is, the performance constraints in older data analytics systems design—is fading. New technologies are significantly bending the cost-benefit curve and letting organizations process more data in ways that are better, faster, and cheaper than before. While some data, especially scientific data, continues to get arbitrarily dense and hence exponentially bigger, most, if not all, of the structured and unstructured data in college and university operations today lies safely within the comfort zones of the available technology. Five technologies in particular are enabling this transformation.
Technology #1: Improvements in Scale-Out, Low-Cost, Near-Real-Time Streaming
While usually associated with Internet of Things (IoT) applications, streaming technologies are now poised to take over the synchronization of data for analytics. These technologies include, but are not limited to, Apache's Kafka, NiFi, and Storm and Amazon's Kinesis (including Kinesis Data Firehose). These tools support real-time and near-real-time use cases and can expand using horizontal scaling approaches that are the norm today to quickly handle big data movement at (usually) extremely low cost. When data moves to streaming, the real-time nature of the tools hardens the environment because errors get rooted out quickly. When data is replaced nightly once a month, the IT organization gets twelve chances a year to fix it. When data is streaming hundreds of times a day, the IT organization fixes it many times a day. These streaming approaches also enable real-time analytics to be working as data moves through the stream even before the data lands to rest in a data store.
Technology #2: High-Speed, In-Memory Analytics
These tools (e.g., SAP HANA) sport scale-out, parallel designs that make mincemeat out of billion-row data sets. These environments also heavily compress the data, maximizing use of higher-speed memory. Very large data sets can be analyzed in these environments, which resemble supercomputers. In-memory analytics costs have dropped significantly in the last decade and are, in some cases, starting to replace high-speed disk drives as cheaper alternatives.
Technology #3: Low-Cost, Big Data Environments
Tools like Apache Hadoop, Amazon Redshift, and Google's serverless BigQuery let organizations store petabytes of data cost-effectively. The high-speed, in-memory tools referred to above can now federate their queries with these environments, providing organizations with a two-tier approach. Tier 1 is very fast but more expensive. Tier 2 is very big and slower but super cheap. Petabytes of data can be modeled in one, synoptic architecture.
Technology #4: Artificial Neural Network Resurgence
After enduring an impossibly long and cold winter of a couple decades, artificial intelligence (AI) has undergone a renaissance, partly enabled by improvements in computer memory and hardware such as CPUs—for example, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), and tensor processing units (TPUs)—but also due to software improvements represented by the various forms of new neural networks available today. These neural network techniques are now ubiquitous in the area of social media and e-commerce.
Technology #5: Hyperscale Cloud Providers
These cloud providers, dominated by Amazon Web Services (AWS) but with Microsoft Azure and Google Cloud Platform following close behind, are enabling all of these technologies to be run in very dynamic and elastic environments. Analytics data processing can ebb and flow in a pattern of big bursts punctuated by dry spells, and the resource consumption and pricing can also ebb and flow.
The New Rules
These five technologies, which are subverting the dominant paradigm of data and analytics, are begging for a new set of rules. The following six rules are an inversion, of sorts, of the old rules.
Rule #1: Everything Is a Verb.
In working with my analytics teams over the last few years, I have found that the old focus on nouns and their relationships to each other (e.g., entity-relationship modeling, among other approaches) is much less important. Relational modeling originally grew out of the need to divorce logical hierarchies (relationships) from physical data structures, providing great flexibility. In time, relational modeling also had to genuflect before the performance altar, and today, most data warehouse designers cut out of this cloth cannot stop themselves from trying to conserve performance and hence adjust their designs. Also, with master data management methods now fairly well adopted, teams I have worked have made significant process in ensuring we understand and model accurately key nouns (entities).
With the five technologies listed above, we can now put verbs first. For each noun, we work out all the events that can change or utilize the noun. Each of those events comes from one or more source systems in tiny batches—a stream—of data creations, updates, or deletions. Rather than try to make sure we have all the additional fields that describe the noun perfectly conformed, we instead work on making sure all streams of changes regarding the noun are suitably captured. In short, we are replicating a transaction log that describes all updates. We call this a replayable log, meaning that we have the time history of changes captured in the stream of events. We treat the events in this log as idempotent, allowing duplicate messages to be detected and dealt with safely within the analytics environment. This lets us ingest data very rapidly and very inexpensively with greater resiliency in case of any system failures. This method of designing the data to allow some duplicate data with after-the-fact resolution is called an eventual consistency data integrity model and enables the very low-cost, high-speed data integration capabilities.
All these events, or verbs, get placed into one very long table. Specifically, the table is one that has a variable record length, with different uses of columns, but all placed in one big, fat, wide table. While this is raising the hair on the back of the neck of old-school data warehousing folks, this sort of design is actually quite old, going back to very early mainframe days of computing, and works well in today's modern systems. For example, although the developer sees a single wide table, in-memory columnar database tools store that activity log in an entirely different internal structure. These large activity tables then serve as our data lake in a very simple, very big, and often very wide data structure. This simplicity enables all sorts of extensibility since it relaxes so many rules of data design. In addition, the streaming technologies let us more easily ingest data, IoT style, from a variety of differing systems. We use this activity table as the base structure upon which we build all our other analytic views.
Rule# 2: Express Maximum Semantic Complexity.
In the old way of doing things, analytics developers will often include only data that matches the need required. We do the opposite. We try to bring in all the data that we can find in any given stream, whether we think we will use the data or not. An analogy is home construction, where it is much easier to put in electrical cabling before the walls go up, not after. For example, in one of our analytic domains, we are bringing in over 5,000 unique fields of activity data but surfacing only about 800 in our initial analytic views. The rest will be surfaced over time.
A second aspect of this rule is that we also bring in data at the lowest level of granularity possible. While this can often explode the number of rows of data within our models, the new technologies come to our aid with all sorts of tiering between high-cost and low-cost storage and several automatic compression methods.
Bringing in as many attributes as possible and all data at the lowest level of granularity also ensures that we will be able to answer any question that may be asked. The only questions that can't be answered are those for which there is no data or for which the source system did not capture data at that level of granularity.
We have found that this in-memory, parallel columnar analytics environment also means that we do not have to handcraft the pre-aggregation of data. All of our models can rely on the lowest level of detail at all times. If we do any pre-aggregation work in our designs, it is for the convenience of the analyst. Sometimes a pre-aggregated number is much easier to work with in visualization tools. Thus, we aggregate data for convenience only, not for speed.
Rule #3: Build Provisionally.
This rule is the hardest for data warehouse developers to swallow. Historically, these developers build fact tables and dimensions that must stand the test of time, and they do. But these models also end up requiring additional data structures, placed on top, to support analysts who often have many different ad hoc issues requiring their attention. Analytics has more than a whiff of ephemerality and needs to change based on ever-evolving business needs. Our designs must capture that ephemerality in a deeper way.
In our environment at UC San Diego, we have a replayable log in the activity tables serving as a data lake, which by itself is not really usable. We must build something on top of the data lake—something that can serve for the duration of time needed but that can also be considered ephemeral and adaptive. We have used the term curated view to capture two opposing intents here. The first intent is to ensure that a curated view of the data is appropriate for a specific, perhaps narrow, but also evolving use case. We call this type of view an analytics vignette. The second intent is to ensure that the curated view has as much structure as possible contained within it. Thus, the structure of hierarchies associated with any given data element and also the exact data type and formatting are carefully worked out. While the curated view may be provisional, the deep structures within it often are not.
Curated views can be redundant and overlapping. Several curated views can be combined and built on top of one another. Our curated view designs typically have three or four levels of hierarchy so that we can reuse code (SQL for us) when constructing curated views. Thus, what the user sees is a single, flat file with often a few hundred attributes (columns) that they can easily use. We take any joining of different tables away from the analysis. All our curated views are designed with all needed joins made.
I have been focused on structured data, but our data lake may carry with it lots of unstructured data—that is, data not adequately described in terms of attributes and relationships. This unstructured data will be continually seeking structure, chiefly through advanced algorithms including AI and machine learning. As the unstructured data grows exponentially, so will the need to structure it. Hence, over time, structure will be added, and our data lake is designed to accommodate that.
Although this may sound hideously wasteful of computational resources, the five technologies listed above have made this approach eminently feasible and cost-effective. Because our curated views are built on top of highly reusable components and are themselves reusable in other curated views, we can have as much model overlap or outright redundancy as our analysts need while still keeping the views quickly changeable. We can easily create new curated views borrowing promiscuously from the collection of other views. By merging the data lake (the activity logs) and the curated views in one environment, we can ingest very complex and unstructured data in the activity tables right away and then incrementally add structure to the data and include that structure in existing or new curated views.
Rule #4: Design for the Speed of Thought.
Why the need for speed, especially when many decisions in business are not made with real-time or near-real-time data? A faster-moving environment is cheaper to manage and more pleasing to the analyst. Designing with speed from the start is far less costly than trying to add it back in later. Managing a real-time, incremental data integration architecture is also less expensive than managing an old-school batch-oriented one. Speed is important for analysts, data developers, and data scientists. A fast environment lets analysts work at the speed of thought with subsecond responses for all clicks in an analytics tool, regardless of whether the task is to aggregate a single column across 500 million rows or to drill into 500 rows of fine detail. A fast environment lets data scientists build and deploy models that much faster.
Designing for the speed of thought also requires moving as much of the complexity as possible to the infrastructure. For example, in our analytic environment at UC San Diego, we can take difficult filtering logic (sometimes complex Boolean and set logic) that would normally be expressed in the front-end visualization tool and convert it into reusable bundles in the back-end, relieving the analyst of that work. Our curated views will contain what look like redundant fields and will include permutations of a single field side by side (e.g., last name and first name combined into one field, in that order, as well as first name and last name, in that order). Although a downstream analyst can easily do that simple manipulation, we found that providing these small details within the analytics environment increases analysts' usage and throughput.
The overall analytics environment also needs to handle a variety of tools that good analysts typically use and that can also support real-time, daily business activities. This "bring your own tools" approach requires a robust and flexible platform for handling many different analytics packages, including traditional statistics, neural networks, older and newer machine learning algorithms, and graphing algorithms, along with different deployment options for use in production settings. In the near future, many organizations will need to be able to manage dozens or hundreds of AI or machine learning models, all running real-time, acting on the stream of data as it flows in. These models need to be placed into service and taken out of service much more dynamically and frequently than in analytics of yore.
These new analytics models will be handling personalized alerting and nudging of people, as well as initiating system-to-system communication that will control human and system activities in an autonomic manner. This analytics environment must be capable of delivering timely information that fits within the cognitive time frame of the task at hand for each person and each intelligent computing system.
Rule #5: Waste Is Good.
Even though the "waste is good" rule is implied or directly called out in the prior four rules, it is worthwhile touching on it further. Developers who make the transition from traditional environments to the new environment all spend at least a few months struggling to accept what was so ingrained in them: the need to conserve computing resources. This is where younger and perhaps less experienced developers and data scientists may have an advantage.
All this supposed waste enables agile, flexible, super-fast, and super-rich analytics environments at a lower price that was unheard of ten years ago. In this environment we routinely take what would be a parsimonious data set and explode it, often to enormous sizes. Why? In a nutshell, we are trading off a larger data structure for a simpler algorithm. For example, in a typical large university, a class schedule for all courses offered can fit into a decently sized spreadsheet. Exploding that data set to show also each room's usage for each minute of the day, for every day in a year for twenty years, results in 2.3 billion rows. Why would anyone do such a thing? Because visualizing an "exploded" data set is trivial and allows the analyst an incredible level of granularity with either a visual or an analytics algorithm. This allows easy historical analysis and quick answers to numerous questions without requiring effortful programming. Is the university more or less efficient in class utilization over time? In which time slots can additional classes be added? How do physical classroom factors such as the quality of the building and the classroom affect student performance in classes over the twenty years? What are the long-term correlations with facility utilization and energy and other utility costs? These questions can be answered with little to no additional data-preparation work.
The old rules for normalizing databases can be thrown away. But this is not a free-for-all environment. Quite the contrary. A new, strict set of construction rules replaces the old set. For example, the new methods for handling activity tables and curated views are as rigorous as any of the old rules for dimensional modeling. It's just that the new methods involve an inverted set of assumptions regarding space, size, and performance. We don't care about the growth or explosion of data. We embrace it.
Rule #6: Democratize the Data.
Data democratization means providing equal access to everyone, leveling the playing field between parts of the organization so that all parties can get access to and use the data. Today's economy is an information one, filled with information workers—and information workers need information. If key staff members in your organization end up in an environment not suitable for their intellectual skills, they will opt to leave. So, when we consider access to data more broadly, we're not talking only about the data itself. We're also recognizing the information-oriented nature of today's work and recognizing the complexity of organizations.
Organizations that invest in decentralized decision-making and that make the necessary investments in technology and organizational practices perform much better than their peers that don't do so.1 Data must be freed from silos and transcend traditional hierarchies, making data hoarders a thing of the past. In this new world, information is no longer power. Information sharing is power. While anecdote and intuition can be potent influences on decision makers, this "faith-based" reasoning gets replaced with "fact-based" reasoning that accommodates more diverse thinking and perspectives than in the past. People who are lower in the organization and are empowered by data can now begin to correct errant anecdotes promoted by powerful people.
Organizations will need to establish a new culture to achieve this data democratization. That culture is going to require executive support—an executive mandate, perhaps—as well as a bottom-up push. But the shift to data democratization can be somewhat subtle. Organizational leaders can tell staff that information sharing is power and that information management is a team sport. People don't own data; rather, they steward it. Data democratization can also be furthered through practices such as the following:
- Develop team-based processes for fixing data quality at the source. These processes put data consumers in tighter collaboration with data producers and place responsibility for data quality in the community.
- Reduce between-team rivalries by eliminating mission overlap and redundant activities where possible. Conflict over information is often caused by groups with highly overlapping missions, activities, and populations served. In this model, each duplicative unit has something to gain by competing with or trash-talking its peer unit.
- Work bottom-up, top-down, and sideways in the organization with messaging and community building.
- Ensure that "getting the culture right" is an executive concern and that leaders are promoting fact-based reasoning.
The traditional two-dimensional hierarchical model of organizations no longer accurately represents how information is managed within colleges and universities and how staff activities generate value. The new rules require organizations with more organic, molecular structures that can alter themselves in order to improve nearest-neighbor communication whenever and wherever that communication is needed. All notions of a hierarchy disappear as individuals find the colleagues they need to communicate with on a regular basis, independent of the organizational structure. Two structures are at play here: the structure that exists on paper and the structure that exists in reality, based on information flows. The latter is a three-dimensional, molecular protein folding structure, where parts of the organization fold in on themselves in order to enhance communication and collaboration. That's the dynamic structure we want to enable.
Data democratization turns individual accountability on its head. In complex processes that involve multiple units, success and failure become a team affair. All team players (organizational units) must share information in order to adjust themselves as needed. Of all the new rules, this one is most flammable. The privacy, political, and organizational untangling that may be required to enable data democratization can be overwhelming.
Impact on IT Organizations
The implications here are manifold. IT teams will need to be organized very differently, around the following activities:
- Data movement design
- Data movement orchestration and monitoring
- Data architecture and data design
- Data science modeling
- Data science orchestration and monitoring
- Data security and privacy
- Data democratization community development and management
These skills, which are a repackaging of older skills but now wrapped around dynamic cloud technologies, form a sort of data science design and operations group. With software as a service growing by leaps and bounds, many IT organizations do not develop software. Instead, with a coterie of various systems in the cloud originating data, a new "Dev Ops" or "Data and Data Science Ops" team is being born. While the rigor learned from older methods like Kimball- or Inmon-style data warehouse is always useful,2 the specific design patterns and assumptions of older methods, especially if they are deeply ingrained in an IT worker, are not. Concepts like ETL, operational data stores (ODS), staging tables, and fact tables and dimensions have been formed around systems with more significant performance limitations and are not directly applicable or needed in new environments.
Managing 21st-century AI and machine learning algorithms requires an operational environment that is very different from conventional analytics.3 Machine learning models are like tires: organizations will have many in service, and the models will require ongoing data operational monitoring and evaluation to identify anomalous output and responses, as well as to determine how fast or slow the models may be responding to the data. IT teams will need the right platforms for holding models on a "runway" for development and testing, for letting the model "take off" in production, and then for bringing models out of service while introducing new models seamlessly. This is not your grandparent's data analytics environment.
These new technologies also raise more legal and ethical concerns. Privacy laws and policies are growing rapidly and differentially across multiple regions of the globe. New AI and machine learning approaches can introduce new forms of benefits and legal liabilities not previously imagined. While data democratization is both important and helpful, ensuring a highly secure environment takes on greater importance too. The tension between serving students and serving our communities needs to be balanced against institutional risk and personal privacy concerns. Fortunately, higher education is not alone. Our peers in the health care world have been grappling with security, privacy, and ethics issues and can offer good models and guidance. Research into quantitative techniques for ensuring privacy while still publishing (or using) data is advancing, and solutions for using these advanced methods (e.g., k-anonymity and differential privacy) are finding their way into products, with more advances in technical controls coming.
The impact outside of the IT organization is likely to be even more significant. Organizational structures that are more flexible and organic in nature are difficult for traditional, control-oriented managers to accept. As information flows more freely to people lower in the organization, power appears to shift away from traditional managers. That is probably as it should be. Colleges and universities are information- and knowledge-generating entities. Information workers in higher education can't be held captive; highly skilled information workers can and will leave for environments friendlier to their talents. Managers will need to move away from a mental model in which they own and control resources, toward a model in which they are primarily facilitators and guides on the side for those they lead.
How higher education institutions portion out accountability will need to be re-examined as well. Institutional success will be increasingly guided by fascinating forms of fast and advanced analytics brought to bear to help reduce operating costs while making an impact on the institutional missions. A shared-information environment requires a shared-risk/reward environment that can promote excellent team play. This can make managing accountability more difficult, since simplistic, top-down reviews need to be blended into more complicated assessments of multi-unit system dynamics with a more diffuse accountability structure. This is also an inversion of the traditional accountability mindset that focuses on specific individuals and leaders. On the surface, this seems harder for managers to do, but one thing that is uniquely human is our ability, as a species, to manage a much larger and more complex set of social linkages. People can and do learn how to operate in more socially complex and dynamic environments. Many followers are up to the challenge. Are leaders?
Summarizing, the new technologies and new rules bring with them different concepts. Table 1 summarizes the key differences between the old approach and the new.
|Old Concepts||New Concepts|
|2.||Design for system performance constraints||Design for speed-of-thought|
|3.||Periodic batch data movement or ETL||Incremental, streaming real-time data movement|
|4.||Point-in-time snapshots||"Replayable log" data lake|
|5.||Separate models for live, near-line, and archival tiers||Single synoptic models across tiers|
|6.||Dimensional modeling||Multiple modeling techniques|
|7.||Data Ops||Data Science Ops|
|8.||Statistical analysis||Machine learning / artificial intelligence|
|9.||Actionable analytics to humans||Machine-to-machine analytics in action|
|10.||Top-down data control||Bottom-up data democratization|
|11.||Hierarchical organizational models||Network/molecular organizational models|
|12.||Individual accountability||Team accountability|
Other implications abound. We need to enumerate them, and we need to ponder them. But we don't have the luxury of time. 21st-century analytics is here, and it is radically different. New technologies call for new rules. Those of us in higher education should learn and adopt these rules. Only with new data and analytics mental models can we discover and take advantage of new possibilities.
This article is expanded and updated from Vince Kellen, "6 New Rules for Managing 21st-Century Analytics," Cutter Business Technology Journal 32, no. 1 (January 2019).
- Erik Brynjolfsson, Lorin M. Hitt, and Shinkyu Yang, "Intangible Assets: Computers and Organizational Capital," in George L. Perry and William C. Brainard, eds., Brookings Papers on Economic Activity 2002 (Washington, DC: Brookings Institution Press, July 1, 2003). ↩
- Ian Abramson, "Data Warehouse: The Choice of Inmon versus Kimball," presentation slides (New Berlin, WI: IAS, 2004). ↩
- See Ellen Friedman and Ted Dunning, Machine Learning Logistics (Sebastopol, CA: O'Reilly Media, October 2017). ↩
Vince Kellen is CIO at the University of California, San Diego (UCSD), a member of UCSD's Chancellor's Cabinet, and a member of UCSD's Vice Chancellor and CFO's senior management team.
© 2019 Vince Kellen. The text of this article is licensed under the Creative Commons Attribution-NoDerivatives 4.0 International License.
EDUCAUSE Review 54, no. 2 (Spring 2019)