Fostering Innovation with data

Our past experiences on sustainable project analysis

Billions have been invested in scaling up green technologies over the past three decades [1] — anaerobic digestion, biodiesel, cellulosic ethanol, gasification-to-fuels, fast pyrolysis, biochar — with vastly different commercial outcomes. Some technologies, supported by major industrial players and intensive R&D, remain without sustained commercial activity, while others, relatively neglected, achieve credible deployment [2]. The reasons for these differences are currently hard to analyze. Could alternative funding and technology development strategies foster more sustained innovations? Are we supporting the development of the most cost-effective technologies to reduce GHG emissions, or some game-changer technologies – more competitive economically, with lower market entry barriers, more robust operationally- are left behind because their advantages go unacknowledged or their initial development faces higher constraints? Beyond citation metrics, rigorous analyses of how R&D and funding translate to market remain, at present, scarce.

The data, however, exists: proliferating R&D portfolios are now publicly open, thousands of commercial projects have been tried, and hundreds of economic and sustainability analyses have been produced, sometimes at high cost. Yet this knowledge is fragmented and often underutilized. Debates on fostering technological innovation tend to focus on bureaucratic inertia, risk aversion and funding levels. These are real concerns, but they obscure more granular and tractable questions about what is actually going well in the innovation ecosystem (e.g., BETO peer Review [3]), and what is really required to bring a good idea to life. NSF, BETO, CORDIS, ARPA-E and institutions like VTT operate with different mandates, risk profiles and time horizons. When industries such as Mitsubishi [4], GTI [5], and Hitachi work on the same technology alongside publicly-funded researchers, their development strategies diverge in ways that are rarely documented and systematically compared. Have industry and research solved complementary problems — or are advances made in one sector going unrecognized or inaccessible to the other (while leaving some challenges unaddressed) [5]? To our knowledge, no organization or infrastructure tries to assemble this knowledge in a way that makes the patterns visible.

Multiple reasons likely explain why building such infrastructure is structurally challenging. First, recognizing a more promising direction is institutionally difficult. It implies questioning past investments, navigating political constraints, and justifying complex tradeoffs in the time it takes to make a decision. The data also exists but remains inaccessible without dedicated infrastructure; building that infrastructure requires both technical and domain expertise that few combine; and those who do rarely have the institutional power to act on what it reveals. No institutional space currently exists where comparative market data could inform decisions without threatening academic freedom or imposing commercial logic. Yet the technical barriers to building such infrastructure have sufficiently dropped that a small, dedicated team could now act where institutions cannot.

Hypothesis

To innovate, what is missing is often not more data, but the infrastructure to benchmark the dense and accelerating flow of accumulating information. How institutional data compare against each other and with recent knowledge is currently difficult to analyze, as is how R&D portfolios compare against commercial outcomes, or how economic models compare against market realities. Research is costly to produce. We must ensure we valorize its outcomes. A rigorous evaluation of a technology's potential requires bringing together multiple perspectives, each with its own data challenges: the commercial track record, the environmental footprint, and the R&D portfolio, for instance. Read in isolation, each is limited; but read together and compared across dozens of projects, patterns emerge that individual study hardly captures.

Over two years of PhD work, we have begun to explore this across three complementary perspectives: economic, sustainability, and R&D portfolio evaluation. The pace accelerated sharply with AI-assisted coding: more was built in the last two months than in the preceding period. This matters beyond our own experience. Proliferating R&D databases are now publicly accessible via API; modern data management tools have drastically lowered the barrier to linking heterogeneous sources ; and AI assistance reduces the expertise required to explore unfamiliar data domains.

We thus hypothesize that the technical barriers to building such infrastructure have sufficiently dropped that a small, dedicated organization could now produce analytical capacity well beyond what funding agencies and institutions currently use — and in a fraction of the time. Given the nature of such work and the multitude of directions it can take, an agile workflow coupled with a time-bounded go/no go pilot appears the most appropriate strategy. A concrete pilot could thus test whether a small team can build, in three months, a modular infrastructure capable of linking heterogeneous data domains for various usages. In two months as a single developer, it was possible to build an infrastructure linking heterogeneous complex objects (process, operational unit, feedstock, literature…)[6], interactive timelines, process schemas, economic models [7], one module to batch AI call request on object data[8], one module to extract and filter large R&D data[9], etc. A team with proper know-how could likely do more on a broader range of topics.

Economic analysis examples – showcasing how policy and R&D priorities can diverge from market realities

Evaluating the economic and commercial potential of bioeconomy technologies requires assembling data of inherently limited quality: techno-economic assessments built on idealized assumptions, public announcements, press releases, etc. The examples below illustrate what structured comparison reveals. Achieving the expected performance for new technologies often takes more than five years[6], and developing them spans decades[4]. Technologies that cannot reach commercial viability at smaller scales have a consistently poor commercialization record [10], [11], while smaller-scale approaches show more resilience: biochar grew rapidly in the recent 10 years[12], and some cellulosic ethanol and fast pyrolysis plants survived despite persistent performance challenges and difficult market conditions [6]. Market selection failures are costly and sometimes foreseeable[11]. And economic models used in R&D evaluations routinely rest on assumptions that commercial experience has quietly invalidated.[13]

R&D portfolio and AI usage – dealing with large databases

R&D portfolio analysis reveals a different layer of the same problem. Funding databases from NSF, BETO, CORDIS and ARPA-E contain hundreds of thousands of project descriptions — but without infrastructure to filter, annotate and compare them, the patterns they contain remain invisible. The gap between what gets funded and what reaches commercial viability is sometimes striking, and rarely documented in a way that informs future allocation. The examples below show some early attempts to make these patterns visible — what works, what remains frustratingly unclear, and where the real methodological barriers lie.

Sustainability analysis – the challenge of making politically sensitive data transparent

Environmental assessments present a different challenge. Life-cycle assessment data is often technically accessible but practically unworkable: large Excel files with poor annotation, where tracking assumptions, updating values, or comparing across studies requires infrastructure that spreadsheets cannot provide. Land-use change adds a further layer: comparing institutional frameworks at regional resolution already reveals striking patterns — for Indonesian palm oil, regional estimates range from under 20 to over 300 g/MJ against a CORSIA value of 39 g/MJ. But challenging the assumptions embedded in the underlying models requires linking heterogeneous data sources that were never designed to work together — soil surveys, agricultural registries, biomass availability data, land capability classifications — each at a different resolution, maintained by a different institution.

Conclusion

The examples presented here illustrate a recurring pattern: the gap between R&D investment and commercial outcomes is rarely explained by a lack of good ideas or insufficient funding alone. Technologies that require a decade to reach expected performance, projects scaled beyond what early markets can absorb, economic models systematically disconnected from commercial realities: these are not unforeseeable outcomes. They appear repeatedly in the historical record and their inadequate consideration impacts both funding and policy orientations.

What structured, comparative data can do is make these patterns visible so more people can act upon them — and the technical barriers to doing so have drastically dropped. A small team with current tools can now build, in months, what would have required years and significant institutional resources two years ago. The go/no-go question remains concrete: can a small team build, in three months, a modular infrastructure stable enough for external use — a functional data entry pipeline, core modules that generate insights a serious analyst would trust, and a clear view of what additional development the system still requires? That question is worth testing now, and the answer will inform whether this kind of infrastructure can realistically scale beyond a single research context.

← All writing