“Digital Vellum and Archives” – Dr. Vinton G. Cerf

“Digital Vellum and Archives” – Dr. Vinton G. Cerf


Again, this is my privilege. I had the privilege
of learning from you, Vint, in a conversation
and discussions last April. In a meeting of the Trusted
Digital Repositories ISO working group that was held at
Goddard Space Flight Center. So in that respect, one of
the pieces we saw on this is really with the interest
across the government, across the scientific
community, and the public, to basically learn about your
insights and your perspectives on the challenges
and the opportunities for digital preservation. I think we agree that this is
still a significant challenge across many communities. So with that, it’s my
privilege to welcome Dr. Vinton Cerf, Vice President
and Evangelist of Google Corporation, to share
his perspectives on these challenges and
these opportunities. So thank you. Very much. I always get nervous when people
clap before I’ve said anything. I think I should just
sit down because it won’t get any better than that. My actual title is Chief
Internet Evangelist, so I’m not just a
generic evangelist. I’m geek orthodox. So I do want to say
something about NITRD because I was involved fairly
early on in that activity. I served on the President’s
Information Technology Advisory Committee. And I can tell you
that that was actually a very powerful, very
effective cross-correlation across the agencies, even
back then in the early 1990s. And it continues, today, to
be a really powerful resource for coordinating
peoples’ R&D development. It’s evidence that
it’s possible to have inter-agency cooperation. And in these
contentious times, it’s nice to see that the
agencies can work together to produce better research. So what I want to do
today is talk about how we preserve digital content. And I’m not going to be able
to expose all of the issues and problems, but I hope I
will trigger some discussion. And I think there’s time in
this session for at least a half an hour or more, depending
on how long I run off at the mouth, to talk about
some of the implications of these issues. So let’s start out by making
an observation that there is a substantial amount
of digital content that’s being produced
every day, increasingly so as we depend more and
more on digital devices for our everyday work. These things, the production
of these digital objects, produces very, very
complex things. If you look at a
spreadsheet, for example. Yes, it’s just a
file, but it’s filled with meaning if you have
the right software in order to interpret it. So we create all of these
very complex objects on a regular basis, and they
enter into storage systems everywhere. We are also interested
in describing what these things look like. What their structure is,
what their semantics are. The first problem is
being able to describe these complex objects in a
way that can be correctly interpreted in the future. The second thing
is figuring out how to identify these objects, to
give them labels that we can use later to find things again. And there are issues
associated with the longevity of the use of those labels. Domain names, for
example, are not stable. If you don’t pay
your annual fee, the domain name may fall
into somebody else’s hands or simply fall into disuse. And every URL that used that
domain name won’t resolve, and a reference to it will fail. And it’s very scary, when you
look at today’s publications, to see all those URLs in there
that people intended to draw you to important content. And yet those
things may no longer work even five years, or
10 years, or even tomorrow. So that’s the big issue,
finding stable identifier spaces that we can rely on for
decades or more is important. There is a question about
taking content in to an archive. And having a very regular
process for making sure we’ve captured all of
the metadata associated with these digital objects
that we want to preserve, and make reference to. So there should be standard
processes for that. There’s also a question
about where did it come from, what does it mean,
how was it prepared? And this kind of information
becomes increasingly important when the data turns out to be
scientific data, measurement information. We need to know about the
calibration of the instruments that produced the data. We need to know what
units were there that the data represents Five. OK, five what? You know, is it
centigrade, Fahrenheit, pounds per square inch,
or something else? So there is a substantial
amount of information that has to be
captured in addition to the specific content. Then there’s this question
about saving things in a legal context. So let’s suppose, for
the sake of argument, that you have created an object
with a piece of software that’s proprietary. And it turns out
that– let’s take a text document produced with
Microsoft Word to be concrete. Well the question is,
100 years from now, will that document encoded in
the Microsoft Word format still be acceptable to the
people of the day? Will they be able to
run the Microsoft Word program, which might not be
current 100 years from now? And the answer may be no. So the question is,
if you’re thinking in those terms,
under what conditions are you allowed to save a copy
of the Microsoft Word program? And then allow third
parties to execute it, even if they didn’t pay
a license fee to do that? So you can make up a whole
series of scenarios about this where companies go bankrupt,
companies are bought up by some other company, a
product line is terminated for business reasons. There are all kinds
of things that may interfere with the way in
which we can preserve access to and use of data. And finally, there’s a
question about business models. If you’re really seriously
thinking hundreds of years, how many businesses
do we know that have lasted hundreds of years? And yet the archive– which is
supposed to hold onto this data until year 3,000
or something– is going to have to have a business
model that sustains that. So far, the only
businesses that I’m familiar with that
have longevity on the order of
hundreds of years are the catholic church,
breweries, and wineries. And so I’ve never been
able to figure out exactly how to connect the
breweries and the wineries with digital archiving,
but it sounds like it would be a fun problem. And of course, as the Chief
Internet Evangelist of Google, I have this feeling like I
have a religious commitment to preservation. But it’s a serious question. What instrumentality
allows information to be preserved and paid for,
for that long period of time? I’m feeling funny
about these slides. Did it even start out that way? OK. We’re going to flip
through these slides because this is not how I
remember my slides started. So I’m beginning to
worry whether– OK, this is interesting. Well, we’ll do it
this way, hang on. Usually I start
with that one, but. OK, here we go. It’s a very interesting
problem to figure out how to inject information
into an archive, and how to describe it. And when you get down
deep into this stuff. You worry that the
people who recorded the data in the first
place, and represented it in one way or another, used
language and descriptors that they were familiar with,
they were comfortable with, and knew how to interpret. But 100 years or
1,000 years from now, it’s not clear
whether the context and the meaning of all those
terms and representations will be clear. And so you have this
question, what’s the community that
will be able to access the data from the archive
and correctly interpret it? If we have a group of
people who understand the meaning of the descriptors
and all the metadata and everything else. Over time, that community
may shrink to zero. In which case, you
end up having to do a kind of recursive process
where you define more and more precisely, for future
generations, what the meaning of this stuff is. And so even if you think
you’ve done your job for say, the next 10 years. You may have, 10 years
from now, a new job to do, which is to define the
descriptions so that someone who isn’t familiar
with the context can still figure out
what the data means. So this process could turn
out to be quite recursive. There’s this system called
Open Archival Information System, or OAIS, which NASA has
been very much involved with. Its consultative committee
on state data systems has looked at and tried
very hard to describe what an archive is, what it
looks like, how it functions, what language it uses,
what models it uses, in order to capture data
and preserve it over time. And so those standard
definitions, descriptors, and everything else
are very important. Because standardization
will help us speak to the
future generations about what this content is. And then of course, you
end up with special cases depending on where the data can
came from and what kind of data it is. If it’s data from the
Large Hadron Collider, that’s one thing. If it’s data from
astronomical telescopes, it’s something else. So we need language that’s
specific to the kind of content that we’re storing away. That’s also true in
things like programs. The specific things
that– games, programs, analytical
programs, and everything else. Some of these software packages
are going to be very unique. Simulations, for example,
the physical processes. And you may need special
terminology and special descriptors in order to explain
the structure of the content. We need to be
systematic about this, and I think every
one of you knows that the President’s Committee
of Advisors on Science and Technology and the Office
of Science and Technology Policy have proposed that
all research that’s sponsored by the US
government have a preservation element to it. And so there needs to be a plan
offered by the researchers who are funded by taxpayer dollars,
to preserve the information that they produce in the
course of their research. I also think that
it’s very important not to rely on any one place
to store all this stuff away. And so architecturally, we want
to have multiple things that we can identify as archives. They need to be able to
inter-work with each other so we can move data from one
archive to another at need. Partly to distribute
risk and partly to cover the case when
an archive is going to be– its business
model has failed and we need to move the
data someplace else. And then there is this question
of what policies will provide incentives for creating,
operating, and sustaining archival processes. All of these things
have to be dealt with, so this is not just
a technical problem. This is actually a very
complex business, legal, and technical problem
that we have to deal with. So let me go back to
history for a moment and observe that we have been
preserving a lot of information over long periods of time. For example, in the Babylonian
and Assyrian periods, we had these cuneiform tablets. They weren’t actually intended
to last for 5,000 years. For the most part,
they were records of transactions stored in
warehouses, in jars, and so on. Except that often, fires baked
the tablets into much more resilient material. I had dinner last Saturday
at a place called 2941. The owner, Rick Adams,
is a remarkable collector of written materials. And among his collection are
the Archimedes Palimpsest, which is about 1,000 years old. And a number of
cuneiform tablets and other papyrus materials,
Egyptian materials. And we spent a couple of hours
just crawling around, having a great time looking at the
historically preserved things. The bad thing about a lot
of the objects that he has preserved– or
conserved, or retained– is that it’s all by accident. It wasn’t planned. A fire created a tablet
which lasted a long time. Somebody stashed away a bag
full of papyrus manuscripts in the desert but it didn’t
rot because it was dry, and it was discovered
accidentally. We should not be doing digital
preservation by accident. We should be doing
this on purpose. But you can see that these
various physical media have been pretty resilient. So papyrus is not
particularly resilient unless you stored it
away in a dry atmosphere. But what about vellum? It’s sheep skin or calf skin. Actually originally
it was calf skin. But the term generally
has referred to materials made out the skin of animals. Sometimes you hear
the word parchment. Those things last
1,000 years and more. We were looking at
manuscripts at 2941 that were 1,200 years old. And they were illuminated,
magnificent, and beautiful– if you could read the
Greek, Latin, and so on. That’s the wetware that’s
required be correctly interpret the otherwise preserved object. But these things
are static objects that have been
preserved in the media, that you see on this chart. And then of course we have
these wonderful inventions that we’re all so proud of, and
they store digital information for a few years. As opposed to a few
hundred, or a few thousand. I still have 5 and
1/4 inch floppies in the closet somewhere, but
nothing to read them with. I have 3 and 1/2 inch floppies. I even have a 3 and a
1/2 inch floppy reader which plugs into a USB
port in the Macintosh. It actually reads
the files but it doesn’t know what to do with
them after that because they’re WordPerfect files, and
I don’t have WordPerfect running in anything. VHS. I hear that the
last VHS production operation closed down now. I think I have one VHS
recorder or player at home, and hundreds of VHS tapes. So our embarrassing problem
is, that while we’ve invented these wonderful tools
for storing digital content, we don’t have any guarantee
they can be read in the future. So that’s part of the challenge,
is the physical medium. We have to keep copying bits
from one medium to another if we want to hang onto the
bits over any reasonable period of time. And then there are other
kinds of static materials like YouTube videos,
photographs, and so on. And I will argue that a video
is, for all practical purposes, a static image. It’s just a series of fixed
images and audio recordings. But it’s not
dynamic in the sense that it doesn’t change over
time; it’s a fixed object. So we’ve learned how to store a
lot of that stuff fairly well. But I want to ask you to imagine
that you are Doris Kearns Goodwin in the 22nd century. Some of you will have read the
story of Lincoln’s presidency. The book called Team of Rivals
was actually astonishing. If you’ve read it,
you’ll recognize the remarkable ability she had
to recreate, in a credible way, the dialogue of the time. Among the various principals. And I remember asking,
how did she do that? It sounded as if she had been
a fly on the wall in 1860, except we know she wasn’t. Well she went to, I don’t
know, a hundred libraries as a sample number. I don’t know how many
she had to go to, but it was certainly a lot. She got the letters
of the principles they exchanged with each other. So she knew the
topics of the day, and she knew the
language that was used, and she knew the positions that
many of the principles took. She was able to
recreate that dialogue. Now if you’re in
the 22nd century and you’re curious about the
beginning of the 21st century, where would you get the e-mails,
Tweets, Quoras, Facebook images, and all the other stuff? In the 22nd century,
it isn’t all clear that those things
will be preserved. The companies that created
them may not be in existence. The cloud systems that have all
that data may no longer exist. Or somebody would have
failed to pay their fees and so their data’s
been flushed. It’s possible that we’re about
to walk into a digital dark age where all this content
is no longer accessible. And even if we’ve
got the bits around, there are some other
problems that we’ll come to. So executable content
is the next big problem, because this stuff isn’t static. You have to run a
program on a computer in order to make it do anything. And so if you have
a complex file like a video game
or a spreadsheet, you need the software
to be running in order to make that object useful. And we have a little
problem with that; here are all kinds of examples
of the executable content that we would still need to
run, in the future, the app. Well, we have to figure
out what the bits mean, which usually means we have to
run the software that created the bits in the first place. We need to know what those bits
are, so we need the metadata. We have to have executable
code or the source code. If we have the
source code, we have to have a compiler
to compile the source code into something we can run. And so the model
I have in my head is that– suppose you have a
computer, like a laptop down there at the end of the table,
running a piece of software. If I could take a digital
X-ray of the machine– and this is a metaphor, OK,
don’t take this literally. But if I could take a
digital X-ray of the machine, I would have the image
of the instruction set of the machine, the hardware. I would have an image
of the operating system. I’d have an image of the
executable application program, and it’d have the bits
of the complex files that that program
interacts with. So if I could capture that
digital X-ray in a reliable way and then store it away, I
would have enough information in order to re-execute the
program with the data that was associated with it. So that’s sort of
a flavor of where I think I want to go to solve
part of the digital archiving problem. We also have a
problem that there is a lot of digital content
that’s being created, more and more every day. It has to get stored
somewhere, so we have to have capacity
for a lot of data. This is true of your e-mail
and our Tweets and all this, plus all the scientific
data that we’re starting to accumulate over time. I’ve already mentioned
various things that can happen to the companies
that own the software or created the software–
bankruptcy, for example, and sunsetting the application. Then there’s these intellectual
property questions, for some things, that are
copyright for 75 years after the death of the authors. You have no right to do
things other than those granted to you by the copyright
owner until that expires. So now if we’re going
to store things away that are subject to
copyright, that we intend to share with others
because it’s supposed to be archival information. There may be a time when
we have to keep track of what the copyrights
are, and then we have to figure out when they
expire so that we can freely make use of things. Otherwise we’ll get into
some kind of a legal pissing contest– that’s a
technical term– with people who have those rights. So we have to figure out
what legal frameworks there are for exercising
archival practices. And it seems to me, just like
we have exceptions in copyrights for what’s called
fair use, that it would be important to have
an exception under copyright for archival purposes for
digital preservation purposes. There are no such laws now, and
so among the many other things we need to worry
about is creating a legal framework which
allows digital preservation to make progress. Now at the
Carnegie-Mellon, there’s a professor there whose name
is Mahadev Satyanarayanan. And I’ve practiced
saying that many times, but we call him Satya
for obvious reasons. Satya figured out,
under NSF support, a way to emulate old operating
systems and old hardware. Or to emulate old hardware
to run old operating systems in a reliable way. It’s actually quite
an interesting thing. That’s probably not
very visible, but. This is not easy, to
emulate hardware correctly, because you have to get the
precise meaning of every one of the instructions
that the machine is capable of executing. You have to be able to run
your old operating system, you have to worry about
dynamic-link libraries, you have to worry about how
the I/O is supposed to work. This is really hard. And then we have
this other problem. When you emulate old hardware on
new hardware virtual machines, those virtual
machines of today run so fast that it runs
faster than it needs to run on the old hardware. In which case,
sometimes the programs don’t work right, especially
if it’s video games. The computer is 100 times
or 1,000 times faster than it used to be and
you’ll never win, right? So you have the
problem of trying to get this fidelity of the
emulation exactly right. And so this is sort of part
of the digital X-ray idea. What Satya did–
he said also, when you build a virtual
machine, there’s actually a fairly big hunk of code. And sometimes it might not
actually fit in the machine that you’re trying to emulate,
or needing to emulate. So he figured out that this was
a very interesting possibility. He looked at what YouTube does
when we’re playing videos; we don’t send you
the whole video and then you play it back. We send it to you
in a timely way, and we hope we get enough
of the frames and the audio there so it doesn’t break up. But we feed this stuff to you. Video is easy in some
sense because it’s serial. It’s frame by frame by frame,
you don’t have to guess. But when you’re
running a program, it hops around in
the memory space. And so you can’t always exactly
predict what piece of memory it’s going to need,
what piece of software it’s going to need next,
if you’re essentially doing paging over the network. So he’s figured out a way to
do that fairly effectively. So you can run a virtual
machine in the cloud, or at least have the bulk of the
virtual machine in the cloud, and then run the
portion of it that you need in the local
machine next to you. So this is just to emphasize the
demand paging idea for running a virtual machine in the cloud. He’s been successful doing this
for at least a dozen different operating systems. And he was demonstrating,
for example, running old DOS 3.1 programs
on an emulated machine in the cloud with all the crappy
graphics and everything else. The fidelity was very good,
it was very impressive. And of course this
is what we need if we want people to experience
what it was like to run programs 25 or 30 years ago. So you probably don’t want
to dive too deep into this, but basically what he does,
he builds a virtual machine emulator. And then on top of
the emulator, he describes what the hardware is
that it’s supposed to emulate. Plus the old operating
system in the application and everything else. The package of information
is what actually runs. Now this sort of works well
for a self-contained program like a spreadsheet. It doesn’t work quite
as well, if for example, what you’re trying to do is run
a browser from 20 years ago. And the browser is taking web
pages that have URLs in them. And when you click
on the URL, that exits the space of the
machine that you’re in. It’s outside of the
browser context. And you reach out to–
well, maybe nothing, because that may
not resolve anymore. And so this notion of
emulating the hardware and running your own operating
system only goes so far. If the system that
you’re trying to preserve is in fact the
World Wide Web, now you have many, many more
problems ahead of you in order to solve that problem,
in order to make your URLs continue to resolve somehow. And we’ll come to
that at the end. This is just how he manages
to package up the stuff that runs in the cloud. It’s all one gigantic file. And he makes these
changes between the cloud and the machine that
you’re interacting with, presenting to you the
hardware and software. He makes it all look like
ordinary web exchanges, so this is a purely
web-driven infrastructure, but it’s doing some pretty
sophisticated stuff. I’ll just skip over that. This is just another
way of showing you what this looks like. So we’re a long way from being
done solving the problem. Part of it is the
scaling problem. The virtual machines get bigger,
more complex, more complicated, the hardware is that
we’re emulating. And the precision of the
emulation is the big problem. He’s run into a whole
bunch of problems. Sometimes memory modes like
the x86 extended memory mode didn’t work very well in
one of the most popular virtual machines that
we’ve called QEMU. He’s also got problems with
exotic hardware platforms. Think for a minute
about what NSF sponsors in the way of research and
the competing entities that are used in supercomputers,
for example. Imagine trying to emulate
a supercomputer environment of 20 years ago in something. It’s not even clear what
that something should be. Now he also runs into problems
where the hardware had bugs, and the software
depended on the bugs, so he has to implement
the bugs in the emulation of the hardware in order to make
the software work correctly. Having to know about
weird stuff like that is part of the challenge. The other problem is that the
people who make the hardware don’t necessarily document it
to a level of adequacy that would let you emulate it. And they’ll say, well, it’s
proprietary information; you can’t have that. And you’re sitting
there saying, but I need to know that in order to
emulate the machine well enough to run the application software
for purposes of preservation. You leave out the “you
idiot” parts like– There are really hard
problems here hiding, that are technical in nature
or sometimes legal in nature. To say nothing of
business models. So I think that if
we’re serious about and are concerned about
our scientific enterprise, it needs to be an
important topic. And we put this into
our requirements. The OFCP has told
NSF and others, if you’re supporting
research, you must have your researchers
give you a plan for preserving their data. But nobody who’s busy doing
physics research, biochemistry, or something else,
is going to be expert in this sort of thing. And yet this sort of thing
may be needed in order for their preservation
to actually work. They save their files,
they save their software, but the software
won’t run unless we have all these other
things taken care of. Now here’s a very
interesting notion. Today we have this
World Wide Web. It has the property that it
is the current collection of information that
you can get to today. This self-defining thing. The set of all sets. Does the set of all sets
contain– don’t answer that. What we have is
what’s there now. What we don’t have is
what was there before. And we don’t have
what is coming later; we just have what’s there now. This is true, by the way, in
the index that Google runs. That’s an index which is
being continually updated of what is there now. So it has nothing to do
with what used to be there, for the most part. The problem is that nobody,
except maybe Brewster Kahle at the
Internet Archive, is trying to preserve what
has been put into the web and might disappear. And so his Internet Archive
is an attempt at doing that, and he’s been doing this
since the mid-1990s. So we– Brewster and
I, and Tim Berners-Lee, and a number of other people
got together a few months ago to talk about the possibility of
creating a World Wide Web which has the property that when
you post something to the web, it’s automatically
archived somehow, whatever that turns out to mean. So that you don’t have to
do what he has been doing. In fact, think about
what Brewster does. He does what we do at Google;
he crawls through the web. He captures web pages
and he stores them away in his petabytes of memory in
his– if you’ve never seen it, he has a church. A building which
he acquired that used to belong to the Christian
Science Church or something. And it looked like a church
with the nice Greek columns and everything in the front. When you get down into the
place where you normally would have pews and
everything, where there might have been
icons and things, he has stacks of memories
going to the ceiling with flashing lights. So what he has to do, he grabs a
web page and he stores it away, but then he has to
go through every URL. And if he is going to pull the
pages that that page points to, not only does he have
to pull that page, but he has to change
the URL that’s pointing to it to
something that’s local inside of his memory. So that it resolves locally. Because you can’t
guarantee that if you try to resolve on
the World Wide Web, that it will still be there. On top of which, the World
Wide Web may go away, may not ever be there after
the year 2020 or something. Or, well, hopefully it’ll
last longer than that. Maybe 2120. Let’s try 2120 or something. So his problem is to preserve
the data as he catches it. Actually it’s sort
of a manual thing. You definitely have
people type in URLs, but there’s a
process that he has to go through in addition to
whatever the page creators went through to create those pages. So our notion, and this is
purely notional at this stage of the game– I
don’t think we’ve got a detailed idea
of how to do this– is to make that whole
thing more automatic. So that when you host a page,
when you say publish this, it also goes through
some archival process. It’s important to
know at what point you should apply
the archival action, because you don’t want
everything a little changed in the web page to create
yet another archive. That’s kind of like
the multi-universe idea where every time an atom
sneezes, it splits into two and a new universe gets created. So we don’t want that, but we
want some well-defined time at which an archival
action should take place. So we learn some lessons from
the internet’s evolution, which I think apply to its
creation and the distributed archive of what we now
call the World Wide Web. One of them is that
incentives for collaboration and cooperation turned out
to be really important. And it would be effective,
all that collaboration, it’s what you do at [INAUDIBLE]. Writ large among the
research community and user community that has allowed the
internet to evolve and process. Everything was open. The protocols have been open,
the evolution process by which those protocols evolved. Humans created it all open. Anybody could join the process. You can’t actually join the
internet engineering task force. There’s no membership. All you can do is show up. And like Woody Allen says,
80% of life is showing up. And if your ideas have
attraction to them, they will take hold. And if they don’t, they
won’t, and that’s it. It’s a pure meritocracy. Also we said in the
internet that there would be an arbitrarily
large number of networks. But we said we didn’t care
what the business model was, and so we dictated
nothing on that store. And the result is we have
parts of the internet run by the government, parts of
the internet in private sector, some of them are for profit,
some are not for profit, some are private like
those little things we run at home in our living room. All of those models are
acceptable because the internet itself didn’t care. It doesn’t rely on any
particular business model, and that probably should be
true for an archival system. Some of them may be for profit,
some may be not for profit, some may be government. There should be
multiple archives and they should have
any means feasible of sustainable support. And then of course there needs
to be a lot of modularity here, because there
is so much detail that you don’t want to have
to know everything in order to do something. And so you want to layer the
architecture of archiving in such a way that some people
can specialize in this part and other people can
not worry about that and only worry about this part. It’s a little bit like what
happens with apps on mobile. The people who write
the apps on the mobiles don’t actually know how
the mobile system works, and they don’t have to. There’s an API, an Application
Programming Interface. They know that if they can meet
the requirements of that API for data transfer, that
their applications can run. All the stuff that
goes on underneath, the internet component, the
Wi-Fi component, the LTE, 4G, 3G, and so on. All of that
infrastructure that’s hidden by that [INAUDIBLE],
and they don’t care. And that’s important that
they shouldn’t have to know. The same should be true
for the archiving system. You shouldn’t have to
know very much in order to participate in and benefit
from the archival system, or contribute to it at different
parts of its architecture. So I’m just thinking
out loud, now, about the kinds
of things that we will want to have
in our tool kit for inventing a
self-archiving way. One of them is to
have the ability to compress the data to reduce
the amount of storage that’s required. Things like Tarball
and formats like that have been used for
a long time, just to be efficient about
storing information. We certainly want to recover–
what did I mean by that? That’s what happens
when you think out loud; you begin to wonder
what you meant. So what did I mean by that? Oh, I know. The question had to do with
what are the things that get stored in the archives? And you have to be able to draw
a boundary around something in order to label it, identify
it, and find it again. So that means that you have
to store things that you have identified as a digital object. So it can’t be just
blowing bits into the ether and hoping that somehow
they’ll be recoverable. You literally have
to be disciplined about identifying something in
particular to get stored away. There is this interesting
issue about when you should store intermediate
versions of things. And this is a choice
that we get to make. It’s not dictated to it. But we have to be
thoughtful about how many intermediate steps
we’re willing to remember, and codify as separate
objects, new objects. Sometimes you can escape
this a little bit. In the case of Google
Docs, for example. The way in which the
document is constructed, we remember changes
to the document. And so as the document evolves,
the corpus of the document remembers its history. So we don’t have to say version
one, version two, version three, version four. We just keep the
object, and as long as we’re archiving
it as an object and we update the archive
of that current object, we have preserved its
history automatically. And so this is a way of
avoiding this question of having 10,000 copies of something. But it’s important to
be able to preserve the historical
evolution of things. Think about what it’s like
if you’re are in NARA, the National Archives. You want very much to be able
to look back and understand the evolution of various
and sundry documents. If we go all the way
back to our origins of the country in the 1770s, one
of the most interesting things is to see different
versions of the Declaration of Independence. There was more than one version. Part of it, a consequence of not
having all the states agreeing at the same time. They weren’t all in the
same room suddenly declaring everything. When the versions were
printed and distributed, some states had not yet agreed. And New York might have
been one of the holdouts. And so the language
of the Declaration actually changes depending
on which instance of the Declaration
you get your hands on. The language changes
to finally, ultimately, refer to all the states,
not just some of them. And so in that
particular case, we had physical instances
of the version. In the digital
case, we might not have to have physically
different instances. If some of you are familiar
with the way programmers work, software of course is never
done, and it always changes. Especially if you find
bugs that have to be fixed. There are aids to help you
keep track of which version of the software am
I working with now, how many different people are
playing around with the codes, is it only one person at
a time that can edit it. You check it out, you
check it back in again. So creating a framework in
which this archival process can be disciplined is
really important. And I’ve already
alluded to the fact that the World Wide Web has
its fragility and brittleness to it. So Brewster time
indexes his snapshots. So he has a thing called
the Wayback Machine, and you can actually search
his archives with time as one of the elements. So you can look at, what did
the Web look like in 1997? Of course he doesn’t
have all of the Web, but he has portions
of the web from 1997; you get to look at that. And you can click
on the web page and see what it looked like. Of course this is
kind of interesting, because the underlying
HTML has evolved over time from the original HTML format,
to XML, to HTML5, and probably some other intermediaries. His software has
to present to you, to the best of its ability,
what that web page actually looked like under the
interpretation that was intended. So we’ve already touched
on a lot of these topics. So I think we can
actually finish that. So we have this problem that
the Web is bigger than itself in some sense. If you imagine that
all we have of the web is what is currently stored, s
storing the history of the web means you have more than the
current web of space available. It’s already hard enough
to store the current web, let alone the past web. So we have a lot of
work to do to improve the quantity of storage
we have available, it’s efficiency,
its compactness, the power required,
and everything else. The backward
compatibility problem, I’ve already alluded to. And of course, all
these questions about what are we
actually legally permitted to do
with the information that we’ve captured off the web. And some of the
content of the web is supposed to be pay
wall blocked, for example. Well, are we allowed
to archive that without having to pay to
get behind the pay wall? Under what terms and
conditions would someone be permitted to do that? And what obligations would
they incur as a result, for how long? Having beautifully
archived the information, at what point are they allowed
to make it available to people freely? Nobody has that
worked out right now. So those issues still
have to be resolved. I like the idea of doing
this stuff as automatically as possible, because most
people who create content don’t really have the
time, energy, wherewithal, or knowledge to figure
out how to archive it. So we would be doing
everybody a big favor. Everyone who creates
content, regardless of whether it’s scientific
content, or entertainment, or anything else. We would do them the big favor
if we create an automated way for them to preserve
their content over time. Now some people
will tell you, well, most of the stuff on the
web isn’t worth saving. And I won’t dispute that. It’s sort of like, look at
all the blogs there are, how many people read the blogs. The average is 1.4 people,
the creator of the blog and the dog. So I’m not arguing that
we should save everything, but I am arguing that if you
did want to save something, there should be
a way to do that. So we should have the
technical capability, and the legal and business
framework in place so if you thought it
was worth archiving, you have the ability to do that. So I’ve been asking
people at Google, is there a role for– one of
the interesting properties of Google Docs, if
you’ve ever used them. One of the interesting
properties of the design is that multiple people
can be not only looking at the document
at the same time, regardless of where they
are on the internet, but they can be editing a
document at the same time. So there are multiple
copies of the document, and those multiple copies are
kept in sync in real time. Frankly I’m astonished that the
engineers were able to do that. It’s really impressive. But the interesting
thing about it is that it means multiple
copies were replicated and kept in sync in real time, which
is what you would want if you had a distributed archive. You’d want things to stay in
sync as quickly as possible. The methods that we use to
achieve that in a Google Doc case might also be applicable
to a distributed archive. It’s also very important
that we find a labeling and identification state
that’s stable over long periods of time, which means if we
identify an object that we want to store away and
we give it a label, the label should not be
assigned to anything other than that object,
and the label should be resolvable over
long periods of time. There are all kinds
of implications of that, sort of like
domain name system having to be around forever and
ever, amen, to resolve a URL. We need this other, whatever it
is– label and identification states to be preserved
over long periods of time so resolution will work. One of the things that’s
interesting about this is that inside of the Google
Cloud, the files that we keep are not actually kept
in a unitary form. They are broken up into pieces. The pieces are scattered
across multiple data centers, so even if you lose
a whole data center, the document is still there
in almost a holographic form. And the identification
part allows you to reassemble
the pieces when that document is referenced. So we have bits and
pieces of the elements of the digital
archive at Google. I’m not suggesting that
Google is the exclusive agent through which this
technology has been developed or could be developed;
it’s just that it has something to contribute. As I hope others do. Some of you are familiar with
the publish and subscribe notion. It’s a very popular
idea, and I think that’s part of this
archival thing. You should be able to assert
that I’m publishing this, and by the way, by implication,
I want to archive it. And if somebody else wants
access to that content, they should be able
to subscribe to it. So there’s some Pub/Sub
component to all of this. What about all the metadata? The tools that allow
us to create content need to also understand
the meaning of metadata, the need for it, and the ability
to capture it and store it away with the documents
or with the objects. And if we don’t
do that, of course then we’ll have a big
problem, because we’ll have a bag of bits. It has a label and
we can find it again, but otherwise it’s not
clear exactly what it is. So the metadata is necessary. This rendering and
interpretation thing is important, too,
because once you find a digital object
that’s been archived and you want to do
something with it, and you’re out here at
the edge of the system. When that object
shows up, you need to have the software
to correctly render it if it’s a static thing, or
it’s video, or something else. Or it may actually
require interpretation. So the active retrieval has to
recover the context in which the object can be interpreted. And I’m not even going to
take time, the permission to use question. You know, who can use what
and under what circumstances? We need vocabulary
for that, and we need mechanism for
registering what rights you’re willing to give
and under what conditions, and for what period
of time, for all these various digital objects. Again, as much of
this as possible needs to be automatic
because a lot of people won’t take the time
to do that themselves. So here’s an
interesting problem. If I am looking–
this is surfing the self-archiving web– if
I’m looking for an object and I find a reference to it. OK, so I’ve got the
handle, so to speak. The question is where do I
go, where does it resolve to? And it should be the case
that there is more than one useful resolution. I don’t want it to only be one
place that holds this object. So no matter which
resolution I get, I should still end up with the
object that I’m looking for. This is a snapshot
question of when do we take a
snapshot of something and say it’s an
archivable moment. And I think we really, really
need to work hard on that. The publishing magazines, the
publishing of the newspaper, has the notion of
addition in it. And that helps us draw a box
around something and say, this is an object that sees
content we want to archive away as this edition. When we have this rather
fluid web environment, it’s not clear what
an addition is. So we may have to invent
new concepts in order to control the rate at which
snapshotting takes place. So automatic archival
upon asserting that you want to publish. If this is a distributed World
Wide Web Archival System, how do you sign up for it? And who pays for it? What about the rendering
engine, and also the executable software? Do you register that somewhere
to say, here is my product. This is useful for
editing photographs, and I want other people
have access to it, but there are terms and
conditions associated with it. Either you are
able to get access to a copy of the software and
you can download it and somehow run it in an
emulated environment, or maybe you have to
run it in the cloud. I don’t know whether
you’ve noticed it but the people who
used to make software, that would sell you a
copy of the software, don’t want to do that anymore. They would like to sell
you the rights to use, the license to use,
but the software is running in the
cloud somewhere. You don’t actually get it. You never do; all you do is
keep paying fees to use it. And they like that
business model better, because having to keep coming up
with a new version of software that you’re willing you pay for
is less attractive than, you can’t use this at all unless
you pay me for the next month’s rent. So there’s that. Now what about other problems? Suppose you’re– none of
you would do this, I’m sure. But suppose that
you thought, boy, wouldn’t it be fun
to archive away really malicious viruses, worms,
Trojan horses, and other stuff. And induce people
to click on a link. You could even– some
people, if you said, do not click on this link. This is highly
malevolent virus malware. Four out of 10 people would
click on the link, you know. It’s sort of like wet paint. I wonder if it’s still wet? Right, OK. So I don’t know what
to do about this. If you have this distributed
archival system and somebody wants to archive away a virus–
which you might want to do. You might want to
capture what we’ve discovered about the
dark side of the web, but how do you filter this,
how do you protect people from accidentally
ingesting malware? And then there’s a
question of how good is the archival process? When I retrieve the
object, what do I get? One possibility is,
you just get to see what the object produced. Sort of a very two dimensional,
here’s what it looks like. Or you get something
where everything works and you can interact
with it, play a game, run any spreadsheet,
download new things. So there’s a question
about saying something about what the fidelity of
the emulation and recovery is, and we’ll have to have
new vocabulary for that. So suppose that the
archive is really, really good about saying, this
page has been stored away. It’s part of the
archive and it can’t be changed without visible damage. Digital signatures and
digitally signed hashes might allow you
to say, this page will be visibly altered
because we can cap that. Well, if you actually put
in some kind of a mechanism like this in the
archival system, you might imagine
that that would be very useful in a court of
law because you could say, this object was stored
at such and such a time. It was a contract between
these two parties. They both agreed to it. I have mechanisms to
prevent either of them, or some third party, from
altering the contract without it becoming clear
that that’s what’s happened. And that now becomes
an interesting piece of legal evidence. I don’t know that we
should deliberately try to create that
feature in the archive, but the side effect of some
of the things that you might want the archives
to do might also produce that effect as well. That might be quite useful. There’s also this question about
the act of archiving something being a requirement for
official record keeping. And those of us who either
work in, or have worked in, the government know that
there are requirements for record keeping. And the archive, if
properly constructed, could satisfy that
particular requirement. And finally, what about
the problem of access control in a timed way? So sometimes people will
put information away, like in the
Presidential Archives, and say, you can’t have
access to this for 50 years, or after a certain
event has occurred. The archive might also
try to observe that set of requirements as well. Some of you will
know that there are concepts that are
starting to show up in the cloud environment. One of them is
called containers, and it’s the way–
essentially, it’s trying to take a
body of software and make it work in
any cloud-based system. The clouds don’t all work
exactly the same way, but a container is intended
to allow you to move things from one cloud to another. That would be a very good
property in the archival thing. So that you could have
multiple archives, and the containerized
software that’s needed could run on all of them. And so that might turn out to
be a very useful mechanism as well. And I think this
is the last slide. I’ve already mentioned
the Internet Archive that Brewster Kahle
has been running. The Library of Alexandria–
the one in Alexandria, Egypt– is one of the backup sites
for the Internet Archive. And I think there’s also one
in– I want to say Japan. Tokyo or Singapore. One of the two. The Computer History Museum
out on the west coast, in Mountain View, also has been
archiving things in addition to just digital content. They’ve been archiving software
as well, and so has Brewster. Although the archiving
of the software has mostly been preserving
the bits of the software, but not necessarily an
environment in which they could execute. Google, of course, has
been doing book scans. And the Cultural Institute has
been capturing digital images from museums and cultural
areas for a long time. And finally there’s
Bob Conn’s work, what’s called the Digital
Object Architecture. And he’s created an identifier
space he calls digital object identifiers. And those of you who read
scientific publications will often see DOI, colon,
and then some string. That’s actually a digital
object identifier, which if you plug
into a browser, will resolve into a URL. Which eventually
gets to the target if the target’s still around. The way it works right
now is the inverse of what is needed because
the URLs themselves are not necessarily stable,
but the DOIs could be if we could map
directly to the target. So I think that’s an
hour’s worth of on and on, and I think that was the last–

Leave a Reply

Your email address will not be published. Required fields are marked *