WEBVTT

00:00.000 --> 00:08.800
Hi, I'm Laura Welcher from the Long Now Foundation.

00:08.800 --> 00:13.720
I direct a project called the Rosetta Project, which is building an archive of information

00:13.720 --> 00:15.920
about all of the world's languages.

00:15.920 --> 00:19.000
And I'm actually up here to present a poster to you tonight.

00:19.000 --> 00:21.320
But it turns out that I was the only one who brought a poster.

00:21.320 --> 00:24.960
So rather than stand up here with my big piece of paper in a really small print, I thought

00:24.960 --> 00:28.080
I would kind of wing it and show you what we have online.

00:28.080 --> 00:30.920
And we have a fair amount of stuff online.

00:30.920 --> 00:35.680
So the Rosetta Project has really two big parts to it.

00:35.680 --> 00:42.160
We're building this archive, this large collection of information on all of the world's languages.

00:42.160 --> 00:47.280
And there are about 7,000 of those.

00:47.280 --> 00:52.600
But what most people see, what people know the Rosetta Project for is this millennial

00:52.600 --> 00:54.240
Rosetta disk that we're building.

00:54.240 --> 01:00.440
So this is like a backup version that can last for thousands of years.

01:00.440 --> 01:07.120
So the web presence we've always had has these two kind of sides to it.

01:07.120 --> 01:11.280
If you go to our website today, one of the main features we have on our landing page

01:11.280 --> 01:18.120
is this interactive version of the disk.

01:18.120 --> 01:24.560
And so we've actually built this prototype of this very long-term archive.

01:24.560 --> 01:30.000
But behind this is actually one of the largest collections of information on human language

01:30.000 --> 01:34.640
that's available publicly on the net.

01:34.640 --> 01:40.600
And while there are other archives out there on human language, they're actually organized

01:40.600 --> 01:42.960
quite a bit differently than ours is.

01:42.960 --> 01:47.320
We're one of only about three sites that actually has information on all of the world's languages.

01:47.320 --> 01:53.040
Most archives are much more specialized, having to do with a small group of languages or individual

01:53.040 --> 01:54.040
languages.

01:54.040 --> 01:59.360
And we're also really different in that most language archives, because they have sensitive

01:59.360 --> 02:04.880
cultural information, they have very complicated systems of permissions that you have to navigate

02:04.880 --> 02:08.000
in order to be able to access the collections.

02:08.000 --> 02:09.960
We've actually sidestepped that.

02:09.960 --> 02:15.320
And everything in our collection is open and publicly available, which makes us pretty

02:15.320 --> 02:21.400
different in the landscape of these other archival entities.

02:21.400 --> 02:25.440
But I think one of the reasons that I was asked to come and speak at this particular

02:25.440 --> 02:30.480
workshop, because this isn't specifically about personal archiving, but we have to wrestle

02:30.480 --> 02:33.080
with some of the same kinds of issues.

02:33.080 --> 02:38.960
And one of them is limited sets of resources in order to be able to maintain this digital

02:38.960 --> 02:39.960
collection.

02:39.960 --> 02:45.360
We're a nonprofit, and we work on government and private grants, and the maintenance and

02:45.360 --> 02:51.320
development of large web collections can actually be quite expensive.

02:51.320 --> 02:53.360
And we used to try to maintain it all in-house.

02:53.360 --> 02:57.640
We used to try to build our own archival back end and manage all of our metadata.

02:57.640 --> 03:00.400
And it was just, it was very complicated.

03:00.400 --> 03:05.880
We never really built it to a level that I was particularly satisfied with.

03:05.880 --> 03:09.880
And so a couple of years ago, we decided to actually scrap the whole thing.

03:09.880 --> 03:14.840
We'd actually built an entire website out in a content management system and had this

03:14.840 --> 03:16.080
very complex structure.

03:16.080 --> 03:17.440
We decided this was not working.

03:17.440 --> 03:19.560
We're not going to be able to maintain it.

03:19.560 --> 03:24.280
So we basically tossed everything out and then said, OK, now what are we going to do?

03:24.280 --> 03:32.440
So we took all of the content in our collection, which amounts to probably over 100,000 scanned

03:32.440 --> 03:38.440
pages at the time of language documentation, which isn't a lot in the era of Google scanning,

03:38.440 --> 03:44.600
but it's all highly vetted and it's all connected to individual human languages that we've actually

03:44.600 --> 03:46.920
rectified.

03:46.920 --> 03:53.840
So we moved all of our content over into this wonderful resource called the Internet Archive,

03:53.840 --> 03:59.760
where we have a special collection.

03:59.760 --> 04:03.840
And so here you can see all of the items that are in our collection that reside here in

04:03.840 --> 04:05.560
the Internet Archive.

04:05.560 --> 04:10.040
What you may notice is that this is just one big, huge list.

04:10.040 --> 04:13.480
What you don't see, this is kind of the user's perspective when they kind of land on our

04:13.480 --> 04:15.760
page and they want to browse.

04:15.760 --> 04:17.900
It's not really intended for that.

04:17.900 --> 04:24.200
This is actually where we manage all of the collection and we provide, we enrich the metadata.

04:24.200 --> 04:28.320
But if you wanted to find something, it is actually quite difficult.

04:28.320 --> 04:31.840
You would need to know what the language code is or the language name, and people don't

04:31.840 --> 04:37.080
know, there's different ways to spell language names and a lot of people don't even know

04:37.080 --> 04:41.040
what a particular language name is, after all, there's 7,000 of them.

04:41.040 --> 04:47.520
So this isn't a particularly good way to browse the collection, but it's not really intended

04:47.520 --> 04:51.880
to be.

04:51.880 --> 04:58.160
What does work is if you actually build your interfaces outside of the Internet Archive.

04:58.160 --> 05:02.520
So use the Internet Archive as your long-term repository and build the structure for your

05:02.520 --> 05:05.120
interface somewhere else.

05:05.120 --> 05:12.640
So what we did is we used a free cloud service called Freebase, which is an open contribution

05:12.640 --> 05:13.640
database.

05:13.640 --> 05:19.880
And the idea behind Freebase is to structure the semantic web, but not from the top down,

05:19.880 --> 05:22.640
but from the bottom up through user contribution.

05:22.640 --> 05:28.000
So users contribute data sets that they think are interesting and they want to develop and

05:28.000 --> 05:31.480
then you can link them to all of the other data sets in Freebase.

05:31.480 --> 05:38.360
So we took all of our metadata about language, so there's a large taxonomy of relationships

05:38.360 --> 05:44.640
between languages, kind of like you have species taxonomies, and other kinds of information

05:44.640 --> 05:49.680
like the kinds of documents in our collection, and we populated that all in Freebase.

05:49.680 --> 05:53.840
And so you see here this is the Rosetta language base inside of Freebase, but it's linked to

05:53.840 --> 05:58.240
all of the other data that resides in Freebase.

05:58.240 --> 06:05.360
And this topic area called Langoid has 10,000 topics, so those are all of the languages

06:05.360 --> 06:11.080
and they're actually linked all to each other in terms of language relationships.

06:11.080 --> 06:20.520
And we used this data set, this metadata about language, to push a static wiki of the world's

06:20.520 --> 06:27.040
languages with one page for every human language and links between them.

06:27.040 --> 06:34.440
So to give you an example, this is a page that's actually a fairly well populated page

06:34.440 --> 06:36.680
for French.

06:36.680 --> 06:41.320
And all of the information that you see down here in the classification taxonomy allows

06:41.320 --> 06:45.540
you to navigate to all of the other language pages.

06:45.540 --> 06:50.080
This is for the Romance language family in Indo-European, but you can get to any other

06:50.080 --> 06:52.360
language family.

06:52.360 --> 06:57.420
All of the data up in here is data that's coming from elsewhere in Freebase that we're

06:57.420 --> 06:59.160
linking to.

06:59.160 --> 07:04.520
All of the data up in here are descriptions that we pulled in from Wikipedia.

07:04.520 --> 07:11.200
We actually rectified about 400 pages, no, I'm sorry, probably about 1,000 pages on human

07:11.200 --> 07:16.160
language in Wikipedia to our collection, so we were able to draw those in.

07:16.160 --> 07:21.880
Here's a map that's contributed from another project called the LLmap project, which is

07:21.880 --> 07:26.000
completely external to our project, so they allowed us to use that URL.

07:26.000 --> 07:32.680
And then all of the documents down here are linking to items that we have in the Internet

07:32.680 --> 07:34.720
Archive.

07:34.720 --> 07:49.000
So what this represents is actually a distributed system where you have the content in one place,

07:49.000 --> 07:53.800
you have kind of the data structure in another place, and then you have another place where

07:53.800 --> 07:58.560
you're building out an interface, or maybe you have multiple interfaces.

07:58.560 --> 08:03.600
And I've actually been talking to my other colleagues who are building linguistic archives

08:03.600 --> 08:10.840
about this, because I think it's a pretty robust model for maintaining your archive

08:10.840 --> 08:14.560
in the long run, if any of the pieces look precarious.

08:14.560 --> 08:22.120
So Freebase was acquired by Google last year, and they have a commitment to keep that data

08:22.120 --> 08:23.120
set open.

08:23.120 --> 08:26.840
But if something happened in the future and we needed to move the data around, as long

08:26.840 --> 08:31.800
as you can get your data in and get your data out, then you're in, you should be in fairly

08:31.800 --> 08:34.720
good shape.

08:34.720 --> 08:41.960
So what I think we've built is something that's fairly recession-proof, too.

08:41.960 --> 08:47.880
We're relying all on cloud services, so there's no service here that we actually have to pay

08:47.880 --> 08:49.740
for and maintain ourselves.

08:49.740 --> 08:58.120
So we are able to piggyback on the resources of many other generous organizations.

08:58.120 --> 09:04.760
And where it looks like this is going in the future, there's a couple of new directions

09:04.760 --> 09:12.280
that one is because we've kind of built out this wiki interface, one of the things we're

09:12.280 --> 09:18.120
potentially interested in doing is elaborating that as a wiki of information on all human

09:18.120 --> 09:20.080
languages.

09:20.080 --> 09:26.620
And so we did a presentation at the Wikimedia Conference last year about this, and that's

09:26.620 --> 09:28.800
one possibility down the road.

09:28.800 --> 09:33.560
I think another great possibility for it would be, I've been admiring the Encyclopedia of

09:33.560 --> 09:38.000
Life, and there's much about the Encyclopedia of Life that could be ported over into the

09:38.000 --> 09:40.960
domain of human language.

09:40.960 --> 09:49.040
But one nearer future for this distributed archive model is a project called the Language

09:49.040 --> 09:56.440
Commons, which Rosetta is one of the organizations that's leading this particular effort.

09:56.440 --> 10:04.120
And the goal is to build a site for all human languages that works to promote open public

10:04.120 --> 10:09.760
data sets for all of the languages and to better enable their online future.

10:09.760 --> 10:13.800
And the Language Commons backend is actually also here at the Internet Archive.

10:13.800 --> 10:18.120
And so what may happen in the near future is kind of a merger of the Language Commons

10:18.120 --> 10:23.920
and the Rosetta Project backend to become the future of the Language Commons proper,

10:23.920 --> 10:31.080
which would be a site for information on all the world's languages.

10:31.080 --> 10:32.080
Any questions?

10:32.080 --> 10:54.200
Thank you.