WEBVTT 00:00.000 --> 00:08.800 Hi, I'm Laura Welcher from the Long Now Foundation. 00:08.800 --> 00:13.720 I direct a project called the Rosetta Project, which is building an archive of information 00:13.720 --> 00:15.920 about all of the world's languages. 00:15.920 --> 00:19.000 And I'm actually up here to present a poster to you tonight. 00:19.000 --> 00:21.320 But it turns out that I was the only one who brought a poster. 00:21.320 --> 00:24.960 So rather than stand up here with my big piece of paper in a really small print, I thought 00:24.960 --> 00:28.080 I would kind of wing it and show you what we have online. 00:28.080 --> 00:30.920 And we have a fair amount of stuff online. 00:30.920 --> 00:35.680 So the Rosetta Project has really two big parts to it. 00:35.680 --> 00:42.160 We're building this archive, this large collection of information on all of the world's languages. 00:42.160 --> 00:47.280 And there are about 7,000 of those. 00:47.280 --> 00:52.600 But what most people see, what people know the Rosetta Project for is this millennial 00:52.600 --> 00:54.240 Rosetta disk that we're building. 00:54.240 --> 01:00.440 So this is like a backup version that can last for thousands of years. 01:00.440 --> 01:07.120 So the web presence we've always had has these two kind of sides to it. 01:07.120 --> 01:11.280 If you go to our website today, one of the main features we have on our landing page 01:11.280 --> 01:18.120 is this interactive version of the disk. 01:18.120 --> 01:24.560 And so we've actually built this prototype of this very long-term archive. 01:24.560 --> 01:30.000 But behind this is actually one of the largest collections of information on human language 01:30.000 --> 01:34.640 that's available publicly on the net. 01:34.640 --> 01:40.600 And while there are other archives out there on human language, they're actually organized 01:40.600 --> 01:42.960 quite a bit differently than ours is. 01:42.960 --> 01:47.320 We're one of only about three sites that actually has information on all of the world's languages. 01:47.320 --> 01:53.040 Most archives are much more specialized, having to do with a small group of languages or individual 01:53.040 --> 01:54.040 languages. 01:54.040 --> 01:59.360 And we're also really different in that most language archives, because they have sensitive 01:59.360 --> 02:04.880 cultural information, they have very complicated systems of permissions that you have to navigate 02:04.880 --> 02:08.000 in order to be able to access the collections. 02:08.000 --> 02:09.960 We've actually sidestepped that. 02:09.960 --> 02:15.320 And everything in our collection is open and publicly available, which makes us pretty 02:15.320 --> 02:21.400 different in the landscape of these other archival entities. 02:21.400 --> 02:25.440 But I think one of the reasons that I was asked to come and speak at this particular 02:25.440 --> 02:30.480 workshop, because this isn't specifically about personal archiving, but we have to wrestle 02:30.480 --> 02:33.080 with some of the same kinds of issues. 02:33.080 --> 02:38.960 And one of them is limited sets of resources in order to be able to maintain this digital 02:38.960 --> 02:39.960 collection. 02:39.960 --> 02:45.360 We're a nonprofit, and we work on government and private grants, and the maintenance and 02:45.360 --> 02:51.320 development of large web collections can actually be quite expensive. 02:51.320 --> 02:53.360 And we used to try to maintain it all in-house. 02:53.360 --> 02:57.640 We used to try to build our own archival back end and manage all of our metadata. 02:57.640 --> 03:00.400 And it was just, it was very complicated. 03:00.400 --> 03:05.880 We never really built it to a level that I was particularly satisfied with. 03:05.880 --> 03:09.880 And so a couple of years ago, we decided to actually scrap the whole thing. 03:09.880 --> 03:14.840 We'd actually built an entire website out in a content management system and had this 03:14.840 --> 03:16.080 very complex structure. 03:16.080 --> 03:17.440 We decided this was not working. 03:17.440 --> 03:19.560 We're not going to be able to maintain it. 03:19.560 --> 03:24.280 So we basically tossed everything out and then said, OK, now what are we going to do? 03:24.280 --> 03:32.440 So we took all of the content in our collection, which amounts to probably over 100,000 scanned 03:32.440 --> 03:38.440 pages at the time of language documentation, which isn't a lot in the era of Google scanning, 03:38.440 --> 03:44.600 but it's all highly vetted and it's all connected to individual human languages that we've actually 03:44.600 --> 03:46.920 rectified. 03:46.920 --> 03:53.840 So we moved all of our content over into this wonderful resource called the Internet Archive, 03:53.840 --> 03:59.760 where we have a special collection. 03:59.760 --> 04:03.840 And so here you can see all of the items that are in our collection that reside here in 04:03.840 --> 04:05.560 the Internet Archive. 04:05.560 --> 04:10.040 What you may notice is that this is just one big, huge list. 04:10.040 --> 04:13.480 What you don't see, this is kind of the user's perspective when they kind of land on our 04:13.480 --> 04:15.760 page and they want to browse. 04:15.760 --> 04:17.900 It's not really intended for that. 04:17.900 --> 04:24.200 This is actually where we manage all of the collection and we provide, we enrich the metadata. 04:24.200 --> 04:28.320 But if you wanted to find something, it is actually quite difficult. 04:28.320 --> 04:31.840 You would need to know what the language code is or the language name, and people don't 04:31.840 --> 04:37.080 know, there's different ways to spell language names and a lot of people don't even know 04:37.080 --> 04:41.040 what a particular language name is, after all, there's 7,000 of them. 04:41.040 --> 04:47.520 So this isn't a particularly good way to browse the collection, but it's not really intended 04:47.520 --> 04:51.880 to be. 04:51.880 --> 04:58.160 What does work is if you actually build your interfaces outside of the Internet Archive. 04:58.160 --> 05:02.520 So use the Internet Archive as your long-term repository and build the structure for your 05:02.520 --> 05:05.120 interface somewhere else. 05:05.120 --> 05:12.640 So what we did is we used a free cloud service called Freebase, which is an open contribution 05:12.640 --> 05:13.640 database. 05:13.640 --> 05:19.880 And the idea behind Freebase is to structure the semantic web, but not from the top down, 05:19.880 --> 05:22.640 but from the bottom up through user contribution. 05:22.640 --> 05:28.000 So users contribute data sets that they think are interesting and they want to develop and 05:28.000 --> 05:31.480 then you can link them to all of the other data sets in Freebase. 05:31.480 --> 05:38.360 So we took all of our metadata about language, so there's a large taxonomy of relationships 05:38.360 --> 05:44.640 between languages, kind of like you have species taxonomies, and other kinds of information 05:44.640 --> 05:49.680 like the kinds of documents in our collection, and we populated that all in Freebase. 05:49.680 --> 05:53.840 And so you see here this is the Rosetta language base inside of Freebase, but it's linked to 05:53.840 --> 05:58.240 all of the other data that resides in Freebase. 05:58.240 --> 06:05.360 And this topic area called Langoid has 10,000 topics, so those are all of the languages 06:05.360 --> 06:11.080 and they're actually linked all to each other in terms of language relationships. 06:11.080 --> 06:20.520 And we used this data set, this metadata about language, to push a static wiki of the world's 06:20.520 --> 06:27.040 languages with one page for every human language and links between them. 06:27.040 --> 06:34.440 So to give you an example, this is a page that's actually a fairly well populated page 06:34.440 --> 06:36.680 for French. 06:36.680 --> 06:41.320 And all of the information that you see down here in the classification taxonomy allows 06:41.320 --> 06:45.540 you to navigate to all of the other language pages. 06:45.540 --> 06:50.080 This is for the Romance language family in Indo-European, but you can get to any other 06:50.080 --> 06:52.360 language family. 06:52.360 --> 06:57.420 All of the data up in here is data that's coming from elsewhere in Freebase that we're 06:57.420 --> 06:59.160 linking to. 06:59.160 --> 07:04.520 All of the data up in here are descriptions that we pulled in from Wikipedia. 07:04.520 --> 07:11.200 We actually rectified about 400 pages, no, I'm sorry, probably about 1,000 pages on human 07:11.200 --> 07:16.160 language in Wikipedia to our collection, so we were able to draw those in. 07:16.160 --> 07:21.880 Here's a map that's contributed from another project called the LLmap project, which is 07:21.880 --> 07:26.000 completely external to our project, so they allowed us to use that URL. 07:26.000 --> 07:32.680 And then all of the documents down here are linking to items that we have in the Internet 07:32.680 --> 07:34.720 Archive. 07:34.720 --> 07:49.000 So what this represents is actually a distributed system where you have the content in one place, 07:49.000 --> 07:53.800 you have kind of the data structure in another place, and then you have another place where 07:53.800 --> 07:58.560 you're building out an interface, or maybe you have multiple interfaces. 07:58.560 --> 08:03.600 And I've actually been talking to my other colleagues who are building linguistic archives 08:03.600 --> 08:10.840 about this, because I think it's a pretty robust model for maintaining your archive 08:10.840 --> 08:14.560 in the long run, if any of the pieces look precarious. 08:14.560 --> 08:22.120 So Freebase was acquired by Google last year, and they have a commitment to keep that data 08:22.120 --> 08:23.120 set open. 08:23.120 --> 08:26.840 But if something happened in the future and we needed to move the data around, as long 08:26.840 --> 08:31.800 as you can get your data in and get your data out, then you're in, you should be in fairly 08:31.800 --> 08:34.720 good shape. 08:34.720 --> 08:41.960 So what I think we've built is something that's fairly recession-proof, too. 08:41.960 --> 08:47.880 We're relying all on cloud services, so there's no service here that we actually have to pay 08:47.880 --> 08:49.740 for and maintain ourselves. 08:49.740 --> 08:58.120 So we are able to piggyback on the resources of many other generous organizations. 08:58.120 --> 09:04.760 And where it looks like this is going in the future, there's a couple of new directions 09:04.760 --> 09:12.280 that one is because we've kind of built out this wiki interface, one of the things we're 09:12.280 --> 09:18.120 potentially interested in doing is elaborating that as a wiki of information on all human 09:18.120 --> 09:20.080 languages. 09:20.080 --> 09:26.620 And so we did a presentation at the Wikimedia Conference last year about this, and that's 09:26.620 --> 09:28.800 one possibility down the road. 09:28.800 --> 09:33.560 I think another great possibility for it would be, I've been admiring the Encyclopedia of 09:33.560 --> 09:38.000 Life, and there's much about the Encyclopedia of Life that could be ported over into the 09:38.000 --> 09:40.960 domain of human language. 09:40.960 --> 09:49.040 But one nearer future for this distributed archive model is a project called the Language 09:49.040 --> 09:56.440 Commons, which Rosetta is one of the organizations that's leading this particular effort. 09:56.440 --> 10:04.120 And the goal is to build a site for all human languages that works to promote open public 10:04.120 --> 10:09.760 data sets for all of the languages and to better enable their online future. 10:09.760 --> 10:13.800 And the Language Commons backend is actually also here at the Internet Archive. 10:13.800 --> 10:18.120 And so what may happen in the near future is kind of a merger of the Language Commons 10:18.120 --> 10:23.920 and the Rosetta Project backend to become the future of the Language Commons proper, 10:23.920 --> 10:31.080 which would be a site for information on all the world's languages. 10:31.080 --> 10:32.080 Any questions? 10:32.080 --> 10:54.200 Thank you.