Chapter 19: Experience with Metadata on the Internet | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA

Chapter 19: Experience with Metadata on the Internet | Data for Science and Society: The Second National Conference on Scientific and Technical Data | U.S. National Committee for CODATA | National Research Council

19

Experience with Metadata on the Internet

James Restivo

I will start out by asking, what are the problems that metadata are trying to solve? I would like to use metadata to minimize the amount of time spent searching for information and retrieving it. This is one aspect of the problem that metadata address.

Metadata also address sharing information on the Internet, sharing it across not just a community or group of people that want to see that one set of information. Metadata allow sharing of information with other communities. Another aspect is that the information that people want to see on the Internet should be current. It cannot be out-of-date. Data and information change constantly, and they are not of value if they are not current.

Another problem we are trying to solve is that we want to ask a question once and receive the information from different places. People want the answer to pop up on their desktop, so this means being able to broaden the question. Metadata allow for a common interface and uniform access to distributed resources.

Also, metadata are driving open nonproprietary standards. People who have closed systems are publishing information, but it's not as valuable, because people are going to tend to go to those places where they can ask that one question and get broad exposure or broad answers coming back.

Finally, metadata allow for the capability for intelligent postprocessing. People want the information coming back to be in a format that their desktop applications can actually work with and do analysis on.

Figure 19.1

In framing the problem as it exists, the access strategy for the metadata is very important. Figure 19.1 shows the standard way of doing research where we deal with different Web sites and different places that publish information. Users now have to go to one source, then the next, and the next. This is a problem when you've got hundreds of places that are publishing their information. It's a problem because it takes a lot more time to do things. It's also a problem because you have to collate all the information that is coming back, and everybody is formatting it differently. Metadata resolve this because they separate format from content.

Also, when you are collating results manually and you're just dealing with hypertext markup language (HTML), not metadata, you have to work a lot harder to strip out all that formatting, because you want to compare apples to apples and not just different types of information.

Figure 19.2

We are finding that people are interested in having one interface (see Figure 19.2). They want the information collated on the way back, and they want it similarly formatted. One way to do distributive searching on the Web is to have a Java applet to accomplish this. Another way is to use a Web browser, which goes off to a Web server that uses some sort of gateway. This, in turn, will intermediate the crawling across all these different servers.

The last type, which I think is really valuable and really isn't generally available yet, is a desktop application. This allows for a query right on the desktop, which will retrieve metadata directly into your desktop applications, such as Access, or spreadsheets, or whatever your tools or favorites are. I think metadata will allow this to happen. That's where I think you see a lot of effort. The nice aspect also is that things are formatted similarly.

I'm framing a lot of the architecture that is centered around metadata, because XML or metadata unto themselves are not enough. You have to frame the whole architecture if you are going to really solve this problem as a community, and communities are really working hard on this.

When talking about active strategies, you can deal with distribution. How am I going to distribute my information? How is it going to be organized? Am I going to have a central publication paradigm? This is nice feature; it's fast, it's stylized. The downside is that the data become obsolete because they are all maintained in a centralized place. The Internet is making it clear that the world is going to be distributed. People want to maintain their data locally. They want to be in control of their data.

Currently most indexing is in HTML. When you go to one of the aggregation sites, everything you see is in indexed HTML. I saw a statistic last year that said that half of the HTML on the Internet in the world is indexed by search engines, such as Yahoo and Excite. These don't include databases, because Web crawlers can't get to databases, and they don't typically bother with the documents.

Distributed information publication is another access strategy that allows users to go out to many different places, get the information they are seeking, and bring it back to the desktop. The good side is that you are getting up-to-date information. The downside is that to achieve this, you need some cooperation between the people who are publishing information, and this cooperation is hard to obtain. You need to have cooperation with standards, which is hard to achieve. Another downside is that because you are going many places, to a certain extent your behavior is controlled by the least common denominator, or at least if there is a slow guy in the pack, if you want to obtain all your answers, you have to wait for that person.

We are finding that there is a hybrid that people want. The hybrid is that people want to publish their information as a collection in some centralized spot, but they also want to be able to play with other data sources.

Another issue is semantic organization of the information. Is it unstructured information? Is it full text? Does it have metadata? When you are dealing with lots of information, software robots are going to do this work. It's easy to index full text. The downside is that you get poor-quality results, because if you want to find a particular author, you are going to find every occurrence of that name, even if it's a reference and not necessarily the author, and all these other types of information.

The other side is structured information. This type of organization provides many fields that describe every aspect that makes sense. This is what a lot of the standards are doing. The nice thing about structured organization is you can ask very detailed questions, including the location that some document is describing, such as latitude and longitude, and request all the documents that are near it, intersect with it, or are within this country, and so on. The downside is that structured organization requires cooperation with standards. It is a downside in that it represents work and can impede the process. Consensus is hard work.

We find that people want systems that can cope in the face of both structured and unstructured information, because you've got your legacy data and information. You're not going to go back and catalogue it. It's really hard and expensive to do this, but you've got new systems and software that are tagging the information correctly and are available. So again, a hybrid is what is needed.

I want to return to a discussion of metadata. Metadata are information describing a resource. Very often the metadata are indeed the resource itself. They can be created from scratch or extracted from existing information by robots or people.

Prevalent metadata properties are record or form based. Each record consists of one or more elements, such as a title, an author, or a subject. Any element has clearly defined semantics. As we heard from John Rumble's talk, it is difficult for the community to agree on what the semantics of the elements are. By having a clear definition, people can comply with the standards that you come up with. The information can include dates, numerics, floats, integers, latitude and longitude coordinates, and so forth. The elements can be grouped hierarchically. They can also have arbitrary repeatability. One of the tough things about dealing with XML is that XML records are postrelational. So it's not so easy to go into a relational database in a lot of standards and pull out a record, especially when that record can span more than 100 tables. There are some problems with technical issues that vendors are trying to resolve, especially when there can be more than 1,000 elements in a record.

I want to give an idea of what metadata are, because people often haven't touched them and haven't seen them. Figure 19.3 shows some simple examples of metadata. The painting could be my metadata standard for painting. It comes with what I call a schema or a profile, which is just to list the elements, title, audits, and image link. I'm showing an example of this metadata standard, whose title is Mona Lisa. Leonardo da Vinci is the artist. Here is an image link to it.

Figure 19.3

A piece of land can have metadata and the metadata are in fact the multiple listing sheet, if you have seen that. It just lists a bunch of different elements describing the real estate property.

It's also important to realize that databases or metadata exist at different levels. You can talk about this in terms of an entire database or of its details.

There are several metadata benefits. If you go through the trouble of dealing with metadata, or overloading metadata onto your information strategies, you get better search results than full text. You don't lose the ability to do full-text searching in the real world, but you get the benefit of being able to do fielded searching. Another benefit is that you spend less time looking, leaving more time for analysis. It is also easier to use the information. Knowledge workers are more likely to use data if they can get at them, especially if they are in a usable form. If you are looking for information and you can't find it, I like to say it doesn't exist, because it doesn't exist in my reality if I can't find it. So if it's in the basement of some place, and nobody can get to some report, it doesn't exist, especially in the face of the Internet. People are starting to use this as their reality.

Information interoperability is another benefit of metadata. Information from many sources can be viewed and analyzed together. This is a really key property. Almost every person that complies with one metadata standard comes back later on and says, "That's really wonderful, but there is another collection of information. Can I get to it?" If the users are complying with standards, it's usually easy to get into their information and make it be a homogeneous combination. If they are not, it's difficult.

This brings up the issue of metadata standards. Standards provide a framework so that software can access information from numerous organizations and applications. In addition, one query accesses many different data sources. This opens up the possibility of querying across different kinds of information from different disciplines, which is pretty significant. Results are also of high quality due to the common semantics. So title names are consistent across multiple disciplines. It now becomes possible with metadata to query a query and within seconds to get an answer, whereas in the past you would have to go and drill into all of these different disciplines. The idea of mining and discovery becomes much easier than it has been in the past.

The other thing to keep in mind with regard to metadata standards is that each community has special needs. As a result, there are always going to be different standards. I think that this is another significant aspect of metadata--whatever tools are used, they have to cope with different standards. When you start dealing with hundreds of different targets with different standards, it becomes a very arduous task. Figure 19.4 provides some examples of metadata standards.

Figure 19.4

Metadata are not enough. To have machines communicate on the Internet, you need a metadata scheme. You also need a communications protocol to allow for search and retrieval and writing, and you have to have agreement on what are commonly searchable fields and on the syntax for sending and retrieving information. Finally, you need applications and tools that put this all together.

There are a number of metadata protocols. These protocols can be standards based, such as ISO Z39.50. This is an International Organization for Standardization (ISO) standard that is formally established. It has a large following, and it supports structured search and retrieval of information. There are also other proprietary protocols.

There are many metadata examples including formats such as HTML, which supports tags or embedded tags, or just putting the metadata right into HTML. Extensible markup language, XML, is huge. It's going to change a lot of what is going on. Machine Readable Catalog (MARC) is used by the library community. There also are a lot of different syntaxes. You have to decide how the information is going to come back, so that people can build those applications, and what standards you are going to use.

When going through implementation, we tell people to consider the whole information management process before implementing a metadata solution. Avoid building information silos, because they are not as useful. People can't get to the information. Use technology that supports interoperability. Component solutions exist. A key aspect is to implement agreement on semantics of the elements that supports interoperability. If you can piggyback on an appropriate standard or find a community that has done the work already, it's a lot easier to go along and augment and modify that. We see this happening quite a bit.

I have some specific case studies that I would like to discuss. Oak Ridge National Laboratory (ORNL) wanted to create a worldwide index and delivery of scientific information. It was going to build a system from scratch. ORNL has lots of people around the world generating scientific information, and it wanted a system that used the Internet to compile the information and make a centralized index. The data change continuously. Oak Ridge wanted to search affiliated databases, not just its own. The technology had to be scalable, and ORNL wanted it off the shelf. It used some of our tools and didn't have to do a lot of configuration. ORNL unified its databases, having a virtual collection of information.

Let's track the information flow. People are taking measurements. They come back to a central regional office, and they download their information. You have to keep in mind that there are many regional offices around the world collecting data. Once that information has been collected, they run a robot that goes around and brings back structured information. They put it into a database, and from that they generate HTML, and they also just make it generally searchable using the Federal Geographic Data Committee (FGDC) standard. This is easier than using straight HTML to bring back information, the way you are used to seeing it. They also provide searching that goes through a piece of software called a gateway, which will search using the Z39.50 and FGDC standards. This brings back the information using a syntax called generic reference syntax or XML.

When somebody does a search, it searches the centralized index and some affiliated databases and then brings back the results in a formatted way, which actually refers to the data that are out in the field. So you are getting up-to-date data. I think they were really innovative in how they did this.

Now ORNL has its centralized index, and it has high-quality searching fields. ORNL is doing this as a service for other organizations that don't have all the technical wherewithal to do this.

Another case study involves the U.S. Geological Survey (USGS). USGS does worldwide search and retrieval of information through the National Spatial Data Infrastructure clearinghouse. It has more than 200 databases that USGS is able to search with one query around the world. It's all geospatial information. You are able to come in and select what gateway you want to use, because digital gateways give you faster Internet access. Once you have this, you go to the site and submit a search form, which searches all the databases in real time. If you pick them all, I think it will take up to a minute or so, but then you come back and you get results. What's significant is that you are able to get at information from more than 200 organizations, even though they are not using the same software all around the world. They are using a standard. How long would it take to access 200 Web sites? If you spent 5 minutes per site researching each topic, you would have of course to use potentially 200 different search engines. You would be there for a couple of days, and here you are able to do it in a couple of minutes. The key reason you are able to do this is because you are using standards. That is the major thrust of what I have to discuss today. Figure 19.5 provides a summary of my discussion.

Figure 19.5