The next topic was standards for metadata and work processes. There are a number of standards currently used in Europe that are helpful for documenting and archiving methods used by statistical agencies to produce statistical products.
H.V. JAGADISH: AUTOMATING THE CAPTURE OF DATA TRANSFORMATION METADATA
H.V. Jagadish began with a presentation on automating the capture of data transformation metadata. (The talk was put together by George Alter, who could not be in attendance.) He asserted that data are useless without metadata—“data about data.” Metadata should include all relevant information about data creation, including transformations to variables. Because the creation of metadata can be costly, to facilitate the creation of comprehensive metadata, there is a need for metadata to be easy to create, which is why the current goal is the automated capture of metadata. Jagadish stressed that it is desirable for the capture of metadata to be automated so that it can happen during the process of data creation or transformation.
This is an example of the research being done at Inter-university Consortium for Political and Social Research (ICPSR)—the world’s largest archive of social science data. ICPSR has many datasets in it, and the metadata are recorded using the Data Documentation Initiative (DDI). There is a specific format of metadata that is defined as the desired target of whatever they create. ICPSR is building search tools based on the DDI XML. Code-books (pdf and online) are rendered from the DDI. If a user goes into the
ICPSR Web page, looks up the online codebook, and searches for variables, he or she can find out about variables in the context of a dataset. These are the kinds of activities that researchers do when they are trying to find appropriate datasets for whatever their research question might be, and there is a lot of detail available.
Jagadish used the example of volunteering for schools/tutoring. He showed how this variable was coded, with a link to an online graph tool and cross-tab tool. One can also get an online codebook for comparable variables. If a user wants to know what question was asked, who answered it given various skip sequences, and how the question was coded, it is either in the pdf or in the code, which might be in SAS, SPSS, or Stata. What happens now is one places data in the archive, and typically there is not a lot of documentation that comes with the data directly; there is a separate process to get that. When data arrive at the archive, there is no text for questions, no interview flow (question order, skip pattern) and no variable provenance, and data transformations are not documented. How are research data created? Most surveys are conducted with computer-assisted interviewing software: telephone interviewing, personal interviewing, or Web interviewing. In other words, the software questionnaire program is the metadata. So the sequence is (1) carrying out computer-assisted interviewing, (2) creating the data, and (3) separately converting the program to DDI. Thus, there are two different paths, one for the data and one for the DDI/metadata. Furthermore, people do not usually use the original data; they perform transformations of it. There are commands that transform the original data, and now a user has revised data or some derived data products. One problem is that the modified data may no longer match the metadata. To do so, the metadata could be created after transformation of the data with the written code placed into the metadata. This process requires that someone make necessary additions to the metadata corresponding to these transformations.
Jagadish continued that statisticians could find out how to extract the modifications to the metadata from the scripts in the statistical packages in an automated way because the set of transformations that one applies using statistical packages is fairly limited in scope. Statistical packages have limited metadata: there are variable names, variable labels, value labels, and no provenance. There is something called a standard data transformation language (SDTL) to DDI translator convertor, and so then the user gets the updated revised metadata through a process that can probably be automated. The statistical packages that could be converted include delimited text, SPSS, SAS, Stata, R, and Excel. An SDTL could be useful because there are many ways to implement transformations in different statistical programming languages, including how missing values are treated. Statistical packages differ in what happens when a missing value is in a logical com-
parison. For example, in a simple program statement [if X is greater than 3, then set Y equal to 9, or if X is less than or equal to 3, set Y equal to 8], the output data can be quite different in SPSS, Stata, and SAS, depending on which values are missing, how zero values gets treated, and if one is working within the context of a single package, because working across systems brings many nonstandard rules into play. The point of this is to have standardized machine-readable metadata that are derived in an automatic way for the computational transformations that apply to collected survey data. This automated capture will cover the vast majority of actual cases, and part of the challenge is providing appropriate escape hatches for the small parts that are missed. Jagadish concluded that the benefits of automation will include (1) better metadata, (2) lower costs of creation of the metadata, (3) standardized and machine readable metadata, and (4) use by researchers of the codebooks that are produced.
DANIEL GILLMAN: DATA DOCUMENTATION INITIATIVE
Dan Gillman then spoke about the DDI. His goal was to describe the efforts they are engaged in at the Bureau of Labor Statistics (BLS) to use DDI to describe the Consumer Expenditure Surveys (CES). The DDI is an attempt to produce standards and products used to describe statistical metadata. These products are developed under a consortium, the DDI Alliance, which is managed by ICPSR. The members of the alliance are statistical offices, libraries, archives, and researchers. The products are the codebook, which Jagadish was talking about and which is used at ICPSR, and an expanded version called Lifecycle. While the codebook describes an individual dataset, Lifecycle talks about the data lifecycle, which includes the ability to compare or reuse metadata across time and across datasets, which today is typically across surveys.
Gillman continued that there are also the vocabularies being developed under the use of the Resource Description Framework, which is a World Wide Web consortium standard for being able to link data and metadata together. There are three products: (1) dataset discovery for data published on the Web, (2) physical data description, which allows users to talk about how data are laid out, and (3) the eXtended Knowledge Organization System (XKOS), which is a vocabulary for describing statistical classifications (it is based on SKOS, the Simple Knowledge Organization System, and XKOS is an extension of that).
Gillman said that the current work of the alliance includes the development of DDI-4 Moving Forward, which is an effort to expand both the codebook and Lifecycle to create a unified model for both, and therefore have the extended ability to describe a dataset’s lifecycle. There is also an effort to describe methodologies, the production process, questionnaires (including all
of the skip patterns), and so forth. DDI-4 Moving Forward is also intended to create multiple bindings, which means that users can implement this in multiple ways. Currently, DDI is as implementable as XML, which will continue to be used, but there is interest in using the Resource Description Framework, and users want to be able to use the Structured Query Language (SQL), which is the language behind relational databases. For DDI, there are a lot of implementations, especially in data libraries and archives around the world. Statistical offices are beginning to adopt it, and all of the DDI products are being implemented in one way or another. DDI Codebook is by far the most common, but Lifecycle is also gaining adherents.
Gillman then discussed the implementation that they are engaged in at the BLS. They are trying to describe the CES. The CES measures how people and households in the United States spend their money. It is conducted by BLS, and the data are collected by the U.S. Census Bureau. The CES consists of two surveys: (1) the interview survey (taking place quarterly) that includes large or recurring expenses (e.g., rent) and (2) a diary (taking place every 2 weeks) that includes small, frequent expenses (e.g., groceries). As for the processing, there are four different subsystems from collection to dissemination. Right now they are doing a complete redesign of this processing and so they want to be able to describe it in a coherent way. The variables that describe the various subsystems throughout the processing are managed in separate Access databases, one per subsystem, which do not communicate. As a result, tracking variables across databases is very hard, albeit necessary, in order to manage quality and the production environment. Instead, they want to build a single system for managing variables. This would work across surveys (interview and diary) and throughout the lifecycle, including dissemination. The lifecycle includes the edit system, the estimation system, and the ultimate data products to the user. They want to show how similar variables change over time. Changes to the survey design happen in the odd years, and they want to show how similar variables change over the lifecycle. Because these databases were developed independently, they differ in the way codes and categories are defined from one database to the other. They are not harmonized.
Gillman continued that they want to follow groupings of variables because they have expenditure groupings that end up in the expenditure data that CES disseminates. They also have universal classification code groupings for products that end up as input to the Consumer Price Index. They want to do this over the entire lifecycle, as they are interested in how questions are defined, how they end up defining the variables, and how this manifests in the public use microdata tables that they disseminate. They also want to include a description of all the production systems, which includes the steps in those production systems. They want to show what happens at all possible levels.
Much of these metadata would not be available to the public, especially at the bottom level, but a higher level can be publicly released. It would include links to variables as inputs/outputs, and it would show the flow through and between each subsystem. In addition, it would automate instrument design. This includes wording and skip flow.
Gillman said that the bottom line is that a metadata system could be useful for handling that. They selected the DDI Lifecycle version 3.2 because that is a standard that is in common use, and there is a company called Collectica that makes software that speaks to DDI. The software is relatively cheap, and they do not have to build a system themselves to be able to use the standard.
Gillman continued that they are now in the process of iterative system development. They are starting small and are currently in the process of developing a set of pilot systems. They have developed two so far. The first was to show whether DDI was sufficient for handling the needs that CES had based on the goals that were laid out. The second pilot was to show whether the Collectica software, which includes a Web portal that works with a repository that they have for handling the metadata, was flexible enough to build the interface necessary to answer the questions that were laid out for the problem at hand.
These pilot systems are resource and time limited. They chose 2012 and 2013, because the CES can change in the odd years. They wanted to show that they could account for changes from one year to the next, and selected education and hospitalization and health insurance as the variables to be able to follow through from beginning to end to see what happens as the survey and production environment change. In order to compare variables, they built a “correspondence tree” that can show how things look across surveys, over time, and throughout the lifecycle, as well as the “code comparison,” which are small numeric values used to represent categories in variables.
They wanted to show the mess that they have created concerning how all of the variables that they have defined in these various systems have differences. Some of these differences are meaningful, while some are gratuitous and unnecessary. DDI has a program called the variable cascade that deals with this. A conceptual variable is a high-level description, concept related, that talks about what the variable means and what the concepts are that represent the categories or the range that the variable defines. The representative variable is what the variable from the substantive point of view looks like. The instance variable is what statisticians are talking about regarding the application, which includes what the missing variable looks like. They have a cascade of increasing detail, and they have put that to use.
Gillman then demonstrated what the interface looks like. The indentation shows from the point of view of the cascade what the education
variable starts out at. At the top, there is a conceptual variable that says education is defined as follows, and then at the next two levels they have different representative variables. The variable is “highest grade completed.” But there are substantive differences. As part of code comparison, in 2012 there were 21 categories, and nursery and kindergarten were not available choices. In 2013, there were 8 categories, including nursery/kindergarten/elementary, high school, and one category for professional/masters/PhD. There were also gratuitous differences. For example, the category labels changed. They added blanks; the grade was described as “12th” versus “twelfth.” The last category was “professional school degree” versus “professional degree” and “high school (grades 9–12), no degree,” versus “high school (grades 9–12, no degree)” and “high school graduate—high school diploma or the equivalent (GED)” versus “high school graduate—high school diploma or equivalent (GED).” There are different variables for each one of these subsystems for each year.
The question is whether those changes make sense or if they are trivial. To address this, experts look to the correspondence tree of the variables that they have defined. They want to be able to see the path of the variables through the lifecycle, so they can go from the question for each of the processing subsystems to the final output. There is a variable at the top, an instance variable that is at a dissemination point and that comes from one of the subsystems behind it. Indented further is the previous subsystem, and indented further is the question that was associated with that variable. This shows the path that a variable takes, and it is easy to see how that migrates through the processing.
Gillman added that people also want to be able to talk about the processing itself. They have four subsystems that are broken into smaller subsystems, and in this demonstration they showed that they could break the process down all the way to its smallest components. Right now they have a limitation because they cannot yet show what the inputs and outputs of each of the subsystems are; they have not done that linkage. But the demonstration showed their systems at the highest level, and that they can show the details of the initial edit subsystem. This subsystem is broken into three bundles, and if bundle two is expanded, it has got many subprocesses within it. Each of the other bundles can be also broken apart. What they want to do is show a process model for each of those systems and show what is happening to the data from input processing to output throughout the entire processing cascade.
DAVID BARRACLOUGH: STATISTICAL DATA AND METADATA EXCHANGE
David Barraclough from the OECD spoke about the Statistical Data and Metadata eXchange (SDMX). He first set the stage by talking about
a specific problem that this addresses, and the kinds of things that SDMX tries to help with.
Imagine that there is an employee at a statistical agency who would like to avoid sending the same data to multiple agencies. Perhaps he or she would also like to avoid sending these data packages; instead, the employee simply wants to disseminate the data on a Website or make them available via application programming interfaces (APIs). In addition, the employee does not want to continuously come up with new formats and standards for a few specific data flows; he or she would rather have a single standard that people can use and know that it is supported by a community. Also, the employee would like to make datasets user friendly and compatible with shared coding and related things.
Barraclough continued that an international organization or a data receiver would like to avoid time and errors when processing these different file formats from providers since they can come in a variety of formats. For example, data could come as a CSV file, Word documents, or Excel questionnaires. Also, agencies would like to avoid processing complete datasets each time, since they would prefer to slice the data that they need and be able to query them. Agencies want to have comparable data, and they want to avoid the time delays that are caused by manual file processing.
Barraclough added that what happens now is that there are many round-trip validations. For example, when OECD gets some data from an agency, there is a lot of checking by statisticians, with possible edits going back to the agency to verify the numbers. Agencies would like to have some automatic validation and also be able to create validation that can be shared on both sides and across agencies. Everyone would like some kind of automatic validation within the statistical workflow and to be able to automate those workflows in order to lower the cost of processing, increase quality, and have more guidelines for implementing new statistical products and new exchanges. They would also like to document this structural metadata for their datasets and for their reference metadata. They would like to be able to store that documentation, have a standard way of querying it, and make it discoverable in-house as well as on the Web. They would also like to benefit from a large community offering free tools and sharing expertise around a standard.
After this justification, Barraclough began by describing SDMX (see SDMX.org for more details). It was released in 2002 as an initiative to foster standards for the exchange of statistical information. The sponsor organizations include Eurostat, the International Monetary Fund, OECD, the United Nations, and the World Bank. There is an information model, and there is a standard for describing Web services in order to make these data queries. There are also standards on which to build registries for the cataloging of the metadata in order to query those metadata in a standard way.
Barraclough continued that another large part of SDMX results from a working group (of which he is the chair) called the statistical working group. This group makes guidelines on how to code certain concepts in statistics using SDMX. They also write best practices for how to use SDMX and implementation guidelines. With respect to the governance of SDMX, in addition to the sponsors who form a steering group for the initiative, there is also a technical working group that works on the technical aspects of the standards. If they change the information model, it is this working group that actually works on how to do that. SDMX also offers a whole raft of reusable tools; just about all of them are free, and some of them have a freeware element to them. The main exchange format for SDMX is XML. There is also a JASON format and a CSV format that have been proposed. In all cases there is always a standard schema with the format, so it is not just a simple CSV or simple XML because that would be meaningless.
Barraclough then delved into the business case for using SDMX. Using SDMX saves resources by reusing exchange systems across domains and agencies, and through reuse of statistical metadata and methodology. What is avoided is each agency creating its own system in order to process and disseminate SDMX. This latter part has been quite a success. This also improves quality in the exchange, because SDMX promotes the use of standard classifications, which is what his working group is trying to do. Use of those classifications reduces mapping and transformation errors. The automated exchange reduces manual intervention errors, and validation is one of the first-class features of SDMX. Through this automation, workflows have fewer wait states where users are waiting on someone to click a button. This is a very typical situation at OECD. Some data come in, then a statistician has to open a file up, copy the data, paste it into another file, perform a transformation, and click another button and do something else, all of which introduces potential errors and timeliness issues.
On the other hand, Barraclough asked what is wrong with sticking to these older formats. Simple CSV is not structured and it is hard to validate. Furthermore, there are no metadata in it. Excel has been used for a lot of exchanging data in questionnaires, but in Excel they have seen that the metadata are tied to the presentation and it becomes a design issue. Excel is also a proprietary format, which introduces licensing issues. Furthermore, Excel is hard to process and automate. FAME, SAS, and Stata files use proprietary formats and are not useful for exchange. GESMES is an older format, proprietary, and there are few tools for it. XPRL and DDI are great at what they do but they are not focused on modeling the exchange of data, which is what SDMX is designed for. Finally, XML is also not good for reuse. SDMX uses XML and adds context in a standard way.
Barraclough then talked briefly about the SDMX information model. In Excel, the information model is based on a sheet, cells, and rows. An
information model means that a user can write formulae and Visual Basic for Applications around the data that are in Excel. In a relational database such as SQL or Oracle, the information model is based on database tables and columns, so a user can write SQL and use interfaces, etc. A different kind of information model is OECD metadata, where there are 42 categories but everybody at OECD knows what these categories are, and they are used as tools at the OECD. The SDMX information model is designed for statistical data and metadata exchange and also to catalog that metadata into queries, etc. SDMX was also designed for aggregated data, and in the most recent version there is a focus on allowing microdata to be modeled more easily.
Barraclough next displayed a depiction of the SDMX information model (High Level). One major component of it is data structure definition, and SDMX provides the model for a dataset. By using the data structure definition, one finds out that data are made up of dimensions and these dimensions use certain code lists. So for seasonal adjustment, if this is in a dataset, then the seasonal adjustment concept or dimension references a seasonal adjustment code list. There is a standard set of codes for seasonal adjustment.
There are things in the information model-to-model data flows, Barraclough continued. One can have the same data structure or dataset used for different data flows. An example is that for a balance of payments, there is a global data structure definition (DSD). But if a user wants to exchange data fluctuating services, this is like a subset of the balance of payments dataset, so one basically constrains the balance of payments dataset. There are many other tools in there such as categories, so one can create many datasets and categorize them to say that certain datasets have to do with trade, national accounts, etc.
Barraclough added that there are also provision agreements, usually on the input side, to say that this agency is sending data. He said that an agency can give this to a developer and ask if he or she could create an SDMX tool, and for that, the agency might want to understand the information model. The agency can then send it to the developer, and then they can understand the concept and how it relates to the other parts of the information model and click it so that it will give them the details in UML.
Barraclough then provided the main tools of SDMX. First is the SDMX registry, which catalogs structural metadata. The global registry is a tool that lives in the cloud and was designed to host global data structure definitions such as balance of payments, national accounts, and foreign direct investment. There are also cross-domain code lists, such as seasonal adjustments. Looking at the data structures—which are completely public—one can see the latest balance of payments dataset. This has frequency and adjustment indicators, and one can see the coding of each of these concepts
on the side. It is possible to download a printer-friendly code, but the idea is that member countries, when they send balance of payments data, use the global DSD4 balance of payments to send it, and it should match all of this coding. The idea is that they know exactly what to send and which coding to use, and that data can be processed such that it is the same across that whole constituency. There is also another form for national accounts, etc.
Second, there is the SDMX converter that is similar to a desktop application that can convert between Excel and CSV to SDMX, furthering the implementation of SDMX. There is another tool that goes further than that, which is the reference infrastructure. That connects to an existing data warehouse that an agency may have, and it gives that agency an SDMX Web service. The agency can map existing structures within the data warehouse, creating a mapping between those structures and something like a global DSD, and then the agency can disseminate the SDMX using the Web service that the SDMX reference infrastructure provides. This is a free tool, which can be installed in agencies, and it is used by many countries, especially in Europe, because Eurostat mandates that its member countries send national accounts data using SDMX. However, it cannot mandate that at the OECD.
Barraclough said that there are also many plug-ins and libraries for econometric tools and there are Java and .Net software libraries to build tools. There is also the OECD warehouse platform, which is partly operational now but will be fully operational in SDMX in 2 years.
As a final demonstration, Barraclough queried a Web service, the OECD key economic indicators (KEI) dataset. This set gets the data from the KEI dataset, and they are the actual codes for the separate dimensions. Basically, whichever XMS query done on whichever agency’s SDMX Web service, it is always constructed in the same way. This file will be in XML, meaning it is readable. Barraclough added that his group looks after SDMX guidelines and tries to make them as user friendly as possible. They have a checklist for SDMX design projects, which includes questions such as if somebody wants to create a reporting framework or simply a set of datasets, then how should they go about it so that it works well? In this design phase, the user first maps the data flows of what he or she wants to do, then defines a concept scheme, which defines for national accounts all of the concepts used to describe these national accounts, and then defines code lists. This leads to getting a lot more detail and developing reusable tools. There is also an SDMX glossary, which is a technical manual for SDMX. It describes various aspects of the standard itself, but it also describes the concepts that are used in the exchanges. For example, a term such as seasonal adjustment is described and coded. This allows a user to compare with other datasets as well as showing which code lists should be used.
There are other guidelines that they have created or that they are working on, like how to do versioning in SDMX (i.e., how to represent
data vintages). They are also working on how to best exchange reference metadata. Barraclough noted that the next presentation at this workshop is on the Generic Statistical Business Process Model (GSBPM), and Dan Gillman had just spoken on DDI. He showed which business processes are covered by DDI and SDMX. The GSBPM shows that DDI covers the design of a process, building it, and then collecting data. SDMX covers the processing, analysis, and dissemination phases. After dissemination, DDI again covers archiving and evaluation. Finally, there are global data structure definitions (DSDs), which aim to improve on these heterogeneous reporting methods found all over the place today. Having these global DSDs aims to improve on timeliness by allowing data queries across agencies, avoids the burden of having many different reporting systems and exchange agreements covering different formats, etc., and saves money by reusing systems, standards, and methodologies. Barraclough pointed out that one can find these in the Global Registry. National Accounts was the first process covered, followed by balance of payments and foreign direct investment. They are now working on international merchandise trade statistics, price statistics, labor statistics, education, sustainable development goals, R&D statistics, environmental-economic accounts, and energy statistics.
Barraclough noted that SDMX is not only a file format, but it is a set of standards covering mainly exchange, and there are many free tools for it. The goal is to help organize statistical metadata to make the exchange of information easier and more efficient. There is now a very active community that is producing new features like SDMX-CSV. It saves resources, improves quality, and supports the timeliness of data exchange. For transparency, it helps to manage, catalog, and surface metadata through registries. A user can actually take the registry software and install it at his or her agency for free. It helps with comparability since it provides standard exchange mechanisms and structures and is aligned with other standards through this high-level group and the international organization communities.
JUAN MUÑOZ: GENERIC STATISTICAL BUSINESS PROCESS MODEL
Juan Muñoz from INEGI gave the next presentation on how the GSBPM could contribute to transparency and reproducibility of national statistics. GSBPM is a reference model that describes and defines the set of business processes to produce official statistics. It provides a standard framework and harmonized terminology to help statistical organizations modernize their statistical production processes, as well as to share methods and components. It is intended to provide a tool to help address the following needs: (1) defining and describing statistical processes in a coher-
ent way, (2) comparing and benchmarking processes within and between organizations, and (3) improving decisions on production systems and organization of resources. Muñoz said that GSBPM comes from a model designed by the New Zealand statistics group, and was further developed by the conference of the European Statistician Steering Group on Statistical Metadata, METIS. Right now they are using the 2013 version. GSBPM is one of the cornerstones of the efforts that are being made by the United Nations Economic Commission for Europe (UNECE), which is supporting the High-Level Group for the Modernisation of Official Statistics. They have a vision that includes the development of several of the standards and models that work together to support the modernization of the production of official statistics.
Muñoz pointed out that GSBPM shares an environment with other modernization standards described in this session, also including the Generic Activity Model for Statistical Organizations (GAMSO) and the Common Statistical Production Architecture (CSPA) service. Information objects move from one activity to another, from one subprocess to another. The information objects contextualize or are described using the Generic Statistical Information Model (GSIM). Supporting this, there is the CSPA, which is currently under development. It describes how to produce statistical services and systems or services that are modular and can be shared between organizations, and it can also implement the systems so that one system can be replaced with another and still share the solution.
DDI and SDMX are two of the standards that work within GSBPM, with DDI describing microdata and processing microdata, mostly for internal processes, while SDMX is used for sharing statistical data and metadata, which is more associated with aggregate data and flows.
To understand GSBPM, Muñoz showed the advantages of using it. One can determine the different phases and subprocesses that are developed to produce statistics. Also, GSBPM can show related activities and their scope so that users can identify good practices. Furthermore, since the processes are well documented, they can be shared, and they are therefore a first step toward having common solutions for similar processes. In addition, users can gain some savings because of the efficiency. Muñoz added that comparisons will show how to produce a statistic with other organizations and have the tools to indicate a better way to organize the process. In summary, GSBPM provides the following benefits: standardization of methodology; structure for organizing documentation; promotion of standardization and identification of good practices; provision of an instrument to share knowledge, methods, and tools (first step toward common solutions for similar processes); facilitation of the use of common tools and methods (efficiency savings); provision of a standard framework for benchmarking; and management of process quality.
There are eight phases of GSBPM: (1) specify needs—what the statistical activity is going to produce from the required data; (2) describe the development and design of the statistical outputs and methodology; (3) specify the construction and deployment of the production system; (4) teach how to select units from which to obtain data and the collection or extraction of that data; (5) process the input data to produce the target outputs; (6) show how to examine the data before dissemination; (7) disseminate statistical products through various channels; and (8) evaluate the experiences gained from the specific instance of a statistical business process.
Muñoz continued that each process has subprocesses. Subprocesses do not have to be followed in a strict order. The user can configure different paths to represent specific instances of the process because subprocesses can be skipped, repeated, etc.
Continuing, Muñoz cited the following benefits for transparency and reproducibility regarding GSBPM: (1) it provides a way to diagram or trace the way in which an entity produces statistics; (2) it makes it possible to compare processes used by different entities; (3) standardization of production processes for certain kinds of studies can be described; and (4) in combination with other standards, statistical information and its quality can be assessed and examined in a more efficient way since tools are provided for these purposes.
If other people have the appropriate input, they can use this structure to compare their output with the output that another agency has. In combination with other standards like GSIM, DDI, and SDMX, one can examine and analyze the quality of the information and maybe find a better way to produce it.
Muñoz closed by answering some frequently asked questions about GSBPM. GSBPM can be used to describe statistical processes based on any kind of input data—survey, administrative records, etc. GSBPM is a practical model that can be and is being applied by many National Statistical Offices. Use of GSBPM does not require a reorganization of one’s office. Muñoz said that GSBPM is a way to document and have some guidance on developing one’s solution to a statistical need. It is used by more than 50 countries and several international organizations.
MICHAELA DENK: GENERIC STATISTICAL INFORMATION MODEL
The next presentation on the GSIM was provided by Michaela Denk of the IMF. Denk started with an overview of GSIM. She began by describing the statistical data lifecycle. One starts with a data and a metadata collection. This is followed by data processing and validation. Finally, there are analytics and dissemination. One usually tries to know, understand, and
document one’s process. The statistical products that flow between these steps are the information objects. It is important to note variations between organizations and the products of an organization. This relates to terminology, granularity, and implementation. This is how GSIM and GSBPM connect to each other. Denk said that although one can use each of these two standards independently, there are benefits to using both because they were planned for joint use.
Denk continued that GSBPM is about documenting and standardizing one’s processes, and GSIM is about the objects, parameters, and configurations that one could use to document the inputs and outputs of each of the process steps. These information objects in GSIM—and there are about 100 to 120 right now—are grouped into four categories to make it easier to access the model. The two groups with which most people are familiar are the structure and the concept groups, because they concern how to organize data. This includes code lists, datasets, data structures, variable definitions, and the population of interest.
Denk said that GSIM goes a step further in the sense that it also tries to cover the entire statistical business process. For information providers, this entails the channels that users exchange data with, as well as the products that are disseminated. On the business side, it is who defines the need for a new product, how it is defined, how one defines a statistical program, and what all of the elements in that program are. The business group also connects directly with GSBPM in the sense that it has objects that relate directly to the steps in a business process. This allows one to document all of the parameters in the configuration of a certain step that are necessary to carry out that step. For each of the items in the information model, there is a certain description that everybody can understand, a name, a definition, some examples, some synonyms, and a UML model that links all of the different information objects to each other.
Denk added that GSBPM provides a standard framework for a flexible model using harmonized terminology, which describes and defines the set of business processes used to produce official statistics. It is used to help statistical organizations move from topical stove-pipes (product-centric) to process-centric approaches. It defines and describes statistical processes in a coherent way. GSBPM is used to modernize statistical production processes, to compare and benchmark processes within and between organizations, and to share methods and components. To compare GPSBM and GSIM, GSBPM is about process, identifying activities (subprocesses) that result in information outputs, while GSIM is about information that flows between those activities, controls activities, and documents them. As such, there are inputs, which are any GSIM information object or objects (e.g., datasets, variables), which are used by a subprocess described by GSBPM, and as a result there are outputs, which are transformed (or new) GSIM informa-
tion objects. There are complementary models to document the statistical business process and the information that is used in and produced by that process. Denk added that there is also a clickable version of GSIM where a user can enter any given group or object and drill down to look at all of the descriptions and navigate. The goal of GSIM is to cover all of the information objects needed throughout the process of producing official statistics. It is not a tool but rather a conceptual model at an abstract level. To implement it, it makes sense to see it in context with other related standards.
GSIM’s information objects come in four categories: (1) business information objects, which are the designs and plans of statistical programs; (2) exchange information objects, which are incoming and outgoing information (e.g., an information provider or an exchange channel); (3) structures, which are the organization and composition of data (e.g., datasets, data structures, and information resources); and (4) concepts that define the data.
Denk said that GSIM is a reference model, not a software tool. Not all information objects are relevant to each implementation of the model. Denk continued that there are quite a few benefits from using GSIM. First, GSIM provides a reference framework and common terminology that lets users speak a common language and understand each other when talking about a business process, because even though people might use the same words, sometimes they do not really mean the same thing. These are internationally agreedon definitions, attributes, and relationships describing the information used in the production of official statistics (information objects). It also enables generic descriptions of the definition, management, and use of data and metadata throughout the statistical business process, which provides for needed transparency. GSIM also improves communication, coordination, cooperation, and collaboration. This is useful even within an organization since it helps to supplant thinking and terminology specific to groups that do not interact much, which helps for going further in the direction of automation. Furthermore, GSIM enables a high degree of automation of the statistical business process that supports reproducibility. Denk stated that in addition, GSIM facilitates capacity building in statistical organizations and GSIM helps in assessing existing statistical information systems and processes. Denk said that they have started using it when they provide technical assistance to national statistical offices in countries or central banks.
Denk added that if an agency may be thinking about implementing GSIM, it is a conceptual model, so it does not really have any specific details about the implementation. Also, it has no dependence or reference to any particular tools, so if an agency wants to use GSIM, it does not have to buy a certain tool or change its whole infrastructure. To implement it at a business level, the agency would look at all of the statistical products it
has and all of the objects that are necessary for the final output, and try to map its existing information model to GSIM. If the agency does not yet have an information model, it could map to GSIM just to have that in a common language. The agency would ideally do this together with GSBPM, documenting objects but also agency processes. The technical level should only be tackled once it has worked on the business level, and it would make sense here to leverage the Common Statistical Production Architecture (CSPA). The idea is that CSPA brings all of the standards that are being developed by the UNECE and partnering organizations together, and gives a reference to statistical organizations who want to take advantage of all of these standards by giving them tools or at least services where they can communicate at a technical level.1
Denk pointed out that the High-Level Group for the Modernisation of Official Statistics was set up by the UNECE Conference of European Statisticians in 2010 to oversee and coordinate international work relating to statistical modernization, and they are responsible for all of these standards. CSPA covers statistical production across processes defined by GSBPM, provides a practical link between conceptual standards such as GSIM and GSBPM and technical standards such as SDMX, and includes application architecture and technology architecture for the delivery of statistical services. The major aim is international collaboration and sharing. Furthermore, it follows a collaborative approach to develop reusable systems fast and cost-effectively.
During the following floor discussion, Connie Citro said that she had a nodding acquaintance with DDI through the late Pat Doyle, and she is now glad to see these other pieces that provide a structure, set of processes, and common terminologies. What she is really interested in is what U.S. statistical agencies are doing. Participants heard from Dan Gillman about what BLS is doing with the consumer expenditure data, but her impression is that the United States is not as involved with these efforts as statistical agencies in other countries.
Gillman agreed that the United States is not nearly as involved as other countries. Part of the problem may be that the country has a bifurcated statistical operation, whereas all of the other countries have one statistical office.
Sally Thompson pointed out that the BEA is a heavy user of SDMX for balance of payments and national accounts. Her impression is that the U.S.
1 For more information on GSIM, see reference materials at http://www1.unece.org/stat/platform/display/GSIMclick/Clickable+GSIM, which helps to navigate the information model and view object definitions and relationships. Another source is https://statswiki.unece.org/display/gsim/GSIM+Specification, which is documentation of GSIM Version 1.1 and previous versions. For GSBPM and GSIM case studies, see https://statswiki.unece.org/display/GSBPM/Case+Studies+of+Metadata+use+with+GSBPM+and+GSIM [November 2018].
Census Bureau is such a large organization that they have experts providing leadership for a lot of international activities, as well as doing international outreach, so they should take the lead on this.
David Barraclough said that related to SDMX, the United States is not heavily involved in the administration. Regarding GSIM, this is a reference model, and sometimes it is hard to see how it can be used. With respect to CSPA, on a technical level it is like a service-oriented architecture for statistics. To build the CSPA service, it has to be described using GSIM and GSBPM.
John Abowd took up Thompson’s challenge. The National Science Foundation–Census Research Network had one node that was specialized in metadata systems, and this effort developed a comprehensive metadata architecture that was DDI compliant. That architecture is being used in parts of the U.S. Census Bureau to create inward and outward facing discoverable metadata in a DDI compliant way. When the Bureau was putting that together, SDMX was offered up by referees of that proposal as an alternative. Abowd wanted to compliment the presenters today for clarifying the relationships. They do not need competing standards, and hearing each of the available ones carve out their space, one might have noticed that the U.S. Census Bureau is not an original producer of anything on the list of currently implemented objects. It is a supplier to other parts of the federal statistical system, in particular BEA and BLS from the Current Population Survey. He thinks it would be good at the ICSP level to discuss the extent to which the United States wants to be a more active participant in the international standards. Abowd explained that FCSM is the forum in the United States for discussing whether the OMB and the Office of the Chief Statistician will promulgate suggestions or direct recommendations for standards that bring all of the statistical agencies into theoretical harmony with respect to what standards they do and do not implement in their systems.
Muñoz added something to the last conversation about the United States taking part in this effort. He heard a participant say that they are searching for a transformation language for statistical information, and that there is already a team under the SDMX technical working group that is developing validation and transformation language. Also, the team is now working on developing a reference model for the data architecture. In this way, maybe some of the standards that could complement the work being done may already be developed. In this way, agencies can make use of these efforts.
Gillman then commented that he has been involved in a lot of these efforts for quite some time and so the United States has been somewhat involved. He is also heavily involved in the development of the DDI Version 4, and there is a relationship between DDI4 and GSIM. DDI can be
seen as a profile of GSIM, as GSIM is sort of an informational or conceptual model and DDI is immediately implementable.
Mike Cohen wondered if some of these models are more survey-data driven. Can they accommodate administrative records data and other data sources? Gillman answered that the survey perspective is what most of them know best because that is how they were brought up. They are all aware of the need to bring in administrative data and data from other sources. It is the understanding of the people developing standards like DDI that they can account for those things. He thinks that they actually have not had more than a few applications to prove that, so if one asked him he would say that of course DDI can handle administrative data and so forth, but they really have yet to prove that.
Muñoz said that in the case of GSBPM, there are some practical examples that are published where GSBPM is used for business registers.
Abowd said in response to Gillman that the DDI implementations that are being used at the U.S. Census Bureau include both administrative record and survey data, so for examples, see Vilhuber’s work.
Levenstein commented that there are things that have to be developed in DDI to make it more useful for administrative data, but that is one reason why having the federal statistical agencies involved in these kinds of organizations at this level would be helpful. She closed by noting that having the voices of the federal statistical agencies engaged in these discussions in terms of the development of standards would lead to standards and products that were more useful to the agencies.