Noshir Contractor, Northwestern University, a moderator of the panel on this topic, pointed out that the phrase “multilevel, high-dimensional, evolving, and emerging networks” is a descriptive way of characterizing the social networks that have come to be represented by the term “big data.” He added that the latter term itself is evolving. Initially, he said, the challenges of big data were referred to as the three V’s: volume, velocity, and variety. Now, he explained, researchers have added four more: variability, veracity, visualization, and value. The presentations in this panel examined the advances in and strategies for analysis of networks in which the data involved have these qualities.
Hsinchun Chen, University of Arizona, addressed the topic of “dark networks,” a term that refers to illegal and covert networks. His work on dark networks has encompassed gang and narcotic networks, extremist and terrorist networks, and computer hackers. He noted that he has examined these networks from a data science perspective, drawing on data and text mining and visualization tools, particularly in a multilingual context.
Chen’s initial work in this area led to the development of a crime information-sharing and data mining tool known as COPLINK. He explained that this tool draws on millions of records from multiple databases containing both some structured data, such as those from police reports, and unstructured data. He pointed out that even with structured data, false identities are possible. Accordingly, he said, instead of relying on single
points of information, the tool draws inferences from relations and multiple reports of similar relations, and not just among people but also between people and locations, vehicles, dates, and so on.
Chen provided the example of an application used in border crossing situations, which he characterized as a high-risk vehicle identification system. The system, he said, collects cross-jurisdictional information from multiple databases. A license plate reader captures a vehicle’s license plate information and within seconds, the data mining tool generates associations between the license plate and other types of information, such as the vehicle’s owner and other Department of Motor Vehicles information, police records, and the context of the crossing (e.g., day, time, other vehicles). From these associations or the absence of associations, Chen explained, the tool predicts whether a particular vehicle crossing is benign or at high risk of narcotic activity.
Chen noted that similar data mining techniques have been used in another project to examine terrorist networks and recruiting efforts through social media. In this case, the amount of information is vast (terabytes), but the challenge is that it resides on the dark web, a term referring to Internet sites that cannot be accessed by standard search engines, are encrypted, and can be accessed only by special software. According to Chen, this situation represents a different type of data collection in which all the available information must be collected at once and then stored prior to analysis. He explained that the analysis of these online communications draws on linguistic theories. The project, he elaborated, developed writing signatures for different authors, or “writeprints” (akin to fingerprints), that make it possible to follow an author across different forums even if the person uses different screen names. Chen added that developing these signatures was not an easy task given that writing features (e.g., syntactic and lexical features) had to be analyzed for multiple languages, including Arabic, English, French, German, and Russian. Once the signatures had been developed, he explained, authors with similar messages were identified, and a model of relationships was created. According to Chen, the analysis and tools resulting from this project proved useful for identifying members of terrorist networks who were overt—for example, already appearing on police records for some reason. “Smarter” members, he said, could not be identified by computer analysis alone; human insight was needed.
Chen’s data techniques were applied more recently to cybersecurity issues (e.g., computer hackers stealing credit card information). In this project, he explained, the object is not only to identify the hackers but also to identify the product or source code involved. But even when searching for computer codes, he said, the process of drawing associations and linking features is important to identifying high-risk subjects. He noted that people who specialize in bank exploits also specialize in cryptology that
can be used in ransomware. He added that this project has developed tools with which to search the dark web and online forums for exploitive source codes, tutorials on creating malicious documents, and malware attachments. The malware attachments, he said, indicate the expertise that was needed to create the malware, pointing to particular specialists.
In closing, Chen listed several challenges in using social network analysis for practical purposes, including identifying appropriate data sources (a great deal of open-source information is available on the Internet, but not all of it will be useful); recognizing appropriate nodes and levels as well as appropriate entities to extract (e.g., identities, writeprints); establishing appropriate links (e.g., linked by associations, time and space, or conversations); and tracking changes over time. He proposed that researchers continue to develop tools and methodological foundations for better understanding dark networks, hidden networks, noise, deception, and adversarial intents. He suggested further that advanced tools could improve the comprehensive and timely collection of open-source information; that AI could assist with entity and relationship recognition; and that advanced data analytics could expand research opportunities. Finally, he cited research on adversarial machine learning1 as an area potentially poised for advancement.
Benjamin Golub, Harvard University, gave an overview of some of the research questions often asked about networks and their processes. He also made an argument for simple, physics-inspired models and greater interdisciplinary work going forward.
The kinds of networks Golub considers are groups of agents involved in decision making, whether it involves productive or adversarial decisions. When thinking about group decision making, he explained, several natural questions arise: Who are the most influential agents in a network? Is a network good at coordinating? How can a network be disrupted (i.e., what interventions work)? He characterized these as scientific questions that can be investigated for different types of groups or networks, from high school students to terrorist organizations.
Work in economics and related fields, Golub continued, has provided insights into these questions. For determining who is influential, he suggested network science has focused on the concept of centrality, particularly eigenvector measures. For determining how well a group coordinates, he said, the field has focused on features of homophily, cohesion, and segrega-
1 Adversarial machine learning involves investigating ways to incorporate machine learning techniques safely in adversarial settings such as spam filtering and malware detection.
tion. He stated that a rich spectrum of mathematical techniques are available with which to measure these features. With regard to interventions to influence group decision making, he noted there are some studies, but the literature is not robust.
Golub suggested that until more is known about how networks respond to interventions, it is difficult to operationalize the theories behind networks. He listed some of the practical complications involved in conducting research on networks. The first is that networks are adaptive and dynamic and respond to interventions in ways that change the networks’ very nature. Golub offered the example of research on drug trafficking that demonstrated the response of trafficking networks to enforcement measures.2 He added that game theoretic modeling, which takes into account agents who are aware of interventions, has application to adaptive networks.
Golub cited as a second complication that networks are not perfectly observed, for two reasons: (1) random noise, and (2) nonrandom error that results because certain relationships may not be activated. In the latter case, he explained that the activation of relationships within a network is highly dependent on context, so that key relationships may not be activated at the time when data on the network are collected. For example, he said, people do not seek advice all the time even if they have access to a friend who can provide it. In addition, for certain events, some relationships or links will not be activated until the day of the event. Golub characterized this as a fairly standard error, a problem even in mundane contexts and not confined to events considered by the Intelligence Community. He added that econometricians have developed methods with which to adjust for this error in statistical estimation.
The final complication cited by Golub was computational costs. Even if enough resources, time, and energy were available to collect massive amounts of information on a network, he explained, the computational power and time needed to run the statistical analyses of this information would be impractical. Thus there are good reasons, he said, to measure only parts of a network.
Golub then turned to the use of models, noting that good models can aid understanding even when only part of a network can be measured. He suggested that the best models with significant scientific impact are “simple, physics-inspired” ones in which intuiting physical or social forces helps in envisioning how a network or system is working. Bad models, on the other hand, are highly combinatorial and algorithmic, and often entail black box operations in which the applied theory of social process is unclear. Golub suggested a focus on useful decompositions as a way to develop success-
2 Dell, M. (2015). Trafficking networks and the Mexican drug war. American Economic Review, 105(6), 1738–1779.
ful models, using the analogy of a drum to explain decompositions. The motion of a drum, he said, can be decomposed into characteristic modes or principal oscillations. Any impulse to the drum can be decomposed to these component pieces, which in turn can be understood as contributing to the drum’s vibration. Golub suggested that the same mathematics and decomposition techniques can be applied to networks. Once the important components have been identified through decomposition, he added, it is important to find easy ways to measure these components.
According to Golub, what is needed in the future is the development of robust statistical models of individual network components. This can be accomplished, he argued, by training people and creating more interdisciplinary research teams at the intersection of game theory, physics, and statistics. He elaborated that those with expertise in game theory and physics can help build models that incorporate reasonable decision making and laws of motion and decomposition, while those with statistical expertise can ensure that the models are robust to sampling and account for systematic bias in the network links.
In closing, Golub reiterated that much can be understood about networks, their processes, and their response to interventions without extensive data collection. It is efficient and meaningful, he asserted, to measure just a part of a network—the part determined to be affected by interventions of interest. Expanding on this thought, he offered the analogy of functional magnetic resonance imaging experiments and suggested that useful science will test interventions so as to “shine something at the system [or network] and see how it vibrates.”
Alexander Volfovsky, Duke University, spoke about the challenges of using statistical techniques to study causal relationships within networks. He briefly considered a network of people, modeling approaches, and the types of experiments that would be run on such a network. Volfovsky suggested that the different nodes could represent different attributes of people, and a model could be created to explain what types of attributes link people together as friends. Once such a model has been created, he said, it can be used to predict the friendship links that will develop when a new person enters the network or reactions within the network when a certain treatment is applied to the model.
Volfovsky reported that researchers often cluster the people or nodes in a network into groups or communities in some way to detect something of interest inside the network. He discussed some of the challenges of detecting these communities. First, communities are frequently based on more than one attribute. Volfovsky explained that simple models, such as the
stochastic block model, may not account for more than one attribute, and that adding complexity to these models just increases the computational time, which at some point becomes impractical. He cited spectral methods as a simpler approach that would yield the same or approximately the same results. However, he asserted, even these methods are limited in scalability. He argued that new tools are needed to address very large networks.
Describing a second challenge, Volfovsky pointed out that communities are more difficult to detect as they become more interconnected. He noted that algorithms exist for detecting communities with no or weak interconnections, but that current algorithms have difficulty detecting communities when strong connections exist among people in different communities. He noted there are some models for such situations, but they are computationally extensive.
Volfovsky described a study using data from the National Longitudinal Study of Adolescent to Adult Health (AddHealth) to illustrate the limitations of simple models and the need for developing models that directly account for the data collection process. As part of the AddHealth study, American high school students were asked to identify their top five friends. Volfovsky and colleagues used the resulting dataset to connect network characteristics to individual behavior, that is, to understand factors contributing to friendships and whether such relationships could be predicted. Volfovsky explained that the use of latent variable models is the standard in statistics for this type of investigation. However, he added, a number of assumptions are made in using these models, one of which is that the data reflect correct observations: in other words, the models do not account for the possibility that the data may be ranked or in some way censored. He noted that the AddHealth survey did indeed censor some data by limiting respondents to five friends (i.e., students with five or fewer friends had their friends recorded, whereas the data on students with more than five friends were limited). Volfovsky and colleagues incorporated this information on how the data were collected into their model of network characteristics and friendships, introducing a likelihood that accommodated the ranked and censored nature of the data and allowed for unbiased estimation of regression effects. Their model was able to predict how many friends a person would have if allowed to name as many friends as desired.
Volfovsky noted that this type of modeling can also assist in developing an understanding of causality within networks by informing the development of better experiments. Such experiments, according to Volfovsky, are currently the best techniques for examining causal relationships within a network. He argued that research advances in this area would require determining how to combine observational data with the experimental results. He pointed to the body of literature in metrics and statistics on drawing causal inferences from observational and experimental data, but
for much simpler contexts, and suggested that researchers need to build on this work to expand statistical techniques for applicability to complex network structures.
Volfovsky cited the example of an effort to understand the efficacy of isolation as treatment for influenza-like illnesses. He argued that the classic approach of subtracting the average outcome for controls from the average outcome for the treated was limited. New tools are emerging, he said, with which to estimate causal effects in networks, which is a substantially more difficult problem since networks likely influence both the outcomes and the way in which the outcomes are observed.
In closing, Volfovsky suggested three areas that need to be addressed in the near future: substantive network challenges (i.e., better understanding of positions, relationships, and trigger points in a network); statistical techniques for addressing uncertainties in observed networks; and engineering solutions to the current computational expense of available models.
This page intentionally left blank.