Multiview Embedding with Partial Labels to Recognize Users of Devices Based on Unified Transformer

Ren, Yimo; Li, Hong; Liu, Peipei; Liu, Jie; Zhu, Hongsong; Sun, Limin

doi:https://doi.org/10.1155/2023/3138551

International Journal of Intelligent Systems

On this page

Abstract Introduction Related Work Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2023 | Article ID 3138551 | https://doi.org/10.1155/2023/3138551

Multiview Embedding with Partial Labels to Recognize Users of Devices Based on Unified Transformer

Yimo Ren,^1,2Hong Li,^1,2Peipei Liu,^1,2Jie Liu,^1,2Hongsong Zhu,^1,2and Limin Sun^1,2

Academic Editor: Mohammad R. Khosravi

Received13 Nov 2022

Revised05 Dec 2022

Accepted14 Dec 2022

Published25 Apr 2023

Abstract

Recognizing the users of devices (or clusters of devices) who use IP addresses as unique identities on the Internet can easily enable numerous security applications. Fast and accurate user recognition is critical for supervisors to find influenced organizations connected to their networks in light of new security threats. Many users’ information scatters in the multisource data of IP addresses. Up until now, user recognition of devices has had two main problems. On the one hand, existing methods could not fully use multisource data of the IP addresses and wastes the valuable information of labels. On the other hand, only a tiny portion of devices can be tagged with highly confident known users manually, making it an urgent need to infer unknown users of devices. So, the problem of user recognition on devices is to guess the unknown user with multisource data and existing devices with known users. Therefore, this paper proposes a multiview fusion method to deal with multisource data from devices with a small number of manually labelled samples. The paper uses GraphSAGE to obtain an exemplary representation of IP addresses and designs a label encoder to fully use a small number of devices with known users. Then, the paper builds a specific unified transformer to achieve high performance to determine whether two devices have the same user. At the same time, the paper conducts real-world experiments and finds that the proposed method can achieve 0.9158 accuracy and 0.6131 F1 to find devices with the same users on the constructed dataset in the real world.

1. Introduction

An Internet protocol (IP) address is a unique identifier representing a device or a cluster of devices on the Internet. Thus, the user recognition of devices could be simplified to the user recognition of IP addresses. Recognizing the users of devices can enable numerous security applications. For example, when a new vulnerability is exposed, network supervisors can quickly and effectively identify the influenced users connected or related to their responsible networks to reduce the potential losses by warning them. In terms of using traffic between devices, there have been many studies [1–3] seeking the source of device attacks, namely user recognition of the attacker device. But the studies about user recognition of devices on the Internet are quite few. Querying public databases, such as WHOIS of IANA [4], ASO with querying protocol BGP [5], and rDNS (reversed domain name system) with querying protocol DNS [6], is a common way to identify the users of large-scale devices. However, it has many limitations: (1) many organizations recorded in the databases are the registers or operators of IP addresses, and Internet service providers or cloud service providers, rather than the actual users, own most IP addresses. (2) IP addresses can be sublet or sold, so the registered organizations may not be the actual users, and it is not easy to know as much as possible about who uses public devices on the Internet.

It seems easier to find device users using public databases such as WHOIS and DNS [7]. Nevertheless, these methods can only get the register of devices, which are always not their users. Indeed, these are some related methods for identifying users of devices. AIWEN [8], a commercial company, links devices with users by analyzing companies’ official websites and corresponding IPs. The method is referenced by commercial IoT engines such as Zoomeye (https://www.zoomeye.org/). However, that method only uses websites instead of multisource data from IPs, resulting in low accuracy and coverage. Also, there are some other studies that focus on user recognition, such as person identification [9] in cameras, but they are different from the user recognition of devices on the Internet.

By analyzing the public attributes of devices on the Internet, we observed that many users’ information scatter in the multisource data of IP addresses, as shown in Table 1. For example, the “cptdc” in the rDNS of the first user is the abbreviation of “China Petroleum Technology and Development Corporation,” which is not recorded in ASO. The corresponding ASO is “China Unicom Beijing Province Network,” which is the device’s operator or Internet service provider. Limited to the massive cost of resources, only a tiny portion of device users can be manually labelled by experts. Therefore, an effective way to realize user recognition of devices is to calculate the similarity between the device with an unknown user and the device with a known user and judge whether the same user uses them or not. However, because devices are usually deployed in various environments, leading to the multisource and sparse data of devices, it is challenging to recognize the users of devices well.

The knowledge graph effectively manages large-scale and multisource data, which is suitable for organizing devices’ data on the Internet. The graph representation is used to fuse and utilize the multisource data and express the graph as vectors of nodes and links to facilitate subsequent tasks. In the past decade, graph neural network (GNN) has realized many successful practices in graph representation for homogeneous and heterogeneous graphs. The typical models for homogeneous graphs are to use random walk [10] to generate and vectorize the nodes sequence of the graphs, such as Node2Vec [11] and Struct2Vec [12]. Furthermore, one of the classical paradigms for heterogeneous graphs is to define and use meta paths, such as Metapath2Vec [13]. However, these graph representation methods face several issues. First, the methods on homogeneous graphs only consider the relations between only one kind of node, and others are transformed into node features, making it hard to learn fine-grained representations of nodes when node features are too sparse. Second, most methods on heterogeneous graphs need specific knowledge to design meta paths for each type of graph. Third, most existing methods could not learn the representation thoroughly and efficiently under the web-scale Internet data of devices. Finally, most of the methods above do not consider the known partial labels in the face of specific tasks, making the effect of recognizing users slightly insufficient.

To improve the performance of user recognition of devices, this paper proposes MVEPL: Multiview embedding with partial labels to recognize users of devices on the Internet based on a unified transformer. To integrate the multidata of devices, MVEPL constructs multiview graphs of devices and uses GraphSAGE to get the nodes’ embedding as the devices’ representation. To fully use the devices with known users, MVEPL calculates the label embedding as an extra feature of devices based on the label encoder. To judge precisely whether the same user uses two devices or not, MVEPL trains the unified transformer on the features of devices based on the known labels. The paper also conducts experiments and finds that MVEPL could achieve 0.9158 accuracy and 0.6131 F1 to find out devices with the same users on the constructed dataset in the real world.

Our contribution is summarized as follows:(1)This paper adopts the multiview method to effectively integrate the multisource data, such as AS, DNS, etc., of devices with IP addresses as identities to get their representation by GraphSAGE inductively.(2)To fully use the small number of devices with known users, this paper proposes a label encoder and makes it early to fully use the small portion of devices with known users to get fine-grained embedding of devices.(3)To realize the user recognition of devices on the Internet, this paper designs a unified transformer to enhance the performance of calculating the similarity between the embeddings of two devices to judge whether the same user uses them.

2. Motivation

2.1. Problem Statement

Our work is motivated by several famous and influenced cyber security incidents, such as the Apache Log4j vulnerability [14]. Log4j is a widely used tool for gathering log information and deploying it on websites. The Log4j vulnerability allows attackers to execute code remotely to control the websites. According to media reports, the vulnerability affected many companies using Internet services, including Apple, Amazon, IBM, Microsoft and Twitter. After regulators and security researchers detect the affected websites, the users should be promptly notified to protect against potential attacks.

However, there are still no adequate methods to know the actual users of large-scale devices worldwide in academia and industry. Motivated by the fact that much information about users is embedded in the multisource data of the devices, we identify the user by fusing the scatted information.

2.2. Challenge

It is very challenging to recognize the users of devices on the Internet due to the following reasons.

Facing multisource data of devices on the Internet requires a method that can well embed the devices. The data of devices we use include rDNS [15], AS [16], subnet [17], location [18], and Hardware/software [19]. At the same time, the coverage of each kind of data is also different and sparse, making existing methods could not perform well.

For example, rDNS is pieces of text split by dot, such as .crawl.baidu.com. AS/subnet are a kind of serial numbers such as AS8075 and 104.245.188.0/24. Hardware/software are also texts, such as Firewall and WebContainer. In reality, the coverage of the rDNS is always much lower than AS, making it difficult to integrate and use multisource data of devices.

Currently, most of the existing graph representation methods focus on using label information when constructing specific task models. However, due to the particularity of measuring users of devices, there are very few devices known users, namely a semisupervised problem. In order to use the label information as much as possible to improve the performance of user recognition, this paper uses label information as an added embedding of devices. When the labels of devices are used as an added embedding of devices and for calculating the losses in the testing process, the performance improvement is meaningless, called “evaluation crossing.” At this time, how to use label information in advance and avoid “evaluation crossing” is also a big challenge.

GNN is a kind of technology in which the model obtains information from homogeneous and heterogeneous graphs. In recent years, GNNs have created many successes for tasks with relational data, such as node classification and link prediction on the open academic graph [20]. The main idea of GNN is to learn the representation of nodes or links by structure and other graph attributes using a neural network. Node2Vec [11] and Struct2Vec [12] use the random walk to generate and vectorize the graphs’ nodes’ sequences to get the nodes’ representation. Nevertheless, Node2Vec and Struct2Vec could only extract the structure features of the graph. GCN [21] and GAT [22] aggregate the features of neighbour nodes by Laplacian matrix and attention scores, respectively. At the same time, the GCN and GAT are too transductive to vectorize the new nodes to the graph. GraphSAGE [23] is more inductive than others because it learns how to aggregate the features to get the node representation instead of the node representation directly. UniMP [24] uses the transformer [25] to extract deeply abstract features of nodes and partial labels of nodes to predict others. All the GNN methods above are designed for graphs with only one kind of node and link, namely homogeneous graphs. So, they cannot model the graphs with more than one kind of node or link, namely heterogeneous graphs. Metapath2Vec [13] defines meta paths and considers the types of nodes and links when generating and vectorizing the nodes’ sequence by a random walk. In contrast, the performance of Metapath2Vec depends on the design of the meta path. RGCN [26] and RGAT [27] also consider the types of nodes and links when aggregating the features of neighbour nodes. Furthermore, HAN [28] gives weights to different link types and to neighbour nodes in the same type of link.

Nevertheless, RGCN, RGAT, and HAN could not consider the implied relations between different types of nodes and links, making it hard to get a fine-grained representation of nodes or links. HGT [29] imitates the structure of a transformer to learn the representation of nodes with different types. Nevertheless, HGT ignores the effect of recognizing nodes with unknown labels with nodes with known labels. Therefore, existing GNN methods cannot be migrated to solve or could not perform well to get the fine-grained embedding of devices for user recognition of devices on the Internet.

3.1. Texts Similarity Model

The text similarity model (TSM), which calculates the similarity of two input texts, is one method that can be easily migrated to judge whether the same user uses two devices. The main idea of TSM is to get the representation of input texts relatively and learn their similarity supervised. The typical and state-the-of-art TSM models in recent years are ABCNN [30], SiaGRU [31], ESIM [32], BIMPM [33] and RE2 [34]. ABCNN and SiaGRU fully learn the context information inside texts by CNN and GRU relatively, but rarely learn the interactive information between texts. ESIM and BIMPM consider the interactive information between texts at the input part of the models but are limited to processing long sequences of input texts. RE2 realizes a fast and accurate text matching by the enhanced residual blocks, but also RE2 mainly depends on the context information of the input texts.

Therefore, existing TSM methods cannot be migrated to solve or could not perform well to judge whether the same user uses two devices or not, to infer the user of the device by the preliminary known user of the other device.

4. Methodology

In this section, MVEPL is introduced, as shown in Figure 1. Firstly, this paper constructs multiview graphs on the base graph of CAIDA topology, combined with the separate embeddings of rDNS, AS/subnet, location, hardware/software as features of their nodes to build multiview graphs; secondly, by using the label encoder and training the GraphSAGE on the graphs, the representation of devices is obtained inductively, including the embeddings of partial devices with known users; finally, unified transformer is designed and used to calculate the similarity between the embeddings of two devices to judge whether the same user uses them or not. At the same time, the paper randomly selects some users of devices and predicts their labels to ensure that no label information will be leaked in advance.

4.1. Separate Embeddings

Separate embeddings layer vectorizes multisource data preliminarily. For data from different sources, the vectorization methods are also different. MVEPL chose four possibly related features for users, including rDNS, AS/subnet, location, and hardware/software of devices. The following are partial analyses of the chosen data:

4.1.1. rDNS

RDNS is a method of decomposing an IP address into a domain name. The first-level (suffix) and second-level (domain) of rDNS are extracted, counted, and sorted to extract the embeddings. Then, we visualized the top 10 suffixes and domains for analysis, as shown in Figure 2.

It is found in Figure 2 that most of the suffixes represent countries and regions, such as de, fr, etc., or the type of industry to which the domain name belongs, such as .net for network providers and .com for commercial enterprises. Moreover, Figure 2 also shows that the domains may contain some information about their users, such as Amazon, Comcast, etc. However, some domains cannot directly expose the relevant users, such as rr.

4.1.2. AS

An autonomous system (AS) on the Internet is a small group with the right to autonomously decide which routing protocol should be used in the same system. This group can be a simple network or several networks controlled by one or more administrators, such as a university, an enterprise, or an individual company. An AS will assign a globally unique digit number, usually called ASN, and administrators of AS are always called ASO.

This paper collects about 100,000 ASNs and their corresponding ASOs. To analyze the usability of AS to recognize the users, the largest 10 AS are listed in Table 2.

The number of IP addresses covered by each AS is different and distributed in the power of , which is relatively large for many users. Also, it can be seen from Table 2 that some ASOs are organizations that may use the IP addresses in AS. For example, the ASO of AS16509 is AMAZON-02, and Amazon probably uses its IP address. While the ASO of AS4134 is CHINANET-BACKBONE No. 31, Jin-Rong Street, CN, it is China’s public backbone Internet. Therefore, the IP addresses in AS4134 are likely to be used by many users. In a word, users can be mined but cannot be determined entirely with AS.

4.1.3. Label Encoder

The early use of label information can make it more convenient to determine whether the users of devices are the same [35, 36]. For example, after knowing the user of one device, we can more precisely determine whether the user of the other device is the same. In this way, the model makes it easier by dividing the problem into two cases, one is to process devices with the same user when some of their users are known, and the other is when all of their users are unknown.

The paper proposes to embed the partially observed labels into the same space as node features, which consist of the label embedding vector for labelled devices and zero vector for unlabelled devices. Unlike the UniMP, the nodes of the heterogeneous graph are various, and devices with IP as identity may be labelled with users. Therefore, the paper uses a randomly initialized linear projection to propagate the label embedding vector from IPs to other nodes, as follows:

For node , represents the node type of ; is the linear projection of trained in subsequent classification; are the neighbour devices of ; is the original features; and is the label embedding vector of node . The paper concatenates and as the final features . Furthermore, only the devices with unknown users could get the label embedding vector by propagation, and linear projection are the attention scores of label embedding based on different users.

4.2. Multiview Embeddings

Multiview embeddings use GraphSAGE to obtain representations of graphs originating from the based graph using the CAIDA topology [37] and separate embeddings as node features from multisources of IPs.

4.2.1. Base Graph

MVEPL constructs the base graph using the CAIDA topology, provided by CAIDA (https://www.caida.org/catalog/datasets/ipv4_routed_24_topology_dataset/). This dataset is designed for studying the topology of the Internet. The dataset is collected by a globally distributed set of monitors (https://www.caida.org/projects/ark/). So, a base graph can be constructed by analyzing CAIDA data to obtain the links between the IP addresses.

4.2.2. GraphSAGE

Most existing graph embedding methods require all the nodes in the graph to be processed simultaneously, also called transductive, and cannot be naturally generalized to unseen nodes. Therefore, MVEPL chose GraphSAGE, an inductive framework that can efficiently and conveniently get the embedding of new nodes to collect multisource data of IP addresses and label embedding of users. For each view graph, the critical equations of GraphSAGE are:where is the embedding of node in layer , ; represents the neighbours of node ; and represent the weights and bias of GraphSAGE.

The graph-based loss function hopes that adjacent nodes have similar node representations while making the representations of separated nodes as distinguishable as possible. The loss function is as follows:where is the neighbour that appears near ; and are the probability distribution and the number of negative samples, respectively.

4.2.3. Unified Transformer

There are some studies [38, 39] using the cluster method on the graphs to divide nodes into different groups, which seems to be useful in user recognition of devices. In contrast, MVEPL aims to judge whether the users of the two devices are the same by the similarity of their embeddings, so as to improve the accuracy of user recognition. These two devices are called query and context in information retrieval. The model based on transformer has achieved state-of-the-art performance in feature extraction and representation in natural language processing. The transformer consists of self-attention and a forward neural network. The self-attention iswhere could be the matrix derived from the input features , and the dimension is . Generally, are the same matrix. When are the same matrix from the query and is the matrix from the context, the attention is called coattention.

Based on the attention defined above, the transformer and cotransformer are built in multihead ways. In order to better extract the interactive information between query and context, MVEPL built a cotransformer to integrate their embeddings. Then, the embeddings after transformers are:where represents the embedding of query device and represents the embedding of the context device.

After getting the final representations of devices from the unified transformer, MVEPL constructs a Siamese model and uses devices as a pair to train and test. In Siamese [31], the loss function used is contrast loss, which can effectively deal with the relationship of paired devices. The expression of contrastive loss is as follows:where represents the Euclidean distance of the two sample features, is the label of whether the two devices have the same user, and margin is the preset threshold and .

MVEPL trained the Siamese by minimizing the contrastive loss based on the final embedding of the query device and context device to determine whether they belong to the same user.

Of course, the earlier the label information is used, the more likely the problem of “evaluation crossing” will occur. That is, the information from the test dataset is leaked to the training dataset, resulting in a meaningless improvement in the models’ performance. Therefore, the paper randomly masks some users of devices and predicts their labels, like UniMP, to ensure that no label information will be leaked in advance. Then, the objective function for classification could bewhere are the labels used in training and are the labels needs to be predicted; are the nodes used in testing; are the embedding vectors of nodes; is the graph structure, and are the parameters of the MVEPL, including those of the label encoder.

5. Experiment

5.1. Dataset

This paper evaluates MVEPL in a real environment at a city in China. The labelled IPs with users are derived from the public data of the devices in the early stage, such as SSL certificates, protocol banners, and existing manual labels. The information about the dataset is shown in Table 3. The target is to judge whether two devices have the same user.

Then, the paper constructs the dataset used in the subsequent experiments on the original dataset provided by Table 3. Each sample in the dataset is a pair of devices, the label of which for the same user is 1 and for different users is 0. Table 3 shows that each user has about 10 IPs, based on the number of users and IP users. Therefore, for convenience, the paper randomly selects 10,000 positive and 100,000 negative samples for the subsequent experiments. All experiments are evaluated in terms of accuracy (ACC) of total classes, precision (), recall (R), and F1-score (F1) of the positive class.

5.2. Settings

The paper conducts all ablations and comparisons using PyTorch 1.10.0 on the environments, as shown in Table 4.

The the main preset parameters of MVEPL are as Table 5 shows:

5.3. Performance

5.3.1. Ablation Studies

The core components in MVEPL that realize user recognition of devices on the Internet are label encoder, multiview embedding, and a unified transformer. To further show their effects, the paper conducts ablation studies(see Table 6). It highlights the best three models outperforming in terms of accuracy and F1. Also, the paper evaluates the cost time of the model, which is the seconds per epoch cost in training, using the parameters in Table 5 and the dataset in Table 3. As the results say, the conclusions are summarized as follows:(1)The performance of models with label encoders is much better than that of models without one because all models perform best in accuracy and F1 when using label encoders. The results prove that the early use of label information described in this paper can improve the performance of the user recognition models under the premise of avoiding “evaluation crossing.”(2)The multiview way to fuse the multisource features of devices performs best on the dataset. The unified transformer based on multiview embedding outperforms those using single-view or concatenated features. For example, the unified transformer based on multiview embedding without a label encoder performs at 0.8543 accuracy and 0.4653 F1. However, the best model based on a single-view is the unified transformer based on AS/subnet, which performs at 0.8387 accuracy and 0.4135 F1. Also, the unified transformer based on concatenated features, namely concat in Table 7, gains 0.6123 accuracy and 0.2007 F1.(3)Models in Table 6 based on cotransformers perform better than transformers to realize user recognition of devices, whether label encoder or multiview embedding is used. It is proven that the interaction information of the input embedding of two devices still plays an essential role in calculating their similarity for user recognition. Unified transformer combines the deep extract features of devices by transformer and their interaction information by cotransformer to gain the best performance, about 0.9158 accuracy and 0.6131 F1.(4)Table 6 also says the models on multiview graphs are much slower than those on single-view graphs. Models on multiview graphs need more calculations to get the node representation. Therefore, the cost time per epoch in the training of total is in the 50 s, much higher than those of rDNS, AS/subnet, location, and hardware/software. At the same time, a unified transformer generally performs best but needs more training time than a transformer or cotransformer. For example, the accuracy of a unified transformer without the label encoder is 0.8543. However, it needs more than the 20 s larger than cotransformer and transformer. At the same time, it must be noted that the label encoder occurs before the calculating node representation of graphs, so the cost time is not included in Table 6.

At the same time, the paper analyses how the ratio of labels masked in training affects the performance of node representations to realize user recognition on the dataset. In this experiment, 10%–90% of the labelled data in the training set is masked. The overall performance is shown in Figure 3.

As seen in Figure 4, the more labels masked in training, the worse the MVEPL performance is, even worse than the models that do not use label embedding. It is possible that in the training process of using label embedding, the fewer partial labels that are used in each epoch, the more difficult it is to obtain the information in the labels. On the contrary, the training of models may fail due to too many missing values in their inputs.

(a)

(b)

(c)

Significantly, the performance when the masked label ratio is 10% is worse than when the masked label ratio is 20%. There may be a small number of errors in the labels from SSL certificates, protocol banners, and existing manual labels in the early stage of user recognition of devices. Therefore, if these labels are masked, the overall performance of the representation of nodes could be easily affected and worsen.

5.3.2. Comparisons with GNNs

In order to verify the ability of MVEPL to vectorize the multidata of IP addresses for user recognition of devices, the paper introduces mainstream and typical GNNs to get node representation on homogeneous and heterogeneous graphs. For a homogeneous graph, the paper takes IP addresses as nodes and concatenates separate embeddings as node features. For a heterogeneous graph, the paper takes IP-IP, IP-rDNS, IP-Hardware, IP-Location, and IP-ASN as links and separate embeddings as node features relatively. Both graphs are based on the base graph. Then the paper inputs node representations from GNNs to the unified transformer to evaluate the ability of different methods for user recognition of devices. The results of comparisons with GNNs are shown in Table 7.

The results show that, in terms of all metrics, the proposed MVEPL significantly and consistently outperforms all baselines on the task of user recognition.

Compared with the models using concatenate features of nodes, the MVEPL achieves about 0.3 gains in accuracy and 0.4 gains in F1. That means the concatenating of node features is a very rough feature extraction that does not consider the graph structure, so it is challenging to achieve satisfactory recognition performance for users.

The best model on the homogeneous graph of devices is UniMP, with about 0.8952 accuracy and 0.5584 F1. The best model on the heterogeneous graph of devices is HGT, with about 0.8527 accuracy and 0.4839 F1. That means the multiview way to fuse multidata is more delicate than the way to vectorize the multidata as node features in a homogeneous graph and does not need to design the meta paths in the heterogeneous graph. Therefore, MVEPL designed in this paper outperforms the existing state-of-the-art models to realize user recognition.

At the same time, due to the GraphSAGE used in the framework, MVEPL is inductive and easier to adapt to web-scale and large-scale data.

5.3.3. Comparisons with TSMs

In order to further verify the MVEPL in user recognition of devices, the paper introduces the mainstream and typical TSMs to calculate the similarity of two devices to judge whether the same user uses them. All methods shared the output of the multiview embedding as the input features. The results of the comparisons are shown in Table 8.

As shown in Table 8, the performance of MVEPL is better than that of TSMs. The reason why MVEPL is competitive may be that those targeted models are built for text matching. While there is little or no correlation between the input features of IP addresses, the previous models used sequence models such as LSTM, which made the performance of MVEPL stronger.

5.3.4. Visualization

In order to verify the ability of MVEPL to represent multisource data of IP addresses for user recognition of devices, the paper takes the output of the unified transformer as the final representation provided by MVEPL. Then, the paper selects the IP addresses of the first 10, middle 10, and last 10 users, ordered by average address number, as the input of K-means. TSNE shows the clustering results in Figure 4.

We found that using the representation provided by MVEPL made it easier to cluster the IP addresses according to their users. It shows that MVEPL obtains a satisfying representation of IP addresses in “user space,” which means clusters of devices with the same user. Thus, the representation of devices from MVEPL could lay the foundation for judging whether two devices have the same user and inferring the unknown users of devices on the Internet.

6. Conclusion

As devices connected to the Internet become broader and more common in industry and life, how to effectively operate and protect the devices will become a top priority for the country and enterprises. Based on the multiview graphs of devices with IP addresses as identities, this paper proposes MVEPL to get the fine-grained embeddings of devices with GraphSAGE and Label Encoder to judge whether two devices have the same user with unified transformer. MVEPL can not only quickly and effectively use the multisource data of devices, but it can also use a small number of devices with manually known users to obtain a good performance in user recognition of devices. In the real world, we can find that MVEPL could obtain a better representation of devices and achieve a high performance of user recognition. The results show that this method can achieve 0.9158 accuracy and 0.6131 F1 to find devices with the same users on the constructed dataset.

The user recognition of devices is exciting, meaningful research, and helpful for network security. In the future, we will apply MVEPL to common IoT scenarios to enable numerous applications for users. On the one hand, user recognition of devices could be used in many network security applications, such as intrusion detection and locating IP addresses. On the other hand, the performance of MVEPL based on more multisource data need to be evaluated in a larger area, not just a city, to verify its performance for user recognition of devices on the Internet.

Data Availability

The data and materials of this study are available from the corresponding author or first author upon reasonable request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The paper is supported by National Key R&D Program of China (2020YFB2103803) and National Natural Science Foundation of China (No. 61931019, No. U1766215)

References

M. Vijayalakshmi, S. Mercy Shalinie, and A. Arun Pragash, “IP traceback system for network and application layer attacks,” in Proceedings of the 2012 International Conference on Recent Trends in Information Technology, pp. 439–444, Chennai, India, April 2012.
View at: Google Scholar
S. M. I. P. Bellovin, “Traceback,” in Encyclopedia of Cryptography and Security, H. C. A. van Tilborg and S. Jajodia, Eds., Springer, Boston, MA, USA, 2011.
View at: Google Scholar
L. Cheng, D. M. Divakaran, W. Y. Lim, and V. L. L. Thing, “Opportunistic piggyback marking for IP traceback,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 2, pp. 273–288, 2016.
View at: Publisher Site | Google Scholar
M. Á Ruiz-Sánchez, E. W. Biersack, and W. Dabbous, “Survey and taxonomy of IP address lookup algorithms,” IEEE network, vol. 15, no. 2, pp. 8–23, 2001.
View at: Publisher Site | Google Scholar
E. N. Nemmi, F. Sassi, and M. La Morgia, “The parallel lives of autonomous systems: ASN allocations vs. BGP,” in Proceedings of the 21st ACM Internet Measurement Conference, pp. 593–611, New York, NY, USA, November 2021.
View at: Google Scholar
R. Romero-Gomez, Y. Nadji, and M. Antonakakis, “Towards designing effective visualizations for DNS-based network threat analysis,” in Proceedings of the 2017 IEEE Symposium on Visualization for Cyber Security (VizSec), pp. 1–8, IEEE, Phoenix, AZ, USA, October 2017.
View at: Google Scholar
W. B. Vries, Q. Scheitle, and M. Müller, “A first look at QNAME minimization in the domain name system,” in Proceedings of the International Conference on Passive and Active Network Measurement, pp. 147–160, Springer, New York, NY, USA, July 2019.
View at: Google Scholar
Y. Wang, D. Burgener, and M. Flores, “Towards street-levelclient-independent IP geolocation,” in Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), New York, NY, USA, March 2011.
View at: Google Scholar
R. Zhou, X. Chang, L. Shi, Y. D. Shen, Y. Yang, and F Nie, “Person reidentification via multi-feature fusion with adaptive graph learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 5, pp. 1592–1601, 2020.
View at: Publisher Site | Google Scholar
B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710, New York, NY, USA, August 2014.
View at: Google Scholar
A. Grover and J. Leskovec, “Node2vec: scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864, New York, NY, USA, August 2016.
View at: Google Scholar
L. F. R. Ribeiro, P. H. P. Saverese, and D. R. Figueiredo, “Struc2vec: learning node representations from structural identity,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 385–394, New York, NY, USA, August 2017.
View at: Google Scholar
Y. Dong, N. V. Chawla, and A. Swami, “Metapath2vec: scalable representation learning for heterogeneous networks,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 135–144, New York, NY, USA, August 2017.
View at: Google Scholar
A. Juvonen, A. Costin, H. Turtiainen, and T. Hamalainen, “On Apache Log4j2 eamac,” IEEE Access, vol. 10, Article ID 86542, 2022.
View at: Publisher Site | Google Scholar
T. Fiebig, K. Borgolte, and S. Hao, “DNS we trust: revisiting a common data-source’s reliability,” in Proceedings of the International Conference on Passive and Active Network Measurement, pp. 131–145, Springer, New York, NY, USA, July 2018.
View at: Google Scholar
D. Magoni and J. J. Pansiot, “Analysis of the autonomous system network topology,” ACM SIGCOMM - Computer Communication Review, vol. 31, no. 3, pp. 26–37, 2001.
View at: Publisher Site | Google Scholar
V. Fuller, T. Li, and J. Yu, “Rfc1519: classless inter-domain routing (cidr): an address assignment and aggregation strategy,” 1993, https://dl.acm.org/doi/10.17487/RFC1519.
View at: Google Scholar
N. Vratonjic, K. Huguenin, V. Bindschaedler, and J. P Hubaux, “A location-privacy threat stemming from the use of shared public IP addresses,” IEEE Transactions on Mobile Computing, vol. 13, no. 11, pp. 2445–2457, 2014.
View at: Publisher Site | Google Scholar
H. Asai and Y. Ohara, “Poptrie: a compressed trie with population count for fast and scalable software IP routing table lookup,” ACM SIGCOMM - Computer Communication Review, vol. 45, no. 4, pp. 57–70, 2015.
View at: Publisher Site | Google Scholar
F. Zhang, X. Liu, and J. Tang, “Oag: toward linking large-scale heterogeneous entity graphs,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2585–2595, New York, NY, USA, July 2019.
View at: Google Scholar
M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” Advances in Neural Information Processing Systems, vol. 29, 2016.
View at: Google Scholar
P. Veličković, G. Cucurull, and A. Casanova, “Graph attention networks,” 2017, https://arxiv.org/abs/1710.10903.
View at: Google Scholar
W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in Neural Information Processing Systems, vol. 30, 2017.
View at: Google Scholar
Y. Shi, Z. Huang, and S. Feng, “Masked label prediction: Unified message passing model for semi-supervised classification,” 2020, https://arxiv.org/abs/2009.03509.
View at: Google Scholar
A. Vaswani, N. Shazeer, and N. Parmar, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
View at: Google Scholar
M. Schlichtkrull, T. N. Kipf, and P. Bloem, “Modeling relational data with graph convolutional networks,” in Proceedings of the European semantic web conference, pp. 593–607, Springer, New York, NY, USA, June 2018.
View at: Google Scholar
D. Busbridge, D. Sherburn, and P. Cavallo, “Relational graph attention networks,” 2019, https://arxiv.org/abs/1904.05811.
View at: Google Scholar
X. Wang, H. Ji, and C. Shi, “Heterogeneous graph attention network,” in Proceedings of the world wide web conference, pp. 2022–2032, New York, NY, USA, May 2019.
View at: Google Scholar
Z. Hu, Y. Dong, and K. Wang, “Heterogeneous graph transformer,” in Proceedings of the Web Conference, pp. 2704–2710, New York, NY, USA, April 2020.
View at: Google Scholar
W. Yin, H. Schütze, B. Xiang, and B Zhou, “ABCNN: attention-based convolutional neural network for modeling sentence pairs,” Transactions of the Association for Computational Linguistics, vol. 4, pp. 259–272, 2016.
View at: Publisher Site | Google Scholar
J. Mueller and A. Thyagarajan, “Siamese recurrent architectures for learning sentence similarity,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
View at: Publisher Site | Google Scholar
Q. Chen, X. Zhu, and Z. Ling, “Enhanced LSTM for natural language inference,” 2016, https://arxiv.org/abs/1609.06038.
View at: Google Scholar
Z. Wang, W. Hamza, and R. Florian, “Bilateral multi-perspective matching for natural language sentences,” 2017, https://arxiv.org/abs/1702.03814.
View at: Google Scholar
R. Yang, J. Zhang, and X. Gao, “Simple and effective text matching with richer alignment features,” 2019, https://arxiv.org/abs/1908.00300.
View at: Google Scholar
J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin, “Label embedding: a frugal baseline for text recognition,” International Journal of Computer Vision, vol. 113, no. 3, pp. 193–207, 2015.
View at: Publisher Site | Google Scholar
Z. Li, L. Yao, X. Chang, K. Zhan, J. Sun, and H. Zhang, “Zero-shot event detection via event-adaptive concept relevance mining,” Pattern Recognition, vol. 88, pp. 595–603, 2019.
View at: Publisher Site | Google Scholar
S. Center, “CAIDA: visualizing the Internet,” 2001, https://www.caida.org/catalog/papers/2001_caida/caida/.
View at: Google Scholar
Z. Li, F. Nie, X. Chang, L. Nie, H. Zhang, and Y Yang, “Rank-constrained spectral clustering with flexible embedding,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 12, pp. 6073–6082, 2018.
View at: Publisher Site | Google Scholar
Z. Li, F. Nie, X. Chang, Y. Yang, C. Zhang, and N Sebe, “Dynamic affinity graph construction for spectral clustering using multiple features,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 12, pp. 6323–6332, 2018.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2023 Yimo Ren et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

International Journal of Intelligent Systems

Multiview Embedding with Partial Labels to Recognize Users of Devices Based on Unified Transformer

Abstract

1. Introduction

2. Motivation

2.1. Problem Statement

2.2. Challenge

3. Related Work

3.1. Texts Similarity Model

4. Methodology

4.1. Separate Embeddings

4.1.1. rDNS

4.1.2. AS

4.1.3. Label Encoder

4.2. Multiview Embeddings

4.2.1. Base Graph

4.2.2. GraphSAGE

4.2.3. Unified Transformer

5. Experiment

5.1. Dataset

5.2. Settings

5.3. Performance

5.3.1. Ablation Studies

5.3.2. Comparisons with GNNs

5.3.3. Comparisons with TSMs

5.3.4. Visualization

6. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright