Navigation

    Gpushare.com

    • Register
    • Login
    • Search
    • Popular
    • Categories
    • Recent
    • Tags

    【炼丹保姆】如何给数据加权重

    技术分享📚有奖励
    1
    1
    80
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • 173****7719
      173****7719 last edited by 173****7719

      准备工作:

      import numpy as np
      import torch
      from torch.utils.data import WeightedRandomSampler
      from torch.utils.data import DataLoader
      from torch.utils.data import TensorDataset
      

      生成数据

      # 假设是一个三分类的问题,每一类的样本数分别为 10,1000,3000
      class_counts = np.array([10, 1000, 3000])
      #  样本总数
      n_samples = class_counts.sum() # 4010
      # 标签
      labels = []
      for i in range(len(class_counts)):
          labels.extend([i]*class_counts[i])
      
      Y = torch.from_numpy(np.array(labels, dtype=np.int64))
      # 随机生成一些数据,不重要
      X = torch.randn(n_samples)
      

      生成权重

      # 给每一类一个权重
      class_weights = [n_samples/class_counts[i] for i in range(len(class_counts))]
      # [401.0, 4.01, 1.3367]
      # 对每个样本生成权重
      weights = [class_weights[i] for i in labels]
      

      数据封装

      train_dataset = TensorDataset(X, Y)
      sampler =  WeightedRandomSampler(weights, int(n_samples),replacement=True)
      

      实验A: 加权分配使用replacement (样本可重复使用)

      train_loader = DataLoader(train_dataset, batch_size=1024,sampler=sampler, drop_last=True)
      
      for i, (x,y) in enumerate(train_loader):
          print(f"batch index {i}, n_0: {(y==0).sum()}, n_1: {(y==1).sum()}, n_2: {(y==3).sum()}")
      # output:
      # 第一个batch,每类的数量分别为 349, 344, 331
      # 第二个batch,每类的数量分别为 344, 360, 320
      # 第三个batch,每类的数量分别为 339, 348, 337
      

      实验B: 加权分配不使用replacement (样本不可重复使用)

      sampler =  WeightedRandomSampler(weights, int(num_samples),replacement=False)
      
      train_loader = DataLoader(train_dataset, batch_size=1024,sampler=sampler, drop_last=True)
      
      for i, (x,y) in enumerate(train_loader):
          print(f"batch index {i}, n_0: {(y==0).sum()}, n_1: {(y==1).sum()}, n_2: {(y==3).sum()}")
      # output:
      # 第一个batch,每类的数量分别为 10, 466, 548
      # 第二个batch,每类的数量分别为 0, 333, 691
      # 第三个batch,每类的数量分别为 0, 173, 851
      

      实验C: 简单随机分配

      train_loader = DataLoader(train_dataset, batch_size=20,shuffle=True, drop_last=True)
      
      for i, (x,y) in enumerate(train_loader):
          print(f"batch index {i}, n_0: {(y==0).sum()}, n_1: {(y==1).sum()}, n_2: {(y==3).sum()}")
      # output:
      # 第一个batch,每类的数量分别为 0, 227, 797
      # 第二个batch,每类的数量分别为 1, 271, 752
      # 第三个batch,每类的数量分别为 6, 257, 761
      

      结论

      使用WeightedRandomSampler 并且允许样本重复使用的话基本可以保证样本的均衡采样。

      1 Reply Last reply Reply Quote 3
      • First post
        Last post