• nyan@lemmy.cafe
    link
    fedilink
    English
    arrow-up
    2
    ·
    6 days ago

    Massive deduplication across all accounts on all servers of image, audio, and video data would theoretically be possible, but ain’t gonna happen. Or we could just discourage people from posting cat videos and bad memes (even less likely to happen).

    • lemmyng@lemmy.ca
      link
      fedilink
      English
      arrow-up
      4
      ·
      6 days ago

      I would argue that duplication of content is a feature, not a bug. It adds resilience, and is explicitly built into systems like CDNs, git, and blockchain (yes I know, blockchains suck at being useful, but nevertheless the point is that duplication of data is intentional and serves a purpose).

      • futatorius@lemm.ee
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        4 days ago

        explicitly built into systems like CDNs, git, and blockchain

        Git only duplicates blobs; textual content is generally stored as deltas (look at git_repack for more details). And it’s bad practice to version-control blobs: the more correct approach is to control the source from which the blob is generated.

        CDNs don’t all work alike so it’s impossible to generalize. I won’t comment on blockchain, since in my work as a developer and architect, I’ve never encountered a valid use case for it.

        • lemmyng@lemmy.ca
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          4 days ago

          You’re missing the forest for the tree here.

          Given identical client setups, two clones of a git repo are identical. That’s duplication, and it’s an intentional feature to allow concurrent development.

          A CDN works by replicating content in various locations. Anycast is then used to deliver the content from any one of those locations, which couldn’t be done reliably without content duplication.

          Blockchains work by checking new blocks against previous blocks. In order to fully guarantee the validity of a block you need to guarantee every block, going back to the beginning of the chain. This is why each root node on a chain needs a full local copy of it. Duplication.

          My point is that we have a lot of processes that rely on full or partial duplication of data, for several purposes: concurrency, faster content delivery, verification, etc. Duplicated data is a feature, not a bug.

      • nyan@lemmy.cafe
        link
        fedilink
        English
        arrow-up
        4
        ·
        6 days ago

        If the data has value, then yes, duplication is a good thing up to a point. The thesis is that only 10% of the data has value, though, and therefore duplicating the other 90% is a waste of resources.

        The real problem is figuring out which 10% of the data has value, which may be more obvious in some cases than others.

    • Brkdncr@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      6 days ago

      Deduplication is trivial when applied at the block level, as long as the data is not encrypted, or is encrypted at rest by the storage system.

      • nyan@lemmy.cafe
        link
        fedilink
        English
        arrow-up
        3
        ·
        6 days ago

        If the storage all belongs to one machine, yes. If it’s spread across multiple machines with similar setups that share a LAN, then you need to put in a little thought to make sure that there’s only one copy for all machines, but it’s still doable.

        In this case, we’re talking millions of machines with different owners, OSs, network security setups, etc. that are only connected across the Internet. The logistics are enough to make a hardened sysadmin blanch.