Deduplication

Deduplication #

Contents #

De-duplication is a common data processing task which consists of taking an array of values and removing all the duplicate values.

There exists a range of different algorithms for performing deduplication efficiently.

The problem of deduplication consists of removing duplicate values from an array of values.

The following three operations are supported in addition to the usual comparison operation, which runs in constant time.
1. get(A, i) returns the value of the \(i\)-th element in \(O(1)\) time
2. search(A, i) searches the array \(A\) for all elements at indexes larger than \(i\) that are the same as the element at index \(i\) using a linear search (i.e., in \(O(n)\) time), and returns the indexes as a list
3. delete(A, i) deletes the element at index \(i\) from the array by shifting all elements from index \(i + 1\) to index \(n - 1\) down by one in \(O(n)\) time

It can be assumed that the maximum size of the output array is a small constant and so its space and time costs are negligible.

Algorithm 1 #

Propose an algorithm for performing deduplication with \(O(n^3)\) worst-case time complexity and \(O(1)\) space complexity. Prove both the correctness of this algorithm and its time and space complexity bounds.

 INPUT    an array A[0:n - 1] of n values where n > 0
OUTPUT    an array A‘ containing only the unique values of A

DEDUPLICATE(A, n)
1    i = 0
2    while i < (len(A) - 1) - 1
3        idxs = search(A[i])
4        for j = len(idxs) - 1 to 0
5            delete(A[idxs[j]])
6        i++

Complexity #

Time #

DEDUPLICATE(A, n)                     // COST     TIMES
  i = 0
  while i < (len(A) - 1) - 1       // 1        n - 1
      idxs = search(A[i])          // n        n - 1
      for j = len(idxs) - 1 to 0   // n - 1    n - 1
          delete(A[idxs[j]])       // n       (n - 1)(n - 1)
      i++

The outer loop iterates over at most \(n - 1\) elements in the worst case (it does not need to iterate over the last element in the array at any given time since that element is unique).

Each iteration consists of a linear-time search operation and the cost of the inner loop. The inner loop iterates over at most \(n - 1\) elements in the worst case, when the array consists of a single value duplicated \(n\) times. Each iteration of the inner loop is dominated by a linear-time delete operation.

\( T(n) = (n - 1)(n + (n - 1)(n)) = \underbrace{O(n)}_{\text{iterations}} \times (\underbrace{O(n)}_{\text{search op}} + \underbrace{O(n)}_{\text{iterations}} \times \underbrace{O(n)}_{\text{delete op}}) = O(n^3) \)

When the array contains no duplicate values as in 1 2 3 4 5 6 7, the outer loop iterates \(n - 1\) times and the inner loop iterates \(0\) time per outer loop iteration.

\( T(n) = (n - 1)(n + (0)(n)) = \underbrace{O(n)}_{\text{iterations}} \times \underbrace{O(n)}_{\text{search op}} = O(n^2) \)

When the array contains \(n\) copies of a single value as in 1 1 1 1 1 1 1, the outer while loop iterates \(1\) time and the inner while loop iterates \(n - 1\) times per outer loop iteration.

\( T(n) = (1)(n + (n - 1)(n)) = \underbrace{O(n)}_{\text{search op}} + \underbrace{O(n)}_{\text{iterations}} \times \underbrace{O(n)}_{\text{delete op}} = O(n^2) \)

CMPSC 465

For each element in \(A\) call search(A[i]). Then for each element returned by search call delete to remove that element. The problem states that the size of the array returned by search is a constant, and can be neglected.

Therefore the algorithm calls search \(O(n)\) times, once per array element. Each call to search may result in a constant number of calls to delete.

Space #

The algorithm uses no auxiliary data structures and so the space complexity is \(S(n) = O(1)\).

Correctness #

CMPSC 465

INVARIANT First it is shown that a single iteration of the algorithm removes all duplicates of the processed array element.

Then it is shown that this is sufficient to show that the algorithm itself is correct.

1

After iteration \(i\) it’s impossible for a record \(A[i]\) to appear with an index \(j \gt i\).

Assume that \(A[i] = x\) and that there exists a \(j\) such that \(j \gt i\) and \(A[j] = x\) after iteration \(i\). Then according to the algorithm it has to be the case that \(A[i] \ne A[j]\). Contradiction!

2

Since the algorithm iterates in this way over each element starting from the first no duplicates can exist upon its completion.

Loop #

\(\textbf{Precondition}\)

\(\textbf{Condition}\)

\(\textbf{Invariant}\)

\(\textbf{Variant}\)

Correctness #

\(\textbf{Initialization}\)

It must be shown that if the \(\texttt{precondition}\) is true then the \(\texttt{invariant}\) is true.

\(\textbf{Maintenance}\)

It must be shown that if both the \(\texttt{condition}\) is true and the \(\texttt{invariant}\) is true after iteration \(i\) then the \(\texttt{invariant}\) is true after iteration \(i' = i + 1\).

\(\textbf{Postcondition}\)

It must be shown that if both the \(\texttt{condition}\) is false and the \(\texttt{invariant}\) is true then the \(\texttt{postcondition}\) is true.

\(\textbf{Termination}\)

Up to now it has been shown that if the loop eventually terminates then the algorithm is correct. It must now be shown that the loop eventually terminates.

REVIEW #

Pre-condition

\(k = 0\)
\(k \lt n = |A|\)
The array \(A = a_0, \dots, a_{n - 1}\) consists of \(n\) (possibly duplicated) values

Loop Condition \(C\)

\(k \lt |A|\)

Loop Invariant \(I\)

\(k \le |A| \le n\)
The first \(k\) elements \(a_0, \dots, a_{k - 1}\) of the array \(A\) consist of non duplicated values

Post-condition

\(|A| = k \le n\)
The array \(A = a_0, \dots, a_{k - 1}\) consists of \(k\) non duplicated values

First we must show that if the pre-condition holds then the loop invariant holds.

Assume that the pre-condition holds.

\(k = 0\)
\(k \lt n = |A|\)
the array \(A = a_0, \dots, a_{n - 1}\) consists of \(n\) (possibly duplicated) values

Then \(I\) holds.

\(k \lt n = |A|\) since \(k = 0\) and \(n \ge 1\) and this implies \(k \le |A| \le n\)
the first \(k = 0\) elements of the array consist of non duplicated values

Next we must show that if both the loop condition holds and the loop invariant holds pre iteration then the loop invariant holds post iteration.

Assume both that the loop condition holds and that the loop invariant holds pre iteration.

loop condition: \(k \lt |A|\)
loop invariant: \(k \le |A| \le n\)
loop invariant: the first \(k\) elements \(a_0, \dots, a_{k - 1}\) of the array \(A\) consist of non duplicated values

Then \(I\) holds post iteration.

\(k' = k + 1\)
\(k \lt |A| \implies k' \le |A|\)
the first \(k'\) elements \(a_0, \dots, a_{k - 1}, a_k\) of the array \(A\) consist of non duplicated values

We must then show that eventually, after a finite number of steps, the loop condition will not hold.

After \(n\) iterations \(k = n = |A|\)

Finally we must show that if the loop invariant holds but the loop condition does not hold then the post-condition holds.

Assume that the loop invariant holds but the loop condition does not hold.

negation of the loop condition: \(k \ge |A|\)
loop invariant: \(k \le |A| \le n\)
post-condition: \(\implies k = |A| \le n\)
post-condition: the array \(A = a_0, \dots, a_{k - 1}\) consists of \(k\) non duplicated values

\(\blacksquare\)

Algorithm 2 #

Deduplication can be made more efficient by using additional memory. Propose an algorithm for performing deduplication with \(O(n^2)\) time complexity and with \(O(n)\) space complexity. You may use an auxiliary array that supports assignment, but you may not use any other data structures. You may not sort the data.

Prove both the correctness and the time and space complexity bounds of this algorithm.

 INPUT    an array A[0:n - 1] of n values where n > 0
OUTPUT    an array B containing only the unique values of A

DEDUPLICATE(A, n)
1    idx_B = 0
2
3    for i = 0 to n - 1
4        if len(search(A[i])) = 0   //n times O(n) cost
5            B[idx_B] = A[i]
6            idx_B++
7
8    return B

Alternative implementation - for each element in A, scan B; it the element in A does not appear in B, copy it to B

DEDUPLICATE(A, n)
  for i = 0 to n - 1
      for j = 0 to len(B) - 1
          if A[i] == B[j]
              break
      

Implementations #

Python #

A = array
B = empty array

foreach a in A
    foreach b in B:
        if a == b:
            break
    else:
        copy(B, a)

A = [ 2, 5, 5, 4, 2, 5, 1, 1, 3, 2, ]
B = []

for a in A:
  print(a, B)
  for b in B:
    if a == b:
      break
  else:
    B.append(a)
print(B)

[]
[2]
[2, 5]
[2, 5]
[2, 5, 4]
[2, 5, 4]
[2, 5, 4]
[2, 5, 4, 1]
[2, 5, 4, 1]
[2, 5, 4, 1, 3]
[2, 5, 4, 1, 3]

Complexity #

Time #

The algorithm iterates over all \(n\) elements of the input array. Each iteration is dominated by the linear-time search operation.

\(T(n) = \underbrace{O(n)}_{\text{iterations}} \times \underbrace{O(n)}_{\text{search op}} = O(n^2)\).

Space #

The auxiliary array is in the worst case as large as the input array.

\(S(n) = O(n)\)

Correctness #

CMPSC 465

It must be shown that a record will be copied from A to B iff that record is not already in B, which can be done via proof by contradiction.

Since the algorithm iterates over each element in this way starting from the first it must be the case that each record from A appears in B exactly once.

REVIEW #

Pre condition:

an array \(A = a_0, \dots, a_{n - 1}\) of \(n\) possibly duplicated values for some positive integer \(n\)
an array \(B\) is an empty array (which will contain the set of \(m\) non duplicated values of array \(A\) where \(1 \le m \le n\))
\(m = 0\)
\(k = 0\) (which will serve as the index of array \(A\))

Loop condition: \(k \lt n\)

Loop invariant: \(B\) is an array of some nonnegative number \(m\) of non duplicated values such that \(0 \le m \le k \le n\)

Post condition: \(B = b_0, \dots, b_{m - 1}\) is an array of \(m\) non duplicated values where \(1 \le m \le k = n\)

At the beginning of the algorithm, the pre-condition holds. This means that \(n\) is a positive integer and \(k = 0\). It follows that the loop invariant holds, namely that \(0 \le m = 0 \le k = 0 \le n\).

We assume that after \(k\) iterations of the algorithm that both the loop condition and the loop invariant hold. It follows that the loop invariant holds post iteration:

\(k' = k + 1\) (the algorithm iterates once from element A[k] to element A[k + 1])
\(k \lt n \implies k' \le n\) (the new value of k is no greater than \(n\))
\(m' = m \lor m' = m + 1\) (the size of B either remains (no new unique value) the same or increases by 1 (new unique value))
\(m \le k \implies m' \le k'\) (regardless, the new value of m is no greater than the new value of k)
\(B\) was an array of unique values and still is an array of unique values

After some finite number of iterations \(k = n\) the loop condition is false and the algorithm terminates.

Assuming that the loop condition is false while the loop invariant remains true, it follows that the post condition is true:

the loop condition is false: \(\lnot(k \lt n) \implies k \ge n\)
the loop invariant is true: \(0 \le m \le k \le n\)
\(\implies k = n\): the array \(A\) has been iterated over and the array \(B\) contains the non duplicated values of \(A\)

Algorithm 3 #

The Unix core utilities include a program called uniq which can perform de-duplication on an input stream in linear time assuming the data is sorted. Given this assumption, propose a de-duplication algorithm that requires \(\Theta(n)\) space and time, and prove its correctness.

 INPUT     an array A[0:n - 1] of n values sorted in ascending order where n >= 0
OUTPUT    the array A‘ containing all and only the m unique values of A where 0 <= m <= n

DEDUPLICATION(A, n)
  B = [A[0]]                // an auxiliary array initially containing just the first value of the input array
  i = 1
  while i < n
      if A[i] != A[i - 1]   // if A[i] is not equal to A[i - 1]
          append(B, A[i])   // copy A[i] into B
      i++

ALTERNATIVELY

DEDUPLICATION(A, n)
  B = [A[0]]                // an auxiliary array initially containing just the first value of the input array
  i = 1
  while i < n
      if A[i] != B[-1]      // if A[i] is not equal to the last value of B
          append(B, A[i])   // copy A[i] into B
      i++

Complexity #

Time #

DEDUPLICATION(A, n)            // TIMES   COST
  B = [A[0]]
  i = 1
  while i < n               // n - 1   1
      if A[i] != B[-1]      //         1
          append(B, A[i])   //         1
      i++                   //         1

This algorithm does not terminate early and always scans the input array.

\(T(n) = (n - 1)(c + \dotsc) = \Theta(n)\)

Space #

\(S(n) = \Theta(n)\)

Correctness #

CMPSC 465

Since the input array is sorted, all distinct values are grouped together into contiguous index ranges and so a record at index \(i\) is a duplicate iff \(A[i] == A[i - 1]\).

An auxiliary array \(B\) is allocated with the same size as \(A\). For each element of \(A\), compare it with the previous element. If they are the same, advance to the next element. If they are different, then append \(A[i]\) to the end of \(B\).

To show the correctness of this algorithm, show that it is impossible for a distinct value of \(A\) to appear more than once in \(B\), and show that it is impossible for a distinct value of \(A\) not to appear in \(B\), via proof by contradiction.

\(\textbf{Precondition}\)

\(A\) is a zero-indexed array of \(n\) integers sorted in ascending order where \(n \ge 0\)
\(B\) is a zero-indexed array containing the first value of \(A\)
\(i = 1\)

\(\textbf{Condition}\)

\(i \lt n\)

\(\textbf{Invariant}\)

array \(B\) contains some positive integer number \(m\) of unique values from \(A\) such that \(0 \lt m \le i \le n\)

\(\textbf{Variant}\)

\(n - i\)

\(\textbf{Postcondition}\)

array \(B\) contains some positive integer number \(m\) of unique values from \(A\) such that \(0 \lt m \le i = n\)

Implementations #

C #

Python #

A = [1, 1, 2, 2, 2, 3, 4, 4]   # O(1)
B = [A[0]]                     # O(1)
k = 1                          # O(1)

while (k < len(A)):            # O(n)
  if A[k] != B[-1]:            # O(1)
    B.append(A[k])             # O(1) - copy element k of A into B
  k = k + 1                    # O(1)

print(A)
print(B)

[1, 1, 2, 2, 2, 3, 4, 4]
[1, 2, 3, 4]