Deduplication#
Contents#
De-duplication is a common data processing task which consists of taking an array of values and removing all the duplicate values.
There exists a range of different algorithms for performing deduplication efficiently.
The problem of deduplication consists of removing duplicate values from an array of values.
The following three operations are supported in addition to the usual comparison operation, which runs in constant time.
1. get(A, i)
returns the value of the \(i\)-th element in \(O(1)\) time
2. search(A, i)
searches the array \(A\) for all elements at indexes larger than \(i\) that are the same as the element at index \(i\) using a linear search (i.e., in \(O(n)\) time), and returns the indexes as a list
3. delete(A, i)
deletes the element at index \(i\) from the array by shifting all elements from index \(i + 1\) to index \(n - 1\) down by one in \(O(n)\) time
It can be assumed that the maximum size of the output array is a small constant and so its space and time costs are negligible.
Algorithm 1#
Propose an algorithm for performing deduplication with \(O(n^3)\) worst-case time complexity and \(O(1)\) space complexity. Prove both the correctness of this algorithm and its time and space complexity bounds.
INPUT an array A[0:n - 1] of n values where n > 0
OUTPUT an array A‘ containing only the unique values of A
DEDUPLICATE(A, n)
1 i = 0
2 while i < (len(A) - 1) - 1
3 idxs = search(A[i])
4 for j = len(idxs) - 1 to 0
5 delete(A[idxs[j]])
6 i++
Complexity#
Time#
DEDUPLICATE(A, n) // COST TIMES
1 i = 0
2 while i < (len(A) - 1) - 1 // 1 n - 1
3 idxs = search(A[i]) // n n - 1
4 for j = len(idxs) - 1 to 0 // n - 1 n - 1
5 delete(A[idxs[j]]) // n (n - 1)(n - 1)
6 i++
The outer loop iterates over at most \(n - 1\) elements in the worst case (it does not need to iterate over the last element in the array at any given time since that element is unique).
Each iteration consists of a linear-time search operation and the cost of the inner loop. The inner loop iterates over at most \(n - 1\) elements in the worst case, when the array consists of a single value duplicated \(n\) times. Each iteration of the inner loop is dominated by a linear-time delete operation.
\( T(n) = (n - 1)(n + (n - 1)(n)) = \underbrace{O(n)}_{\text{iterations}} \times (\underbrace{O(n)}_{\text{search op}} + \underbrace{O(n)}_{\text{iterations}} \times \underbrace{O(n)}_{\text{delete op}}) = O(n^3) \)
When the array contains no duplicate values as in 1 2 3 4 5 6 7
, the outer loop iterates \(n - 1\) times and the inner loop iterates \(0\) time per outer loop iteration.
\( T(n) = (n - 1)(n + (0)(n)) = \underbrace{O(n)}_{\text{iterations}} \times \underbrace{O(n)}_{\text{search op}} = O(n^2) \)
When the array contains \(n\) copies of a single value as in 1 1 1 1 1 1 1
, the outer while loop iterates \(1\) time and the inner while loop iterates \(n - 1\) times per outer loop iteration.
\( T(n) = (1)(n + (n - 1)(n)) = \underbrace{O(n)}_{\text{search op}} + \underbrace{O(n)}_{\text{iterations}} \times \underbrace{O(n)}_{\text{delete op}} = O(n^2) \)
Space#
The algorithm uses no auxiliary data structures and so the space complexity is \(S(n) = O(1)\).
Correctness#
Loop#
\(\textbf{Precondition}\)
\(\textbf{Condition}\)
\(\textbf{Invariant}\)
\(\textbf{Variant}\)
Correctness#
\(\textbf{Initialization}\)
It must be shown that if the \(\texttt{precondition}\) is true then the \(\texttt{invariant}\) is true.
\(\textbf{Maintenance}\)
It must be shown that if both the \(\texttt{condition}\) is true and the \(\texttt{invariant}\) is true after iteration \(i\) then the \(\texttt{invariant}\) is true after iteration \(i' = i + 1\).
\(\textbf{Postcondition}\)
It must be shown that if both the \(\texttt{condition}\) is false and the \(\texttt{invariant}\) is true then the \(\texttt{postcondition}\) is true.
\(\textbf{Termination}\)
Up to now it has been shown that if the loop eventually terminates then the algorithm is correct. It must now be shown that the loop eventually terminates.
REVIEW#
Pre-condition
\(k = 0\)
\(k \lt n = |A|\)
The array \(A = a_0, \dots, a_{n - 1}\) consists of \(n\) (possibly duplicated) values
Loop Condition \(C\)
\(k \lt |A|\)
Loop Invariant \(I\)
\(k \le |A| \le n\)
The first \(k\) elements \(a_0, \dots, a_{k - 1}\) of the array \(A\) consist of non duplicated values
Post-condition
\(|A| = k \le n\)
The array \(A = a_0, \dots, a_{k - 1}\) consists of \(k\) non duplicated values
First we must show that if the pre-condition holds then the loop invariant holds.
Assume that the pre-condition holds.
\(k = 0\)
\(k \lt n = |A|\)
the array \(A = a_0, \dots, a_{n - 1}\) consists of \(n\) (possibly duplicated) values
Then \(I\) holds.
\(k \lt n = |A|\) since \(k = 0\) and \(n \ge 1\) and this implies \(k \le |A| \le n\)
the first \(k = 0\) elements of the array consist of non duplicated values
Next we must show that if both the loop condition holds and the loop invariant holds pre iteration then the loop invariant holds post iteration.
Assume both that the loop condition holds and that the loop invariant holds pre iteration.
loop condition: \(k \lt |A|\)
loop invariant: \(k \le |A| \le n\)
loop invariant: the first \(k\) elements \(a_0, \dots, a_{k - 1}\) of the array \(A\) consist of non duplicated values
Then \(I\) holds post iteration.
\(k' = k + 1\)
\(k \lt |A| \implies k' \le |A|\)
the first \(k'\) elements \(a_0, \dots, a_{k - 1}, a_k\) of the array \(A\) consist of non duplicated values
We must then show that eventually, after a finite number of steps, the loop condition will not hold.
After \(n\) iterations \(k = n = |A|\)
Finally we must show that if the loop invariant holds but the loop condition does not hold then the post-condition holds.
Assume that the loop invariant holds but the loop condition does not hold.
negation of the loop condition: \(k \ge |A|\)
loop invariant: \(k \le |A| \le n\)
post-condition: \(\implies k = |A| \le n\)
post-condition: the array \(A = a_0, \dots, a_{k - 1}\) consists of \(k\) non duplicated values
\(\blacksquare\)
Algorithm 2#
Deduplication can be made more efficient by using additional memory. Propose an algorithm for performing deduplication with \(O(n^2)\) time complexity and with \(O(n)\) space complexity. You may use an auxiliary array that supports assignment, but you may not use any other data structures. You may not sort the data.
Prove both the correctness and the time and space complexity bounds of this algorithm.
INPUT an array A[0:n - 1] of n values where n > 0
OUTPUT an array B containing only the unique values of A
DEDUPLICATE(A, n)
1 idx_B = 0
2
3 for i = 0 to n - 1
4 if len(search(A[i])) = 0 //n times O(n) cost
5 B[idx_B] = A[i]
6 idx_B++
7
8 return B
Alternative implementation - for each element in A, scan B; it the element in A does not appear in B, copy it to B
DEDUPLICATE(A, n)
1 for i = 0 to n - 1
3 for j = 0 to len(B) - 1
4 if A[i] == B[j]
5 break
5
Implementations#
Python#
A = array
B = empty array
foreach a in A
foreach b in B:
if a == b:
break
else:
copy(B, a)
A = [ 2, 5, 5, 4, 2, 5, 1, 1, 3, 2, ]
B = []
for a in A:
print(a, B)
for b in B:
if a == b:
break
else:
B.append(a)
print(B)
2 []
5 [2]
5 [2, 5]
4 [2, 5]
2 [2, 5, 4]
5 [2, 5, 4]
1 [2, 5, 4]
1 [2, 5, 4, 1]
3 [2, 5, 4, 1]
2 [2, 5, 4, 1, 3]
[2, 5, 4, 1, 3]
Complexity#
Time#
The algorithm iterates over all \(n\) elements of the input array. Each iteration is dominated by the linear-time search operation.
\(T(n) = \underbrace{O(n)}_{\text{iterations}} \times \underbrace{O(n)}_{\text{search op}} = O(n^2)\).
Space#
The auxiliary array is in the worst case as large as the input array.
\(S(n) = O(n)\)
Correctness#
REVIEW#
Pre condition:
an array \(A = a_0, \dots, a_{n - 1}\) of \(n\) possibly duplicated values for some positive integer \(n\)
an array \(B\) is an empty array (which will contain the set of \(m\) non duplicated values of array \(A\) where \(1 \le m \le n\))
\(m = 0\)
\(k = 0\) (which will serve as the index of array \(A\))
Loop condition: \(k \lt n\)
Loop invariant: \(B\) is an array of some nonnegative number \(m\) of non duplicated values such that \(0 \le m \le k \le n\)
Post condition: \(B = b_0, \dots, b_{m - 1}\) is an array of \(m\) non duplicated values where \(1 \le m \le k = n\)
At the beginning of the algorithm, the pre-condition holds. This means that \(n\) is a positive integer and \(k = 0\). It follows that the loop invariant holds, namely that \(0 \le m = 0 \le k = 0 \le n\).
We assume that after \(k\) iterations of the algorithm that both the loop condition and the loop invariant hold. It follows that the loop invariant holds post iteration:
\(k' = k + 1\) (the algorithm iterates once from element A[k] to element A[k + 1])
\(k \lt n \implies k' \le n\) (the new value of k is no greater than \(n\))
\(m' = m \lor m' = m + 1\) (the size of B either remains (no new unique value) the same or increases by 1 (new unique value))
\(m \le k \implies m' \le k'\) (regardless, the new value of m is no greater than the new value of k)
\(B\) was an array of unique values and still is an array of unique values
After some finite number of iterations \(k = n\) the loop condition is false and the algorithm terminates.
Assuming that the loop condition is false while the loop invariant remains true, it follows that the post condition is true:
the loop condition is false: \(\lnot(k \lt n) \implies k \ge n\)
the loop invariant is true: \(0 \le m \le k \le n\)
\(\implies k = n\): the array \(A\) has been iterated over and the array \(B\) contains the non duplicated values of \(A\)
Algorithm 3#
The Unix core utilities include a program called uniq
which can perform de-duplication on an input stream in linear time assuming the data is sorted. Given this assumption, propose a de-duplication algorithm that requires \(\Theta(n)\) space and time, and prove its correctness.
INPUT an array A[0:n - 1] of n values sorted in ascending order where n >= 0
OUTPUT the array A‘ containing all and only the m unique values of A where 0 <= m <= n
DEDUPLICATION(A, n)
1 B = [A[0]] // an auxiliary array initially containing just the first value of the input array
2 i = 1
3 while i < n
4 if A[i] != A[i - 1] // if A[i] is not equal to A[i - 1]
5 append(B, A[i]) // copy A[i] into B
6 i++
ALTERNATIVELY
DEDUPLICATION(A, n)
1 B = [A[0]] // an auxiliary array initially containing just the first value of the input array
2 i = 1
3 while i < n
4 if A[i] != B[-1] // if A[i] is not equal to the last value of B
5 append(B, A[i]) // copy A[i] into B
6 i++
Complexity#
Time#
DEDUPLICATION(A, n) // TIMES COST
1 B = [A[0]]
2 i = 1
3 while i < n // n - 1 1
4 if A[i] != B[-1] // 1
5 append(B, A[i]) // 1
6 i++ // 1
This algorithm does not terminate early and always scans the input array.
\(T(n) = (n - 1)(c + \dotsc) = \Theta(n)\)
Space#
\(S(n) = \Theta(n)\)
Correctness#
\(\textbf{Precondition}\)
\(A\) is a zero-indexed array of \(n\) integers sorted in ascending order where \(n \ge 0\)
\(B\) is a zero-indexed array containing the first value of \(A\)
\(i = 1\)
\(\textbf{Condition}\)
\(i \lt n\)
\(\textbf{Invariant}\)
array \(B\) contains some positive integer number \(m\) of unique values from \(A\) such that \(0 \lt m \le i \le n\)
\(\textbf{Variant}\)
\(n - i\)
\(\textbf{Postcondition}\)
array \(B\) contains some positive integer number \(m\) of unique values from \(A\) such that \(0 \lt m \le i = n\)
Implementations#
C#
Python#
A = [1, 1, 2, 2, 2, 3, 4, 4] # O(1)
B = [A[0]] # O(1)
k = 1 # O(1)
while (k < len(A)): # O(n)
if A[k] != B[-1]: # O(1)
B.append(A[k]) # O(1) - copy element k of A into B
k = k + 1 # O(1)
print(A)
print(B)
[1, 1, 2, 2, 2, 3, 4, 4]
[1, 2, 3, 4]