Regex Techniques for Complex Data Processing Scenarios Quiz

Challenge your expertise with these medium-level questions on advanced regex use cases in data processing, focusing on complex pattern extraction, transformation, and validation techniques. Enhance your understanding of practical regex strategies for text parsing, cleaning, and manipulation in real-world data workflows.

  1. Extracting Specific Patterns

    Which regex pattern best matches email addresses that only use lowercase letters, numbers, and periods before the @ symbol, as in 'john.doe12@example.com'?

    1. ^[a-z0-9.]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$
    2. ^[A-Z0-9.]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$
    3. ^w+@[a-z0-9.-]+.[a-z]{2,}$
    4. ^[a-zA-Z0-9]+@[a-zA-Z0-9.-]+.[A-Za-z]{2,}$

    Explanation: The pattern '^[a-z0-9.]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$' only allows lowercase letters, numbers, and periods before the '@', satisfying the condition. The second option allows uppercase letters which is not permitted. The third uses 'w' which includes underscores, violating the letter, number, and period restriction. The fourth option allows both uppercase and lacks the period in the first part, so it is too broad.

  2. Regex for Data Cleaning

    In cleaning phone numbers to retain only the digits from strings such as '+1 (234) 567-8900', which regex replacement pattern is most effective?

    1. D
    2. S
    3. d
    4. [^w]

    Explanation: The 'D' character class matches any non-digit, making it ideal for removing everything except digits from the input and leaving a clean numeric string. 'S' matches non-whitespace, removing too much information. 'd' matches digits and would remove the digits themselves, which is undesirable. '[^w]' incorrectly targets everything except lowercase 'w', not fulfilling the requirement.

  3. Lookahead/Lookbehind Usage

    Which regex pattern correctly extracts numbers that are directly followed by the word 'USD', as in '500USD is the total cost'?

    1. d+(?=USD)
    2. USD(?=d+)
    3. (?<=USD)d+
    4. d+USD

    Explanation: The pattern 'd+(?=USD)' uses a positive lookahead to match numbers that are immediately before 'USD'. 'USD(?=d+)' is reversed, attempting to match 'USD' before numbers. '(?<=USD)d+' uses lookbehind but would match numbers after 'USD', not before. 'd+USD' matches both but does not extract just the number; it includes 'USD' in the match.

  4. Multiline Matching

    When processing multiline logs, which regex modifier enables the caret '^' and dollar '$' anchors to match the start and end of each line rather than the entire input?

    1. m
    2. s
    3. g
    4. u

    Explanation: The 'm' modifier, short for multiline, allows '^' and '$' to match the start and end of each line rather than only the start and end of the whole string. The 's' modifier changes '.' to match newline characters. The 'g' modifier allows global, repeated matches but doesn't change anchor behavior. The 'u' modifier is for Unicode support, unrelated to line anchors.

  5. Regex Group Referencing

    Given the pattern '(d{4})-(d{2})-(d{2})', which replacement syntax would swap the day and month in dates like '2023-12-31' to '2023-31-12'?

    1. $1-$3-$2
    2. $3-$2-$1
    3. $2-$1-$3
    4. 1-3-2

    Explanation: In many regex engines, '$1-$3-$2' refers to the first, third, and second groups respectively, effectively swapping day and month in the date. Both '$3-$2-$1' and '$2-$1-$3' arrange the segments incorrectly. '1-3-2' is a syntax variant found in some engines, but '$' notation is more widely used in replacements.