Testing

Testing

Style Guides

Style automation
  • Enforce adherence to coding styles according to predefined guilines
  • Example: Prettier.io
  • Advantages
    • Zero human effort
    • Uniform enforcement
    • Prevent accidentally misleading style
    • Can be applied after refactoring, synthesizing code
    • Can update entire codebase when style rules change
  • Disadvantages
    • Can’t reproduce all reasonable style rules
    • Special-case exceptions are awkward
    • Reformatting pollutes blame history
Style guide examples
  • Examples
  • Don’t blindly adopt someone else’s style guide – Some justifications may not apply externally
    • But good to inherit from
  • Elements of good style guides
    • Justify choices
      • Avoid danger
      • Enforce best practice
      • Ensure consistency
    • Avoid details that can be automated
    • Get developer buy-in

Portability

Advantages and Drawbacks
  • Advantages
    • Enlarges customer base
    • Futureproofing
      • e.g. Apple Silicon
    • Reduces implicit assumptions
    • Improves process robustness
    • Expands tooling options
      • Compilers
      • Analysis tools
    • Educates team
  • Example anecdote: Every time I build a project with a new compiler, I discover bugs
    • Sometimes those bugs are in the compiler… but most are in the application
  • Drawbacks
    • Maintenance burden
Portability targets
  • Architecture
    • x86, ARM, 32 vs. 64-bit
  • Operating system
    • Linux (Red Hat, Debian), Windows, Mac OS
    • Android, iOS
  • Form factor
    • Smatphone, tablet, laptop, desktop, dual monitors
  • Web browser
    • Chrome, Safari, Firefox
  • C/C++ compilers
    • GCC, Clang, MSVC, Intel, Solaris Studio, IBM XL, PGI, SGI/Open64/PathScale
  • Java virtual machines
    • Oracle/OpenJDK, IBM/OpenJ9, Azul
  • Python interpreters
    • CPython, PyPy, Jython
Techniques to improve portability
  • Heterogeneous developer environments
  • Automated cross-platform builds and tests
    • Cloud infrastructure available
    • Don’t ignore errors
  • Highlight in style guides, code review checklists
  • Use cross-platform standards and abstraction layers
    • Avoid writing your own #ifdefs unless portability is a business case
  • Common gotchas:
    • Integer sizes
    • Filesystems
    • Unsupported APIs, language features
    • Floating-point behavior
    • Performance characteristics
    • Assumptions about unspecified behavior

Testing

Goals of testing
  • Find and prevent bugs
  • Improve maintainability (esp. refactoring)
  • Clarify intended usage
  • To meet these goals, tests themselves should be:
    • Bug-free
    • Maintainable
    • Clearly documented and easy to read
Test coverage
  • Ways to measure “how much code” was tested
    • Function coverage
    • Statement (line) coverage
    • Branch coverage
    • Condition/decision coverage
    • Loop coverage
    • Path coverage
  • Coverage analysis can reveal gaps in testing
Identify test cases to cover the following snippet of code
1
2
3
if (a>b && c!=25) { 
    d++; 
}
Solution
  • Required cases for condition/decision coverage:
    • a<=b
    • a>b && c==25
    • a>b && c!=25
Coverage targets
  • Any statement not covered by a test is code you expect your client/users to run before you do
  • By this philosophy, 100% line coverage would be a minimum target
  • But chasing coverage metrics with low-quality tests can be self-defeating
  • Tests take time to write, review, and run; must consider cost/benefit ratio
Discussion: difficult testing scenarios
  • Discuss the following testing scenarios
    • Error codes & exceptions from library and system calls
      • Out of memory
      • Out of disk space
      • Incomplete I/O
      • Transient I/O error (EAGAIN)
      • Timeouts
    • Unbounded blocking
    • Crash/power loss
      • Corrupted data
    • Malicious intent
    • Concurrency
      • High lock contention
      • Race conditions
      • Caching & memory ordering
      • True concurrency vs. multitasking
    • Portability
      • Unsupported capabilities
      • Platform differences
    • Performance
      • NUMA
      • Big.LITTLE
      • Disk I/O (bandwidth, latency)
      • Network I/O (bandwidth, latency)
Beyoncé rule
  • “If you liked it, then you shoulda put a test on it”
  • Manages responsibility during large-scale refactoring
    • Infrastructure team must ensure all tests pass before committing
    • If functionality breaks, product team must fix it (and add more tests)
  • Aim for sufficient coverage so that you (and your teammates) would be okay being held responsible for a production breakage in uncovered code
SQLite
  • https://www.sqlite.org/testing.html
  • 640x more test code than application code
  • 100% branch test coverage
  • OOM, I/O errors, crashes
    • Use abstractions to wrap malloc, I/O operations
  • Boundary values
  • Regression tests
  • Valgrind
  • Fuzz testing
Kinds of testing
  • Styles
    • Exploratory: unscripted testing done by the developers themselves
    • Smoke tests: minimal attempts to operate the software to identify basic problems
    • Black box: test development without knowledge of the implementation/source code
    • Glass box: formal verification of a program’s internal structure/workflow
    • Fuzzy testing: a type of blackbox testing that involves invalid/unexpected/random data.
    • Dynamic analysis: evaluate software behavior during run time.
  • Scopes
    • Unit tests
    • Integration tests
    • End-to-end tests
  • Sizes
    • Small: fast, deterministic (in-process)
    • Medium: multi-process, allow blocking calls (single machine)
    • Large: Multi-node
  • Purpose
    • Prevent reoccurrence of bugs (regression tests)
    • Prepare for release (acceptance tests, beta testing)
    • Ensure operating health (self tests)
Aerospace testing
  • Unit tests
    • Ensure thorough coverage
    • Verify independent implementations
  • Smoke tests
    • Small-scale integration test
    • Ensure configs are valid
  • Regression tests
    • Catch any change to behavior (ensure refactoring changes are non-functional)
    • Ensure control algorithms achieve mission objectives
  • Checkpoint/restore tests
  • Exploratory tests
    • Logged data posted to reviews
  • Software-in-the-loop
    • Medium-scale integration test
    • Leverage virtualization, preloading, hardware simulation
    • Subsystem and end-to-end scope
  • Hardware-in-the-loop
    • Large-scale integration test
    • Verify non-functional requirements
  • Vehicle-in-the-loop
    • Large-scale integration test
    • Verify a particular “production unit”
  • Formal test deliverables
Flaky vs. brittle tests
  • Flaky
    • Non-deterministic failures
      • Multi-process/multi-node infrastructure failures
      • Performance/timeouts
      • Randomness
      • Always log seed
      • Concurrency
      • Difficult to reproduce
      • Time of day
  • Brittle
    • “High maintenance”
      • Leverage private functionality
      • Depend on private state
      • Assume behavior beyond the spec
        • e.g. checking interactions instead of state
    • Coming up: guidelines to avoid brittle tests
Aside: random numbers
  • In most settings, random numbers should be deterministic
    • Enables reproducibility, reduces test flakiness
    • Exceptions (in production): cryptography, gambling
  • Recommended approach
    • Application starts with a specified global seed (and logs it)
    • Each component constructs a private RNG by combining global seed with unique instance name
    • Alternative for parallel computation: sequence queries, use RNG that can “fast forward” state
  • Advantages
    • Results independent of amount of parallelism
    • Results do not change if “peripheral” components are added or removed
Test scope
  • Small scope
    • Limited coverage (per test)
    • But coverage is orthogonal
    • May require awkward setup (dependency injection, mock objects)
    • Can be written simultaneously with the code-under-test
    • Easy to diagnose
      • Limited amount of code is executed
      • Easier to understand procedure and results
    • Typically faster
      • Can run more often
  • Large scope
    • Extensive coverage (per test)
      • Much coverage is redundant
      • Most results are not checked (false sense of security)
    • May be easier to set up than mid-scoped tests
      • But total configuration harder to reason about
    • Depends on whole system
      • Bugs may not be found until later
    • Difficult to diagnose
      • Slows down debugging when bugs are found
    • Typically slower
Exploratory testing
  • Applications
    • Developers check how existing code behaves
    • Developers “gut check” new code
    • Demonstrate functionality in a scenario of interest with complicated setup
    • QA testing (test behaviors developers often overlook)
  • Tools
    • Application itself (print)
    • REPL (JShell, iPython)
    • Dynamic analysis tools (callgrind)
  • Drawbacks
    • Not reproducible
      • Results may depend on unique context
      • Good habit to log all interactions
    • Good to think about expectations before running test, but if you can express what you expect, just write a unit test
    • Quality varies with tester
      • Can’t measure coverage
  • Appropriate for one-off scripts
Unit tests
  • Narrow scope (typically a single function or a single class)
  • Focus on publicly-visible, fully-specified behavior
    • Check state, not process
  • Write for clarity
    • Okay to be repetitive
    • Avoid new abstractions or logic
  • Bad example:
    • When registering a new user, the system first generates a password, then tries to insert a new auth table row, throwing an exception if insertion failed (name already taken)
  • Better example:
    • After registering a new user whose name is not taken, a new row will exist in the database with their username and password
    • If attempting to register a new user whose name is already taken, an exception is thrown.
Behavior-driven development
  • Structuring tests around methods can make them brittle, hard to read
    • Try to test too many behaviors at once
  • Better to structure tests around scenarios
  • Arrange-act-assert format
    • “Given …, when …, then …”
    • Analogous to User Stories preamble
  • Given two accounts, the first of which has at least $100, when transferring $100 from the first to the second account, then both account balances should reflect the transfer”
  • Test frameworks can help make tests self-documenting
  • Consider writing tests before implementing features
Integration tests
  • Broader scope
    • Check that multiple components interface correctly
    • Check behavior of subsystems
  • Tend to be larger in size
    • SoA requires multiple processes
    • Non-trivial data, config can be slow
    • Aim for smallest test possible
      • Split pipelines into pairwise interactions
  • Larger tests require non-trivial infrastructure, can be flaky
    • Fakes
    • Lightweight substitutions
      • In-memory databases
    • Hermetic services
      • Leverage virtualization to deploy isolated instances of service dependencies
    • Record/replay I/O
      • Trades flakiness for brittleness
Integration environments
  • Production
    • Highest fidelity, esp. for load
    • Failures affect real users
    • Canarying: deploy to subset of production systems
    • E.g. internal users, early access
    • Can lead to version skew – incompatibility between concurrently-running components
    • Feature flags: Allow operators to quickly toggle between new and old implementation
  • Staging
    • Ideally configured just like production
    • Potentially high infrastructure cost, limited availability
    • Often can’t duplicate production load
    • Failures do not harm users
    • Can practice disaster recovery
Chaos engineering
  • Originated at Netflix (ChaosMonkey)
  • High-reliability, distributed systems must tolerate failure
  • Recovery procedures are often not sufficiently rehearsed – painful, risky
  • Deliberately inject failures in production environment
    • Tests system resiliency under realistic load
    • Encourages recovery automation
Continuous integration (“CI”)
  • Build and test whole systems regularly
    • Discover issues earlier
    • Reduce integration pain through automation and isolation of issues
    • Test beyond single developer’s resources
    • Eliminate reliance on developers’ discipline
    • Continuously monitor readiness of code
  • Applies to both development and release
    • Continuous build+test
    • Continuous delivery
CI decisions
  • How to compose systems along release workflow
  • Which tests to run when along release workflow
  • Typical setup
    • Pre-submit test suite gates all merges
      • Compilation and fast tests relevant to affected code
    • Post-submit test suite verifies subset of commits on trunk
      • Contains larger, more integrated tests
      • Blesses commits that pass as “green”
    • Release promotion pipeline verifies candidates for release
      • Contains even larger tests, may require dedicated resources
Automation, speed, & infrastructure
  • Builds, tests, and deployment must be automated and reliable
    • Ideally completely reproducible
  • Most steps must be fast to avoid impeding productivity
    • Cache build products
    • Skip unaffected tests
    • Parallelize & invest in compute resources
  • Benefits from tooling
    • Integration with version control and code review
      • Pre-merge and pre-release gates
      • “Last-known-good” branch (new work should branch from here, not trunk)
    • Bisect breakages
    • Log all results
    • Automatically rerun flaky tests
Multi-system CI
  • Without monorepo, need to assemble system from several asynchronously-versioned repositories
  • Large integration tests can’t check every revision/combination
  • Objective: identify “configurations” (revision combinations) suitable for promotion (larger-scale testing, release)

Dynamic analysis

Common dynamic analysis tools
  • Coverage
  • Debuggers
  • Memory checkers
  • Sanitizers
  • Profilers
Fuzz testing
  • Give program random input, look for crashes, assertion violations
  • Increased in popularity in 2010s; very effective at finding security vulnerabilities
  • Can be enhanced with coverage feedback
  • Use genetic algorithms, neural networks to construct input that exercises particular branches
What is a performance bug?
  • Avoid premature optimization!
  • Does not meet deadlines / satisfy SLA
  • Responsiveness, smoothness do not meet requirements
    • 100 ms: GUI
    • 15-30 ms: Animation (30-60 fps)
    • 10 ms: MIDI, VR
  • Unexpected slowdown for certain inputs / DoS vulnerability
  • Performance regression (gradual and acute degradation)
  • Performance variability across platforms
  • Sub-optimal throughput for HPC
Performance testing challenges
  • How much room for improvement is there?
    • Amdahl’s law: Limits to speedup from parallelization, local optimization
    • Roofline analysis: Do you expect to be limited by bandwidth or compute?
  • Is slowdown localized, dispersed, or emergent?
  • Getting reliable measurements is difficult
    • Inconsistency, load dependency, JIT compilation, non-representative datasets, intrusive tooling
    • Average case vs. worst case, tail metrics
    • Tension between latency and bandwidth
Latency vs. throughput
  • Latency: Duration between a single trigger and the system’s response
    • “Tail latency” (e.g. 95th percentile under a specified load) is more important than average
  • Throughput: Time it takes to processes a fixed amount of work
    • Often a function of workload
      • Typically throughput increases with workload size up to a saturation point
    • Reduce overhead with batching
      • Typically at expense of latency
Discussion
  • Consider adding new elements to a sorted list (initial size N) while maintaining sorted order.
  • Scenario A: Elements are inserted into their proper position one at a time.
  • Scenario B: All elements are appended to the list, then the whole list is sorted (comparison sort).