Every organization that has been collecting address data for more than a few years has the same problem: a large, messy address table with unknown quality. Addresses entered through web forms by users who weren't paying attention. Records migrated from a legacy CRM that didn't validate input. Bulk imports from a partner's Excel sheet in an incompatible format. Business locations that have moved since the data was collected.
Running logistics on bad address data causes failed deliveries. Running marketing on it wastes direct mail spend. Running real estate valuations on it produces incorrect catchment area analyses. Before any of those operations happen, the addresses need to be cleaned and validated.
Geocoding is the most effective way to do this at scale. Send a raw address string to the Geocoding API; receive back structured components, geographic coordinates, and a confidence score. Addresses that geocode cleanly to a high confidence are almost certainly valid. Addresses that return a low confidence or no match at all need human review.
This tutorial builds a complete Python script that reads a CSV of addresses, geocodes each one against the MapAtlas Geocoding API, writes the results with confidence scores, and flags low-confidence records for manual review. It handles rate limiting, transient network errors, and EU address format quirks that US-centric geocoding tutorials miss.
Why Address Data Goes Bad
Address quality degrades for predictable reasons:
Manual entry errors. Web form users type quickly, autocorrect mangles street names, and validation that accepts any non-empty string lets garbage through. A study of B2C checkout data found that 7–12% of manually entered addresses contain errors significant enough to cause delivery failure.
Business relocations. A B2B database collected two years ago will have approximately 10–15% of addresses no longer matching the current business location due to moves, mergers, and closures.
Format inconsistencies. Data collected from multiple sources uses different conventions: full country names vs. ISO codes, "St." vs. "Street", house-number-before-street vs. after, "flat" vs. "apt" vs. "wohnung". A geocoder normalizes all of these into structured output.
Legacy system migrations. Address fields that were split across multiple database columns often get concatenated during migration, losing structure. Free-text address fields from older systems may include notes, references, or formatting that isn't part of the actual address.
The geocoding approach handles all of these because it relies on geographic matching, not string matching. An address that's formatted incorrectly can still geocode correctly if the underlying location data matches.
Understanding Confidence Scores
The MapAtlas Geocoding API returns a confidence property on each feature, ranging from 0.0 to 1.0. It represents how closely the returned result matches the input query, accounting for format differences, abbreviations, and ambiguity.
| Confidence | Interpretation | Recommended action |
|---|---|---|
| 0.90 – 1.00 | Exact or near-exact match | Accept automatically |
| 0.85 – 0.89 | Strong match, minor format differences | Accept with logging |
| 0.60 – 0.84 | Partial match, street found but house number uncertain | Flag for manual review |
| 0.40 – 0.59 | Ambiguous, locality matched but not specific address | Reject or escalate |
| 0.00 – 0.39 | No meaningful match | Reject, likely invalid |
These thresholds are starting points. For a logistics operation where failed deliveries are expensive, tighten the automatic-accept threshold to 0.92+. For a marketing mailing list where the cost of a manual review outweighs the cost of a few bad addresses, loosen it to 0.80.
The API also returns a match_type property indicating what level of the address hierarchy matched: point (exact building), interpolated (position estimated between known house numbers), street (street found but house number not), or locality (only the city matched).
The Python Validation Script
Install dependencies:
pip install requests pandas tqdm
The script reads a CSV with an address column (or separate street, city, country columns), geocodes each row, and writes a new CSV with validation results appended.
#!/usr/bin/env python3
"""
bulk_geocode.py, Validate a CSV of addresses using the MapAtlas Geocoding API.
Input CSV must have either:
- An 'address' column (full address string), or
- 'street', 'city', and 'country' columns (will be concatenated)
Outputs a new CSV with added columns:
geocoded_label, geocoded_lat, geocoded_lng, confidence, match_type, status
"""
import csv
import time
import logging
import requests
import pandas as pd
from tqdm import tqdm
from pathlib import Path
# ── Configuration ──────────────────────────────────────────────────────────────
API_KEY = 'YOUR_API_KEY'
API_BASE = 'https://api.mapatlas.eu/geocoding/v1/search'
INPUT_CSV = 'addresses.csv'
OUTPUT_CSV = 'addresses_validated.csv'
RATE_LIMIT_RPS = 5 # Requests per second (stay within your plan limits)
RETRY_ATTEMPTS = 3 # Retries on transient errors
RETRY_DELAY_S = 2.0 # Seconds between retries
# Confidence thresholds
ACCEPT_THRESHOLD = 0.85
REVIEW_THRESHOLD = 0.60
logging.basicConfig(level=logging.INFO, format='%(levelname)s %(message)s')
log = logging.getLogger(__name__)
# ── Geocoding function ─────────────────────────────────────────────────────────
def geocode_address(address_str: str) -> dict:
"""
Geocode a single address string. Returns a dict with result fields.
Retries on network errors and 429 rate-limit responses.
"""
params = {
'text': address_str,
'key': API_KEY,
'size': 1,
}
for attempt in range(RETRY_ATTEMPTS):
try:
resp = requests.get(API_BASE, params=params, timeout=10)
if resp.status_code == 429:
# Rate limited, wait and retry
wait = float(resp.headers.get('Retry-After', RETRY_DELAY_S * (attempt + 1)))
log.warning(f'Rate limited. Waiting {wait:.1f}s before retry {attempt + 1}.')
time.sleep(wait)
continue
resp.raise_for_status()
data = resp.json()
if not data.get('features'):
return {'geocoded_label': '', 'geocoded_lat': None, 'geocoded_lng': None,
'confidence': 0.0, 'match_type': 'no_match', 'status': 'reject'}
feature = data['features'][0]
props = feature['properties']
coords = feature['geometry']['coordinates'] # [lng, lat]
confidence = round(float(props.get('confidence', 0.0)), 4)
match_type = props.get('match_type', 'unknown')
if confidence >= ACCEPT_THRESHOLD:
status = 'accept'
elif confidence >= REVIEW_THRESHOLD:
status = 'review'
else:
status = 'reject'
return {
'geocoded_label': props.get('label', ''),
'geocoded_lat': round(coords[1], 6),
'geocoded_lng': round(coords[0], 6),
'confidence': confidence,
'match_type': match_type,
'status': status,
}
except requests.exceptions.RequestException as exc:
log.warning(f'Network error on attempt {attempt + 1}: {exc}')
if attempt < RETRY_ATTEMPTS - 1:
time.sleep(RETRY_DELAY_S * (attempt + 1))
# All retries exhausted
return {'geocoded_label': '', 'geocoded_lat': None, 'geocoded_lng': None,
'confidence': 0.0, 'match_type': 'error', 'status': 'error'}
# ── Main processing loop ───────────────────────────────────────────────────────
def main():
df = pd.read_csv(INPUT_CSV, dtype=str).fillna('')
# Build address string from available columns
if 'address' in df.columns:
df['_query'] = df['address']
elif all(c in df.columns for c in ['street', 'city', 'country']):
df['_query'] = df['street'] + ', ' + df['city'] + ', ' + df['country']
else:
raise ValueError("CSV must have 'address' or 'street'+'city'+'country' columns.")
results = []
sleep_interval = 1.0 / RATE_LIMIT_RPS
for query in tqdm(df['_query'], desc='Geocoding', unit='addr'):
result = geocode_address(query.strip())
results.append(result)
time.sleep(sleep_interval)
results_df = pd.DataFrame(results)
output_df = pd.concat([df.drop(columns=['_query']), results_df], axis=1)
output_df.to_csv(OUTPUT_CSV, index=False, quoting=csv.QUOTE_NONNUMERIC)
# Summary
total = len(output_df)
accept = (output_df['status'] == 'accept').sum()
review = (output_df['status'] == 'review').sum()
reject = (output_df['status'] == 'reject').sum()
errors = (output_df['status'] == 'error').sum()
log.info(f'\n── Results ────────────────────────────────')
log.info(f'Total processed : {total:,}')
log.info(f'Accept (≥{ACCEPT_THRESHOLD}) : {accept:,} ({accept/total:.1%})')
log.info(f'Review : {review:,} ({review/total:.1%})')
log.info(f'Reject : {reject:,} ({reject/total:.1%})')
log.info(f'Errors : {errors:,} ({errors/total:.1%})')
log.info(f'Output written to {OUTPUT_CSV}')
if __name__ == '__main__':
main()
Run it:
python bulk_geocode.py
For a 10,000-address CSV at 5 requests/second, the script completes in approximately 35 minutes. The progress bar (via tqdm) shows real-time throughput and estimated completion time.

[Image: Terminal screenshot showing the tqdm progress bar at 67% completion (6,700/10,000 addresses), throughput at 4.9 addr/s, estimated time remaining 17 minutes. Below the progress bar, a log line shows "WARNING Network error on attempt 1: Connection timeout" followed by a successful retry.]
Handling the Output
The output CSV adds six columns to your input data:
address, ..., geocoded_label, geocoded_lat, geocoded_lng, confidence, match_type, status
Filter by status to generate three output files:
# After running main(), split into acceptance tiers:
df = pd.read_csv('addresses_validated.csv')
df[df['status'] == 'accept'].to_csv('addresses_clean.csv', index=False)
df[df['status'] == 'review'].to_csv('addresses_review.csv', index=False)
df[df['status'] == 'reject'].to_csv('addresses_reject.csv', index=False)
print(f"Clean: {len(df[df['status']=='accept']):,} addresses ready for use")
print(f"Review: {len(df[df['status']=='review']):,} addresses for manual check")
print(f"Reject: {len(df[df['status']=='reject']):,} addresses to discard or fix")
The addresses_review.csv file is the one that needs human attention. Typical review patterns:
- Match type
street(confidence 0.65–0.80): The street was found but not the house number. Likely a new building, a rural address with sparse coverage, or a typo in the house number. Check the original source. - Match type
locality(confidence 0.50–0.65): Only the city matched. The street name is likely misspelled or doesn't exist in the data. Look up the address in a postal directory. - Low confidence on clearly valid-looking addresses: Check for country/language mismatch. A Dutch address queried without a country restriction may geocode against a similarly named German or Belgian town.
EU Address Quirks to Know
Bulk geocoding guides written for the US market skip the format differences that trip up EU datasets. Here are the most common issues:
Germany, house number ordering. German addresses use {street} {number} format: Berliner Straße 42. Many CRMs store addresses in {number} {street} format because they were built for UK/US convention. If your German addresses are geocoding with low confidence, try reversing the number and street name in the query string before submission.
France, arrondissements. Paris addresses include an arrondissement (1st–20th) as part of the postal code: 75001 through 75020. Queries that omit the arrondissement and just use Paris are geocoded to the city centroid, not the specific district, this appears as a locality match with low confidence rather than an address match.
Netherlands, postal code format. Dutch postal codes follow a strict DDDD LL pattern (4 digits, space, 2 uppercase letters). Codes stored without the space (1012LG instead of 1012 LG) or with lowercase letters will geocode correctly, but if you're validating postal code format separately, normalize to uppercase with a space.
Belgium, language ambiguity. Some Belgian municipalities have different French and Dutch names (Liège/Luik, Gent/Gand). The API handles both, but inconsistent naming in your dataset (some records using French, some Dutch) may produce different confidence levels. Normalize to one language per region before geocoding.
Spain and Italy, street prefix variations. "Calle", "Carrer", "Via", "Viale" are all valid street type prefixes and may appear abbreviated ("C/") in records. The geocoder handles common abbreviations but uncommon shortenings from legacy systems may need a normalization step.
Increasing Throughput With Concurrent Requests
The sequential script is safe and simple but slow. For larger datasets, replace the sequential loop with a concurrent executor:
from concurrent.futures import ThreadPoolExecutor, as_completed
def main_concurrent(max_workers=10):
df = pd.read_csv(INPUT_CSV, dtype=str).fillna('')
# ... (same setup as before) ...
results = [None] * len(df)
with ThreadPoolExecutor(max_workers=max_workers) as executor:
future_to_idx = {
executor.submit(geocode_address, row['_query'].strip()): idx
for idx, row in df.iterrows()
}
for future in tqdm(as_completed(future_to_idx), total=len(df), desc='Geocoding'):
idx = future_to_idx[future]
results[idx] = future.result()
# ... (same output writing as before) ...
With 10 concurrent workers at 5 RPS per worker, throughput reaches approximately 50 requests/second, 10,000 addresses in under 4 minutes. Check your MapAtlas plan's concurrent request limit before increasing max_workers.
For organizations running address validation as a recurring operation, cleaning new CRM imports weekly, or validating delivery addresses nightly, see the Logistics & Delivery solutions page for how MapAtlas integrations fit into operational workflows.
If you're building address autocomplete to prevent bad data from entering your system in the first place (catching errors at the source rather than cleaning them in batch), Address Autocomplete API: How One Field Lifts Checkout Conversion by 35% covers the frontend implementation.

[Image: Horizontal bar chart showing a hypothetical confidence score distribution for 10,000 addresses: 72% in the 0.85-1.00 (accept) range, 18% in the 0.60-0.84 (review) range, and 10% in the 0.00-0.59 (reject) range. A vertical dashed line at 0.85 marks the accept threshold.]
Working With the Coordinates Output
The validated output includes geocoded_lat and geocoded_lng for every accepted address. These coordinates open up analysis capabilities that weren't possible with raw address strings:
- Distance calculations. Compute straight-line distances between a warehouse and each delivery address to estimate shipping cost tiers.
- Catchment area analysis. Plot validated customer locations on a map to see where demand concentrates geographically.
- Delivery zone assignment. Assign each address to a delivery zone by testing whether its coordinates fall within a zone polygon.
- Duplicate detection. Two records with the same coordinates (within a few meters) are likely duplicates, even if the address strings differ in formatting.
For NAP (Name/Address/Phone) consistency and its impact on AI search visibility, particularly relevant for local business address databases, NAP Consistency for AI Search: Why Mismatched Addresses Kill Your ChatGPT Visibility explains why geocoding-validated addresses are the right foundation for structured data markup.
Summary
Bulk geocoding with the MapAtlas Geocoding API gives you:
- Validated addresses with confidence scores so you know which records to trust automatically and which need review.
- Structured components (street, house number, postal code, city, country) normalized from any input format.
- Geographic coordinates for every accepted address, enabling spatial analysis and routing.
- EU address format support built into the API, not something you need to handle in preprocessing.
The Python script runs at 5 addresses/second sequentially, up to 50/second with concurrent workers. For 10,000 addresses, budget 4–40 minutes depending on concurrency. Output is a clean CSV with three tiers: accept, review, reject.
Sign up for a free MapAtlas API key to start. The Geocoding API supports bulk queries on all plans, see pricing for per-request rates and monthly free tier limits.
Frequently Asked Questions
How long does it take to geocode 10,000 addresses with the MapAtlas API?
At a conservative rate of 5 requests per second with a small sleep buffer, 10,000 addresses take approximately 35-40 minutes. With concurrent requests (10 workers) and appropriate rate limits, this drops to under 5 minutes. The Python script in this tutorial includes both sequential and concurrent options.
What confidence score threshold should I use for address validation?
A practical three-tier system works well: accept addresses with confidence above 0.85, flag for manual review between 0.60 and 0.85, and reject below 0.60. The exact thresholds depend on how critical address accuracy is for your use case, logistics operations should use stricter thresholds than marketing campaigns.
Does the MapAtlas Geocoding API handle EU-specific address formats?
Yes. The API correctly parses German (house number after street), French (house number before street), Dutch (4+2 postal codes), and other EU country address conventions. Queries in the local language return better results than transliterated or English-format queries.
