Web information gathering
Web Reconnaissance Guide
1. Overview of Web Reconnaissance
- Definition: Web reconnaissance is the preliminary phase of a security assessment, focusing on systematically gathering information about a target website or web application to understand its structure and potential weaknesses.
- Objectives:
- Asset Discovery: Identify public-facing elements such as web pages, subdomains, IP addresses, and technologies in use.
- Uncover Sensitive Data: Detect exposed information like configuration files or backup data.
- Map Attack Surface: Pinpoint vulnerabilities, misconfigurations, or entry points for potential exploitation.
- Gather Intelligence: Collect data for social engineering, such as key personnel details or email addresses.
- Significance:
- Attackers leverage reconnaissance to craft targeted attacks and evade defenses.
- Defenders use it to proactively identify and mitigate vulnerabilities.
- Reconnaissance Types:
- Active Reconnaissance:
- Involves direct interaction with the target (e.g., scanning ports or vulnerabilities).
- Techniques: Port scanning, vulnerability scanning, network mapping, banner grabbing, OS fingerprinting, service enumeration, web crawling.
- Tools: Nmap, Nessus, Nikto, Burp Suite Spider, curl.
- Risk: High chance of detection due to triggering intrusion detection systems (IDS) or firewalls.
- Passive Reconnaissance:
- Relies on publicly available data without direct target interaction.
- Techniques: Search engine queries, WHOIS lookups, DNS analysis, web archive reviews, social media scraping, code repository analysis.
- Tools: Google, WHOIS CLI, dig, Wayback Machine, LinkedIn, GitHub.
- Risk: Minimal detection risk, as it resembles typical internet activity.
- Active Reconnaissance:
2. WHOIS Protocol
- Definition: WHOIS is a query/response protocol used to retrieve registration details for internet resources like domains, IP addresses, and autonomous systems from public databases.
- Purpose: Acts as the internet’s directory, providing ownership and technical details for online assets.
- Key WHOIS Record Elements:
- Domain Name: e.g.,
example.com. - Registrar: The entity managing the domain (e.g., GoDaddy).
- Registrant Contact: The domain owner (individual or organization).
- Administrative Contact: Manages domain operations.
- Technical Contact: Handles technical configurations.
- Creation/Expiration Dates: Registration and expiry dates.
- Name Servers: Resolve domain to IP addresses.
- Domain Name: e.g.,
- Historical Context:
- Developed in the 1970s by Elizabeth Feinler at Stanford’s NIC for ARPANET.
- Originally tracked network users, hostnames, and domains.
- Relevance to Reconnaissance:
- Personnel Insights: Exposes names, emails, or phone numbers for social engineering or phishing.
- Infrastructure Mapping: Name servers and IPs reveal hosting providers or misconfigurations.
- Historical Analysis: Tools like WhoisFreaks track changes in ownership or configurations.
- Use Cases:
- Phishing Detection:
- A WHOIS lookup on a suspicious email domain reveals recent registration, hidden ownership, or shady hosting, indicating phishing.
- Action: Block the domain, alert users, investigate the hosting provider.
- Malware Investigation:
- A malware C2 server’s WHOIS shows anonymous emails, high-risk hosting countries, or lax registrars, suggesting a compromised server.
- Action: Notify the provider, escalate investigation.
- Threat Intelligence:
- Analyzing WHOIS data across threat actor domains uncovers patterns like clustered registrations or shared name servers.
- Action: Develop threat profiles, share indicators of compromise (IOCs).
- Phishing Detection:
- Using WHOIS:
- Installation:
sudo apt update && sudo apt install whois -y(Linux). - Command:
whois example.com(e.g.,whois google.com). - Sample Output (google.com):
- Registrar: MarkMonitor Inc.
- Creation Date: 1997-09-15.
- Expiry Date: 2028-09-14.
- Registrant: Google LLC, Domain Admin.
- Domain Status: Protected (clientDeleteProhibited, clientTransferProhibited).
- Name Servers:
ns1.google.com,ns2.google.com, etc. - Insight: Well-secured, long-standing domain with Google-managed DNS.
- Limitations: May not reveal specific vulnerabilities or employee details; supplement with other recon methods.
- Installation:
3. Domain Name System (DNS)
- Definition: DNS translates user-friendly domain names (e.g.,
example.com) into IP addresses (e.g., 93.184.216.34), serving as the internet’s navigation system. - DNS Resolution Process:
- Query Initiation: A device checks its cache, then queries a DNS resolver (e.g., ISP server).
- Recursive Query: Resolver contacts a root name server.
- Root Response: Directs to a top-level domain (TLD) server (e.g.,
.com). - TLD Response: Points to the authoritative name server.
- Authoritative Response: Provides the IP address.
- Resolver Delivery: Returns the IP to the device and caches it.
- Connection: Device connects to the target server.
- Hosts File:
- Location:
/etc/hosts(Linux/macOS),C:\Windows\System32\drivers\etc\hosts(Windows). - Format:
<IP> <Hostname> [<Alias>](e.g.,127.0.0.1 localhost). - Purpose: Local DNS overrides for testing, development, or blocking (e.g.,
0.0.0.0 ads.example.com). - Editing: Requires admin/root access; changes are immediate.
- Location:
- Core DNS Concepts:
- Zone: A managed segment of a domain’s namespace (e.g.,
example.comand its subdomains). - Zone File: Contains resource records for a zone (e.g., A, MX, NS).
- DNS Record Types:
- A: Links hostname to IPv4 (e.g.,
www.example.com IN A 93.184.216.34). - AAAA: Links hostname to IPv6.
- CNAME: Aliases a hostname to another (e.g.,
blog.example.com IN CNAME server1.example.net). - MX: Defines mail servers (e.g.,
example.com IN MX 10 mail.example.com). - NS: Lists authoritative name servers.
- TXT: Stores arbitrary text (e.g., SPF records).
- SOA: Specifies zone authority (e.g., serial number, refresh intervals).
- SRV: Indicates service locations.
- PTR: Maps IP to hostname for reverse DNS.
- A: Links hostname to IPv4 (e.g.,
- IN: Denotes Internet protocol in records.
- Zone: A managed segment of a domain’s namespace (e.g.,
- Reconnaissance Value:
- Asset Identification: Reveals subdomains, mail servers, and hosting infrastructure.
- Infrastructure Mapping: NS/A records pinpoint hosting providers or load balancers.
- Change Monitoring: New subdomains (e.g.,
api.example.com) or TXT records (e.g., security tools) signal new services or vulnerabilities.
- Sample Zone File:
$TTL 3600 @ IN SOA ns1.example.com. admin.example.com. (2025010101 3600 900 604800 86400) @ IN NS ns1.example.com. @ IN NS ns2.example.com. @ IN MX 10 mail.example.com. www IN A 93.184.216.34 mail IN A 198.51.100.10 ftp IN CNAME www.example.com.
4. DNS Reconnaissance Techniques
- Objective: Extract detailed infrastructure insights using DNS-focused tools.
- Key Tools:
- dig: Robust tool for detailed DNS queries.
- nslookup: Simple DNS lookup utility.
- host: Quick tool for A/AAAA/MX queries.
- dnsenum, fierce, dnsrecon: Advanced tools for subdomain enumeration and zone transfers.
- theHarvester: OSINT tool for collecting emails, subdomains, and hosts.
- Online Platforms: Web-based DNS lookup services for ease of use.
- Using dig:
- Commands:
dig example.com: Retrieves A record.dig example.com MX: Lists mail servers.dig example.com NS: Shows name servers.dig example.com TXT: Displays text records.dig example.com CNAME: Finds aliases.dig example.com SOA: Gets zone authority.dig @8.8.8.8 example.com: Queries a specific server.dig +trace example.com: Traces the resolution path.dig -x 93.184.216.34: Performs reverse lookup.dig +short example.com: Provides minimal output.dig +noall +answer example.com: Shows only the answer section.dig example.com ANY: Requests all records (often restricted per RFC 8482).
- Caution: Excessive queries may trigger rate limits or detection; always obtain permission.
- Commands:
- Sample dig Output (example.com):
dig example.com ; <<>> DiG 9.18.24 <<>> example.com ; Got answer: ; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12345 ; QUESTION SECTION: ;example.com. IN A ; ANSWER SECTION: example.com. 3600 IN A 93.184.216.34 ; Query time: 10 msec ; SERVER: 8.8.8.8#53(8.8.8.8) (UDP) ; WHEN: Wed May 14 13:32:00 +06 2025- Analysis:
- Header: Confirms successful query (NOERROR).
- Question: Requested A record for
example.com. - Answer: IP address
93.184.216.34. - Metadata: Query time, server used.
- Analysis:
5. Advanced Reconnaissance: Subdomains, Zone Transfers, Virtual Hosts, and Certificate Transparency
5.1 Subdomains
- Definition: Subdomains are extensions of a primary domain (e.g.,
app.example.comforexample.com), used to segment services like email, blogs, or admin portals. - Reconnaissance Value:
- Development Environments: Subdomains like
staging.example.commay be less secure, exposing sensitive data. - Administrative Portals: Hidden subdomains (e.g.,
admin.example.com) may host login interfaces. - Legacy Systems: Forgotten subdomains may run outdated, exploitable software.
- Data Exposure: Misconfigured subdomains may leak configurations or internal documents.
- Development Environments: Subdomains like
- Enumeration Methods:
- Active Enumeration:
- Directly queries target DNS servers.
- Techniques:
- Zone Transfers: Attempts to retrieve the full zone file (rarely successful due to modern security).
- Brute-Forcing: Tests subdomain names using wordlists.
- Tools:
dnsenum,fierce,gobuster. - Risk: Detectable by security systems.
- Passive Enumeration:
- Leverages external data sources without contacting the target.
- Techniques:
- Certificate Transparency Logs: Public SSL certificate records reveal subdomains.
- Search Engines: Queries like
site:*.example.comuncover subdomains. - DNS Databases: Aggregate historical DNS data.
- Risk: Low detection risk, highly stealthy.
- Best Practice: Combine active and passive techniques for thorough coverage.
- Active Enumeration:
5.2 Subdomain Brute-Forcing
- Definition: An active method that tests potential subdomain names against a domain using wordlists to identify valid subdomains.
- Workflow:
- Select Wordlist:
- Generic: Common terms (e.g.,
dev,mail,admin). - Targeted: Industry-specific or based on observed naming conventions.
- Custom: Derived from recon data or keywords.
- Generic: Common terms (e.g.,
- Query Execution: Tool appends wordlist entries to the domain (e.g.,
test.example.com). - DNS Resolution: Verifies if subdomains resolve to IPs via A/AAAA records.
- Validation: Filters results, optionally checks accessibility.
- Select Wordlist:
- Tools:
- dnsenum: Multifaceted DNS recon with brute-forcing, zone transfers, and WHOIS lookups.
- fierce: Streamlined for subdomain discovery with wildcard detection.
- dnsrecon, amass, assetfinder, puredns: Specialized for efficient subdomain enumeration.
- Example (dnsenum):
dnsenum --enum example.com -f /usr/share/seclists/Discovery/DNS/subdomains-top1million-5000.txt- Output: Identifies
www.example.com,mail.example.comresolving to93.184.216.34. - Features: Recursive brute-forcing, DNS record enumeration, leverages SecLists.
- Output: Identifies
- Considerations:
- Generates noticeable DNS traffic; may trigger alerts.
- Use focused wordlists to minimize noise and improve accuracy.
5.3 DNS Zone Transfers
- Definition: A process for syncing DNS records between primary and secondary name servers to ensure consistency.
- Mechanism:
- AXFR Request: Secondary server requests a full zone transfer (AXFR).
- SOA Delivery: Primary sends the Start of Authority record.
- Record Transfer: Sends all records (A, MX, NS, etc.).
- Completion: Primary signals transfer completion.
- Confirmation: Secondary acknowledges receipt.
- Security Risk:
- Misconfigured servers may allow unauthorized AXFR requests, exposing the entire zone file.
- Exposed Data: Subdomains, IPs, mail servers, hosting details, and misconfigurations.
- Historical Note: Common in the early internet; now rare but misconfigurations persist.
- Exploitation Example:
- Tool:
dig. - Command:
dig axfr @nsztm1.digi.ninja zonetransfer.me - Output (zonetransfer.me):
zonetransfer.me. 7200 IN SOA nsztm1.digi.ninja. robin.digi.ninja. ... zonetransfer.me. 7200 IN A 5.196.105.14 zonetransfer.me. 7200 IN MX 0 ASPMX.L.GOOGLE.COM. www.zonetransfer.me. 7200 IN A 5.196.105.14 internal.zonetransfer.me. 7200 IN A 127.0.0.1 ... - Insight: Exposes subdomains (
internal.zonetransfer.me), IPs, and mail servers. - Note:
zonetransfer.meis a test domain for educational purposes.
- Tool:
- Mitigation:
- Limit zone transfers to authorized secondary servers.
- Regularly audit DNS configurations.
- Recon Value:
- Provides a complete DNS infrastructure map.
- Uncovers hidden subdomains (e.g., internal or staging servers).
- Even failed transfers reveal configuration details.
5.4 Virtual Hosts (VHosts)
- Definition: Virtual hosts enable multiple websites or domains to operate on a single server or IP, differentiated by the HTTP Host header.
- VHosts vs. Subdomains:
- Subdomains: Managed via DNS records (e.g.,
blog.example.com). - VHosts: Server-side configurations, may not have DNS entries.
- Example (Apache):
<VirtualHost *:80> ServerName www.example.com DocumentRoot /var/www/example </VirtualHost> <VirtualHost *:80> ServerName app.example.org DocumentRoot /var/www/app </VirtualHost>
- Subdomains: Managed via DNS records (e.g.,
- Operation:
- Browser sends an HTTP request with a Host header (e.g.,
app.example.com). - Server matches the Host header to a VHost configuration.
- Serves content from the corresponding document root.
- Browser sends an HTTP request with a Host header (e.g.,
- Accessing Undocumented VHosts:
- Edit the local hosts file (e.g.,
/etc/hostsorC:\Windows\System32\drivers\etc\hosts). - Example:
93.184.216.34 hidden.example.comto bypass DNS.
- Edit the local hosts file (e.g.,
- Virtual Hosting Types:
- Name-Based: Relies on Host header; efficient, widely used, but limited for some protocols (e.g., older SSL).
- IP-Based: Assigns unique IPs per site; flexible but IP-intensive.
- Port-Based: Uses different ports (e.g.,
:8080); less common due to user inconvenience.
- VHost Discovery (Fuzzing):
- Technique: Tests various hostnames against a server’s IP to identify active VHosts.
- Tool: Gobuster.
- Command:
gobuster vhost -u http://example.com -w /usr/share/seclists/Discovery/DNS/subdomains-top1million-110000.txt --append-domain - Output: Discovers
app.example.com(Status: 200). - Flags:
-u: Target URL or IP.-w: Wordlist path.--append-domain: Adds base domain to queries.-t: Adjusts thread count for speed.-k: Skips SSL verification.-o: Exports results to a file.
- Considerations:
- Generates significant HTTP traffic; may trigger WAF/IDS.
- Requires explicit authorization to avoid legal issues.
- Review results for hidden portals or internal systems.
Advanced Web Reconnaissance Guide
5. Certificate Transparency (CT) Logs
- Definition: Certificate Transparency logs are public, tamper-proof records of SSL/TLS certificate issuances, maintained by independent entities to ensure transparency.
- Purpose:
- Identify Unauthorized Certificates: Detect rogue or misissued certificates.
- Ensure CA Accountability: Monitor certificate authorities for improper practices.
- Enhance Web Security: Strengthen trust in the Public Key Infrastructure (PKI).
- Reconnaissance Value:
- Exposes subdomains listed in certificate Subject Alternative Name (SAN) fields.
- Reveals historical or expired subdomains (e.g., forgotten development servers).
- Offers reliable subdomain discovery without relying on brute-forcing or wordlists.
- Tools:
- crt.sh:
- Web-based platform and API for querying certificate data.
- Pros: Free, no signup required, intuitive interface.
- Cons: Basic filtering capabilities.
- Censys:
- Comprehensive platform for certificate and device discovery.
- Pros: Rich dataset, API support.
- Cons: Requires account (free tier available).
- crt.sh:
- Example (crt.sh):
curl -s "https://crt.sh/?q=example.com&output=json" | jq -r '.[] | select(.name_value | contains("dev")) | .name_value' | sort -u- Output:
dev.example.com,secure.dev.example.com,.dev.example.com. - Breakdown:
curl: Retrieves JSON data from crt.sh forexample.com.jq: Filters subdomains containing “dev” inname_value, extracts unique entries.sort -u: Sorts and removes duplicates.
- Output:
- Advantages:
- Passive and stealthy; no direct interaction with the target.
- Uncovers obscure or historical subdomains missed by other methods.
Practical Reconnaissance Strategies
- Comprehensive Approach:
- Use zone transfers (if misconfigured) for complete DNS insights, brute-forcing for active subdomain discovery, CT logs for passive enumeration, and VHost fuzzing to identify server-side configurations.
- Stealth Techniques:
- Prioritize passive methods like CT logs and search engine queries to avoid detection.
- For active methods, use targeted wordlists and rate-limited queries to minimize noise.
- Tool Integration:
- Leverage
digfor zone transfers,dnsenumorgobusterfor brute-forcing, andcrt.shfor CT log analysis. - Automate data extraction with scripts (e.g.,
curlandjqfor CT logs).
- Leverage
- Validation:
- Verify discovered subdomains and VHosts with HTTP requests or manual inspection.
- Investigate anomalies, such as development servers or internal portals.
- Ethical Guidelines:
- Obtain explicit permission before performing active reconnaissance (e.g., brute-forcing, VHost fuzzing, zone transfer attempts).
- Adhere to rate limits to prevent server disruption.
- Continuous Monitoring:
- Regularly query CT logs for newly issued certificates and subdomains.
- Track DNS changes (e.g., via historical zone transfer data, if available).
Key Tools and Commands
- dig (Zone Transfer):
dig axfr @<nameserver> <domain> - dnsenum (Subdomain Brute-Forcing):
dnsenum --enum <domain> -f <wordlist> - gobuster (VHost Fuzzing):
gobuster vhost -u http://<IP> -w <wordlist> --append-domain - crt.sh (CT Logs):
curl -s "https://crt.sh/?q=<domain>&output=json" | jq -r '.[] | .name_value' | sort -u
6. Web Crawling, Fingerprinting, robots.txt, and Well-Known URIs
6.1 Fingerprinting
- Definition: The process of identifying a website’s technical components (e.g., web server, OS, CMS, frameworks) to understand its technology stack and potential vulnerabilities.
- Reconnaissance Value:
- Targeted Exploitation: Enables attacks tailored to specific software versions.
- Misconfiguration Detection: Uncovers outdated software or insecure settings.
- Prioritization: Guides focus toward vulnerable systems.
- Holistic Profiling: Builds a detailed picture of the target’s infrastructure.
- Techniques:
- Banner Grabbing: Extracts software/version details from server responses.
- HTTP Header Analysis: Inspects headers like
ServerorX-Powered-By. - Response Probing: Sends crafted requests to trigger unique responses.
- Content Inspection: Analyzes page structure, scripts, or metadata for clues.
- Tools:
- Wappalyzer: Browser extension for detecting CMS and frameworks.
- BuiltWith: Comprehensive tech stack analysis (free/paid tiers).
- WhatWeb: CLI tool for fingerprinting web technologies.
- Nmap: Network scanner with NSE scripts for OS and service detection.
- Netcraft: Provides tech, hosting, and security insights.
- wafw00f: Identifies Web Application Firewalls (WAFs).
- Example (example.com):
- Banner Grabbing (curl):
curl -I example.com- Output:
Server: nginx/1.14.2, redirects to HTTPS,X-Powered-By: PHP/7.4. - Insight: Reveals nginx server, PHP backend.
- Output:
- wafw00f:
pip3 install wafw00f wafw00f example.com- Output: Detects Cloudflare WAF.
- Implication: Indicates robust security; adjust recon to bypass WAF restrictions.
- Nikto:
nikto -h example.com -Tuning b- Output:
- IP:
93.184.216.34. - Server: nginx/1.14.2 (potentially outdated).
- CMS: Drupal detected via
/CHANGELOG.txt. - Headers: Missing
Content-Security-Policy. - Issues: Potential Drupal vulnerabilities, insecure headers.
- IP:
- Output:
- Banner Grabbing (curl):
- Considerations:
- WAFs may block aggressive probes; use subtle techniques.
- Combine fingerprinting with crawling for contextual insights.
6.2 Web Crawling
- Definition: An automated process (spidering) that systematically navigates a website by following links to collect data like pages, files, and metadata.
- Process:
- Begin with a seed URL (e.g., homepage).
- Fetch and parse the page, extracting links.
- Queue and crawl links iteratively.
- Crawling Approaches:
- Breadth-First: Explores all links on a page before diving deeper; ideal for mapping site structure.
- Depth-First: Pursues one link path deeply; suited for targeting specific content.
- Collected Data:
- Links: Internal (site hierarchy) and external (third-party connections).
- Comments: May reveal sensitive details (e.g., developer notes, software versions).
- Metadata: Includes titles, descriptions, keywords, or authorship.
- Sensitive Files: Configs (
config.php), backups (.bak), logs (access.log), or credentials.
- Reconnaissance Value:
- Maps site architecture and uncovers hidden pages.
- Identifies exploitable files or comments.
- Enables contextual analysis (e.g., linking comments to exposed directories).
- Example:
- Crawling reveals
/backups/with directory listing enabled, exposingdatabase.sql. - A comment referencing “legacy API” combined with
/api/discovery suggests outdated endpoints.
- Crawling reveals
- Considerations:
- Analyze findings holistically to connect data points.
- Avoid server overload by limiting request rates.
6.3 Web Crawling Tools
- Purpose: Automate crawling to streamline data collection and focus on analysis.
- Key Tools:
- Burp Suite Spider: Active crawler for mapping web applications and identifying vulnerabilities.
- OWASP ZAP: Open-source scanner with a spider for manual or automated vulnerability discovery.
- Scrapy: Python framework for building custom crawlers tailored to specific needs.
- Apache Nutch: Scalable Java crawler for large or focused crawls; requires configuration expertise.
- Scrapy Example (example.com):
- Setup:
pip3 install scrapy - Custom Spider:
scrapy crawl recon -a url=http://example.com -o results.json - Output (results.json):
emails:info@example.com,support@example.com.links: Internal (/about), external (cdn.example.net).external_files:report.pdf.js_files:main.js,vendor.js.form_fields,images,videos,audio,comments(e.g.,<!-- debug mode -->).
- Data Structure:
Key Description emails Email addresses on the site links Internal/external URLs external_files Downloadable files (e.g., PDFs) js_files JavaScript files form_fields Form input fields images Image URLs videos Video URLs audio Audio URLs comments HTML comments
- Setup:
- Ethical Considerations:
- Secure permission before crawling.
- Respect server limits to avoid disruption.
- Reconnaissance Value:
- Provides structured data for mapping site functionality.
- Highlights entry points like forms or sensitive files.
6.4 robots.txt
- Definition: A text file located at a website’s root (e.g.,
example.com/robots.txt) that adheres to the Robots Exclusion Standard, instructing crawlers on allowed or restricted paths. - Format:
- User-agent: Specifies bots (e.g.,
*for all,Bingbotfor Bing). - Directives:
Disallow: Blocks paths (e.g.,/private/).Allow: Permits paths (e.g.,/public/).Crawl-delay: Sets delay between requests (e.g.,Crawl-delay: 5).Sitemap: Links to sitemap (e.g.,Sitemap: https://example.com/sitemap.xml).
- User-agent: Specifies bots (e.g.,
- Example:
User-agent: * Disallow: /admin/ Disallow: /internal/ Allow: /blog/ User-agent: Googlebot Crawl-delay: 5 Sitemap: https://example.com/sitemap.xml- Insight: Suggests
/admin/and/internal/may contain sensitive content.
- Insight: Suggests
- Purpose of robots.txt:
- Prevents server overload from excessive crawling.
- Protects sensitive areas from search engine indexing.
- Ensures compliance with site policies.
- Reconnaissance Value:
- Hidden Paths:
Disallowentries (e.g.,/admin/) hint at sensitive directories. - Site Layout: Allowed/disallowed paths reveal structure.
- Security Awareness: Traps or honeypot paths indicate defensive measures.
- Hidden Paths:
- Considerations:
- Respect robots.txt during ethical reconnaissance.
- Manually explore
Disallowpaths for potential insights.
6.5 Well-Known URIs
- Definition: A standardized directory (
/.well-known/) defined by RFC 8615, hosted at a site’s root, containing metadata, configurations, and service details, managed by IANA. - Common URIs:
security.txt(RFC 9116): Provides security contact information.change-password: Points to password reset page.openid-configuration: Supplies OpenID Connect metadata.assetlinks.json: Verifies app or asset ownership.mta-sts.txt: Defines email security policies (MTA-STS).
- OpenID Connect Example:
- URL:
https://example.com/.well-known/openid-configuration. - JSON Output:
- Endpoints for authorization, token issuance, and user info.
jwks_urifor cryptographic keys.- Supported scopes, response types, and algorithms.
- Recon Value:
- Maps authentication infrastructure.
- Exposes security configurations (e.g., signing algorithms).
- URL:
- Reconnaissance Value:
- Reveals critical endpoints and configurations.
- Provides structured metadata for understanding site functionality.
- Methodology:
- Consult IANA’s well-known URI registry.
- Probe paths like
curl https://example.com/.well-known/security.txt.
- Considerations:
- Passive method with minimal detection risk.
- Combine with crawling to map site features comprehensively.
Practical Strategies for Crawling and Analysis
- Integrated Workflow:
- Fingerprinting: Identify technologies (e.g., nginx, Drupal) to prioritize vulnerability research.
- Crawling: Map site structure and extract links, files, or comments using tools like Scrapy.
- robots.txt: Investigate
Disallowpaths (e.g.,/internal/) for sensitive content. - Well-Known URIs: Check
/.well-known/for security or authentication details.
- Stealth Techniques:
- Focus on passive methods (e.g., robots.txt, well-known URIs) to avoid detection.
- Limit crawl intensity and respect
Crawl-delaydirectives.
- Tool Synergy:
- curl: Fetch headers (
curl -I) for quick fingerprinting. - wafw00f/Nikto: Detect WAFs and vulnerabilities.
- Scrapy: Automate structured data collection.
- curl: Fetch headers (
- Validation:
- Manually verify sensitive paths from robots.txt or crawling results.
- Test well-known URIs to confirm active endpoints.
- Ethical Guidelines:
- Secure explicit authorization for active reconnaissance (e.g., crawling, fingerprinting).
- Avoid excessive requests to respect server resources.
- Contextual Analysis:
- Combine findings (e.g., Drupal from fingerprinting,
/backups/from crawling,/admin/from robots.txt) to uncover exploitable weaknesses.
- Combine findings (e.g., Drupal from fingerprinting,
Key Tools and Commands
- curl (Banner Grabbing):
curl -I https://example.com - wafw00f (WAF Detection):
wafw00f example.com - Nikto (Fingerprinting):
nikto -h example.com -Tuning b - Scrapy (Crawling):
scrapy crawl recon -a url=http://example.com -o results.json - robots.txt Check:
curl https://example.com/robots.txt - Well-Known URIs:
curl https://example.com/.well-known/security.txt
7. Search Engine Discovery, Web Archives, and Automation
7.1 Search Engine Discovery
- Definition: Using search engines for Open Source Intelligence (OSINT) to collect data on targets (e.g., websites, organizations) through advanced query techniques.
- Reconnaissance Value:
- Accessibility: Public, legal, and cost-free.
- Broad Coverage: Indexes extensive web content.
- Simplicity: Requires minimal technical expertise.
- Applications:
- Security Assessments: Identify exposed data, vulnerabilities, or entry points.
- Competitive Analysis: Gather insights on competitors’ strategies or technologies.
- Investigations: Uncover hidden relationships or activities.
- Threat Intelligence: Monitor malicious actors and predict attack patterns.
- Limitations:
- Incomplete indexing of web content.
- Restricted access to protected or unindexed data.
- Search Operators:
Operator Description Example Use Case site:Restricts to a domain site:example.comMap all pages on a domain inurl:Searches URL for term inurl:adminFind admin panels filetype:Targets file type filetype:pdfLocate documents intitle:Searches page title intitle:"login portal"Find login pages intext:Searches page content intext:"confidential"Find sensitive content cache:Views cached page cache:example.comAccess past content link:Finds linking pages link:example.comDiscover external links related:Finds similar sites related:example.comIdentify comparable sites numrange:Searches number range site:example.com numrange:2020-2025Find pages with specific numbers allintext:All terms in content allintext:admin passwordPrecise content search allinurl:All terms in URL allinurl:login panelURLs with multiple terms AND,OR,NOTLogical operators site:example.com NOT inurl:blogRefine queries *Wildcard site:example.com user*guideMatch variations ..Range search site:example.com "price" 100..500Find price ranges ""Exact phrase "security policy"Exact matches -Excludes term site:example.com -inurl:signupExclude irrelevant pages - Google Dorking:
- Advanced queries to uncover sensitive data or vulnerabilities.
- Examples:
- Login Pages:
site:example.com inurl:(login | dashboard) - Exposed Files:
site:example.com filetype:(pdf | xlsx) - Config Files:
site:example.com inurl:config - Backups:
site:example.com filetype:bak
- Login Pages:
- Resource: Exploit-DB’s Google Hacking Database for curated dorks.
- Considerations:
- Passive method with low detection risk.
- Combine with web archives or crawling for deeper insights.
- Manually verify results to filter false positives.
7.2 Web Archives
- Definition: Repositories like the Internet Archive’s Wayback Machine that preserve historical snapshots of websites, capturing content, design, and functionality.
- Wayback Machine Mechanics:
- Crawling: Bots capture webpages, including HTML, CSS, JavaScript, and media.
- Storage: Snapshots are timestamped and archived.
- Retrieval: Users access snapshots by URL and date.
- Snapshot Frequency:
- Varies by site popularity and archive resources.
- High-traffic sites: Frequent snapshots (e.g., daily).
- Niche sites: Infrequent snapshots (e.g., yearly).
- Limitations:
- Incomplete capture of dynamic or restricted content.
- Site owners may request exclusion (not always enforced).
- Reconnaissance Value:
- Hidden Assets: Exposes old subdomains, directories, or files.
- Change Analysis: Tracks site evolution (e.g., tech upgrades, design changes).
- OSINT Insights: Reveals past strategies, personnel, or technologies.
- Stealth: Passive with no target interaction.
- Example (example.com):
- Access Wayback Machine, enter
example.com, select a 2018 snapshot. - Insight: Reveals discontinued
/forum/subdirectory or outdated CMS.
- Access Wayback Machine, enter
- Considerations:
- Analyze snapshots for forgotten assets or vulnerabilities.
- Compare historical and current data to identify changes.
- Cross-reference with search engine results for anomalies.
7.3 Automating Reconnaissance
- Definition: Employing tools and frameworks to automate repetitive reconnaissance tasks for efficiency and consistency.
- Benefits:
- Speed: Accelerates data collection.
- Scalability: Supports multiple targets or domains.
- Accuracy: Minimizes human error.
- Versatility: Covers DNS, subdomains, crawling, and scanning.
- Integration: Combines with other tools for streamlined workflows.
- Key Frameworks:
- FinalRecon: Python tool for headers, WHOIS, SSL, crawling, DNS, subdomains, and web archives.
- Recon-ng: Modular framework for DNS, subdomains, crawling, and exploit discovery.
- theHarvester: Collects emails, subdomains, and hosts from public sources.
- SpiderFoot: OSINT tool for domains, emails, social media, and scanning.
- OSINT Framework: Curated toolset for search engines, social media, and public records.
- FinalRecon Example (example.com):
- Setup:
git clone https://github.com/thewhiteh4t/FinalRecon.git cd FinalRecon pip3 install -r requirements.txt chmod +x finalrecon.py - Command:
./finalrecon.py --headers --whois --url http://example.com - Output:
- Headers:
Server: nginx/1.14.2,Content-Type: text/html. - WHOIS:
- Domain:
example.com. - Registrar: Example Registrar.
- Creation: 1995-08-13.
- Expiry: 2026-08-12.
- Name Servers:
ns1.example.com,ns2.example.com.
- Domain:
- Export: Results saved to
~/.local/share/finalrecon/dumps/.
- Headers:
- Options:
--headers: Fetches HTTP headers.--whois: Performs WHOIS lookup.--sslinfo: Analyzes SSL certificates.--crawl: Crawls the site.--dns: Enumerates DNS records.--sub: Discovers subdomains.--dir: Scans directories.--wayback: Queries Wayback Machine.--ps: Conducts port scanning.--full: Runs all modules.- Additional:
-w(wordlist),-e(file extensions),-o(output format).
- Setup:
- Considerations:
- Active methods (e.g., scanning, crawling) may trigger detection; use cautiously.
- Obtain authorization to ensure legal and ethical compliance.
- Tailor modules to the target’s context for optimal results.
Practical Strategies for Search and Automation
- Integrated Workflow:
- Search Engine Discovery: Use Google Dorks to identify login pages or exposed files; validate findings manually.
- Web Archives: Query Wayback Machine for historical subdomains or technologies; compare with current data.
- Automation: Leverage FinalRecon for broad reconnaissance, supplemented by targeted tools like Nikto or Scrapy.
- Stealth Techniques:
- Emphasize passive methods (e.g., search engines, web archives) to minimize detection.
- Apply rate-limiting to automated scans to respect server limits.
- Tool Synergy:
- Google:
site:example.com filetype:pdfto find documents. - Wayback Machine: Access via
archive.orgfor historical snapshots. - FinalRecon: Use
--sub,--crawl,--waybackfor comprehensive data collection.
- Google:
- Validation:
- Verify dork results (e.g., config files) through manual checks.
- Confirm Wayback findings against the live site for relevance.
- Review automated outputs for actionable vulnerabilities.
- Ethical Guidelines:
- Secure explicit permission for active reconnaissance (e.g., FinalRecon scans).
- Adhere to robots.txt and site terms during crawling.
- Contextual Analysis:
- Combine dork findings (e.g.,
/dashboard/frominurl:dashboard), Wayback data (e.g., old CMS), and FinalRecon headers (e.g., nginx) to uncover exploitable insights.
- Combine dork findings (e.g.,