Skip to main content

MongoDB

Certified

Important Capabilities

CapabilityStatusNotes
Table-Level LineageEnabled by default

This plugin extracts the following:

  • Databases and associated metadata
  • Collections in each database and schemas for each collection (via schema inference)

By default, schema inference samples 1,000 documents from each collection. Setting schemaSamplingSize: null will scan the entire collection. Moreover, setting useRandomSampling: False will sample the first documents found without random selection, which may be faster for large collections.

Note that schemaSamplingSize has no effect if enableSchemaInference: False is set.

Really large schemas will be further truncated to a maximum of 300 schema fields. This is configurable using the maxSchemaSize parameter.

CLI based Ingestion

Install the Plugin

pip install 'acryl-datahub[mongodb]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: "mongodb"
config:
# Coordinates
connect_uri: "mongodb://localhost"

# Credentials
username: admin
password: password
authMechanism: "DEFAULT"

# Options
enableSchemaInference: True
useRandomSampling: True
maxSchemaSize: 300

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

Field [Required]TypeDescriptionDefaultNotes
authMechanismstringMongoDB authentication mechanism.
connect_uristringMongoDB connection URI.mongodb://localhost
enableSchemaInferencebooleanWhether to infer schemas.True
maxDocumentSizeinteger16793600
maxSchemaSizeintegerMaximum number of fields to include in the schema.300
optionsobjectAdditional options to pass to pymongo.MongoClient().{}
passwordstringMongoDB password.
schemaSamplingSizeintegerNumber of documents to use when inferring schema size. If set to 0, all documents will be scanned.1000
useRandomSamplingbooleanIf documents for schema inference should be randomly selected. If False, documents will be selected from start.True
usernamestringMongoDB username.
envstringThe environment that all assets produced by this connector belong toPROD
collection_patternAllowDenyPatternregex patterns for collections to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
collection_pattern.allowarray(string)
collection_pattern.denyarray(string)
collection_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True
database_patternAllowDenyPatternregex patterns for databases to filter in ingestion.{'allow': ['.*'], 'deny': [], 'ignoreCase': True}
database_pattern.allowarray(string)
database_pattern.denyarray(string)
database_pattern.ignoreCasebooleanWhether to ignore case sensitivity during pattern matching.True

Code Coordinates

  • Class Name: datahub.ingestion.source.mongodb.MongoDBSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for MongoDB, feel free to ping us on our Slack.