JSONL JSONL document
AI-powered detection and analysis of JSONL document files.
Instant JSONL File Detection
Use our advanced AI-powered tool to instantly detect and analyze JSONL document files with precision and speed.
File Information
JSONL document
Data
.jsonl
application/jsonl
JSONL (JSON Lines)
What is a JSONL file?
A JSONL (JSON Lines) file is a text file format where each line contains a separate, valid JSON object. Also known as newline-delimited JSON (NDJSON), this format provides a convenient way to store and stream large datasets as a sequence of JSON records. Each line is independent and can be processed individually, making it ideal for log files, data processing pipelines, and streaming applications.
History and Development
JSON Lines emerged from the need to handle large datasets and streaming JSON data more efficiently than traditional JSON arrays. It provides a simple solution for processing JSON data line-by-line without loading entire files into memory.
Key milestones:
- 2013: JSON Lines format informally specified
- 2014: Growing adoption in big data processing tools
- 2015: Widespread use in machine learning datasets
- 2016: Support added to major data processing frameworks
- Present: Standard format for streaming JSON and large datasets
File Structure and Format
Basic Structure
{"name": "Alice", "age": 30, "city": "New York"}
{"name": "Bob", "age": 25, "city": "San Francisco"}
{"name": "Charlie", "age": 35, "city": "Chicago"}
Format Rules
- Each line contains exactly one JSON object
- Lines are separated by newline characters (\n or \r\n)
- No commas between objects (unlike JSON arrays)
- Each line must be valid JSON
- Empty lines are typically ignored
- No overall array structure wrapping the objects
Comparison with JSON Array
// Traditional JSON Array
[
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35}
]
// JSON Lines (JSONL)
{"name": "Alice", "age": 30}
{"name": "Bob", "age": 25}
{"name": "Charlie", "age": 35}
Common Use Cases
Log Files
{"timestamp": "2024-01-15T10:30:00Z", "level": "INFO", "message": "User logged in", "user_id": 12345}
{"timestamp": "2024-01-15T10:31:00Z", "level": "ERROR", "message": "Database connection failed", "error_code": 500}
{"timestamp": "2024-01-15T10:32:00Z", "level": "INFO", "message": "User logged out", "user_id": 12345}
Machine Learning Datasets
{"text": "This movie is amazing!", "sentiment": "positive", "confidence": 0.95}
{"text": "Terrible acting and plot", "sentiment": "negative", "confidence": 0.87}
{"text": "It was okay, nothing special", "sentiment": "neutral", "confidence": 0.62}
API Response Data
{"id": 1, "product": "Laptop", "price": 999.99, "category": "Electronics"}
{"id": 2, "product": "Book", "price": 29.99, "category": "Literature"}
{"id": 3, "product": "Headphones", "price": 199.99, "category": "Electronics"}
Event Streaming
{"event_type": "page_view", "user_id": "u123", "page": "/home", "timestamp": 1642234567}
{"event_type": "click", "user_id": "u123", "element": "signup_button", "timestamp": 1642234589}
{"event_type": "purchase", "user_id": "u456", "product_id": "p789", "amount": 49.99, "timestamp": 1642234612}
Technical Specifications
Attribute | Details |
---|---|
File Extension | .jsonl, .ndjson, .jsonlines |
MIME Type | application/jsonl |
Encoding | UTF-8 |
Line Ending | LF (\n) or CRLF (\r\n) |
Maximum Line Length | No limit (practical: 1MB per line) |
Structure | One JSON object per line |
Processing JSONL Files
Python Examples
Reading JSONL Files
import json
def read_jsonl(file_path):
"""Read JSONL file and return list of objects."""
data = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if line: # Skip empty lines
data.append(json.loads(line))
return data
def read_jsonl_generator(file_path):
"""Read JSONL file using generator for memory efficiency."""
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if line:
yield json.loads(line)
# Usage
data = read_jsonl('data.jsonl')
print(f"Loaded {len(data)} records")
# Memory-efficient processing
for record in read_jsonl_generator('large_data.jsonl'):
process_record(record)
Writing JSONL Files
import json
def write_jsonl(data, file_path):
"""Write list of objects to JSONL file."""
with open(file_path, 'w', encoding='utf-8') as f:
for item in data:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
def append_to_jsonl(item, file_path):
"""Append single item to JSONL file."""
with open(file_path, 'a', encoding='utf-8') as f:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
# Usage
records = [
{"name": "Alice", "score": 95},
{"name": "Bob", "score": 87},
{"name": "Charlie", "score": 92}
]
write_jsonl(records, 'scores.jsonl')
# Append new record
append_to_jsonl({"name": "David", "score": 89}, 'scores.jsonl')
Data Processing and Filtering
import json
from typing import Iterator, Dict, Any
def filter_jsonl(input_path: str, output_path: str,
filter_func: callable) -> int:
"""Filter JSONL file based on predicate function."""
count = 0
with open(input_path, 'r', encoding='utf-8') as infile, \
open(output_path, 'w', encoding='utf-8') as outfile:
for line in infile:
line = line.strip()
if line:
record = json.loads(line)
if filter_func(record):
outfile.write(json.dumps(record, ensure_ascii=False) + '\n')
count += 1
return count
def transform_jsonl(input_path: str, output_path: str,
transform_func: callable) -> None:
"""Transform records in JSONL file."""
with open(input_path, 'r', encoding='utf-8') as infile, \
open(output_path, 'w', encoding='utf-8') as outfile:
for line in infile:
line = line.strip()
if line:
record = json.loads(line)
transformed = transform_func(record)
if transformed is not None:
outfile.write(json.dumps(transformed, ensure_ascii=False) + '\n')
# Usage examples
# Filter records where score > 90
high_scores = filter_jsonl('scores.jsonl', 'high_scores.jsonl',
lambda x: x.get('score', 0) > 90)
# Transform: add grade field
def add_grade(record):
score = record.get('score', 0)
if score >= 90:
record['grade'] = 'A'
elif score >= 80:
record['grade'] = 'B'
elif score >= 70:
record['grade'] = 'C'
else:
record['grade'] = 'F'
return record
transform_jsonl('scores.jsonl', 'scores_with_grades.jsonl', add_grade)
Pandas Integration
import pandas as pd
import json
def jsonl_to_dataframe(file_path: str) -> pd.DataFrame:
"""Convert JSONL file to pandas DataFrame."""
data = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if line:
data.append(json.loads(line))
return pd.DataFrame(data)
def dataframe_to_jsonl(df: pd.DataFrame, file_path: str) -> None:
"""Convert pandas DataFrame to JSONL file."""
with open(file_path, 'w', encoding='utf-8') as f:
for _, row in df.iterrows():
f.write(row.to_json(force_ascii=False) + '\n')
def chunked_jsonl_to_dataframe(file_path: str, chunk_size: int = 1000) -> Iterator[pd.DataFrame]:
"""Read large JSONL file in chunks."""
chunk = []
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if line:
chunk.append(json.loads(line))
if len(chunk) >= chunk_size:
yield pd.DataFrame(chunk)
chunk = []
# Yield remaining records
if chunk:
yield pd.DataFrame(chunk)
# Usage
df = jsonl_to_dataframe('data.jsonl')
print(df.describe())
# Process large files in chunks
for chunk_df in chunked_jsonl_to_dataframe('large_data.jsonl', chunk_size=5000):
# Process each chunk
processed = chunk_df.groupby('category').mean()
print(f"Processed chunk with {len(chunk_df)} records")
Command Line Tools
Using jq for JSONL Processing
# Filter records
jq 'select(.score > 90)' scores.jsonl > high_scores.jsonl
# Transform records
jq '. + {grade: (if .score >= 90 then "A" elif .score >= 80 then "B" else "C" end)}' scores.jsonl > graded.jsonl
# Extract specific fields
jq '{name: .name, score: .score}' scores.jsonl > names_scores.jsonl
# Count records
wc -l data.jsonl
# Get unique values
jq -r '.category' products.jsonl | sort | uniq -c
# Statistical operations
jq -s 'map(.price) | add / length' products.jsonl # Average price
Using Miller (mlr)
# Convert JSONL to CSV
mlr --jsonl --csv cat data.jsonl > data.csv
# Filter and aggregate
mlr --jsonl filter '$score > 85' then stats1 -a mean -f score data.jsonl
# Group by operations
mlr --jsonl stats1 -a count,mean -f price -g category products.jsonl
# Sort records
mlr --jsonl sort -f score data.jsonl
Streaming and Real-time Processing
Python Streaming Example
import json
import time
from typing import Generator
def stream_jsonl_producer(file_path: str, delay: float = 0.1) -> Generator[Dict[str, Any], None, None]:
"""Simulate streaming JSONL data."""
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if line:
yield json.loads(line)
time.sleep(delay) # Simulate real-time delay
def jsonl_stream_processor(stream: Generator[Dict[str, Any], None, None],
window_size: int = 100) -> None:
"""Process streaming JSONL data with windowing."""
window = []
for record in stream:
window.append(record)
if len(window) >= window_size:
# Process window
process_window(window)
# Slide window (keep last 50% for overlap)
window = window[window_size // 2:]
def process_window(records: List[Dict[str, Any]]) -> None:
"""Process a window of records."""
if not records:
return
# Example: Calculate statistics
scores = [r.get('score', 0) for r in records if 'score' in r]
if scores:
avg_score = sum(scores) / len(scores)
print(f"Window avg score: {avg_score:.2f}, count: {len(scores)}")
# Usage
stream = stream_jsonl_producer('events.jsonl', delay=0.01)
jsonl_stream_processor(stream, window_size=50)
Apache Kafka Integration
from kafka import KafkaProducer, KafkaConsumer
import json
class JSONLKafkaProducer:
def __init__(self, bootstrap_servers: str, topic: str):
self.producer = KafkaProducer(
bootstrap_servers=bootstrap_servers,
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
self.topic = topic
def send_jsonl_file(self, file_path: str) -> None:
"""Send JSONL file records to Kafka topic."""
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
line = line.strip()
if line:
record = json.loads(line)
self.producer.send(self.topic, record)
self.producer.flush()
class JSONLKafkaConsumer:
def __init__(self, bootstrap_servers: str, topic: str, group_id: str):
self.consumer = KafkaConsumer(
topic,
bootstrap_servers=bootstrap_servers,
group_id=group_id,
value_deserializer=lambda v: json.loads(v.decode('utf-8'))
)
self.output_file = None
def consume_to_jsonl(self, output_path: str) -> None:
"""Consume Kafka messages and write to JSONL file."""
with open(output_path, 'w', encoding='utf-8') as f:
for message in self.consumer:
f.write(json.dumps(message.value, ensure_ascii=False) + '\n')
f.flush() # Ensure real-time writing
# Usage
producer = JSONLKafkaProducer('localhost:9092', 'events')
producer.send_jsonl_file('events.jsonl')
consumer = JSONLKafkaConsumer('localhost:9092', 'events', 'jsonl-consumer')
consumer.consume_to_jsonl('consumed_events.jsonl')
Big Data Processing
Apache Spark with JSONL
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
def process_jsonl_with_spark(input_path: str, output_path: str):
"""Process large JSONL files with Apache Spark."""
spark = SparkSession.builder \
.appName("JSONL Processing") \
.getOrCreate()
# Read JSONL file
df = spark.read.json(input_path)
# Data processing
processed_df = df \
.filter(col("score") > 80) \
.withColumn("grade",
when(col("score") >= 90, "A")
.when(col("score") >= 80, "B")
.otherwise("C")) \
.groupBy("category") \
.agg(avg("score").alias("avg_score"),
count("*").alias("count"))
# Write results as JSONL
processed_df.write \
.mode("overwrite") \
.json(output_path)
spark.stop()
# For very large files, use partitioning
def process_partitioned_jsonl(input_path: str, output_path: str):
"""Process partitioned JSONL data."""
spark = SparkSession.builder \
.appName("Partitioned JSONL") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
df = spark.read.json(input_path)
# Repartition for better performance
df_partitioned = df.repartition(100, "category")
# Process and write with partitioning
df_partitioned \
.write \
.mode("overwrite") \
.partitionBy("category") \
.json(output_path)
spark.stop()
Dask for Out-of-Core Processing
import dask.bag as db
import json
def process_large_jsonl_with_dask(input_path: str, output_path: str):
"""Process large JSONL files that don't fit in memory."""
# Read JSONL with Dask
def parse_line(line):
try:
return json.loads(line.strip())
except:
return None
# Create Dask bag
lines = db.read_text(input_path)
records = lines.map(parse_line).filter(lambda x: x is not None)
# Process data
filtered = records.filter(lambda x: x.get('score', 0) > 85)
# Compute statistics
avg_score = records.pluck('score').filter(lambda x: x is not None).mean()
# Save results
filtered.map(json.dumps).to_textfiles(output_path)
print(f"Average score: {avg_score.compute()}")
# Example with custom partitioning
def process_with_custom_partitions(input_path: str, npartitions: int = 10):
"""Process JSONL with custom partitioning strategy."""
lines = db.read_text(input_path, blocksize="64MB")
records = lines.map(lambda x: json.loads(x.strip()) if x.strip() else None) \
.filter(lambda x: x is not None)
# Repartition based on data characteristics
balanced = records.repartition(npartitions=npartitions)
# Group by category and calculate statistics
by_category = balanced.groupby(lambda x: x.get('category', 'unknown'))
stats = by_category.map_partitions(lambda partition: [
{
'category': category,
'count': len(list(items)),
'avg_score': sum(item.get('score', 0) for item in items) / len(list(items))
}
for category, items in partition
]).flatten()
return stats.compute()
Validation and Schema
JSON Schema Validation for JSONL
import json
import jsonschema
from typing import List, Dict, Any
def validate_jsonl_schema(file_path: str, schema: Dict[str, Any]) -> List[Dict[str, Any]]:
"""Validate JSONL file against JSON schema."""
errors = []
with open(file_path, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if line:
try:
record = json.loads(line)
jsonschema.validate(record, schema)
except json.JSONDecodeError as e:
errors.append({
'line': line_num,
'error': 'Invalid JSON',
'details': str(e)
})
except jsonschema.ValidationError as e:
errors.append({
'line': line_num,
'error': 'Schema validation failed',
'details': e.message,
'path': list(e.path)
})
return errors
# Example schema
user_schema = {
"type": "object",
"properties": {
"name": {"type": "string", "minLength": 1},
"age": {"type": "integer", "minimum": 0, "maximum": 150},
"email": {"type": "string", "format": "email"},
"score": {"type": "number", "minimum": 0, "maximum": 100}
},
"required": ["name", "age"]
}
# Validate file
validation_errors = validate_jsonl_schema('users.jsonl', user_schema)
if validation_errors:
for error in validation_errors:
print(f"Line {error['line']}: {error['error']} - {error['details']}")
else:
print("All records are valid")
Data Quality Checks
def analyze_jsonl_quality(file_path: str) -> Dict[str, Any]:
"""Analyze data quality of JSONL file."""
stats = {
'total_lines': 0,
'valid_json': 0,
'empty_lines': 0,
'field_coverage': {},
'data_types': {},
'unique_values': {}
}
all_fields = set()
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
stats['total_lines'] += 1
line = line.strip()
if not line:
stats['empty_lines'] += 1
continue
try:
record = json.loads(line)
stats['valid_json'] += 1
# Analyze fields
for field, value in record.items():
all_fields.add(field)
# Field coverage
stats['field_coverage'][field] = stats['field_coverage'].get(field, 0) + 1
# Data types
value_type = type(value).__name__
if field not in stats['data_types']:
stats['data_types'][field] = {}
stats['data_types'][field][value_type] = \
stats['data_types'][field].get(value_type, 0) + 1
# Unique values (for categorical fields)
if isinstance(value, (str, int, bool)) and len(str(value)) < 100:
if field not in stats['unique_values']:
stats['unique_values'][field] = set()
stats['unique_values'][field].add(value)
except json.JSONDecodeError:
continue
# Convert sets to lists for JSON serialization
for field in stats['unique_values']:
stats['unique_values'][field] = list(stats['unique_values'][field])
# Calculate field coverage percentages
valid_records = stats['valid_json']
for field in stats['field_coverage']:
coverage_pct = (stats['field_coverage'][field] / valid_records) * 100
stats['field_coverage'][field] = {
'count': stats['field_coverage'][field],
'percentage': round(coverage_pct, 2)
}
return stats
# Usage
quality_report = analyze_jsonl_quality('data.jsonl')
print(f"Total lines: {quality_report['total_lines']}")
print(f"Valid JSON: {quality_report['valid_json']}")
print("Field coverage:")
for field, info in quality_report['field_coverage'].items():
print(f" {field}: {info['percentage']}%")
JSONL provides an excellent balance between simplicity and functionality, making it an ideal choice for data processing pipelines, streaming applications, and large dataset management where line-by-line processing is beneficial.
AI-Powered JSONL File Analysis
Instant Detection
Quickly identify JSONL document files with high accuracy using Google's advanced Magika AI technology.
Security Analysis
Analyze file structure and metadata to ensure the file is legitimate and safe to use.
Detailed Information
Get comprehensive details about file type, MIME type, and other technical specifications.
Privacy First
All analysis happens in your browser - no files are uploaded to our servers.
Related File Types
Explore other file types in the Data category and discover more formats:
Start Analyzing JSONL Files Now
Use our free AI-powered tool to detect and analyze JSONL document files instantly with Google's Magika technology.
⚡ Try File Detection Tool